VedantJhunthra's picture
Create README.md
06a0a6f verified
🧠 NERClassifier-BERT-CoNLL2003
A BERT-based Named Entity Recognition (NER) model fine-tuned on the CoNLL-2003 dataset. It classifies tokens in text into predefined entity types: Person (PER), Location (LOC), Organization (ORG), and Miscellaneous (MISC). This model is ideal for information extraction, document tagging, and question answering systems.
---
✨ Model Highlights
πŸ“Œ Based on bert-base-cased (by Google)
πŸ” Fine-tuned on the CoNLL-2003 Named Entity Recognition dataset
⚑ Supports prediction of 4 entity types: PER, LOC, ORG, MISC
πŸ’Ύ Available in both full and quantized versions for fast inference
---
🧠 Intended Uses
β€’ Resume and document parsing
β€’ News article analysis
β€’ Question answering pipelines
β€’ Chatbots and virtual assistants
β€’ Information retrieval and tagging
---
🚫 Limitations
β€’ Trained on English-only NER data (CoNLL-2003)
β€’ May not perform well on informal text (e.g., tweets, slang)
β€’ Entity boundaries may be misaligned with subword tokenization
β€’ Limited performance on extremely long sequences (>128 tokens)
---
πŸ‹οΈβ€β™‚οΈ Training Details
| Field | Value |
| -------------- | ------------------------------ |
| **Base Model** | `bert-base-cased` |
| **Dataset** | CoNLL-2003 |
| **Framework** | PyTorch with πŸ€— Transformers |
| **Epochs** | 5 |
| **Batch Size** | 16 |
| **Max Length** | 128 tokens |
| **Optimizer** | AdamW |
| **Loss** | CrossEntropyLoss (token-level) |
| **Device** | Trained on CUDA-enabled GPU |
---
πŸ“Š Evaluation Metrics
| Metric | Score |
| ----------------------------------------------- | ----- |
| Accuracy | 0.98 |
| F1-Score | 0.97 |
---
πŸ”Ž Label Mapping
| Label ID | Entity Type |
| -------- | ----------- |
| 0 | O |
| 1 | B-PER |
| 2 | I-PER |
| 3 | B-ORG |
| 4 | I-ORG |
| 5 | B-LOC |
| 6 | I-LOC |
| 7 | B-MISC |
| 8 | I-MISC |
---
πŸš€ Usage
```python
from transformers import BertTokenizerFast, BertForTokenClassification
import torch
model_name = "AventIQ-AI/ner_bert_conll2003"
tokenizer = BertTokenizerFast.from_pretrained(model_name)
model = BertForTokenClassification.from_pretrained(model_name)
model.eval()
def predict_tokens(text):
inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=128)
with torch.no_grad():
outputs = model(**inputs).logits
predictions = torch.argmax(outputs, dim=2)
tokens = tokenizer.convert_ids_to_tokens(inputs["input_ids"][0])
labels = [model.config.id2label[label_id.item()] for label_id in predictions[0]]
return list(zip(tokens, labels))
# Test example
print(predict_tokens("Barack Obama visited Google in California."))
```
---
🧩 Quantization
Post-training static quantization applied using PyTorch to reduce model size and improve inference performance on edge devices.
---
πŸ—‚ Repository Structure
```
.
β”œβ”€β”€ model/ # Quantized model files
β”œβ”€β”€ tokenizer_config/ # Tokenizer and vocab files
β”œβ”€β”€ model.safensors/ # Fine-tuned model in safetensors format
β”œβ”€β”€ README.md # Model card
```
---
🀝 Contributing
Open to improvements and feedback! Feel free to submit a pull request or open an issue if you find any bugs or want to enhance the model.