π§ NERClassifier-BERT-CoNLL2003
A BERT-based Named Entity Recognition (NER) model fine-tuned on the CoNLL-2003 dataset. It classifies tokens in text into predefined entity types: Person (PER), Location (LOC), Organization (ORG), and Miscellaneous (MISC). This model is ideal for information extraction, document tagging, and question answering systems.
β¨ Model Highlights
π Based on bert-base-cased (by Google) π Fine-tuned on the CoNLL-2003 Named Entity Recognition dataset β‘ Supports prediction of 4 entity types: PER, LOC, ORG, MISC πΎ Available in both full and quantized versions for fast inference
π§ Intended Uses
β’ Resume and document parsing β’ News article analysis β’ Question answering pipelines β’ Chatbots and virtual assistants β’ Information retrieval and tagging
π« Limitations
β’ Trained on English-only NER data (CoNLL-2003) β’ May not perform well on informal text (e.g., tweets, slang) β’ Entity boundaries may be misaligned with subword tokenization β’ Limited performance on extremely long sequences (>128 tokens)
ποΈββοΈ Training Details
Field | Value |
---|---|
Base Model | bert-base-cased |
Dataset | CoNLL-2003 |
Framework | PyTorch with π€ Transformers |
Epochs | 5 |
Batch Size | 16 |
Max Length | 128 tokens |
Optimizer | AdamW |
Loss | CrossEntropyLoss (token-level) |
Device | Trained on CUDA-enabled GPU |
π Evaluation Metrics
Metric | Score |
---|---|
Accuracy | 0.98 |
F1-Score | 0.97 |
π Label Mapping
Label ID | Entity Type |
---|---|
0 | O |
1 | B-PER |
2 | I-PER |
3 | B-ORG |
4 | I-ORG |
5 | B-LOC |
6 | I-LOC |
7 | B-MISC |
8 | I-MISC |
π Usage
from transformers import BertTokenizerFast, BertForTokenClassification
import torch
model_name = "AventIQ-AI/ner_bert_conll2003"
tokenizer = BertTokenizerFast.from_pretrained(model_name)
model = BertForTokenClassification.from_pretrained(model_name)
model.eval()
def predict_tokens(text):
inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=128)
with torch.no_grad():
outputs = model(**inputs).logits
predictions = torch.argmax(outputs, dim=2)
tokens = tokenizer.convert_ids_to_tokens(inputs["input_ids"][0])
labels = [model.config.id2label[label_id.item()] for label_id in predictions[0]]
return list(zip(tokens, labels))
# Test example
print(predict_tokens("Barack Obama visited Google in California."))
π§© Quantization
Post-training static quantization applied using PyTorch to reduce model size and improve inference performance on edge devices.
π Repository Structure
.
βββ model/ # Quantized model files
βββ tokenizer_config/ # Tokenizer and vocab files
βββ model.safensors/ # Fine-tuned model in safetensors format
βββ README.md # Model card
π€ Contributing
Open to improvements and feedback! Feel free to submit a pull request or open an issue if you find any bugs or want to enhance the model.