|
π§ NERClassifier-BERT-CoNLL2003 |
|
|
|
A BERT-based Named Entity Recognition (NER) model fine-tuned on the CoNLL-2003 dataset. It classifies tokens in text into predefined entity types: Person (PER), Location (LOC), Organization (ORG), and Miscellaneous (MISC). This model is ideal for information extraction, document tagging, and question answering systems. |
|
|
|
--- |
|
β¨ Model Highlights |
|
|
|
π Based on bert-base-cased (by Google) |
|
π Fine-tuned on the CoNLL-2003 Named Entity Recognition dataset |
|
β‘ Supports prediction of 4 entity types: PER, LOC, ORG, MISC |
|
πΎ Available in both full and quantized versions for fast inference |
|
|
|
--- |
|
π§ Intended Uses |
|
|
|
β’ Resume and document parsing |
|
β’ News article analysis |
|
β’ Question answering pipelines |
|
β’ Chatbots and virtual assistants |
|
β’ Information retrieval and tagging |
|
|
|
--- |
|
π« Limitations |
|
|
|
β’ Trained on English-only NER data (CoNLL-2003) |
|
β’ May not perform well on informal text (e.g., tweets, slang) |
|
β’ Entity boundaries may be misaligned with subword tokenization |
|
β’ Limited performance on extremely long sequences (>128 tokens) |
|
|
|
--- |
|
ποΈββοΈ Training Details |
|
|
|
| Field | Value | |
|
| -------------- | ------------------------------ | |
|
| **Base Model** | `bert-base-cased` | |
|
| **Dataset** | CoNLL-2003 | |
|
| **Framework** | PyTorch with π€ Transformers | |
|
| **Epochs** | 5 | |
|
| **Batch Size** | 16 | |
|
| **Max Length** | 128 tokens | |
|
| **Optimizer** | AdamW | |
|
| **Loss** | CrossEntropyLoss (token-level) | |
|
| **Device** | Trained on CUDA-enabled GPU | |
|
|
|
--- |
|
π Evaluation Metrics |
|
|
|
| Metric | Score | |
|
| ----------------------------------------------- | ----- | |
|
| Accuracy | 0.98 | |
|
| F1-Score | 0.97 | |
|
|
|
--- |
|
π Label Mapping |
|
|
|
| Label ID | Entity Type | |
|
| -------- | ----------- | |
|
| 0 | O | |
|
| 1 | B-PER | |
|
| 2 | I-PER | |
|
| 3 | B-ORG | |
|
| 4 | I-ORG | |
|
| 5 | B-LOC | |
|
| 6 | I-LOC | |
|
| 7 | B-MISC | |
|
| 8 | I-MISC | |
|
|
|
--- |
|
π Usage |
|
```python |
|
from transformers import BertTokenizerFast, BertForTokenClassification |
|
import torch |
|
|
|
model_name = "AventIQ-AI/ner_bert_conll2003" |
|
tokenizer = BertTokenizerFast.from_pretrained(model_name) |
|
model = BertForTokenClassification.from_pretrained(model_name) |
|
model.eval() |
|
|
|
def predict_tokens(text): |
|
inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=128) |
|
with torch.no_grad(): |
|
outputs = model(**inputs).logits |
|
predictions = torch.argmax(outputs, dim=2) |
|
tokens = tokenizer.convert_ids_to_tokens(inputs["input_ids"][0]) |
|
labels = [model.config.id2label[label_id.item()] for label_id in predictions[0]] |
|
return list(zip(tokens, labels)) |
|
|
|
# Test example |
|
print(predict_tokens("Barack Obama visited Google in California.")) |
|
|
|
``` |
|
--- |
|
π§© Quantization |
|
|
|
Post-training static quantization applied using PyTorch to reduce model size and improve inference performance on edge devices. |
|
|
|
--- |
|
π Repository Structure |
|
``` |
|
. |
|
βββ model/ # Quantized model files |
|
βββ tokenizer_config/ # Tokenizer and vocab files |
|
βββ model.safensors/ # Fine-tuned model in safetensors format |
|
βββ README.md # Model card |
|
|
|
``` |
|
--- |
|
π€ Contributing |
|
|
|
Open to improvements and feedback! Feel free to submit a pull request or open an issue if you find any bugs or want to enhance the model. |
|
|