VedantJhunthra's picture
Create README.md
06a0a6f verified

🧠 NERClassifier-BERT-CoNLL2003

A BERT-based Named Entity Recognition (NER) model fine-tuned on the CoNLL-2003 dataset. It classifies tokens in text into predefined entity types: Person (PER), Location (LOC), Organization (ORG), and Miscellaneous (MISC). This model is ideal for information extraction, document tagging, and question answering systems.


✨ Model Highlights

πŸ“Œ Based on bert-base-cased (by Google) πŸ” Fine-tuned on the CoNLL-2003 Named Entity Recognition dataset ⚑ Supports prediction of 4 entity types: PER, LOC, ORG, MISC πŸ’Ύ Available in both full and quantized versions for fast inference


🧠 Intended Uses

β€’ Resume and document parsing β€’ News article analysis β€’ Question answering pipelines β€’ Chatbots and virtual assistants β€’ Information retrieval and tagging


🚫 Limitations

β€’ Trained on English-only NER data (CoNLL-2003) β€’ May not perform well on informal text (e.g., tweets, slang) β€’ Entity boundaries may be misaligned with subword tokenization β€’ Limited performance on extremely long sequences (>128 tokens)


πŸ‹οΈβ€β™‚οΈ Training Details

Field Value
Base Model bert-base-cased
Dataset CoNLL-2003
Framework PyTorch with πŸ€— Transformers
Epochs 5
Batch Size 16
Max Length 128 tokens
Optimizer AdamW
Loss CrossEntropyLoss (token-level)
Device Trained on CUDA-enabled GPU

πŸ“Š Evaluation Metrics

Metric Score
Accuracy 0.98
F1-Score 0.97

πŸ”Ž Label Mapping

Label ID Entity Type
0 O
1 B-PER
2 I-PER
3 B-ORG
4 I-ORG
5 B-LOC
6 I-LOC
7 B-MISC
8 I-MISC

πŸš€ Usage

from transformers import BertTokenizerFast, BertForTokenClassification
import torch

model_name = "AventIQ-AI/ner_bert_conll2003"
tokenizer = BertTokenizerFast.from_pretrained(model_name)
model = BertForTokenClassification.from_pretrained(model_name)
model.eval()

def predict_tokens(text):
    inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=128)
    with torch.no_grad():
        outputs = model(**inputs).logits
    predictions = torch.argmax(outputs, dim=2)
    tokens = tokenizer.convert_ids_to_tokens(inputs["input_ids"][0])
    labels = [model.config.id2label[label_id.item()] for label_id in predictions[0]]
    return list(zip(tokens, labels))

# Test example
print(predict_tokens("Barack Obama visited Google in California."))

🧩 Quantization

Post-training static quantization applied using PyTorch to reduce model size and improve inference performance on edge devices.


πŸ—‚ Repository Structure

.
β”œβ”€β”€ model/               # Quantized model files
β”œβ”€β”€ tokenizer_config/    # Tokenizer and vocab files
β”œβ”€β”€ model.safensors/     # Fine-tuned model in safetensors format
β”œβ”€β”€ README.md            # Model card

🀝 Contributing

Open to improvements and feedback! Feel free to submit a pull request or open an issue if you find any bugs or want to enhance the model.