README.md · AventIQ-AI/token-classification-CONLL-2003-NER at main

🧠 NERClassifier-BERT-CoNLL2003

A BERT-based Named Entity Recognition (NER) model fine-tuned on the CoNLL-2003 dataset. It classifies tokens in text into predefined entity types: Person (PER), Location (LOC), Organization (ORG), and Miscellaneous (MISC). This model is ideal for information extraction, document tagging, and question answering systems.

✨ Model Highlights

📌 Based on bert-base-cased (by Google) 🔍 Fine-tuned on the CoNLL-2003 Named Entity Recognition dataset ⚡ Supports prediction of 4 entity types: PER, LOC, ORG, MISC 💾 Available in both full and quantized versions for fast inference

🧠 Intended Uses

• Resume and document parsing • News article analysis • Question answering pipelines • Chatbots and virtual assistants • Information retrieval and tagging

🚫 Limitations

• Trained on English-only NER data (CoNLL-2003) • May not perform well on informal text (e.g., tweets, slang) • Entity boundaries may be misaligned with subword tokenization • Limited performance on extremely long sequences (>128 tokens)

🏋️‍♂️ Training Details

Field	Value
Base Model	`bert-base-cased`
Dataset	CoNLL-2003
Framework	PyTorch with 🤗 Transformers
Epochs	5
Batch Size	16
Max Length	128 tokens
Optimizer	AdamW
Loss	CrossEntropyLoss (token-level)
Device	Trained on CUDA-enabled GPU

📊 Evaluation Metrics

Metric	Score
Accuracy	0.98
F1-Score	0.97

🔎 Label Mapping

Label ID	Entity Type
0	O
1	B-PER
2	I-PER
3	B-ORG
4	I-ORG
5	B-LOC
6	I-LOC
7	B-MISC
8	I-MISC

🚀 Usage

from transformers import BertTokenizerFast, BertForTokenClassification
import torch

model_name = "AventIQ-AI/ner_bert_conll2003"
tokenizer = BertTokenizerFast.from_pretrained(model_name)
model = BertForTokenClassification.from_pretrained(model_name)
model.eval()

def predict_tokens(text):
    inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=128)
    with torch.no_grad():
        outputs = model(**inputs).logits
    predictions = torch.argmax(outputs, dim=2)
    tokens = tokenizer.convert_ids_to_tokens(inputs["input_ids"][0])
    labels = [model.config.id2label[label_id.item()] for label_id in predictions[0]]
    return list(zip(tokens, labels))

# Test example
print(predict_tokens("Barack Obama visited Google in California."))

🧩 Quantization

Post-training static quantization applied using PyTorch to reduce model size and improve inference performance on edge devices.

🗂 Repository Structure

.
├── model/               # Quantized model files
├── tokenizer_config/    # Tokenizer and vocab files
├── model.safensors/     # Fine-tuned model in safetensors format
├── README.md            # Model card

🤝 Contributing

Open to improvements and feedback! Feel free to submit a pull request or open an issue if you find any bugs or want to enhance the model.