AventIQ-AI
/

token-classification-CONLL-2003-NER

Model card Files Files and versions Community

token-classification-CONLL-2003-NER / README.md

VedantJhunthra's picture

Create README.md

06a0a6f verified 2 months ago

|

history blame contribute delete

3.67 kB

	🧠 NERClassifier-BERT-CoNLL2003

	A BERT-based Named Entity Recognition (NER) model fine-tuned on the CoNLL-2003 dataset. It classifies tokens in text into predefined entity types: Person (PER), Location (LOC), Organization (ORG), and Miscellaneous (MISC). This model is ideal for information extraction, document tagging, and question answering systems.

	---
	✨ Model Highlights

	📌 Based on bert-base-cased (by Google)
	🔍 Fine-tuned on the CoNLL-2003 Named Entity Recognition dataset
	⚡ Supports prediction of 4 entity types: PER, LOC, ORG, MISC
	💾 Available in both full and quantized versions for fast inference

	---
	🧠 Intended Uses

	• Resume and document parsing
	• News article analysis
	• Question answering pipelines
	• Chatbots and virtual assistants
	• Information retrieval and tagging

	---
	🚫 Limitations

	• Trained on English-only NER data (CoNLL-2003)
	• May not perform well on informal text (e.g., tweets, slang)
	• Entity boundaries may be misaligned with subword tokenization
	• Limited performance on extremely long sequences (>128 tokens)

	---
	🏋️‍♂️ Training Details

	\| Field \| Value \|
	\| -------------- \| ------------------------------ \|
	\| Base Model \| `bert-base-cased` \|
	\| Dataset \| CoNLL-2003 \|
	\| Framework \| PyTorch with 🤗 Transformers \|
	\| Epochs \| 5 \|
	\| Batch Size \| 16 \|
	\| Max Length \| 128 tokens \|
	\| Optimizer \| AdamW \|
	\| Loss \| CrossEntropyLoss (token-level) \|
	\| Device \| Trained on CUDA-enabled GPU \|

	---
	📊 Evaluation Metrics

	\| Metric \| Score \|
	\| ----------------------------------------------- \| ----- \|
	\| Accuracy \| 0.98 \|
	\| F1-Score \| 0.97 \|

	---
	🔎 Label Mapping

	\| Label ID \| Entity Type \|
	\| -------- \| ----------- \|
	\| 0 \| O \|
	\| 1 \| B-PER \|
	\| 2 \| I-PER \|
	\| 3 \| B-ORG \|
	\| 4 \| I-ORG \|
	\| 5 \| B-LOC \|
	\| 6 \| I-LOC \|
	\| 7 \| B-MISC \|
	\| 8 \| I-MISC \|

	---
	🚀 Usage
	```python
	from transformers import BertTokenizerFast, BertForTokenClassification
	import torch

	model_name = "AventIQ-AI/ner_bert_conll2003"
	tokenizer = BertTokenizerFast.from_pretrained(model_name)
	model = BertForTokenClassification.from_pretrained(model_name)
	model.eval()

	def predict_tokens(text):
	inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=128)
	with torch.no_grad():
	outputs = model(**inputs).logits
	predictions = torch.argmax(outputs, dim=2)
	tokens = tokenizer.convert_ids_to_tokens(inputs["input_ids"][0])
	labels = [model.config.id2label[label_id.item()] for label_id in predictions[0]]
	return list(zip(tokens, labels))

	# Test example
	print(predict_tokens("Barack Obama visited Google in California."))

	```
	---
	🧩 Quantization

	Post-training static quantization applied using PyTorch to reduce model size and improve inference performance on edge devices.

	---
	🗂 Repository Structure
	```
	.
	├── model/ # Quantized model files
	├── tokenizer_config/ # Tokenizer and vocab files
	├── model.safensors/ # Fine-tuned model in safetensors format
	├── README.md # Model card

	```
	---
	🤝 Contributing

	Open to improvements and feedback! Feel free to submit a pull request or open an issue if you find any bugs or want to enhance the model.