VedantJhunthra commited on
Commit
06a0a6f
Β·
verified Β·
1 Parent(s): 604ed74

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +110 -0
README.md ADDED
@@ -0,0 +1,110 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ 🧠 NERClassifier-BERT-CoNLL2003
2
+
3
+ A BERT-based Named Entity Recognition (NER) model fine-tuned on the CoNLL-2003 dataset. It classifies tokens in text into predefined entity types: Person (PER), Location (LOC), Organization (ORG), and Miscellaneous (MISC). This model is ideal for information extraction, document tagging, and question answering systems.
4
+
5
+ ---
6
+ ✨ Model Highlights
7
+
8
+ πŸ“Œ Based on bert-base-cased (by Google)
9
+ πŸ” Fine-tuned on the CoNLL-2003 Named Entity Recognition dataset
10
+ ⚑ Supports prediction of 4 entity types: PER, LOC, ORG, MISC
11
+ πŸ’Ύ Available in both full and quantized versions for fast inference
12
+
13
+ ---
14
+ 🧠 Intended Uses
15
+
16
+ β€’ Resume and document parsing
17
+ β€’ News article analysis
18
+ β€’ Question answering pipelines
19
+ β€’ Chatbots and virtual assistants
20
+ β€’ Information retrieval and tagging
21
+
22
+ ---
23
+ 🚫 Limitations
24
+
25
+ β€’ Trained on English-only NER data (CoNLL-2003)
26
+ β€’ May not perform well on informal text (e.g., tweets, slang)
27
+ β€’ Entity boundaries may be misaligned with subword tokenization
28
+ β€’ Limited performance on extremely long sequences (>128 tokens)
29
+
30
+ ---
31
+ πŸ‹οΈβ€β™‚οΈ Training Details
32
+
33
+ | Field | Value |
34
+ | -------------- | ------------------------------ |
35
+ | **Base Model** | `bert-base-cased` |
36
+ | **Dataset** | CoNLL-2003 |
37
+ | **Framework** | PyTorch with πŸ€— Transformers |
38
+ | **Epochs** | 5 |
39
+ | **Batch Size** | 16 |
40
+ | **Max Length** | 128 tokens |
41
+ | **Optimizer** | AdamW |
42
+ | **Loss** | CrossEntropyLoss (token-level) |
43
+ | **Device** | Trained on CUDA-enabled GPU |
44
+
45
+ ---
46
+ πŸ“Š Evaluation Metrics
47
+
48
+ | Metric | Score |
49
+ | ----------------------------------------------- | ----- |
50
+ | Accuracy | 0.98 |
51
+ | F1-Score | 0.97 |
52
+
53
+ ---
54
+ πŸ”Ž Label Mapping
55
+
56
+ | Label ID | Entity Type |
57
+ | -------- | ----------- |
58
+ | 0 | O |
59
+ | 1 | B-PER |
60
+ | 2 | I-PER |
61
+ | 3 | B-ORG |
62
+ | 4 | I-ORG |
63
+ | 5 | B-LOC |
64
+ | 6 | I-LOC |
65
+ | 7 | B-MISC |
66
+ | 8 | I-MISC |
67
+
68
+ ---
69
+ πŸš€ Usage
70
+ ```python
71
+ from transformers import BertTokenizerFast, BertForTokenClassification
72
+ import torch
73
+
74
+ model_name = "AventIQ-AI/ner_bert_conll2003"
75
+ tokenizer = BertTokenizerFast.from_pretrained(model_name)
76
+ model = BertForTokenClassification.from_pretrained(model_name)
77
+ model.eval()
78
+
79
+ def predict_tokens(text):
80
+ inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=128)
81
+ with torch.no_grad():
82
+ outputs = model(**inputs).logits
83
+ predictions = torch.argmax(outputs, dim=2)
84
+ tokens = tokenizer.convert_ids_to_tokens(inputs["input_ids"][0])
85
+ labels = [model.config.id2label[label_id.item()] for label_id in predictions[0]]
86
+ return list(zip(tokens, labels))
87
+
88
+ # Test example
89
+ print(predict_tokens("Barack Obama visited Google in California."))
90
+
91
+ ```
92
+ ---
93
+ 🧩 Quantization
94
+
95
+ Post-training static quantization applied using PyTorch to reduce model size and improve inference performance on edge devices.
96
+
97
+ ---
98
+ πŸ—‚ Repository Structure
99
+ ```
100
+ .
101
+ β”œβ”€β”€ model/ # Quantized model files
102
+ β”œβ”€β”€ tokenizer_config/ # Tokenizer and vocab files
103
+ β”œβ”€β”€ model.safensors/ # Fine-tuned model in safetensors format
104
+ β”œβ”€β”€ README.md # Model card
105
+
106
+ ```
107
+ ---
108
+ 🀝 Contributing
109
+
110
+ Open to improvements and feedback! Feel free to submit a pull request or open an issue if you find any bugs or want to enhance the model.