Upload 7 files

Browse files

Files changed (7) hide show

README_NER_Model.md +144 -0
config.json +48 -0
model.safetensors +3 -0
special_tokens_map.json +7 -0
tokenizer.json +0 -0
tokenizer_config.json +59 -0
vocab.txt +0 -0

README_NER_Model.md ADDED Viewed

	@@ -0,0 +1,144 @@

+# BERT-Based Named Entity Recognition (NER) Model
+This repository contains a fine-tuned BERT-based model for Named Entity Recognition (NER) using the WNUT-17 dataset. The model is trained using the Hugging Face Transformers and Datasets libraries, and supports inference and quantization for deployment in resource-constrained environments.
+---
+## Model Details
+- **Model Name:** BERT-Base-Cased NER
+- **Model Architecture:** BERT Base
+- **Task:** Named Entity Recognition (NER)
+- **Dataset:** WNUT-17 (from Hugging Face Datasets)
+- **Quantization:** Float16
+- **Fine-tuning Framework:** Hugging Face Transformers
+---
+## Usage
+### Installation
+```bash
+pip install transformers datasets evaluate seqeval scikit-learn torch
+```
+### Training the Model
+```python
+from transformers import Trainer
+trainer = Trainer(
+    model=model,
+    args=training_args,
+    train_dataset=tokenized_datasets["train"],
+    eval_dataset=tokenized_datasets["validation"],
+    tokenizer=tokenizer,
+    data_collator=data_collator,
+    compute_metrics=compute_metrics
+)
+trainer.train()
+```
+### Saving the Model
+```python
+model.save_pretrained("./saved_model")
+tokenizer.save_pretrained("./saved_model")
+```
+### Testing the Saved Model
+```python
+from transformers import pipeline, AutoTokenizer, AutoModelForTokenClassification
+model = AutoModelForTokenClassification.from_pretrained("./saved_model")
+tokenizer = AutoTokenizer.from_pretrained("./saved_model")
+ner_pipeline = pipeline("ner", model=model, tokenizer=tokenizer, aggregation_strategy="simple")
+sample_sentences = [
+    "Barack Obama visited Microsoft headquarters in Redmond.",
+    "Nancy Gautam lives in Faridabad and studies at J.C. Bose University.",
+    "Google is launching a new AI product in California."
+]
+for sentence in sample_sentences:
+    print(f"Sentence: {sentence}")
+    print(ner_pipeline(sentence))
+```
+### Quantizing the Model
+```python
+import torch
+quantized_model = model.to(dtype=torch.float16, device="cuda" if torch.cuda.is_available() else "cpu")
+quantized_model.save_pretrained("quantized-model")
+tokenizer.save_pretrained("quantized-model")
+```
+### Testing the Quantized Model
+```python
+model = AutoModelForTokenClassification.from_pretrained("quantized-model", torch_dtype=torch.float16)
+tokenizer = AutoTokenizer.from_pretrained("quantized-model")
+ner_pipeline = pipeline("ner", model=model, tokenizer=tokenizer, aggregation_strategy="simple")
+```
+---
+## Performance Metrics
+- **Accuracy:** Evaluated using seqeval on the validation split
+- **Precision, Recall, F1 Score:** Computed using label-wise predictions excluding ignored indices
+---
+## Fine-Tuning Details
+### Dataset
+The model was fine-tuned on the WNUT-17 dataset, a benchmark dataset for emerging and rare named entities. The preprocessing includes:
+- Tokenization using BERT tokenizer
+- Label alignment for wordpiece tokens
+### Training Configuration
+- **Epochs:** 3
+- **Batch Size:** 16
+- **Learning Rate:** 2e-5
+- **Max Length:** 128 tokens (implicitly handled by tokenizer)
+- **Evaluation Strategy:** Per epoch
+### Quantization
+The model was quantized using PyTorch's half-precision (float16) support to reduce memory footprint and inference time.
+---
+## Repository Structure
+```
+.
+├── saved_model/           # Fine-Tuned BERT Model and Tokenizer
+├── quantized-model/       # Quantized Model for Deployment
+├── ner_output/            # Training Logs and Checkpoints
+├── README.md              # Documentation
+```
+---
+## Limitations
+- May not generalize well to domains outside WNUT-17 entities
+- Quantized model may slightly reduce accuracy for faster performance
+---
+## Contributing
+Contributions are welcome! Please raise an issue or PR for improvements, bug fixes, or feature additions.
+---

config.json ADDED Viewed

	@@ -0,0 +1,48 @@

+{
+  "_num_labels": 9,
+  "architectures": [
+    "BertForTokenClassification"
+  ],
+  "attention_probs_dropout_prob": 0.1,
+  "classifier_dropout": null,
+  "hidden_act": "gelu",
+  "hidden_dropout_prob": 0.1,
+  "hidden_size": 768,
+  "id2label": {
+    "0": "O",
+    "1": "B-MISC",
+    "2": "I-MISC",
+    "3": "B-PER",
+    "4": "I-PER",
+    "5": "B-ORG",
+    "6": "I-ORG",
+    "7": "B-LOC",
+    "8": "I-LOC"
+  },
+  "initializer_range": 0.02,
+  "intermediate_size": 3072,
+  "label2id": {
+    "B-LOC": 7,
+    "B-MISC": 1,
+    "B-ORG": 5,
+    "B-PER": 3,
+    "I-LOC": 8,
+    "I-MISC": 2,
+    "I-ORG": 6,
+    "I-PER": 4,
+    "O": 0
+  },
+  "layer_norm_eps": 1e-12,
+  "max_position_embeddings": 512,
+  "model_type": "bert",
+  "num_attention_heads": 12,
+  "num_hidden_layers": 12,
+  "output_past": true,
+  "pad_token_id": 0,
+  "position_embedding_type": "absolute",
+  "torch_dtype": "float16",
+  "transformers_version": "4.51.1",
+  "type_vocab_size": 2,
+  "use_cache": true,
+  "vocab_size": 28996
+}

model.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:5c5d159e14df95b39629e99909509193807b026b0efa977704d7068833fce608
+size 215476426

special_tokens_map.json ADDED Viewed

	@@ -0,0 +1,7 @@

+{
+  "cls_token": "[CLS]",
+  "mask_token": "[MASK]",
+  "pad_token": "[PAD]",
+  "sep_token": "[SEP]",
+  "unk_token": "[UNK]"
+}

tokenizer.json ADDED Viewed

The diff for this file is too large to render. See raw diff

tokenizer_config.json ADDED Viewed

	@@ -0,0 +1,59 @@

+{
+  "added_tokens_decoder": {
+    "0": {
+      "content": "[PAD]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "100": {
+      "content": "[UNK]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "101": {
+      "content": "[CLS]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "102": {
+      "content": "[SEP]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "103": {
+      "content": "[MASK]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    }
+  },
+  "clean_up_tokenization_spaces": true,
+  "cls_token": "[CLS]",
+  "do_basic_tokenize": true,
+  "do_lower_case": false,
+  "extra_special_tokens": {},
+  "mask_token": "[MASK]",
+  "max_len": 512,
+  "model_max_length": 512,
+  "never_split": null,
+  "pad_token": "[PAD]",
+  "sep_token": "[SEP]",
+  "strip_accents": null,
+  "tokenize_chinese_chars": true,
+  "tokenizer_class": "BertTokenizer",
+  "unk_token": "[UNK]"
+}

vocab.txt ADDED Viewed

The diff for this file is too large to render. See raw diff