File size: 5,104 Bytes
45c87f6 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 |
# Medical Entity Extraction with BERT
## π Overview
This repository hosts the quantized version of the `bert-base-cased` model for Medical Entity Extraction using the 'tner/bc5cdr' dataset. The model is specifically designed to recognize entities related to **Disease,Symptoms,Drug**. The model has been optimized for efficient deployment while maintaining high accuracy, making it suitable for resource-constrained environments.
## π Model Details
- **Model Architecture**: BERT Base Cased
- **Task**: Medical Entity Extraction
- **Dataset**: Hugging Face's `tner/bc5cdr`
- **Quantization**: Float16
- **Fine-tuning Framework**: Hugging Face Transformers
---
## π Usage
### Installation
```bash
pip install transformers torch
```
### Loading the Model
```python
from transformers import BertTokenizerFast, BertForTokenClassification
import torch
device = "cuda" if torch.cuda.is_available() else "cpu"
model_name = "AventIQ-AI/bert-medical-entity-extraction"
model = BertForTokenClassification.from_pretrained(model_name).to(device)
tokenizer = BertTokenizerFast.from_pretrained(model_name)
```
### Named Entity Recognition Inference
```python
from transformers import pipeline
ner_pipeline = pipeline("ner", model=model_name, tokenizer=tokenizer)
test_sentence = "An overdose of Ibuprofen can lead to severe gastric issues."
ner_results = ner_pipeline(test_sentence)
label_map = {
"LABEL_0": "O", # Outside (not an entity)
"LABEL_1": "Drug",
"LABEL_2": "Disease",
"LABEL_3": "Symptom",
"LABEL_4": "Treatment"
}
def merge_tokens(ner_results):
merged_entities = []
current_word = ""
current_label = ""
current_score = 0
count = 0
for entity in ner_results:
word = entity["word"]
label = entity["entity"] # Model's output (e.g., LABEL_1, LABEL_2)
score = entity["score"]
# Merge subwords
if word.startswith("##"):
current_word += word[2:] # Remove '##' and append
current_score += score
count += 1
else:
if current_word: # Store the previous merged word
mapped_label = label_map.get(current_label, "Unknown")
merged_entities.append((current_word, mapped_label, current_score / count))
current_word = word
current_label = label
current_score = score
count = 1
# Add the last word
if current_word:
mapped_label = label_map.get(current_label, "Unknown")
merged_entities.append((current_word, mapped_label, current_score / count))
return merged_entities
print("\nπ©Ί Medical NER Predictions:")
for word, label, score in merge_tokens(ner_results):
if label != "O": # Skip non-entities
print(f"πΉ Entity: {word} | Category: {label} | Score: {score:.4f}")
```
### **πΉ Labeling Scheme (BIO Format)**
- **B-XYZ (Beginning)**: Indicates the beginning of an entity of type XYZ (e.g., B-PER for the beginning of a personβs name).
- **I-XYZ (Inside)**: Represents subsequent tokens inside an entity (e.g., I-PER for the second part of a personβs name).
- **O (Outside)**: Denotes tokens that are not part of any named entity.
---
## π Evaluation Results for Quantized Model
### **πΉ Overall Performance**
- **Accuracy**: **93.27%** β
- **Precision**: **92.31%**
- **Recall**: **93.27%**
- **F1 Score**: **92.31%**
---
### **πΉ Performance by Entity Type**
| Entity Type | Precision | Recall | F1 Score | Number of Entities |
|------------|-----------|--------|----------|--------------------|
| **Disease** | **91.46%** | **92.07%** | **91.76%** | 3,000 |
| **Drug** | **71.25%** | **72.83%** | **72.03%** | 1,266 |
| **Symptom** | **89.83%** | **93.02%** | **91.40%** | 3,524 |
| **Treatment** | **88.83%** | **92.02%** | **90.40%** | 3,124 |
---
#### β³ **Inference Speed Metrics**
- **Total Evaluation Time**: 15.89 sec
- **Samples Processed per Second**: 217.26
- **Steps per Second**: 27.18
- **Epochs Completed**: 3
---
## Fine-Tuning Details
### Dataset
The Hugging Face's `tner/bc5cdr` dataset was used, containing texts and their ner tags.
## π Training Details
- **Number of epochs**: 3
- **Batch size**: 8
- **Evaluation strategy**: epoch
- **Learning Rate**: 2e-5
### β‘ Quantization
Post-training quantization was applied using PyTorch's built-in quantization framework to reduce the model size and improve inference efficiency.
---
## π Repository Structure
```
.
βββ model/ # Contains the quantized model files
βββ tokenizer_config/ # Tokenizer configuration and vocabulary files
βββ model.safetensors/ # Quantized Model
βββ README.md # Model documentation
```
---
## β οΈ Limitations
- The model may not generalize well to domains outside the fine-tuning dataset.
- Quantization may result in minor accuracy degradation compared to full-precision models.
---
## π€ Contributing
Contributions are welcome! Feel free to open an issue or submit a pull request if you have suggestions or improvements.
|