YAML Metadata Warning: empty or missing yaml metadata in repo card (https://huggingface.co/docs/hub/model-cards#model-card-metadata)

Medical Entity Extraction with BERT

πŸ“Œ Overview

This repository hosts the quantized version of the bert-base-cased model for Medical Entity Extraction using the 'tner/bc5cdr' dataset. The model is specifically designed to recognize entities related to Disease,Symptoms,Drug. The model has been optimized for efficient deployment while maintaining high accuracy, making it suitable for resource-constrained environments.

πŸ— Model Details

  • Model Architecture: BERT Base Cased
  • Task: Medical Entity Extraction
  • Dataset: Hugging Face's tner/bc5cdr
  • Quantization: Float16
  • Fine-tuning Framework: Hugging Face Transformers

πŸš€ Usage

Installation

pip install transformers torch

Loading the Model

from transformers import BertTokenizerFast, BertForTokenClassification
import torch

device = "cuda" if torch.cuda.is_available() else "cpu"

model_name = "AventIQ-AI/bert-medical-entity-extraction"
model = BertForTokenClassification.from_pretrained(model_name).to(device)
tokenizer = BertTokenizerFast.from_pretrained(model_name)

Named Entity Recognition Inference

from transformers import pipeline

ner_pipeline = pipeline("ner", model=model_name, tokenizer=tokenizer)
test_sentence = "An overdose of Ibuprofen can lead to severe gastric issues."
ner_results = ner_pipeline(test_sentence)
label_map = {
    "LABEL_0": "O",  # Outside (not an entity)
    "LABEL_1": "Drug",
    "LABEL_2": "Disease",
    "LABEL_3": "Symptom",
    "LABEL_4": "Treatment"
}

def merge_tokens(ner_results):
    merged_entities = []
    current_word = ""
    current_label = ""
    current_score = 0
    count = 0

    for entity in ner_results:
        word = entity["word"]
        label = entity["entity"]  # Model's output (e.g., LABEL_1, LABEL_2)
        score = entity["score"]

        # Merge subwords
        if word.startswith("##"):
            current_word += word[2:]  # Remove '##' and append
            current_score += score
            count += 1
        else:
            if current_word:  # Store the previous merged word
                mapped_label = label_map.get(current_label, "Unknown")
                merged_entities.append((current_word, mapped_label, current_score / count))
            current_word = word
            current_label = label
            current_score = score
            count = 1

    # Add the last word
    if current_word:
        mapped_label = label_map.get(current_label, "Unknown")
        merged_entities.append((current_word, mapped_label, current_score / count))

    return merged_entities

print("\n🩺 Medical NER Predictions:")
for word, label, score in merge_tokens(ner_results):
    if label != "O":  # Skip non-entities
        print(f"πŸ”Ή Entity: {word} | Category: {label} | Score: {score:.4f}")

πŸ”Ή Labeling Scheme (BIO Format)

  • B-XYZ (Beginning): Indicates the beginning of an entity of type XYZ (e.g., B-PER for the beginning of a person’s name).
  • I-XYZ (Inside): Represents subsequent tokens inside an entity (e.g., I-PER for the second part of a person’s name).
  • O (Outside): Denotes tokens that are not part of any named entity.

πŸ“Š Evaluation Results for Quantized Model

πŸ”Ή Overall Performance

  • Accuracy: 93.27% βœ…
  • Precision: 92.31%
  • Recall: 93.27%
  • F1 Score: 92.31%

πŸ”Ή Performance by Entity Type

Entity Type Precision Recall F1 Score Number of Entities
Disease 91.46% 92.07% 91.76% 3,000
Drug 71.25% 72.83% 72.03% 1,266
Symptom 89.83% 93.02% 91.40% 3,524
Treatment 88.83% 92.02% 90.40% 3,124

⏳ Inference Speed Metrics

  • Total Evaluation Time: 15.89 sec
  • Samples Processed per Second: 217.26
  • Steps per Second: 27.18
  • Epochs Completed: 3

Fine-Tuning Details

Dataset

The Hugging Face's tner/bc5cdr dataset was used, containing texts and their ner tags.

πŸ“Š Training Details

  • Number of epochs: 3
  • Batch size: 8
  • Evaluation strategy: epoch
  • Learning Rate: 2e-5

⚑ Quantization

Post-training quantization was applied using PyTorch's built-in quantization framework to reduce the model size and improve inference efficiency.


πŸ“‚ Repository Structure

.
β”œβ”€β”€ model/               # Contains the quantized model files
β”œβ”€β”€ tokenizer_config/    # Tokenizer configuration and vocabulary files
β”œβ”€β”€ model.safetensors/   # Quantized Model
β”œβ”€β”€ README.md            # Model documentation

⚠️ Limitations

  • The model may not generalize well to domains outside the fine-tuning dataset.
  • Quantization may result in minor accuracy degradation compared to full-precision models.

🀝 Contributing

Contributions are welcome! Feel free to open an issue or submit a pull request if you have suggestions or improvements.

Downloads last month
66
Safetensors
Model size
108M params
Tensor type
F16
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Space using AventIQ-AI/bert-medical-entity-extraction 1