|
--- |
|
license: apache-2.0 |
|
datasets: |
|
- rigonsallauka/portugese_ner_dataset |
|
language: |
|
- pt |
|
metrics: |
|
- f1 |
|
- precision |
|
- recall |
|
- confusion_matrix |
|
base_model: |
|
- google-bert/bert-base-cased |
|
pipeline_tag: token-classification |
|
tags: |
|
- NER |
|
- medical |
|
- symptoms |
|
- extraction |
|
- portugese |
|
--- |
|
# Portugese Medical NER |
|
|
|
## Use |
|
- **Primary Use Case**: This model is designed to extract medical entities such as symptoms, diagnostic tests, and treatments from clinical text in the Portugese language. |
|
- **Applications**: Suitable for healthcare professionals, clinical data analysis, and research into medical text processing. |
|
- **Supported Entity Types**: |
|
- `PROBLEM`: Diseases, symptoms, and medical conditions. |
|
- `TEST`: Diagnostic procedures and laboratory tests. |
|
- `TREATMENT`: Medications, therapies, and other medical interventions. |
|
|
|
## Training Data |
|
- **Data Sources**: Annotated datasets, including clinical data and translations of English medical text into Portugese. |
|
- **Data Augmentation**: The training dataset underwent data augmentation techniques to improve the model's ability to generalize to different text structures. |
|
- **Dataset Split**: |
|
- **Training Set**: 80% |
|
- **Validation Set**: 10% |
|
- **Test Set**: 10% |
|
|
|
## Model Training |
|
- **Training Configuration**: |
|
- **Optimizer**: AdamW |
|
- **Learning Rate**: 3e-5 |
|
- **Batch Size**: 64 |
|
- **Epochs**: 200 |
|
- **Loss Function**: Focal Loss to handle class imbalance |
|
- **Frameworks**: PyTorch, Hugging Face Transformers, SimpleTransformers |
|
|
|
## Evaluation metrics |
|
|
|
- eval_loss = 0.34290624315439794 |
|
- f1_score = 0.7720704622812219 |
|
- precision = 0.7724936121316581 |
|
- recall = 0.7716477757556993 |
|
|
|
## How to Use |
|
You can easily use this model with the Hugging Face `transformers` library. Here's an example of how to load and use the model for inference: |
|
|
|
```python |
|
from transformers import AutoTokenizer, AutoModelForTokenClassification |
|
import torch |
|
|
|
model_name = "rigonsallauka/portugese_medical_ner" |
|
|
|
# Load the tokenizer and model |
|
tokenizer = AutoTokenizer.from_pretrained(model_name) |
|
model = AutoModelForTokenClassification.from_pretrained(model_name) |
|
|
|
# Sample text for inference |
|
text = "O paciente reclamou de fortes dores de cabeça e náusea que persistiram por dois dias. Para aliviar os sintomas, foi prescrito paracetamol e recomendado descansar e beber bastante líquidos." |
|
|
|
# Tokenize the input text |
|
inputs = tokenizer(text, return_tensors="pt") |