MediAlbertina

The first publicly available medical language models trained with real European Portuguese data.

MediAlbertina is a family of encoders from the Bert family, DeBERTaV2-based, resulting from the continuation of the pre-training of PORTULAN's Albertina models with Electronic Medical Records shared by Portugal's largest public hospital.

Like its antecessors, MediAlbertina models are distributed under the MIT license.

Model Description

MediAlbertina PT-PT 1.5B was created through domain adaptation of Albertina PT-PT 1.5B on real European Portuguese EMRs by employing masked language modeling. It underwent evaluation through fine-tuning for the Information Extraction (IE) tasks Named Entity Recognition (NER) and Assertion Status (AStatus) on more than 10k manually annotated entities belonging to the following classes: Diagnosis, Symptom, Vital Sign, Result, Medical Procedure, Medication, Dosage, and Progress. In both tasks, MediAlbertina achieved superior results to its antecessors, demonstrating the effectiveness of this domain adaptation, and its potential for medical AI in Portugal.

Model	NER Single Model	NER Multi-Models (Diag+Symp)	NER Multi-Models (Med+Dos)	NER Multi-Models (MP+VS+R)	NER Multi-Models (Prog)	Assertion Status (Diag)	Assertion Status (Symp)	Assertion Status (Med)
	F1-score	F1-score	F1-score	F1-score	F1-score	F1-score	F1-score	F1-score
Albertina PT-PT 900M	0.813	0.771	0.886	0.777	0.784	0.703	0.803	0.556
Albertina PT-PT 1.5B	0.838	0.801	0.924	0.836	0.877	0.772	0.881	0.862
MediAlbertina PT-PT 900M	0.832	0.801	0.916	0.810	0.864	0.722	0.823	0.723
MediAlbertina PT-PT 1.5B	0.843	0.813	0.926	0.851	0.858	0.789	0.886	0.868

Data

MediAlbertina PT-PT 1.5B was trained on more than 15M sentences and 300M tokens from 2.6M fully anonymized and unique Electronic Medical Records (EMRs) from Portugal's largest public hospital. This data was acquired under the framework of the FCT project DSAIPA/AI/0122/2020 AIMHealth-Mobile Applications Based on Artificial Intelligence.

How to use

from transformers import pipeline

unmasker = pipeline('fill-mask', model='portugueseNLP/medialbertina_pt-pt_1.5b')
unmasker("Analgesia com morfina em perfusão (15 [MASK]/kg/h)")

Citation

MediAlbertina is developed by a joint team from ISCTE-IUL, Portugal, and Select Data, CA USA. For a fully detailed description, check the respective publication:

@article{MediAlbertina PT-PT,
      title={MediAlbertina: An European Portuguese medical language model}, 
      author={Miguel Nunes and João Boné and João Ferreira
              and Pedro Chaves and Luís Elvas},
      year={2024},
      journal={CBM},
      volume={182}
      url={https://doi.org/10.1016/j.compbiomed.2024.109233}
}

Please use the above cannonical reference when using or citing this model.

Acknowledgements

This work was financially supported by Project Blockchain.PT – Decentralize Portugal with Blockchain Agenda, (Project no 51), WP2, Call no 02/C05-i01.01/2022, funded by the Portuguese Recovery and Resillience Program (PRR), The Portuguese Republic and The European Union (EU) under the framework of Next Generation EU Program.

portugueseNLP
/

medialbertina_pt-pt_1.5b