Dutch Biomedical Entity Linking
Summary
- RoBERTa-based basemodel that is trained from scratch on Dutch hospital notes (medRoBERTa.nl).
- 2nd-phase pretrained using self-alignment on UMLS-derived Dutch biomedical ontology.
- fine-tuned on automatically generated weakly labelled corpus from Wikipedia.
- evaluation results on Mantra GSC corpus can be found in the report
All code for generating the training data, training the model and evaluating it, can be found in the github repository.
Usage
The following script (reused the original sapBERT repository) computes the embeddings for a list of input entities (strings)
import numpy as np
import torch
from tqdm.auto import tqdm
from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained("fonshartendorp/dutch_biomedical_entity_linking)")
model = AutoModel.from_pretrained("fonshartendorp/dutch_biomedical_entity_linking").cuda()
# replace with your own list of entity names
dutch_biomedical_entities = ["versnelde ademhaling", "Coronavirus infectie", "aandachtstekort/hyperactiviteitstoornis", "hartaanval"]
bs = 128 # batch size during inference
all_embs = []
for i in tqdm(np.arange(0, len(dutch_biomedical_entities), bs)):
toks = tokenizer.batch_encode_plus(dutch_biomedical_entities[i:i+bs],
padding="max_length",
max_length=25,
truncation=True,
return_tensors="pt")
toks_cuda = {}
for k,v in toks.items():
toks_cuda[k] = v.cuda()
cls_rep = model(**toks_cuda)[0][:,0,:] # use CLS representation as the embedding
all_embs.append(cls_rep.cpu().detach().numpy())
all_embs = np.concatenate(all_embs, axis=0)
For (Dutch) biomedical entity linking, the following steps should be performed:
- Request UMLS (and SNOMED NL) license
- Precompute embeddings for all entities in the UMLS with the fine-tuned model
- Compute embedding of the new, unseen mention with the fine-tuned model
- Perform nearest-neighbour search (or search FAISS-index) for linking the embedding of the new mention to its most similar embedding from the UMLS
- Downloads last month
- 2
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social
visibility and check back later, or deploy to Inference Endpoints (dedicated)
instead.