fonshartendorp/dutch_biomedical_entity_linking

Dutch Biomedical Entity Linking

Summary

RoBERTa-based basemodel that is trained from scratch on Dutch hospital notes (medRoBERTa.nl).
2nd-phase pretrained using self-alignment on UMLS-derived Dutch biomedical ontology.
fine-tuned on automatically generated weakly labelled corpus from Wikipedia.
evaluation results on Mantra GSC corpus can be found in the report

All code for generating the training data, training the model and evaluating it, can be found in the github repository.

Usage

The following script (reused the original sapBERT repository) computes the embeddings for a list of input entities (strings)

import numpy as np
import torch
from tqdm.auto import tqdm
from transformers import AutoTokenizer, AutoModel  

tokenizer = AutoTokenizer.from_pretrained("fonshartendorp/dutch_biomedical_entity_linking)")  
model = AutoModel.from_pretrained("fonshartendorp/dutch_biomedical_entity_linking").cuda()

# replace with your own list of entity names
dutch_biomedical_entities = ["versnelde ademhaling", "Coronavirus infectie", "aandachtstekort/hyperactiviteitstoornis", "hartaanval"]

bs = 128 # batch size during inference
all_embs = []
for i in tqdm(np.arange(0, len(dutch_biomedical_entities), bs)):
    toks = tokenizer.batch_encode_plus(dutch_biomedical_entities[i:i+bs], 
                                       padding="max_length", 
                                       max_length=25, 
                                       truncation=True,
                                       return_tensors="pt")
    toks_cuda = {}
    for k,v in toks.items():
        toks_cuda[k] = v.cuda()
    cls_rep = model(**toks_cuda)[0][:,0,:] # use CLS representation as the embedding
    all_embs.append(cls_rep.cpu().detach().numpy())

all_embs = np.concatenate(all_embs, axis=0)

For (Dutch) biomedical entity linking, the following steps should be performed:

Request UMLS (and SNOMED NL) license
Precompute embeddings for all entities in the UMLS with the fine-tuned model
Compute embedding of the new, unseen mention with the fine-tuned model
Perform nearest-neighbour search (or search FAISS-index) for linking the embedding of the new mention to its most similar embedding from the UMLS