File size: 2,282 Bytes
8a3019d ff0ef3f 8a3019d d84b6ab 8a3019d 7cdab80 ae5bb73 8a3019d e5978ae 7d8cbf2 ea4a1f7 5f743ef ea4a1f7 47b6bd0 ea4a1f7 0e85bb7 ea4a1f7 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 |
---
language: multilingual
tags:
- biomedical
- lexical-semantics
- cross-lingual
datasets:
- UMLS
**[news]** A cross-lingual extension of SapBERT will appear in the main onference of **ACL 2021**! <br>
**[news]** SapBERT will appear in the conference proceedings of **NAACL 2021**!
### SapBERT-XLMR
SapBERT [(Liu et al. 2020)](https://arxiv.org/pdf/2010.11784.pdf) trained with [UMLS](https://www.nlm.nih.gov/research/umls/licensedcontent/umlsknowledgesources.html) 2020AB, using [xlm-roberta-base](https://huggingface.co/xlm-roberta-base) as the base model. Please use [CLS] as the representation of the input.
#### Extracting embeddings from SapBERT
The following script converts a list of strings (entity names) into embeddings.
```python
import numpy as np
import torch
from tqdm.auto import tqdm
from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained("cambridgeltl/SapBERT-from-PubMedBERT-fulltext")
model = AutoModel.from_pretrained("cambridgeltl/SapBERT-from-PubMedBERT-fulltext").cuda()
# replace with your own list of entity names
all_names = ["covid-19", "Coronavirus infection", "high fever", "Tumor of posterior wall of oropharynx"]
bs = 128 # batch size during inference
all_embs = []
for i in tqdm(np.arange(0, len(all_names), bs)):
toks = tokenizer.batch_encode_plus(all_names[i:i+bs],
padding="max_length",
max_length=25,
truncation=True,
return_tensors="pt")
toks_cuda = {}
for k,v in toks.items():
toks_cuda[k] = v.cuda()
cls_rep = model(**toks_cuda)[0][:,0,:] # use CLS representation as the embedding
all_embs.append(cls_rep.cpu().detach().numpy())
all_embs = np.concatenate(all_embs, axis=0)
```
For more details about training and eval, see SapBERT [github repo](https://github.com/cambridgeltl/sapbert).
### Citation
```bibtex
@inproceedings{liu2021learning,
title={Learning Domain-Specialised Representations for Cross-Lingual Biomedical Entity Linking},
author={Liu, Fangyu and Vuli{\'c}, Ivan and Korhonen, Anna and Collier, Nigel},
booktitle={Proceedings of ACL-IJCNLP 2021},
month = aug,
year={2021}
}
``` |