|
--- |
|
license: gpl-3.0 |
|
--- |
|
|
|
Pre-trained word embeddings using the text of published biomedical manuscripts. These embeddings use 100 dimensions and were trained using the word2vec algorithm on all available manuscripts found in the [PMC Open Access Subset](https://www.ncbi.nlm.nih.gov/pmc/tools/openftlist/). See the paper here: https://pubmed.ncbi.nlm.nih.gov/34920127/ |
|
|
|
Citation: |
|
|
|
``` |
|
@article{flamholz2022word, |
|
title={Word embeddings trained on published case reports are lightweight, effective for clinical tasks, and free of protected health information}, |
|
author={Flamholz, Zachary N and Crane-Droesch, Andrew and Ungar, Lyle H and Weissman, Gary E}, |
|
journal={Journal of Biomedical Informatics}, |
|
volume={125}, |
|
pages={103971}, |
|
year={2022}, |
|
publisher={Elsevier} |
|
} |
|
``` |
|
|
|
## Quick start |
|
|
|
Word embeddings are compatible with the [`gensim` Python package](https://radimrehurek.com/gensim/) format. |
|
|
|
First download the files from this archive. Then load the embeddings into Python. |
|
|
|
|
|
```python |
|
|
|
from gensim.models import FastText, Word2Vec, KeyedVectors # KeyedVectors are used to load the GloVe models |
|
|
|
# Load the model |
|
model = Word2Vec.load('w2v_oa_all_100d.bin') |
|
|
|
# Return 100-dimensional vector representations of each word |
|
model.wv.word_vec('diabetes') |
|
model.wv.word_vec('cardiac_arrest') |
|
model.wv.word_vec('lymphangioleiomyomatosis') |
|
|
|
# Try out cosine similarity |
|
model.wv.similarity('copd', 'chronic_obstructive_pulmonary_disease') |
|
model.wv.similarity('myocardial_infarction', 'heart_attack') |
|
model.wv.similarity('lymphangioleiomyomatosis', 'lam') |
|
|
|
``` |