garyw
/

clinical-embeddings-100d-w2v-oa-all

Model card Files Files and versions Community

clinical-embeddings-100d-w2v-oa-all / README.md

garyw's picture

Update README.md

6c02478 about 2 years ago

|

history blame contribute delete

1.57 kB

	---
	license: gpl-3.0
	---

	Pre-trained word embeddings using the text of published biomedical manuscripts. These embeddings use 100 dimensions and were trained using the word2vec algorithm on all available manuscripts found in the [PMC Open Access Subset](https://www.ncbi.nlm.nih.gov/pmc/tools/openftlist/). See the paper here: https://pubmed.ncbi.nlm.nih.gov/34920127/

	Citation:

	```
	@article{flamholz2022word,
	title={Word embeddings trained on published case reports are lightweight, effective for clinical tasks, and free of protected health information},
	author={Flamholz, Zachary N and Crane-Droesch, Andrew and Ungar, Lyle H and Weissman, Gary E},
	journal={Journal of Biomedical Informatics},
	volume={125},
	pages={103971},
	year={2022},
	publisher={Elsevier}
	}
	```

	## Quick start

	Word embeddings are compatible with the [`gensim` Python package](https://radimrehurek.com/gensim/) format.

	First download the files from this archive. Then load the embeddings into Python.


	```python

	from gensim.models import FastText, Word2Vec, KeyedVectors # KeyedVectors are used to load the GloVe models

	# Load the model
	model = Word2Vec.load('w2v_oa_all_100d.bin')

	# Return 100-dimensional vector representations of each word
	model.wv.word_vec('diabetes')
	model.wv.word_vec('cardiac_arrest')
	model.wv.word_vec('lymphangioleiomyomatosis')

	# Try out cosine similarity
	model.wv.similarity('copd', 'chronic_obstructive_pulmonary_disease')
	model.wv.similarity('myocardial_infarction', 'heart_attack')
	model.wv.similarity('lymphangioleiomyomatosis', 'lam')

	```