pubmedbert-base-embeddings-1M / README.md

Initial version

69392bf 9 days ago

8.8 kB

	---
	pipeline_tag: sentence-similarity
	tags:
	- sentence-transformers
	- feature-extraction
	- sentence-similarity
	- transformers
	- embeddings
	- static-embeddings
	language: en
	license: apache-2.0
	---

	# PubMedBERT Embeddings 1M

	This is a pruned version of [PubMedBERT Embeddings 2M](https://huggingface.co/NeuML/pubmedbert-base-embeddings-2M). It prunes the vocabulary to take the top 50% most frequently used tokens.

	See [Extremely Small BERT Models from Mixed-Vocabulary Training](https://arxiv.org/abs/1909.11687) for background on pruning vocabularies to build smaller models.

	## Usage (txtai)

	This model can be used to build embeddings databases with [txtai](https://github.com/neuml/txtai) for semantic search and/or as a knowledge source for retrieval augmented generation (RAG).

	```python
	import txtai

	# Create embeddings
	embeddings = txtai.Embeddings(
	path="neuml/pubmedbert-base-embeddings-1M",
	content=True,
	)
	embeddings.index(documents())

	# Run a query
	embeddings.search("query to run")
	```

	## Usage (Sentence-Transformers)

	Alternatively, the model can be loaded with [sentence-transformers](https://www.SBERT.net).

	```python
	from sentence_transformers import SentenceTransformer
	from sentence_transformers.models import StaticEmbedding

	# Initialize a StaticEmbedding module
	static = StaticEmbedding.from_model2vec("neuml/pubmedbert-base-embeddings-1M")
	model = SentenceTransformer(modules=[static])

	sentences = ["This is an example sentence", "Each sentence is converted"]
	embeddings = model.encode(sentences)
	print(embeddings)
	```

	## Usage (Model2Vec)

	The model can also be used directly with Model2Vec.

	```python
	from model2vec import StaticModel

	# Load a pretrained Model2Vec model
	model = StaticModel.from_pretrained("neuml/pubmedbert-base-embeddings-1M")

	# Compute text embeddings
	sentences = ["This is an example sentence", "Each sentence is converted"]
	embeddings = model.encode(sentences)
	print(embeddings)
	```

	## Evaluation Results

	The following compares performance of this model against the models previously compared with [PubMedBERT Embeddings](https://huggingface.co/NeuML/pubmedbert-base-embeddings#evaluation-results). The following datasets were used to evaluate model performance.

	- [PubMed QA](https://huggingface.co/datasets/pubmed_qa)
	- Subset: pqa_labeled, Split: train, Pair: (question, long_answer)
	- [PubMed Subset](https://huggingface.co/datasets/awinml/pubmed_abstract_3_1k)
	- Split: test, Pair: (title, text)
	- _Note: The previously used [PubMed Subset](https://huggingface.co/datasets/zxvix/pubmed_subset_new) dataset is no longer available but a similar dataset is used here_
	- [PubMed Summary](https://huggingface.co/datasets/scientific_papers)
	- Subset: pubmed, Split: validation, Pair: (article, abstract)

	The [Pearson correlation coefficient](https://en.wikipedia.org/wiki/Pearson_correlation_coefficient) is used as the evaluation metric.

	\| Model \| PubMed QA \| PubMed Subset \| PubMed Summary \| Average \|
	\| -------------------------------------------------------------------------------------- \| --------- \| ------------- \| -------------- \| --------- \|
	\| pubmedbert-base-embeddings-8M-M2V (No training) \| 69.84 \| 70.77 \| 71.30 \| 70.64 \|
	\| [pubmedbert-base-embeddings-100K](https://hf.co/neuml/pubmedbert-base-embeddings-100K) \| 74.56 \| 84.65 \| 81.84 \| 80.35 \|
	\| [pubmedbert-base-embeddings-500K](https://hf.co/neuml/pubmedbert-base-embeddings-500K) \| 86.03 \| 91.71 \| 91.25 \| 89.66 \|
	\| [pubmedbert-base-embeddings-1M](https://hf.co/neuml/pubmedbert-base-embeddings-1M) \| 87.87 \| 92.80 \| 92.87 \| 91.18 \|
	\| [pubmedbert-base-embeddings-2M](https://hf.co/neuml/pubmedbert-base-embeddings-2M) \| 88.62 \| 93.08 \| 93.24 \| 91.65 \|

	As we can see, the accuracy tradeoff is relatively minimal compared to the original model.

	## Runtime performance

	As another test, let's see how long each model takes to index 120K article abstracts using the following code. All indexing is done with a RTX 3090 GPU.

	```python
	from datasets import load_dataset
	from tqdm import tqdm
	from txtai import Embeddings

	ds = load_dataset("ccdv/pubmed-summarization", split="train")

	embeddings = Embeddings(path="path to model", content=True, backend="numpy")
	embeddings.index(tqdm(ds["abstract"]))
	```

	\| Model \| Model Size (MB) \| Index time (s) \|
	\| -------------------------------------------------------------------------------------- \| ---------- \| -------------- \|
	\| [pubmedbert-base-embeddings-100K](https://hf.co/neuml/pubmedbert-base-embeddings-100K) \| 0.2 \| 19 \|
	\| [pubmedbert-base-embeddings-500K](https://hf.co/neuml/pubmedbert-base-embeddings-500K) \| 1.0 \| 17 \|
	\| [pubmedbert-base-embeddings-1M](https://hf.co/neuml/pubmedbert-base-embeddings-1M) \| 2.0 \| 17 \|
	\| [pubmedbert-base-embeddings-2M](https://hf.co/neuml/pubmedbert-base-embeddings-2M) \| 7.5 \| 17 \|

	Vocabulary pruning doesn't change the runtime performance in this case. But the model is much smaller. Vectors are stored at `int16` precision. This can be beneficial to smaller/lower powered embedded devices and could lead to faster vectorization times.

	## Training

	This model was vocabulary pruned using the following script.

	```python
	import json
	import os

	from collections import Counter
	from pathlib import Path

	import numpy as np

	from model2vec import StaticModel
	from more_itertools import batched
	from sklearn.decomposition import PCA
	from tokenlearn.train import collect_means_and_texts
	from tokenizers import Tokenizer
	from tqdm import tqdm
	from txtai.scoring import ScoringFactory

	def tokenize(tokenizer):
	# Tokenize into dataset
	dataset = []
	for t in tqdm(batched(texts, 1024)):
	encodings = tokenizer.encode_batch_fast(t, add_special_tokens=False)
	for e in encodings:
	dataset.append((None, e.ids, None))

	return dataset

	def tokenweights(tokenizer):
	dataset = tokenize(tokenizer)

	# Build scoring index
	scoring = ScoringFactory.create({"method": "bm25", "terms": True})
	scoring.index(dataset)

	# Calculate mean value of weights array per token
	tokens = np.zeros(tokenizer.get_vocab_size())
	for x in scoring.idf:
	tokens[x] = np.mean(scoring.terms.weights(x)[1])

	return tokens

	# See PubMedBERT Embeddings 2M model for details on this data
	features = "features"
	paths = sorted(Path(features).glob("*.json"))
	texts, _ = collect_means_and_texts(paths)

	# Output model parameters
	output = "output path"
	params, dims = 1000000, 64

	path = "pubmedbert-base-embeddings-2M_unweighted"
	model = StaticModel.from_pretrained(path)

	os.makedirs(output, exist_ok=True)

	with open(f"{path}/tokenizer.json", "r", encoding="utf-8") as f:
	config = json.load(f)

	# Calculate number of tokens to keep
	tokencount = params // model.dim

	# Calculate term frequency
	freqs = Counter()
	for _, ids, _ in tokenize(model.tokenizer):
	freqs.update(ids)

	# Select top N most common tokens
	uids = set(x for x, _ in freqs.most_common(tokencount))
	uids = [uid for token, uid in config["model"]["vocab"].items() if uid in uids or token.startswith("[")]

	# Get embeddings for uids
	model.embedding = model.embedding[uids]

	# Select pruned tokens
	pairs, index = [], 0
	for token, uid in config["model"]["vocab"].items():
	if uid in uids:
	pairs.append((token, index))
	index += 1

	config["model"]["vocab"] = dict(pairs)

	# Write new tokenizer
	with open(f"{output}/tokenizer.json", "w", encoding="utf-8") as f:
	json.dump(config, f, indent=2)

	model.tokenizer = Tokenizer.from_file(f"{output}/tokenizer.json")

	# Re-weight tokens
	weights = tokenweights(model.tokenizer)

	# Remove NaNs from embedding, if any
	embedding = np.nan_to_num(model.embedding)

	# Apply PCA
	embedding = PCA(n_components=dims).fit_transform(embedding)

	# Apply weights
	embedding *= weights[:, None]

	# Update model embedding and normalize
	model.embedding, model.normalize = embedding.astype(np.int16), True

	model.save_pretrained(output)
	```

	## Acknowledgement

	This model is built on the great work from the [Minish Lab](https://github.com/MinishLab) team consisting of [Stephan Tulkens](https://github.com/stephantul) and [Thomas van Dongen](https://github.com/Pringled).

	Read more at the following links.

	- [Model2Vec](https://github.com/MinishLab/model2vec)
	- [Tokenlearn](https://github.com/MinishLab/tokenlearn)
	- [Minish Lab Blog](https://minishlab.github.io/)