|
--- |
|
pipeline_tag: sentence-similarity |
|
tags: |
|
- sentence-transformers |
|
- feature-extraction |
|
- sentence-similarity |
|
- transformers |
|
- embeddings |
|
- static-embeddings |
|
language: en |
|
license: apache-2.0 |
|
--- |
|
|
|
# PubMedBERT Embeddings 1M |
|
|
|
This is a pruned version of [PubMedBERT Embeddings 2M](https://huggingface.co/NeuML/pubmedbert-base-embeddings-2M). It prunes the vocabulary to take the top 50% most frequently used tokens. |
|
|
|
See [Extremely Small BERT Models from Mixed-Vocabulary Training](https://arxiv.org/abs/1909.11687) for background on pruning vocabularies to build smaller models. |
|
|
|
## Usage (txtai) |
|
|
|
This model can be used to build embeddings databases with [txtai](https://github.com/neuml/txtai) for semantic search and/or as a knowledge source for retrieval augmented generation (RAG). |
|
|
|
```python |
|
import txtai |
|
|
|
# Create embeddings |
|
embeddings = txtai.Embeddings( |
|
path="neuml/pubmedbert-base-embeddings-1M", |
|
content=True, |
|
) |
|
embeddings.index(documents()) |
|
|
|
# Run a query |
|
embeddings.search("query to run") |
|
``` |
|
|
|
## Usage (Sentence-Transformers) |
|
|
|
Alternatively, the model can be loaded with [sentence-transformers](https://www.SBERT.net). |
|
|
|
```python |
|
from sentence_transformers import SentenceTransformer |
|
from sentence_transformers.models import StaticEmbedding |
|
|
|
# Initialize a StaticEmbedding module |
|
static = StaticEmbedding.from_model2vec("neuml/pubmedbert-base-embeddings-1M") |
|
model = SentenceTransformer(modules=[static]) |
|
|
|
sentences = ["This is an example sentence", "Each sentence is converted"] |
|
embeddings = model.encode(sentences) |
|
print(embeddings) |
|
``` |
|
|
|
## Usage (Model2Vec) |
|
|
|
The model can also be used directly with Model2Vec. |
|
|
|
```python |
|
from model2vec import StaticModel |
|
|
|
# Load a pretrained Model2Vec model |
|
model = StaticModel.from_pretrained("neuml/pubmedbert-base-embeddings-1M") |
|
|
|
# Compute text embeddings |
|
sentences = ["This is an example sentence", "Each sentence is converted"] |
|
embeddings = model.encode(sentences) |
|
print(embeddings) |
|
``` |
|
|
|
## Evaluation Results |
|
|
|
The following compares performance of this model against the models previously compared with [PubMedBERT Embeddings](https://huggingface.co/NeuML/pubmedbert-base-embeddings#evaluation-results). The following datasets were used to evaluate model performance. |
|
|
|
- [PubMed QA](https://huggingface.co/datasets/pubmed_qa) |
|
- Subset: pqa_labeled, Split: train, Pair: (question, long_answer) |
|
- [PubMed Subset](https://huggingface.co/datasets/awinml/pubmed_abstract_3_1k) |
|
- Split: test, Pair: (title, text) |
|
- _Note: The previously used [PubMed Subset](https://huggingface.co/datasets/zxvix/pubmed_subset_new) dataset is no longer available but a similar dataset is used here_ |
|
- [PubMed Summary](https://huggingface.co/datasets/scientific_papers) |
|
- Subset: pubmed, Split: validation, Pair: (article, abstract) |
|
|
|
The [Pearson correlation coefficient](https://en.wikipedia.org/wiki/Pearson_correlation_coefficient) is used as the evaluation metric. |
|
|
|
| Model | PubMed QA | PubMed Subset | PubMed Summary | Average | |
|
| -------------------------------------------------------------------------------------- | --------- | ------------- | -------------- | --------- | |
|
| pubmedbert-base-embeddings-8M-M2V (No training) | 69.84 | 70.77 | 71.30 | 70.64 | |
|
| [pubmedbert-base-embeddings-100K](https://hf.co/neuml/pubmedbert-base-embeddings-100K) | 74.56 | 84.65 | 81.84 | 80.35 | |
|
| [pubmedbert-base-embeddings-500K](https://hf.co/neuml/pubmedbert-base-embeddings-500K) | 86.03 | 91.71 | 91.25 | 89.66 | |
|
| [**pubmedbert-base-embeddings-1M**](https://hf.co/neuml/pubmedbert-base-embeddings-1M) | **87.87** | **92.80** | **92.87** | **91.18** | |
|
| [pubmedbert-base-embeddings-2M](https://hf.co/neuml/pubmedbert-base-embeddings-2M) | 88.62 | 93.08 | 93.24 | 91.65 | |
|
|
|
As we can see, the accuracy tradeoff is relatively minimal compared to the original model. |
|
|
|
## Runtime performance |
|
|
|
As another test, let's see how long each model takes to index 120K article abstracts using the following code. All indexing is done with a RTX 3090 GPU. |
|
|
|
```python |
|
from datasets import load_dataset |
|
from tqdm import tqdm |
|
from txtai import Embeddings |
|
|
|
ds = load_dataset("ccdv/pubmed-summarization", split="train") |
|
|
|
embeddings = Embeddings(path="path to model", content=True, backend="numpy") |
|
embeddings.index(tqdm(ds["abstract"])) |
|
``` |
|
|
|
| Model | Model Size (MB) | Index time (s) | |
|
| -------------------------------------------------------------------------------------- | ---------- | -------------- | |
|
| [pubmedbert-base-embeddings-100K](https://hf.co/neuml/pubmedbert-base-embeddings-100K) | 0.2 | 19 | |
|
| [pubmedbert-base-embeddings-500K](https://hf.co/neuml/pubmedbert-base-embeddings-500K) | 1.0 | 17 | |
|
| **[pubmedbert-base-embeddings-1M](https://hf.co/neuml/pubmedbert-base-embeddings-1M)** | **2.0** | 17 | |
|
| [pubmedbert-base-embeddings-2M](https://hf.co/neuml/pubmedbert-base-embeddings-2M) | 7.5 | 17 | |
|
|
|
Vocabulary pruning doesn't change the runtime performance in this case. But the model is much smaller. Vectors are stored at `int16` precision. This can be beneficial to smaller/lower powered embedded devices and could lead to faster vectorization times. |
|
|
|
## Training |
|
|
|
This model was vocabulary pruned using the following script. |
|
|
|
```python |
|
import json |
|
import os |
|
|
|
from collections import Counter |
|
from pathlib import Path |
|
|
|
import numpy as np |
|
|
|
from model2vec import StaticModel |
|
from more_itertools import batched |
|
from sklearn.decomposition import PCA |
|
from tokenlearn.train import collect_means_and_texts |
|
from tokenizers import Tokenizer |
|
from tqdm import tqdm |
|
from txtai.scoring import ScoringFactory |
|
|
|
def tokenize(tokenizer): |
|
# Tokenize into dataset |
|
dataset = [] |
|
for t in tqdm(batched(texts, 1024)): |
|
encodings = tokenizer.encode_batch_fast(t, add_special_tokens=False) |
|
for e in encodings: |
|
dataset.append((None, e.ids, None)) |
|
|
|
return dataset |
|
|
|
def tokenweights(tokenizer): |
|
dataset = tokenize(tokenizer) |
|
|
|
# Build scoring index |
|
scoring = ScoringFactory.create({"method": "bm25", "terms": True}) |
|
scoring.index(dataset) |
|
|
|
# Calculate mean value of weights array per token |
|
tokens = np.zeros(tokenizer.get_vocab_size()) |
|
for x in scoring.idf: |
|
tokens[x] = np.mean(scoring.terms.weights(x)[1]) |
|
|
|
return tokens |
|
|
|
# See PubMedBERT Embeddings 2M model for details on this data |
|
features = "features" |
|
paths = sorted(Path(features).glob("*.json")) |
|
texts, _ = collect_means_and_texts(paths) |
|
|
|
# Output model parameters |
|
output = "output path" |
|
params, dims = 1000000, 64 |
|
|
|
path = "pubmedbert-base-embeddings-2M_unweighted" |
|
model = StaticModel.from_pretrained(path) |
|
|
|
os.makedirs(output, exist_ok=True) |
|
|
|
with open(f"{path}/tokenizer.json", "r", encoding="utf-8") as f: |
|
config = json.load(f) |
|
|
|
# Calculate number of tokens to keep |
|
tokencount = params // model.dim |
|
|
|
# Calculate term frequency |
|
freqs = Counter() |
|
for _, ids, _ in tokenize(model.tokenizer): |
|
freqs.update(ids) |
|
|
|
# Select top N most common tokens |
|
uids = set(x for x, _ in freqs.most_common(tokencount)) |
|
uids = [uid for token, uid in config["model"]["vocab"].items() if uid in uids or token.startswith("[")] |
|
|
|
# Get embeddings for uids |
|
model.embedding = model.embedding[uids] |
|
|
|
# Select pruned tokens |
|
pairs, index = [], 0 |
|
for token, uid in config["model"]["vocab"].items(): |
|
if uid in uids: |
|
pairs.append((token, index)) |
|
index += 1 |
|
|
|
config["model"]["vocab"] = dict(pairs) |
|
|
|
# Write new tokenizer |
|
with open(f"{output}/tokenizer.json", "w", encoding="utf-8") as f: |
|
json.dump(config, f, indent=2) |
|
|
|
model.tokenizer = Tokenizer.from_file(f"{output}/tokenizer.json") |
|
|
|
# Re-weight tokens |
|
weights = tokenweights(model.tokenizer) |
|
|
|
# Remove NaNs from embedding, if any |
|
embedding = np.nan_to_num(model.embedding) |
|
|
|
# Apply PCA |
|
embedding = PCA(n_components=dims).fit_transform(embedding) |
|
|
|
# Apply weights |
|
embedding *= weights[:, None] |
|
|
|
# Update model embedding and normalize |
|
model.embedding, model.normalize = embedding.astype(np.int16), True |
|
|
|
model.save_pretrained(output) |
|
``` |
|
|
|
## Acknowledgement |
|
|
|
This model is built on the great work from the [Minish Lab](https://github.com/MinishLab) team consisting of [Stephan Tulkens](https://github.com/stephantul) and [Thomas van Dongen](https://github.com/Pringled). |
|
|
|
Read more at the following links. |
|
|
|
- [Model2Vec](https://github.com/MinishLab/model2vec) |
|
- [Tokenlearn](https://github.com/MinishLab/tokenlearn) |
|
- [Minish Lab Blog](https://minishlab.github.io/) |
|
|