File size: 8,796 Bytes
69392bf
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
---
pipeline_tag: sentence-similarity
tags:
- sentence-transformers
- feature-extraction
- sentence-similarity
- transformers
- embeddings
- static-embeddings
language: en
license: apache-2.0
---

# PubMedBERT Embeddings 1M

This is a pruned version of [PubMedBERT Embeddings 2M](https://huggingface.co/NeuML/pubmedbert-base-embeddings-2M). It prunes the vocabulary to take the top 50% most frequently used tokens. 

See [Extremely Small BERT Models from Mixed-Vocabulary Training](https://arxiv.org/abs/1909.11687) for background on pruning vocabularies to build smaller models.

## Usage (txtai)

This model can be used to build embeddings databases with [txtai](https://github.com/neuml/txtai) for semantic search and/or as a knowledge source for retrieval augmented generation (RAG).

```python
import txtai

# Create embeddings
embeddings = txtai.Embeddings(
  path="neuml/pubmedbert-base-embeddings-1M",
  content=True,
)
embeddings.index(documents())

# Run a query
embeddings.search("query to run")
```

## Usage (Sentence-Transformers)

Alternatively, the model can be loaded with [sentence-transformers](https://www.SBERT.net).

```python
from sentence_transformers import SentenceTransformer
from sentence_transformers.models import StaticEmbedding

# Initialize a StaticEmbedding module
static = StaticEmbedding.from_model2vec("neuml/pubmedbert-base-embeddings-1M")
model = SentenceTransformer(modules=[static])

sentences = ["This is an example sentence", "Each sentence is converted"]
embeddings = model.encode(sentences)
print(embeddings)
```

## Usage (Model2Vec)

The model can also be used directly with Model2Vec.

```python
from model2vec import StaticModel

# Load a pretrained Model2Vec model
model = StaticModel.from_pretrained("neuml/pubmedbert-base-embeddings-1M")

# Compute text embeddings
sentences = ["This is an example sentence", "Each sentence is converted"]
embeddings = model.encode(sentences)
print(embeddings)
```

## Evaluation Results

The following compares performance of this model against the models previously compared with [PubMedBERT Embeddings](https://huggingface.co/NeuML/pubmedbert-base-embeddings#evaluation-results). The following datasets were used to evaluate model performance.

- [PubMed QA](https://huggingface.co/datasets/pubmed_qa)
  - Subset: pqa_labeled, Split: train, Pair: (question, long_answer)
- [PubMed Subset](https://huggingface.co/datasets/awinml/pubmed_abstract_3_1k)
  - Split: test, Pair: (title, text)
  - _Note: The previously used [PubMed Subset](https://huggingface.co/datasets/zxvix/pubmed_subset_new) dataset is no longer available but a similar dataset is used here_
- [PubMed Summary](https://huggingface.co/datasets/scientific_papers)
  - Subset: pubmed, Split: validation, Pair: (article, abstract)

The [Pearson correlation coefficient](https://en.wikipedia.org/wiki/Pearson_correlation_coefficient) is used as the evaluation metric.

| Model                                                                                  | PubMed QA | PubMed Subset | PubMed Summary | Average   |
| -------------------------------------------------------------------------------------- | --------- | ------------- | -------------- | --------- |
| pubmedbert-base-embeddings-8M-M2V (No training)                                        | 69.84     | 70.77         | 71.30          | 70.64     |
| [pubmedbert-base-embeddings-100K](https://hf.co/neuml/pubmedbert-base-embeddings-100K) | 74.56     | 84.65         | 81.84          | 80.35     |
| [pubmedbert-base-embeddings-500K](https://hf.co/neuml/pubmedbert-base-embeddings-500K) | 86.03     | 91.71         | 91.25          | 89.66     |
| [**pubmedbert-base-embeddings-1M**](https://hf.co/neuml/pubmedbert-base-embeddings-1M) | **87.87** | **92.80**     | **92.87**      | **91.18** |
| [pubmedbert-base-embeddings-2M](https://hf.co/neuml/pubmedbert-base-embeddings-2M)     | 88.62     | 93.08         | 93.24          | 91.65     |

As we can see, the accuracy tradeoff is relatively minimal compared to the original model.

## Runtime performance

As another test, let's see how long each model takes to index 120K article abstracts using the following code. All indexing is done with a RTX 3090 GPU. 

```python
from datasets import load_dataset
from tqdm import tqdm
from txtai import Embeddings

ds = load_dataset("ccdv/pubmed-summarization", split="train")

embeddings = Embeddings(path="path to model", content=True, backend="numpy")
embeddings.index(tqdm(ds["abstract"]))
```

| Model                                                                                  | Model Size (MB) | Index time (s) |
| -------------------------------------------------------------------------------------- | ----------      | -------------- |
| [pubmedbert-base-embeddings-100K](https://hf.co/neuml/pubmedbert-base-embeddings-100K) | 0.2             | 19             |
| [pubmedbert-base-embeddings-500K](https://hf.co/neuml/pubmedbert-base-embeddings-500K) | 1.0             | 17             |
| **[pubmedbert-base-embeddings-1M](https://hf.co/neuml/pubmedbert-base-embeddings-1M)** | **2.0**         | 17             |
| [pubmedbert-base-embeddings-2M](https://hf.co/neuml/pubmedbert-base-embeddings-2M)     | 7.5             | 17             |

Vocabulary pruning doesn't change the runtime performance in this case. But the model is much smaller. Vectors are stored at `int16` precision. This can be beneficial to smaller/lower powered embedded devices and could lead to faster vectorization times.

## Training

This model was vocabulary pruned using the following script.

```python
import json
import os

from collections import Counter
from pathlib import Path

import numpy as np

from model2vec import StaticModel
from more_itertools import batched
from sklearn.decomposition import PCA
from tokenlearn.train import collect_means_and_texts
from tokenizers import Tokenizer
from tqdm import tqdm
from txtai.scoring import ScoringFactory

def tokenize(tokenizer):
    # Tokenize into dataset
    dataset = []
    for t in tqdm(batched(texts, 1024)):
        encodings = tokenizer.encode_batch_fast(t, add_special_tokens=False)
        for e in encodings:
            dataset.append((None, e.ids, None))

    return dataset

def tokenweights(tokenizer):
    dataset = tokenize(tokenizer)

    # Build scoring index
    scoring = ScoringFactory.create({"method": "bm25", "terms": True})
    scoring.index(dataset)

    # Calculate mean value of weights array per token
    tokens = np.zeros(tokenizer.get_vocab_size())
    for x in scoring.idf:
        tokens[x] = np.mean(scoring.terms.weights(x)[1])

    return tokens

# See PubMedBERT Embeddings 2M model for details on this data
features = "features"
paths = sorted(Path(features).glob("*.json"))
texts, _ = collect_means_and_texts(paths)

# Output model parameters
output = "output path"
params, dims = 1000000, 64

path = "pubmedbert-base-embeddings-2M_unweighted"
model = StaticModel.from_pretrained(path)

os.makedirs(output, exist_ok=True)

with open(f"{path}/tokenizer.json", "r", encoding="utf-8") as f:
    config = json.load(f)

# Calculate number of tokens to keep
tokencount = params // model.dim

# Calculate term frequency
freqs = Counter()
for _, ids, _ in tokenize(model.tokenizer):
    freqs.update(ids)

# Select top N most common tokens
uids = set(x for x, _ in freqs.most_common(tokencount))
uids = [uid for token, uid in config["model"]["vocab"].items() if uid in uids or token.startswith("[")]

# Get embeddings for uids
model.embedding = model.embedding[uids]

# Select pruned tokens
pairs, index = [], 0
for token, uid in config["model"]["vocab"].items():
    if uid in uids:
        pairs.append((token, index))
        index += 1

config["model"]["vocab"] = dict(pairs)

# Write new tokenizer
with open(f"{output}/tokenizer.json", "w", encoding="utf-8") as f:
    json.dump(config, f, indent=2)

model.tokenizer = Tokenizer.from_file(f"{output}/tokenizer.json")

# Re-weight tokens
weights = tokenweights(model.tokenizer)

# Remove NaNs from embedding, if any
embedding = np.nan_to_num(model.embedding)

# Apply PCA
embedding = PCA(n_components=dims).fit_transform(embedding)

# Apply weights
embedding *= weights[:, None]

# Update model embedding and normalize
model.embedding, model.normalize = embedding.astype(np.int16), True

model.save_pretrained(output)
```

## Acknowledgement

This model is built on the great work from the [Minish Lab](https://github.com/MinishLab) team consisting of [Stephan Tulkens](https://github.com/stephantul) and [Thomas van Dongen](https://github.com/Pringled).

Read more at the following links.

- [Model2Vec](https://github.com/MinishLab/model2vec)
- [Tokenlearn](https://github.com/MinishLab/tokenlearn)
- [Minish Lab Blog](https://minishlab.github.io/)