davidmezzetti commited on
Commit
69392bf
·
1 Parent(s): 938b8a9

Initial version

Browse files
Files changed (4) hide show
  1. README.md +237 -0
  2. config.json +1 -0
  3. model.safetensors +3 -0
  4. tokenizer.json +0 -0
README.md ADDED
@@ -0,0 +1,237 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ pipeline_tag: sentence-similarity
3
+ tags:
4
+ - sentence-transformers
5
+ - feature-extraction
6
+ - sentence-similarity
7
+ - transformers
8
+ - embeddings
9
+ - static-embeddings
10
+ language: en
11
+ license: apache-2.0
12
+ ---
13
+
14
+ # PubMedBERT Embeddings 1M
15
+
16
+ This is a pruned version of [PubMedBERT Embeddings 2M](https://huggingface.co/NeuML/pubmedbert-base-embeddings-2M). It prunes the vocabulary to take the top 50% most frequently used tokens.
17
+
18
+ See [Extremely Small BERT Models from Mixed-Vocabulary Training](https://arxiv.org/abs/1909.11687) for background on pruning vocabularies to build smaller models.
19
+
20
+ ## Usage (txtai)
21
+
22
+ This model can be used to build embeddings databases with [txtai](https://github.com/neuml/txtai) for semantic search and/or as a knowledge source for retrieval augmented generation (RAG).
23
+
24
+ ```python
25
+ import txtai
26
+
27
+ # Create embeddings
28
+ embeddings = txtai.Embeddings(
29
+ path="neuml/pubmedbert-base-embeddings-1M",
30
+ content=True,
31
+ )
32
+ embeddings.index(documents())
33
+
34
+ # Run a query
35
+ embeddings.search("query to run")
36
+ ```
37
+
38
+ ## Usage (Sentence-Transformers)
39
+
40
+ Alternatively, the model can be loaded with [sentence-transformers](https://www.SBERT.net).
41
+
42
+ ```python
43
+ from sentence_transformers import SentenceTransformer
44
+ from sentence_transformers.models import StaticEmbedding
45
+
46
+ # Initialize a StaticEmbedding module
47
+ static = StaticEmbedding.from_model2vec("neuml/pubmedbert-base-embeddings-1M")
48
+ model = SentenceTransformer(modules=[static])
49
+
50
+ sentences = ["This is an example sentence", "Each sentence is converted"]
51
+ embeddings = model.encode(sentences)
52
+ print(embeddings)
53
+ ```
54
+
55
+ ## Usage (Model2Vec)
56
+
57
+ The model can also be used directly with Model2Vec.
58
+
59
+ ```python
60
+ from model2vec import StaticModel
61
+
62
+ # Load a pretrained Model2Vec model
63
+ model = StaticModel.from_pretrained("neuml/pubmedbert-base-embeddings-1M")
64
+
65
+ # Compute text embeddings
66
+ sentences = ["This is an example sentence", "Each sentence is converted"]
67
+ embeddings = model.encode(sentences)
68
+ print(embeddings)
69
+ ```
70
+
71
+ ## Evaluation Results
72
+
73
+ The following compares performance of this model against the models previously compared with [PubMedBERT Embeddings](https://huggingface.co/NeuML/pubmedbert-base-embeddings#evaluation-results). The following datasets were used to evaluate model performance.
74
+
75
+ - [PubMed QA](https://huggingface.co/datasets/pubmed_qa)
76
+ - Subset: pqa_labeled, Split: train, Pair: (question, long_answer)
77
+ - [PubMed Subset](https://huggingface.co/datasets/awinml/pubmed_abstract_3_1k)
78
+ - Split: test, Pair: (title, text)
79
+ - _Note: The previously used [PubMed Subset](https://huggingface.co/datasets/zxvix/pubmed_subset_new) dataset is no longer available but a similar dataset is used here_
80
+ - [PubMed Summary](https://huggingface.co/datasets/scientific_papers)
81
+ - Subset: pubmed, Split: validation, Pair: (article, abstract)
82
+
83
+ The [Pearson correlation coefficient](https://en.wikipedia.org/wiki/Pearson_correlation_coefficient) is used as the evaluation metric.
84
+
85
+ | Model | PubMed QA | PubMed Subset | PubMed Summary | Average |
86
+ | -------------------------------------------------------------------------------------- | --------- | ------------- | -------------- | --------- |
87
+ | pubmedbert-base-embeddings-8M-M2V (No training) | 69.84 | 70.77 | 71.30 | 70.64 |
88
+ | [pubmedbert-base-embeddings-100K](https://hf.co/neuml/pubmedbert-base-embeddings-100K) | 74.56 | 84.65 | 81.84 | 80.35 |
89
+ | [pubmedbert-base-embeddings-500K](https://hf.co/neuml/pubmedbert-base-embeddings-500K) | 86.03 | 91.71 | 91.25 | 89.66 |
90
+ | [**pubmedbert-base-embeddings-1M**](https://hf.co/neuml/pubmedbert-base-embeddings-1M) | **87.87** | **92.80** | **92.87** | **91.18** |
91
+ | [pubmedbert-base-embeddings-2M](https://hf.co/neuml/pubmedbert-base-embeddings-2M) | 88.62 | 93.08 | 93.24 | 91.65 |
92
+
93
+ As we can see, the accuracy tradeoff is relatively minimal compared to the original model.
94
+
95
+ ## Runtime performance
96
+
97
+ As another test, let's see how long each model takes to index 120K article abstracts using the following code. All indexing is done with a RTX 3090 GPU.
98
+
99
+ ```python
100
+ from datasets import load_dataset
101
+ from tqdm import tqdm
102
+ from txtai import Embeddings
103
+
104
+ ds = load_dataset("ccdv/pubmed-summarization", split="train")
105
+
106
+ embeddings = Embeddings(path="path to model", content=True, backend="numpy")
107
+ embeddings.index(tqdm(ds["abstract"]))
108
+ ```
109
+
110
+ | Model | Model Size (MB) | Index time (s) |
111
+ | -------------------------------------------------------------------------------------- | ---------- | -------------- |
112
+ | [pubmedbert-base-embeddings-100K](https://hf.co/neuml/pubmedbert-base-embeddings-100K) | 0.2 | 19 |
113
+ | [pubmedbert-base-embeddings-500K](https://hf.co/neuml/pubmedbert-base-embeddings-500K) | 1.0 | 17 |
114
+ | **[pubmedbert-base-embeddings-1M](https://hf.co/neuml/pubmedbert-base-embeddings-1M)** | **2.0** | 17 |
115
+ | [pubmedbert-base-embeddings-2M](https://hf.co/neuml/pubmedbert-base-embeddings-2M) | 7.5 | 17 |
116
+
117
+ Vocabulary pruning doesn't change the runtime performance in this case. But the model is much smaller. Vectors are stored at `int16` precision. This can be beneficial to smaller/lower powered embedded devices and could lead to faster vectorization times.
118
+
119
+ ## Training
120
+
121
+ This model was vocabulary pruned using the following script.
122
+
123
+ ```python
124
+ import json
125
+ import os
126
+
127
+ from collections import Counter
128
+ from pathlib import Path
129
+
130
+ import numpy as np
131
+
132
+ from model2vec import StaticModel
133
+ from more_itertools import batched
134
+ from sklearn.decomposition import PCA
135
+ from tokenlearn.train import collect_means_and_texts
136
+ from tokenizers import Tokenizer
137
+ from tqdm import tqdm
138
+ from txtai.scoring import ScoringFactory
139
+
140
+ def tokenize(tokenizer):
141
+ # Tokenize into dataset
142
+ dataset = []
143
+ for t in tqdm(batched(texts, 1024)):
144
+ encodings = tokenizer.encode_batch_fast(t, add_special_tokens=False)
145
+ for e in encodings:
146
+ dataset.append((None, e.ids, None))
147
+
148
+ return dataset
149
+
150
+ def tokenweights(tokenizer):
151
+ dataset = tokenize(tokenizer)
152
+
153
+ # Build scoring index
154
+ scoring = ScoringFactory.create({"method": "bm25", "terms": True})
155
+ scoring.index(dataset)
156
+
157
+ # Calculate mean value of weights array per token
158
+ tokens = np.zeros(tokenizer.get_vocab_size())
159
+ for x in scoring.idf:
160
+ tokens[x] = np.mean(scoring.terms.weights(x)[1])
161
+
162
+ return tokens
163
+
164
+ # See PubMedBERT Embeddings 2M model for details on this data
165
+ features = "features"
166
+ paths = sorted(Path(features).glob("*.json"))
167
+ texts, _ = collect_means_and_texts(paths)
168
+
169
+ # Output model parameters
170
+ output = "output path"
171
+ params, dims = 1000000, 64
172
+
173
+ path = "pubmedbert-base-embeddings-2M_unweighted"
174
+ model = StaticModel.from_pretrained(path)
175
+
176
+ os.makedirs(output, exist_ok=True)
177
+
178
+ with open(f"{path}/tokenizer.json", "r", encoding="utf-8") as f:
179
+ config = json.load(f)
180
+
181
+ # Calculate number of tokens to keep
182
+ tokencount = params // model.dim
183
+
184
+ # Calculate term frequency
185
+ freqs = Counter()
186
+ for _, ids, _ in tokenize(model.tokenizer):
187
+ freqs.update(ids)
188
+
189
+ # Select top N most common tokens
190
+ uids = set(x for x, _ in freqs.most_common(tokencount))
191
+ uids = [uid for token, uid in config["model"]["vocab"].items() if uid in uids or token.startswith("[")]
192
+
193
+ # Get embeddings for uids
194
+ model.embedding = model.embedding[uids]
195
+
196
+ # Select pruned tokens
197
+ pairs, index = [], 0
198
+ for token, uid in config["model"]["vocab"].items():
199
+ if uid in uids:
200
+ pairs.append((token, index))
201
+ index += 1
202
+
203
+ config["model"]["vocab"] = dict(pairs)
204
+
205
+ # Write new tokenizer
206
+ with open(f"{output}/tokenizer.json", "w", encoding="utf-8") as f:
207
+ json.dump(config, f, indent=2)
208
+
209
+ model.tokenizer = Tokenizer.from_file(f"{output}/tokenizer.json")
210
+
211
+ # Re-weight tokens
212
+ weights = tokenweights(model.tokenizer)
213
+
214
+ # Remove NaNs from embedding, if any
215
+ embedding = np.nan_to_num(model.embedding)
216
+
217
+ # Apply PCA
218
+ embedding = PCA(n_components=dims).fit_transform(embedding)
219
+
220
+ # Apply weights
221
+ embedding *= weights[:, None]
222
+
223
+ # Update model embedding and normalize
224
+ model.embedding, model.normalize = embedding.astype(np.int16), True
225
+
226
+ model.save_pretrained(output)
227
+ ```
228
+
229
+ ## Acknowledgement
230
+
231
+ This model is built on the great work from the [Minish Lab](https://github.com/MinishLab) team consisting of [Stephan Tulkens](https://github.com/stephantul) and [Thomas van Dongen](https://github.com/Pringled).
232
+
233
+ Read more at the following links.
234
+
235
+ - [Model2Vec](https://github.com/MinishLab/model2vec)
236
+ - [Tokenlearn](https://github.com/MinishLab/tokenlearn)
237
+ - [Minish Lab Blog](https://minishlab.github.io/)
config.json ADDED
@@ -0,0 +1 @@
 
 
1
+ {"model_type": "model2vec", "architectures": ["StaticModel"], "tokenizer_name": "neuml/pubmedbert-base-embeddings", "apply_pca": 64, "apply_zipf": true, "hidden_dim": 64, "seq_length": 1000000, "normalize": true}
model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:c24936ae1ff2217a8ec58846ed8e086001091359e253d8f260f43584cf4e54bc
3
+ size 2000472
tokenizer.json ADDED
The diff for this file is too large to render. See raw diff