davidmezzetti commited on
Commit
6f07fc3
·
1 Parent(s): f7da929

Initial version

Browse files
Files changed (4) hide show
  1. README.md +214 -0
  2. config.json +1 -0
  3. model.safetensors +3 -0
  4. tokenizer.json +0 -0
README.md ADDED
@@ -0,0 +1,214 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ pipeline_tag: sentence-similarity
3
+ tags:
4
+ - sentence-transformers
5
+ - feature-extraction
6
+ - sentence-similarity
7
+ - transformers
8
+ - embeddings
9
+ - static-embeddings
10
+ language: en
11
+ license: apache-2.0
12
+ ---
13
+
14
+ # PubMedBERT Embeddings 2M
15
+
16
+ This is a distilled version of [PubMedBERT Embeddings](https://huggingface.co/NeuML/pubmedbert-base-embeddings) using the [Model2Vec](https://github.com/MinishLab/model2vec) library. It uses static embeddings, allowing text embeddings to be computed orders of magnitude faster on both GPU and CPU. It is designed for applications where computational resources are limited or where real-time performance is critical.
17
+
18
+ ## Usage (txtai)
19
+
20
+ This model can be used to build embeddings databases with [txtai](https://github.com/neuml/txtai) for semantic search and/or as a knowledge source for retrieval augmented generation (RAG).
21
+
22
+ ```python
23
+ import txtai
24
+
25
+ # Create embeddings
26
+ embeddings = txtai.Embeddings(
27
+ path="neuml/pubmedbert-base-embeddings-2M",
28
+ content=True,
29
+ )
30
+ embeddings.index(documents())
31
+
32
+ # Run a query
33
+ embeddings.search("query to run")
34
+ ```
35
+
36
+ ## Usage (Sentence-Transformers)
37
+
38
+ Alternatively, the model can be loaded with [sentence-transformers](https://www.SBERT.net).
39
+
40
+ ```python
41
+ from sentence_transformers import SentenceTransformer
42
+ from sentence_transformers.models import StaticEmbedding
43
+
44
+ # Initialize a StaticEmbedding module
45
+ static = StaticEmbedding.from_model2vec("neuml/pubmedbert-base-embeddings-2M")
46
+ model = SentenceTransformer(modules=[static])
47
+
48
+ sentences = ["This is an example sentence", "Each sentence is converted"]
49
+ embeddings = model.encode(sentences)
50
+ print(embeddings)
51
+ ```
52
+
53
+ ## Usage (Model2Vec)
54
+
55
+ The model can also be used directly with Model2Vec.
56
+
57
+ ```python
58
+ from model2vec import StaticModel
59
+
60
+ # Load a pretrained Model2Vec model
61
+ model = StaticModel.from_pretrained("neuml/pubmedbert-base-embeddings-2M")
62
+
63
+ # Compute text embeddings
64
+ sentences = ["This is an example sentence", "Each sentence is converted"]
65
+ embeddings = model.encode(sentences)
66
+ print(embeddings)
67
+ ```
68
+
69
+ ## Evaluation Results
70
+
71
+ The following compares performance of this model against the models previously compared with [PubMedBERT Embeddings](https://huggingface.co/NeuML/pubmedbert-base-embeddings#evaluation-results). The following datasets were used to evaluate model performance.
72
+
73
+ - [PubMed QA](https://huggingface.co/datasets/pubmed_qa)
74
+ - Subset: pqa_labeled, Split: train, Pair: (question, long_answer)
75
+ - [PubMed Subset](https://huggingface.co/datasets/awinml/pubmed_abstract_3_1k)
76
+ - Split: test, Pair: (title, text)
77
+ - _Note: The previously used [PubMed Subset](https://huggingface.co/datasets/zxvix/pubmed_subset_new) dataset is no longer available but a similar dataset is used here_
78
+ - [PubMed Summary](https://huggingface.co/datasets/scientific_papers)
79
+ - Subset: pubmed, Split: validation, Pair: (article, abstract)
80
+
81
+ The [Pearson correlation coefficient](https://en.wikipedia.org/wiki/Pearson_correlation_coefficient) is used as the evaluation metric.
82
+
83
+ | Model | PubMed QA | PubMed Subset | PubMed Summary | Average |
84
+ | ---------------------------------------------------------------------------------- | --------- | ------------- | -------------- | --------- |
85
+ | [all-MiniLM-L6-v2](https://hf.co/sentence-transformers/all-MiniLM-L6-v2) | 90.40 | 95.92 | 94.07 | 93.46 |
86
+ | [bge-base-en-v1.5](https://hf.co/BAAI/bge-base-en-v1.5) | 91.02 | 95.82 | 94.49 | 93.78 |
87
+ | [gte-base](https://hf.co/thenlper/gte-base) | 92.97 | 96.90 | 96.24 | 95.37 |
88
+ | [**pubmedbert-base-embeddings-2M**](https://hf.co/neuml/pubmedbert-base-embeddings-2M) | **88.62** | **93.08** | **93.24** | **91.65** |
89
+ | [pubmedbert-base-embeddings-8M](https://hf.co/neuml/pubmedbert-base-embeddings-8M) | 90.05 | 94.29 | 94.15 | 92.83 |
90
+ | [pubmedbert-base-embeddings](https://hf.co/neuml/pubmedbert-base-embeddings) | 93.27 | 97.00 | 96.58 | 95.62 |
91
+ | [S-PubMedBert-MS-MARCO](https://hf.co/pritamdeka/S-PubMedBert-MS-MARCO) | 90.86 | 93.68 | 93.54 | 92.69 |
92
+
93
+ As we can see, this model while not the top scoring model is certainly competitive.
94
+
95
+ ## Runtime performance
96
+
97
+ As another test, let's see how long each model takes to index 120K article abstracts using the following code. All indexing is done with a RTX 3090 GPU.
98
+
99
+ ```python
100
+ from datasets import load_dataset
101
+ from tqdm import tqdm
102
+ from txtai import Embeddings
103
+
104
+ ds = load_dataset("ccdv/pubmed-summarization", split="train")
105
+
106
+ embeddings = Embeddings(path="path to model", content=True, backend="numpy")
107
+ embeddings.index(tqdm(ds["abstract"]))
108
+ ```
109
+
110
+ | Model | Params (M) | Index time (s) |
111
+ | ---------------------------------------------------------------------------------- | ---------- | -------------- |
112
+ | [all-MiniLM-L6-v2](https://hf.co/sentence-transformers/all-MiniLM-L6-v2) | 22 | 117 |
113
+ | BM25 | - | 18 |
114
+ | [bge-base-en-v1.5](https://hf.co/BAAI/bge-base-en-v1.5) | 109 | 518 |
115
+ | [gte-base](https://hf.co/thenlper/gte-base) | 109 | 523 |
116
+ | [**pubmedbert-base-embeddings-2M**](https://hf.co/neuml/pubmedbert-base-embeddings-2M) | **2** | **17** |
117
+ | [pubmedbert-base-embeddings-8M](https://hf.co/neuml/pubmedbert-base-embeddings-8M) | 8 | 18 |
118
+ | [pubmedbert-base-embeddings](https://hf.co/neuml/pubmedbert-base-embeddings) | 109 | 462 |
119
+ | [S-PubMedBert-MS-MARCO](https://hf.co/pritamdeka/S-PubMedBert-MS-MARCO) | 109 | 465 |
120
+
121
+ Clearly a static model's main upside is speed. It's important to note that if storage savings is the only concern, then take a look at [PubMedBERT Embeddings Matryoshka](https://huggingface.co/NeuML/pubmedbert-base-embeddings-matryoshka). The 256 dimension model scores higher than this model, so does the 64 dimension model. The tradeoff is that the runtime performance is still as slow as the base model.
122
+
123
+ If runtime performance is the major concern, then a static model offers the best blend of accuracy and speed. Model2Vec models only need CPUs to run, no GPU required. Note how this model takes the same amount of time as building a BM25 index, which is normally an order of magnitude faster than vector models.
124
+
125
+ ## Training
126
+
127
+ This model was trained using the [Tokenlearn](https://github.com/MinishLab/tokenlearn) library. First data was featurized with the following script.
128
+
129
+ ```bash
130
+ python -m tokenlearn.featurize --model-name "neuml/pubmedbert-base-embeddings" --dataset-path "training-articles" --output-dir "features"
131
+ ```
132
+
133
+ _Note that the same random sample of articles as [described here](https://medium.com/neuml/embeddings-for-medical-literature-74dae6abf5e0) are used for the dataset `training-articles`._
134
+
135
+ From there, the following training script builds the model. The final model is weighted using [BM25](https://en.wikipedia.org/wiki/Okapi_BM25) instead of the default SIF weighting method.
136
+
137
+ ```python
138
+ from pathlib import Path
139
+
140
+ import numpy as np
141
+
142
+ from model2vec import StaticModel
143
+ from more_itertools import batched
144
+ from sklearn.decomposition import PCA
145
+ from tokenlearn.train import collect_means_and_texts, train_model
146
+ from tqdm import tqdm
147
+ from txtai.scoring import ScoringFactory
148
+
149
+ def tokenweights():
150
+ tokenizer = model.tokenizer
151
+
152
+ # Tokenize into dataset
153
+ dataset = []
154
+ for t in tqdm(batched(texts, 1024)):
155
+ encodings = tokenizer.encode_batch_fast(t, add_special_tokens=False)
156
+ for e in encodings:
157
+ dataset.append((None, e.ids, None))
158
+
159
+ # Build scoring index
160
+ scoring = ScoringFactory.create({"method": "bm25", "terms": True})
161
+ scoring.index(dataset)
162
+
163
+ # Calculate mean value of weights array per token
164
+ tokens = np.zeros(tokenizer.get_vocab_size())
165
+ for token in scoring.idf:
166
+ tokens[token] = np.mean(scoring.terms.weights(token)[1])
167
+
168
+ return tokens
169
+
170
+ # Collect paths for training data
171
+ paths = sorted(Path("features").glob("*.json"))
172
+ texts, vectors = collect_means_and_texts(paths)
173
+
174
+ # Train the model
175
+ model = train_model("neuml/pubmedbert-base-embeddings", texts, vectors)
176
+
177
+ # Weight the model
178
+ weights = tokenweights()
179
+
180
+ # Remove NaNs from embedding, if any
181
+ embedding = np.nan_to_num(model.embedding)
182
+
183
+ # Apply PCA
184
+ embedding = PCA(n_components=embedding.shape[1]).fit_transform(embedding)
185
+
186
+ # Apply weights
187
+ embedding *= weights[:, None]
188
+
189
+ # Update model embedding and normalize
190
+ model.embedding, model.normalize = embedding, True
191
+
192
+ # Save model
193
+ model.save_pretrained("output path")
194
+ ```
195
+
196
+ The following table compares the accuracy results for each of the methods
197
+
198
+ | Model | PubMed QA | PubMed Subset | PubMed Summary | Average |
199
+ | -----------------------------------------------------| --------- | ------------- | -------------- | --------- |
200
+ | **pubmedbert-base-embeddings-8M-BM25** | **90.05** | **94.29** | **94.15** | **92.83** |
201
+ | pubmedbert-base-embeddings-8M-M2V (No training) | 69.84 | 70.77 | 71.30 | 70.64 |
202
+ | pubmedbert-base-embeddings-8M-SIF | 88.75 | 93.78 | 93.05 | 91.86 |
203
+
204
+ As we can see, the BM25 weighted model has the best results for the evaluated datasets
205
+
206
+ ## Acknowledgement
207
+
208
+ This model is built on the great work from the [Minish Lab](https://github.com/MinishLab) team consisting of [Stephan Tulkens](https://github.com/stephantul) and [Thomas van Dongen](https://github.com/Pringled).
209
+
210
+ Read more at the following links.
211
+
212
+ - [Model2Vec](https://github.com/MinishLab/model2vec)
213
+ - [Tokenlearn](https://github.com/MinishLab/tokenlearn)
214
+ - [Minish Lab Blog](https://minishlab.github.io/)
config.json ADDED
@@ -0,0 +1 @@
 
 
1
+ {"model_type": "model2vec", "architectures": ["StaticModel"], "tokenizer_name": "neuml/pubmedbert-base-embeddings", "apply_pca": 64, "apply_zipf": true, "hidden_dim": 64, "seq_length": 1000000, "normalize": true}
model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:0f5f1fd6dbb25e22edc1bd1cf8ffa5089690a33cb29ab49896eed80ef687d110
3
+ size 7813720
tokenizer.json ADDED
The diff for this file is too large to render. See raw diff