Initial version

Browse files

Files changed (4) hide show

README.md +237 -0
config.json +1 -0
model.safetensors +3 -0
tokenizer.json +0 -0

README.md ADDED Viewed

	@@ -0,0 +1,237 @@

+---
+pipeline_tag: sentence-similarity
+tags:
+- sentence-transformers
+- feature-extraction
+- sentence-similarity
+- transformers
+- embeddings
+- static-embeddings
+language: en
+license: apache-2.0
+---
+# PubMedBERT Embeddings 1M
+This is a pruned version of [PubMedBERT Embeddings 2M](https://huggingface.co/NeuML/pubmedbert-base-embeddings-2M). It prunes the vocabulary to take the top 50% most frequently used tokens.
+See [Extremely Small BERT Models from Mixed-Vocabulary Training](https://arxiv.org/abs/1909.11687) for background on pruning vocabularies to build smaller models.
+## Usage (txtai)
+This model can be used to build embeddings databases with [txtai](https://github.com/neuml/txtai) for semantic search and/or as a knowledge source for retrieval augmented generation (RAG).
+```python
+import txtai
+# Create embeddings
+embeddings = txtai.Embeddings(
+  path="neuml/pubmedbert-base-embeddings-1M",
+  content=True,
+)
+embeddings.index(documents())
+# Run a query
+embeddings.search("query to run")
+```
+## Usage (Sentence-Transformers)
+Alternatively, the model can be loaded with [sentence-transformers](https://www.SBERT.net).
+```python
+from sentence_transformers import SentenceTransformer
+from sentence_transformers.models import StaticEmbedding
+# Initialize a StaticEmbedding module
+static = StaticEmbedding.from_model2vec("neuml/pubmedbert-base-embeddings-1M")
+model = SentenceTransformer(modules=[static])
+sentences = ["This is an example sentence", "Each sentence is converted"]
+embeddings = model.encode(sentences)
+print(embeddings)
+```
+## Usage (Model2Vec)
+The model can also be used directly with Model2Vec.
+```python
+from model2vec import StaticModel
+# Load a pretrained Model2Vec model
+model = StaticModel.from_pretrained("neuml/pubmedbert-base-embeddings-1M")
+# Compute text embeddings
+sentences = ["This is an example sentence", "Each sentence is converted"]
+embeddings = model.encode(sentences)
+print(embeddings)
+```
+## Evaluation Results
+The following compares performance of this model against the models previously compared with [PubMedBERT Embeddings](https://huggingface.co/NeuML/pubmedbert-base-embeddings#evaluation-results). The following datasets were used to evaluate model performance.
+- [PubMed QA](https://huggingface.co/datasets/pubmed_qa)
+  - Subset: pqa_labeled, Split: train, Pair: (question, long_answer)
+- [PubMed Subset](https://huggingface.co/datasets/awinml/pubmed_abstract_3_1k)
+  - Split: test, Pair: (title, text)
+  - _Note: The previously used [PubMed Subset](https://huggingface.co/datasets/zxvix/pubmed_subset_new) dataset is no longer available but a similar dataset is used here_
+- [PubMed Summary](https://huggingface.co/datasets/scientific_papers)
+  - Subset: pubmed, Split: validation, Pair: (article, abstract)
+The [Pearson correlation coefficient](https://en.wikipedia.org/wiki/Pearson_correlation_coefficient) is used as the evaluation metric.
+| Model                                                                                  | PubMed QA | PubMed Subset | PubMed Summary | Average   |
+| -------------------------------------------------------------------------------------- | --------- | ------------- | -------------- | --------- |
+| pubmedbert-base-embeddings-8M-M2V (No training)                                        | 69.84     | 70.77         | 71.30          | 70.64     |
+| [pubmedbert-base-embeddings-100K](https://hf.co/neuml/pubmedbert-base-embeddings-100K) | 74.56     | 84.65         | 81.84          | 80.35     |
+| [pubmedbert-base-embeddings-500K](https://hf.co/neuml/pubmedbert-base-embeddings-500K) | 86.03     | 91.71         | 91.25          | 89.66     |
+| [**pubmedbert-base-embeddings-1M**](https://hf.co/neuml/pubmedbert-base-embeddings-1M) | **87.87** | **92.80**     | **92.87**      | **91.18** |
+| [pubmedbert-base-embeddings-2M](https://hf.co/neuml/pubmedbert-base-embeddings-2M)     | 88.62     | 93.08         | 93.24          | 91.65     |
+As we can see, the accuracy tradeoff is relatively minimal compared to the original model.
+## Runtime performance
+As another test, let's see how long each model takes to index 120K article abstracts using the following code. All indexing is done with a RTX 3090 GPU.
+```python
+from datasets import load_dataset
+from tqdm import tqdm
+from txtai import Embeddings
+ds = load_dataset("ccdv/pubmed-summarization", split="train")
+embeddings = Embeddings(path="path to model", content=True, backend="numpy")
+embeddings.index(tqdm(ds["abstract"]))
+```
+| Model                                                                                  | Model Size (MB) | Index time (s) |
+| -------------------------------------------------------------------------------------- | ----------      | -------------- |
+| [pubmedbert-base-embeddings-100K](https://hf.co/neuml/pubmedbert-base-embeddings-100K) | 0.2             | 19             |
+| [pubmedbert-base-embeddings-500K](https://hf.co/neuml/pubmedbert-base-embeddings-500K) | 1.0             | 17             |
+| **[pubmedbert-base-embeddings-1M](https://hf.co/neuml/pubmedbert-base-embeddings-1M)** | **2.0**         | 17             |
+| [pubmedbert-base-embeddings-2M](https://hf.co/neuml/pubmedbert-base-embeddings-2M)     | 7.5             | 17             |
+Vocabulary pruning doesn't change the runtime performance in this case. But the model is much smaller. Vectors are stored at `int16` precision. This can be beneficial to smaller/lower powered embedded devices and could lead to faster vectorization times.
+## Training
+This model was vocabulary pruned using the following script.
+```python
+import json
+import os
+from collections import Counter
+from pathlib import Path
+import numpy as np
+from model2vec import StaticModel
+from more_itertools import batched
+from sklearn.decomposition import PCA
+from tokenlearn.train import collect_means_and_texts
+from tokenizers import Tokenizer
+from tqdm import tqdm
+from txtai.scoring import ScoringFactory
+def tokenize(tokenizer):
+    # Tokenize into dataset
+    dataset = []
+    for t in tqdm(batched(texts, 1024)):
+        encodings = tokenizer.encode_batch_fast(t, add_special_tokens=False)
+        for e in encodings:
+            dataset.append((None, e.ids, None))
+    return dataset
+def tokenweights(tokenizer):
+    dataset = tokenize(tokenizer)
+    # Build scoring index
+    scoring = ScoringFactory.create({"method": "bm25", "terms": True})
+    scoring.index(dataset)
+    # Calculate mean value of weights array per token
+    tokens = np.zeros(tokenizer.get_vocab_size())
+    for x in scoring.idf:
+        tokens[x] = np.mean(scoring.terms.weights(x)[1])
+    return tokens
+# See PubMedBERT Embeddings 2M model for details on this data
+features = "features"
+paths = sorted(Path(features).glob("*.json"))
+texts, _ = collect_means_and_texts(paths)
+# Output model parameters
+output = "output path"
+params, dims = 1000000, 64
+path = "pubmedbert-base-embeddings-2M_unweighted"
+model = StaticModel.from_pretrained(path)
+os.makedirs(output, exist_ok=True)
+with open(f"{path}/tokenizer.json", "r", encoding="utf-8") as f:
+    config = json.load(f)
+# Calculate number of tokens to keep
+tokencount = params // model.dim
+# Calculate term frequency
+freqs = Counter()
+for _, ids, _ in tokenize(model.tokenizer):
+    freqs.update(ids)
+# Select top N most common tokens
+uids = set(x for x, _ in freqs.most_common(tokencount))
+uids = [uid for token, uid in config["model"]["vocab"].items() if uid in uids or token.startswith("[")]
+# Get embeddings for uids
+model.embedding = model.embedding[uids]
+# Select pruned tokens
+pairs, index = [], 0
+for token, uid in config["model"]["vocab"].items():
+    if uid in uids:
+        pairs.append((token, index))
+        index += 1
+config["model"]["vocab"] = dict(pairs)
+# Write new tokenizer
+with open(f"{output}/tokenizer.json", "w", encoding="utf-8") as f:
+    json.dump(config, f, indent=2)
+model.tokenizer = Tokenizer.from_file(f"{output}/tokenizer.json")
+# Re-weight tokens
+weights = tokenweights(model.tokenizer)
+# Remove NaNs from embedding, if any
+embedding = np.nan_to_num(model.embedding)
+# Apply PCA
+embedding = PCA(n_components=dims).fit_transform(embedding)
+# Apply weights
+embedding *= weights[:, None]
+# Update model embedding and normalize
+model.embedding, model.normalize = embedding.astype(np.int16), True
+model.save_pretrained(output)
+```
+## Acknowledgement
+This model is built on the great work from the [Minish Lab](https://github.com/MinishLab) team consisting of [Stephan Tulkens](https://github.com/stephantul) and [Thomas van Dongen](https://github.com/Pringled).
+Read more at the following links.
+- [Model2Vec](https://github.com/MinishLab/model2vec)
+- [Tokenlearn](https://github.com/MinishLab/tokenlearn)
+- [Minish Lab Blog](https://minishlab.github.io/)

config.json ADDED Viewed

	@@ -0,0 +1 @@


1	+ {"model_type": "model2vec", "architectures": ["StaticModel"], "tokenizer_name": "neuml/pubmedbert-base-embeddings", "apply_pca": 64, "apply_zipf": true, "hidden_dim": 64, "seq_length": 1000000, "normalize": true}

model.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:c24936ae1ff2217a8ec58846ed8e086001091359e253d8f260f43584cf4e54bc
+size 2000472

tokenizer.json ADDED Viewed

The diff for this file is too large to render. See raw diff