sentence-transformers
/

multi-qa-mpnet-base-cos-v1

@@ -7,6 +7,7 @@ tags:
 - feature-extraction
 - sentence-similarity
 - transformers
 pipeline_tag: sentence-similarity
 ---
@@ -58,14 +59,14 @@ from transformers import AutoTokenizer, AutoModel
 import torch
 import torch.nn.functional as F
-#Mean Pooling - Take average of all tokens
 def mean_pooling(model_output, attention_mask):
-    token_embeddings = model_output.last_hidden_state #First element of model_output contains all token embeddings
     input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
     return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)
-#Encode text
 def encode(texts):
     # Tokenize sentences
     encoded_input = tokenizer(texts, padding=True, truncation=True, return_tensors='pt')
@@ -91,24 +92,50 @@ docs = ["Around 9 Million people live in London", "London is known for its finan
 tokenizer = AutoTokenizer.from_pretrained("sentence-transformers/multi-qa-mpnet-base-cos-v1")
 model = AutoModel.from_pretrained("sentence-transformers/multi-qa-mpnet-base-cos-v1")
-#Encode query and docs
 query_emb = encode(query)
 doc_emb = encode(docs)
-#Compute dot score between query and all document embeddings
 scores = torch.mm(query_emb, doc_emb.transpose(0, 1))[0].cpu().tolist()
-#Combine docs & scores
 doc_score_pairs = list(zip(docs, scores))
-#Sort by decreasing score
 doc_score_pairs = sorted(doc_score_pairs, key=lambda x: x[1], reverse=True)
-#Output passages & scores
 for doc, score in doc_score_pairs:
     print(score, doc)
 ```
 ## Technical Details
 In the following some technical details how this model must be used:
@@ -124,25 +151,22 @@ Note: When loaded with `sentence-transformers`, this model produces normalized e
 ----
 ## Background
 The project aims to train sentence embedding models on very large sentence level datasets using a self-supervised
 contrastive learning objective. We use a contrastive learning objective: given a sentence from the pair, the model should predict which out of a set of randomly sampled other sentences, was actually paired with it in our dataset.
-We developped this model during the
 [Community week using JAX/Flax for NLP & CV](https://discuss.huggingface.co/t/open-to-the-community-community-week-using-jax-flax-for-nlp-cv/7104),
-organized by Hugging Face. We developped this model as part of the project:
-[Train the Best Sentence Embedding Model Ever with 1B Training Pairs](https://discuss.huggingface.co/t/train-the-best-sentence-embedding-model-ever-with-1b-training-pairs/7354). We benefited from efficient hardware infrastructure to run the project: 7 TPUs v3-8, as well as intervention from Googles Flax, JAX, and Cloud team member about efficient deep learning frameworks.
 ## Intended uses
-Our model is intented to be used for semantic search: It encodes queries / questions and text paragraphs in a dense vector space. It finds relevant documents for the given passages.
 Note that there is a limit of 512 word pieces: Text longer than that will be truncated. Further note that the model was just trained on input text up to 250 word pieces. It might not work well for longer text.
 ## Training procedure
 The full training script is accessible in this current repository: `train_script.py`.
@@ -153,14 +177,12 @@ We use the pretrained [`mpnet-base`](https://huggingface.co/microsoft/mpnet-base
 #### Training
-We use the concatenation from multiple datasets to fine-tune our model. In total we have about 215M (question, answer) pairs.
 We sampled each dataset given a weighted probability which configuration is detailed in the `data_config.json` file.
 The model was trained with [MultipleNegativesRankingLoss](https://www.sbert.net/docs/package_reference/losses.html#multiplenegativesrankingloss) using Mean-pooling, cosine-similarity as similarity function, and a scale of 20.
 | Dataset                    | Number of training tuples  |
 |--------------------------------------------------------|:--------------------------:|
 | [WikiAnswers](https://github.com/afader/oqa#wikianswers-corpus) Duplicate question pairs from WikiAnswers |  77,427,422 |
@@ -169,7 +191,7 @@ The model was trained with [MultipleNegativesRankingLoss](https://www.sbert.net/
 | [Stack Exchange](https://huggingface.co/datasets/flax-sentence-embeddings/stackexchange_xml) (Title, Answer) pairs from all StackExchanges  |  21,396,559 |
 | [MS MARCO](https://microsoft.github.io/msmarco/) Triplets (query, answer, hard_negative) for 500k queries from Bing search engine |  17,579,773 |
 | [GOOAQ: Open Question Answering with Diverse Answer Types](https://github.com/allenai/gooaq) (query, answer) pairs for 3M Google queries and Google featured snippet  | 3,012,496 |
-| [Amazon-QA](http://jmcauley.ucsd.edu/data/amazon/qa/) (Question, Answer) pairs from Amazon product pages | 2,448,839
 | [Yahoo Answers](https://www.kaggle.com/soumikrakshit/yahoo-answers-dataset) (Title, Answer) pairs from Yahoo Answers | 1,198,260 |
 | [Yahoo Answers](https://www.kaggle.com/soumikrakshit/yahoo-answers-dataset) (Question, Answer) pairs from Yahoo Answers | 681,164 |
 | [Yahoo Answers](https://www.kaggle.com/soumikrakshit/yahoo-answers-dataset) (Title, Question) pairs from Yahoo Answers | 659,896 |

 - feature-extraction
 - sentence-similarity
 - transformers
+- text-embeddings-inference
 pipeline_tag: sentence-similarity
 ---
 import torch
 import torch.nn.functional as F
+# Mean Pooling - Take average of all tokens
 def mean_pooling(model_output, attention_mask):
+    token_embeddings = model_output.last_hidden_state # First element of model_output contains all token embeddings
     input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
     return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)
+# Encode text
 def encode(texts):
     # Tokenize sentences
     encoded_input = tokenizer(texts, padding=True, truncation=True, return_tensors='pt')
 tokenizer = AutoTokenizer.from_pretrained("sentence-transformers/multi-qa-mpnet-base-cos-v1")
 model = AutoModel.from_pretrained("sentence-transformers/multi-qa-mpnet-base-cos-v1")
+# Encode query and docs
 query_emb = encode(query)
 doc_emb = encode(docs)
+# Compute dot score between query and all document embeddings
 scores = torch.mm(query_emb, doc_emb.transpose(0, 1))[0].cpu().tolist()
+# Combine docs & scores
 doc_score_pairs = list(zip(docs, scores))
+# Sort by decreasing score
 doc_score_pairs = sorted(doc_score_pairs, key=lambda x: x[1], reverse=True)
+# Output passages & scores
 for doc, score in doc_score_pairs:
     print(score, doc)
 ```
+## Usage (Text Embeddings Inference (TEI))
+[Text Embeddings Inference (TEI)](https://github.com/huggingface/text-embeddings-inference) is a blazing fast inference solution for text embedding models.
+- CPU:
+```bash
+docker run -p 8080:80 -v hf_cache:/data --pull always ghcr.io/huggingface/text-embeddings-inference:cpu-latest --model-id sentence-transformers/multi-qa-mpnet-base-cos-v1 --pooling mean --dtype float16
+```
+- NVIDIA GPU:
+```bash
+docker run --gpus all -p 8080:80 -v hf_cache:/data --pull always ghcr.io/huggingface/text-embeddings-inference:cuda-latest --model-id sentence-transformers/multi-qa-mpnet-base-cos-v1 --pooling mean --dtype float16
+```
+Send a request to `/v1/embeddings` to generate embeddings via the [OpenAI Embeddings API](https://platform.openai.com/docs/api-reference/embeddings/create):
+```bash
+curl http://localhost:8080/v1/embeddings \
+  -H "Content-Type: application/json" \
+  -d '{
+    "model": "sentence-transformers/multi-qa-mpnet-base-cos-v1",
+    "input": "How many people live in London?"
+  }'
+```
+Or check the [Text Embeddings Inference API specification](https://huggingface.github.io/text-embeddings-inference/) instead.
 ## Technical Details
 In the following some technical details how this model must be used:
 ----
 ## Background
 The project aims to train sentence embedding models on very large sentence level datasets using a self-supervised
 contrastive learning objective. We use a contrastive learning objective: given a sentence from the pair, the model should predict which out of a set of randomly sampled other sentences, was actually paired with it in our dataset.
+We developed this model during the
 [Community week using JAX/Flax for NLP & CV](https://discuss.huggingface.co/t/open-to-the-community-community-week-using-jax-flax-for-nlp-cv/7104),
+organized by Hugging Face. We developed this model as part of the project:
+[Train the Best Sentence Embedding Model Ever with 1B Training Pairs](https://discuss.huggingface.co/t/train-the-best-sentence-embedding-model-ever-with-1b-training-pairs/7354). We benefited from efficient hardware infrastructure to run the project: 7 TPUs v3-8, as well as intervention from Google's Flax, JAX, and Cloud team members about efficient deep learning frameworks.
 ## Intended uses
+Our model is intended to be used for semantic search: It encodes queries / questions and text paragraphs in a dense vector space. It finds relevant documents for the given passages.
 Note that there is a limit of 512 word pieces: Text longer than that will be truncated. Further note that the model was just trained on input text up to 250 word pieces. It might not work well for longer text.
 ## Training procedure
 The full training script is accessible in this current repository: `train_script.py`.
 #### Training
+We use the concatenation of multiple datasets to fine-tune our model. In total we have about 215M (question, answer) pairs.
 We sampled each dataset given a weighted probability which configuration is detailed in the `data_config.json` file.
 The model was trained with [MultipleNegativesRankingLoss](https://www.sbert.net/docs/package_reference/losses.html#multiplenegativesrankingloss) using Mean-pooling, cosine-similarity as similarity function, and a scale of 20.
 | Dataset                    | Number of training tuples  |
 |--------------------------------------------------------|:--------------------------:|
 | [WikiAnswers](https://github.com/afader/oqa#wikianswers-corpus) Duplicate question pairs from WikiAnswers |  77,427,422 |
 | [Stack Exchange](https://huggingface.co/datasets/flax-sentence-embeddings/stackexchange_xml) (Title, Answer) pairs from all StackExchanges  |  21,396,559 |
 | [MS MARCO](https://microsoft.github.io/msmarco/) Triplets (query, answer, hard_negative) for 500k queries from Bing search engine |  17,579,773 |
 | [GOOAQ: Open Question Answering with Diverse Answer Types](https://github.com/allenai/gooaq) (query, answer) pairs for 3M Google queries and Google featured snippet  | 3,012,496 |
+| [Amazon-QA](http://jmcauley.ucsd.edu/data/amazon/qa/) (Question, Answer) pairs from Amazon product pages | 2,448,839 |
 | [Yahoo Answers](https://www.kaggle.com/soumikrakshit/yahoo-answers-dataset) (Title, Answer) pairs from Yahoo Answers | 1,198,260 |
 | [Yahoo Answers](https://www.kaggle.com/soumikrakshit/yahoo-answers-dataset) (Question, Answer) pairs from Yahoo Answers | 681,164 |
 | [Yahoo Answers](https://www.kaggle.com/soumikrakshit/yahoo-answers-dataset) (Title, Question) pairs from Yahoo Answers | 659,896 |