Add `text-embeddings-inference` tag & snippet

#8
by alvarobartt HF Staff - opened
Files changed (1) hide show
  1. README.md +41 -19
README.md CHANGED
@@ -7,6 +7,7 @@ tags:
7
  - feature-extraction
8
  - sentence-similarity
9
  - transformers
 
10
  pipeline_tag: sentence-similarity
11
  ---
12
 
@@ -58,14 +59,14 @@ from transformers import AutoTokenizer, AutoModel
58
  import torch
59
  import torch.nn.functional as F
60
 
61
- #Mean Pooling - Take average of all tokens
62
  def mean_pooling(model_output, attention_mask):
63
- token_embeddings = model_output.last_hidden_state #First element of model_output contains all token embeddings
64
  input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
65
  return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)
66
 
67
 
68
- #Encode text
69
  def encode(texts):
70
  # Tokenize sentences
71
  encoded_input = tokenizer(texts, padding=True, truncation=True, return_tensors='pt')
@@ -91,24 +92,50 @@ docs = ["Around 9 Million people live in London", "London is known for its finan
91
  tokenizer = AutoTokenizer.from_pretrained("sentence-transformers/multi-qa-mpnet-base-cos-v1")
92
  model = AutoModel.from_pretrained("sentence-transformers/multi-qa-mpnet-base-cos-v1")
93
 
94
- #Encode query and docs
95
  query_emb = encode(query)
96
  doc_emb = encode(docs)
97
 
98
- #Compute dot score between query and all document embeddings
99
  scores = torch.mm(query_emb, doc_emb.transpose(0, 1))[0].cpu().tolist()
100
 
101
- #Combine docs & scores
102
  doc_score_pairs = list(zip(docs, scores))
103
 
104
- #Sort by decreasing score
105
  doc_score_pairs = sorted(doc_score_pairs, key=lambda x: x[1], reverse=True)
106
 
107
- #Output passages & scores
108
  for doc, score in doc_score_pairs:
109
  print(score, doc)
110
  ```
111
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
112
  ## Technical Details
113
 
114
  In the following some technical details how this model must be used:
@@ -124,25 +151,22 @@ Note: When loaded with `sentence-transformers`, this model produces normalized e
124
 
125
  ----
126
 
127
-
128
  ## Background
129
 
130
  The project aims to train sentence embedding models on very large sentence level datasets using a self-supervised
131
  contrastive learning objective. We use a contrastive learning objective: given a sentence from the pair, the model should predict which out of a set of randomly sampled other sentences, was actually paired with it in our dataset.
132
 
133
- We developped this model during the
134
  [Community week using JAX/Flax for NLP & CV](https://discuss.huggingface.co/t/open-to-the-community-community-week-using-jax-flax-for-nlp-cv/7104),
135
- organized by Hugging Face. We developped this model as part of the project:
136
- [Train the Best Sentence Embedding Model Ever with 1B Training Pairs](https://discuss.huggingface.co/t/train-the-best-sentence-embedding-model-ever-with-1b-training-pairs/7354). We benefited from efficient hardware infrastructure to run the project: 7 TPUs v3-8, as well as intervention from Googles Flax, JAX, and Cloud team member about efficient deep learning frameworks.
137
 
138
  ## Intended uses
139
 
140
- Our model is intented to be used for semantic search: It encodes queries / questions and text paragraphs in a dense vector space. It finds relevant documents for the given passages.
141
 
142
  Note that there is a limit of 512 word pieces: Text longer than that will be truncated. Further note that the model was just trained on input text up to 250 word pieces. It might not work well for longer text.
143
 
144
-
145
-
146
  ## Training procedure
147
 
148
  The full training script is accessible in this current repository: `train_script.py`.
@@ -153,14 +177,12 @@ We use the pretrained [`mpnet-base`](https://huggingface.co/microsoft/mpnet-base
153
 
154
  #### Training
155
 
156
- We use the concatenation from multiple datasets to fine-tune our model. In total we have about 215M (question, answer) pairs.
157
  We sampled each dataset given a weighted probability which configuration is detailed in the `data_config.json` file.
158
 
159
  The model was trained with [MultipleNegativesRankingLoss](https://www.sbert.net/docs/package_reference/losses.html#multiplenegativesrankingloss) using Mean-pooling, cosine-similarity as similarity function, and a scale of 20.
160
 
161
 
162
-
163
-
164
  | Dataset | Number of training tuples |
165
  |--------------------------------------------------------|:--------------------------:|
166
  | [WikiAnswers](https://github.com/afader/oqa#wikianswers-corpus) Duplicate question pairs from WikiAnswers | 77,427,422 |
@@ -169,7 +191,7 @@ The model was trained with [MultipleNegativesRankingLoss](https://www.sbert.net/
169
  | [Stack Exchange](https://huggingface.co/datasets/flax-sentence-embeddings/stackexchange_xml) (Title, Answer) pairs from all StackExchanges | 21,396,559 |
170
  | [MS MARCO](https://microsoft.github.io/msmarco/) Triplets (query, answer, hard_negative) for 500k queries from Bing search engine | 17,579,773 |
171
  | [GOOAQ: Open Question Answering with Diverse Answer Types](https://github.com/allenai/gooaq) (query, answer) pairs for 3M Google queries and Google featured snippet | 3,012,496 |
172
- | [Amazon-QA](http://jmcauley.ucsd.edu/data/amazon/qa/) (Question, Answer) pairs from Amazon product pages | 2,448,839
173
  | [Yahoo Answers](https://www.kaggle.com/soumikrakshit/yahoo-answers-dataset) (Title, Answer) pairs from Yahoo Answers | 1,198,260 |
174
  | [Yahoo Answers](https://www.kaggle.com/soumikrakshit/yahoo-answers-dataset) (Question, Answer) pairs from Yahoo Answers | 681,164 |
175
  | [Yahoo Answers](https://www.kaggle.com/soumikrakshit/yahoo-answers-dataset) (Title, Question) pairs from Yahoo Answers | 659,896 |
 
7
  - feature-extraction
8
  - sentence-similarity
9
  - transformers
10
+ - text-embeddings-inference
11
  pipeline_tag: sentence-similarity
12
  ---
13
 
 
59
  import torch
60
  import torch.nn.functional as F
61
 
62
+ # Mean Pooling - Take average of all tokens
63
  def mean_pooling(model_output, attention_mask):
64
+ token_embeddings = model_output.last_hidden_state # First element of model_output contains all token embeddings
65
  input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
66
  return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)
67
 
68
 
69
+ # Encode text
70
  def encode(texts):
71
  # Tokenize sentences
72
  encoded_input = tokenizer(texts, padding=True, truncation=True, return_tensors='pt')
 
92
  tokenizer = AutoTokenizer.from_pretrained("sentence-transformers/multi-qa-mpnet-base-cos-v1")
93
  model = AutoModel.from_pretrained("sentence-transformers/multi-qa-mpnet-base-cos-v1")
94
 
95
+ # Encode query and docs
96
  query_emb = encode(query)
97
  doc_emb = encode(docs)
98
 
99
+ # Compute dot score between query and all document embeddings
100
  scores = torch.mm(query_emb, doc_emb.transpose(0, 1))[0].cpu().tolist()
101
 
102
+ # Combine docs & scores
103
  doc_score_pairs = list(zip(docs, scores))
104
 
105
+ # Sort by decreasing score
106
  doc_score_pairs = sorted(doc_score_pairs, key=lambda x: x[1], reverse=True)
107
 
108
+ # Output passages & scores
109
  for doc, score in doc_score_pairs:
110
  print(score, doc)
111
  ```
112
 
113
+ ## Usage (Text Embeddings Inference (TEI))
114
+
115
+ [Text Embeddings Inference (TEI)](https://github.com/huggingface/text-embeddings-inference) is a blazing fast inference solution for text embedding models.
116
+
117
+ - CPU:
118
+ ```bash
119
+ docker run -p 8080:80 -v hf_cache:/data --pull always ghcr.io/huggingface/text-embeddings-inference:cpu-latest --model-id sentence-transformers/multi-qa-mpnet-base-cos-v1 --pooling mean --dtype float16
120
+ ```
121
+
122
+ - NVIDIA GPU:
123
+ ```bash
124
+ docker run --gpus all -p 8080:80 -v hf_cache:/data --pull always ghcr.io/huggingface/text-embeddings-inference:cuda-latest --model-id sentence-transformers/multi-qa-mpnet-base-cos-v1 --pooling mean --dtype float16
125
+ ```
126
+
127
+ Send a request to `/v1/embeddings` to generate embeddings via the [OpenAI Embeddings API](https://platform.openai.com/docs/api-reference/embeddings/create):
128
+ ```bash
129
+ curl http://localhost:8080/v1/embeddings \
130
+ -H "Content-Type: application/json" \
131
+ -d '{
132
+ "model": "sentence-transformers/multi-qa-mpnet-base-cos-v1",
133
+ "input": "How many people live in London?"
134
+ }'
135
+ ```
136
+
137
+ Or check the [Text Embeddings Inference API specification](https://huggingface.github.io/text-embeddings-inference/) instead.
138
+
139
  ## Technical Details
140
 
141
  In the following some technical details how this model must be used:
 
151
 
152
  ----
153
 
 
154
  ## Background
155
 
156
  The project aims to train sentence embedding models on very large sentence level datasets using a self-supervised
157
  contrastive learning objective. We use a contrastive learning objective: given a sentence from the pair, the model should predict which out of a set of randomly sampled other sentences, was actually paired with it in our dataset.
158
 
159
+ We developed this model during the
160
  [Community week using JAX/Flax for NLP & CV](https://discuss.huggingface.co/t/open-to-the-community-community-week-using-jax-flax-for-nlp-cv/7104),
161
+ organized by Hugging Face. We developed this model as part of the project:
162
+ [Train the Best Sentence Embedding Model Ever with 1B Training Pairs](https://discuss.huggingface.co/t/train-the-best-sentence-embedding-model-ever-with-1b-training-pairs/7354). We benefited from efficient hardware infrastructure to run the project: 7 TPUs v3-8, as well as intervention from Google's Flax, JAX, and Cloud team members about efficient deep learning frameworks.
163
 
164
  ## Intended uses
165
 
166
+ Our model is intended to be used for semantic search: It encodes queries / questions and text paragraphs in a dense vector space. It finds relevant documents for the given passages.
167
 
168
  Note that there is a limit of 512 word pieces: Text longer than that will be truncated. Further note that the model was just trained on input text up to 250 word pieces. It might not work well for longer text.
169
 
 
 
170
  ## Training procedure
171
 
172
  The full training script is accessible in this current repository: `train_script.py`.
 
177
 
178
  #### Training
179
 
180
+ We use the concatenation of multiple datasets to fine-tune our model. In total we have about 215M (question, answer) pairs.
181
  We sampled each dataset given a weighted probability which configuration is detailed in the `data_config.json` file.
182
 
183
  The model was trained with [MultipleNegativesRankingLoss](https://www.sbert.net/docs/package_reference/losses.html#multiplenegativesrankingloss) using Mean-pooling, cosine-similarity as similarity function, and a scale of 20.
184
 
185
 
 
 
186
  | Dataset | Number of training tuples |
187
  |--------------------------------------------------------|:--------------------------:|
188
  | [WikiAnswers](https://github.com/afader/oqa#wikianswers-corpus) Duplicate question pairs from WikiAnswers | 77,427,422 |
 
191
  | [Stack Exchange](https://huggingface.co/datasets/flax-sentence-embeddings/stackexchange_xml) (Title, Answer) pairs from all StackExchanges | 21,396,559 |
192
  | [MS MARCO](https://microsoft.github.io/msmarco/) Triplets (query, answer, hard_negative) for 500k queries from Bing search engine | 17,579,773 |
193
  | [GOOAQ: Open Question Answering with Diverse Answer Types](https://github.com/allenai/gooaq) (query, answer) pairs for 3M Google queries and Google featured snippet | 3,012,496 |
194
+ | [Amazon-QA](http://jmcauley.ucsd.edu/data/amazon/qa/) (Question, Answer) pairs from Amazon product pages | 2,448,839 |
195
  | [Yahoo Answers](https://www.kaggle.com/soumikrakshit/yahoo-answers-dataset) (Title, Answer) pairs from Yahoo Answers | 1,198,260 |
196
  | [Yahoo Answers](https://www.kaggle.com/soumikrakshit/yahoo-answers-dataset) (Question, Answer) pairs from Yahoo Answers | 681,164 |
197
  | [Yahoo Answers](https://www.kaggle.com/soumikrakshit/yahoo-answers-dataset) (Title, Question) pairs from Yahoo Answers | 659,896 |