Sentence Transformers
We are forking sentence-transformers/all-MiniLM-L6-v2 as it is similar to the targeting dataset and use case. For more details, please check the pre-trained model weight repository.
Fine-tuning
- Fine-tune the model using a contrastive objective.
- Compute the cosine similarity from each possible sentence pairs from the batch.
- Then apply the cross entropy loss by comparing with true pairs.
Hyper parameters
- Train the model during 100k steps using a batch size of 1024 (128 per TPU core).
- Use a learning rate warm up of 500.
- The sequence length was limited to 128 tokens.
- Used the AdamW optimizer with a 2e-5 learning rate.
- The full training script is accessible in this current repository:
train_script.py
.
Performance
Model Name |
Performance Sentence Embeddings (14 Datasets) |
Performance Semantic Search (6 Datasets) |
Avg. Performance |
Speed |
Model Size |
all-mpnet-base-v2 |
69.57 |
57.02 |
63.30 |
2800 |
420 MB |
multi-qa-mpnet-base-dot-v1 |
66.76 |
57.60 |
62.18 |
2800 |
420 MB |
all-distilroberta-v1 |
68.73 |
50.94 |
59.84 |
4000 |
290 MB |
all-MiniLM-L12-v2 |
68.70 |
50.82 |
59.76 |
7500 |
120 MB |
multi-qa-distilbert-cos-v1 |
65.98 |
52.83 |
59.41 |
4000 |
250 MB |
all-MiniLM-L6-v2 (This model) |
68.06 |
49.54 |
58.80 |
14200 |
80 MB |
multi-qa-MiniLM-L6-cos-v1 |
64.33 |
51.83 |
58.08 |
14200 |
80 MB |
paraphrase-multilingual-mpnet-base-v2 |
65.83 |
41.68 |
53.75 |
2500 |
970 MB |
paraphrase-albert-small-v2 |
64.46 |
40.04 |
52.25 |
5000 |
43 MB |
paraphrase-multilingual-MiniLM-L12-v2 |
64.25 |
39.19 |
51.72 |
7500 |
420 MB |
paraphrase-MiniLM-L3-v2 |
62.29 |
39.19 |
50.74 |
19000 |
61 MB |
distiluse-base-multilingual-cased-v1 |
61.30 |
29.87 |
45.59 |
4000 |
480 MB |
distiluse-base-multilingual-cased-v2 |
60.18 |
27.35 |
43.77 |
4000 |
480 MB |
Datasets
Dataset |
Paper |
Number of training tuples |
Reddit comments (2015-2018) |
paper |
726,484,430 |
S2ORC Citation pairs (Abstracts) |
paper |
116,288,806 |
WikiAnswers Duplicate question pairs |
paper |
77,427,422 |
PAQ (Question, Answer) pairs |
paper |
64,371,441 |
S2ORC Citation pairs (Titles) |
paper |
52,603,982 |
S2ORC (Title, Abstract) |
paper |
41,769,185 |
Stack Exchange (Title, Body) pairs |
- |
25,316,456 |
Stack Exchange (Title+Body, Answer) pairs |
- |
21,396,559 |
Stack Exchange (Title, Answer) pairs |
- |
21,396,559 |
MS MARCO triplets |
paper |
9,144,553 |
GOOAQ: Open Question Answering with Diverse Answer Types |
paper |
3,012,496 |
Yahoo Answers (Title, Answer) |
paper |
1,198,260 |
Code Search |
- |
1,151,414 |
COCO Image captions |
paper |
828,395 |
SPECTER citation triplets |
paper |
684,100 |
Yahoo Answers (Question, Answer) |
paper |
681,164 |
Yahoo Answers (Title, Question) |
paper |
659,896 |
SearchQA |
paper |
582,261 |
Eli5 |
paper |
325,475 |
Flickr 30k |
paper |
317,695 |
Stack Exchange Duplicate questions (titles) |
|
304,525 |
AllNLI (SNLI and MultiNLI |
paper SNLI, paper MultiNLI |
277,230 |
Stack Exchange Duplicate questions (bodies) |
|
250,519 |
Stack Exchange Duplicate questions (titles+bodies) |
|
250,460 |
Sentence Compression |
paper |
180,000 |
Wikihow |
paper |
128,542 |
Altlex |
paper |
112,696 |
Quora Question Triplets |
- |
103,663 |
Simple Wikipedia |
paper |
102,225 |
Natural Questions (NQ) |
paper |
100,231 |
SQuAD2.0 |
paper |
87,599 |
TriviaQA |
- |
73,346 |
Total |
|
1,170,060,424 |