|
--- |
|
license: mit |
|
--- |
|
|
|
This model has been first pretrained on the BEIR corpus and fine-tuned on MS MARCO dataset following the approach described in the paper **COCO-DR: Combating Distribution Shifts in Zero-Shot Dense Retrieval with Contrastive and Distributionally Robust Learning**. The associated GitHub repository is available here https://github.com/OpenMatch/COCO-DR. |
|
|
|
This model is trained with BERT-base as the backbone with 110M hyperparameters. See the paper https://arxiv.org/abs/2210.15212 for details. |
|
|
|
|
|
## Usage |
|
|
|
Pre-trained models can be loaded through the HuggingFace transformers library: |
|
|
|
```python |
|
from transformers import AutoModel, AutoTokenizer |
|
|
|
model = AutoModel.from_pretrained("OpenMatch/cocodr-base-msmarco") |
|
tokenizer = AutoTokenizer.from_pretrained("OpenMatch/cocodr-base-msmarco") |
|
``` |
|
|
|
Then embeddings for different sentences can be obtained by doing the following: |
|
|
|
```python |
|
|
|
sentences = [ |
|
"Where was Marie Curie born?", |
|
"Maria Sklodowska, later known as Marie Curie, was born on November 7, 1867.", |
|
"Born in Paris on 15 May 1859, Pierre Curie was the son of Eugène Curie, a doctor of French Catholic origin from Alsace." |
|
] |
|
|
|
inputs = tokenizer(sentences, padding=True, truncation=True, return_tensors="pt") |
|
embeddings = model(**inputs, output_hidden_states=True, return_dict=True).hidden_states[-1][:, :1].squeeze(1) # the embedding of the [CLS] token after the final layer |
|
``` |
|
|
|
Then similarity scores between the different sentences are obtained with a dot product between the embeddings: |
|
```python |
|
|
|
score01 = embeddings[0] @ embeddings[1] # 216.9792 |
|
score02 = embeddings[0] @ embeddings[2] # 216.6684 |
|
``` |
|
|
|
|
|
|