Update README.md
Browse files
README.md
CHANGED
@@ -7,5 +7,36 @@ This model has been first pretrained on the BEIR corpus and fine-tuned on MS MAR
|
|
7 |
This model is trained with BERT-base as the backbone with 110M hyperparameters.
|
8 |
|
9 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
10 |
|
11 |
|
|
|
7 |
This model is trained with BERT-base as the backbone with 110M hyperparameters.
|
8 |
|
9 |
|
10 |
+
## Usage
|
11 |
+
|
12 |
+
Pre-trained models can be loaded through the HuggingFace transformers library:
|
13 |
+
|
14 |
+
```python
|
15 |
+
from transformers import AutoModel, AutoTokenizer
|
16 |
+
|
17 |
+
model = AutoModel.from_pretrained("OpenMatch/cocodr-base-msmarco")
|
18 |
+
tokenizer = AutoTokenizer.from_pretrained("OpenMatch/cocodr-base-msmarco")
|
19 |
+
```
|
20 |
+
|
21 |
+
Then embeddings for different sentences can be obtained by doing the following:
|
22 |
+
|
23 |
+
```python
|
24 |
+
|
25 |
+
sentences = [
|
26 |
+
"Where was Marie Curie born?",
|
27 |
+
"Maria Sklodowska, later known as Marie Curie, was born on November 7, 1867.",
|
28 |
+
"Born in Paris on 15 May 1859, Pierre Curie was the son of Eugène Curie, a doctor of French Catholic origin from Alsace."
|
29 |
+
]
|
30 |
+
|
31 |
+
inputs = tokenizer(sentences, padding=True, truncation=True, return_tensors="pt")
|
32 |
+
embeddings = model(**inputs, output_hidden_states=True, return_dict=True).hidden_states[-1][:, :1].squeeze(1) # the embedding of the [CLS] token after the final layer
|
33 |
+
```
|
34 |
+
|
35 |
+
Then similarity scores between the different sentences are obtained with a dot product between the embeddings:
|
36 |
+
```python
|
37 |
+
|
38 |
+
score01 = embeddings[0] @ embeddings[1] # 216.9792
|
39 |
+
score02 = embeddings[0] @ embeddings[2] # 216.6684
|
40 |
+
```
|
41 |
|
42 |
|