Create README.md
Browse files
README.md
ADDED
@@ -0,0 +1,111 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
This model is the context encoder of the Wiki BM25 Lexical Model (Λ) from the SPAR paper:
|
2 |
+
|
3 |
+
[Salient Phrase Aware Dense Retrieval: Can a Dense Retriever Imitate a Sparse One?](https://arxiv.org/abs/2110.06918)
|
4 |
+
<br>
|
5 |
+
Xilun Chen, Kushal Lakhotia, Barlas Oğuz, Anchit Gupta, Patrick Lewis, Stan Peshterliev, Yashar Mehdad, Sonal Gupta and Wen-tau Yih
|
6 |
+
<br>
|
7 |
+
**Meta AI**
|
8 |
+
|
9 |
+
The associated github repo is available here: https://github.com/facebookresearch/dpr-scale/tree/main/spar
|
10 |
+
|
11 |
+
This model is a BERT-base sized dense retriever trained on Wikipedia articles to imitate the behavior of BM25.
|
12 |
+
The following models are also available:
|
13 |
+
Pretrained Model | Corpus | Teacher | Architecture | Query Encoder Path | Context Encoder Path
|
14 |
+
|---|---|---|---|---|---
|
15 |
+
Wiki BM25 Λ | Wikipedia | BM25 | BERT-base | facebook/spar-wiki-bm25-lexmodel-query-encoder | facebook/spar-wiki-bm25-lexmodel-context-encoder
|
16 |
+
PAQ BM25 Λ | PAQ | BM25 | BERT-base | facebook/spar-paq-bm25-lexmodel-query-encoder | facebook/spar-paq-bm25-lexmodel-context-encoder
|
17 |
+
MARCO BM25 Λ | MS MARCO | BM25 | BERT-base | facebook/spar-marco-bm25-lexmodel-query-encoder | facebook/spar-marco-bm25-lexmodel-context-encoder
|
18 |
+
MARCO UniCOIL Λ | MS MARCO | UniCOIL | BERT-base | facebook/spar-marco-unicoil-lexmodel-query-encoder | facebook/spar-marco-unicoil-lexmodel-context-encoder
|
19 |
+
|
20 |
+
# Using the Lexical Model (Λ) Alone
|
21 |
+
|
22 |
+
This model should be used together with the associated query encoder, similar to the [DPR](https://huggingface.co/docs/transformers/v4.22.1/en/model_doc/dpr) model.
|
23 |
+
|
24 |
+
```
|
25 |
+
import torch
|
26 |
+
from transformers import AutoTokenizer, AutoModel
|
27 |
+
|
28 |
+
# The tokenizer is the same for the query and context encoder
|
29 |
+
tokenizer = AutoTokenizer.from_pretrained('facebook/spar-wiki-bm25-lexmodel-query-encoder')
|
30 |
+
query_encoder = AutoModel.from_pretrained('facebook/spar-wiki-bm25-lexmodel-query-encoder')
|
31 |
+
context_encoder = AutoModel.from_pretrained('facebook/spar-wiki-bm25-lexmodel-context-encoder')
|
32 |
+
|
33 |
+
query = "Where was Marie Curie born?"
|
34 |
+
contexts = [
|
35 |
+
"Maria Sklodowska, later known as Marie Curie, was born on November 7, 1867.",
|
36 |
+
"Born in Paris on 15 May 1859, Pierre Curie was the son of Eugène Curie, a doctor of French Catholic origin from Alsace."
|
37 |
+
]
|
38 |
+
|
39 |
+
# Apply tokenizer
|
40 |
+
query_input = tokenizer(query, return_tensors='pt')
|
41 |
+
ctx_input = tokenizer(contexts, padding=True, truncation=True, return_tensors='pt')
|
42 |
+
|
43 |
+
# Compute embeddings: take the last-layer hidden state of the [CLS] token
|
44 |
+
query_emb = query_encoder(**query_input).last_hidden_state[:, 0, :]
|
45 |
+
ctx_emb = context_encoder(**ctx_input).last_hidden_state[:, 0, :]
|
46 |
+
|
47 |
+
# Compute similarity scores using dot product
|
48 |
+
score1 = query_emb @ ctx_emb[0] # 341.3268
|
49 |
+
score2 = query_emb @ ctx_emb[1] # 340.1626
|
50 |
+
|
51 |
+
```
|
52 |
+
|
53 |
+
# Using the Lexical Model (Λ) with a Base Dense Retriever as in SPAR
|
54 |
+
As Λ learns lexical matching from a sparse teacher retriever, it can be used in combination with a standard dense retriever (e.g. [DPR](https://huggingface.co/docs/transformers/v4.22.1/en/model_doc/dpr#dpr), [Contriever](https://huggingface.co/facebook/contriever-msmarco)) to build a dense retriever that excels at both lexical and semantic matching.
|
55 |
+
|
56 |
+
In the following example, we show how to build the SPAR-Wiki model for Open-Domain Question Answering by concatenating the embeddings of DPR and the Wiki BM25 Λ.
|
57 |
+
```
|
58 |
+
import torch
|
59 |
+
from transformers import AutoTokenizer, AutoModel
|
60 |
+
from transformers import DPRQuestionEncoder, DPRQuestionEncoderTokenizer
|
61 |
+
from transformers import DPRContextEncoder, DPRContextEncoderTokenizer
|
62 |
+
|
63 |
+
# DPR model
|
64 |
+
dpr_ctx_tokenizer = DPRContextEncoderTokenizer.from_pretrained("facebook/dpr-ctx_encoder-multiset-base")
|
65 |
+
dpr_ctx_encoder = DPRContextEncoder.from_pretrained("facebook/dpr-ctx_encoder-multiset-base")
|
66 |
+
dpr_query_tokenizer = DPRQuestionEncoderTokenizer.from_pretrained("facebook/dpr-question_encoder-multiset-base")
|
67 |
+
dpr_query_encoder = DPRQuestionEncoder.from_pretrained("facebook/dpr-question_encoder-multiset-base")
|
68 |
+
|
69 |
+
# Wiki BM25 Λ model
|
70 |
+
lexmodel_tokenizer = AutoTokenizer.from_pretrained('facebook/spar-wiki-bm25-lexmodel-query-encoder')
|
71 |
+
lexmodel_query_encoder = AutoModel.from_pretrained('facebook/spar-wiki-bm25-lexmodel-query-encoder')
|
72 |
+
lexmodel_context_encoder = AutoModel.from_pretrained('facebook/spar-wiki-bm25-lexmodel-context-encoder')
|
73 |
+
|
74 |
+
query = "Where was Marie Curie born?"
|
75 |
+
contexts = [
|
76 |
+
"Maria Sklodowska, later known as Marie Curie, was born on November 7, 1867.",
|
77 |
+
"Born in Paris on 15 May 1859, Pierre Curie was the son of Eugène Curie, a doctor of French Catholic origin from Alsace."
|
78 |
+
]
|
79 |
+
|
80 |
+
# Compute DPR embeddings
|
81 |
+
dpr_query_input = dpr_query_tokenizer(query, return_tensors='pt')['input_ids']
|
82 |
+
dpr_query_emb = dpr_query_encoder(dpr_query_input).pooler_output
|
83 |
+
dpr_ctx_input = dpr_ctx_tokenizer(contexts, padding=True, truncation=True, return_tensors='pt')
|
84 |
+
dpr_ctx_emb = dpr_ctx_encoder(**dpr_ctx_input).pooler_output
|
85 |
+
|
86 |
+
# Compute Λ embeddings
|
87 |
+
lexmodel_query_input = lexmodel_tokenizer(query, return_tensors='pt')
|
88 |
+
lexmodel_query_emb = lexmodel_query_encoder(**query_input).last_hidden_state[:, 0, :]
|
89 |
+
lexmodel_ctx_input = lexmodel_tokenizer(contexts, padding=True, truncation=True, return_tensors='pt')
|
90 |
+
lexmodel_ctx_emb = lexmodel_context_encoder(**ctx_input).last_hidden_state[:, 0, :]
|
91 |
+
|
92 |
+
# Form SPAR embeddings via concatenation
|
93 |
+
|
94 |
+
# The concatenation weight is only applied to query embeddings
|
95 |
+
# Refer to the SPAR paper for details
|
96 |
+
concat_weight = 0.7
|
97 |
+
|
98 |
+
spar_query_emb = torch.cat(
|
99 |
+
[dpr_query_emb, concat_weight * lexmodel_query_emb],
|
100 |
+
dim=-1,
|
101 |
+
)
|
102 |
+
spar_ctx_emb = torch.cat(
|
103 |
+
[dpr_ctx_emb, lexmodel_ctx_emb],
|
104 |
+
dim=-1,
|
105 |
+
)
|
106 |
+
|
107 |
+
# Compute similarity scores
|
108 |
+
score1 = spar_query_emb @ spar_ctx_emb[0] # 317.6931
|
109 |
+
score2 = spar_query_emb @ spar_ctx_emb[1] # 314.6144
|
110 |
+
```
|
111 |
+
|