|
--- |
|
tags: |
|
- feature-extraction |
|
pipeline_tag: feature-extraction |
|
--- |
|
|
|
This model is the context encoder of the PAQ BM25 Lexical Model (Λ) from the SPAR paper: |
|
|
|
[Salient Phrase Aware Dense Retrieval: Can a Dense Retriever Imitate a Sparse One?](https://arxiv.org/abs/2110.06918) |
|
<br> |
|
Xilun Chen, Kushal Lakhotia, Barlas Oğuz, Anchit Gupta, Patrick Lewis, Stan Peshterliev, Yashar Mehdad, Sonal Gupta and Wen-tau Yih |
|
<br> |
|
**Meta AI** |
|
|
|
The associated github repo is available here: https://github.com/facebookresearch/dpr-scale/tree/main/spar |
|
|
|
This model is a BERT-base sized dense retriever trained using PAQ questions as queries to imitate the behavior of BM25. |
|
The following models are also available: |
|
Pretrained Model | Corpus | Teacher | Architecture | Query Encoder Path | Context Encoder Path |
|
|---|---|---|---|---|--- |
|
Wiki BM25 Λ | Wikipedia | BM25 | BERT-base | facebook/spar-wiki-bm25-lexmodel-query-encoder | facebook/spar-wiki-bm25-lexmodel-context-encoder |
|
PAQ BM25 Λ | PAQ | BM25 | BERT-base | facebook/spar-paq-bm25-lexmodel-query-encoder | facebook/spar-paq-bm25-lexmodel-context-encoder |
|
MARCO BM25 Λ | MS MARCO | BM25 | BERT-base | facebook/spar-marco-bm25-lexmodel-query-encoder | facebook/spar-marco-bm25-lexmodel-context-encoder |
|
MARCO UniCOIL Λ | MS MARCO | UniCOIL | BERT-base | facebook/spar-marco-unicoil-lexmodel-query-encoder | facebook/spar-marco-unicoil-lexmodel-context-encoder |
|
|
|
# Using the Lexical Model (Λ) Alone |
|
|
|
This model should be used together with the associated query encoder, similar to the [DPR](https://huggingface.co/docs/transformers/v4.22.1/en/model_doc/dpr) model. |
|
|
|
``` |
|
import torch |
|
from transformers import AutoTokenizer, AutoModel |
|
|
|
# The tokenizer is the same for the query and context encoder |
|
tokenizer = AutoTokenizer.from_pretrained('facebook/spar-wiki-bm25-lexmodel-query-encoder') |
|
query_encoder = AutoModel.from_pretrained('facebook/spar-wiki-bm25-lexmodel-query-encoder') |
|
context_encoder = AutoModel.from_pretrained('facebook/spar-wiki-bm25-lexmodel-context-encoder') |
|
|
|
query = "Where was Marie Curie born?" |
|
contexts = [ |
|
"Maria Sklodowska, later known as Marie Curie, was born on November 7, 1867.", |
|
"Born in Paris on 15 May 1859, Pierre Curie was the son of Eugène Curie, a doctor of French Catholic origin from Alsace." |
|
] |
|
|
|
# Apply tokenizer |
|
query_input = tokenizer(query, return_tensors='pt') |
|
ctx_input = tokenizer(contexts, padding=True, truncation=True, return_tensors='pt') |
|
|
|
# Compute embeddings: take the last-layer hidden state of the [CLS] token |
|
query_emb = query_encoder(**query_input).last_hidden_state[:, 0, :] |
|
ctx_emb = context_encoder(**ctx_input).last_hidden_state[:, 0, :] |
|
|
|
# Compute similarity scores using dot product |
|
score1 = query_emb @ ctx_emb[0] # 341.3268 |
|
score2 = query_emb @ ctx_emb[1] # 340.1626 |
|
|
|
``` |
|
|
|
# Using the Lexical Model (Λ) with a Base Dense Retriever as in SPAR |
|
As Λ learns lexical matching from a sparse teacher retriever, it can be used in combination with a standard dense retriever (e.g. [DPR](https://huggingface.co/docs/transformers/v4.22.1/en/model_doc/dpr#dpr), [Contriever](https://huggingface.co/facebook/contriever-msmarco)) to build a dense retriever that excels at both lexical and semantic matching. |
|
|
|
In the following example, we show how to build the SPAR-Wiki model for Open-Domain Question Answering by concatenating the embeddings of DPR and the Wiki BM25 Λ. |
|
``` |
|
import torch |
|
from transformers import AutoTokenizer, AutoModel |
|
from transformers import DPRQuestionEncoder, DPRQuestionEncoderTokenizer |
|
from transformers import DPRContextEncoder, DPRContextEncoderTokenizer |
|
|
|
# DPR model |
|
dpr_ctx_tokenizer = DPRContextEncoderTokenizer.from_pretrained("facebook/dpr-ctx_encoder-multiset-base") |
|
dpr_ctx_encoder = DPRContextEncoder.from_pretrained("facebook/dpr-ctx_encoder-multiset-base") |
|
dpr_query_tokenizer = DPRQuestionEncoderTokenizer.from_pretrained("facebook/dpr-question_encoder-multiset-base") |
|
dpr_query_encoder = DPRQuestionEncoder.from_pretrained("facebook/dpr-question_encoder-multiset-base") |
|
|
|
# Wiki BM25 Λ model |
|
lexmodel_tokenizer = AutoTokenizer.from_pretrained('facebook/spar-wiki-bm25-lexmodel-query-encoder') |
|
lexmodel_query_encoder = AutoModel.from_pretrained('facebook/spar-wiki-bm25-lexmodel-query-encoder') |
|
lexmodel_context_encoder = AutoModel.from_pretrained('facebook/spar-wiki-bm25-lexmodel-context-encoder') |
|
|
|
query = "Where was Marie Curie born?" |
|
contexts = [ |
|
"Maria Sklodowska, later known as Marie Curie, was born on November 7, 1867.", |
|
"Born in Paris on 15 May 1859, Pierre Curie was the son of Eugène Curie, a doctor of French Catholic origin from Alsace." |
|
] |
|
|
|
# Compute DPR embeddings |
|
dpr_query_input = dpr_query_tokenizer(query, return_tensors='pt')['input_ids'] |
|
dpr_query_emb = dpr_query_encoder(dpr_query_input).pooler_output |
|
dpr_ctx_input = dpr_ctx_tokenizer(contexts, padding=True, truncation=True, return_tensors='pt') |
|
dpr_ctx_emb = dpr_ctx_encoder(**dpr_ctx_input).pooler_output |
|
|
|
# Compute Λ embeddings |
|
lexmodel_query_input = lexmodel_tokenizer(query, return_tensors='pt') |
|
lexmodel_query_emb = lexmodel_query_encoder(**query_input).last_hidden_state[:, 0, :] |
|
lexmodel_ctx_input = lexmodel_tokenizer(contexts, padding=True, truncation=True, return_tensors='pt') |
|
lexmodel_ctx_emb = lexmodel_context_encoder(**ctx_input).last_hidden_state[:, 0, :] |
|
|
|
# Form SPAR embeddings via concatenation |
|
|
|
# The concatenation weight is only applied to query embeddings |
|
# Refer to the SPAR paper for details |
|
concat_weight = 0.7 |
|
|
|
spar_query_emb = torch.cat( |
|
[dpr_query_emb, concat_weight * lexmodel_query_emb], |
|
dim=-1, |
|
) |
|
spar_ctx_emb = torch.cat( |
|
[dpr_ctx_emb, lexmodel_ctx_emb], |
|
dim=-1, |
|
) |
|
|
|
# Compute similarity scores |
|
score1 = spar_query_emb @ spar_ctx_emb[0] # 317.6931 |
|
score2 = spar_query_emb @ spar_ctx_emb[1] # 314.6144 |
|
``` |
|
|
|
|