Marissa commited on
Commit
133d029
1 Parent(s): 7fee798

Add model card

Browse files

Hi! This PR has a preliminary model card, based on the format we are using as part of our effort to standardize model cards at Hugging Face. It is very similar to the proposed model card at https://huggingface.co/facebook/dpr-ctx_encoder-single-nq-base/discussions/1

Feel free to merge if you are ok with the changes! (cc

@Ezi


@Meg


@Nazneen
)

Files changed (1) hide show
  1. README.md +138 -0
README.md ADDED
@@ -0,0 +1,138 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language: en
3
+ license: cc-by-nc-4.0
4
+ tags:
5
+ - dpr
6
+ datasets:
7
+ - nq_open
8
+ ---
9
+
10
+ # MODEL NAME
11
+
12
+ ## Table of Contents
13
+ - [Model Details](#model-details)
14
+ - [How To Get Started With the Model](#how-to-get-started-with-the-model)
15
+ - [Uses](#uses)
16
+ - [Risks, Limitations and Biases](#risks-limitations-and-biases)
17
+ - [Training](#training)
18
+ - [Evaluation](#evaluation-results)
19
+ - [Environmental Impact](#environmental-impact)
20
+ - [Technical Specifications](#technical-specifications)
21
+ - [Citation Information](#citation-information)
22
+ - [Model Card Authors](#model-card-authors)
23
+
24
+ ## Model Details
25
+
26
+ **Model Description:** [Dense Passage Retrieval (DPR)](https://github.com/facebookresearch/DPR) is a set of tools and models for state-of-the-art open-domain Q&A research. `dpr-question_encoder-single-nq-base` is the question encoder trained using the [Natural Questions (NQ) dataset](https://huggingface.co/datasets/nq_open) ([Lee et al., 2019](https://aclanthology.org/P19-1612/); [Kwiatkowski et al., 2019](https://aclanthology.org/Q19-1026/)).
27
+
28
+ - **Developed by:** See [GitHub repo](https://github.com/facebookresearch/DPR) for model developers
29
+ - **Model Type:** BERT-based encoder
30
+ - **Language(s):** [CC-BY-NC-4.0](https://github.com/facebookresearch/DPR/blob/main/LICENSE), also see [Code of Conduct](https://github.com/facebookresearch/DPR/blob/main/CODE_OF_CONDUCT.md)
31
+ - **License:** English
32
+ - **Related Models:**
33
+ - [`dpr-ctx_encoder-single-nq-base`](https://huggingface.co/facebook/dpr-ctx_encoder-single-nq-base)
34
+ - [`dpr-reader-single-nq-base`](https://huggingface.co/facebook/dpr-reader-single-nq-base)
35
+ - [`dpr-ctx_encoder-multiset-base`](https://huggingface.co/facebook/dpr-ctx_encoder-multiset-base)
36
+ - [`dpr-question_encoder-multiset-base`](https://huggingface.co/facebook/dpr-question_encoder-multiset-base)
37
+ - [`dpr-reader-multiset-base`](https://huggingface.co/facebook/dpr-reader-multiset-base)
38
+ - **Resources for more information:**
39
+ - [Research Paper](https://arxiv.org/abs/2004.04906)
40
+ - [GitHub Repo](https://github.com/facebookresearch/DPR)
41
+ - [Hugging Face DPR docs](https://huggingface.co/docs/transformers/main/en/model_doc/dpr)
42
+ - [BERT Base Uncased Model Card](https://huggingface.co/bert-base-uncased)
43
+
44
+ ## How to Get Started with the Model
45
+
46
+ Use the code below to get started with the model.
47
+
48
+ ```python
49
+ from transformers import DPRQuestionEncoder, DPRQuestionEncoderTokenizer
50
+
51
+ tokenizer = DPRQuestionEncoderTokenizer.from_pretrained("facebook/dpr-question_encoder-single-nq-base")
52
+ model = DPRQuestionEncoder.from_pretrained("facebook/dpr-question_encoder-single-nq-base")
53
+ input_ids = tokenizer("Hello, is my dog cute ?", return_tensors="pt")["input_ids"]
54
+ embeddings = model(input_ids).pooler_output
55
+ ```
56
+
57
+ ## Uses
58
+
59
+ #### Direct Use
60
+
61
+ `dpr-question_encoder-single-nq-base`, [`dpr-ctx_encoder-single-nq-base`](https://huggingface.co/facebook/dpr-ctx_encoder-single-nq-base), and [`dpr-reader-single-nq-base`](https://huggingface.co/facebook/dpr-reader-single-nq-base) can be used for the task of open-domain question answering.
62
+
63
+ #### Misuse and Out-of-scope Use
64
+
65
+ The model should not be used to intentionally create hostile or alienating environments for people. In addition, the set of DPR models was not trained to be factual or true representations of people or events, and therefore using the models to generate such content is out-of-scope for the abilities of this model.
66
+
67
+ ## Risks, Limitations and Biases
68
+
69
+ **CONTENT WARNING: Readers should be aware this section may contain content that is disturbing, offensive, and can propogate historical and current stereotypes.**
70
+
71
+ Significant research has explored bias and fairness issues with language models (see, e.g., [Sheng et al., 2021](https://aclanthology.org/2021.acl-long.330.pdf) and [Bender et al., 2021](https://dl.acm.org/doi/pdf/10.1145/3442188.3445922)). Predictions generated by the model can include disturbing and harmful stereotypes across protected classes; identity characteristics; and sensitive, social, and occupational groups.
72
+
73
+ ## Training
74
+
75
+ #### Training Data
76
+
77
+ This model was trained using the [Natural Questions (NQ) dataset](https://huggingface.co/datasets/nq_open) ([Lee et al., 2019](https://aclanthology.org/P19-1612/); [Kwiatkowski et al., 2019](https://aclanthology.org/Q19-1026/)). The model authors write that:
78
+ > [The dataset] was designed for end-to-end question answering. The questions were mined from real Google search queries and the answers were spans in Wikipedia articles identified by annotators.
79
+
80
+ #### Training Procedure
81
+
82
+ The training procedure is described in the [associated paper](https://arxiv.org/pdf/2004.04906.pdf):
83
+
84
+ > Given a collection of M text passages, the goal of our dense passage retriever (DPR) is to index all the passages in a low-dimensional and continuous space, such that it can retrieve efficiently the top k passages relevant to the input question for the reader at run-time.
85
+
86
+ > Our dense passage retriever (DPR) uses a dense encoder EP(·) which maps any text passage to a d- dimensional real-valued vectors and builds an index for all the M passages that we will use for retrieval. At run-time, DPR applies a different encoder EQ(·) that maps the input question to a d-dimensional vector, and retrieves k passages of which vectors are the closest to the question vector.
87
+
88
+ The authors report that for encoders, they used two independent BERT ([Devlin et al., 2019](https://aclanthology.org/N19-1423/)) networks (base, un-cased) and use FAISS ([Johnson et al., 2017](https://arxiv.org/abs/1702.08734)) during inference time to encode and index passages. See the paper for further details on training, including encoders, inference, positive and negative passages, and in-batch negatives.
89
+
90
+ ## Evaluation
91
+
92
+ The following evaluation information is extracted from the [associated paper](https://arxiv.org/pdf/2004.04906.pdf).
93
+
94
+ #### Testing Data, Factors and Metrics
95
+
96
+ The model developers report the performance of the model on five QA datasets, using the top-k accuracy (k ∈ {20, 100}). The datasets were [NQ](https://huggingface.co/datasets/nq_open), [TriviaQA](https://huggingface.co/datasets/trivia_qa), [WebQuestions (WQ)](https://huggingface.co/datasets/web_questions), [CuratedTREC (TREC)](https://huggingface.co/datasets/trec), and [SQuAD v1.1](https://huggingface.co/datasets/squad).
97
+
98
+ #### Results
99
+
100
+ | | Top 20 | | | | | Top 100| | | | |
101
+ |:----:|:------:|:---------:|:--:|:----:|:-----:|:------:|:---------:|:--:|:----:|:-----:|
102
+ | | NQ | TriviaQA | WQ | TREC | SQuAD | NQ | TriviaQA | WQ | TREC | SQuAD |
103
+ | | 78.4 | 79.4 |73.2| 79.8 | 63.2 | 85.4 | 85.0 |81.4| 89.1 | 77.2 |
104
+
105
+ ## Environmental Impact
106
+
107
+ Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700). We present the hardware type and based on the [associated paper](https://arxiv.org/abs/2004.04906).
108
+
109
+ - **Hardware Type:** 8 32GB GPUs
110
+ - **Hours used:** Unknown
111
+ - **Cloud Provider:** Unknown
112
+ - **Compute Region:** Unknown
113
+ - **Carbon Emitted:** Unknown
114
+
115
+ ## Technical Specifications
116
+
117
+ See the [associated paper](https://arxiv.org/abs/2004.04906) for details on the modeling architecture, objective, compute infrastructure, and training details.
118
+
119
+ ## Citation Information
120
+
121
+ ```bibtex
122
+ @inproceedings{karpukhin-etal-2020-dense,
123
+ title = "Dense Passage Retrieval for Open-Domain Question Answering",
124
+ author = "Karpukhin, Vladimir and Oguz, Barlas and Min, Sewon and Lewis, Patrick and Wu, Ledell and Edunov, Sergey and Chen, Danqi and Yih, Wen-tau",
125
+ booktitle = "Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)",
126
+ month = nov,
127
+ year = "2020",
128
+ address = "Online",
129
+ publisher = "Association for Computational Linguistics",
130
+ url = "https://www.aclweb.org/anthology/2020.emnlp-main.550",
131
+ doi = "10.18653/v1/2020.emnlp-main.550",
132
+ pages = "6769--6781",
133
+ }
134
+ ```
135
+
136
+ ## Model Card Authors
137
+
138
+ This model card was written by the team at Hugging Face.