zhichao-geng
commited on
Update README.md
Browse files
README.md
CHANGED
@@ -14,9 +14,23 @@ tags:
|
|
14 |
# opensearch-neural-sparse-encoding-v1
|
15 |
This is a learned sparse retrieval model. It encodes the documents to 30522 dimensional **sparse vectors**. For queries, it just use a tokenizer and a weight look-up table to generate sparse vectors. The non-zero dimension index means the corresponding token in the vocabulary, and the weight means the importance of the token. And the similarity score is the inner product of query/document sparse vectors. In the real-world use case, the search performance of opensearch-neural-sparse-encoding-v1 is comparable to BM25.
|
16 |
|
|
|
17 |
|
18 |
OpenSearch neural sparse feature supports learned sparse retrieval with lucene inverted index. Link: https://opensearch.org/docs/latest/query-dsl/specialized/neural-sparse/. The indexing and search can be performed with OpenSearch high-level API.
|
19 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
20 |
## Usage (HuggingFace)
|
21 |
This model is supposed to run inside OpenSearch cluster. But you can also use it outside the cluster, with HuggingFace models API.
|
22 |
|
@@ -116,5 +130,21 @@ for token in sorted(query_token_weight, key=lambda x:query_token_weight[x], reve
|
|
116 |
|
117 |
The above code sample shows an example of neural sparse search. Although there is no overlap token in original query and document, but this model performs a good match.
|
118 |
|
119 |
-
##
|
120 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
14 |
# opensearch-neural-sparse-encoding-v1
|
15 |
This is a learned sparse retrieval model. It encodes the documents to 30522 dimensional **sparse vectors**. For queries, it just use a tokenizer and a weight look-up table to generate sparse vectors. The non-zero dimension index means the corresponding token in the vocabulary, and the weight means the importance of the token. And the similarity score is the inner product of query/document sparse vectors. In the real-world use case, the search performance of opensearch-neural-sparse-encoding-v1 is comparable to BM25.
|
16 |
|
17 |
+
This model is trained on MS MARCO dataset.
|
18 |
|
19 |
OpenSearch neural sparse feature supports learned sparse retrieval with lucene inverted index. Link: https://opensearch.org/docs/latest/query-dsl/specialized/neural-sparse/. The indexing and search can be performed with OpenSearch high-level API.
|
20 |
|
21 |
+
## Select the model
|
22 |
+
The model should be selected considering search relevance, model inference and retrieval efficiency(FLOPS). We benchmark models' **zero-shot performance** on a subset of BEIR benchmark: TrecCovid,NFCorpus,NQ,HotpotQA,FiQA,ArguAna,Touche,DBPedia,SCIDOCS,FEVER,Climate FEVER,SciFact,Quora.
|
23 |
+
|
24 |
+
Overall, the v2 series of models have better search relevance, efficiency and inference speed than the v1 series. The specific advantages and disadvantages may vary across different datasets.
|
25 |
+
|
26 |
+
| Model | Inference-free for Retrieval | Model Parameters | AVG NDCG@10 | AVG FLOPS |
|
27 |
+
|-------|------------------------------|------------------|-------------|-----------|
|
28 |
+
| [opensearch-neural-sparse-encoding-v1](https://huggingface.co/opensearch-project/opensearch-neural-sparse-encoding-v1) | | 133M | 0.524 | 11.4 |
|
29 |
+
| [opensearch-neural-sparse-encoding-v2-distill](https://huggingface.co/opensearch-project/opensearch-neural-sparse-encoding-v2-distill) | | 67M | 0.528 | 8.3 |
|
30 |
+
| [opensearch-neural-sparse-encoding-doc-v1](https://huggingface.co/opensearch-project/opensearch-neural-sparse-encoding-doc-v1) | ✔️ | 133M | 0.490 | 2.3 |
|
31 |
+
| [opensearch-neural-sparse-encoding-doc-v2-distill](https://huggingface.co/opensearch-project/opensearch-neural-sparse-encoding-doc-v2-distill) | ✔️ | 67M | 0.504 | 1.8 |
|
32 |
+
|
33 |
+
|
34 |
## Usage (HuggingFace)
|
35 |
This model is supposed to run inside OpenSearch cluster. But you can also use it outside the cluster, with HuggingFace models API.
|
36 |
|
|
|
130 |
|
131 |
The above code sample shows an example of neural sparse search. Although there is no overlap token in original query and document, but this model performs a good match.
|
132 |
|
133 |
+
## Detailed Search Relevance
|
134 |
+
|
135 |
+
| Dataset | [opensearch-neural-sparse-encoding-v1](https://huggingface.co/opensearch-project/opensearch-neural-sparse-encoding-v1) | [opensearch-neural-sparse-encoding-v2-distill](https://huggingface.co/opensearch-project/opensearch-neural-sparse-encoding-v2-distill) | [opensearch-neural-sparse-encoding-doc-v1](https://huggingface.co/opensearch-project/opensearch-neural-sparse-encoding-doc-v1) | [opensearch-neural-sparse-encoding-doc-v2-distill](https://huggingface.co/opensearch-project/opensearch-neural-sparse-encoding-doc-v2-distill) |
|
136 |
+
|---------|-------------------------------------------------------------------------|-------------------------------------------------------------------------------------|------------------------------------------------------------------------------|------------------------------------------------------------------------------------------|
|
137 |
+
| Trec Covid | 0.771 | 0.775 | 0.707 | 0.690 |
|
138 |
+
| NFCorpus | 0.360 | 0.347 | 0.352 | 0.343 |
|
139 |
+
| NQ | 0.553 | 0.561 | 0.521 | 0.528 |
|
140 |
+
| HotpotQA | 0.697 | 0.685 | 0.677 | 0.675 |
|
141 |
+
| FiQA | 0.376 | 0.374 | 0.344 | 0.357 |
|
142 |
+
| ArguAna | 0.508 | 0.551 | 0.461 | 0.496 |
|
143 |
+
| Touche | 0.278 | 0.278 | 0.294 | 0.287 |
|
144 |
+
| DBPedia | 0.447 | 0.435 | 0.412 | 0.418 |
|
145 |
+
| SCIDOCS | 0.164 | 0.173 | 0.154 | 0.166 |
|
146 |
+
| FEVER | 0.821 | 0.849 | 0.743 | 0.818 |
|
147 |
+
| Climate FEVER | 0.263 | 0.249 | 0.202 | 0.224 |
|
148 |
+
| SciFact | 0.723 | 0.722 | 0.716 | 0.715 |
|
149 |
+
| Quora | 0.856 | 0.863 | 0.788 | 0.841 |
|
150 |
+
| **Average** | **0.524** | **0.528** | **0.490** | **0.504** |
|