zhichao-geng Frinkleko commited on
Commit
3d820d9
·
verified ·
1 Parent(s): 30db8be

Update README.md (#1)

Browse files

- Update README.md (5a8533b1f4fb44a554ccb8cdcd998821f6b93ff2)


Co-authored-by: Xinjie Shen <[email protected]>

Files changed (1) hide show
  1. README.md +164 -3
README.md CHANGED
@@ -1,3 +1,164 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language: en
3
+ license: apache-2.0
4
+ tags:
5
+ - learned sparse
6
+ - opensearch
7
+ - transformers
8
+ - retrieval
9
+ - passage-retrieval
10
+ - document-expansion
11
+ - bag-of-words
12
+ ---
13
+
14
+ # opensearch-neural-sparse-encoding-doc-v3-distill
15
+
16
+ ## Select the model
17
+ The model should be selected considering search relevance, model inference and retrieval efficiency(FLOPS). We benchmark models' performance on a subset of BEIR benchmark: TrecCovid,NFCorpus,NQ,HotpotQA,FiQA,ArguAna,Touche,DBPedia,SCIDOCS,FEVER,Climate FEVER,SciFact,Quora.
18
+
19
+ Overall, the v3 series of models have better search relevance, efficiency and inference speed than the v1 and v2 series. The specific advantages and disadvantages may vary across different datasets.
20
+
21
+ | Model | Inference-free for Retrieval | Model Parameters | AVG NDCG@10 | AVG FLOPS |
22
+ |-------|------------------------------|------------------|-------------|-----------|
23
+ | [opensearch-neural-sparse-encoding-v1](https://huggingface.co/opensearch-project/opensearch-neural-sparse-encoding-v1) | | 133M | 0.524 | 11.4 |
24
+ | [opensearch-neural-sparse-encoding-v2-distill](https://huggingface.co/opensearch-project/opensearch-neural-sparse-encoding-v2-distill) | | 67M | 0.528 | 8.3 |
25
+ | [opensearch-neural-sparse-encoding-doc-v1](https://huggingface.co/opensearch-project/opensearch-neural-sparse-encoding-doc-v1) | ✔️ | 133M | 0.490 | 2.3 |
26
+ | [opensearch-neural-sparse-encoding-doc-v2-distill](https://huggingface.co/opensearch-project/opensearch-neural-sparse-encoding-doc-v2-distill) | ✔️ | 67M | 0.504 | 1.8 |
27
+ | [opensearch-neural-sparse-encoding-doc-v2-mini](https://huggingface.co/opensearch-project/opensearch-neural-sparse-encoding-doc-v2-mini) | ✔️ | 23M | 0.497 | 1.7 |
28
+ | [opensearch-neural-sparse-encoding-doc-v3-distill](https://huggingface.co/opensearch-project/opensearch-neural-sparse-encoding-doc-v3-distill) | ✔️ | 67M | 0.517 | 1.8 |
29
+
30
+ ## Overview
31
+
32
+ - **Paper**: Coming Soon
33
+ - **Codes**: [opensearch-sparse-model-tuning-sample](https://github.com/zhichao-aws/opensearch-sparse-model-tuning-sample/tree/l0_enhance)
34
+
35
+ This is a learned sparse retrieval model. It encodes the documents to 30522 dimensional **sparse vectors**. For queries, it just use a tokenizer and a weight look-up table to generate sparse vectors. The non-zero dimension index means the corresponding token in the vocabulary, and the weight means the importance of the token. And the similarity score is the inner product of query/document sparse vectors.
36
+
37
+ The training datasets includes MS MARCO, eli5_question_answer, squad_pairs, WikiAnswers, yahoo_answers_title_question, gooaq_pairs, stackexchange_duplicate_questions_body_body, wikihow, S2ORC_title_abstract, stackexchange_duplicate_questions_title-body_title-body, yahoo_answers_question_answer, searchQA_top5_snippets, stackexchange_duplicate_questions_title_title, yahoo_answers_title_answer, fever, fiqa, hotpotqa, nfcorpus, scifact.
38
+
39
+ OpenSearch neural sparse feature supports learned sparse retrieval with lucene inverted index. Link: https://opensearch.org/docs/latest/query-dsl/specialized/neural-sparse/. The indexing and search can be performed with OpenSearch high-level API.
40
+
41
+
42
+ ## Usage (HuggingFace)
43
+ This model is supposed to run inside OpenSearch cluster. But you can also use it outside the cluster, with HuggingFace models API.
44
+
45
+ ```python
46
+ import json
47
+ import itertools
48
+ import torch
49
+
50
+ from transformers import AutoModelForMaskedLM, AutoTokenizer
51
+
52
+
53
+ # get sparse vector from dense vectors with shape batch_size * seq_len * vocab_size
54
+ def get_sparse_vector(feature, output):
55
+ values, _ = torch.max(output*feature["attention_mask"].unsqueeze(-1), dim=1)
56
+ # note we update the activation for v3 model
57
+ values = torch.log(1 + torch.log(1 + torch.relu(values)))
58
+ values[:,special_token_ids] = 0
59
+ return values
60
+
61
+ # transform the sparse vector to a dict of (token, weight)
62
+ def transform_sparse_vector_to_dict(sparse_vector):
63
+ sample_indices,token_indices=torch.nonzero(sparse_vector,as_tuple=True)
64
+ non_zero_values = sparse_vector[(sample_indices,token_indices)].tolist()
65
+ number_of_tokens_for_each_sample = torch.bincount(sample_indices).cpu().tolist()
66
+ tokens = [transform_sparse_vector_to_dict.id_to_token[_id] for _id in token_indices.tolist()]
67
+
68
+ output = []
69
+ end_idxs = list(itertools.accumulate([0]+number_of_tokens_for_each_sample))
70
+ for i in range(len(end_idxs)-1):
71
+ token_strings = tokens[end_idxs[i]:end_idxs[i+1]]
72
+ weights = non_zero_values[end_idxs[i]:end_idxs[i+1]]
73
+ output.append(dict(zip(token_strings, weights)))
74
+ return output
75
+
76
+ # download the idf file from model hub. idf is used to give weights for query tokens
77
+ def get_tokenizer_idf(tokenizer):
78
+ from huggingface_hub import hf_hub_download
79
+ local_cached_path = hf_hub_download(repo_id="opensearch-project/opensearch-neural-sparse-encoding-doc-v3-distill", filename="idf.json")
80
+ with open(local_cached_path) as f:
81
+ idf = json.load(f)
82
+ idf_vector = [0]*tokenizer.vocab_size
83
+ for token,weight in idf.items():
84
+ _id = tokenizer._convert_token_to_id_with_added_voc(token)
85
+ idf_vector[_id]=weight
86
+ return torch.tensor(idf_vector)
87
+
88
+ # load the model
89
+ model = AutoModelForMaskedLM.from_pretrained("opensearch-project/opensearch-neural-sparse-encoding-doc-v3-distill")
90
+ tokenizer = AutoTokenizer.from_pretrained("opensearch-project/opensearch-neural-sparse-encoding-doc-v3-distill")
91
+ idf = get_tokenizer_idf(tokenizer)
92
+
93
+ # set the special tokens and id_to_token transform for post-process
94
+ special_token_ids = [tokenizer.vocab[token] for token in tokenizer.special_tokens_map.values()]
95
+ get_sparse_vector.special_token_ids = special_token_ids
96
+ id_to_token = ["" for i in range(tokenizer.vocab_size)]
97
+ for token, _id in tokenizer.vocab.items():
98
+ id_to_token[_id] = token
99
+ transform_sparse_vector_to_dict.id_to_token = id_to_token
100
+
101
+
102
+
103
+ query = "What's the weather in ny now?"
104
+ document = "Currently New York is rainy."
105
+
106
+ # encode the query
107
+ feature_query = tokenizer([query], padding=True, truncation=True, return_tensors='pt', return_token_type_ids=False)
108
+ input_ids = feature_query["input_ids"]
109
+ batch_size = input_ids.shape[0]
110
+ query_vector = torch.zeros(batch_size, tokenizer.vocab_size)
111
+ query_vector[torch.arange(batch_size).unsqueeze(-1), input_ids] = 1
112
+ query_sparse_vector = query_vector*idf
113
+
114
+ # encode the document
115
+ feature_document = tokenizer([document], padding=True, truncation=True, return_tensors='pt', return_token_type_ids=False)
116
+ output = model(**feature_document)[0]
117
+ document_sparse_vector = get_sparse_vector(feature_document, output)
118
+
119
+
120
+ # get similarity score
121
+ sim_score = torch.matmul(query_sparse_vector[0],document_sparse_vector[0])
122
+ print(sim_score) # tensor(11.1105, grad_fn=<DotBackward0>)
123
+
124
+
125
+ query_token_weight = transform_sparse_vector_to_dict(query_sparse_vector)[0]
126
+ document_query_token_weight = transform_sparse_vector_to_dict(document_sparse_vector)[0]
127
+ for token in sorted(query_token_weight, key=lambda x:query_token_weight[x], reverse=True):
128
+ if token in document_query_token_weight:
129
+ print("score in query: %.4f, score in document: %.4f, token: %s"%(query_token_weight[token],document_query_token_weight[token],token))
130
+
131
+
132
+
133
+ # result:
134
+ # score in query: 5.7729, score in document: 0.8049, token: ny
135
+ # score in query: 4.5684, score in document: 0.9710, token: weather
136
+ # score in query: 3.5895, score in document: 0.4720, token: now
137
+ # score in query: 3.3313, score in document: 0.0286, token: ?
138
+ # score in query: 2.7699, score in document: 0.0787, token: what
139
+ # score in query: 0.4989, score in document: 0.0417, token: in
140
+ ```
141
+
142
+ The above code sample shows an example of neural sparse search. Although there is no overlap token in original query and document, but this model performs a good match.
143
+
144
+ ## Detailed Search Relevance
145
+
146
+ <div style="overflow-x: auto;">
147
+
148
+ | Model | Average | Trec Covid | NFCorpus | NQ | HotpotQA | FiQA | ArguAna | Touche | DBPedia | SCIDOCS | FEVER | Climate FEVER | SciFact | Quora |
149
+ |-------|---------|------------|----------|----|----------|------|---------|--------|---------|---------|-------|---------------|---------|-------|
150
+ | [opensearch-neural-sparse-encoding-v1](https://huggingface.co/opensearch-project/opensearch-neural-sparse-encoding-v1) | 0.524 | 0.771 | 0.360 | 0.553 | 0.697 | 0.376 | 0.508 | 0.278 | 0.447 | 0.164 | 0.821 | 0.263 | 0.723 | 0.856 |
151
+ | [opensearch-neural-sparse-encoding-v2-distill](https://huggingface.co/opensearch-project/opensearch-neural-sparse-encoding-v2-distill) | 0.528 | 0.775 | 0.347 | 0.561 | 0.685 | 0.374 | 0.551 | 0.278 | 0.435 | 0.173 | 0.849 | 0.249 | 0.722 | 0.863 |
152
+ | [opensearch-neural-sparse-encoding-doc-v1](https://huggingface.co/opensearch-project/opensearch-neural-sparse-encoding-doc-v1) | 0.490 | 0.707 | 0.352 | 0.521 | 0.677 | 0.344 | 0.461 | 0.294 | 0.412 | 0.154 | 0.743 | 0.202 | 0.716 | 0.788 |
153
+ | [opensearch-neural-sparse-encoding-doc-v2-distill](https://huggingface.co/opensearch-project/opensearch-neural-sparse-encoding-doc-v2-distill) | 0.504 | 0.690 | 0.343 | 0.528 | 0.675 | 0.357 | 0.496 | 0.287 | 0.418 | 0.166 | 0.818 | 0.224 | 0.715 | 0.841 |
154
+ | [opensearch-neural-sparse-encoding-doc-v2-mini](https://huggingface.co/opensearch-project/opensearch-neural-sparse-encoding-doc-v2-mini) | 0.497 | 0.709 | 0.336 | 0.510 | 0.666 | 0.338 | 0.480 | 0.285 | 0.407 | 0.164 | 0.812 | 0.216 | 0.699 | 0.837 |
155
+ | [opensearch-neural-sparse-encoding-doc-v3-distill](https://huggingface.co/opensearch-project/opensearch-neural-sparse-encoding-doc-v3-distill) | 0.517 | 0.724 | 0.345 | 0.544 | 0.694 | 0.356 | 0.520 | 0.294 | 0.424 | 0.163 | 0.845 | 0.239 | 0.708 | 0.863 |
156
+ </div>
157
+
158
+ ## License
159
+
160
+ This project is licensed under the [Apache v2.0 License](https://github.com/opensearch-project/neural-search/blob/main/LICENSE).
161
+
162
+ ## Copyright
163
+
164
+ Copyright OpenSearch Contributors. See [NOTICE](https://github.com/opensearch-project/neural-search/blob/main/NOTICE) for details.