File size: 15,298 Bytes

---
language: en
license: apache-2.0
tags:
- learned sparse
- opensearch
- transformers
- retrieval
- passage-retrieval
- query-expansion
- document-expansion
- bag-of-words
- sentence-transformers
- sparse-encoder
- sparse
- splade
pipeline_tag: feature-extraction
library_name: sentence-transformers
---

# opensearch-neural-sparse-encoding-v2-distill

## Select the model
The model should be selected considering search relevance, model inference and retrieval efficiency(FLOPS). We benchmark models' **zero-shot performance** on a subset of BEIR benchmark: TrecCovid,NFCorpus,NQ,HotpotQA,FiQA,ArguAna,Touche,DBPedia,SCIDOCS,FEVER,Climate FEVER,SciFact,Quora.

Overall, the v2 series of models have better search relevance, efficiency and inference speed than the v1 series. The specific advantages and disadvantages may vary across different datasets.

| Model | Inference-free for Retrieval | Model Parameters | AVG NDCG@10 | AVG FLOPS |
|-------|------------------------------|------------------|-------------|-----------|
| [opensearch-neural-sparse-encoding-v1](https://huggingface.co/opensearch-project/opensearch-neural-sparse-encoding-v1) |  | 133M | 0.524 | 11.4 |
| [opensearch-neural-sparse-encoding-v2-distill](https://huggingface.co/opensearch-project/opensearch-neural-sparse-encoding-v2-distill) |  | 67M | 0.528 | 8.3 |
| [opensearch-neural-sparse-encoding-doc-v1](https://huggingface.co/opensearch-project/opensearch-neural-sparse-encoding-doc-v1) | ✔️ | 133M | 0.490 | 2.3 |
| [opensearch-neural-sparse-encoding-doc-v2-distill](https://huggingface.co/opensearch-project/opensearch-neural-sparse-encoding-doc-v2-distill) | ✔️ | 67M | 0.504 | 1.8 |
| [opensearch-neural-sparse-encoding-doc-v2-mini](https://huggingface.co/opensearch-project/opensearch-neural-sparse-encoding-doc-v2-mini) | ✔️ | 23M | 0.497 | 1.7 |

## Overview
- **Paper**: [Towards Competitive Search Relevance For Inference-Free Learned Sparse Retrievers](https://arxiv.org/abs/2411.04403)
- **Fine-tuning sample**: [opensearch-sparse-model-tuning-sample](https://github.com/zhichao-aws/opensearch-sparse-model-tuning-sample)

This is a learned sparse retrieval model. It encodes the queries and documents to 30522 dimensional **sparse vectors**. The non-zero dimension index means the corresponding token in the vocabulary, and the weight means the importance of the token.

The training datasets includes MS MARCO, eli5_question_answer, squad_pairs, WikiAnswers, yahoo_answers_title_question, gooaq_pairs, stackexchange_duplicate_questions_body_body, wikihow, S2ORC_title_abstract, stackexchange_duplicate_questions_title-body_title-body, yahoo_answers_question_answer, searchQA_top5_snippets, stackexchange_duplicate_questions_title_title, yahoo_answers_title_answer.

OpenSearch neural sparse feature supports learned sparse retrieval with lucene inverted index. Link: https://opensearch.org/docs/latest/query-dsl/specialized/neural-sparse/. The indexing and search can be performed with OpenSearch high-level API.

## Usage (Sentence Transformers)

First install the Sentence Transformers library:

```bash
pip install -U sentence-transformers
```

Then you can load this model and run inference.

```python
from sentence_transformers.sparse_encoder import SparseEncoder

# Download from the 🤗 Hub
model = SparseEncoder("opensearch-project/opensearch-neural-sparse-encoding-v2-distill")

query = "What's the weather in ny now?"
document = "Currently New York is rainy."

query_embed = model.encode_query(query)
document_embed = model.encode_document(document)

sim = model.similarity(query_embed, document_embed)
print(f"Similarity: {sim}")
# Similarity: tensor([[38.6113]])

decoded_query = model.decode(query_embed)
decoded_document = model.decode(document_embed)

for i in range(len(decoded_query)):
    query_token, query_score = decoded_query[i]
    doc_score = next((score for token, score in decoded_document if token == query_token), 0)
    if doc_score != 0:
        print(f"Token: {query_token}, Query score: {query_score:.4f}, Document score: {doc_score:.4f}")

# Token: york, Query score: 2.7273, Document score: 2.9088
# Token: now, Query score: 2.5734, Document score: 0.9208
# Token: ny, Query score: 2.3895, Document score: 1.7237
# Token: weather, Query score: 2.2184, Document score: 1.2368
# Token: current, Query score: 1.8693, Document score: 1.4146
# Token: today, Query score: 1.5888, Document score: 0.7450
# Token: sunny, Query score: 1.4704, Document score: 0.9247
# Token: nyc, Query score: 1.4374, Document score: 1.9737
# Token: currently, Query score: 1.4347, Document score: 1.6019
# Token: climate, Query score: 1.1605, Document score: 0.9794
# Token: upstate, Query score: 1.0944, Document score: 0.7141
# Token: forecast, Query score: 1.0471, Document score: 0.5519
# Token: verve, Query score: 0.9268, Document score: 0.6692
# Token: huh, Query score: 0.9126, Document score: 0.4486
# Token: greene, Query score: 0.8960, Document score: 0.7706
# Token: picturesque, Query score: 0.8779, Document score: 0.7120
# Token: pleasantly, Query score: 0.8471, Document score: 0.4183
# Token: windy, Query score: 0.8079, Document score: 0.2140
# Token: favorable, Query score: 0.7537, Document score: 0.4925
# Token: rain, Query score: 0.7519, Document score: 2.1456
# Token: skies, Query score: 0.7277, Document score: 0.3818
# Token: lena, Query score: 0.6995, Document score: 0.8593
# Token: sunshine, Query score: 0.6895, Document score: 0.2410
# Token: johnny, Query score: 0.6621, Document score: 0.3016
# Token: skyline, Query score: 0.6604, Document score: 0.1933
# Token: sasha, Query score: 0.6117, Document score: 0.2197
# Token: vibe, Query score: 0.5962, Document score: 0.0414
# Token: hardly, Query score: 0.5381, Document score: 0.7560
# Token: prevailing, Query score: 0.4583, Document score: 0.4243
# Token: unpredictable, Query score: 0.4539, Document score: 0.5073
# Token: presently, Query score: 0.4350, Document score: 0.8463
# Token: hail, Query score: 0.3674, Document score: 0.2496
# Token: shivered, Query score: 0.3324, Document score: 0.5506
# Token: wind, Query score: 0.3281, Document score: 0.1964
# Token: rudy, Query score: 0.3052, Document score: 0.5785
# Token: looming, Query score: 0.2797, Document score: 0.0357
# Token: atmospheric, Query score: 0.2712, Document score: 0.0870
# Token: vicky, Query score: 0.2471, Document score: 0.3490
# Token: sandy, Query score: 0.2247, Document score: 0.2383
# Token: crowded, Query score: 0.2154, Document score: 0.5737
# Token: chilly, Query score: 0.1723, Document score: 0.1857
# Token: blizzard, Query score: 0.1700, Document score: 0.4110
# Token: ##cken, Query score: 0.1183, Document score: 0.0613
# Token: unrest, Query score: 0.0923, Document score: 0.6363
# Token: russ, Query score: 0.0624, Document score: 0.2127
# Token: blackout, Query score: 0.0558, Document score: 0.5542
# Token: kahn, Query score: 0.0549, Document score: 0.1589
# Token: 2020, Query score: 0.0160, Document score: 0.0566
# Token: nighttime, Query score: 0.0125, Document score: 0.3753
```

## Usage (HuggingFace)
This model is supposed to run inside OpenSearch cluster. But you can also use it outside the cluster, with HuggingFace models API. 

```python
import itertools
import torch
from transformers import AutoModelForMaskedLM, AutoTokenizer


# get sparse vector from dense vectors with shape batch_size * seq_len * vocab_size
def get_sparse_vector(feature, output):
    values, _ = torch.max(output*feature["attention_mask"].unsqueeze(-1), dim=1)
    values = torch.log(1 + torch.relu(values))
    values[:,special_token_ids] = 0
    return values
    
# transform the sparse vector to a dict of (token, weight)
def transform_sparse_vector_to_dict(sparse_vector):
    sample_indices,token_indices=torch.nonzero(sparse_vector,as_tuple=True)
    non_zero_values = sparse_vector[(sample_indices,token_indices)].tolist()
    number_of_tokens_for_each_sample = torch.bincount(sample_indices).cpu().tolist()
    tokens = [transform_sparse_vector_to_dict.id_to_token[_id] for _id in token_indices.tolist()]

    output = []
    end_idxs = list(itertools.accumulate([0]+number_of_tokens_for_each_sample))
    for i in range(len(end_idxs)-1):
        token_strings = tokens[end_idxs[i]:end_idxs[i+1]]
        weights = non_zero_values[end_idxs[i]:end_idxs[i+1]]
        output.append(dict(zip(token_strings, weights)))
    return output
    

# load the model
model = AutoModelForMaskedLM.from_pretrained("opensearch-project/opensearch-neural-sparse-encoding-v2-distill")
tokenizer = AutoTokenizer.from_pretrained("opensearch-project/opensearch-neural-sparse-encoding-v2-distill")

# set the special tokens and id_to_token transform for post-process
special_token_ids = [tokenizer.vocab[token] for token in tokenizer.special_tokens_map.values()]
get_sparse_vector.special_token_ids = special_token_ids
id_to_token = ["" for i in range(tokenizer.vocab_size)]
for token, _id in tokenizer.vocab.items():
    id_to_token[_id] = token
transform_sparse_vector_to_dict.id_to_token = id_to_token



query = "What's the weather in ny now?"
document = "Currently New York is rainy."

# encode the query & document
feature = tokenizer([query, document], padding=True, truncation=True, return_tensors='pt')
output = model(**feature)[0]
sparse_vector = get_sparse_vector(feature, output)

# get similarity score
sim_score = torch.matmul(sparse_vector[0],sparse_vector[1])
print(sim_score)   # tensor(38.6112, grad_fn=<DotBackward0>)


query_token_weight, document_query_token_weight = transform_sparse_vector_to_dict(sparse_vector)
for token in sorted(query_token_weight, key=lambda x:query_token_weight[x], reverse=True):
    if token in document_query_token_weight:
        print("score in query: %.4f, score in document: %.4f, token: %s"%(query_token_weight[token],document_query_token_weight[token],token))
        

        
# result:
# score in query: 2.7273, score in document: 2.9088, token: york
# score in query: 2.5734, score in document: 0.9208, token: now
# score in query: 2.3895, score in document: 1.7237, token: ny
# score in query: 2.2184, score in document: 1.2368, token: weather
# score in query: 1.8693, score in document: 1.4146, token: current
# score in query: 1.5887, score in document: 0.7450, token: today
# score in query: 1.4704, score in document: 0.9247, token: sunny
# score in query: 1.4374, score in document: 1.9737, token: nyc
# score in query: 1.4347, score in document: 1.6019, token: currently
# score in query: 1.1605, score in document: 0.9794, token: climate
# score in query: 1.0944, score in document: 0.7141, token: upstate
# score in query: 1.0471, score in document: 0.5519, token: forecast
# score in query: 0.9268, score in document: 0.6692, token: verve
# score in query: 0.9126, score in document: 0.4486, token: huh
# score in query: 0.8960, score in document: 0.7706, token: greene
# score in query: 0.8779, score in document: 0.7120, token: picturesque
# score in query: 0.8471, score in document: 0.4183, token: pleasantly
# score in query: 0.8079, score in document: 0.2140, token: windy
# score in query: 0.7537, score in document: 0.4925, token: favorable
# score in query: 0.7519, score in document: 2.1456, token: rain
# score in query: 0.7277, score in document: 0.3818, token: skies
# score in query: 0.6995, score in document: 0.8593, token: lena
# score in query: 0.6895, score in document: 0.2410, token: sunshine
# score in query: 0.6621, score in document: 0.3016, token: johnny
# score in query: 0.6604, score in document: 0.1933, token: skyline
# score in query: 0.6117, score in document: 0.2197, token: sasha
# score in query: 0.5962, score in document: 0.0414, token: vibe
# score in query: 0.5381, score in document: 0.7560, token: hardly
# score in query: 0.4582, score in document: 0.4243, token: prevailing
# score in query: 0.4539, score in document: 0.5073, token: unpredictable
# score in query: 0.4350, score in document: 0.8463, token: presently
# score in query: 0.3674, score in document: 0.2496, token: hail
# score in query: 0.3324, score in document: 0.5506, token: shivered
# score in query: 0.3281, score in document: 0.1964, token: wind
# score in query: 0.3052, score in document: 0.5785, token: rudy
# score in query: 0.2797, score in document: 0.0357, token: looming
# score in query: 0.2712, score in document: 0.0870, token: atmospheric
# score in query: 0.2471, score in document: 0.3490, token: vicky
# score in query: 0.2247, score in document: 0.2383, token: sandy
# score in query: 0.2154, score in document: 0.5737, token: crowded
# score in query: 0.1723, score in document: 0.1857, token: chilly
# score in query: 0.1700, score in document: 0.4110, token: blizzard
# score in query: 0.1183, score in document: 0.0613, token: ##cken
# score in query: 0.0923, score in document: 0.6363, token: unrest
# score in query: 0.0624, score in document: 0.2127, token: russ
# score in query: 0.0558, score in document: 0.5542, token: blackout
# score in query: 0.0549, score in document: 0.1589, token: kahn
# score in query: 0.0160, score in document: 0.0566, token: 2020
# score in query: 0.0125, score in document: 0.3753, token: nighttime
```

The above code sample shows an example of neural sparse search. Although there is no overlap token in original query and document, but this model performs a good match. 

## Detailed Search Relevance

<div style="overflow-x: auto;">
    
| Model | Average | Trec Covid | NFCorpus | NQ | HotpotQA | FiQA | ArguAna | Touche | DBPedia | SCIDOCS | FEVER | Climate FEVER | SciFact | Quora |
|-------|---------|------------|----------|----|----------|------|---------|--------|---------|---------|-------|---------------|---------|-------|
| [opensearch-neural-sparse-encoding-v1](https://huggingface.co/opensearch-project/opensearch-neural-sparse-encoding-v1) | 0.524 | 0.771 | 0.360 | 0.553 | 0.697 | 0.376 | 0.508 | 0.278 | 0.447 | 0.164 | 0.821 | 0.263 | 0.723 | 0.856 |
| [opensearch-neural-sparse-encoding-v2-distill](https://huggingface.co/opensearch-project/opensearch-neural-sparse-encoding-v2-distill) | 0.528 | 0.775 | 0.347 | 0.561 | 0.685 | 0.374 | 0.551 | 0.278 | 0.435 | 0.173 | 0.849 | 0.249 | 0.722 | 0.863 |
| [opensearch-neural-sparse-encoding-doc-v1](https://huggingface.co/opensearch-project/opensearch-neural-sparse-encoding-doc-v1) | 0.490 | 0.707 | 0.352 | 0.521 | 0.677 | 0.344 | 0.461 | 0.294 | 0.412 | 0.154 | 0.743 | 0.202 | 0.716 | 0.788 |
| [opensearch-neural-sparse-encoding-doc-v2-distill](https://huggingface.co/opensearch-project/opensearch-neural-sparse-encoding-doc-v2-distill) | 0.504 | 0.690 | 0.343 | 0.528 | 0.675 | 0.357 | 0.496 | 0.287 | 0.418 | 0.166 | 0.818 | 0.224 | 0.715 | 0.841 |
| [opensearch-neural-sparse-encoding-doc-v2-mini](https://huggingface.co/opensearch-project/opensearch-neural-sparse-encoding-doc-v2-mini) | 0.497 | 0.709 | 0.336 | 0.510 | 0.666 | 0.338 | 0.480 | 0.285 | 0.407 | 0.164 | 0.812 | 0.216 | 0.699 | 0.837 |
    
</div>

## License

This project is licensed under the [Apache v2.0 License](https://github.com/opensearch-project/neural-search/blob/main/LICENSE).

## Copyright

Copyright OpenSearch Contributors. See [NOTICE](https://github.com/opensearch-project/neural-search/blob/main/NOTICE) for details.