opensearch-project
/

opensearch-neural-sparse-encoding-doc-v1

The SentenceTransformer library will soon add support for sparse models through the SparseEncoder class.
We would like to add support for this model, and with this PR it is now properly handled.
We modified as little as possible, so it should work with any other custom loading logic you may have.

You will first need to install the current version of the library:

pip install git+https://github.com/arthurbr11/sentence-transformers.git@sparse_implementation

Feel free to run this code using revision="refs/pr/5" in the AutoTokenizer, AutoModelForMaskedLM, etc. to test this PR with your custom code or with the one below before merging:

from sentence_transformers import SparseEncoder

# Download from the 🤗 Hub
model = SparseEncoder("opensearch-project/opensearch-neural-sparse-encoding-doc-v1", revision="refs/pr/5")
# Run inference
sentences = [
    'The weather is lovely today.',
    "It's so sunny outside!",
    'He drove to the stadium.',
]
embeddings = model.encode(sentences)
print(embeddings.shape)
# (3, 30522)

# Get the similarity scores for the embeddings
similarities = model.similarity(embeddings, embeddings)
print(similarities.shape)
# [3, 3]

# Let's decode our embeddings to be able to interpret them
decoded = model.decode(embeddings, top_k=10)
for decoded, sentence in zip(decoded, sentences):
    print(f"Sentence: {sentence}")
    print(f"Decoded: {decoded}")
    print()

cc @tomaarsen

Arthur BRESNU

arthurbresnu changed pull request status to open Jun 10

tomaarsen

Jun 10

Feel free to let us know if you have any questions about the files that we're proposing to add here.

For additional context, this is what we'd expect to get as outputs for the similarity and the decoded embeddings:

tensor([[    7.2913,     1.8760,     0.0035],
        [    1.8760,     6.4976,     0.1080],
        [    0.0035,     0.1080,     9.8219]], device='cuda:0')
Sentence: The weather is lovely today.
Decoded: [('weather', 1.405685305595398), ('today', 1.1451733112335205), ('lovely', 0.8350375890731812), ('climate', 0.6556388735771179), ('forecast', 0.5856578946113586), ('beautiful', 0.5536007881164551), ('day', 0.5009242296218872), ('tomorrow', 0.4879005551338196), ('yesterday', 0.481747567653656), ('nice', 0.44232678413391113)]

Sentence: It's so sunny outside!
Decoded: [('outside', 0.9889683723449707), ('sunny', 0.8924372792243958), ('weather', 0.8884875774383545), ('lyrics', 0.6884512901306152), ('so', 0.6462645530700684), ('outdoors', 0.6106253862380981), ('sunshine', 0.5346807241439819), ('outdoor', 0.5262255668640137), ('song', 0.4600299596786499), ('out', 0.4593561589717865)]

Sentence: He drove to the stadium.
Decoded: [('stadium', 0.9480016231536865), ('drive', 0.7638173699378967), ('driving', 0.704725444316864), ('drove', 0.6354855298995972), ('stadiums', 0.5713939070701599), ('driver', 0.5664985775947571), ('football', 0.5491527318954468), ('car', 0.519797682762146), ('baseball', 0.46990716457366943), ('drivers', 0.4272884130477905)]

Tom Aarsen

Update README.mdcf44a30a

zhichao-geng changed pull request status to merged Jun 11

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment