Model Card for INDUS-Retriever-small

INDUS-Retriever-small (previously nasa-smd-ibm-st.38m) is a Bi-encoder sentence transformer model, that is fine-tuned from distilled version of nasa-smd-ibm-v0.1 encoder model. it is a smaller version of nasa-smd-ibm-st with better performance, using fewer parameters (shown below). It's trained with 362 million examples along with a domain-specific dataset of 2.6 million examples from documents curated by NASA Science Mission Directorate (SMD). With this model, we aim to enhance natural language technologies like information retrieval and intelligent search as it applies to SMD NLP applications.

A bigger model is also available here: https://huggingface.co/nasa-impact/nasa-smd-ibm-st-v2

Model Details

Base Encoder Model: INDUS(https://huggingface.co/nasa-impact/nasa-smd-ibm-v0.1)
Tokenizer: Custom
Parameters: 38M
Training Strategy: Sentence Pairs, and score indicating relevancy. The model encodes the two sentence pairs independently and cosine similarity is calculated. the similarity is optimized using the relevance score.

Training Data

Figure: dataset sources for sentence transformers (362M in total)

Additionally, 2.6M abstract + title pairs collected from NASA SMD documents.

Training Procedure

Framework: PyTorch 1.9.1
sentence-transformers version: 4.30.2
Strategy: Sentence Pairs

Evaluation

Following models are evaluated:

All-MiniLM-l6-v2 [sentence-transformers/all-MiniLM-L6-v2]
BGE-base [BAAI/bge-base-en-v1.5]
RoBERTa-base [roberta-base]
nasa-smd-ibm-rtvr_v0.1 [nasa-impact/nasa-smd-ibm-st]

Figure: BEIR and NASA-IR Evaluation Metrics

Uses

Information Retreival
Sentence Similarity Search

For NASA SMD related, scientific usecases.

Usage


from sentence_transformers import SentenceTransformer, Util

model = SentenceTransformer("nasa-impact/nasa-smd-ibm-st.38m")

input_queries = [
'query: how much protein should a female eat', 'query: summit define']
input_passages = [
"As a general guideline, the CDC's average requirement of protein for women ages 19 to 70 is 46 grams per day.
But, as you can see from this chart, you'll need to increase that if you're expecting or training for a marathon. Check out the chart below to see how much protein you should be eating each day.",
"Definition of summit for English Language Learners. : 1 the highest point of a mountain : the top of a mountain. : 2 the highest level. : 3 a meeting or series of meetings between the leaders of two or more governments."]
query_embeddings = model.encode(input_queries)
passage_embeddings = model.encode(input_passages)
print(util.cos_sim(query_embeddings, passage_embeddings))

Note

This Sentence Transformer Model is released in support of the training and evaluation of the encoder language model "Indus".

Accompanying paper can be found here: https://arxiv.org/abs/2405.10725

Citation

If you find this work useful, please cite using the following bibtex citation:

@misc {nasa-impact_2024,
    author       = { {NASA-IMPACT} },
    title        = { nasa-ibm-st.38m (Revision 9c1989c) },
    year         = 2024,
    url          = { https://huggingface.co/nasa-impact/nasa-ibm-st.38m },
    doi          = { 10.57967/hf/1875 },
    publisher    = { Hugging Face }
}

Attribution

IBM Research

Aashka Trivedi
Masayasu Muraoka
Bishwaranjan Bhattacharjee
Takuma Udagawa

NASA SMD

Muthukumaran Ramasubramanian
Iksha Gurung
Rahul Ramachandran
Manil Maskey
Kaylin Bugbee
Mike Little
Elizabeth Fancher
Lauren Sanders
Sylvain Costes
Sergi Blanco-Cuaresma
Kelly Lockhart
Thomas Allen
Felix Grazes
Megan Ansdell
Alberto Accomazzi
Sanaz Vahidinia
Ryan McGranaghan
Armin Mehrabian
Tsendgar Lee

Disclaimer

This sentence-transformer model is currently in an experimental phase. We are working to improve the model's capabilities and performance, and as we progress, we invite the community to engage with this model, provide feedback, and contribute to its evolution.