davidmezzetti's picture
Update README.md
dc96101 verified
metadata
tags:
  - sentence-transformers
  - sparse-encoder
  - sparse
  - splade
  - generated_from_trainer
  - loss:SpladeLoss
  - loss:SparseMultipleNegativesRankingLoss
  - loss:FlopsLoss
base_model: microsoft/BiomedNLP-BiomedBERT-base-uncased-abstract-fulltext
pipeline_tag: sentence-similarity
library_name: sentence-transformers
metrics:
  - pearson_cosine
  - spearman_cosine
  - active_dims
  - sparsity_ratio
model-index:
  - name: SPLADE Sparse Encoder
    results:
      - task:
          type: semantic-similarity
          name: Semantic Similarity
        dataset:
          type: pubmed-similarity
          name: PubMed Similarity
        metrics:
          - type: pearson_cosine
            value: 0.9422980731390805
            name: Pearson Cosine
          - type: spearman_cosine
            value: 0.8870061609483617
            name: Spearman Cosine
          - type: active_dims
            value: 34.0018196105957
            name: Active Dims
          - type: sparsity_ratio
            value: 0.9988859897906233
            name: Sparsity Ratio
language: en
license: apache-2.0

PubMedBERT SPLADE

This is a SPLADE Sparse Encoder model finetuned from PubMedBERT-base using sentence-transformers. It maps sentences & paragraphs to a 30522-dimensional sparse vector space and can be used for semantic search and sparse retrieval.

The training dataset was generated using a random sample of PubMed title-abstract pairs along with similar title pairs.

PubMedBERT SPLADE produces higher quality sparse embeddings than generalized models for medical literature. Further fine-tuning for a medical subdomain will result in even better performance.

Usage (txtai)

This model can be used to build embeddings databases with txtai for semantic search and/or as a knowledge source for retrieval augmented generation (RAG).

Note: txtai 9.0+ is required for sparse vector scoring support

import txtai

embeddings = txtai.Embeddings(
  sparse="neuml/pubmedbert-base-splade",
  content=True
)
embeddings.index(documents())

# Run a query
embeddings.search("query to run")

Usage (Sentence-Transformers)

Alternatively, the model can be loaded with sentence-transformers.

from sentence_transformers import SparseEncoder
sentences = ["This is an example sentence", "Each sentence is converted"]

model = SparseEncoder("neuml/pubmedbert-base-splade")
embeddings = model.encode(sentences)
print(embeddings)

Evaluation Results

Performance of this model compared to the top base models on the MTEB leaderboard is shown below. A popular smaller model was also evaluated along with the most downloaded PubMed similarity model on the Hugging Face Hub.

The following datasets were used to evaluate model performance.

  • PubMed QA
    • Subset: pqa_labeled, Split: train, Pair: (question, long_answer)
  • PubMed Subset
    • Split: test, Pair: (title, text)
  • PubMed Summary
    • Subset: pubmed, Split: validation, Pair: (article, abstract)

Evaluation results are shown below. The Pearson correlation coefficient is used as the evaluation metric.

Model PubMed QA PubMed Subset PubMed Summary Average
all-MiniLM-L6-v2 90.40 95.92 94.07 93.46
bge-base-en-v1.5 91.02 95.82 94.49 93.78
gte-base 92.97 96.90 96.24 95.37
pubmedbert-base-embeddings 93.27 97.00 96.58 95.62
pubmedbert-base-splade 90.76 96.20 95.87 94.28
S-PubMedBert-MS-MARCO 90.86 93.68 93.54 92.69

While this model was't the highest scoring model using the Pearson metric, it does well when measured by Spearman rank correlation coefficient.

Model PubMed QA PubMed Subset PubMed Summary Average
all-MiniLM-L6-v2 85.77 86.52 86.32 86.20
bge-base-en-v1.5 85.71 86.58 86.35 86.21
gte-base 86.44 86.60 86.55 86.53
pubmedbert-base-embeddings 86.29 86.57 86.47 86.44
pubmedbert-base-splade 86.80 89.12 88.60 88.17
S-PubMedBert-MS-MARCO 85.71 86.37 86.13 86.07

This indicates that the SPLADE model may do a better job of calculating scores/rankings in the correct direction.

Full Model Architecture

SparseEncoder(
  (0): MLMTransformer({'max_seq_length': 512, 'do_lower_case': False, 'architecture': 'BertForMaskedLM'})
  (1): SpladePooling({'pooling_strategy': 'max', 'activation_function': 'relu', 'word_embedding_dimension': 30522})
)

More Information

The training data for this model is the same as described in this article. See this article for more on the training scripts.