ModernCE-base-sts / README.md
dleemiller's picture
Upload 7 files
fccab7b verified
|
raw
history blame
4.66 kB

ModernBERT Cross-Encoder: Semantic Similarity (STS)

Cross encoders are high performing encoder models that compare two texts and output a 0-1 score. I've found the cross-encoders/roberta-large-stsb model to be very useful in creating evaluators for LLM outputs. They're simple to use, fast and very accurate.

Like many people, I was excited about the architecture and training uplift from the ModernBERT architecture (answerdotai/ModernBERT-base). So I've applied it to the stsb cross encoder, which is a very handy model. Additionally, I've added pretraining from my much larger semi-synthetic dataset dleemiller/wiki-sim that targets this kind of objective.


Features

  • High performing: Achieves Pearson: 0.9162 and Spearman: 0.9122 on the STS-Benchmark test set.
  • Efficient architecture: Based on the ModernBERT-base design (149M parameters), offering faster inference speeds.
  • Extended context length: Processes sequences up to 8192 tokens, great for LLM output evals.
  • Diversified training: Pretrained on dleemiller/wiki-sim and fine-tuned on sentence-transformers/stsb.

Performance

Model STS-B Test Pearson STS-B Test Spearman Context Length Parameters Speed*
ModernCE-base-sts 0.9162 0.9122 8192 149M Fast
roberta-large-stsb 0.9147 0.9115 512 355M Slow
distilroberta-base-stsb 0.8792 0.8765 512 66M Fast

Usage

To use ModernCE for semantic similarity tasks, you can load the model with the Hugging Face sentence-transformers library:

from sentence_transformers import CrossEncoder

# Load ModernCE model
model = CrossEncoder("dleemiller/ModernCE-base-sts")

# Predict similarity scores for sentence pairs
sentence_pairs = [
    ("It's a wonderful day outside.", "It's so sunny today!"),
    ("It's a wonderful day outside.", "He drove to work earlier."),
]
scores = model.predict(sentence_pairs)

print(scores)  # Outputs: array([0.9184, 0.0123], dtype=float32)

Output

The model returns similarity scores in the range [0, 1], where higher scores indicate stronger semantic similarity.


Training Details

Pretraining

The model was pretrained on the pair-score-sampled subset of the dleemiller/wiki-sim dataset. This dataset provides diverse sentence pairs with semantic similarity scores, helping the model build a robust understanding of relationships between sentences.

  • Classifier Dropout: 0.3, to introduce regularization and reduce overfitting.
  • Objective: STS-B scores from roberta-large-stsb.

Fine-Tuning

Fine-tuning was performed on the sentence-transformers/stsb dataset.

Validation Results

The model achieved the following test set performance after fine-tuning:

  • Pearson Correlation: 0.9162
  • Spearman Correlation: 0.9122

Logs for training and evaluation are included in the training logs.


Applications

  1. Semantic Search: Retrieve relevant documents or text passages based on query similarity.
  2. Retrieval-Augmented Generation (RAG): Enhance generative models by providing contextually relevant information.
  3. Content Moderation: Automatically classify or rank content based on similarity to predefined guidelines.
  4. Code Search: Leverage the model's ability to understand code and natural language for large-scale programming tasks.

Model Card

  • Architecture: ModernBERT-base
  • Tokenizer: Custom tokenizer trained with modern techniques for long-context handling.
  • Pretraining Data: dleemiller/wiki-sim (pair-score-sampled)
  • Fine-Tuning Data: sentence-transformers/stsb

Thank You

Thanks to the AnswerAI team for providing the ModernBERT models, and the Sentence Transformers team for their leadership in transformer encoder models.


Citation

If you use this model in your research, please cite:

@misc{moderncestsb2025,
  author = {Miller, D. Lee},
  title = {ModernCE STS: An STS cross encoder model},
  year = {2025},
  publisher = {Hugging Face Hub},
  url = {https://huggingface.co/dleemiller/ModernCE-base-sts},
}

License

This model is licensed under the MIT License.