dleemiller
/

ModernCE-large-sts

+---
+license: mit
+datasets:
+- dleemiller/wiki-sim
+- sentence-transformers/stsb
+language:
+- en
+metrics:
+- spearmanr
+- pearsonr
+base_model:
+- answerdotai/ModernBERT-large
+pipeline_tag: sentence-similarity
+library_name: sentence-transformers
+tags:
+- cross-encoder
+- modernbert
+- sts
+- stsb
+---
+# ModernBERT Cross-Encoder: Semantic Similarity (STS)
+Cross encoders are high performing encoder models that compare two texts and output a 0-1 score.
+I've found the `cross-encoders/roberta-large-stsb` model to be very useful in creating evaluators for LLM outputs.
+They're simple to use, fast and very accurate.
+Like many people, I was excited about the architecture and training uplift from the ModernBERT architecture (`answerdotai/ModernBERT-large`).
+So I've applied it to the stsb cross encoder, which is a very handy model. Additionally, I've added
+pretraining from my much larger semi-synthetic dataset `dleemiller/wiki-sim` that targets this kind of objective.
+The inference performance efficiency, expanded context and simplicity make this a really nice platform as an evaluator model.
+---
+## Features
+- **High performing:** Achieves **Pearson: 0.9256** and **Spearman: 0.9215** on the STS-Benchmark test set.
+- **Efficient architecture:** Based on the ModernBERT-large design (395M parameters), offering faster inference speeds.
+- **Extended context length:** Processes sequences up to 8192 tokens, great for LLM output evals.
+- **Diversified training:** Pretrained on `dleemiller/wiki-sim` and fine-tuned on `sentence-transformers/stsb`.
+---
+## Performance
+| Model                          | STS-B Test Pearson | STS-B Test Spearman | Context Length | Parameters | Speed  |
+|--------------------------------|--------------------|---------------------|----------------|------------|---------|
+| `ModernCE-large-sts`           | **0.9256**         | **0.9215**          | **8192**       | 395M       | **Medium** |
+| `ModernCE-base-sts`            | **0.9162**         | **0.9122**          | **8192**       | 149M       | **Fast** |
+| `stsb-roberta-large`           | 0.9147            | -              | 512            | 355M       | Slow    |
+| `stsb-distilroberta-base`      | 0.8792            | -              | 512            | 66M        | Fast    |
+---
+## Usage
+To use ModernCE for semantic similarity tasks, you can load the model with the Hugging Face `sentence-transformers` library:
+```python
+from sentence_transformers import CrossEncoder
+# Load ModernCE model
+model = CrossEncoder("dleemiller/ModernCE-large-sts")
+# Predict similarity scores for sentence pairs
+sentence_pairs = [
+    ("It's a wonderful day outside.", "It's so sunny today!"),
+    ("It's a wonderful day outside.", "He drove to work earlier."),
+]
+scores = model.predict(sentence_pairs)
+print(scores)  # Outputs: array([0.9184, 0.0123], dtype=float32)
+```
+### Output
+The model returns similarity scores in the range `[0, 1]`, where higher scores indicate stronger semantic similarity.
+---
+## Training Details
+### Pretraining
+The model was pretrained on the `pair-score-sampled` subset of the [`dleemiller/wiki-sim`](https://huggingface.co/datasets/dleemiller/wiki-sim) dataset. This dataset provides diverse sentence pairs with semantic similarity scores, helping the model build a robust understanding of relationships between sentences.
+- **Classifier Dropout:** a somewhat large classifier dropout of 0.3, to reduce overreliance on teacher scores.
+- **Objective:** STS-B scores from `cross-encoder/stsb-roberta-large`.
+### Fine-Tuning
+Fine-tuning was performed on the [`sentence-transformers/stsb`](https://huggingface.co/datasets/sentence-transformers/stsb) dataset.
+### Validation Results
+The model achieved the following test set performance after fine-tuning:
+- **Pearson Correlation:** 0.9256
+- **Spearman Correlation:** 0.9215
+Logs for training and evaluation are included in the [training logs](output/eval/sts-test-results.csv).
+---
+## Applications
+1. **Semantic Search:** Retrieve relevant documents or text passages based on query similarity.
+2. **Retrieval-Augmented Generation (RAG):** Enhance generative models by providing contextually relevant information.
+3. **Content Moderation:** Automatically classify or rank content based on similarity to predefined guidelines.
+4. **Code Search:** Leverage the model's ability to understand code and natural language for large-scale programming tasks.
+---
+## Model Card
+- **Architecture:** ModernBERT-large
+- **Tokenizer:** Custom tokenizer trained with modern techniques for long-context handling.
+- **Pretraining Data:** `dleemiller/wiki-sim (pair-score-sampled)`
+- **Fine-Tuning Data:** `sentence-transformers/stsb`
+---
+## Thank You
+Thanks to the AnswerAI team for providing the ModernBERT models, and the Sentence Transformers team for their leadership in transformer encoder models.
+---
+## Citation
+If you use this model in your research, please cite:
+```bibtex
+@misc{moderncestsb2025,
+  author = {Miller, D. Lee},
+  title = {ModernCE STS: An STS cross encoder model},
+  year = {2025},
+  publisher = {Hugging Face Hub},
+  url = {https://huggingface.co/dleemiller/ModernCE-large-sts},
+}
+```
+---
+## License
+This model is licensed under the [MIT License](LICENSE).