# ModernBERT Cross-Encoder: Semantic Similarity (STS) Cross encoders are high performing encoder models that compare two texts and output a 0-1 score. I've found the `cross-encoders/roberta-large-stsb` model to be very useful in creating evaluators for LLM outputs. They're simple to use, fast and very accurate. Like many people, I was excited about the architecture and training uplift from the ModernBERT architecture (`answerdotai/ModernBERT-base`). So I've applied it to the stsb cross encoder, which is a very handy model. Additionally, I've added pretraining from my much larger semi-synthetic dataset `dleemiller/wiki-sim` that targets this kind of objective. --- ## Features - **High performing:** Achieves **Pearson: 0.9162** and **Spearman: 0.9122** on the STS-Benchmark test set. - **Efficient architecture:** Based on the ModernBERT-base design (149M parameters), offering faster inference speeds. - **Extended context length:** Processes sequences up to 8192 tokens, great for LLM output evals. - **Diversified training:** Pretrained on `dleemiller/wiki-sim` and fine-tuned on `sentence-transformers/stsb`. --- ## Performance | Model | STS-B Test Pearson | STS-B Test Spearman | Context Length | Parameters | Speed* | |--------------------------------|--------------------|---------------------|----------------|------------|---------| | **ModernCE-base-sts** | **0.9162** | **0.9122** | **8192** | 149M | **Fast** | | `roberta-large-stsb` | 0.9147 | 0.9115 | 512 | 355M | Slow | | `distilroberta-base-stsb` | 0.8792 | 0.8765 | 512 | 66M | Fast | --- ## Usage To use ModernCE for semantic similarity tasks, you can load the model with the Hugging Face `sentence-transformers` library: ```python from sentence_transformers import CrossEncoder # Load ModernCE model model = CrossEncoder("dleemiller/ModernCE-base-sts") # Predict similarity scores for sentence pairs sentence_pairs = [ ("It's a wonderful day outside.", "It's so sunny today!"), ("It's a wonderful day outside.", "He drove to work earlier."), ] scores = model.predict(sentence_pairs) print(scores) # Outputs: array([0.9184, 0.0123], dtype=float32) ``` ### Output The model returns similarity scores in the range `[0, 1]`, where higher scores indicate stronger semantic similarity. --- ## Training Details ### Pretraining The model was pretrained on the `pair-score-sampled` subset of the [`dleemiller/wiki-sim`](https://huggingface.co/datasets/dleemiller/wiki-sim) dataset. This dataset provides diverse sentence pairs with semantic similarity scores, helping the model build a robust understanding of relationships between sentences. - **Classifier Dropout:** 0.3, to introduce regularization and reduce overfitting. - **Objective:** STS-B scores from `roberta-large-stsb`. ### Fine-Tuning Fine-tuning was performed on the [`sentence-transformers/stsb`](https://huggingface.co/datasets/sentence-transformers/stsb) dataset. ### Validation Results The model achieved the following test set performance after fine-tuning: - **Pearson Correlation:** 0.9162 - **Spearman Correlation:** 0.9122 Logs for training and evaluation are included in the [training logs](output/eval/sts-test-results.csv). --- ## Applications 1. **Semantic Search:** Retrieve relevant documents or text passages based on query similarity. 2. **Retrieval-Augmented Generation (RAG):** Enhance generative models by providing contextually relevant information. 3. **Content Moderation:** Automatically classify or rank content based on similarity to predefined guidelines. 4. **Code Search:** Leverage the model's ability to understand code and natural language for large-scale programming tasks. --- ## Model Card - **Architecture:** ModernBERT-base - **Tokenizer:** Custom tokenizer trained with modern techniques for long-context handling. - **Pretraining Data:** `dleemiller/wiki-sim (pair-score-sampled)` - **Fine-Tuning Data:** `sentence-transformers/stsb` --- ## Thank You Thanks to the AnswerAI team for providing the ModernBERT models, and the Sentence Transformers team for their leadership in transformer encoder models. --- ## Citation If you use this model in your research, please cite: ```bibtex @misc{moderncestsb2025, author = {Miller, D. Lee}, title = {ModernCE STS: An STS cross encoder model}, year = {2025}, publisher = {Hugging Face Hub}, url = {https://huggingface.co/dleemiller/ModernCE-base-sts}, } ``` --- ## License This model is licensed under the [MIT License](LICENSE).