Upload README.md
Browse files
README.md
ADDED
@@ -0,0 +1,140 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
license: mit
|
3 |
+
datasets:
|
4 |
+
- dleemiller/wiki-sim
|
5 |
+
- sentence-transformers/stsb
|
6 |
+
language:
|
7 |
+
- en
|
8 |
+
metrics:
|
9 |
+
- spearmanr
|
10 |
+
- pearsonr
|
11 |
+
base_model:
|
12 |
+
- answerdotai/ModernBERT-large
|
13 |
+
pipeline_tag: sentence-similarity
|
14 |
+
library_name: sentence-transformers
|
15 |
+
tags:
|
16 |
+
- cross-encoder
|
17 |
+
- modernbert
|
18 |
+
- sts
|
19 |
+
- stsb
|
20 |
+
---
|
21 |
+
# ModernBERT Cross-Encoder: Semantic Similarity (STS)
|
22 |
+
|
23 |
+
Cross encoders are high performing encoder models that compare two texts and output a 0-1 score.
|
24 |
+
I've found the `cross-encoders/roberta-large-stsb` model to be very useful in creating evaluators for LLM outputs.
|
25 |
+
They're simple to use, fast and very accurate.
|
26 |
+
|
27 |
+
Like many people, I was excited about the architecture and training uplift from the ModernBERT architecture (`answerdotai/ModernBERT-large`).
|
28 |
+
So I've applied it to the stsb cross encoder, which is a very handy model. Additionally, I've added
|
29 |
+
pretraining from my much larger semi-synthetic dataset `dleemiller/wiki-sim` that targets this kind of objective.
|
30 |
+
The inference performance efficiency, expanded context and simplicity make this a really nice platform as an evaluator model.
|
31 |
+
|
32 |
+
---
|
33 |
+
|
34 |
+
## Features
|
35 |
+
- **High performing:** Achieves **Pearson: 0.9256** and **Spearman: 0.9215** on the STS-Benchmark test set.
|
36 |
+
- **Efficient architecture:** Based on the ModernBERT-large design (395M parameters), offering faster inference speeds.
|
37 |
+
- **Extended context length:** Processes sequences up to 8192 tokens, great for LLM output evals.
|
38 |
+
- **Diversified training:** Pretrained on `dleemiller/wiki-sim` and fine-tuned on `sentence-transformers/stsb`.
|
39 |
+
|
40 |
+
---
|
41 |
+
|
42 |
+
## Performance
|
43 |
+
|
44 |
+
| Model | STS-B Test Pearson | STS-B Test Spearman | Context Length | Parameters | Speed |
|
45 |
+
|--------------------------------|--------------------|---------------------|----------------|------------|---------|
|
46 |
+
| `ModernCE-large-sts` | **0.9256** | **0.9215** | **8192** | 395M | **Medium** |
|
47 |
+
| `ModernCE-base-sts` | **0.9162** | **0.9122** | **8192** | 149M | **Fast** |
|
48 |
+
| `stsb-roberta-large` | 0.9147 | - | 512 | 355M | Slow |
|
49 |
+
| `stsb-distilroberta-base` | 0.8792 | - | 512 | 66M | Fast |
|
50 |
+
|
51 |
+
|
52 |
+
---
|
53 |
+
|
54 |
+
## Usage
|
55 |
+
|
56 |
+
To use ModernCE for semantic similarity tasks, you can load the model with the Hugging Face `sentence-transformers` library:
|
57 |
+
|
58 |
+
```python
|
59 |
+
from sentence_transformers import CrossEncoder
|
60 |
+
|
61 |
+
# Load ModernCE model
|
62 |
+
model = CrossEncoder("dleemiller/ModernCE-large-sts")
|
63 |
+
|
64 |
+
# Predict similarity scores for sentence pairs
|
65 |
+
sentence_pairs = [
|
66 |
+
("It's a wonderful day outside.", "It's so sunny today!"),
|
67 |
+
("It's a wonderful day outside.", "He drove to work earlier."),
|
68 |
+
]
|
69 |
+
scores = model.predict(sentence_pairs)
|
70 |
+
|
71 |
+
print(scores) # Outputs: array([0.9184, 0.0123], dtype=float32)
|
72 |
+
```
|
73 |
+
|
74 |
+
### Output
|
75 |
+
The model returns similarity scores in the range `[0, 1]`, where higher scores indicate stronger semantic similarity.
|
76 |
+
|
77 |
+
---
|
78 |
+
|
79 |
+
## Training Details
|
80 |
+
|
81 |
+
### Pretraining
|
82 |
+
The model was pretrained on the `pair-score-sampled` subset of the [`dleemiller/wiki-sim`](https://huggingface.co/datasets/dleemiller/wiki-sim) dataset. This dataset provides diverse sentence pairs with semantic similarity scores, helping the model build a robust understanding of relationships between sentences.
|
83 |
+
- **Classifier Dropout:** a somewhat large classifier dropout of 0.3, to reduce overreliance on teacher scores.
|
84 |
+
- **Objective:** STS-B scores from `cross-encoder/stsb-roberta-large`.
|
85 |
+
|
86 |
+
### Fine-Tuning
|
87 |
+
Fine-tuning was performed on the [`sentence-transformers/stsb`](https://huggingface.co/datasets/sentence-transformers/stsb) dataset.
|
88 |
+
|
89 |
+
### Validation Results
|
90 |
+
The model achieved the following test set performance after fine-tuning:
|
91 |
+
- **Pearson Correlation:** 0.9256
|
92 |
+
- **Spearman Correlation:** 0.9215
|
93 |
+
|
94 |
+
Logs for training and evaluation are included in the [training logs](output/eval/sts-test-results.csv).
|
95 |
+
|
96 |
+
---
|
97 |
+
|
98 |
+
## Applications
|
99 |
+
|
100 |
+
1. **Semantic Search:** Retrieve relevant documents or text passages based on query similarity.
|
101 |
+
2. **Retrieval-Augmented Generation (RAG):** Enhance generative models by providing contextually relevant information.
|
102 |
+
3. **Content Moderation:** Automatically classify or rank content based on similarity to predefined guidelines.
|
103 |
+
4. **Code Search:** Leverage the model's ability to understand code and natural language for large-scale programming tasks.
|
104 |
+
|
105 |
+
---
|
106 |
+
|
107 |
+
## Model Card
|
108 |
+
|
109 |
+
- **Architecture:** ModernBERT-large
|
110 |
+
- **Tokenizer:** Custom tokenizer trained with modern techniques for long-context handling.
|
111 |
+
- **Pretraining Data:** `dleemiller/wiki-sim (pair-score-sampled)`
|
112 |
+
- **Fine-Tuning Data:** `sentence-transformers/stsb`
|
113 |
+
|
114 |
+
---
|
115 |
+
|
116 |
+
## Thank You
|
117 |
+
|
118 |
+
Thanks to the AnswerAI team for providing the ModernBERT models, and the Sentence Transformers team for their leadership in transformer encoder models.
|
119 |
+
|
120 |
+
---
|
121 |
+
|
122 |
+
## Citation
|
123 |
+
|
124 |
+
If you use this model in your research, please cite:
|
125 |
+
|
126 |
+
```bibtex
|
127 |
+
@misc{moderncestsb2025,
|
128 |
+
author = {Miller, D. Lee},
|
129 |
+
title = {ModernCE STS: An STS cross encoder model},
|
130 |
+
year = {2025},
|
131 |
+
publisher = {Hugging Face Hub},
|
132 |
+
url = {https://huggingface.co/dleemiller/ModernCE-large-sts},
|
133 |
+
}
|
134 |
+
```
|
135 |
+
|
136 |
+
---
|
137 |
+
|
138 |
+
## License
|
139 |
+
|
140 |
+
This model is licensed under the [MIT License](LICENSE).
|