dleemiller commited on
Commit
1329d93
·
verified ·
1 Parent(s): 739ea90

Upload README.md

Browse files
Files changed (1) hide show
  1. README.md +140 -0
README.md ADDED
@@ -0,0 +1,140 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: mit
3
+ datasets:
4
+ - dleemiller/wiki-sim
5
+ - sentence-transformers/stsb
6
+ language:
7
+ - en
8
+ metrics:
9
+ - spearmanr
10
+ - pearsonr
11
+ base_model:
12
+ - answerdotai/ModernBERT-large
13
+ pipeline_tag: sentence-similarity
14
+ library_name: sentence-transformers
15
+ tags:
16
+ - cross-encoder
17
+ - modernbert
18
+ - sts
19
+ - stsb
20
+ ---
21
+ # ModernBERT Cross-Encoder: Semantic Similarity (STS)
22
+
23
+ Cross encoders are high performing encoder models that compare two texts and output a 0-1 score.
24
+ I've found the `cross-encoders/roberta-large-stsb` model to be very useful in creating evaluators for LLM outputs.
25
+ They're simple to use, fast and very accurate.
26
+
27
+ Like many people, I was excited about the architecture and training uplift from the ModernBERT architecture (`answerdotai/ModernBERT-large`).
28
+ So I've applied it to the stsb cross encoder, which is a very handy model. Additionally, I've added
29
+ pretraining from my much larger semi-synthetic dataset `dleemiller/wiki-sim` that targets this kind of objective.
30
+ The inference performance efficiency, expanded context and simplicity make this a really nice platform as an evaluator model.
31
+
32
+ ---
33
+
34
+ ## Features
35
+ - **High performing:** Achieves **Pearson: 0.9256** and **Spearman: 0.9215** on the STS-Benchmark test set.
36
+ - **Efficient architecture:** Based on the ModernBERT-large design (395M parameters), offering faster inference speeds.
37
+ - **Extended context length:** Processes sequences up to 8192 tokens, great for LLM output evals.
38
+ - **Diversified training:** Pretrained on `dleemiller/wiki-sim` and fine-tuned on `sentence-transformers/stsb`.
39
+
40
+ ---
41
+
42
+ ## Performance
43
+
44
+ | Model | STS-B Test Pearson | STS-B Test Spearman | Context Length | Parameters | Speed |
45
+ |--------------------------------|--------------------|---------------------|----------------|------------|---------|
46
+ | `ModernCE-large-sts` | **0.9256** | **0.9215** | **8192** | 395M | **Medium** |
47
+ | `ModernCE-base-sts` | **0.9162** | **0.9122** | **8192** | 149M | **Fast** |
48
+ | `stsb-roberta-large` | 0.9147 | - | 512 | 355M | Slow |
49
+ | `stsb-distilroberta-base` | 0.8792 | - | 512 | 66M | Fast |
50
+
51
+
52
+ ---
53
+
54
+ ## Usage
55
+
56
+ To use ModernCE for semantic similarity tasks, you can load the model with the Hugging Face `sentence-transformers` library:
57
+
58
+ ```python
59
+ from sentence_transformers import CrossEncoder
60
+
61
+ # Load ModernCE model
62
+ model = CrossEncoder("dleemiller/ModernCE-large-sts")
63
+
64
+ # Predict similarity scores for sentence pairs
65
+ sentence_pairs = [
66
+ ("It's a wonderful day outside.", "It's so sunny today!"),
67
+ ("It's a wonderful day outside.", "He drove to work earlier."),
68
+ ]
69
+ scores = model.predict(sentence_pairs)
70
+
71
+ print(scores) # Outputs: array([0.9184, 0.0123], dtype=float32)
72
+ ```
73
+
74
+ ### Output
75
+ The model returns similarity scores in the range `[0, 1]`, where higher scores indicate stronger semantic similarity.
76
+
77
+ ---
78
+
79
+ ## Training Details
80
+
81
+ ### Pretraining
82
+ The model was pretrained on the `pair-score-sampled` subset of the [`dleemiller/wiki-sim`](https://huggingface.co/datasets/dleemiller/wiki-sim) dataset. This dataset provides diverse sentence pairs with semantic similarity scores, helping the model build a robust understanding of relationships between sentences.
83
+ - **Classifier Dropout:** a somewhat large classifier dropout of 0.3, to reduce overreliance on teacher scores.
84
+ - **Objective:** STS-B scores from `cross-encoder/stsb-roberta-large`.
85
+
86
+ ### Fine-Tuning
87
+ Fine-tuning was performed on the [`sentence-transformers/stsb`](https://huggingface.co/datasets/sentence-transformers/stsb) dataset.
88
+
89
+ ### Validation Results
90
+ The model achieved the following test set performance after fine-tuning:
91
+ - **Pearson Correlation:** 0.9256
92
+ - **Spearman Correlation:** 0.9215
93
+
94
+ Logs for training and evaluation are included in the [training logs](output/eval/sts-test-results.csv).
95
+
96
+ ---
97
+
98
+ ## Applications
99
+
100
+ 1. **Semantic Search:** Retrieve relevant documents or text passages based on query similarity.
101
+ 2. **Retrieval-Augmented Generation (RAG):** Enhance generative models by providing contextually relevant information.
102
+ 3. **Content Moderation:** Automatically classify or rank content based on similarity to predefined guidelines.
103
+ 4. **Code Search:** Leverage the model's ability to understand code and natural language for large-scale programming tasks.
104
+
105
+ ---
106
+
107
+ ## Model Card
108
+
109
+ - **Architecture:** ModernBERT-large
110
+ - **Tokenizer:** Custom tokenizer trained with modern techniques for long-context handling.
111
+ - **Pretraining Data:** `dleemiller/wiki-sim (pair-score-sampled)`
112
+ - **Fine-Tuning Data:** `sentence-transformers/stsb`
113
+
114
+ ---
115
+
116
+ ## Thank You
117
+
118
+ Thanks to the AnswerAI team for providing the ModernBERT models, and the Sentence Transformers team for their leadership in transformer encoder models.
119
+
120
+ ---
121
+
122
+ ## Citation
123
+
124
+ If you use this model in your research, please cite:
125
+
126
+ ```bibtex
127
+ @misc{moderncestsb2025,
128
+ author = {Miller, D. Lee},
129
+ title = {ModernCE STS: An STS cross encoder model},
130
+ year = {2025},
131
+ publisher = {Hugging Face Hub},
132
+ url = {https://huggingface.co/dleemiller/ModernCE-large-sts},
133
+ }
134
+ ```
135
+
136
+ ---
137
+
138
+ ## License
139
+
140
+ This model is licensed under the [MIT License](LICENSE).