Improve model card: update pipeline tag and add library name (#2)
Browse files- Improve model card: update pipeline tag and add library name (7214475326b5d36a30740783d10c0592976a1a79)
Co-authored-by: Niels Rogge <[email protected]>
README.md
CHANGED
@@ -1,9 +1,14 @@
|
|
1 |
---
|
2 |
-
license: mit
|
3 |
language:
|
4 |
- en
|
5 |
-
|
|
|
|
|
|
|
|
|
|
|
6 |
---
|
|
|
7 |
# Ettin: an Open Suite of Paired Encoders and Decoders
|
8 |
|
9 |
[](https://opensource.org/licenses/MIT)
|
@@ -82,11 +87,11 @@ model = AutoModelForCausalLM.from_pretrained("jhu-clsp/ettin-decoder-150m")
|
|
82 |
|
83 |
Ettin models are designed to provide a foundation for comparing encoder-only and decoder-only architectures. Unlike previous comparisons that were limited by different training data, architectures, and recipes, Ettin models use:
|
84 |
|
85 |
-
1.
|
86 |
-
2.
|
87 |
-
3.
|
88 |
-
4.
|
89 |
-
5.
|
90 |
|
91 |
This approach allows for true apples-to-apples comparisons between encoder and decoder models, revealing the inherent strengths of each architecture.
|
92 |
|
@@ -94,10 +99,10 @@ This approach allows for true apples-to-apples comparisons between encoder and d
|
|
94 |
|
95 |
The training data is publicly available and split across different phases:
|
96 |
|
97 |
-
-
|
98 |
-
-
|
99 |
-
-
|
100 |
-
-
|
101 |
|
102 |
## Model Family
|
103 |
|
@@ -146,10 +151,10 @@ These models demonstrate what happens when you continue training encoders as dec
|
|
146 |
|:-----|:------|:-----------|:------------|:---------|
|
147 |
| XXS | [ettin-decoder-from-encoder-17m](https://huggingface.co/jhu-clsp/ettin-decoder-from-encoder-17m) | 17M | Encoder β CLM continued training | [](https://huggingface.co/jhu-clsp/ettin-decoder-from-encoder-17m) |
|
148 |
| XS | [ettin-decoder-from-encoder-32m](https://huggingface.co/jhu-clsp/ettin-decoder-from-encoder-32m) | 32M | Encoder β CLM continued training | [](https://huggingface.co/jhu-clsp/ettin-decoder-from-encoder-32m) |
|
149 |
-
| Small | [ettin-decoder-from-encoder-68m](https://huggingface.co/jhu-clsp/ettin-decoder-from-encoder-68m) | 68M | Encoder β CLM continued training | [](https://huggingface.co/jhu-clsp/ettin-decoder-
|
150 |
-
| Base | [ettin-decoder-from-encoder-150m](https://huggingface.co/jhu-clsp/ettin-decoder-from-encoder-150m) | 150M | Encoder β CLM continued training | [](https://huggingface.co/jhu-clsp/ettin-decoder-
|
151 |
-
| Large | [ettin-decoder-from-encoder-400m](https://huggingface.co/jhu-clsp/ettin-decoder-from-encoder-400m) | 400M | Encoder β CLM continued training | [](https://huggingface.co/jhu-clsp/ettin-decoder-
|
152 |
-
| XL | [ettin-decoder-from-encoder-1b](https://huggingface.co/jhu-clsp/ettin-decoder-
|
153 |
|
154 |
**Example Usage for Cross-Objective Models:**
|
155 |
```python
|
@@ -174,9 +179,9 @@ All raw training checkpoints are available in the [jhu-clsp/ettin-checkpoints](h
|
|
174 |
#### HuggingFace Format Checkpoints
|
175 |
Each model repository contains multiple tagged versions representing different training stages:
|
176 |
|
177 |
-
-
|
178 |
-
-
|
179 |
-
-
|
180 |
|
181 |
```python
|
182 |
from transformers import AutoTokenizer, AutoModelForCausalLM
|
@@ -209,27 +214,27 @@ This checkpoint availability enables detailed analysis of training dynamics, los
|
|
209 |
|
210 |
Ettin provides the first **controlled comparison** of encoder vs. decoder architectures:
|
211 |
|
212 |
-
-
|
213 |
-
-
|
214 |
-
-
|
215 |
-
-
|
216 |
-
-
|
217 |
|
218 |
### Use Cases for Researchers
|
219 |
|
220 |
-
-
|
221 |
-
-
|
222 |
-
-
|
223 |
-
-
|
224 |
-
-
|
225 |
|
226 |
### Reproducibility
|
227 |
|
228 |
All training artifacts are publicly available:
|
229 |
-
-
|
230 |
-
-
|
231 |
-
-
|
232 |
-
-
|
233 |
|
234 |
## Training Details
|
235 |
|
@@ -238,14 +243,14 @@ All training artifacts are publicly available:
|
|
238 |
**Architecture:** Transformer with RoPE, GLU activations, and prenorm layers
|
239 |
|
240 |
**Training Phases:**
|
241 |
-
-
|
242 |
-
-
|
243 |
-
-
|
244 |
|
245 |
**Key Features:**
|
246 |
-
-
|
247 |
-
-
|
248 |
-
-
|
249 |
|
250 |
## Model Architecture
|
251 |
|
@@ -262,7 +267,7 @@ All training artifacts are publicly available:
|
|
262 |
|
263 |
### Encoder: Masked Language Modeling
|
264 |
<details>
|
265 |
-
<summary>Click to expand
|
266 |
|
267 |
```python
|
268 |
from transformers import AutoTokenizer, AutoModelForMaskedLM
|
@@ -296,7 +301,7 @@ print(f"Predictions: {predictions}")
|
|
296 |
### Decoder: Text Generation
|
297 |
|
298 |
<details>
|
299 |
-
<summary>Click to expand
|
300 |
|
301 |
```python
|
302 |
from transformers import AutoTokenizer, AutoModelForCausalLM
|
@@ -783,7 +788,8 @@ def main():
|
|
783 |
model.push_to_hub(run_name)
|
784 |
except Exception:
|
785 |
logging.error(
|
786 |
-
f"Error uploading model to the Hugging Face Hub
|
|
|
787 |
f"`huggingface-cli login`, followed by loading the model using `model = CrossEncoder({final_output_dir!r})` "
|
788 |
f"and saving it using `model.push_to_hub('{run_name}')`."
|
789 |
)
|
|
|
1 |
---
|
|
|
2 |
language:
|
3 |
- en
|
4 |
+
license: mit
|
5 |
+
pipeline_tag: feature-extraction
|
6 |
+
library_name: transformers
|
7 |
+
tags:
|
8 |
+
- modernbert
|
9 |
+
- encoder
|
10 |
---
|
11 |
+
|
12 |
# Ettin: an Open Suite of Paired Encoders and Decoders
|
13 |
|
14 |
[](https://opensource.org/licenses/MIT)
|
|
|
87 |
|
88 |
Ettin models are designed to provide a foundation for comparing encoder-only and decoder-only architectures. Unlike previous comparisons that were limited by different training data, architectures, and recipes, Ettin models use:
|
89 |
|
90 |
+
1. **Identical training data** - Same high-quality mixture across all models
|
91 |
+
2. **Open Training Data** - Data is available now with batch-level training data for each of the 250+ checkpoints
|
92 |
+
3. **Matched architectures** - Only differing in attention patterns (bidirectional vs causal) and training objectives (MLM vs CLM)
|
93 |
+
4. **Consistent training recipe** - Three-phase training with 2T tokens
|
94 |
+
5. **Multiple scales** - From 17M to 1B parameters
|
95 |
|
96 |
This approach allows for true apples-to-apples comparisons between encoder and decoder models, revealing the inherent strengths of each architecture.
|
97 |
|
|
|
99 |
|
100 |
The training data is publicly available and split across different phases:
|
101 |
|
102 |
+
- **Pre-training Data**: [jhu-clsp/ettin-pretraining-data](https://huggingface.co/datasets/jhu-clsp/ettin-pretraining-data) - 1.7T tokens of diverse data mixture
|
103 |
+
- **Mid-training/Extension Data**: [jhu-clsp/ettin-extension-data](https://huggingface.co/datasets/jhu-clsp/ettin-extension-data) - 250B tokens of higher-quality filtered data
|
104 |
+
- **Decay Phase Data**: [jhu-clsp/ettin-decay-data](https://huggingface.co/datasets/jhu-clsp/ettin-decay-data) - 100B tokens of premium data sources
|
105 |
+
- **Training Data Order**: [jhu-clsp/ettin-data-order](https://huggingface.co/datasets/jhu-clsp/ettin-data-order) - Batch-level training order (columns: input_ids, step)
|
106 |
|
107 |
## Model Family
|
108 |
|
|
|
151 |
|:-----|:------|:-----------|:------------|:---------|
|
152 |
| XXS | [ettin-decoder-from-encoder-17m](https://huggingface.co/jhu-clsp/ettin-decoder-from-encoder-17m) | 17M | Encoder β CLM continued training | [](https://huggingface.co/jhu-clsp/ettin-decoder-from-encoder-17m) |
|
153 |
| XS | [ettin-decoder-from-encoder-32m](https://huggingface.co/jhu-clsp/ettin-decoder-from-encoder-32m) | 32M | Encoder β CLM continued training | [](https://huggingface.co/jhu-clsp/ettin-decoder-from-encoder-32m) |
|
154 |
+
| Small | [ettin-decoder-from-encoder-68m](https://huggingface.co/jhu-clsp/ettin-decoder-from-encoder-68m) | 68M | Encoder β CLM continued training | [](https://huggingface.co/jhu-clsp/ettin-decoder-68m) |
|
155 |
+
| Base | [ettin-decoder-from-encoder-150m](https://huggingface.co/jhu-clsp/ettin-decoder-from-encoder-150m) | 150M | Encoder β CLM continued training | [](https://huggingface.co/jhu-clsp/ettin-decoder-150m) |
|
156 |
+
| Large | [ettin-decoder-from-encoder-400m](https://huggingface.co/jhu-clsp/ettin-decoder-from-encoder-400m) | 400M | Encoder β CLM continued training | [](https://huggingface.co/jhu-clsp/ettin-decoder-400m) |
|
157 |
+
| XL | [ettin-decoder-from-encoder-1b](https://huggingface.co/jhu-clsp/ettin-decoder-1b) | 1B | Encoder β CLM continued training | [](https://huggingface.co/jhu-clsp/ettin-decoder-1b) |
|
158 |
|
159 |
**Example Usage for Cross-Objective Models:**
|
160 |
```python
|
|
|
179 |
#### HuggingFace Format Checkpoints
|
180 |
Each model repository contains multiple tagged versions representing different training stages:
|
181 |
|
182 |
+
- **`step{number}`** - Pretraining phase checkpoints (e.g., `step599525`, `step596528`)
|
183 |
+
- **`ext{number}`** - Extension/mid-training phase checkpoints (e.g., `ext1000`, `ext2000`)
|
184 |
+
- **`decay{number}`** - Decay phase checkpoints (e.g., `decay100`, `decay500`)
|
185 |
|
186 |
```python
|
187 |
from transformers import AutoTokenizer, AutoModelForCausalLM
|
|
|
214 |
|
215 |
Ettin provides the first **controlled comparison** of encoder vs. decoder architectures:
|
216 |
|
217 |
+
- **Identical Training Data**: Same 2T token mixture across all models
|
218 |
+
- **Matched Architectures**: Only attention patterns and objectives differ
|
219 |
+
- **Open Everything**: Training data, model weights, and batch-level training order
|
220 |
+
- **Multiple Scales**: Fair comparison from 17M to 1B parameters
|
221 |
+
- **250+ Checkpoints**: Complete training trajectory analysis
|
222 |
|
223 |
### Use Cases for Researchers
|
224 |
|
225 |
+
- **Architecture Studies**: Compare encoder vs decoder capabilities fairly
|
226 |
+
- **Training Dynamics**: Analyze 250+ checkpoints with batch-level data ordering
|
227 |
+
- **Scaling Laws**: Study how architectural advantages change with scale
|
228 |
+
- **Transfer Learning**: Investigate cross-objective training effectiveness
|
229 |
+
- **Replication Studies**: First open replication of ModernBERT training recipe
|
230 |
|
231 |
### Reproducibility
|
232 |
|
233 |
All training artifacts are publicly available:
|
234 |
+
- Training data with exact batch ordering
|
235 |
+
- Model checkpoints every 8.5B tokens
|
236 |
+
- Complete hyperparameter configurations
|
237 |
+
- Training code and evaluation scripts
|
238 |
|
239 |
## Training Details
|
240 |
|
|
|
243 |
**Architecture:** Transformer with RoPE, GLU activations, and prenorm layers
|
244 |
|
245 |
**Training Phases:**
|
246 |
+
- **Pre-training**: 1.7T tokens with diverse data mixture
|
247 |
+
- **Mid-training**: 250B tokens with higher-quality filtered data and context extension to 8K
|
248 |
+
- **Decay phase**: 100B tokens with premium data sources
|
249 |
|
250 |
**Key Features:**
|
251 |
+
- Context length: Up to 8K tokens
|
252 |
+
- Vocabulary: 50,368 tokens (ModernBERT tokenizer)
|
253 |
+
- Deep but efficient architectures following MobileLLM principles
|
254 |
|
255 |
## Model Architecture
|
256 |
|
|
|
267 |
|
268 |
### Encoder: Masked Language Modeling
|
269 |
<details>
|
270 |
+
<summary>Click to expand **encoder** usage examples</summary>
|
271 |
|
272 |
```python
|
273 |
from transformers import AutoTokenizer, AutoModelForMaskedLM
|
|
|
301 |
### Decoder: Text Generation
|
302 |
|
303 |
<details>
|
304 |
+
<summary>Click to expand **decoder text generation**</summary>
|
305 |
|
306 |
```python
|
307 |
from transformers import AutoTokenizer, AutoModelForCausalLM
|
|
|
788 |
model.push_to_hub(run_name)
|
789 |
except Exception:
|
790 |
logging.error(
|
791 |
+
f"Error uploading model to the Hugging Face Hub:
|
792 |
+
{traceback.format_exc()}To upload it manually, you can run "
|
793 |
f"`huggingface-cli login`, followed by loading the model using `model = CrossEncoder({final_output_dir!r})` "
|
794 |
f"and saving it using `model.push_to_hub('{run_name}')`."
|
795 |
)
|