orionweller nielsr HF Staff commited on
Commit
befd76b
Β·
verified Β·
1 Parent(s): 77ab863

Improve model card: update pipeline tag and add library name (#2)

Browse files

- Improve model card: update pipeline tag and add library name (7214475326b5d36a30740783d10c0592976a1a79)


Co-authored-by: Niels Rogge <[email protected]>

Files changed (1) hide show
  1. README.md +47 -41
README.md CHANGED
@@ -1,9 +1,14 @@
1
  ---
2
- license: mit
3
  language:
4
  - en
5
- pipeline_tag: fill-mask
 
 
 
 
 
6
  ---
 
7
  # Ettin: an Open Suite of Paired Encoders and Decoders
8
 
9
  [![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
@@ -82,11 +87,11 @@ model = AutoModelForCausalLM.from_pretrained("jhu-clsp/ettin-decoder-150m")
82
 
83
  Ettin models are designed to provide a foundation for comparing encoder-only and decoder-only architectures. Unlike previous comparisons that were limited by different training data, architectures, and recipes, Ettin models use:
84
 
85
- 1. **Identical training data** - Same high-quality mixture across all models
86
- 2. **Open Training Data** - Data is available now with batch-level training data for each of the 250+ checkpoints
87
- 3. **Matched architectures** - Only differing in attention patterns (bidirectional vs causal) and training objectives (MLM vs CLM)
88
- 4. **Consistent training recipe** - Three-phase training with 2T tokens
89
- 5. **Multiple scales** - From 17M to 1B parameters
90
 
91
  This approach allows for true apples-to-apples comparisons between encoder and decoder models, revealing the inherent strengths of each architecture.
92
 
@@ -94,10 +99,10 @@ This approach allows for true apples-to-apples comparisons between encoder and d
94
 
95
  The training data is publicly available and split across different phases:
96
 
97
- - **Pre-training Data**: [jhu-clsp/ettin-pretraining-data](https://huggingface.co/datasets/jhu-clsp/ettin-pretraining-data) - 1.7T tokens of diverse data mixture
98
- - **Mid-training/Extension Data**: [jhu-clsp/ettin-extension-data](https://huggingface.co/datasets/jhu-clsp/ettin-extension-data) - 250B tokens of higher-quality filtered data
99
- - **Decay Phase Data**: [jhu-clsp/ettin-decay-data](https://huggingface.co/datasets/jhu-clsp/ettin-decay-data) - 100B tokens of premium data sources
100
- - **Training Data Order**: [jhu-clsp/ettin-data-order](https://huggingface.co/datasets/jhu-clsp/ettin-data-order) - Batch-level training order (columns: input_ids, step)
101
 
102
  ## Model Family
103
 
@@ -146,10 +151,10 @@ These models demonstrate what happens when you continue training encoders as dec
146
  |:-----|:------|:-----------|:------------|:---------|
147
  | XXS | [ettin-decoder-from-encoder-17m](https://huggingface.co/jhu-clsp/ettin-decoder-from-encoder-17m) | 17M | Encoder β†’ CLM continued training | [![Download](https://img.shields.io/badge/πŸ€—-Download-blue)](https://huggingface.co/jhu-clsp/ettin-decoder-from-encoder-17m) |
148
  | XS | [ettin-decoder-from-encoder-32m](https://huggingface.co/jhu-clsp/ettin-decoder-from-encoder-32m) | 32M | Encoder β†’ CLM continued training | [![Download](https://img.shields.io/badge/πŸ€—-Download-blue)](https://huggingface.co/jhu-clsp/ettin-decoder-from-encoder-32m) |
149
- | Small | [ettin-decoder-from-encoder-68m](https://huggingface.co/jhu-clsp/ettin-decoder-from-encoder-68m) | 68M | Encoder β†’ CLM continued training | [![Download](https://img.shields.io/badge/πŸ€—-Download-blue)](https://huggingface.co/jhu-clsp/ettin-decoder-from-encoder-68m) |
150
- | Base | [ettin-decoder-from-encoder-150m](https://huggingface.co/jhu-clsp/ettin-decoder-from-encoder-150m) | 150M | Encoder β†’ CLM continued training | [![Download](https://img.shields.io/badge/πŸ€—-Download-blue)](https://huggingface.co/jhu-clsp/ettin-decoder-from-encoder-150m) |
151
- | Large | [ettin-decoder-from-encoder-400m](https://huggingface.co/jhu-clsp/ettin-decoder-from-encoder-400m) | 400M | Encoder β†’ CLM continued training | [![Download](https://img.shields.io/badge/πŸ€—-Download-blue)](https://huggingface.co/jhu-clsp/ettin-decoder-from-encoder-400m) |
152
- | XL | [ettin-decoder-from-encoder-1b](https://huggingface.co/jhu-clsp/ettin-decoder-from-encoder-1b) | 1B | Encoder β†’ CLM continued training | [![Download](https://img.shields.io/badge/πŸ€—-Download-blue)](https://huggingface.co/jhu-clsp/ettin-decoder-from-encoder-1b) |
153
 
154
  **Example Usage for Cross-Objective Models:**
155
  ```python
@@ -174,9 +179,9 @@ All raw training checkpoints are available in the [jhu-clsp/ettin-checkpoints](h
174
  #### HuggingFace Format Checkpoints
175
  Each model repository contains multiple tagged versions representing different training stages:
176
 
177
- - **`step{number}`** - Pretraining phase checkpoints (e.g., `step599525`, `step596528`)
178
- - **`ext{number}`** - Extension/mid-training phase checkpoints (e.g., `ext1000`, `ext2000`)
179
- - **`decay{number}`** - Decay phase checkpoints (e.g., `decay100`, `decay500`)
180
 
181
  ```python
182
  from transformers import AutoTokenizer, AutoModelForCausalLM
@@ -209,27 +214,27 @@ This checkpoint availability enables detailed analysis of training dynamics, los
209
 
210
  Ettin provides the first **controlled comparison** of encoder vs. decoder architectures:
211
 
212
- - **Identical Training Data**: Same 2T token mixture across all models
213
- - **Matched Architectures**: Only attention patterns and objectives differ
214
- - **Open Everything**: Training data, model weights, and batch-level training order
215
- - **Multiple Scales**: Fair comparison from 17M to 1B parameters
216
- - **250+ Checkpoints**: Complete training trajectory analysis
217
 
218
  ### Use Cases for Researchers
219
 
220
- - **Architecture Studies**: Compare encoder vs decoder capabilities fairly
221
- - **Training Dynamics**: Analyze 250+ checkpoints with batch-level data ordering
222
- - **Scaling Laws**: Study how architectural advantages change with scale
223
- - **Transfer Learning**: Investigate cross-objective training effectiveness
224
- - **Replication Studies**: First open replication of ModernBERT training recipe
225
 
226
  ### Reproducibility
227
 
228
  All training artifacts are publicly available:
229
- - Training data with exact batch ordering
230
- - Model checkpoints every 8.5B tokens
231
- - Complete hyperparameter configurations
232
- - Training code and evaluation scripts
233
 
234
  ## Training Details
235
 
@@ -238,14 +243,14 @@ All training artifacts are publicly available:
238
  **Architecture:** Transformer with RoPE, GLU activations, and prenorm layers
239
 
240
  **Training Phases:**
241
- - **Pre-training**: 1.7T tokens with diverse data mixture
242
- - **Mid-training**: 250B tokens with higher-quality filtered data and context extension to 8K
243
- - **Decay phase**: 100B tokens with premium data sources
244
 
245
  **Key Features:**
246
- - Context length: Up to 8K tokens
247
- - Vocabulary: 50,368 tokens (ModernBERT tokenizer)
248
- - Deep but efficient architectures following MobileLLM principles
249
 
250
  ## Model Architecture
251
 
@@ -262,7 +267,7 @@ All training artifacts are publicly available:
262
 
263
  ### Encoder: Masked Language Modeling
264
  <details>
265
- <summary>Click to expand <strong>encoder</strong> usage examples</summary>
266
 
267
  ```python
268
  from transformers import AutoTokenizer, AutoModelForMaskedLM
@@ -296,7 +301,7 @@ print(f"Predictions: {predictions}")
296
  ### Decoder: Text Generation
297
 
298
  <details>
299
- <summary>Click to expand <strong>decoder text generation</strong></summary>
300
 
301
  ```python
302
  from transformers import AutoTokenizer, AutoModelForCausalLM
@@ -783,7 +788,8 @@ def main():
783
  model.push_to_hub(run_name)
784
  except Exception:
785
  logging.error(
786
- f"Error uploading model to the Hugging Face Hub:\n{traceback.format_exc()}To upload it manually, you can run "
 
787
  f"`huggingface-cli login`, followed by loading the model using `model = CrossEncoder({final_output_dir!r})` "
788
  f"and saving it using `model.push_to_hub('{run_name}')`."
789
  )
 
1
  ---
 
2
  language:
3
  - en
4
+ license: mit
5
+ pipeline_tag: feature-extraction
6
+ library_name: transformers
7
+ tags:
8
+ - modernbert
9
+ - encoder
10
  ---
11
+
12
  # Ettin: an Open Suite of Paired Encoders and Decoders
13
 
14
  [![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
 
87
 
88
  Ettin models are designed to provide a foundation for comparing encoder-only and decoder-only architectures. Unlike previous comparisons that were limited by different training data, architectures, and recipes, Ettin models use:
89
 
90
+ 1. **Identical training data** - Same high-quality mixture across all models
91
+ 2. **Open Training Data** - Data is available now with batch-level training data for each of the 250+ checkpoints
92
+ 3. **Matched architectures** - Only differing in attention patterns (bidirectional vs causal) and training objectives (MLM vs CLM)
93
+ 4. **Consistent training recipe** - Three-phase training with 2T tokens
94
+ 5. **Multiple scales** - From 17M to 1B parameters
95
 
96
  This approach allows for true apples-to-apples comparisons between encoder and decoder models, revealing the inherent strengths of each architecture.
97
 
 
99
 
100
  The training data is publicly available and split across different phases:
101
 
102
+ - **Pre-training Data**: [jhu-clsp/ettin-pretraining-data](https://huggingface.co/datasets/jhu-clsp/ettin-pretraining-data) - 1.7T tokens of diverse data mixture
103
+ - **Mid-training/Extension Data**: [jhu-clsp/ettin-extension-data](https://huggingface.co/datasets/jhu-clsp/ettin-extension-data) - 250B tokens of higher-quality filtered data
104
+ - **Decay Phase Data**: [jhu-clsp/ettin-decay-data](https://huggingface.co/datasets/jhu-clsp/ettin-decay-data) - 100B tokens of premium data sources
105
+ - **Training Data Order**: [jhu-clsp/ettin-data-order](https://huggingface.co/datasets/jhu-clsp/ettin-data-order) - Batch-level training order (columns: input_ids, step)
106
 
107
  ## Model Family
108
 
 
151
  |:-----|:------|:-----------|:------------|:---------|
152
  | XXS | [ettin-decoder-from-encoder-17m](https://huggingface.co/jhu-clsp/ettin-decoder-from-encoder-17m) | 17M | Encoder β†’ CLM continued training | [![Download](https://img.shields.io/badge/πŸ€—-Download-blue)](https://huggingface.co/jhu-clsp/ettin-decoder-from-encoder-17m) |
153
  | XS | [ettin-decoder-from-encoder-32m](https://huggingface.co/jhu-clsp/ettin-decoder-from-encoder-32m) | 32M | Encoder β†’ CLM continued training | [![Download](https://img.shields.io/badge/πŸ€—-Download-blue)](https://huggingface.co/jhu-clsp/ettin-decoder-from-encoder-32m) |
154
+ | Small | [ettin-decoder-from-encoder-68m](https://huggingface.co/jhu-clsp/ettin-decoder-from-encoder-68m) | 68M | Encoder β†’ CLM continued training | [![Download](https://img.shields.io/badge/πŸ€—-Download-blue)](https://huggingface.co/jhu-clsp/ettin-decoder-68m) |
155
+ | Base | [ettin-decoder-from-encoder-150m](https://huggingface.co/jhu-clsp/ettin-decoder-from-encoder-150m) | 150M | Encoder β†’ CLM continued training | [![Download](https://img.shields.io/badge/πŸ€—-Download-blue)](https://huggingface.co/jhu-clsp/ettin-decoder-150m) |
156
+ | Large | [ettin-decoder-from-encoder-400m](https://huggingface.co/jhu-clsp/ettin-decoder-from-encoder-400m) | 400M | Encoder β†’ CLM continued training | [![Download](https://img.shields.io/badge/πŸ€—-Download-blue)](https://huggingface.co/jhu-clsp/ettin-decoder-400m) |
157
+ | XL | [ettin-decoder-from-encoder-1b](https://huggingface.co/jhu-clsp/ettin-decoder-1b) | 1B | Encoder β†’ CLM continued training | [![Download](https://img.shields.io/badge/πŸ€—-Download-blue)](https://huggingface.co/jhu-clsp/ettin-decoder-1b) |
158
 
159
  **Example Usage for Cross-Objective Models:**
160
  ```python
 
179
  #### HuggingFace Format Checkpoints
180
  Each model repository contains multiple tagged versions representing different training stages:
181
 
182
+ - **`step{number}`** - Pretraining phase checkpoints (e.g., `step599525`, `step596528`)
183
+ - **`ext{number}`** - Extension/mid-training phase checkpoints (e.g., `ext1000`, `ext2000`)
184
+ - **`decay{number}`** - Decay phase checkpoints (e.g., `decay100`, `decay500`)
185
 
186
  ```python
187
  from transformers import AutoTokenizer, AutoModelForCausalLM
 
214
 
215
  Ettin provides the first **controlled comparison** of encoder vs. decoder architectures:
216
 
217
+ - **Identical Training Data**: Same 2T token mixture across all models
218
+ - **Matched Architectures**: Only attention patterns and objectives differ
219
+ - **Open Everything**: Training data, model weights, and batch-level training order
220
+ - **Multiple Scales**: Fair comparison from 17M to 1B parameters
221
+ - **250+ Checkpoints**: Complete training trajectory analysis
222
 
223
  ### Use Cases for Researchers
224
 
225
+ - **Architecture Studies**: Compare encoder vs decoder capabilities fairly
226
+ - **Training Dynamics**: Analyze 250+ checkpoints with batch-level data ordering
227
+ - **Scaling Laws**: Study how architectural advantages change with scale
228
+ - **Transfer Learning**: Investigate cross-objective training effectiveness
229
+ - **Replication Studies**: First open replication of ModernBERT training recipe
230
 
231
  ### Reproducibility
232
 
233
  All training artifacts are publicly available:
234
+ - Training data with exact batch ordering
235
+ - Model checkpoints every 8.5B tokens
236
+ - Complete hyperparameter configurations
237
+ - Training code and evaluation scripts
238
 
239
  ## Training Details
240
 
 
243
  **Architecture:** Transformer with RoPE, GLU activations, and prenorm layers
244
 
245
  **Training Phases:**
246
+ - **Pre-training**: 1.7T tokens with diverse data mixture
247
+ - **Mid-training**: 250B tokens with higher-quality filtered data and context extension to 8K
248
+ - **Decay phase**: 100B tokens with premium data sources
249
 
250
  **Key Features:**
251
+ - Context length: Up to 8K tokens
252
+ - Vocabulary: 50,368 tokens (ModernBERT tokenizer)
253
+ - Deep but efficient architectures following MobileLLM principles
254
 
255
  ## Model Architecture
256
 
 
267
 
268
  ### Encoder: Masked Language Modeling
269
  <details>
270
+ <summary>Click to expand **encoder** usage examples</summary>
271
 
272
  ```python
273
  from transformers import AutoTokenizer, AutoModelForMaskedLM
 
301
  ### Decoder: Text Generation
302
 
303
  <details>
304
+ <summary>Click to expand **decoder text generation**</summary>
305
 
306
  ```python
307
  from transformers import AutoTokenizer, AutoModelForCausalLM
 
788
  model.push_to_hub(run_name)
789
  except Exception:
790
  logging.error(
791
+ f"Error uploading model to the Hugging Face Hub:
792
+ {traceback.format_exc()}To upload it manually, you can run "
793
  f"`huggingface-cli login`, followed by loading the model using `model = CrossEncoder({final_output_dir!r})` "
794
  f"and saving it using `model.push_to_hub('{run_name}')`."
795
  )