File size: 6,275 Bytes
da6afe3 2f741b5 da6afe3 2f741b5 da6afe3 2f741b5 9943452 2f741b5 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 |
---
library_name: transformers
license: apache-2.0
language:
- en
widget:
- text: "You will be given a question and options. Select the right answer.
QUESTION: If (G, .) is a group such that (ab)^-1 = a^-1b^-1, for all a, b in G, then G is a/an
CHOICES:
- A: commutative semi group
- B: abelian group
- C: non-abelian group
- D: None of these
ANSWER: [unused0] [MASK]"
tags:
- fill-mask
- masked-lm
- long-context
- classification
- modernbert
pipeline_tag: fill-mask
inference: false
---
# ModernBERT-Large-Instruct
## Table of Contents
1. [Model Summary](#model-summary)
2. [Usage](#Usage)
3. [Evaluation](#Evaluation)
4. [Limitations](#limitations)
5. [Training](#training)
6. [License](#license)
7. [Citation](#citation)
## Model Summary
ModernBERT-Instruct-Large is a lightly instruction-tuned version of [ModernBERT-large](https://huggingface.co/answerdotai/ModernBERT-large), trained using a mixed-objective (Answer Token Prediction & Dummy MLM) on 20M examples sampled from the FLAN collection.
Despite a very straightforward pre-training and inference pipeline, this model proves to be a very strong model in a variety of contexts, in both zero-shot and fully-finetuned settings.
For more details, we recommend checking out the [TIL Blog Post](), the [mini cookbook GitHub repository](https://github.com/AnswerDotAI/ModernBERT-Instruct-mini-cookbook) or the [Technical Report](https://arxiv.org/abs/2502.03793).
## Usage
In order to use ModernBERT-Large-Instruct, you need to install a version of `transformers` which natively supports ModernBERT (4.48+):
```sh
pip install -U transformers>=4.48.0
```
**⚠️ If your GPU supports it, we recommend using ModernBERT with Flash Attention 2 to reach the highest efficiency. To do so, install Flash Attention as follows, then use the model as normal:**
```bash
pip install flash-attn
```
All tasks are then performed using the Model's Masked Language Modelling head, load via `AutoModelForMaskedLM`. Here is an example to answer an MMLU question:
```python
import torch
from transformers import AutoTokenizer, AutoModelForMaskedLM
# Load model and tokenizer
model_name = "answerdotai/ModernBERT-Large-Instruct"
tokenizer = AutoTokenizer.from_pretrained(model_name)
device = 'cuda' if torch.cuda.is_available() else 'cpu'
if device == 'cuda':
model = AutoModelForMaskedLM.from_pretrained(model_name, attn_implementation="flash_attention_2")
else:
model = AutoModelForMaskedLM.from_pretrained(model_name)
model.to(device)
# Format input for classification or multiple choice. This is a random example from MMLU.
text = """You will be given a question and options. Select the right answer.
QUESTION: If (G, .) is a group such that (ab)^-1 = a^-1b^-1, for all a, b in G, then G is a/an
CHOICES:
- A: commutative semi group
- B: abelian group
- C: non-abelian group
- D: None of these
ANSWER: [unused0] [MASK]"""
# Get prediction
inputs = tokenizer(text, return_tensors="pt").to(device)
outputs = model(**inputs)
mask_idx = (inputs.input_ids == tokenizer.mask_token_id).nonzero()[0, 1]
pred_id = outputs.logits[0, mask_idx].argmax()
answer = tokenizer.decode(pred_id)
print(f"Predicted answer: {answer}") # Outputs: B
```
## Evaluation
Results are taken from the [technical report](https://arxiv.org/abs/2502.03793). Results for MMLU and MMLU-Pro are taken from [SmolLM2 (†)](https://huggingface.co/HuggingFaceTB/SmolLM2-360M-Instruct) and the [MMLU-Pro leaderboard (‡)](https://huggingface.co/spaces/TIGER-Lab/MMLU-Pro) whenever possible.
### Zero-Shot
| Model | MMLU | MMLU-Pro | ADEv2 | NIS | OSE | Average |
|---------------------------|-----------|-------------|------------|----------|------------|-----------|
| **0.3-0.5B** | | | | | | |
| Tasksource-NLI | 36.08 | 16.54 | _65.17_ | 58.72 | 21.11 | _39.52_ |
| RoBERTa-Large-SST | 31.30 | 13.63 | 43.61 | 75.00 | **40.67** | 40.84 |
| UniMC | 38.48 | **18.83** | 23.29 | 73.96 | 36.88 | 38.29 |
| ModernBERT-Large-Instruct | **43.06** | 17.16 | **53.31** | **85.53**| 20.62 | **43.94**|
| SmoLM2-360M | 35.8† | 11.38‡ | - | - | - | - |
| Qwen2-0.5B | 33.7† | 15.93‡ | - | - | - | - |
| **1B+** | | | | | | |
| Llama3.2-1B | 45.83 | 22.6 | - | - | - | - |
| SmoLM2-1.7B | 48.44 | 18.31‡ | - | - | - | - |
| Qwen2.5-1.5B | ***59.67***| ***32.1‡***| - | - | - | - |
### Fine-Tuned
| Model | MNLI | Yahoo! | 20ng | AGNews | SST-2 | IMDB | SST-5 | Average |
|----------------------------|-----------|---------|-----------|----------|------------|---------|-----------|----------|
| ModernBERT (cls head) | 90.8† | 77.75 | **73.96** | **95.34**| **97.1†** | 96.52 | 59.28 | 84.39 |
| ModernBERT-Large-Instruct | **91.03** | **77.88**| **73.96** | 95.24 | 96.22 | **97.2**| **61.13** | **84.67**|
## Limitations
ModernBERT’s training data is primarily English and code, so performance is best on these languages. ModernBERT-Large-Instruct is a first version, demonstrating the strong potential of using the MLM head for downstream tasks without complex pipelines. However, it is very likely to have failure cases and it could be improved further.
## License
Apache 2.0
## Citation
If you use ModernBERT-Large-Instruct in your work, please cite:
```
@misc{clavié2025itsmasksimpleinstructiontuning,
title={It's All in The [MASK]: Simple Instruction-Tuning Enables BERT-like Masked Language Models As Generative Classifiers},
author={Benjamin Clavié and Nathan Cooper and Benjamin Warner},
year={2025},
eprint={2502.03793},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2502.03793},
}
``` |