ModernBERT fails to work without FlashAttention !
Issue: ModernBERT Without Flash Attention on Tesla V100
Problem Description
I am trying to use ModernBERT without Flash Attention, as it requires an Ampere GPU or newer (see error below), while I only have a Tesla V100 16GB.
Figure 1: Error message indicating that Flash Attention requires an Ampere or newer GPU.
To bypass Flash Attention, I explored the transformers source code and found that ModernBERT allows alternative attention implementations like SDPA or Eager. This can be set using:
model.config._attn_implementation = ['flash_attention_2', 'eager', 'sdpa'][2]
However, this leads to the following unexpected error:
TypeError: ModernBertUnpaddedRotaryEmbedding.forward() got an unexpected keyword argument 'position_ids'
The error originates from the ModernBertAttention module, specifically when applying the rotary embeddings function on the qkv
tensor.
Figure 2: Stack trace showing the unexpected keyword argument error in ModernBertUnpaddedRotaryEmbedding.
Reproduction Code
Here is the full code to reproduce the issue:
# import flash_attn_2_cuda as flash_attn_cud
from transformers import AutoTokenizer, AutoModel
# Model name
model_name = "answerdotai/ModernBERT-base"
# Load tokenizer and model
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModel.from_pretrained(model_name).to('cuda')
# Force SDPA Attention instead of Flash Attention 2
model.config._attn_implementation = ['flash_attention_2', 'eager', 'sdpa'][2]
# Test the model
text = "The capital of France is [MASK]."
inputs = tokenizer(text, return_tensors="pt").to("cuda")
outputs = model(**inputs)
Error Location
After investigating the source code, the error occurs at line 180 in the following file:
π Transformers GitHub - ModernBERT
Figure 3: Source code snippet highlighting the line where the error occurs in ModernBertAttention.
Thoughts and Next Steps
This issue restricts ModernBERT's usability for those without Ampere GPUs, limiting accessibility to Open Source models. It would be helpful if:
- The model could gracefully fall back to SDPA or Eager Attention without errors.
- The position_ids issue in
ModernBertUnpaddedRotaryEmbedding
was addressed.
I hope this post helps locate the issue and contributes to making Open Source AI more accessible!
@tomaarsen I hope you'll have a chance to check this out! π
Hello!
Odd, the unpadded rotary should not be used unless Flash Attention 2 is chosen. Additionally, Flash Attention 2 should only be chosen by default if it is compatible with your hardware and software, I believe.
Could you please experiment with this snippet:
# import flash_attn_2_cuda as flash_attn_cud
from transformers import AutoTokenizer, AutoModel
# Model name
model_name = "answerdotai/ModernBERT-base"
# Load tokenizer and model
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModel.from_pretrained(model_name, attn_implementation="sdpa").to('cuda') # <- or "eager"
print(model.config._attn_implementation)
# Test the model
text = "The capital of France is [MASK]."
inputs = tokenizer(text, return_tensors="pt").to("cuda")
outputs = model(**inputs)
In short, I suspect that loading the model with FA2 builds the entire model with the unpadded (i.e. FA2 only) layers, meaning that you cannot easily switch to another attention implementation after loading. Instead, we want to specify it during loading. Note: I've not tested this yet.
- Tom Aarsen
Wow, that was a super fast response!
It actually works with SPDA and eager!
Thanks so much, @tomaarsen !