ModernBERT fails to work without FlashAttention !

#56

by benhachem - opened Jan 24

Discussion

benhachem

Jan 24

•

edited Jan 24

Issue: ModernBERT Without Flash Attention on Tesla V100

Problem Description

I am trying to use ModernBERT without Flash Attention, as it requires an Ampere GPU or newer (see error below), while I only have a Tesla V100 16GB.

Figure 1: Error message indicating that Flash Attention requires an Ampere or newer GPU.

To bypass Flash Attention, I explored the transformers source code and found that ModernBERT allows alternative attention implementations like SDPA or Eager. This can be set using:

model.config._attn_implementation = ['flash_attention_2', 'eager', 'sdpa'][2]

However, this leads to the following unexpected error:

TypeError: ModernBertUnpaddedRotaryEmbedding.forward() got an unexpected keyword argument 'position_ids'

The error originates from the ModernBertAttention module, specifically when applying the rotary embeddings function on the qkv tensor.

Figure 2: Stack trace showing the unexpected keyword argument error in ModernBertUnpaddedRotaryEmbedding.

Reproduction Code

Here is the full code to reproduce the issue:

# import flash_attn_2_cuda as flash_attn_cud
from transformers import AutoTokenizer, AutoModel

# Model name
model_name = "answerdotai/ModernBERT-base"  

# Load tokenizer and model
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModel.from_pretrained(model_name).to('cuda')

# Force SDPA Attention instead of Flash Attention 2
model.config._attn_implementation = ['flash_attention_2', 'eager', 'sdpa'][2]

# Test the model
text = "The capital of France is [MASK]."
inputs = tokenizer(text, return_tensors="pt").to("cuda")
outputs = model(**inputs)

Error Location

After investigating the source code, the error occurs at line 180 in the following file:
🔗 Transformers GitHub - ModernBERT

Figure 3: Source code snippet highlighting the line where the error occurs in ModernBertAttention.

Thoughts and Next Steps

This issue restricts ModernBERT's usability for those without Ampere GPUs, limiting accessibility to Open Source models. It would be helpful if:

The model could gracefully fall back to SDPA or Eager Attention without errors.
The position_ids issue in ModernBertUnpaddedRotaryEmbedding was addressed.

I hope this post helps locate the issue and contributes to making Open Source AI more accessible!

benhachem

Jan 24

•

edited Jan 24

@tomaarsen I hope you'll have a chance to check this out! 😉

benhachem changed discussion title from Model does not work without FlashAttention ! to ModernBERT fails to work without FlashAttention ! Jan 24

tomaarsen

Answer.AI org Jan 24

Hello!

Odd, the unpadded rotary should not be used unless Flash Attention 2 is chosen. Additionally, Flash Attention 2 should only be chosen by default if it is compatible with your hardware and software, I believe.
Could you please experiment with this snippet:

# import flash_attn_2_cuda as flash_attn_cud
from transformers import AutoTokenizer, AutoModel

# Model name
model_name = "answerdotai/ModernBERT-base"  

# Load tokenizer and model
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModel.from_pretrained(model_name, attn_implementation="sdpa").to('cuda') # <- or "eager"
print(model.config._attn_implementation)

# Test the model
text = "The capital of France is [MASK]."
inputs = tokenizer(text, return_tensors="pt").to("cuda")
outputs = model(**inputs)

In short, I suspect that loading the model with FA2 builds the entire model with the unpadded (i.e. FA2 only) layers, meaning that you cannot easily switch to another attention implementation after loading. Instead, we want to specify it during loading. Note: I've not tested this yet.

Tom Aarsen

benhachem

Jan 24

•

edited Jan 24

Wow, that was a super fast response!

It actually works with SPDA and eager!

Thanks so much, @tomaarsen !

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment