No GQA in 210M model?

#5
by przvl - opened

Hello!

according to the config.py and the paper section Appendix A EuroBERT Model Architecture, the smallest model with 210M parameters has 12 Attention Heads and 12 Key/Value Heads, while the larger models each have 18 Attention Heads and 6 Key/Value Heads.

Does that mean the smallest model does not use Grouped Query Attention but standard multi-head self-attention? Is there any reason for this?

Thanks!

Your need to confirm your account before you can post a new comment.

Sign up or log in to comment