No GQA in 210M model?
#5
by
przvl
- opened
Hello!
according to the config.py
and the paper section Appendix A EuroBERT Model Architecture, the smallest model with 210M parameters has 12 Attention Heads and 12 Key/Value Heads, while the larger models each have 18 Attention Heads and 6 Key/Value Heads.
Does that mean the smallest model does not use Grouped Query Attention but standard multi-head self-attention? Is there any reason for this?
Thanks!