No GQA in 210M model?

by przvl - opened 4 days ago

4 days ago

Hello!

according to the config.py and the paper section Appendix A EuroBERT Model Architecture, the smallest model with 210M parameters has 12 Attention Heads and 12 Key/Value Heads, while the larger models each have 18 Attention Heads and 6 Key/Value Heads.

Does that mean the smallest model does not use Grouped Query Attention but standard multi-head self-attention? Is there any reason for this?

Thanks!

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

Your need to confirm your account before you can post a new comment.

· Sign up or log in to comment