Going deeper vs wider

by sohampnow - opened Dec 4, 2024

Dec 4, 2024

Hi again! How did you arrive at the specific configuration for num_hidden_layers, hidden_size and intermediate_size?

Given a fixed parameter (or latency) budget, do you have any insights into what helps the model quality? Whether it's adding more layers, or reducing the hidden/intermediate sizes

Thanks!

Muennighoff

Ai2 org Dec 5, 2024

•

edited Dec 5, 2024

Width vs depth does not matter too much see https://arxiv.org/abs/2001.08361 so we just went with values from prior work
(except for the finegrained experts part which we ablate in the paper)

sohampnow

Dec 5, 2024

Thanks for the pointer! It would be interesting to revisit width vs depth with more number of tokens since a lot of trends tend to emerge clearly after ~30-40B tokens (even from the OLMoE paper).
If I read it correctly, https://arxiv.org/abs/2001.08361 trains models up to ~20B tokens which might be too early to indicate these trends.

Muennighoff

Ai2 org Dec 5, 2024

Oh good point; yes maybe their scales were too small!

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment