Papers
arxiv:2507.09404

Scaling Laws for Optimal Data Mixtures

Published on Jul 12
· Submitted by mshukor on Jul 16
#3 Paper of the day
Authors:
,
,
,
,
,

Abstract

Scaling laws predict optimal data mixtures for large foundation models across different domains, improving performance and reducing trial-and-error.

AI-generated summary

Large foundation models are typically trained on data from multiple domains, with the data mixture--the proportion of each domain used--playing a critical role in model performance. The standard approach to selecting this mixture relies on trial and error, which becomes impractical for large-scale pretraining. We propose a systematic method to determine the optimal data mixture for any target domain using scaling laws. Our approach accurately predicts the loss of a model of size N trained with D tokens and a specific domain weight vector h. We validate the universality of these scaling laws by demonstrating their predictive power in three distinct and large-scale settings: large language model (LLM), native multimodal model (NMM), and large vision models (LVM) pretraining. We further show that these scaling laws can extrapolate to new data mixtures and across scales: their parameters can be accurately estimated using a few small-scale training runs, and used to estimate the performance at larger scales and unseen domain weights. The scaling laws allow to derive the optimal domain weights for any target domain under a given training budget (N,D), providing a principled alternative to costly trial-and-error methods.

Community

Paper author Paper submitter

We propose new scaling laws that predict the optimal data mixture, for pretraining LLMs, native multimodal models and large vision encoders.

Only running small-scale experiments is needed, and we can then extrapolate to large-scale ones.

These laws allow:

(1) to predict the model performance, before any training, given a model size N, dataset size T, and training data mixture h (here on mix of multimodal data domains)

(2) to predict the optimal data mixture, given a FLOPs budget (N, D)

Paper: https://arxiv.org/abs/2507.09404

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2507.09404 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2507.09404 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2507.09404 in a Space README.md to link it from this page.

Collections including this paper 6