Best Practices for Open Multilingual LLM Evaluation

Community Article Published May 7, 2025

If you want to determine the best existing language model for a given language or you want to tell if your method improved performance for a given language, what's the most effective and reliable way to do this? At a high level, the answers are the same for all languages; however, for languages other than English (and increasingly Mandarin Chinese), language model evaluation is trickier, as the number of benchmarks is likely to be more limited.

I will highlight some of the considerations I think are most important, especially in terms of selecting benchmarks. I also talk about how to run benchmarks and report results in a way that facilitates replicability and reliable inferences about model performance.

On Benchmark Availability

As I mentioned, for most languages, there are many fewer benchmarks available than for English. There is a steep dropoff in the availability of benchmarks for even the next highest-resource languages, as is shown in the figure below from Wu et al. (2025). So, depending on the language(s) you're working with, it may not be possible to adhere to all of these best practices.

image/png

Best Practices

Choose tasks with appropriate difficulty level

This seems obvious, but I think this is actually a non-trivial condition to meet, especially for low-resource languages. Many benchmarks that cover a lot of languages are too easy for current models. This is in part due to a temporal disconnect between benchmark development and model evaluation. For example, HellaSwag was developed in 2019, when GPT-2 was a SOTA model. And yet, it is still being used to evaluate new models. We need to be adapting and updating benchmarks so they are appropriate for current models.

Tasks that are too easy can't help with model selection, because performance will be at ceiling. Scores will not differ enough to provide a signal about which model is better.

One example of a benchmark that does not necessarily meet this criteria for many models is the recently released MultiBLiMP benchmark, which tests grammatical knowledge. In the release of the paper, many models—including very small models—were reaching above 95% accuracy for most languages. Therefore, MultiBLiMP can't necessarily help much with model selection (especially for >1B models), as differences in performance are generally small and may not be statistically significant. There may be other ways to use the benchmark, e.g. to understand training dynamics.

Avoid machine-translated benchmarks, especially without expert verification

Machine translation adds noise into a benchmark. And the lower the resource level of the language you're trying to evaluate on, the more noise machine translation will add, because translations systems are worse. This is especially relevant if you want to compare performance across languages.

For example, if we wanted to use the EU21 benchmark, which offers translation-equivalent benchmarks for 21 EU languages, it is unwise to strictly compare performance between languages within this benchmark. Take English and Estonian as the two extremes of the languages represented. English is the highest resource language and Estonian is the lowest resource language. If a model exhibits lower performance in Estonian, it may be because the model performs worse in Estonian, due to less training data availability or less model capacity dedicated to that language. However, it is also possible that Estonian performance is worse because the machine translation into Estonian contained errors or was lower quality than the English translation.

Machine translated bechmarks, therefore, do not allow us to make strong inferences about model performance. Instead, performance is confounded with the machine translation quality.

Avoid automatic evaluations that do not correlate with human preferences

For open-ended tasks like summarization, it is common to use automatic evaluation metrics like ROUGE or BLEUScore. It is also increasingly common to use LLM judges to evaluate summary quality. For example, XLSum is a widely used multilingual summarization benchmark that uses ROUGE as an evaluation metric. However, recent work shows that all of these metrics are poorly correlated with human preferences. If employed, automatic metrics or LLM judges needs to be extensively evaluated and aligned for each language and domain.

Benchmarks should be culturally appropriate and adapted to the target language

Translating benchmarks into another language, even with expert human translations, may not be sufficient to guarantee effective evaluation in a language. Benchmarks that have not only been translated, but localized, show higher correlations with human preferences than ones which have just been translated. So, for example, the translated version of MMLU in EU21 has not been culturally adapted. Language specific variants of MMLU like CMMLU for Chinese or KMMLU for Korean have been adapted to be culturally relevant. GlobalMMLU has culturally agnostic and cultural-specific subsets. The latter would be an example of a benchmark that adheres to this recommendation.

Use tasks and metrics that correlate with human judgments

Some tasks seem to generally better be predictive of human preferences. For example, this study found that MGSM correlates more strongly with human judgments than MMLU. In the absense of this kind of study, it's important to select benchmarks that closely correspond to the desired application. Performance on reasoning and code generation tasks may not be predictive of a model for creative composition.

Putting it all together

There obviously aren't any perfect benchmarks, but hopefully these recommendations are helpful in identifying benchmarks that are appropriate for the model(s) you are evaluating!

image/png

Implementation for fair comparison

Once you've got your benchmarks selected, how you implement the benchmark and report the results are important. If you are evaluating on a language that has fewer available benchmarks, evaluation needs to be more precise, as each benchmark is weighted more heavily during the process of model selection. Whether or not differences in performance are statistically significant matter. Differences in performance due to minor differences in implementation matter.

Use (and report) confidence intervals

To illustrate the importance of determining statistical significance, here we see accuracy from one run (top) versus confidence intervals from ten runs (bottom). These lead to completely different conclusions about relative model performance. Now you might be more inclined to use GPT-4 Turbo, especially given the difference in cost. This could make a big difference once you've deployed your model for your use case or application.

image/png

Use Consistent and Replicable Implementations

Minor differences in implementation, even as small as whitespace usage, can lead to differences in performance. The LM Evaluation Harness automatically controls for implementation details and provides standard error. So, using LM Eval Harness means you don't even have to worry about these things!

image/png

Reporting evaluations fairly

Another important aspect for multilingual LM evaluation is disaggregating multilingual performance. Reporting only a mean over scores for all languages obscures whcih languages are performing worse. For some applications, it might be preferable to have a lower mean performance score, but have more balanced performance across languages. If there are a small number of "priority languages", it might be most important to consider scores for those languages.

For example, in MultiBLiMP, LLama 8B and Goldfish models have average accuracy scores over the same set of languages of 92.6 and 93.8, respectively. However, looking at the distribution of scores, you see that Llama has fewer high-performing languages and a much larger set of languages with relatively low performance. Goldfish models, on the other hand, have a large number of models at the high-end of performance (but fewer extremely high-accuracy languages), and have fewer languages with lower performance. All of this is extremely important when considering candidate models in a multilingual context.

image/png

Looking forward to new evaluations

Thinking about how to develop new evaluations, my priorities are expanding language coverage for popular tasks, keeping in mind the best practices I discussed above. In particular, this means not relying on machine translation to adapt existing (English) benchmarks to new languages. Where appropriate, adapting the benchmark to language- and cultural-specific features.

Doing this is only possible through large, international collaborations. If you're interested in being involved in this kind of endeavor, join the EleutherAI Discord, which is open to everyone. It's a great place to get feedback on your ideas and find collaborators. In the #multilingual channel, we are developing ideas for new benchmarks and finding contributors for particular languages.

This is content adapted from my talk at PyTorch Day France 2025.

image/png

Community

Sign up or log in to comment