Unable to Reproduce Results of Any Model on the Evaluation Leaderboard

#9
by tcy6 - opened

Hello, your work is amazing, and I would love to follow it. If I want to reproduce the MMBench scores, which code should I refer to? I tried to reproduce the results but only achieved a score of 73. Additionally, for the scores of the closed-source models, it seems that the given results are lower than the official reports of these closed-source models. Why is this the case?

Hello, thanks for your interest in our model.

Could you please provide more information about your settings? Also, from your title it sounds like you tried all the datasets but, in your description, I only see MMBench mentioned. Is the only discrepancy in MMBench?

As for the closed source models, we did not optimize prompts and did not use special tokens (e.g., the multiple-choice token <mc> from Claude). Looking at the results from the closed models, we found a few refusals to answer or did not comply w/ the instructions in the prompt on how to answer (either a single letter or single word or phrase). For example, the models would answer with very long paragraphs when expecting a single word.

For example, the results of Llama3-Llava-Next-8B on your leadberboard are much different from the official results.

Differences w.r.t. to Llama3-Llava-Next-8B might be due to different things (e.g., generation parameters, prompts, etc.). We used temperature=0, and sample_len-1024, other than that, we left the generation parameters as default. For example, in our evaluations of MMMU, we left the image tokens interleaved in the question or answer as stated in the dataset. Moreover, for answers that could not be parsed, we followed what the MMMU paper suggested which is that random answers are generated. All this can contribute to variances.

Nevertheless, I will include our model to LMMS-Eval benchmark soon so that the community can reproduce our numbers more easily.

Just to share w/ the community the PR on LMMs-Eval: https://github.com/EvolvingLMMs-Lab/lmms-eval/pull/87

LMMs-EVAL reproduces our reported results. One note is w/ MMBench-dev-en. We reported plain, raw accuracy (i.e., predicted letter matches expected letter) which gives us the number we reported: 80.5. However, when using the gpt_eval_score, we obtained the 73.7. I hope this clarifies the initial question.

nguyenbh changed discussion status to closed

Sign up or log in to comment