Spaces:
Sleeping
Evaluation
In LLaVA-1.5, we evaluate models on a diverse set of 12 benchmarks. To ensure the reproducibility, we evaluate the models with greedy decoding. We do not evaluate using beam search to make the inference process consistent with the chat demo of real-time outputs.
Currently, we mostly utilize the official toolkit or server for the evaluation.
Evaluate on Custom Datasets
You can evaluate LLaVA on your custom datasets by converting your dataset to LLaVA's jsonl format, and evaluate using model_vqa.py
.
Below we provide a general guideline for evaluating datasets with some common formats.
- Short-answer (e.g. VQAv2, MME).
<question>
Answer the question using a single word or phrase.
- Option-only for multiple-choice (e.g. MMBench, SEED-Bench).
<question>
A. <option_1>
B. <option_2>
C. <option_3>
D. <option_4>
Answer with the option's letter from the given choices directly.
- Natural QA (e.g. LLaVA-Bench, MM-Vet).
No postprocessing is needed.
Scripts
Before preparing task-specific data, you MUST first download eval.zip. It contains custom annotations, scripts, and the prediction files with LLaVA v1.5. Extract to ./playground/data/eval
. This also provides a general structure for all datasets.
VQAv2
- Download
test2015
and put it under./playground/data/eval/vqav2
. - Multi-GPU inference.
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 bash scripts/v1_5/eval/vqav2.sh
- Submit the results to the evaluation server:
./playground/data/eval/vqav2/answers_upload
.
GQA
- Download the data and evaluation scripts following the official instructions and put under
./playground/data/eval/gqa/data
. You may need to modifyeval.py
as this due to the missing assets in the GQA v1.2 release. - Multi-GPU inference.
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 bash scripts/v1_5/eval/gqa.sh
VisWiz
- Download
test.json
and extracttest.zip
totest
. Put them under./playground/data/eval/vizwiz
. - Single-GPU inference.
CUDA_VISIBLE_DEVICES=0 bash scripts/v1_5/eval/vizwiz.sh
- Submit the results to the evaluation server:
./playground/data/eval/vizwiz/answers_upload
.
ScienceQA
- Under
./playground/data/eval/scienceqa
, downloadimages
,pid_splits.json
,problems.json
from thedata/scienceqa
folder of the ScienceQA repo. - Single-GPU inference and evaluate.
CUDA_VISIBLE_DEVICES=0 bash scripts/v1_5/eval/sqa.sh
TextVQA
- Download
TextVQA_0.5.1_val.json
and images and extract to./playground/data/eval/textvqa
. - Single-GPU inference and evaluate.
CUDA_VISIBLE_DEVICES=0 bash scripts/v1_5/eval/textvqa.sh
POPE
- Download
coco
from POPE and put under./playground/data/eval/pope
. - Single-GPU inference and evaluate.
CUDA_VISIBLE_DEVICES=0 bash scripts/v1_5/eval/pope.sh
MME
- Download the data following the official instructions here.
- Downloaded images to
MME_Benchmark_release_version
. - put the official
eval_tool
andMME_Benchmark_release_version
under./playground/data/eval/MME
. - Single-GPU inference and evaluate.
CUDA_VISIBLE_DEVICES=0 bash scripts/v1_5/eval/mme.sh
MMBench
- Download
mmbench_dev_20230712.tsv
and put under./playground/data/eval/mmbench
. - Single-GPU inference.
CUDA_VISIBLE_DEVICES=0 bash scripts/v1_5/eval/mmbench.sh
- Submit the results to the evaluation server:
./playground/data/eval/mmbench/answers_upload/mmbench_dev_20230712
.
MMBench-CN
- Download
mmbench_dev_cn_20231003.tsv
and put under./playground/data/eval/mmbench
. - Single-GPU inference.
CUDA_VISIBLE_DEVICES=0 bash scripts/v1_5/eval/mmbench_cn.sh
- Submit the results to the evaluation server:
./playground/data/eval/mmbench/answers_upload/mmbench_dev_cn_20231003
.
SEED-Bench
- Following the official instructions to download the images and the videos. Put images under
./playground/data/eval/seed_bench/SEED-Bench-image
. - Extract the video frame in the middle from the downloaded videos, and put them under
./playground/data/eval/seed_bench/SEED-Bench-video-image
. We provide our scriptextract_video_frames.py
modified from the official one. - Multiple-GPU inference and evaluate.
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 bash scripts/v1_5/eval/seed.sh
- Optionally, submit the results to the leaderboard:
./playground/data/eval/seed_bench/answers_upload
using the official jupyter notebook.
LLaVA-Bench-in-the-Wild
- Extract contents of
llava-bench-in-the-wild
to./playground/data/eval/llava-bench-in-the-wild
. - Single-GPU inference and evaluate.
CUDA_VISIBLE_DEVICES=0 bash scripts/v1_5/eval/llavabench.sh
MM-Vet
- Extract
mm-vet.zip
to./playground/data/eval/mmvet
. - Single-GPU inference.
CUDA_VISIBLE_DEVICES=0 bash scripts/v1_5/eval/mmvet.sh
- Evaluate the predictions in
./playground/data/eval/mmvet/results
using the official jupyter notebook.
More Benchmarks
Below are awesome benchmarks for multimodal understanding from the research community, that are not initially included in the LLaVA-1.5 release.
Q-Bench
- Download
llvisionqa_dev.json
(fordev
-subset) andllvisionqa_test.json
(fortest
-subset). Put them under./playground/data/eval/qbench
. - Download and extract images and put all the images directly under
./playground/data/eval/qbench/images_llviqionqa
. - Single-GPU inference (change
dev
totest
for evaluation on test set).
CUDA_VISIBLE_DEVICES=0 bash scripts/v1_5/eval/qbench.sh dev
- Submit the results by instruction here:
./playground/data/eval/qbench/llvisionqa_dev_answers.jsonl
.
Chinese-Q-Bench
- Download
质衡-问答-验证集.json
(fordev
-subset) and质衡-问答-测试集.json
(fortest
-subset). Put them under./playground/data/eval/qbench
. - Download and extract images and put all the images directly under
./playground/data/eval/qbench/images_llviqionqa
. - Single-GPU inference (change
dev
totest
for evaluation on test set).
CUDA_VISIBLE_DEVICES=0 bash scripts/v1_5/eval/qbench_zh.sh dev
- Submit the results by instruction here:
./playground/data/eval/qbench/llvisionqa_zh_dev_answers.jsonl
.