Spaces:
Runtime error
Runtime error
A newer version of the Gradio SDK is available:
5.12.0
Data preparation
data for training
- The images pretraining dataset is from LLaVA.
- The images tuning dataset is from LLaVA.
- The videos pretraining dataset is from Valley.
- The videos tuning dataset is from Video-ChatGPT.
- Download the training annotations. You can download from Baidu Disk, Google Disk or Peking University Disk
We also provide the processed data as follows.
After downloading all of them, organize the data as follows in DATA_ROOT
.
DATA_ROOT
βββ llava_image
βββ llava_image_tune
βββ valley
βββ videochatgpt_tune
data for validating
- For image, follow LLaVA's instructions. You MUST first download eval.zip. It contains custom annotations, scripts, and the prediction files with LLaVA v1.5. Extract to
eval
. This also provides a general structure for all datasets. - For video, videos and annotations can be downloaded from Video-ChatGPT. We also provide the processed data as follows.
After downloading all of them, organize the data as follows in eval
.
eval
βββ GPT_Zero_Shot_QA
β βββ Activitynet_Zero_Shot_QA
β βββ MSRVTT_Zero_Shot_QA
β βββ MSVD_Zero_Shot_QA
β βββ TGIF_Zero_Shot_QA
βββ gqa
β βββ answers
β βββ data
β βββ llava_gqa_testdev_balanced.jsonl
βββ llava-bench-in-the-wild
β βββ answers
β βββ answers_gpt4.jsonl
β βββ bard_0718.jsonl
β βββ bing_chat_0629.jsonl
β βββ context.jsonl
β βββ images
β βββ questions.jsonl
β βββ README.md
β βββ reviews
βββ mmbench
β βββ answers
β βββ answers_upload
β βββ mmbench_dev_20230712.tsv
β βββ mmbench_dev_en_20231003.tsv
βββ MME
β βββ answers
β βββ convert_answer_to_mme.py
β βββ llava_mme.jsonl
βββ mm-vet
β βββ answers
β βββ bard_set.json
β βββ convert_answers.py
β βββ images
β βββ llava-mm-vet.jsonl
β βββ mm-vet.json
β βββ results
βββ pope
β βββ answers
β βββ coco
β βββ llava_pope_test.jsonl
β βββ val2014
βββ scienceqa
β βββ answers
β βββ images
β βββ llava_test_CQM-A.json
β βββ pid_splits.json
β βββ problems.json
βββ seed_bench
β βββ answers
β βββ answers_upload
β βββ extract_video_frames.py
β βββ llava-seed-bench.jsonl
βββ textvqa
β βββ answers
β βββ llava_textvqa_val_v051_ocr.jsonl
β βββ TextVQA_0.5.1_val.json
β βββ train_images
βββ vizwiz
β βββ answers
β βββ answers_upload
β βββ llava_test.jsonl
β βββ test
β βββ test.json
β βββ train.json
β βββ val.json
βββ vqav2
βββ answers
βββ answers_upload
βββ llava_vqav2_mscoco_test2015.jsonl
βββ llava_vqav2_mscoco_test-dev2015.jsonl
βββ test2015
Training
Specify your DATA_ROOT
according to the data preparation.
- Stage 1 pretraining script: pretrain.sh.
- Stage 2 tuning script: finetune.sh.
Validating
Our image validation code comes from LLaVA and our video validation code comes from Video-ChatGPT, thanks for their contribution!
You can refer to the official repository for validation, but we also provide off-the-shelf scripts.
MSRVTT-QA
- Inference to get the result.
CUDA_VISIBLE_DEVICES=0 bash scripts/v1_5/eval/run_qa_msrvtt.sh
- GPT-Assistant evaluation.
bash scripts/v1_5/eval/eval_qa_msrvtt.sh
MSVD-QA
- Inference to get the result.
CUDA_VISIBLE_DEVICES=0 bash scripts/v1_5/eval/run_qa_msvd.sh
- GPT-Assistant evaluation.
bash scripts/v1_5/eval/eval_qa_msvd.sh
TGIF-QA
- Inference to get the result.
CUDA_VISIBLE_DEVICES=0 bash scripts/v1_5/eval/run_qa_tgif.sh
- GPT-Assistant evaluation.
bash scripts/v1_5/eval/eval_qa_tgif.sh
ActivityNet-QA
- Inference to get the result.
CUDA_VISIBLE_DEVICES=0 bash scripts/v1_5/eval/run_qa_activitynet.sh
- GPT-Assistant evaluation.
bash scripts/v1_5/eval/eval_qa_activitynet.sh
VQAv2
- Download
test2015
and put it undereval/vqav2
. - Multi-GPU inference.
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 bash scripts/v1_5/eval/eval_image_vqav2.sh
- Submit the results to the evaluation server:
eval/vqav2/answers_upload
.
GQA
- Download the data following the official instructions here and put under
eval/gqa/data
. - Multi-GPU inference.
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 bash scripts/v1_5/eval/eval_image_gqa.sh
VisWiz
CUDA_VISIBLE_DEVICES=0 bash scripts/v1_5/eval/eval_image_vizwiz.sh
- Submit the results to the evaluation server:
eval/vizwiz/answers_upload
.
ScienceQA
- Under
eval/scienceqa
, downloadimages
,pid_splits.json
,problems.json
from thedata/scienceqa
folder of the ScienceQA repo. - Single-GPU inference and evaluate.
CUDA_VISIBLE_DEVICES=0 bash scripts/v1_5/eval/eval_image_sqa.sh
TextVQA
- Download
TextVQA_0.5.1_val.json
and images and extract toeval/textvqa
. - Single-GPU inference and evaluate.
CUDA_VISIBLE_DEVICES=0 bash scripts/v1_5/eval/eval_image_textvqa.sh
POPE
- Download
coco
from POPE and put undereval/pope
. - Single-GPU inference and evaluate.
CUDA_VISIBLE_DEVICES=0 bash scripts/v1_5/eval/eval_image_pope.sh
MMBench
- Download
mmbench_dev_20230712.tsv
and put undereval/mmbench
. - Single-GPU inference.
CUDA_VISIBLE_DEVICES=0 bash scripts/v1_5/eval/eval_image_mmbench.sh
- Submit the results to the evaluation server:
eval/mmbench/answers_upload/mmbench_dev_20230712
.
LLaVA-Bench-in-the-Wild
- Extract contents of
llava-bench-in-the-wild
toeval/llava-bench-in-the-wild
. - Single-GPU inference and evaluate.
CUDA_VISIBLE_DEVICES=0 bash scripts/v1_5/eval/eval_image_llavabench.sh
MM-Vet
- Extract
mm-vet.zip
toeval/mmvet
. - Single-GPU inference.
CUDA_VISIBLE_DEVICES=0 bash scripts/v1_5/eval/eval_image_mmvet.sh