Video-LLaVA / TRAIN_AND_VALIDATE.md
LinB203
update
0e023c7

A newer version of the Gradio SDK is available: 5.12.0

Upgrade

Data preparation

data for training

We also provide the processed data as follows.

DatasetsBaidu Disk
Image pretrainingLink
Image tuningLink
Video pretrainingLink
Video tuningLink

After downloading all of them, organize the data as follows in DATA_ROOT.

DATA_ROOT
β”œβ”€β”€ llava_image
β”œβ”€β”€ llava_image_tune
β”œβ”€β”€ valley
└── videochatgpt_tune

data for validating

  • For image, follow LLaVA's instructions. You MUST first download eval.zip. It contains custom annotations, scripts, and the prediction files with LLaVA v1.5. Extract to eval. This also provides a general structure for all datasets.
  • For video, videos and annotations can be downloaded from Video-ChatGPT. We also provide the processed data as follows.
    DatasetsBaidu DiskGoogle DiskPeking University Disk
    Activitynet_Zero_Shot_QALink--
    MSRVTT_Zero_Shot_QALinkLink-
    MSVD_Zero_Shot_QALinkLinkLink
    TGIF_Zero_Shot_QALinkLinkLink

After downloading all of them, organize the data as follows in eval.

eval
β”œβ”€β”€ GPT_Zero_Shot_QA
β”‚   β”œβ”€β”€ Activitynet_Zero_Shot_QA
β”‚   β”œβ”€β”€ MSRVTT_Zero_Shot_QA
β”‚   β”œβ”€β”€ MSVD_Zero_Shot_QA
β”‚   └── TGIF_Zero_Shot_QA
β”œβ”€β”€ gqa
β”‚   β”œβ”€β”€ answers
β”‚   β”œβ”€β”€ data
β”‚   └── llava_gqa_testdev_balanced.jsonl
β”œβ”€β”€ llava-bench-in-the-wild
β”‚   β”œβ”€β”€ answers
β”‚   β”œβ”€β”€ answers_gpt4.jsonl
β”‚   β”œβ”€β”€ bard_0718.jsonl
β”‚   β”œβ”€β”€ bing_chat_0629.jsonl
β”‚   β”œβ”€β”€ context.jsonl
β”‚   β”œβ”€β”€ images
β”‚   β”œβ”€β”€ questions.jsonl
β”‚   β”œβ”€β”€ README.md
β”‚   └── reviews
β”œβ”€β”€ mmbench
β”‚   β”œβ”€β”€ answers
β”‚   β”œβ”€β”€ answers_upload
β”‚   β”œβ”€β”€ mmbench_dev_20230712.tsv
β”‚   └── mmbench_dev_en_20231003.tsv
β”œβ”€β”€ MME
β”‚   β”œβ”€β”€ answers
β”‚   β”œβ”€β”€ convert_answer_to_mme.py
β”‚   └── llava_mme.jsonl
β”œβ”€β”€ mm-vet
β”‚   β”œβ”€β”€ answers
β”‚   β”œβ”€β”€ bard_set.json
β”‚   β”œβ”€β”€ convert_answers.py
β”‚   β”œβ”€β”€ images
β”‚   β”œβ”€β”€ llava-mm-vet.jsonl
β”‚   β”œβ”€β”€ mm-vet.json
β”‚   └── results
β”œβ”€β”€ pope
β”‚   β”œβ”€β”€ answers
β”‚   β”œβ”€β”€ coco
β”‚   β”œβ”€β”€ llava_pope_test.jsonl
β”‚   └── val2014
β”œβ”€β”€ scienceqa
β”‚   β”œβ”€β”€ answers
β”‚   β”œβ”€β”€ images
β”‚   β”œβ”€β”€ llava_test_CQM-A.json
β”‚   β”œβ”€β”€ pid_splits.json
β”‚   └── problems.json
β”œβ”€β”€ seed_bench
β”‚   β”œβ”€β”€ answers
β”‚   β”œβ”€β”€ answers_upload
β”‚   β”œβ”€β”€ extract_video_frames.py
β”‚   └── llava-seed-bench.jsonl
β”œβ”€β”€ textvqa
β”‚   β”œβ”€β”€ answers
β”‚   β”œβ”€β”€ llava_textvqa_val_v051_ocr.jsonl
β”‚   β”œβ”€β”€ TextVQA_0.5.1_val.json
β”‚   └── train_images
β”œβ”€β”€ vizwiz
β”‚   β”œβ”€β”€ answers
β”‚   β”œβ”€β”€ answers_upload
β”‚   β”œβ”€β”€ llava_test.jsonl
β”‚   β”œβ”€β”€ test
β”‚   β”œβ”€β”€ test.json
β”‚   β”œβ”€β”€ train.json
β”‚   └── val.json
└── vqav2
    β”œβ”€β”€ answers
    β”œβ”€β”€ answers_upload
    β”œβ”€β”€ llava_vqav2_mscoco_test2015.jsonl
    β”œβ”€β”€ llava_vqav2_mscoco_test-dev2015.jsonl
    └── test2015

Training

Specify your DATA_ROOT according to the data preparation.

Validating

Our image validation code comes from LLaVA and our video validation code comes from Video-ChatGPT, thanks for their contribution!

You can refer to the official repository for validation, but we also provide off-the-shelf scripts.

MSRVTT-QA

  1. Inference to get the result.
CUDA_VISIBLE_DEVICES=0 bash scripts/v1_5/eval/run_qa_msrvtt.sh
  1. GPT-Assistant evaluation.
bash scripts/v1_5/eval/eval_qa_msrvtt.sh

MSVD-QA

  1. Inference to get the result.
CUDA_VISIBLE_DEVICES=0 bash scripts/v1_5/eval/run_qa_msvd.sh
  1. GPT-Assistant evaluation.
bash scripts/v1_5/eval/eval_qa_msvd.sh

TGIF-QA

  1. Inference to get the result.
CUDA_VISIBLE_DEVICES=0 bash scripts/v1_5/eval/run_qa_tgif.sh
  1. GPT-Assistant evaluation.
bash scripts/v1_5/eval/eval_qa_tgif.sh

ActivityNet-QA

  1. Inference to get the result.
CUDA_VISIBLE_DEVICES=0 bash scripts/v1_5/eval/run_qa_activitynet.sh
  1. GPT-Assistant evaluation.
bash scripts/v1_5/eval/eval_qa_activitynet.sh

VQAv2

  1. Download test2015 and put it under eval/vqav2.
  2. Multi-GPU inference.
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 bash scripts/v1_5/eval/eval_image_vqav2.sh
  1. Submit the results to the evaluation server: eval/vqav2/answers_upload.

GQA

  1. Download the data following the official instructions here and put under eval/gqa/data.
  2. Multi-GPU inference.
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 bash scripts/v1_5/eval/eval_image_gqa.sh

VisWiz

  1. Download test.json and extract test.zip to test. Put them under eval/vizwiz.
  2. Single-GPU inference.
CUDA_VISIBLE_DEVICES=0 bash scripts/v1_5/eval/eval_image_vizwiz.sh
  1. Submit the results to the evaluation server: eval/vizwiz/answers_upload.

ScienceQA

  1. Under eval/scienceqa, download images, pid_splits.json, problems.json from the data/scienceqa folder of the ScienceQA repo.
  2. Single-GPU inference and evaluate.
CUDA_VISIBLE_DEVICES=0 bash scripts/v1_5/eval/eval_image_sqa.sh

TextVQA

  1. Download TextVQA_0.5.1_val.json and images and extract to eval/textvqa.
  2. Single-GPU inference and evaluate.
CUDA_VISIBLE_DEVICES=0 bash scripts/v1_5/eval/eval_image_textvqa.sh

POPE

  1. Download coco from POPE and put under eval/pope.
  2. Single-GPU inference and evaluate.
CUDA_VISIBLE_DEVICES=0 bash scripts/v1_5/eval/eval_image_pope.sh

MMBench

  1. Download mmbench_dev_20230712.tsv and put under eval/mmbench.
  2. Single-GPU inference.
CUDA_VISIBLE_DEVICES=0 bash scripts/v1_5/eval/eval_image_mmbench.sh
  1. Submit the results to the evaluation server: eval/mmbench/answers_upload/mmbench_dev_20230712.

LLaVA-Bench-in-the-Wild

  1. Extract contents of llava-bench-in-the-wild to eval/llava-bench-in-the-wild.
  2. Single-GPU inference and evaluate.
CUDA_VISIBLE_DEVICES=0 bash scripts/v1_5/eval/eval_image_llavabench.sh

MM-Vet

  1. Extract mm-vet.zip to eval/mmvet.
  2. Single-GPU inference.
CUDA_VISIBLE_DEVICES=0 bash scripts/v1_5/eval/eval_image_mmvet.sh