Model description

This model is a fine-tuned version of Qwen/Qwen2.5-VL-7B-Instruct, introduced in the paper ShotBench: Expert-Level Cinematic Understanding in Vision-Language Models. It is trained by supervised fine-tuning on the largest and high-quality dataset for cinematic language understanding to date. It currently achieves state-of-the-art performance on ShotBench, a comprehensive benchmark for evaluating cinematography understanding in vision-language models.

Project Page: https://vchitect.github.io/ShotBench-project/

Code: https://github.com/Vchitect/ShotBench

Demo

Image

import cv2
import torch
from transformers import Qwen2_5_VLForConditionalGeneration, AutoProcessor
from qwen_vl_utils import process_vision_info

device = "cuda"
device_map = "balanced"
dtype = torch.bfloat16
image_path = "/path/to/image.jpg"

model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
  "Vchitect/ShotVL-7B",
  device_map=device_map,
  attn_implementation="flash_attention_2",
  torch_dtype=dtype,
).eval()
processor = AutoProcessor.from_pretrained(
  "Vchitect/ShotVL-7B", revision="refs/pr/24", use_fast=True, torch_dtype=dtype
)

msgs = [
  {"role": "system", "content": "You are a helpful assistant."},
  {
    "role": "user",
    "content": [
      {"type": "image", "image": image_path},
      {"type": "text", "text": "What's the shot size of this shot?"},
    ],
  },
]

text = processor.apply_chat_template(msgs, tokenize=False, add_generation_prompt=True)
image_inputs, video_inputs = process_vision_info(msgs)
inputs = processor(
  text=[text],
  images=image_inputs,
  videos=video_inputs,
  padding=True,
  return_tensors="pt",
).to(device)

with torch.inference_mode():
  out_ids = model.generate(**inputs, max_new_tokens=640)
  
trimmed = [o[len(i):] for i, o in zip(inputs.input_ids, out_ids)]
print(processor.batch_decode(trimmed, skip_special_tokens=True)[0])

Video

import cv2
import torch
from transformers import Qwen2_5_VLForConditionalGeneration, AutoProcessor
from qwen_vl_utils import process_vision_info

device = "cuda"
device_map = "balanced"
dtype = torch.bfloat16
video_path = "/path/to/video.mp4"

model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
  "Vchitect/ShotVL-7B",
  device_map=device_map,
  attn_implementation="flash_attention_2",
  torch_dtype=dtype,
).eval()
processor = AutoProcessor.from_pretrained(
  "Vchitect/ShotVL-7B", revision="refs/pr/24", use_fast=True, torch_dtype=dtype
)

question = (
    "What's the camera movement in this movie shot?
"
    "Options:
A. Boom down
B. Boom up
C. Push in
D. Pull out
"
    "Please select the most likely answer from the options above.
"
)
msgs = [
  {"role": "system", "content": "You are a helpful assistant."},
  {
    "role": "user",
    "content": [
      {"type": "video", "video": video_path, "max_pixels": 360*640, "fps": 12.0},
      {"type": "text", "text": question},
    ],
  },
]

text = processor.apply_chat_template(msgs, tokenize=False, add_generation_prompt=True)
image_inputs, video_inputs = process_vision_info(msgs)
inputs = processor(
  text=[text],
  images=image_inputs,
  videos=video_inputs,
  padding=True,
  return_tensors="pt",
).to(device)

with torch.inference_mode():
  out_ids = model.generate(**inputs, max_new_tokens=640)
  
trimmed = [o[len(i):] for i, o in zip(inputs.input_ids, out_ids)]
print(processor.batch_decode(trimmed, skip_special_tokens=True)[0])

Evaluation Results

Abbreviations:  SS = Shot Size,  SF = Shot Framing,  CA = Camera Angle,  LS = Lens Size,  LT = Lighting Type,  LC = Lighting Conditions,  SC = Shot Composition,  CM = Camera MovementUnderline marks previous best in each group.
Our ShotVL models establish new SOTA.
ModelsSSSFCALSLT LCSCCMAvg
Open-Sourced VLMs
Qwen2.5-VL-3B-Instruct54.656.643.136.659.345.141.531.946.1
Qwen2.5-VL-7B-Instruct69.173.553.247.060.547.449.930.253.8
LLaVA-NeXT-Video-7B35.937.132.527.850.931.728.031.334.4
LLaVA-Video-7B-Qwen256.965.445.136.063.545.437.435.348.1
LLaVA-Onevision-Qwen2-7B-Ov-Chat58.471.052.338.759.544.950.939.751.9
InternVL2.5-8B56.370.350.841.160.245.150.133.650.9
InternVL3-2B56.356.044.434.656.844.643.038.146.7
InternVL3-8B62.165.846.842.958.044.346.844.251.4
InternVL3-14B59.682.255.440.761.744.651.138.254.2
Internlm-xcomposer2d5-7B51.171.039.832.759.335.735.738.845.5
Ovis2-8B35.937.132.527.850.931.728.035.334.9
VILA1.5-3B33.444.932.128.650.635.728.421.534.4
VILA1.5-8B40.644.539.129.748.932.934.436.938.4
VILA1.5-13B36.754.640.734.852.835.434.231.340.1
Instructblip-vicuna-7B27.027.934.529.444.429.727.125.030.6
Instructblip-vicuna-13B26.829.227.928.039.024.027.122.028.0
InternVL2.5-38B67.885.455.441.761.748.952.444.057.2
InternVL3-38B68.084.051.943.664.446.954.744.657.3
Qwen2.5-VL-32B-Instruct62.376.651.048.361.744.052.243.855.0
Qwen2.5-VL-72B-Instruct75.182.956.746.859.049.454.148.959.1
InternVL3-78B69.780.054.544.065.547.451.844.457.2
Proprietary VLMs
Gemini-2.0-flash48.975.544.631.962.248.952.447.451.5
Gemini-2.5-flash-preview-04-1757.782.951.443.865.245.745.943.554.5
GPT-4o69.383.158.248.963.248.055.248.359.3
Ours
ShotVL-3B HF 77.985.668.859.365.7 53.157.451.765.1
ShotVL-7B HF 81.290.178.068.570.1 64.345.762.970.1

BibTeX

@misc{
      liu2025shotbench,
      title={ShotBench: Expert-Level Cinematic Understanding in Vision-Language Models}, 
      author={Hongbo Liu and Jingwen He and Yi Jin and Dian Zheng and Yuhao Dong and Fan Zhang and Ziqi Huang and Yinan He and Yangguang Li and Weichao Chen and Yu Qiao and Wanli Ouyang and Shengjie Zhao and Ziwei Liu},
      year={2025},
      eprint={2506.21356},
      achivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2506.21356}, 
    }
Downloads last month
1,204
Safetensors
Model size
8.29B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Vchitect/ShotVL-7B

Finetuned
(520)
this model
Quantizations
2 models

Collection including Vchitect/ShotVL-7B