arxiv:2411.17451

VLRewardBench: A Challenging Benchmark for Vision-Language Generative Reward Models

Published on Nov 26, 2024

· Submitted by

tobiaslee on Nov 27, 2024

Upvote

Authors:

Lei Li ,

Zhihui Xie ,

Abstract

VL-RewardBench is a comprehensive benchmark designed to challenge vision-language generative reward models across various tasks, demonstrating their limitations and providing insights for improvement.

AI-generated summary

Vision-language generative reward models (VL-GenRMs) play a crucial role in aligning and evaluating multimodal AI systems, yet their own evaluation remains under-explored. Current assessment methods primarily rely on AI-annotated preference labels from traditional VL tasks, which can introduce biases and often fail to effectively challenge state-of-the-art models. To address these limitations, we introduce VL-RewardBench, a comprehensive benchmark spanning general multimodal queries, visual hallucination detection, and complex reasoning tasks. Through our AI-assisted annotation pipeline combining sample selection with human verification, we curate 1,250 high-quality examples specifically designed to probe model limitations. Comprehensive evaluation across 16 leading large vision-language models, demonstrates VL-RewardBench's effectiveness as a challenging testbed, where even GPT-4o achieves only 65.4% accuracy, and state-of-the-art open-source models such as Qwen2-VL-72B, struggle to surpass random-guessing. Importantly, performance on VL-RewardBench strongly correlates (Pearson's r > 0.9) with MMMU-Pro accuracy using Best-of-N sampling with VL-GenRMs. Analysis experiments uncover three critical insights for improving VL-GenRMs: (i) models predominantly fail at basic visual perception tasks rather than reasoning tasks; (ii) inference-time scaling benefits vary dramatically by model capacity; and (iii) training VL-GenRMs to learn to judge substantially boosts judgment capability (+14.7% accuracy for a 7B VL-GenRM). We believe VL-RewardBench along with the experimental insights will become a valuable resource for advancing VL-GenRMs.

View arXiv page View PDF Add to collection

Community

tobiaslee

Paper author Paper submitter Nov 27, 2024

Space (Data Visualizer + Leaderboard): https://huggingface.co/spaces/MMInstruction/VL-RewardBench

librarian-bot

Nov 28, 2024

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

VLRewardBench: A Challenging Benchmark for Vision-Language Generative Reward Models

Abstract

Community

Models citing this paper 1

Datasets citing this paper 2

Spaces citing this paper 1

Collections including this paper 2