Update README.md

5f20a43 verified 9 days ago

7.82 kB

	---
	license: mit
	datasets:
	- CodeGoat24/HPD
	- CodeGoat24/LiFT-HRA
	- CodeGoat24/OIP
	- CodeGoat24/EvalMuse
	- CodeGoat24/ShareGPTVideo-DPO
	- CodeGoat24/VideoFeedback
	- CodeGoat24/LLaVA-Critic-113k
	- CodeGoat24/VideoDPO
	base_model:
	- lmms-lab/llava-onevision-qwen2-7b-ov
	---


	# Unified-Reward-7B-v1.5

	## Model Summary

	`Unified-Reward-7b-v1.5` is the enhanced version of [Unified-Reward-7b](https://huggingface.co/CodeGoat24/UnifiedReward-7b/blob/main/README.md), the first unified reward model for multimodal understanding and generation assessment, enabling both pairwise ranking and pointwise scoring, which can be employed for vision model preference alignment.

	For further details, please refer to the following resources:
	- 📰 Paper: https://arxiv.org/pdf/2503.05236
	- 🪐 Project Page: https://codegoat24.github.io/UnifiedReward/
	- 🤗 Model Collections: https://huggingface.co/collections/CodeGoat24/unifiedreward-models-67c3008148c3a380d15ac63a
	- 🤗 Dataset Collections: https://huggingface.co/collections/CodeGoat24/unifiedreward-training-data-67c300d4fd5eff00fa7f1ede
	- 👋 Point of Contact: [Yibin Wang](https://codegoat24.github.io)


	## 🏁 Compared with Current Reward Models

	\| Reward Model \| Method\| Image Generation \| Image Understanding \| Video Generation \| Video Understanding
	\| :-----: \| :-----: \|:-----: \|:-----: \| :-----: \| :-----: \|
	\| [PickScore](https://github.com/yuvalkirstain/PickScore) \|Point \| √ \| \| \|\|
	\| [HPS](https://github.com/tgxs002/HPSv2) \| Point \| √ \| \|\|\|
	\| [ImageReward](https://github.com/THUDM/ImageReward) \| Point\| √\| \|\|\|
	\| [LLaVA-Critic](https://huggingface.co/lmms-lab/llava-critic-7b) \| Pair/Point \| \| √ \|\|\|
	\| [IXC-2.5-Reward](https://github.com/InternLM/InternLM-XComposer) \| Pair/Point \| \| √ \|\|√\|
	\| [VideoScore](https://github.com/TIGER-AI-Lab/VideoScore) \| Point \| \| \|√ \|\|
	\| [LiFT](https://github.com/CodeGoat24/LiFT) \| Point \| \| \|√\| \|
	\| [VisionReward](https://github.com/THUDM/VisionReward) \| Point \|√ \| \|√\|\|
	\| [VideoReward](https://github.com/KwaiVGI/VideoAlign) \| Point \| \| \|√ \|\|
	\| UnifiedReward (Ours) \| Pair/Point \| √ \| √ \|√\|√\|


	VLRewardBench Comparison Results

	\| Models \| General \| Hallu. \| Reason. \| Overall Accuracy \| Macro Accuracy \|
	\|----------------------\|---------\|--------\|---------\|------------------\|---------------\|
	\| Gemini-1.5-Pro \| 50.8 \| 72.5 \| 64.2 \| 67.2 \| 62.5 \|
	\| GPT-4o \| 49.1 \| 67.6 \| 70.5 \| 65.8 \| 62.4 \|
	\| LLaVA-Critic \| 47.4 \| 38.5 \| 53.8 \| 46.9 \| 46.6 \|
	\| OV-7B \| 32.2 \| 20.1 \| 57.1 \| 29.6 \| 36.5 \|
	\| [Unified-Reward](https://huggingface.co/CodeGoat24/UnifiedReward-7b/blob/main/README.md) \| 60.6 \| 78.4 \| 60.5 \| 66.1 \| 66.5 \|
	\| UnifiedReward-v1.5 \| 68.1 \| 84.4 \| 59.5 \| 70.1 \| 70.7 \|


	GenAI-Bench(Image) Comparison Results

	\| Method \| GenAI-Bench \| \|
	\|------------------\|------------\|--------\|
	\| \| tau \| diff \|
	\| PickScore \| 53.2 \| 67.2 \|
	\| HPSv2 \| 51.6 \| 68.4 \|
	\| ImageReward \| 47.8 \| 65.0 \|
	\| VisionReward \| 46.8 \| 66.4 \|
	\| OV-7B \| 39.7 \| 53.2 \|
	\| [UnifiedReward](https://huggingface.co/CodeGoat24/UnifiedReward-7b/blob/main/README.md) \| 54.8 \| 70.9 \|
	\| UnifiedReward-v1.5 \| 58.9 \| 72.4 \|


	GenAI-Bench(Video) and VideoGen-Reward Comparison Results

	\| Method \| GenAI-Bench \| \| VideoGen-Reward \| \|
	\|------------------\|------------\|--------\|-----------------\|--------\|
	\| \| tau \| diff \| tau \| diff \|
	\| VideoScore \| 46.2 \| 70.6 \| 42.1 \| 49.9 \|
	\| LiFT \| 41.2 \| 60.1 \| 40.6 \| 58.3 \|
	\| VisionReward \| 52.1 \| 73.1 \| 57.4 \| 68.2 \|
	\| VideoReward \| 50.2 \| 73.3 \| 60.1 \| 73.9 \|
	\| OV-7B \| 40.8 \| 51.4 \| 40.4 \| 50.2 \|
	\| [UnifiedReward](https://huggingface.co/CodeGoat24/UnifiedReward-7b/blob/main/README.md) \| 60.7 \| 77.2 \| 66.6 \| 79.3 \|
	\| UnifiedReward-v1.5 \| 61.7 \| 78.5 \| 67.0 \| 80.5 \|


	### Quick Start
	All pair rank and point score inference codes are provided in our [github](https://github.com/CodeGoat24/UnifiedReward).

	We take image understanding assessment as example here:
	~~~python
	# pip install git+https://github.com/LLaVA-VL/LLaVA-NeXT.git
	from llava.model.builder import load_pretrained_model
	from llava.mm_utils import get_model_name_from_path, process_images, tokenizer_image_token
	from llava.constants import IMAGE_TOKEN_INDEX, DEFAULT_IMAGE_TOKEN, DEFAULT_IM_START_TOKEN, DEFAULT_IM_END_TOKEN, IGNORE_INDEX
	from llava.conversation import conv_templates, SeparatorStyle

	from PIL import Image
	import requests
	import copy
	import torch

	import sys
	import warnings
	import os


	warnings.filterwarnings("ignore")
	pretrained = "CodeGoat24/UnifiedReward-7b-v1.5"
	model_name = "llava_qwen"
	device = "cuda"
	device_map = "auto"
	tokenizer, model, image_processor, max_length = load_pretrained_model(pretrained, None, model_name, device_map=device_map) # Add any other thing you want to pass in llava_model_args

	model.eval()

	url = "https://github.com/LLaVA-VL/blog/blob/main/2024-10-03-llava-critic/static/images/critic_img_seven.png?raw=True"
	image = Image.open(requests.get(url, stream=True).raw)
	image_tensor = process_images([image], image_processor, model.config)
	image_tensor = [_image.to(dtype=torch.float16, device=device) for _image in image_tensor]

	conv_template = "qwen_1_5" # Make sure you use correct chat template for different models

	# pairwise ranking
	critic_prompt = "Given an image and a corresponding question, please serve as an unbiased and fair judge to evaluate the quality of the answers provided by a Large Multimodal Model (LMM). Determine which answer is better and explain your reasoning with specific details. Your task is provided as follows:\nQuestion: [What this image presents?]\nThe first response: [The image is a black and white sketch of a line that appears to be in the shape of a cross. The line is a simple and straightforward representation of the cross shape, with two straight lines intersecting at a point.]\nThe second response: [This is a handwritten number seven.]\nASSISTANT:\n"

	# pointwise scoring
	# critic_prompt = "Given an image and a corresponding question, please serve as an unbiased and fair judge to evaluate the quality of answer answers provided by a Large Multimodal Model (LMM). Score the response out of 100 and explain your reasoning with specific details. Your task is provided as follows:\nQuestion: [What this image presents?]\nThe LMM response: [This is a handwritten number seven.]\nASSISTANT:\n "

	question = DEFAULT_IMAGE_TOKEN + "\n" + critic_prompt
	conv = copy.deepcopy(conv_templates[conv_template])
	conv.append_message(conv.roles[0], question)
	conv.append_message(conv.roles[1], None)
	prompt_question = conv.get_prompt()

	input_ids = tokenizer_image_token(prompt_question, tokenizer, IMAGE_TOKEN_INDEX, return_tensors="pt").unsqueeze(0).to(device)
	image_sizes = [image.size]


	cont = model.generate(
	input_ids,
	images=image_tensor,
	image_sizes=image_sizes,
	do_sample=False,
	temperature=0,
	max_new_tokens=4096,
	)
	text_outputs = tokenizer.batch_decode(cont, skip_special_tokens=True)
	print(text_outputs[0])
	~~~


	## Citation

	```
	@article{UnifiedReward,
	title={Unified Reward Model for Multimodal Understanding and Generation.},
	author={Wang, Yibin and Zang, Yuhang, and Li, Hao and Jin, Cheng and Wang Jiaqi},
	journal={arXiv preprint arXiv:2503.05236},
	year={2025}
	}
	```