OmniAlign-V: Towards Enhanced Alignment of MLLMs with Human Preference
Abstract
Recent advancements in open-source multi-modal large language models (MLLMs) have primarily focused on enhancing foundational capabilities, leaving a significant gap in human preference alignment. This paper introduces OmniAlign-V, a comprehensive dataset of 200K high-quality training samples featuring diverse images, complex questions, and varied response formats to improve MLLMs' alignment with human preferences. We also present MM-AlignBench, a human-annotated benchmark specifically designed to evaluate MLLMs' alignment with human values. Experimental results show that finetuning MLLMs with OmniAlign-V, using Supervised Fine-Tuning (SFT) or Direct Preference Optimization (DPO), significantly enhances human preference alignment while maintaining or enhancing performance on standard VQA benchmarks, preserving their fundamental capabilities. Our datasets, benchmark, code and checkpoints have been released at https://github.com/PhoenixZ810/OmniAlign-V.
Community
Paper: Paper,
Github: Github,
Page: Page,
SFT Dataset: OmniAlign-V,
DPO Dataset: OmniAlign-V-DPO,
In this work, we introduce three key contributions, OmniAlign-V SFT dataset, OmniAlign-V-DPO dataset, and MM-AlignBench:
- OmniAlign-V SFT Dataset: A SFT dataset designed to improve the alignment of Multi-modal Large Language Models (MLLMs) with human preferences. It contains 205k high-quality Image-Question-Answer pairs , featuring open-ended, creative questions and long, knowledge-rich, comprehensive answers.
- OmniAlign-V-DPO Dataset: A specialized dataset for Direct Preference Optimization (DPO). It leverages the answers from the OmniAlign-V SFT dataset as positive samples and generates negative samples using LLaVANext-InternLM-7B with rejection sampling.
- MM-AlignBench: A benchmark for evaluating MLLMs' alignment with human preferences. It includes 252 high-quality, human-annotated samples with diverse image types and open-ended questions. Modeled after Arena-style benchmarks, it uses GPT-4o as the judge model and Claude-Sonnet-3 as the reference model.
๐ฅ Dataset Performance
Our OmniAlign-V SFT dataset not only significantly improves the alignment of MLLMs with human preference, but also boosts the performance of MLLMs on common downstream tasks, particularly on benchmarks like MMVet and MMMU.
By incorporating a DPO stage using our OmniAlign-V-DPO dataset, we achieve even better alignment with human preferences. Notably, our LLaVANext-OA-32B model, built on the Qwen2.5-32B-Instruct foundation, surpasses Qwen2VL-72B on the MM-AlignBench.
๐ MM-AlignBench
MM-AlignBench is now supported in VLMEvalKit, a powerful toolkit for evaluating over 200 MLLMs across various benchmarks. For more details, check out the VLMEvalKit repository .
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- MM-RLHF: The Next Step Forward in Multimodal LLM Alignment (2025)
- LongWriter-V: Enabling Ultra-Long and High-Fidelity Generation in Vision-Language Models (2025)
- Towards Interactive Deepfake Analysis (2025)
- Condor: Enhance LLM Alignment with Knowledge-Driven Data Synthesis and Refinement (2025)
- Are They the Same? Exploring Visual Correspondence Shortcomings of Multimodal LLMs (2025)
- IIMedGPT: Promoting Large Language Model Capabilities of Medical Tasks by Efficient Human Preference Alignment (2025)
- Evaluating and Improving Graph to Text Generation with Large Language Models (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 3
Datasets citing this paper 2
Spaces citing this paper 0
No Space linking this paper