OPA-DPO LoRA for LLaVA-v1.5-13B

Introduction

Hallucination remains a major challenge for Large Vision-Language Models (LVLMs). Direct Preference Optimization (DPO) has gained increasing attention as a simple solution to hallucination issues. Nonetheless, different data construction methods in existing works bring notable performance variations. We identify a crucial factor: outcomes are largely contingent on whether the constructed data aligns on-policy w.r.t the initial (reference) policy of DPO. Due to the implicit KL-divergence constraint, off-policy data cannot be effectively learned.

We propose On-Policy Alignment (OPA)-DPO framework, which uniquely leverages expert feedback to correct hallucinated responses and aligns both the original and expert-revised responses in an on-policy manner. Compared with DPO without OPA operations, OPA-DPO significantly enhances performance. It achieves SOTA performance with only 4.8k training data, while most DPO-based algorithms require over 10k data.

Usage

Please refer to our Github Repository for more details. If you wish to use our model outside of our code, make sure to update the base_model_name_or_path in the adapter_config.json file to liuhaotian/llava-v1.5-13b.

Please note that the LoRA modules are also added on top of the vision tower. Ensure that the vision tower is loaded before loading the LoRA module.

Acknowledgements

We would like to express our gratitude for the code snippets provided in LLaVA, LLaVA-RLHF, FastChat and TRL, and datasets provided in RLAIF-V. These resources have significantly contributed to the development of our project.