Papers
arxiv:2407.08639

β-DPO: Direct Preference Optimization with Dynamic β

Published on Jul 11, 2024
Authors:
,
,
,
,
,
,

Abstract

Direct Preference Optimization (DPO) has emerged as a compelling approach for training Large Language Models (LLMs) to adhere to human preferences. However, the performance of DPO is sensitive to the fine-tuning of its trade-off parameter beta, as well as to the quality of the preference data. We analyze the impact of beta and data quality on DPO, uncovering that optimal beta values vary with the informativeness of pairwise data. Addressing the limitations of static beta values, we introduce a novel framework that dynamically calibrates beta at the batch level, informed by data quality considerations. Additionally, our method incorporates beta-guided data filtering to safeguard against the influence of outliers. Through empirical evaluation, we demonstrate that our dynamic beta adjustment technique significantly improves DPO's performance across a range of models and datasets, offering a more robust and adaptable training paradigm for aligning LLMs with human feedback. The code is available at https://github.com/junkangwu/beta-DPO.

Community

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2407.08639 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2407.08639 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2407.08639 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.