Papers
arxiv:2410.19133

Hybrid Preferences: Learning to Route Instances for Human vs. AI Feedback

Published on Oct 24
· Submitted by ljvmiranda921 on Oct 28
Authors:
,
,
,
,
,
,
,

Abstract

Learning from human feedback has enabled the alignment of language models (LMs) with human preferences. However, directly collecting human preferences can be expensive, time-consuming, and can have high variance. An appealing alternative is to distill preferences from LMs as a source of synthetic annotations as they are more consistent, cheaper, and scale better than human annotation; however, they are also prone to biases and errors. In this work, we introduce a routing framework that combines inputs from humans and LMs to achieve better annotation quality, while reducing the total cost of human annotation. The crux of our approach is to identify preference instances that will benefit from human annotations. We formulate this as an optimization problem: given a preference dataset and an evaluation metric, we train a performance prediction model to predict a reward model's performance on an arbitrary combination of human and LM annotations and employ a routing strategy that selects a combination that maximizes predicted performance. We train the performance prediction model on MultiPref, a new preference dataset with 10K instances paired with human and LM labels. We show that the selected hybrid mixture of LM and direct human preferences using our routing framework achieves better reward model performance compared to using either one exclusively. We simulate selective human preference collection on three other datasets and show that our method generalizes well to all three. We analyze features from the routing model to identify characteristics of instances that can benefit from human feedback, e.g., prompts with a moderate safety concern or moderate intent complexity. We release the dataset, annotation platform, and source code used in this study to foster more efficient and accurate preference collection in the future.

Community

Paper author Paper submitter

We present a routing framework for allocating preference instances to either human or LM annotators resulting in a set of hybrid annotations. The crux of our approach is by identifying which instances will benefit the most from direct human annotations.

Key Results: We show that hybrid annotations from our routing framework outperforms random sampling on a set human annotation budget, and generalizes to other preference datasets like Helpsteer2, AlpacaFarm, and ChatArena when evaluated on RewardBench and on other downstream tasks via best-of-N evaluation. In addition, we conduct an analysis to determine when human annotations are useful.

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2410.19133 in a model README.md to link it from this page.

Datasets citing this paper 1

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2410.19133 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.