arxiv:2410.17131

Aligning Large Language Models via Self-Steering Optimization

Published on Oct 22

· Submitted by

Tigerph on Oct 23

#3 Paper of the day

Upvote

Authors:

Hao Xiang ,

Hongyu Lin ,

Keming Lu ,

Yaojie Lu ,

Xianpei Han ,

Jingren Zhou ,

Junyang Lin

Abstract

Automated alignment develops alignment systems with minimal human intervention. The key to automated alignment lies in providing learnable and accurate preference signals for preference learning without human annotation. In this paper, we introduce Self-Steering Optimization (SSO), an algorithm that autonomously generates high-quality preference signals based on predefined principles during iterative training, eliminating the need for manual annotation. SSO maintains the accuracy of signals by ensuring a consistent gap between chosen and rejected responses while keeping them both on-policy to suit the current policy model's learning capacity. SSO can benefit the online and offline training of the policy model, as well as enhance the training of reward models. We validate the effectiveness of SSO with two foundation models, Qwen2 and Llama3.1, indicating that it provides accurate, on-policy preference signals throughout iterative training. Without any manual annotation or external models, SSO leads to significant performance improvements across six subjective or objective benchmarks. Besides, the preference data generated by SSO significantly enhanced the performance of the reward model on Rewardbench. Our work presents a scalable approach to preference optimization, paving the way for more efficient and effective automated alignment.

View arXiv page View PDF Add to collection

Community

Tigerph

Paper submitter 5 days ago

Minimal Human Intervention: The paper highlights the development of alignment systems that require minimal human intervention, which is a significant advantage in automating complex processes.
Automated Preference Signals: It introduces the concept of generating learnable and accurate preference signals automatically, without the need for human annotation, which addresses a major challenge in preference learning.
Self-Steering Optimization (SSO) Algorithm: The introduction of SSO is a key innovation. This algorithm autonomously generates high-quality preference signals, maintaining their accuracy by ensuring a consistent gap between chosen and rejected responses, all while keeping these responses on-policy.
Versatility in Training: SSO is applicable to both online and offline training scenarios, making it a flexible tool for enhancing the training of policy models and reward models.
Empirical Validation: The effectiveness of SSO is validated through experiments with two foundational models, Qwen2 and Llama3.1, demonstrating its capability to provide accurate, on-policy preference signals throughout iterative training.
Performance Improvements: Without relying on manual annotation or external models, SSO achieves notable performance improvements across multiple benchmarks, both subjective and objective.
Enhanced Reward Model Performance: The preference data generated by SSO significantly improves the performance of the reward model on Rewardbench, further validating its utility.
Scalability and Efficiency: The paper concludes by presenting SSO as a scalable solution for preference optimization, which could lead to more efficient and effective methods for automated alignment in various applications.