Papers
arxiv:2410.17131

Aligning Large Language Models via Self-Steering Optimization

Published on Oct 22
· Submitted by Tigerph on Oct 23
#3 Paper of the day
Authors:
,
,

Abstract

Automated alignment develops alignment systems with minimal human intervention. The key to automated alignment lies in providing learnable and accurate preference signals for preference learning without human annotation. In this paper, we introduce Self-Steering Optimization (SSO), an algorithm that autonomously generates high-quality preference signals based on predefined principles during iterative training, eliminating the need for manual annotation. SSO maintains the accuracy of signals by ensuring a consistent gap between chosen and rejected responses while keeping them both on-policy to suit the current policy model's learning capacity. SSO can benefit the online and offline training of the policy model, as well as enhance the training of reward models. We validate the effectiveness of SSO with two foundation models, Qwen2 and Llama3.1, indicating that it provides accurate, on-policy preference signals throughout iterative training. Without any manual annotation or external models, SSO leads to significant performance improvements across six subjective or objective benchmarks. Besides, the preference data generated by SSO significantly enhanced the performance of the reward model on Rewardbench. Our work presents a scalable approach to preference optimization, paving the way for more efficient and effective automated alignment.

Community

Paper submitter
  1. Minimal Human Intervention: The paper highlights the development of alignment systems that require minimal human intervention, which is a significant advantage in automating complex processes.

  2. Automated Preference Signals: It introduces the concept of generating learnable and accurate preference signals automatically, without the need for human annotation, which addresses a major challenge in preference learning.

  3. Self-Steering Optimization (SSO) Algorithm: The introduction of SSO is a key innovation. This algorithm autonomously generates high-quality preference signals, maintaining their accuracy by ensuring a consistent gap between chosen and rejected responses, all while keeping these responses on-policy.

  4. Versatility in Training: SSO is applicable to both online and offline training scenarios, making it a flexible tool for enhancing the training of policy models and reward models.

  5. Empirical Validation: The effectiveness of SSO is validated through experiments with two foundational models, Qwen2 and Llama3.1, demonstrating its capability to provide accurate, on-policy preference signals throughout iterative training.

  6. Performance Improvements: Without relying on manual annotation or external models, SSO achieves notable performance improvements across multiple benchmarks, both subjective and objective.

  7. Enhanced Reward Model Performance: The preference data generated by SSO significantly improves the performance of the reward model on Rewardbench, further validating its utility.

  8. Scalability and Efficiency: The paper concludes by presenting SSO as a scalable solution for preference optimization, which could lead to more efficient and effective methods for automated alignment in various applications.

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

This comment has been hidden

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2410.17131 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2410.17131 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2410.17131 in a Space README.md to link it from this page.

Collections including this paper 4