sh110495
's Collections
WPO: Enhancing RLHF with Weighted Preference Optimization
Paper
•
2406.11827
•
Published
•
14
Self-Improving Robust Preference Optimization
Paper
•
2406.01660
•
Published
•
19
Bootstrapping Language Models with DPO Implicit Rewards
Paper
•
2406.09760
•
Published
•
39
BPO: Supercharging Online Preference Learning by Adhering to the
Proximity of Behavior LLM
Paper
•
2406.12168
•
Published
•
7
Understanding and Diagnosing Deep Reinforcement Learning
Paper
•
2406.16979
•
Published
•
9
Step-DPO: Step-wise Preference Optimization for Long-chain Reasoning of
LLMs
Paper
•
2406.18629
•
Published
•
42
Understand What LLM Needs: Dual Preference Alignment for
Retrieval-Augmented Generation
Paper
•
2406.18676
•
Published
•
6
Step-Controlled DPO: Leveraging Stepwise Error for Enhanced Mathematical
Reasoning
Paper
•
2407.00782
•
Published
•
24
Direct Preference Knowledge Distillation for Large Language Models
Paper
•
2406.19774
•
Published
•
22
Understanding Reference Policies in Direct Preference Optimization
Paper
•
2407.13709
•
Published
•
17
Self-Training with Direct Preference Optimization Improves
Chain-of-Thought Reasoning
Paper
•
2407.18248
•
Published
•
32
JudgeBench: A Benchmark for Evaluating LLM-based Judges
Paper
•
2410.12784
•
Published
•
44