A Common Pitfall of Margin-based Language Model Alignment: Gradient Entanglement Paper • 2410.13828 • Published 11 days ago • 3
Training Language Models to Self-Correct via Reinforcement Learning Paper • 2409.12917 • Published Sep 19 • 131
Towards a Unified View of Preference Learning for Large Language Models: A Survey Paper • 2409.02795 • Published Sep 4 • 72
AutoDefense: Multi-Agent LLM Defense against Jailbreak Attacks Paper • 2403.04783 • Published Mar 2 • 2