RL Zero: Zero-Shot Language to Behaviors without any Supervision Paper • 2412.05718 • Published Dec 7, 2024 • 4 • 2
Scaling Laws for Reward Model Overoptimization in Direct Alignment Algorithms Paper • 2406.02900 • Published Jun 5, 2024 • 12
Scaling Laws for Reward Model Overoptimization in Direct Alignment Algorithms Paper • 2406.02900 • Published Jun 5, 2024 • 12
hsikchi/pythia-6.9b-goldrm_tldr-dpo-beta-0.0175-alpha-0-step-79872 Text Generation • Updated May 18, 2024 • 12
hsikchi/pythia-6.9b-goldrm_tldr-dpo-beta-0.0175-alpha-0-step-59904 Text Generation • Updated May 18, 2024 • 12
hsikchi/pythia-6.9b-goldrm_tldr-dpo-beta-0.0175-alpha-0-step-19968 Text Generation • Updated May 18, 2024 • 12
hsikchi/pythia-6.9b-goldrm_tldr-dpo-beta-0.0375-alpha-0-step-59904 Text Generation • Updated May 18, 2024 • 13
hsikchi/pythia-6.9b-goldrm_tldr-dpo-beta-0.0375-alpha-0-step-39936 Text Generation • Updated May 18, 2024 • 13
hsikchi/pythia-6.9b-goldrm_tldr-dpo-beta-0.0375-alpha-0-step-79872 Text Generation • Updated May 18, 2024 • 12
hsikchi/pythia-6.9b-goldrm_tldr-dpo-beta-0.0175-alpha-0-step-39936 Text Generation • Updated May 18, 2024 • 12
hsikchi/pythia-6.9b-goldrm_tldr-dpo-beta-0.0175-alpha-0-LATEST Text Generation • Updated May 18, 2024 • 13
hsikchi/pythia-6.9b-goldrm_tldr-dpo-beta-0.0375-alpha-0-step-19968 Text Generation • Updated May 18, 2024 • 13
hsikchi/pythia-6.9b-goldrm_tldr-dpo-beta-0.025-alpha-0-step-59904 Text Generation • Updated May 18, 2024 • 15
hsikchi/pythia-6.9b-goldrm_tldr-dpo-beta-0.01-alpha-0-step-39936 Text Generation • Updated May 18, 2024 • 13
hsikchi/pythia-6.9b-goldrm_tldr-dpo-beta-0.01-alpha-0-step-59904 Text Generation • Updated May 18, 2024 • 12
hsikchi/pythia-6.9b-goldrm_tldr-dpo-beta-0.0375-alpha-0-LATEST Text Generation • Updated May 18, 2024 • 12
hsikchi/pythia-6.9b-goldrm_tldr-dpo-beta-0.01-alpha-0-step-79872 Text Generation • Updated May 18, 2024 • 13
hsikchi/pythia-6.9b-goldrm_tldr-dpo-beta-0.01-alpha-0-step-19968 Text Generation • Updated May 18, 2024 • 12
hsikchi/pythia-6.9b-goldrm_tldr-dpo-beta-0.025-alpha-0-step-39936 Text Generation • Updated May 18, 2024 • 12