ZHLiu627/updated_qwen2.5_code_1.5b_grpo_iter0_full_data_miao_0212__self_correction_iter1_v1 Viewer • Updated 4 days ago • 29.3k • 34
ZHLiu627/dataset_qwen2.5_code_1.5b_grpo_iter0_full_data_miao_0212_2_global_step_70filtered_v1_v1 Viewer • Updated 4 days ago • 29.3k • 8
ZHLiu627/dataset_qwen2.5_code_1.5b_grpo_iter0_full_data_miao_0212_2_global_step_70filtered_v1_v1 Viewer • Updated 4 days ago • 29.3k • 8
Provably Mitigating Overoptimization in RLHF: Your SFT Loss is Implicitly an Adversarial Regularizer Paper • 2405.16436 • Published May 26, 2024 • 1
Regularized-Preference-Optimization Collection The models trained in https://github.com/YSLIU627/Regularized-Preference-Optimization • 4 items • Updated 4 days ago
Regularized-Preference-Optimization Collection The models trained in https://github.com/YSLIU627/Regularized-Preference-Optimization • 4 items • Updated 4 days ago
ZHLiu627/dataset_qwen2.5_code_1.5b_grpo_iter0_full_data_miao_0212_2_global_step_70filtered_v1 Viewer • Updated 10 days ago • 29.3k • 171
ZHLiu627/dataset_qwen2.5_code_1.5b_grpo_iter0_full_data_miao_0212_2_global_step_70filtered_v1 Viewer • Updated 10 days ago • 29.3k • 171