DPO'ed model performs even worse on RLHF benchmarks???
#1
by
gagfafsdgsdfgs
- opened
Looks like DPO'ed version performs worse than the pure SFT model on both AlpacaEval 2 (LC) and MT Bench. Is it an indication that RLHF training was not properly conducted? Any thoughts on this?