amd
/

Text Generation

DPO'ed model performs even worse on RLHF benchmarks???

#1
by gagfafsdgsdfgs - opened

Looks like DPO'ed version performs worse than the pure SFT model on both AlpacaEval 2 (LC) and MT Bench. Is it an indication that RLHF training was not properly conducted? Any thoughts on this?

Sign up or log in to comment