DPO'ed model performs even worse on RLHF benchmarks???

by gagfafsdgsdfgs - opened 5 days ago

5 days ago

Looks like DPO'ed version performs worse than the pure SFT model on both AlpacaEval 2 (LC) and MT Bench. Is it an indication that RLHF training was not properly conducted? Any thoughts on this?

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment