Since you mentioned 'gpqa_diamond_zeroshot on LM_Eval harness,' what did the final model score, and how long did that benchmark take to run?
· Sign up or log in to comment