Can't reproduce the evaluation result of GPQA dataset

#47

by Rinn000 - opened 8 days ago

8 days ago

I'v tried zero-shot/few-shot prompts to evaluate the performace of this model. However, the result is far below 60% accuracy, which is shown on the linked blog. Could you share your official benchmark progress/prompt/code? By the way, the extraction & prompting is aligned with the format of GPQA papaer.

raidhon

7 days ago

I have the same problem got 37% accuracy.
Question for Qwen team.
What are your recommended hyperparameters for evaluation ?
I got significantly lower results on almost all benchmarks mentioned in the presentation.

liuyang110

1 day ago

Same question, I would like to know if the original article reports greedy decoding or sampling results?

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment