Can't reproduce the evaluation result of GPQA dataset
#47
by
Rinn000
- opened
I'v tried zero-shot/few-shot prompts to evaluate the performace of this model. However, the result is far below 60% accuracy, which is shown on the linked blog. Could you share your official benchmark progress/prompt/code? By the way, the extraction & prompting is aligned with the format of GPQA papaer.
I have the same problem got 37% accuracy.
Question for Qwen team.
What are your recommended hyperparameters for evaluation ?
I got significantly lower results on almost all benchmarks mentioned in the presentation.
Same question, I would like to know if the original article reports greedy decoding or sampling results?