Confusion about the description of evaluation settings
Could you please provide more details on evaluation settings of GSM8K dataset?We evaluate GSM8K CoT with chat template and 8-way few shot as multiturn.
- How do you implement CoT with only chat template?
- How do you compute exact match metric, via template parse?
- What's the difference between flexible and strict extract?
Hi, these are settings that are best understood in the context of the lm-eval
harness, https://github.com/EleutherAI/lm-evaluation-harness
For example, you can replicat GSM8k evaluations like so
lm_eval --model hf --model_args pretrained=tomg-group-umd/huginn_swa_100_10_avg_0.9_merge,trust_remote_code=True,dtype="bfloat16",mean_recurrence=64 \
--tasks gsm8k_cot --batch_size=1 --output_path=outputs/evals --log_samples --apply_chat_template=True \
--system_instruction="You are a helpful assistant that can assist users with mathematical reasoning." --fewshot_as_multiturn
Pick a model checkpoint you want for the pretrained
argument, and a recurrence argument for mean_recurrence
.
You can find the exact definition of flexible extract in the eval harness here: https://github.com/EleutherAI/lm-evaluation-harness/blob/52df63b7b30da53c481ed9090598d9189fab1d91/lm_eval/tasks/gsm8k/gsm8k-cot.yaml#L55
Thanks for your reply. I also noticed that there's a gsm8k_long_cot.yaml
file in evaluate_raven
, which is different from lm_eval's gsm8k-cot.yaml
file. Is this file useful in reproducing the result?
By the way, I'd like to confirm whether "w/o sys. prompt" means --system_instruction=None
in Table 2? And does the configuration for GSM8K
and GSM8K CoT
correspond to --task gsm8k_cot_zeroshot
and --task gsm8k_cot --fewshot_as_multiturn
, respectively?
Looking forward to your reply!
Ah no, that file is only useful if you wanted to use more than 8 fewshot examples, it's not used for evaluation in this work.
w/o system prompt:--apply_chat_template=False --system_instruction=None
w system prompt:--apply_chat_template=True --system_instruction="You are a helpful assistant that can assist users with mathematical reasoning." --fewshot_as_multiturn
The GSM8k
column is the standard GSM8k setup, so --task gsm8k
.
The GSM8k CoT
column is --task gsm8k_cot
. (EDIT: this one had a cmd too many)
Why is there a --fewshot_as_multiturn
in w system prompt
setting? So the GSM8k
(which should be zero-shot) column with w system prompt
row contains --fewshot_as_multiturn
?
--fewshot_as_multiturn
is a no-op if there are no fewshot examples, it only determines that if there are fewshot examples, they should be prepared as multiple messages, instead of being in a single message together with the actual query.
Thanks again for solving all my problems!