Confusion about the description of evaluation settings

#7
by CrazyD - opened

Could you please provide more details on evaluation settings of GSM8K dataset?
We evaluate GSM8K CoT with chat template and 8-way few shot as multiturn.

  1. How do you implement CoT with only chat template?
  2. How do you compute exact match metric, via template parse?
  3. What's the difference between flexible and strict extract?
Tom Goldstein's Lab at University of Maryland, College Park org

Hi, these are settings that are best understood in the context of the lm-eval harness, https://github.com/EleutherAI/lm-evaluation-harness

For example, you can replicat GSM8k evaluations like so

lm_eval --model hf --model_args pretrained=tomg-group-umd/huginn_swa_100_10_avg_0.9_merge,trust_remote_code=True,dtype="bfloat16",mean_recurrence=64  \
  --tasks gsm8k_cot  --batch_size=1  --output_path=outputs/evals  --log_samples --apply_chat_template=True \
  --system_instruction="You are a helpful assistant that can assist users with mathematical reasoning."  --fewshot_as_multiturn

Pick a model checkpoint you want for the pretrained argument, and a recurrence argument for mean_recurrence.

Tom Goldstein's Lab at University of Maryland, College Park org

Thanks for your reply. I also noticed that there's a gsm8k_long_cot.yaml file in evaluate_raven, which is different from lm_eval's gsm8k-cot.yaml file. Is this file useful in reproducing the result?

By the way, I'd like to confirm whether "w/o sys. prompt" means --system_instruction=None in Table 2? And does the configuration for GSM8K and GSM8K CoT correspond to --task gsm8k_cot_zeroshot and --task gsm8k_cot --fewshot_as_multiturn, respectively?

Looking forward to your reply!

Tom Goldstein's Lab at University of Maryland, College Park org
edited 20 days ago

Ah no, that file is only useful if you wanted to use more than 8 fewshot examples, it's not used for evaluation in this work.

w/o system prompt:
--apply_chat_template=False --system_instruction=None

w system prompt:
--apply_chat_template=True --system_instruction="You are a helpful assistant that can assist users with mathematical reasoning." --fewshot_as_multiturn

The GSM8k column is the standard GSM8k setup, so --task gsm8k.
The GSM8k CoT column is --task gsm8k_cot. (EDIT: this one had a cmd too many)

Why is there a --fewshot_as_multiturn in w system prompt setting? So the GSM8k (which should be zero-shot) column with w system prompt row contains --fewshot_as_multiturn?

Tom Goldstein's Lab at University of Maryland, College Park org

--fewshot_as_multiturn is a no-op if there are no fewshot examples, it only determines that if there are fewshot examples, they should be prepared as multiple messages, instead of being in a single message together with the actual query.

Thanks again for solving all my problems!

CrazyD changed discussion status to closed

Sign up or log in to comment