Challenges in Reproducing Zero-Shot Accuracy for ARC-C with Llama-3.2-1B-Instruct
I've been trying to reproduce the reported 0-shot accuracy for ARC-C (59.4%) using the Llama-3.2-1B-Instruct model, but I am only achieving 39%. Here's what I've done so far:
I followed the instructions from the transformers documentation on Llama 3, adding a pad_token and resizing the token embeddings accordingly.
I successfully reproduced the results for 8-shot GSM8K(45.3% compared to 44.4% reported in the paper). Below is an example of a 1-shot prompt for GSM8k:
<|begin_of_text|><|start_header_id|>system<|end_header_id|>
Cutting Knowledge Date: December 2023
Today Date: 21 Jan 2025
<|eot_id|><|start_header_id|>user<|end_header_id|>
Question: For every 12 cans you recycle, you receive $0.50, and for every 5 kilograms of newspapers, you receive $1.50. If your family collected 144 cans and 20 kilograms of newspapers, how much money would you receive?<|eot_id|><|start_header_id|>assistant<|end_header_id|>
Answer: There are 144/12 = <<144/12=12>>12 sets of 12 cans that the family collected.
So, the family would receive $0.50 x 12 = $<<0.50*12=6>>6 for the cans.
There are 20/5 = <<20/5=4>>4 sets of 5 kilograms of newspapers that the family collected.
So, the family would receive $1.50 x 4 = $<<1.50*4=6>>6 for the newspapers.
Therefore, the family would receive a total of $6 + $6 = $<<6+6=12>>12.
#### 12<|eot_id|><|start_header_id|>user<|end_header_id|>
Question: Janet's ducks lay 16 eggs per day. She eats three for breakfast every morning and bakes muffins for her friends every day with four. She sells the remainder at the farmers' market daily for $2 per fresh duck egg. How much in dollars does she make every day at the farmers' market? Let's think step by step. At the end, you MUST write the answer as an integer after '####'.<|eot_id|>
Using a similar approach for ARC-C, I crafted the following prompt template for 0-shot evaluation:
<|begin_of_text|><|start_header_id|>system<|end_header_id|>
Cutting Knowledge Date: December 2023
Today Date: 21 Jan 2025
You are an AI assistant answering multiple-choice questions. Always evaluate the question carefully and respond with only a single letter: A, B, C, or D. Do not provide any explanation or additional text.<|eot_id|><|start_header_id|>user<|end_header_id|>
Question: An astronomer observes that a planet rotates faster after a meteorite impact. Which is the most likely effect of this increase in rotation?
Options:
A. Planetary density will decrease.
B. Planetary years will become longer.
C. Planetary days will become shorter.
D. Planetary gravity will become stronger.
Answer:<|eot_id|>
Despite my efforts, the accuracy remains significantly below the reported value. Is there something I might be overlooking in the setup, preprocessing, or prompt design? Any guidance would be greatly appreciated! I was able to increase the performance to 51.3% with ten-shot evaluation, but this is still much lower than the value reported in the paper.