UniverseTBD/astrollama · Performance on MMLU Astronomy

Based on testing via LM Evaluation Harness it seems like this model is outperformed by the base version of Llama2 7B on MMLU Astronomy ("hendrycksTest-astronomy"). Is there a bug in the uploaded model?

hf-causal-experimental (pretrained=universeTBD/astrollama), limit: None, provide_description: False, num_fewshot: 0, batch_size: 4

Task	Version	Metric	Value		Stderr
hendrycksTest-astronomy	1	acc	0.3816	±	0.0395
		acc_norm	0.3816	±	0.0395

hf-causal-experimental (pretrained=meta-llama/Llama-2-7b-hf), limit: None, provide_description: False, num_fewshot: 0, batch_size: 4

Task	Version	Metric	Value		Stderr
hendrycksTest-astronomy	1	acc	0.4211	±	0.0402
		acc_norm	0.4211	±	0.0402