Performance on MMLU Astronomy

#1
by meni12345 - opened

Based on testing via LM Evaluation Harness it seems like this model is outperformed by the base version of Llama2 7B on MMLU Astronomy ("hendrycksTest-astronomy"). Is there a bug in the uploaded model?

hf-causal-experimental (pretrained=universeTBD/astrollama), limit: None, provide_description: False, num_fewshot: 0, batch_size: 4

Task Version Metric Value Stderr
hendrycksTest-astronomy 1 acc 0.3816 ± 0.0395
acc_norm 0.3816 ± 0.0395

hf-causal-experimental (pretrained=meta-llama/Llama-2-7b-hf), limit: None, provide_description: False, num_fewshot: 0, batch_size: 4

Task Version Metric Value Stderr
hendrycksTest-astronomy 1 acc 0.4211 ± 0.0402
acc_norm 0.4211 ± 0.0402
UniverseTBD org

Hi @meni12345 , we haven't fine-tuned a chat version of the model, so no QA instruction was provided. We are currently in the process to do so and'll provide a chat version very soon. Thank you for testing our model!

Sign up or log in to comment