HuggingFaceH4/zephyr-7b-gemma-v0.1 · Can't Reproduce MT Bench Results

Apr 27, 2024

What I did should be quite easy to reproduce, and I'm getting scores of ~5.7 on MT Bench as opposed to 7.81. Please let me know what I'm doing wrong!

NOTE: I also tried the template mentioned here and could not get any better results.

Follow setup instructions from here:
https://github.com/lm-sys/FastChat/tree/main/fastchat/llm_judge

python gen_model_answer.py --model-path HuggingFaceH4/zephyr-7b-gemma-v0.1 --model-id hf_zephyr-7b-gemma-dpo
python gen_judgment.py --model-list hf_zephyr-7b-gemma-dpo --parallel 4
python show_result.py --model-list hf_zephyr-7b-gemma-dpo

########## First turn ##########
score
model turn
hf_zephyr-7b-gemma-dpo 1 5.696203

########## Second turn ##########
score
model turn
hf_zephyr-7b-gemma-dpo 2 5.7

########## Average ##########
score
model
hf_zephyr-7b-gemma-dpo 5.698113

chrlu

Apr 29, 2024

This comment has been hidden

chrlu changed discussion status to closed May 12, 2024