Can't Reproduce MT Bench Results
What I did should be quite easy to reproduce, and I'm getting scores of ~5.7 on MT Bench as opposed to 7.81. Please let me know what I'm doing wrong!
NOTE: I also tried the template mentioned here and could not get any better results.
Follow setup instructions from here:
https://github.com/lm-sys/FastChat/tree/main/fastchat/llm_judge
python gen_model_answer.py --model-path HuggingFaceH4/zephyr-7b-gemma-v0.1 --model-id hf_zephyr-7b-gemma-dpo
python gen_judgment.py --model-list hf_zephyr-7b-gemma-dpo --parallel 4
python show_result.py --model-list hf_zephyr-7b-gemma-dpo
########## First turn ##########
score
model turn
hf_zephyr-7b-gemma-dpo 1 5.696203
########## Second turn ##########
score
model turn
hf_zephyr-7b-gemma-dpo 2 5.7
########## Average ##########
score
model
hf_zephyr-7b-gemma-dpo 5.698113