gx-ai-architect commited on
Commit
800c176
·
verified ·
1 Parent(s): cd8e540

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +3 -0
README.md CHANGED
@@ -52,6 +52,9 @@ The official **Merlinite-7B-pt** achieves **7.96** on MT-Bench, surpassing Mistr
52
 
53
  <img src="https://cdn-uploads.huggingface.co/production/uploads/66104696134c832243bde60d/YVrrGg2bTll1wDclBqxPZ.png" width="650">
54
 
 
 
 
55
  Instead of training preference models or prompting large language models (LLMs) as a judge, we took an alternate approach to reward modeling that uses readily available LLMs and employs log-ratio calculation (DPO reward) as a proxy for reward assessments, as outlined in Lambert (2024) [^1].
56
 
57
  We chose Mixtral-8x7B-Instruct-v0.1 and Mixtral-8x7B-v0.1 as the basis for computing rewards; while this choice does not conform precisely to the relationship between the DPO-policy and the base-policy, it nevertheless yields strong performance, with an average score of 74.7 on the [RewardBench leaderboard](https://huggingface.co/spaces/allenai/reward-bench).
 
52
 
53
  <img src="https://cdn-uploads.huggingface.co/production/uploads/66104696134c832243bde60d/YVrrGg2bTll1wDclBqxPZ.png" width="650">
54
 
55
+ **Above shows MT-Bench score comparisons on 8 prompt domains**
56
+
57
+
58
  Instead of training preference models or prompting large language models (LLMs) as a judge, we took an alternate approach to reward modeling that uses readily available LLMs and employs log-ratio calculation (DPO reward) as a proxy for reward assessments, as outlined in Lambert (2024) [^1].
59
 
60
  We chose Mixtral-8x7B-Instruct-v0.1 and Mixtral-8x7B-v0.1 as the basis for computing rewards; while this choice does not conform precisely to the relationship between the DPO-policy and the base-policy, it nevertheless yields strong performance, with an average score of 74.7 on the [RewardBench leaderboard](https://huggingface.co/spaces/allenai/reward-bench).