gx-ai-architect
/

merlinite-placeholder

Text Generation

Model card Files Files and versions Community

gx-ai-architect commited on Jun 17, 2024

Commit

800c176

·

verified ·

1 Parent(s): cd8e540

Update README.md

Files changed (1) hide show

README.md +3 -0

README.md CHANGED Viewed

@@ -52,6 +52,9 @@ The official **Merlinite-7B-pt** achieves **7.96** on MT-Bench, surpassing Mistr
 <img src="https://cdn-uploads.huggingface.co/production/uploads/66104696134c832243bde60d/YVrrGg2bTll1wDclBqxPZ.png" width="650">
 Instead of training preference models or prompting large language models (LLMs) as a judge, we took an alternate approach to reward modeling that uses readily available LLMs and employs log-ratio calculation (DPO reward) as a proxy for reward assessments, as outlined in Lambert (2024) [^1].
 We chose Mixtral-8x7B-Instruct-v0.1 and Mixtral-8x7B-v0.1 as the basis for computing rewards; while this choice does not conform precisely to the relationship between the DPO-policy and the base-policy, it nevertheless yields strong performance, with an average score of 74.7 on the [RewardBench leaderboard](https://huggingface.co/spaces/allenai/reward-bench).

 <img src="https://cdn-uploads.huggingface.co/production/uploads/66104696134c832243bde60d/YVrrGg2bTll1wDclBqxPZ.png" width="650">
+**Above shows MT-Bench score comparisons on 8 prompt domains**
 Instead of training preference models or prompting large language models (LLMs) as a judge, we took an alternate approach to reward modeling that uses readily available LLMs and employs log-ratio calculation (DPO reward) as a proxy for reward assessments, as outlined in Lambert (2024) [^1].
 We chose Mixtral-8x7B-Instruct-v0.1 and Mixtral-8x7B-v0.1 as the basis for computing rewards; while this choice does not conform precisely to the relationship between the DPO-policy and the base-policy, it nevertheless yields strong performance, with an average score of 74.7 on the [RewardBench leaderboard](https://huggingface.co/spaces/allenai/reward-bench).