Update README.md
Browse files
README.md
CHANGED
@@ -53,7 +53,6 @@ The official **Merlinite-7B-pt** achieves **7.96** on MT-Bench, surpassing Mistr
|
|
53 |
<img src="https://cdn-uploads.huggingface.co/production/uploads/66104696134c832243bde60d/YVrrGg2bTll1wDclBqxPZ.png" width="650">
|
54 |
|
55 |
Instead of training preference models or prompting large language models (LLMs) as a judge, we took an alternate approach to reward modeling that uses readily available LLMs and employs log-ratio calculation (DPO reward) as a proxy for reward assessments, as outlined in Lambert (2024) [^1].
|
56 |
-
[^1]: Lambert, 2024. *RewardBench: Evaluating Reward Models for Language Modeling*.
|
57 |
|
58 |
We chose Mixtral-8x7B-Instruct-v0.1 and Mixtral-8x7B-v0.1 as the basis for computing rewards; while this choice does not conform precisely to the relationship between the DPO-policy and the base-policy, it nevertheless yields strong performance, with an average score of 74.7 on the [RewardBench leaderboard](https://huggingface.co/spaces/allenai/reward-bench).
|
59 |
|
@@ -61,6 +60,7 @@ Having Mixtral log-ratio as reward model, we then choose iterative rejection sam
|
|
61 |
|
62 |
The prompts space for preference tuning were uniformly sampled by source from the [LAB](https://arxiv.org/abs/2403.01081) SFT data distribution, which has extensive coverage in knowledge, domains, and tasks.
|
63 |
|
|
|
64 |
|
65 |
### Discussion
|
66 |
|
|
|
53 |
<img src="https://cdn-uploads.huggingface.co/production/uploads/66104696134c832243bde60d/YVrrGg2bTll1wDclBqxPZ.png" width="650">
|
54 |
|
55 |
Instead of training preference models or prompting large language models (LLMs) as a judge, we took an alternate approach to reward modeling that uses readily available LLMs and employs log-ratio calculation (DPO reward) as a proxy for reward assessments, as outlined in Lambert (2024) [^1].
|
|
|
56 |
|
57 |
We chose Mixtral-8x7B-Instruct-v0.1 and Mixtral-8x7B-v0.1 as the basis for computing rewards; while this choice does not conform precisely to the relationship between the DPO-policy and the base-policy, it nevertheless yields strong performance, with an average score of 74.7 on the [RewardBench leaderboard](https://huggingface.co/spaces/allenai/reward-bench).
|
58 |
|
|
|
60 |
|
61 |
The prompts space for preference tuning were uniformly sampled by source from the [LAB](https://arxiv.org/abs/2403.01081) SFT data distribution, which has extensive coverage in knowledge, domains, and tasks.
|
62 |
|
63 |
+
[^1]: Lambert, 2024. *RewardBench: Evaluating Reward Models for Language Modeling*.
|
64 |
|
65 |
### Discussion
|
66 |
|