gx-ai-architect
/

merlinite-placeholder

Text Generation

Model card Files Files and versions Community

gx-ai-architect commited on Jun 17, 2024

Commit

cd8e540

·

verified ·

1 Parent(s): e0b172e

Update README.md

Files changed (1) hide show

README.md +1 -1

README.md CHANGED Viewed

@@ -53,7 +53,6 @@ The official **Merlinite-7B-pt** achieves **7.96** on MT-Bench, surpassing Mistr
 <img src="https://cdn-uploads.huggingface.co/production/uploads/66104696134c832243bde60d/YVrrGg2bTll1wDclBqxPZ.png" width="650">
 Instead of training preference models or prompting large language models (LLMs) as a judge, we took an alternate approach to reward modeling that uses readily available LLMs and employs log-ratio calculation (DPO reward) as a proxy for reward assessments, as outlined in Lambert (2024) [^1].
-[^1]: Lambert, 2024. *RewardBench: Evaluating Reward Models for Language Modeling*.
 We chose Mixtral-8x7B-Instruct-v0.1 and Mixtral-8x7B-v0.1 as the basis for computing rewards; while this choice does not conform precisely to the relationship between the DPO-policy and the base-policy, it nevertheless yields strong performance, with an average score of 74.7 on the [RewardBench leaderboard](https://huggingface.co/spaces/allenai/reward-bench).
@@ -61,6 +60,7 @@ Having Mixtral log-ratio as reward model, we then choose iterative rejection sam
 The prompts space for preference tuning were uniformly sampled by source from the [LAB](https://arxiv.org/abs/2403.01081) SFT data distribution, which has extensive coverage in knowledge, domains, and tasks.
 ### Discussion

 <img src="https://cdn-uploads.huggingface.co/production/uploads/66104696134c832243bde60d/YVrrGg2bTll1wDclBqxPZ.png" width="650">
 Instead of training preference models or prompting large language models (LLMs) as a judge, we took an alternate approach to reward modeling that uses readily available LLMs and employs log-ratio calculation (DPO reward) as a proxy for reward assessments, as outlined in Lambert (2024) [^1].
 We chose Mixtral-8x7B-Instruct-v0.1 and Mixtral-8x7B-v0.1 as the basis for computing rewards; while this choice does not conform precisely to the relationship between the DPO-policy and the base-policy, it nevertheless yields strong performance, with an average score of 74.7 on the [RewardBench leaderboard](https://huggingface.co/spaces/allenai/reward-bench).
 The prompts space for preference tuning were uniformly sampled by source from the [LAB](https://arxiv.org/abs/2403.01081) SFT data distribution, which has extensive coverage in knowledge, domains, and tasks.
+[^1]: Lambert, 2024. *RewardBench: Evaluating Reward Models for Language Modeling*.
 ### Discussion