AALF commited on
Commit
7735ee1
1 Parent(s): 9015944

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +21 -20
README.md CHANGED
@@ -19,7 +19,7 @@ FuseChat-3.0: Preference Optimization for Implicit Model Fusion
19
  <img src="FuseChat-3.0.png" width=70%/>
20
  </div>
21
 
22
- We present FuseChat-3.0, a series of models crafted to enhance performance by integrating the strengths of multiple source LLMs into more compact target LLMs. To achieve this fusion, we utilized four powerful source LLMs: Gemma-2-27B-it, Mistral-Large-Instruct-2407, Qwen-2.5-72B-Instruct, and Llama-3.1-70B-Instruct. For the target LLMs, we employed three widely-used smaller models—Llama-3.1-8B-Instruct, Gemma-2-9B-it, and Qwen-2.5-7B-Instruct—along with two even more compact models—Llama-3.2-3B-Instruct and Llama-3.2-1B-Instruct. The implicit model fusion process involves a two-stage training pipeline comprising Supervised Fine-Tuning (SFT) to mitigate distribution discrepancies between target and source LLMs, and Direct Preference Optimization (DPO) for learning preferences from multiple source LLMs. The resulting FuseChat-3.0 models demonstrated substantial improvements in tasks related to general conversation, instruction following, mathematics, and coding. Notably, when Llama-3.1-8B-Instruct served as the target LLM, our fusion approach achieved an average improvement of 6.8 points across 14 benchmarks. Moreover, it showed significant improvements of 37.1 and 30.1 points on instruction-following test sets AlpacaEval-2 and Arena-Hard respectively. We have released the [FuseChat-3.0](https://huggingface.co/FuseAI) models on Huggingface, stay tuned for the forthcoming dataset and code.
23
 
24
 
25
 
@@ -32,7 +32,7 @@ FuseChat-3.0, however, takes a different approach by enhancing a single LLM thro
32
 
33
  Our IMF method follows a three-stage process aimed at effectively transferring capabilities from source LLMs to a target LLM. First, during **dataset construction**, we sample N responses from each of the source LLMs and annotate these responses using an external reward model. Second, in the **supervised fine-tuning (SFT)** stage, we fine-tune the target model using the best responses, which not only enhances the target model's capabilities but also helps mitigate the distributional gap between the source and target models. Finally, in the **direct preference optimization (DPO)** stage, we optimize the target model by using the best and worst responses from the source models as preference pairs, further enhancing the target model's performance. The complete pipeline will be detailed in the following paragraph.
34
 
35
- ## Dataset Construction
36
  ### Prompt Selection
37
  Our datasets were designed to enhance model's instruction following, general conversation, mathematics, coding, and Chinese-language capabilities. We selected data from open-source community datasets, applying targeted filtering and preprocessing. Key datasets and filtering criteria included:
38
 
@@ -41,8 +41,8 @@ Our datasets were designed to enhance model's instruction following, general con
41
  - **Coding**: Curated from [leetcode](https://huggingface.co/datasets/greengerong/leetcode) and [self-oss-instruct-sc2-exec-filter-50k](https://huggingface.co/datasets/bigcode/self-oss-instruct-sc2-exec-filter-50k), retaining prompts with test cases.
42
  - **Chinese Language**: Integrated [alpaca_gpt4_zh](https://huggingface.co/datasets/llamafactory/alpaca_gpt4_zh) and [Magpie-Qwen2-Pro-200K-Chinese](https://huggingface.co/datasets/Magpie-Align/Magpie-Qwen2-Pro-200K-Chinese), filtering out code and math prompts to retain approximately 10,000 high-quality samples.
43
 
44
- ### Sampling
45
- For each dataset's prompts, we synthesized responses mainly from four different series of source models, specifically [Gemma-2-27b-it](https://huggingface.co/google/gemma-2-27b-it), [Mistral-Large-Instruct-2407](https://huggingface.co/mistralai/Mistral-Large-Instruct-2407), [Qwen-2.5-72B-Instruct](https://huggingface.co/Qwen/Qwen2-72B-Instruct), and [Llama-3.1-70B-Instruct](https://huggingface.co/meta-llama/Llama-3.1-70B-Instruct).
46
 
47
  - **Instruction Following & General Conversation**: We sampled each prompt five times from all the source models.
48
  - **Mathematics**: We retained the responses generated by Llama-3.1-405B-Instruct from the original dataset (OpenMathInstruct-2) and additionally sampled responses using [Qwen-2.5-Math-72B-Instruct](https://huggingface.co/Qwen/Qwen-2.5-Math-72B-Instruct).
@@ -58,7 +58,7 @@ The sampling parameters for different models are detailed in Table below.
58
  </tr>
59
 
60
  <tr>
61
- <td>Gemma-2-27b-it</td>
62
  <td>Temp 0.8 Top-p 0.95</td>
63
  </tr>
64
 
@@ -80,7 +80,7 @@ The sampling parameters for different models are detailed in Table below.
80
 
81
  </table>
82
 
83
- ### Filtering
84
  - **Instruction Following**: To assign RM scores to the five responses generated by each source model, we employed [ArmoRM](https://huggingface.co/RLHFlow/ArmoRM-Llama3-8B-v0.1) for annotation. We then divided the annotated data into SFT and DPO datasets using a 4:6 ratio. For the SFT phase, we selected the responses with the highest RM scores. During the DPO phase, we paired responses from the same source model, designating those with the highest RM scores as positive samples and those with the lowest RM scores as negative samples. We ensured that the RM score difference between the positive and negative samples in each pair ranged from 0.01 to 0.1.
85
  - **Mathematics**: We initially annotated the responses from all source models for correctness by comparing them with the gold labels and evaluating them using the RM scores provided by ArmoRM. We then strategically divided the dataset into SFT phase and DPO phase. In the SFT phase, we incorporated responses that were correct and had the highest RM scores. This selection ensured that the fine-tuning process was based on high-quality responses that aligned closely with the desired outcomes. For the DPO phase, we constructed paired samples from the same source model. The positive samples consisted of correct answers with the highest RM scores, while the negative samples were incorrect answers with the lowest RM scores. To ensure meaningful comparisons during optimization, we maintained an RM score differential between positive and negative pairs within the range of 0.01 to 0.1.
86
  - **Coding**: We employed a dual-scoring system comprising correctness scores and RM scores for coding evaluation. The correctness scores assessed whether the code passed both static analysis and test cases, ensuring functional accuracy. The RM scores were used for preference evaluation, gauging the quality of responses based on predefined criteria. During the SFT phase, we included responses that not only passed all test cases but also achieved the highest RM scores. This selection ensured that the model was fine-tuned on exemplary code that met both correctness and preference standards. In the DPO phase, we contrasted positive samples—high-scoring responses that passed the tests—with negative samples—low-scoring responses that failed the tests. This comparison aimed to optimize the model's ability to prefer higher-quality code during training. We excluded any instances where all model responses failed to meet the testing criteria. This exclusion was necessary to maintain the integrity of the evaluation process, as such cases did not provide meaningful data for assessing and improving the model's performance.
@@ -172,7 +172,7 @@ Our final dataset comprised 158,784 total entries, with 94,539 entries for the S
172
 
173
  </table>
174
 
175
- ## Training Pipeline
176
  The implicit model fusion process involves a two-stage training pipeline comprising Supervised Fine-Tuning (SFT) to mitigate distribution discrepancies between target and source LLMs, and Direct Preference Optimization (DPO) for learning preferences from multiple source LLMs.
177
 
178
  ### SFT
@@ -196,7 +196,7 @@ We used [Llama-Factory](https://github.com/hiyouga/LLaMA-Factory) as our fine-tu
196
  </tr>
197
 
198
  <tr>
199
- <td>Gemma-2-9B-it</td>
200
  <td>2e-6</td>
201
  </tr>
202
 
@@ -238,7 +238,7 @@ Different models' hyperparameters are shown in the table below.
238
  </tr>
239
 
240
  <tr>
241
- <td>FuseChat-Gemma-2-9b-SFT</td>
242
  <td>5e-7</td>
243
  <td>0.01</td>
244
  <td>No</td>
@@ -267,6 +267,7 @@ We include more details and release our evaluation code at [FuseEval](https://gi
267
 
268
  The evaluation results of five series fused models are as follows, showing that our FuseChat-3.0 models achieved varying degrees of improvement across different target models. When selecting Llama-3.1-8B-Instruct as the target model, our fusion model **FuseChat-Llama-3.1-8B-Instruct achieved an average performance improvement of 6.8 points across 14 benchmarks. Notably, it showed significant improvements of 37.1 and 30.1 points on instruction-following test sets AlpacaEval-2 and Arena-Hard respectively**. Additionally, FuseChat-Llama-3.1-8B-Instruct outperformed AllenAI's recently released Llama-3.1-Tulu-3-8B model on all benchmarks except GSM8K and GPQA-Diamond. All these results demonstrate the effectiveness and success of FuseChat-3.0.
269
 
 
270
  ### FuseChat-Qwen-2.5-7B-Instruct Performance
271
 
272
  <table class="js-sort-table table hidden">
@@ -293,16 +294,16 @@ The evaluation results of five series fused models are as follows, showing that
293
 
294
  <tr>
295
  <td>MT-Bench</td>
296
- <td>8.42</td>
297
- <td>8.46</td>
298
- <td><strong>8.98</strong></td>
299
  </tr>
300
 
301
  <tr>
302
  <td>AlignBench v1.1</td>
303
- <td>7.49</td>
304
- <td>7.35</td>
305
- <td><strong>7.61</strong></td>
306
  </tr>
307
 
308
  <tr>
@@ -314,7 +315,7 @@ The evaluation results of five series fused models are as follows, showing that
314
 
315
  <tr>
316
  <td>MATH</td>
317
- <td><strong>75</strong></td>
318
  <td>72.7</td>
319
  <td>73.6</td>
320
  </tr>
@@ -322,7 +323,7 @@ The evaluation results of five series fused models are as follows, showing that
322
  <tr>
323
  <td>AMC 23</td>
324
  <td>52.5</td>
325
- <td>45</td>
326
  <td><strong>57.5</strong></td>
327
  </tr>
328
 
@@ -337,7 +338,7 @@ The evaluation results of five series fused models are as follows, showing that
337
  <td>MMLU-Pro</td>
338
  <td><strong>54.1</strong></td>
339
  <td>51.7</td>
340
- <td>53</td>
341
  </tr>
342
 
343
  <tr>
@@ -377,13 +378,13 @@ The evaluation results of five series fused models are as follows, showing that
377
 
378
  <tr>
379
  <td>Average</td>
380
- <td>50</td>
381
  <td>48.9</td>
382
  <td><strong>52.9</strong></td>
383
  </tr>
384
  </table>
385
 
386
- ## BibTeX
387
  ```
388
  @article{yang2024wrpo,
389
  title={Weighted-Reward Preference Optimization for Implicit Model Fusion},
 
19
  <img src="FuseChat-3.0.png" width=70%/>
20
  </div>
21
 
22
+ We present FuseChat-3.0, a series of models crafted to enhance performance by integrating the strengths of multiple source LLMs into more compact target LLMs. To achieve this fusion, we utilized four powerful source LLMs: Gemma-2-27B-It, Mistral-Large-Instruct-2407, Qwen-2.5-72B-Instruct, and Llama-3.1-70B-Instruct. For the target LLMs, we employed three widely-used smaller models—Llama-3.1-8B-Instruct, Gemma-2-9B-It, and Qwen-2.5-7B-Instruct—along with two even more compact models—Llama-3.2-3B-Instruct and Llama-3.2-1B-Instruct. The implicit model fusion process involves a two-stage training pipeline comprising Supervised Fine-Tuning (SFT) to mitigate distribution discrepancies between target and source LLMs, and Direct Preference Optimization (DPO) for learning preferences from multiple source LLMs. The resulting FuseChat-3.0 models demonstrated substantial improvements in tasks related to general conversation, instruction following, mathematics, and coding. Notably, when Llama-3.1-8B-Instruct served as the target LLM, our fusion approach achieved an average improvement of 6.8 points across 14 benchmarks. Moreover, it showed significant improvements of 37.1 and 30.1 points on instruction-following test sets AlpacaEval-2 and Arena-Hard respectively. We have released the [FuseChat-3.0](https://huggingface.co/FuseAI) models on Huggingface, stay tuned for the forthcoming dataset and code.
23
 
24
 
25
 
 
32
 
33
  Our IMF method follows a three-stage process aimed at effectively transferring capabilities from source LLMs to a target LLM. First, during **dataset construction**, we sample N responses from each of the source LLMs and annotate these responses using an external reward model. Second, in the **supervised fine-tuning (SFT)** stage, we fine-tune the target model using the best responses, which not only enhances the target model's capabilities but also helps mitigate the distributional gap between the source and target models. Finally, in the **direct preference optimization (DPO)** stage, we optimize the target model by using the best and worst responses from the source models as preference pairs, further enhancing the target model's performance. The complete pipeline will be detailed in the following paragraph.
34
 
35
+ ## Dataset
36
  ### Prompt Selection
37
  Our datasets were designed to enhance model's instruction following, general conversation, mathematics, coding, and Chinese-language capabilities. We selected data from open-source community datasets, applying targeted filtering and preprocessing. Key datasets and filtering criteria included:
38
 
 
41
  - **Coding**: Curated from [leetcode](https://huggingface.co/datasets/greengerong/leetcode) and [self-oss-instruct-sc2-exec-filter-50k](https://huggingface.co/datasets/bigcode/self-oss-instruct-sc2-exec-filter-50k), retaining prompts with test cases.
42
  - **Chinese Language**: Integrated [alpaca_gpt4_zh](https://huggingface.co/datasets/llamafactory/alpaca_gpt4_zh) and [Magpie-Qwen2-Pro-200K-Chinese](https://huggingface.co/datasets/Magpie-Align/Magpie-Qwen2-Pro-200K-Chinese), filtering out code and math prompts to retain approximately 10,000 high-quality samples.
43
 
44
+ ### Response Sampling
45
+ For each dataset's prompts, we synthesized responses mainly from four different series of source models, specifically [Gemma-2-27b-It](https://huggingface.co/google/gemma-2-27b-it), [Mistral-Large-Instruct-2407](https://huggingface.co/mistralai/Mistral-Large-Instruct-2407), [Qwen-2.5-72B-Instruct](https://huggingface.co/Qwen/Qwen2-72B-Instruct), and [Llama-3.1-70B-Instruct](https://huggingface.co/meta-llama/Llama-3.1-70B-Instruct).
46
 
47
  - **Instruction Following & General Conversation**: We sampled each prompt five times from all the source models.
48
  - **Mathematics**: We retained the responses generated by Llama-3.1-405B-Instruct from the original dataset (OpenMathInstruct-2) and additionally sampled responses using [Qwen-2.5-Math-72B-Instruct](https://huggingface.co/Qwen/Qwen-2.5-Math-72B-Instruct).
 
58
  </tr>
59
 
60
  <tr>
61
+ <td>Gemma-2-27b-It</td>
62
  <td>Temp 0.8 Top-p 0.95</td>
63
  </tr>
64
 
 
80
 
81
  </table>
82
 
83
+ ### Data Construction
84
  - **Instruction Following**: To assign RM scores to the five responses generated by each source model, we employed [ArmoRM](https://huggingface.co/RLHFlow/ArmoRM-Llama3-8B-v0.1) for annotation. We then divided the annotated data into SFT and DPO datasets using a 4:6 ratio. For the SFT phase, we selected the responses with the highest RM scores. During the DPO phase, we paired responses from the same source model, designating those with the highest RM scores as positive samples and those with the lowest RM scores as negative samples. We ensured that the RM score difference between the positive and negative samples in each pair ranged from 0.01 to 0.1.
85
  - **Mathematics**: We initially annotated the responses from all source models for correctness by comparing them with the gold labels and evaluating them using the RM scores provided by ArmoRM. We then strategically divided the dataset into SFT phase and DPO phase. In the SFT phase, we incorporated responses that were correct and had the highest RM scores. This selection ensured that the fine-tuning process was based on high-quality responses that aligned closely with the desired outcomes. For the DPO phase, we constructed paired samples from the same source model. The positive samples consisted of correct answers with the highest RM scores, while the negative samples were incorrect answers with the lowest RM scores. To ensure meaningful comparisons during optimization, we maintained an RM score differential between positive and negative pairs within the range of 0.01 to 0.1.
86
  - **Coding**: We employed a dual-scoring system comprising correctness scores and RM scores for coding evaluation. The correctness scores assessed whether the code passed both static analysis and test cases, ensuring functional accuracy. The RM scores were used for preference evaluation, gauging the quality of responses based on predefined criteria. During the SFT phase, we included responses that not only passed all test cases but also achieved the highest RM scores. This selection ensured that the model was fine-tuned on exemplary code that met both correctness and preference standards. In the DPO phase, we contrasted positive samples—high-scoring responses that passed the tests—with negative samples—low-scoring responses that failed the tests. This comparison aimed to optimize the model's ability to prefer higher-quality code during training. We excluded any instances where all model responses failed to meet the testing criteria. This exclusion was necessary to maintain the integrity of the evaluation process, as such cases did not provide meaningful data for assessing and improving the model's performance.
 
172
 
173
  </table>
174
 
175
+ ## Training
176
  The implicit model fusion process involves a two-stage training pipeline comprising Supervised Fine-Tuning (SFT) to mitigate distribution discrepancies between target and source LLMs, and Direct Preference Optimization (DPO) for learning preferences from multiple source LLMs.
177
 
178
  ### SFT
 
196
  </tr>
197
 
198
  <tr>
199
+ <td>Gemma-2-9B-It</td>
200
  <td>2e-6</td>
201
  </tr>
202
 
 
238
  </tr>
239
 
240
  <tr>
241
+ <td>FuseChat-Gemma-2-9B-SFT</td>
242
  <td>5e-7</td>
243
  <td>0.01</td>
244
  <td>No</td>
 
267
 
268
  The evaluation results of five series fused models are as follows, showing that our FuseChat-3.0 models achieved varying degrees of improvement across different target models. When selecting Llama-3.1-8B-Instruct as the target model, our fusion model **FuseChat-Llama-3.1-8B-Instruct achieved an average performance improvement of 6.8 points across 14 benchmarks. Notably, it showed significant improvements of 37.1 and 30.1 points on instruction-following test sets AlpacaEval-2 and Arena-Hard respectively**. Additionally, FuseChat-Llama-3.1-8B-Instruct outperformed AllenAI's recently released Llama-3.1-Tulu-3-8B model on all benchmarks except GSM8K and GPQA-Diamond. All these results demonstrate the effectiveness and success of FuseChat-3.0.
269
 
270
+
271
  ### FuseChat-Qwen-2.5-7B-Instruct Performance
272
 
273
  <table class="js-sort-table table hidden">
 
294
 
295
  <tr>
296
  <td>MT-Bench</td>
297
+ <td>8.4</td>
298
+ <td>8.5</td>
299
+ <td><strong>9.0</strong></td>
300
  </tr>
301
 
302
  <tr>
303
  <td>AlignBench v1.1</td>
304
+ <td>7.5</td>
305
+ <td>7.4</td>
306
+ <td><strong>7.6</strong></td>
307
  </tr>
308
 
309
  <tr>
 
315
 
316
  <tr>
317
  <td>MATH</td>
318
+ <td><strong>75.0</strong></td>
319
  <td>72.7</td>
320
  <td>73.6</td>
321
  </tr>
 
323
  <tr>
324
  <td>AMC 23</td>
325
  <td>52.5</td>
326
+ <td>45.0</td>
327
  <td><strong>57.5</strong></td>
328
  </tr>
329
 
 
338
  <td>MMLU-Pro</td>
339
  <td><strong>54.1</strong></td>
340
  <td>51.7</td>
341
+ <td>53.0</td>
342
  </tr>
343
 
344
  <tr>
 
378
 
379
  <tr>
380
  <td>Average</td>
381
+ <td>50.0</td>
382
  <td>48.9</td>
383
  <td><strong>52.9</strong></td>
384
  </tr>
385
  </table>
386
 
387
+ ## Citation
388
  ```
389
  @article{yang2024wrpo,
390
  title={Weighted-Reward Preference Optimization for Implicit Model Fusion},