AALF commited on
Commit
6d890a6
·
verified ·
1 Parent(s): 33c7aa8

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +69 -51
README.md CHANGED
@@ -267,118 +267,136 @@ We include more details and release our evaluation code at [FuseEval](https://gi
267
 
268
  The evaluation results of five series fused models are as follows, showing that our FuseChat-3.0 models achieved varying degrees of improvement across different target models. When selecting Llama-3.1-8B-Instruct as the target model, our fusion model **FuseChat-Llama-3.1-8B-Instruct achieved an average performance improvement of 6.8 points across 14 benchmarks. Notably, it showed significant improvements of 37.1 and 30.1 points on instruction-following test sets AlpacaEval-2 and Arena-Hard respectively**. Additionally, FuseChat-Llama-3.1-8B-Instruct outperformed AllenAI's recently released Llama-3.1-Tulu-3-8B model on all benchmarks except GSM8K and GPQA-Diamond. All these results demonstrate the effectiveness and success of FuseChat-3.0.
269
 
270
- ### FuseChat-Llama-3.2-3B-Instruct Performance
 
 
271
  <table class="js-sort-table table hidden">
272
  <tr>
273
  <td class="js-sort-string"><strong>Benchmarks</strong></td>
274
- <td class="js-sort-string"><strong>Llama-3.2-3B-Instruct</strong></td>
275
- <td class="js-sort-string"><strong>FuseChat-Llama-3.2-3B-SFT</strong></td>
276
- <td class="js-sort-string"><strong>FuseChat-Llama-3.2-3B-Instruct</strong></td>
 
277
  </tr>
278
 
279
  <tr>
280
  <td style="white-space: nowrap;">AlpacaEval-2 (LC %)</td>
281
- <td>21.4</td>
282
- <td>31.1</td>
283
- <td><strong>54</strong></td>
 
284
  </tr>
285
 
286
  <tr>
287
  <td>Arena-Hard (WR %)</td>
288
- <td>16.6</td>
289
- <td>21.3</td>
290
- <td><strong>30.2</strong></td>
 
291
  </tr>
292
 
293
  <tr>
294
  <td>MT-Bench</td>
295
- <td>6.87</td>
296
- <td>7.33</td>
297
- <td><strong>7.66</strong></td>
 
298
  </tr>
299
 
300
  <tr>
301
  <td>AlignBench v1.1</td>
302
- <td>3.83</td>
303
- <td>5.5</td>
304
- <td><strong>5.91</strong></td>
 
305
  </tr>
306
 
307
  <tr>
308
  <td>GSM8K</td>
309
- <td>82</td>
310
- <td><strong>82.8</strong></td>
311
- <td>82</td>
 
312
  </tr>
313
 
314
  <tr>
315
  <td>MATH</td>
316
- <td>51.4</td>
317
- <td>52.9</td>
318
- <td><strong>53.1</strong></td>
 
319
  </tr>
320
 
321
  <tr>
322
- <td>AMC23</td>
323
- <td>22.5</td>
324
- <td>20</td>
325
- <td><strong>35</strong></td>
 
326
  </tr>
327
 
328
  <tr>
329
  <td>LiveBench 0831</td>
330
- <td>23.4</td>
331
- <td>24.5</td>
332
- <td><strong>24.9</strong></td>
 
333
  </tr>
334
-
335
  <tr>
336
  <td>MMLU-Pro</td>
337
- <td>39.3</td>
338
- <td><strong>40.3</strong></td>
339
- <td>40.3</td>
 
340
  </tr>
341
 
342
  <tr>
343
  <td>MMLU-redux</td>
344
- <td>58.5</td>
345
- <td>58.2</td>
346
- <td><strong>59</strong></td>
 
347
  </tr>
348
 
349
  <tr>
350
  <td>GPQA-Diamond</td>
351
- <td>29.8</td>
352
- <td>33.3</td>
353
- <td><strong>33.8</strong></td>
 
354
  </tr>
355
 
356
  <tr>
357
  <td>HumanEval</td>
358
- <td>61</td>
359
- <td><strong>62.8</strong></td>
360
- <td>60.4</td>
 
361
  </tr>
362
 
363
  <tr>
364
  <td>MBPP</td>
365
- <td><strong>68.5</strong></td>
366
- <td>67.5</td>
367
- <td>67.5</td>
 
368
  </tr>
369
 
370
  <tr>
371
  <td>LiveCodeBench<br>2408-2411</td>
372
- <td>8.3</td>
373
- <td>7.1</td>
374
- <td><strong>9</strong></td>
 
375
  </tr>
376
 
377
  <tr>
378
  <td>Average</td>
379
- <td>35.2</td>
380
- <td>36.8</td>
381
- <td><strong>40.2</strong></td>
 
382
  </tr>
383
  </table>
384
 
 
267
 
268
  The evaluation results of five series fused models are as follows, showing that our FuseChat-3.0 models achieved varying degrees of improvement across different target models. When selecting Llama-3.1-8B-Instruct as the target model, our fusion model **FuseChat-Llama-3.1-8B-Instruct achieved an average performance improvement of 6.8 points across 14 benchmarks. Notably, it showed significant improvements of 37.1 and 30.1 points on instruction-following test sets AlpacaEval-2 and Arena-Hard respectively**. Additionally, FuseChat-Llama-3.1-8B-Instruct outperformed AllenAI's recently released Llama-3.1-Tulu-3-8B model on all benchmarks except GSM8K and GPQA-Diamond. All these results demonstrate the effectiveness and success of FuseChat-3.0.
269
 
270
+
271
+ ### FuseChat-Llama-3.1-8B-Instruct Performance
272
+
273
  <table class="js-sort-table table hidden">
274
  <tr>
275
  <td class="js-sort-string"><strong>Benchmarks</strong></td>
276
+ <td class="js-sort-string"><strong>Llama-3.1-8B-Instruct</strong></td>
277
+ <td class="js-sort-string"><strong>Llama-3.1-Tulu-3-8B</strong></td>
278
+ <td class="js-sort-string"><strong>FuseChat-Llama-3.1-8B-SFT</strong></td>
279
+ <td class="js-sort-string"><strong>FuseChat-Llama-3.1-8B-Instruct</strong></td>
280
  </tr>
281
 
282
  <tr>
283
  <td style="white-space: nowrap;">AlpacaEval-2 (LC %)</td>
284
+ <td>28.3</td>
285
+ <td>33.4</td>
286
+ <td>41.3</td>
287
+ <td><strong>65.4</strong></td>
288
  </tr>
289
 
290
  <tr>
291
  <td>Arena-Hard (WR %)</td>
292
+ <td>28.1</td>
293
+ <td>45.6</td>
294
+ <td>38.7</td>
295
+ <td><strong>58.2</strong></td>
296
  </tr>
297
 
298
  <tr>
299
  <td>MT-Bench</td>
300
+ <td>8.38</td>
301
+ <td>8.34</td>
302
+ <td>8.54</td>
303
+ <td><strong>9</strong></td>
304
  </tr>
305
 
306
  <tr>
307
  <td>AlignBench v1.1</td>
308
+ <td>4.61</td>
309
+ <td>6.2</td>
310
+ <td>6.25</td>
311
+ <td><strong>6.69</strong></td>
312
  </tr>
313
 
314
  <tr>
315
  <td>GSM8K</td>
316
+ <td>85.9</td>
317
+ <td><strong>88.6</strong></td>
318
+ <td>87</td>
319
+ <td>88</td>
320
  </tr>
321
 
322
  <tr>
323
  <td>MATH</td>
324
+ <td>50.7</td>
325
+ <td>47.5</td>
326
+ <td>54.7</td>
327
+ <td><strong>55.2</strong></td>
328
  </tr>
329
 
330
  <tr>
331
+ <td>AMC 23</td>
332
+ <td>25</td>
333
+ <td>25</td>
334
+ <td>30</td>
335
+ <td><strong>37.5</strong></td>
336
  </tr>
337
 
338
  <tr>
339
  <td>LiveBench 0831</td>
340
+ <td>27.6</td>
341
+ <td>30.1</td>
342
+ <td>30.2</td>
343
+ <td><strong>32</strong></td>
344
  </tr>
345
+
346
  <tr>
347
  <td>MMLU-Pro</td>
348
+ <td><strong>50</strong></td>
349
+ <td>42.9</td>
350
+ <td>47.8</td>
351
+ <td>49.2</td>
352
  </tr>
353
 
354
  <tr>
355
  <td>MMLU-redux</td>
356
+ <td>67.2</td>
357
+ <td>66.3</td>
358
+ <td>68.4</td>
359
+ <td><strong>69.2</strong></td>
360
  </tr>
361
 
362
  <tr>
363
  <td>GPQA-Diamond</td>
364
+ <td>33.8</td>
365
+ <td>35.9</td>
366
+ <td><strong>37.9</strong></td>
367
+ <td>34.9</td>
368
  </tr>
369
 
370
  <tr>
371
  <td>HumanEval</td>
372
+ <td>69.5</td>
373
+ <td>66.5</td>
374
+ <td>69.5</td>
375
+ <td><strong>71.3</strong></td>
376
  </tr>
377
 
378
  <tr>
379
  <td>MBPP</td>
380
+ <td><strong>75.4</strong></td>
381
+ <td>56.3</td>
382
+ <td>71.4</td>
383
+ <td>72</td>
384
  </tr>
385
 
386
  <tr>
387
  <td>LiveCodeBench<br>2408-2411</td>
388
+ <td>12.3</td>
389
+ <td>10.6</td>
390
+ <td>12.6</td>
391
+ <td><strong>13.1</strong></td>
392
  </tr>
393
 
394
  <tr>
395
  <td>Average</td>
396
+ <td>40.5</td>
397
+ <td>40.2</td>
398
+ <td>43.2</td>
399
+ <td><strong>47.3</strong></td>
400
  </tr>
401
  </table>
402