Update README.md
Browse files
README.md
CHANGED
@@ -267,118 +267,136 @@ We include more details and release our evaluation code at [FuseEval](https://gi
|
|
267 |
|
268 |
The evaluation results of five series fused models are as follows, showing that our FuseChat-3.0 models achieved varying degrees of improvement across different target models. When selecting Llama-3.1-8B-Instruct as the target model, our fusion model **FuseChat-Llama-3.1-8B-Instruct achieved an average performance improvement of 6.8 points across 14 benchmarks. Notably, it showed significant improvements of 37.1 and 30.1 points on instruction-following test sets AlpacaEval-2 and Arena-Hard respectively**. Additionally, FuseChat-Llama-3.1-8B-Instruct outperformed AllenAI's recently released Llama-3.1-Tulu-3-8B model on all benchmarks except GSM8K and GPQA-Diamond. All these results demonstrate the effectiveness and success of FuseChat-3.0.
|
269 |
|
270 |
-
|
|
|
|
|
271 |
<table class="js-sort-table table hidden">
|
272 |
<tr>
|
273 |
<td class="js-sort-string"><strong>Benchmarks</strong></td>
|
274 |
-
<td class="js-sort-string"><strong>Llama-3.
|
275 |
-
<td class="js-sort-string"><strong>
|
276 |
-
<td class="js-sort-string"><strong>FuseChat-Llama-3.
|
|
|
277 |
</tr>
|
278 |
|
279 |
<tr>
|
280 |
<td style="white-space: nowrap;">AlpacaEval-2 (LC %)</td>
|
281 |
-
<td>
|
282 |
-
<td>
|
283 |
-
<td
|
|
|
284 |
</tr>
|
285 |
|
286 |
<tr>
|
287 |
<td>Arena-Hard (WR %)</td>
|
288 |
-
<td>
|
289 |
-
<td>
|
290 |
-
<td
|
|
|
291 |
</tr>
|
292 |
|
293 |
<tr>
|
294 |
<td>MT-Bench</td>
|
295 |
-
<td>
|
296 |
-
<td>
|
297 |
-
<td
|
|
|
298 |
</tr>
|
299 |
|
300 |
<tr>
|
301 |
<td>AlignBench v1.1</td>
|
302 |
-
<td>
|
303 |
-
<td>
|
304 |
-
<td
|
|
|
305 |
</tr>
|
306 |
|
307 |
<tr>
|
308 |
<td>GSM8K</td>
|
309 |
-
<td>
|
310 |
-
<td><strong>
|
311 |
-
<td>
|
|
|
312 |
</tr>
|
313 |
|
314 |
<tr>
|
315 |
<td>MATH</td>
|
316 |
-
<td>
|
317 |
-
<td>
|
318 |
-
<td
|
|
|
319 |
</tr>
|
320 |
|
321 |
<tr>
|
322 |
-
<td>
|
323 |
-
<td>
|
324 |
-
<td>
|
325 |
-
<td
|
|
|
326 |
</tr>
|
327 |
|
328 |
<tr>
|
329 |
<td>LiveBench 0831</td>
|
330 |
-
<td>
|
331 |
-
<td>
|
332 |
-
<td
|
|
|
333 |
</tr>
|
334 |
-
|
335 |
<tr>
|
336 |
<td>MMLU-Pro</td>
|
337 |
-
<td>
|
338 |
-
<td
|
339 |
-
<td>
|
|
|
340 |
</tr>
|
341 |
|
342 |
<tr>
|
343 |
<td>MMLU-redux</td>
|
344 |
-
<td>
|
345 |
-
<td>
|
346 |
-
<td
|
|
|
347 |
</tr>
|
348 |
|
349 |
<tr>
|
350 |
<td>GPQA-Diamond</td>
|
351 |
-
<td>
|
352 |
-
<td>
|
353 |
-
<td><strong>
|
|
|
354 |
</tr>
|
355 |
|
356 |
<tr>
|
357 |
<td>HumanEval</td>
|
358 |
-
<td>
|
359 |
-
<td
|
360 |
-
<td>
|
|
|
361 |
</tr>
|
362 |
|
363 |
<tr>
|
364 |
<td>MBPP</td>
|
365 |
-
<td><strong>
|
366 |
-
<td>
|
367 |
-
<td>
|
|
|
368 |
</tr>
|
369 |
|
370 |
<tr>
|
371 |
<td>LiveCodeBench<br>2408-2411</td>
|
372 |
-
<td>
|
373 |
-
<td>
|
374 |
-
<td
|
|
|
375 |
</tr>
|
376 |
|
377 |
<tr>
|
378 |
<td>Average</td>
|
379 |
-
<td>
|
380 |
-
<td>
|
381 |
-
<td
|
|
|
382 |
</tr>
|
383 |
</table>
|
384 |
|
|
|
267 |
|
268 |
The evaluation results of five series fused models are as follows, showing that our FuseChat-3.0 models achieved varying degrees of improvement across different target models. When selecting Llama-3.1-8B-Instruct as the target model, our fusion model **FuseChat-Llama-3.1-8B-Instruct achieved an average performance improvement of 6.8 points across 14 benchmarks. Notably, it showed significant improvements of 37.1 and 30.1 points on instruction-following test sets AlpacaEval-2 and Arena-Hard respectively**. Additionally, FuseChat-Llama-3.1-8B-Instruct outperformed AllenAI's recently released Llama-3.1-Tulu-3-8B model on all benchmarks except GSM8K and GPQA-Diamond. All these results demonstrate the effectiveness and success of FuseChat-3.0.
|
269 |
|
270 |
+
|
271 |
+
### FuseChat-Llama-3.1-8B-Instruct Performance
|
272 |
+
|
273 |
<table class="js-sort-table table hidden">
|
274 |
<tr>
|
275 |
<td class="js-sort-string"><strong>Benchmarks</strong></td>
|
276 |
+
<td class="js-sort-string"><strong>Llama-3.1-8B-Instruct</strong></td>
|
277 |
+
<td class="js-sort-string"><strong>Llama-3.1-Tulu-3-8B</strong></td>
|
278 |
+
<td class="js-sort-string"><strong>FuseChat-Llama-3.1-8B-SFT</strong></td>
|
279 |
+
<td class="js-sort-string"><strong>FuseChat-Llama-3.1-8B-Instruct</strong></td>
|
280 |
</tr>
|
281 |
|
282 |
<tr>
|
283 |
<td style="white-space: nowrap;">AlpacaEval-2 (LC %)</td>
|
284 |
+
<td>28.3</td>
|
285 |
+
<td>33.4</td>
|
286 |
+
<td>41.3</td>
|
287 |
+
<td><strong>65.4</strong></td>
|
288 |
</tr>
|
289 |
|
290 |
<tr>
|
291 |
<td>Arena-Hard (WR %)</td>
|
292 |
+
<td>28.1</td>
|
293 |
+
<td>45.6</td>
|
294 |
+
<td>38.7</td>
|
295 |
+
<td><strong>58.2</strong></td>
|
296 |
</tr>
|
297 |
|
298 |
<tr>
|
299 |
<td>MT-Bench</td>
|
300 |
+
<td>8.38</td>
|
301 |
+
<td>8.34</td>
|
302 |
+
<td>8.54</td>
|
303 |
+
<td><strong>9</strong></td>
|
304 |
</tr>
|
305 |
|
306 |
<tr>
|
307 |
<td>AlignBench v1.1</td>
|
308 |
+
<td>4.61</td>
|
309 |
+
<td>6.2</td>
|
310 |
+
<td>6.25</td>
|
311 |
+
<td><strong>6.69</strong></td>
|
312 |
</tr>
|
313 |
|
314 |
<tr>
|
315 |
<td>GSM8K</td>
|
316 |
+
<td>85.9</td>
|
317 |
+
<td><strong>88.6</strong></td>
|
318 |
+
<td>87</td>
|
319 |
+
<td>88</td>
|
320 |
</tr>
|
321 |
|
322 |
<tr>
|
323 |
<td>MATH</td>
|
324 |
+
<td>50.7</td>
|
325 |
+
<td>47.5</td>
|
326 |
+
<td>54.7</td>
|
327 |
+
<td><strong>55.2</strong></td>
|
328 |
</tr>
|
329 |
|
330 |
<tr>
|
331 |
+
<td>AMC 23</td>
|
332 |
+
<td>25</td>
|
333 |
+
<td>25</td>
|
334 |
+
<td>30</td>
|
335 |
+
<td><strong>37.5</strong></td>
|
336 |
</tr>
|
337 |
|
338 |
<tr>
|
339 |
<td>LiveBench 0831</td>
|
340 |
+
<td>27.6</td>
|
341 |
+
<td>30.1</td>
|
342 |
+
<td>30.2</td>
|
343 |
+
<td><strong>32</strong></td>
|
344 |
</tr>
|
345 |
+
|
346 |
<tr>
|
347 |
<td>MMLU-Pro</td>
|
348 |
+
<td><strong>50</strong></td>
|
349 |
+
<td>42.9</td>
|
350 |
+
<td>47.8</td>
|
351 |
+
<td>49.2</td>
|
352 |
</tr>
|
353 |
|
354 |
<tr>
|
355 |
<td>MMLU-redux</td>
|
356 |
+
<td>67.2</td>
|
357 |
+
<td>66.3</td>
|
358 |
+
<td>68.4</td>
|
359 |
+
<td><strong>69.2</strong></td>
|
360 |
</tr>
|
361 |
|
362 |
<tr>
|
363 |
<td>GPQA-Diamond</td>
|
364 |
+
<td>33.8</td>
|
365 |
+
<td>35.9</td>
|
366 |
+
<td><strong>37.9</strong></td>
|
367 |
+
<td>34.9</td>
|
368 |
</tr>
|
369 |
|
370 |
<tr>
|
371 |
<td>HumanEval</td>
|
372 |
+
<td>69.5</td>
|
373 |
+
<td>66.5</td>
|
374 |
+
<td>69.5</td>
|
375 |
+
<td><strong>71.3</strong></td>
|
376 |
</tr>
|
377 |
|
378 |
<tr>
|
379 |
<td>MBPP</td>
|
380 |
+
<td><strong>75.4</strong></td>
|
381 |
+
<td>56.3</td>
|
382 |
+
<td>71.4</td>
|
383 |
+
<td>72</td>
|
384 |
</tr>
|
385 |
|
386 |
<tr>
|
387 |
<td>LiveCodeBench<br>2408-2411</td>
|
388 |
+
<td>12.3</td>
|
389 |
+
<td>10.6</td>
|
390 |
+
<td>12.6</td>
|
391 |
+
<td><strong>13.1</strong></td>
|
392 |
</tr>
|
393 |
|
394 |
<tr>
|
395 |
<td>Average</td>
|
396 |
+
<td>40.5</td>
|
397 |
+
<td>40.2</td>
|
398 |
+
<td>43.2</td>
|
399 |
+
<td><strong>47.3</strong></td>
|
400 |
</tr>
|
401 |
</table>
|
402 |
|