puneeshkhanna
commited on
Add HF leaderboard eval comparision
Browse files
README.md
CHANGED
@@ -91,7 +91,63 @@ print(response)
|
|
91 |
<br>
|
92 |
|
93 |
## Benchmarks
|
94 |
-
We report
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
95 |
- We use [lm-evaluation harness](https://github.com/EleutherAI/lm-evaluation-harness).
|
96 |
- We report **raw scores** obtained by applying chat template and fewshot_as_multiturn.
|
97 |
- We use same batch-size across all models.
|
@@ -231,6 +287,9 @@ We report in the following table our internal pipeline benchmarks.
|
|
231 |
</tbody>
|
232 |
</table>
|
233 |
|
|
|
|
|
|
|
234 |
|
235 |
## Technical Report
|
236 |
Coming soon....
|
|
|
91 |
<br>
|
92 |
|
93 |
## Benchmarks
|
94 |
+
We report the official HuggingFace leaderboard normalized evaluations [Open LLM Leaderboard Evaluation Results](https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard) in the following table.
|
95 |
+
<table border="1" style="width: 100%; text-align: center; border-collapse: collapse;">
|
96 |
+
<colgroup>
|
97 |
+
<col style="width: 10%;">
|
98 |
+
<col style="width: 7%;">
|
99 |
+
<col style="width: 7%;">
|
100 |
+
<col style="background-color: rgba(80, 15, 213, 0.5); width: 7%;">
|
101 |
+
</colgroup>
|
102 |
+
<thead>
|
103 |
+
<tr>
|
104 |
+
<th>Benchmark</th>
|
105 |
+
<th>Llama-3.1-8B-Instruct</th>
|
106 |
+
<th>Qwen2.5-7B-Instruct</th>
|
107 |
+
<th>Falcon3-7B-Instruct</th>
|
108 |
+
</tr>
|
109 |
+
</thead>
|
110 |
+
<tbody>
|
111 |
+
<tr>
|
112 |
+
<td>IFEval</td>
|
113 |
+
<td><b>78.56</b></td>
|
114 |
+
<td>75.85</td>
|
115 |
+
<td>76.12</td>
|
116 |
+
</tr>
|
117 |
+
<tr>
|
118 |
+
<td>BBH (3-shot)</td>
|
119 |
+
<td>29.89</td>
|
120 |
+
<td>34.89</td>
|
121 |
+
<td><b>37.92</b></td>
|
122 |
+
</tr>
|
123 |
+
<tr>
|
124 |
+
<td>MATH Lvl-5 (4-shot)</td>
|
125 |
+
<td>19.34</td>
|
126 |
+
<td>0.00</td>
|
127 |
+
<td><b>31.87</b></td>
|
128 |
+
</tr>
|
129 |
+
<tr>
|
130 |
+
<td>GPQA (0-shot)</td>
|
131 |
+
<td>2.35</td>
|
132 |
+
<td>5.48</td>
|
133 |
+
<td><b>8.05</b></td>
|
134 |
+
</tr>
|
135 |
+
<tr>
|
136 |
+
<td>MUSR (0-shot)</td>
|
137 |
+
<td>8.41</td>
|
138 |
+
<td>8.45</td>
|
139 |
+
<td><b>21.17</b></td>
|
140 |
+
</tr>
|
141 |
+
<tr>
|
142 |
+
<td>MMLU-PRO (5-shot)</td>
|
143 |
+
<td>30.68</td>
|
144 |
+
<td><b>36.52</b></td>
|
145 |
+
<td>34.30</td>
|
146 |
+
</tr>
|
147 |
+
</tbody>
|
148 |
+
</table>
|
149 |
+
|
150 |
+
Also, we report in the following table our internal pipeline benchmarks.
|
151 |
- We use [lm-evaluation harness](https://github.com/EleutherAI/lm-evaluation-harness).
|
152 |
- We report **raw scores** obtained by applying chat template and fewshot_as_multiturn.
|
153 |
- We use same batch-size across all models.
|
|
|
287 |
</tbody>
|
288 |
</table>
|
289 |
|
290 |
+
## Useful links
|
291 |
+
- View our [release blogpost](https://huggingface.co/blog/falcon3).
|
292 |
+
- Feel free to join [our discord server](https://discord.gg/fwXpMyGc) if you have any questions or to interact with our researchers and developers.
|
293 |
|
294 |
## Technical Report
|
295 |
Coming soon....
|