puneeshkhanna commited on
Commit
a46dd97
·
verified ·
1 Parent(s): 2d89dec

Add HF leaderboard eval comparision

Browse files
Files changed (1) hide show
  1. README.md +60 -1
README.md CHANGED
@@ -91,7 +91,63 @@ print(response)
91
  <br>
92
 
93
  ## Benchmarks
94
- We report in the following table our internal pipeline benchmarks.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
95
  - We use [lm-evaluation harness](https://github.com/EleutherAI/lm-evaluation-harness).
96
  - We report **raw scores** obtained by applying chat template and fewshot_as_multiturn.
97
  - We use same batch-size across all models.
@@ -231,6 +287,9 @@ We report in the following table our internal pipeline benchmarks.
231
  </tbody>
232
  </table>
233
 
 
 
 
234
 
235
  ## Technical Report
236
  Coming soon....
 
91
  <br>
92
 
93
  ## Benchmarks
94
+ We report the official HuggingFace leaderboard normalized evaluations [Open LLM Leaderboard Evaluation Results](https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard) in the following table.
95
+ <table border="1" style="width: 100%; text-align: center; border-collapse: collapse;">
96
+ <colgroup>
97
+ <col style="width: 10%;">
98
+ <col style="width: 7%;">
99
+ <col style="width: 7%;">
100
+ <col style="background-color: rgba(80, 15, 213, 0.5); width: 7%;">
101
+ </colgroup>
102
+ <thead>
103
+ <tr>
104
+ <th>Benchmark</th>
105
+ <th>Llama-3.1-8B-Instruct</th>
106
+ <th>Qwen2.5-7B-Instruct</th>
107
+ <th>Falcon3-7B-Instruct</th>
108
+ </tr>
109
+ </thead>
110
+ <tbody>
111
+ <tr>
112
+ <td>IFEval</td>
113
+ <td><b>78.56</b></td>
114
+ <td>75.85</td>
115
+ <td>76.12</td>
116
+ </tr>
117
+ <tr>
118
+ <td>BBH (3-shot)</td>
119
+ <td>29.89</td>
120
+ <td>34.89</td>
121
+ <td><b>37.92</b></td>
122
+ </tr>
123
+ <tr>
124
+ <td>MATH Lvl-5 (4-shot)</td>
125
+ <td>19.34</td>
126
+ <td>0.00</td>
127
+ <td><b>31.87</b></td>
128
+ </tr>
129
+ <tr>
130
+ <td>GPQA (0-shot)</td>
131
+ <td>2.35</td>
132
+ <td>5.48</td>
133
+ <td><b>8.05</b></td>
134
+ </tr>
135
+ <tr>
136
+ <td>MUSR (0-shot)</td>
137
+ <td>8.41</td>
138
+ <td>8.45</td>
139
+ <td><b>21.17</b></td>
140
+ </tr>
141
+ <tr>
142
+ <td>MMLU-PRO (5-shot)</td>
143
+ <td>30.68</td>
144
+ <td><b>36.52</b></td>
145
+ <td>34.30</td>
146
+ </tr>
147
+ </tbody>
148
+ </table>
149
+
150
+ Also, we report in the following table our internal pipeline benchmarks.
151
  - We use [lm-evaluation harness](https://github.com/EleutherAI/lm-evaluation-harness).
152
  - We report **raw scores** obtained by applying chat template and fewshot_as_multiturn.
153
  - We use same batch-size across all models.
 
287
  </tbody>
288
  </table>
289
 
290
+ ## Useful links
291
+ - View our [release blogpost](https://huggingface.co/blog/falcon3).
292
+ - Feel free to join [our discord server](https://discord.gg/fwXpMyGc) if you have any questions or to interact with our researchers and developers.
293
 
294
  ## Technical Report
295
  Coming soon....