tiiuae
/

Falcon3-7B-Instruct

Text Generation

text-generation-inference

Inference Endpoints

Model card Files Files and versions Community

puneeshkhanna commited on 22 days ago

Commit

a46dd97

·

verified ·

1 Parent(s): 2d89dec

Add HF leaderboard eval comparision

Files changed (1) hide show

README.md +60 -1

README.md CHANGED Viewed

@@ -91,7 +91,63 @@ print(response)
 <br>
 ## Benchmarks
-We report in the following table our internal pipeline benchmarks.
  - We use [lm-evaluation harness](https://github.com/EleutherAI/lm-evaluation-harness).
  - We report **raw scores** obtained by applying chat template and fewshot_as_multiturn.
  - We use same batch-size across all models.
@@ -231,6 +287,9 @@ We report in the following table our internal pipeline benchmarks.
     </tbody>
 </table>
 ## Technical Report
 Coming soon....

 <br>
 ## Benchmarks
+We report the official HuggingFace leaderboard normalized evaluations [Open LLM Leaderboard Evaluation Results](https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard) in the following table.
+<table border="1" style="width: 100%; text-align: center; border-collapse: collapse;">
+    <colgroup>
+        <col style="width: 10%;">
+        <col style="width: 7%;">
+        <col style="width: 7%;">
+        <col style="background-color: rgba(80, 15, 213, 0.5); width: 7%;">
+    </colgroup>
+    <thead>
+        <tr>
+            <th>Benchmark</th>
+            <th>Llama-3.1-8B-Instruct</th>
+            <th>Qwen2.5-7B-Instruct</th>
+            <th>Falcon3-7B-Instruct</th>
+        </tr>
+    </thead>
+    <tbody>
+        <tr>
+            <td>IFEval</td>
+            <td><b>78.56</b></td>
+            <td>75.85</td>
+            <td>76.12</td>
+        </tr>
+        <tr>
+            <td>BBH (3-shot)</td>
+            <td>29.89</td>
+            <td>34.89</td>
+            <td><b>37.92</b></td>
+        </tr>
+        <tr>
+            <td>MATH Lvl-5 (4-shot)</td>
+            <td>19.34</td>
+            <td>0.00</td>
+            <td><b>31.87</b></td>
+        </tr>
+        <tr>
+            <td>GPQA (0-shot)</td>
+            <td>2.35</td>
+            <td>5.48</td>
+            <td><b>8.05</b></td>
+        </tr>
+        <tr>
+            <td>MUSR (0-shot)</td>
+            <td>8.41</td>
+            <td>8.45</td>
+            <td><b>21.17</b></td>
+        </tr>
+        <tr>
+            <td>MMLU-PRO (5-shot)</td>
+            <td>30.68</td>
+            <td><b>36.52</b></td>
+            <td>34.30</td>
+        </tr>
+    </tbody>
+</table>
+Also, we report in the following table our internal pipeline benchmarks.
  - We use [lm-evaluation harness](https://github.com/EleutherAI/lm-evaluation-harness).
  - We report **raw scores** obtained by applying chat template and fewshot_as_multiturn.
  - We use same batch-size across all models.
     </tbody>
 </table>
+## Useful links
+- View our [release blogpost](https://huggingface.co/blog/falcon3).
+- Feel free to join [our discord server](https://discord.gg/fwXpMyGc) if you have any questions or to interact with our researchers and developers.
 ## Technical Report
 Coming soon....