pankajmathur commited on
Commit
bb8e17e
1 Parent(s): 8fbc0b2

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +53 -0
README.md CHANGED
@@ -158,6 +158,59 @@ After that, you can `generate()` again to let the model use the tool result in t
158
  see the [LLaMA prompt format docs](https://llama.meta.com/docs/model-cards-and-prompt-formats/llama3_1/) and the Transformers [tool use documentation](https://huggingface.co/docs/transformers/main/chat_templating#advanced-tool-use--function-calling).
159
 
160
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
161
  ## Llama 3.3 Responsibility & Safety
162
 
163
  As part of our Responsible release approach, we followed a three-pronged strategy to managing trust & safety risks:
 
158
  see the [LLaMA prompt format docs](https://llama.meta.com/docs/model-cards-and-prompt-formats/llama3_1/) and the Transformers [tool use documentation](https://huggingface.co/docs/transformers/main/chat_templating#advanced-tool-use--function-calling).
159
 
160
 
161
+ ## Open LLM Leaderboard Evals
162
+
163
+ ```
164
+
165
+ |Tasks|Version|Filter|n-shot|Metric| |Value | |Stderr|
166
+ |-----|------:|------|-----:|------|---|-----:|---|-----:|
167
+ |leaderboard_bbh_boolean_expressions|1|none|3|acc_norm|↑|0.904|±|0.0187|
168
+ |leaderboard_bbh_causal_judgement|1|none|3|acc_norm|↑|0.6524|±|0.0349|
169
+ |leaderboard_bbh_date_understanding|1|none|3|acc_norm|↑|0.692|±|0.0293|
170
+ |leaderboard_bbh_disambiguation_qa|1|none|3|acc_norm|↑|0.788|±|0.0259|
171
+ |leaderboard_bbh_formal_fallacies|1|none|3|acc_norm|↑|0.744|±|0.0277|
172
+ |leaderboard_bbh_geometric_shapes|1|none|3|acc_norm|↑|0.532|±|0.0316|
173
+ |leaderboard_bbh_hyperbaton|1|none|3|acc_norm|↑|0.696|±|0.0292|
174
+ |leaderboard_bbh_logical_deduction_five_objects|1|none|3|acc_norm|↑|0.6|±|0.031|
175
+ |leaderboard_bbh_logical_deduction_seven_objects|1|none|3|acc_norm|↑|0.58|±|0.0313|
176
+ |leaderboard_bbh_logical_deduction_three_objects|1|none|3|acc_norm|↑|0.944|±|0.0146|
177
+ |leaderboard_bbh_movie_recommendation|1|none|3|acc_norm|↑|0.808|±|0.025|
178
+ |leaderboard_bbh_navigate|1|none|3|acc_norm|↑|0.704|±|0.0289|
179
+ |leaderboard_bbh_object_counting|1|none|3|acc_norm|↑|0.676|±|0.0297|
180
+ |leaderboard_bbh_penguins_in_a_table|1|none|3|acc_norm|↑|0.7055|±|0.0379|
181
+ |leaderboard_bbh_reasoning_about_colored_objects|1|none|3|acc_norm|↑|0.756|±|0.0272|
182
+ |leaderboard_bbh_ruin_names|1|none|3|acc_norm|↑|0.872|±|0.0212|
183
+ |leaderboard_bbh_salient_translation_error_detection|1|none|3|acc_norm|↑|0.656|±|0.0301|
184
+ |leaderboard_bbh_snarks|1|none|3|acc_norm|↑|0.8034|±|0.0299|
185
+ |leaderboard_bbh_sports_understanding|1|none|3|acc_norm|↑|0.916|±|0.0176|
186
+ |leaderboard_bbh_temporal_sequences|1|none|3|acc_norm|↑|0.996|±|0.004|
187
+ |leaderboard_bbh_tracking_shuffled_objects_five_objects|1|none|3|acc_norm|↑|0.316|±|0.0295|
188
+ |leaderboard_bbh_tracking_shuffled_objects_seven_objects|1|none|3|acc_norm|↑|0.288|±|0.0287|
189
+ |leaderboard_bbh_tracking_shuffled_objects_three_objects|1|none|3|acc_norm|↑|0.364|±|0.0305|
190
+ |leaderboard_bbh_web_of_lies|1|none|3|acc_norm|↑|0.628|±|0.0306|
191
+ |leaderboard_gpqa_diamond|1|none|0|acc_norm|↑|0.3586|±|0.0342|
192
+ |leaderboard_gpqa_extended|1|none|0|acc_norm|↑|0.4542|±|0.0213|
193
+ |leaderboard_gpqa_main|1|none|0|acc_norm|↑|0.4799|±|0.0236|
194
+ |leaderboard_ifeval|3|none|0|inst_level_loose_acc|↑|0.693|±|N/A|
195
+ |leaderboard_ifeval|3|none|0|inst_level_strict_acc|↑|0.6259|±|N/A|
196
+ |leaderboard_ifeval|3|none|0|prompt_level_loose_acc|↑|0.5767|±|0.0213|
197
+ |leaderboard_ifeval|3|none|0|prompt_level_strict_acc|↑|0.4954|±|0.0215|
198
+ |leaderboard_math_algebra_hard|2|none|4|exact_match|↑|0.5798|±|0.0282|
199
+ |leaderboard_math_counting_and_prob_hard|2|none|4|exact_match|↑|0.3008|±|0.0415|
200
+ |leaderboard_math_geometry_hard|2|none|4|exact_match|↑|0.1288|±|0.0293|
201
+ |leaderboard_math_intermediate_algebra_hard|2|none|4|exact_match|↑|0.1143|±|0.019|
202
+ |leaderboard_math_num_theory_hard|2|none|4|exact_match|↑|0.3182|±|0.0377|
203
+ |leaderboard_math_prealgebra_hard|2|none|4|exact_match|↑|0.5492|±|0.0359|
204
+ |leaderboard_math_precalculus_hard|2|none|4|exact_match|↑|0.2|±|0.0346|
205
+ |leaderboard_mmlu_pro|0.1|none|5|acc|↑|0.5082|±|0.0046|
206
+ |leaderboard_musr_murder_mysteries|1|none|0|acc_norm|↑|0.612|±|0.0309|
207
+ |leaderboard_musr_object_placements|1|none|0|acc_norm|↑|0.2812|±|0.0282|
208
+ |leaderboard_musr_team_allocation|1|none|0|acc_norm|↑|0.588|±|0.0312|
209
+ |Average|--|--|--|--|↑|0.5840|±|0.0274|
210
+
211
+ ```
212
+
213
+
214
  ## Llama 3.3 Responsibility & Safety
215
 
216
  As part of our Responsible release approach, we followed a three-pronged strategy to managing trust & safety risks: