File size: 23,104 Bytes
e1d7970
 
fadf077
 
 
 
 
 
 
 
 
 
 
 
 
 
 
e30a90f
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3484ab6
 
 
 
 
 
 
 
a118b67
 
 
 
 
3484ab6
 
 
 
 
 
 
 
 
 
 
a118b67
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
73286d4
a118b67
c91dc28
 
a118b67
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
c91dc28
73286d4
16954e7
3484ab6
c91dc28
3484ab6
f3e7417
 
 
3484ab6
16954e7
 
 
e43841a
 
 
 
 
3484ab6
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
73286d4
 
 
3484ab6
 
 
 
 
 
 
 
73286d4
 
 
 
 
 
 
 
 
 
e30a90f
1eee035
e30a90f
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
---
license: llama3.1
datasets:
- DebateLabKIT/deepa2-conversations
- DebateLabKIT/deep-argmap-conversations
- allenai/tulu-3-sft-mixture
base_model:
- meta-llama/Llama-3.1-8B-Instruct
pipeline_tag: text-generation
library_name: transformers
tags:
- logic
- argumentation
- critical-thinking
- argument-mapping
- trl
- sft
model-index:
- name: Llama-3.1-Argunaut-1-8B-SFT
  results:
  - task:
      type: text-generation
      name: Text Generation
    dataset:
      name: IFEval (0-Shot)
      type: wis-k/instruction-following-eval
      split: train
      args:
        num_few_shot: 0
    metrics:
    - type: inst_level_strict_acc and prompt_level_strict_acc
      value: 55.19
      name: averaged accuracy
    source:
      url: https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard#/?search=DebateLabKIT%2FLlama-3.1-Argunaut-1-8B-SFT
      name: Open LLM Leaderboard
  - task:
      type: text-generation
      name: Text Generation
    dataset:
      name: BBH (3-Shot)
      type: SaylorTwift/bbh
      split: test
      args:
        num_few_shot: 3
    metrics:
    - type: acc_norm
      value: 27.19
      name: normalized accuracy
    source:
      url: https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard#/?search=DebateLabKIT%2FLlama-3.1-Argunaut-1-8B-SFT
      name: Open LLM Leaderboard
  - task:
      type: text-generation
      name: Text Generation
    dataset:
      name: MATH Lvl 5 (4-Shot)
      type: lighteval/MATH-Hard
      split: test
      args:
        num_few_shot: 4
    metrics:
    - type: exact_match
      value: 11.18
      name: exact match
    source:
      url: https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard#/?search=DebateLabKIT%2FLlama-3.1-Argunaut-1-8B-SFT
      name: Open LLM Leaderboard
  - task:
      type: text-generation
      name: Text Generation
    dataset:
      name: GPQA (0-shot)
      type: Idavidrein/gpqa
      split: train
      args:
        num_few_shot: 0
    metrics:
    - type: acc_norm
      value: 4.47
      name: acc_norm
    source:
      url: https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard#/?search=DebateLabKIT%2FLlama-3.1-Argunaut-1-8B-SFT
      name: Open LLM Leaderboard
  - task:
      type: text-generation
      name: Text Generation
    dataset:
      name: MuSR (0-shot)
      type: TAUR-Lab/MuSR
      args:
        num_few_shot: 0
    metrics:
    - type: acc_norm
      value: 15.85
      name: acc_norm
    source:
      url: https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard#/?search=DebateLabKIT%2FLlama-3.1-Argunaut-1-8B-SFT
      name: Open LLM Leaderboard
  - task:
      type: text-generation
      name: Text Generation
    dataset:
      name: MMLU-PRO (5-shot)
      type: TIGER-Lab/MMLU-Pro
      config: main
      split: test
      args:
        num_few_shot: 5
    metrics:
    - type: acc
      value: 27.47
      name: accuracy
    source:
      url: https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard#/?search=DebateLabKIT%2FLlama-3.1-Argunaut-1-8B-SFT
      name: Open LLM Leaderboard
---


# Model Card for Llama-3.1-Argunaut-1-8B-SFT

This model is a fine-tuned version of [meta-llama/Llama-3.1-8B-Instruct](https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct).
It has been trained using [TRL](https://github.com/huggingface/trl).

📘 [HF Blog Article](https://huggingface.co/blog/ggbetz/argunauts-phase-1)

[<img src="https://raw.githubusercontent.com/wandb/assets/main/wandb-github-badge-28.svg" alt="Visualize in Weights & Biases" width="150" height="24"/>](https://wandb.ai/ggbetz/Argunauts-1/runs/922q79gp/overview)


## Quick start

```python
from transformers import pipeline

question = "Are you familiar with Argdown syntax? What's its purpose?"
generator = pipeline("text-generation", model="DebateLabKIT/Llama-3.1-Argunaut-1-8B-SFT", device="cuda")
output = generator([{"role": "user", "content": question}], max_new_tokens=128, return_full_text=False)[0]
print(output["generated_text"])
```

## Evaluation


### Chat Experience

<details>

<code>

👤 **You**  
Hi, are you familiar with Argdown?

🤖 **Assistant**  
Yes, I'm familiar with Argdown. It's a markup language for reasoning that allows you to write structured text in a natural and readable way. With Argdown, you can create argument maps where sentences are connected by lines that represent different relations (e.g., support, reject, reason). This makes it easy to visualize complex argumentative structures.

Would you like to create an Argdown argument map? Feel free to provide a topic and/or some sentences, and I'll help you set up the map!


👤 **You**  
Can you please map the following text?

School uniforms deter crime and increase student safety.
In Long Beach, California, after two years of a district-wide K-8 mandatory uniform policy, reports of assault and battery in the district’s schools decreased by 34%, assault with a deadly weapon dropped by 50%, fighting incidents went down by 51%, sex offenses were cut by 74%, robbery dropped by 65%, possession of weapons (or weapon “look-alikes”) decreased by 52%, possession of drugs went down by 69%, and vandalism was lowered by 18%. [64]

One year after Sparks Middle School in Nevada instituted a uniform policy, school police data showed a 63% drop in police log reports, and decreases were also noted in gang activity, student fights, graffiti, property damage, and battery. A peer-reviewed study found that schools with uniform policies had 12% fewer firearm-related incidents and 15% fewer drug-related incidents than schools without uniforms. [25][69]

School uniforms also prevent students from concealing weapons under baggy clothing, make it easier to keep track of students on field trips, and make intruders on campus more visible. Frank Quatrone, superintendent in the Lodi school district of New Jersey, states, “When you have students dressed alike, you make them safer. If someone were to come into a building, the intruder could easily be recognized.” [6][38]

Further, school uniforms create a level playing field among students, reducing peer pressure and bullying. When all students are dressed alike, competition between students over clothing choices and the teasing of those who are dressed in less expensive or less fashionable outfits can be eliminated. Research by the Schoolwear Association found that 83% of teachers thought “a good school uniform…could prevent bullying based on appearance or economic background.” Arminta Jacobson, founder and director of the Center for Parent Education at the University of North Texas, states that uniforms put “all kids on the same playing field in terms of their appearance. I think it probably gives them a sense of belonging and a feeling of being socially accepted.” [5][91]

🤖 **Assistant**  
```argdown
School uniforms deter crime and increase student safety.
    <+ In Long Beach, California, after two years of a district-wide K-8 mandatory uniform policy, reports of assault and battery in the district’s schools decreased by 34%, assault with a deadly weapon dropped by 50%, fighting incidents went down by 51%, sex offenses were cut by 74%, robbery dropped by 65%, possession of weapons (or weapon “look-alikes”) decreased by 52%, possession of drugs went down by 69%, and vandalism was lowered by 18%. [64]
    <+ One year after Sparks Middle School in Nevada instituted a uniform policy, school police data showed a 63% drop in police log reports, and decreases were also noted in gang activity, student fights, graffiti, property damage, and battery. A peer-reviewed study found that schools with uniform policies had 12% fewer firearm-related incidents and 15% fewer drug-related incidents than schools without uniforms. [25][69]
    <+ School uniforms also prevent students from concealing weapons under baggy clothing, make it easier to keep track of students on field trips, and make intruders on campus more visible. Frank Quatrone, superintendent in the Lodi school district of New Jersey, states, “When you have students dressed alike, you make them safer. If someone were to come into a building, the intruder could easily be recognized.” [6][38]
    <+ Further, school uniforms create a level playing field among students, reducing peer pressure and bullying. When all students are dressed alike, competition between students over clothing choices and the teasing of those who are dressed in less expensive or less fashionable outfits can be eliminated. Research by the Schoolwear Association found that 83% of teachers thought “a good school uniform…could prevent bullying based on appearance or economic background.” Arminta Jacobson, founder and director of the Center for Parent Education at the University of North Texas, states that uniforms put “all kids on the same playing field in terms of their appearance. I think it probably gives them a sense of belonging and a feeling of being socially accepted.” [5][91]
```

👤 **You**  
Please summarize, in your own words, the main argument presented in the text. No more than 2-3 sentemces, please.

🤖 **Assistant**  
School uniforms are an effective solution to crime and safety issues in schools, as evidenced by the reduction of incidents in various institutions that implemented uniform policies.


👤 **You**  
Let's focus on this argument. Just for the protocol, what's the conclusion?

🤖 **Assistant**  
The text argues that:  
School uniforms deter crime and increase student safety.

By examining this argument, we can now proceed to reconstruct it. Please provide the next step in this inference guide:



👤 **You**  
Can you reconstruct this argument as premise conclusion structure, using argdown syntax?

🤖 **Assistant**  
The argument can be reconstructed as follows:

```argdown
School uniforms deter crime and increase student safety.
    <+ In Long Beach, California, after two years of a district-wide K-8 mandatory uniform policy, reports of assault and battery in the district’s schools decreased by 34%, assault with a deadly weapon dropped by 50%, fighting incidents went down by 51%, sex offenses were cut by 74%, robbery dropped by 65%, possession of weapons (or weapon “look-alikes”) decreased by 52%, possession of drugs went down by 69%, and vandalism was lowered by 18%. [64]
    <+ One year after Sparks Middle School in Nevada instituted a uniform policy, school police data showed a 63% drop in police log reports, and decreases were also noted in gang activity, student fights, graffiti, property damage, and battery. A peer-reviewed study found that schools with uniform policies had 12% fewer firearm-related incidents and 15% fewer drug-related incidents than schools without uniforms. [25][69]
    <+ School uniforms also prevent students from concealing weapons under baggy clothing, make it easier to keep track of students on field trips, and make intruders on campus more visible. Frank Quatrone, superintendent in the Lodi school district of New Jersey, states, “When you have students dressed alike, you make them safer. If someone were to come into a building, the intruder could easily be recognized.” [6][38]
    <+ Further, school uniforms create a level playing field among students, reducing peer pressure and bullying. When all students are dressed alike, competition between students over clothing choices and the teasing of those who are dressed in less expensive or less fashionable outfits can be eliminated. Research by the Schoolwear Association found that 83% of teachers thought “a good school uniform…could prevent bullying based on appearance or economic background.” Arminta Jacobson, founder and director of the Center for Parent Education at the University of North Texas, states that uniforms put “all kids on the same playing field in terms of their appearance. I think it probably gives them a sense of belonging and a feeling of being socially accepted.” [5][91]
```

> 🛑 Poor answer
>
> That's NOT a reconstruction in standard form, but just a repetition of the previously given map. 

👤 **You**  
No, I really want you to reconstruct the argument in standard form, listing premises and conclusions.

🤖 **Assistant**    
```argdown
School uniforms deter crime and increase student safety.
    <+ <Subargument 1>
        <+ In Long Beach, California, after two years of a district-wide K-8 mandatory uniform policy, reports of assault and battery in the district’s schools decreased by 34%, assault with a deadly weapon dropped by 50%, fighting incidents went down by 51%, sex offenses were cut by 74%, robbery dropped by 65%, possession of weapons (or weapon “look-alikes”) decreased by 52%, possession of drugs went down by 69%, and vandalism was lowered by 18%. [64]
        <+ One year after Sparks Middle School in Nevada instituted a uniform policy, school police data showed a 63% drop in police log reports, and decreases were also noted in gang activity, student fights, graffiti, property damage, and battery. A peer-reviewed study found that schools with uniform policies had 12% fewer firearm-related incidents and 15% fewer drug-related incidents than schools without uniforms. [25][69]
        <+ School uniforms also prevent students from concealing weapons under baggy clothing, make it easier to keep track of students on field trips, and make intruders on campus more visible. Frank Quatrone, superintendent in the Lodi school district of New Jersey, states, “When you have students dressed alike, you make them safer. If someone were to come into a building, the intruder could easily be recognized.” [6][38]
        <+ Further, school uniforms create a level playing field among students, reducing peer pressure and bullying. When all students are dressed alike, competition between students over clothing choices and the teasing of those who are dressed in less expensive or less fashionable outfits can be eliminated. Research by the Schoolwear Association found that 83% of teachers thought “a good school uniform…could prevent bullying based on appearance or economic background.” Arminta Jacobson, founder and director of the Center for Parent Education at the University of North Texas, states that uniforms put “all kids on the same playing field in terms of their appearance. I think it probably gives them a sense of belonging and a feeling of being socially accepted.” [5][91]
```

> 🛑 Poor answer
>
> The assistant fails to correct the answer, despite the user's directions. 

</code>

Anbother chat:

<code>

👤 **You**  
Hi, are you familiar with Argdown and rendering arguments as premise conclusion structures?

🤖 **Assistant**    
Yes, I am familiar with Argdown and can render arguments as premise-conclusion structures. What would you like to discuss or have me demonstrate?


👤 **You**  
Can you please describe the main argument presented in the following text?

School uniforms deter crime and increase student safety.
In Long Beach, California, after two years of a district-wide K-8 mandatory uniform policy, reports of assault and battery in the district’s schools decreased by 34%, assault with a deadly weapon dropped by 50%, fighting incidents went down by 51%, sex offenses were cut by 74%, robbery dropped by 65%, possession of weapons (or weapon “look-alikes”) decreased by 52%, possession of drugs went down by 69%, and vandalism was lowered by 18%. [64]

One year after Sparks Middle School in Nevada instituted a uniform policy, school police data showed a 63% drop in police log reports, and decreases were also noted in gang activity, student fights, graffiti, property damage, and battery. A peer-reviewed study found that schools with uniform policies had 12% fewer firearm-related incidents and 15% fewer drug-related incidents than schools without uniforms. [25][69]

School uniforms also prevent students from concealing weapons under baggy clothing, make it easier to keep track of students on field trips, and make intruders on campus more visible. Frank Quatrone, superintendent in the Lodi school district of New Jersey, states, “When you have students dressed alike, you make them safer. If someone were to come into a building, the intruder could easily be recognized.” [6][38]

Further, school uniforms create a level playing field among students, reducing peer pressure and bullying. When all students are dressed alike, competition between students over clothing choices and the teasing of those who are dressed in less expensive or less fashionable outfits can be eliminated. Research by the Schoolwear Association found that 83% of teachers thought “a good school uniform…could prevent bullying based on appearance or economic background.” Arminta Jacobson, founder and director of the Center for Parent Education at the University of North Texas, states that uniforms put “all kids on the same playing field in terms of their appearance. I think it probably gives them a sense of belonging and a feeling of being socially accepted.” [5][91]

🤖 **Assistant**    
School uniforms are not only a traditional and respectful way for institutions to create an environment of unity, but they also benefit students' safety and performance. The implementation of school uniforms can lead to a significant reduction in crime rates, improved student behavior, and enhanced safety on school campuses.


👤 **You**  
Can you please reconstruct this argument in standard form (premise-conclusion structure)?

🤖 **Assistant**

```argdown
(1) Implementing school uniforms has led to a significant reduction in crime rates in the Long Beach, California district.
(2) The reduction in crime rates is a result of the uniform policy and not just a coincidence.
(3) The reduction in crime rates is a desirable outcome that contributes to the overall safety and well-being of students.
(4) The reduction in crime rates is a result of the uniform policy and not just a coincidence.
-----
(5) Implementing school uniforms is an effective way to deter crime and increase student safety.
```


👤 **You**  
Can you please add intermediary conclusions and add inference rule?

🤖 **Assistant**    
Sure, here are the intermediate steps leading to the final conclusion:

```argdown
Implementing school uniforms has led to a significant reduction in crime rates in the Long Beach, California district.
The reduction in crime rates is a result of the uniform policy and not just a coincidence.
The reduction in crime rates is a desirable outcome that contributes to the overall safety and well-being of students.
Implementing school uniforms is an effective way to deter crime and increase student safety.
```


👤 **You**  
Please render the refined reconstruction in Argdown.

🤖 **Assistant**    
The argument can be represented as follows:

```argdown
[Conclusion 1]: Implementing school uniforms is an effective way to deter crime and increase student safety.
    <+ <Subargument 1>
        <+ [Premise]: Implementing school uniforms has led to a significant reduction in crime rates in the Long Beach, California district.
        <+ [Premise]: The reduction in crime rates is a result of the uniform policy and not just a coincidence.
        <+ [Premise]: The reduction in crime rates is a desirable outcome that contributes to the overall safety and well-being of students.
```

👤 **You**  
And as premise-conclusion structure, with inference info added?

🤖 **Assistant**    
```argdown
[Conclusion 1]: Implementing school uniforms is an effective way to deter crime and increase student safety.
    <+ <Subargument 1>
        <+ [Premise]: Implementing school uniforms has led to a significant reduction in crime rates in the Long Beach, California district.
        <+ [Premise]: The reduction in crime rates is a result of the uniform policy and not just a coincidence.
        <+ [Premise]: The reduction in crime rates is a desirable outcome that contributes to the overall safety and well-being of students.
```

</code>
</details>



### Metrics


|               |Llama-3.1-8B-Instruct|Argunaut-1-8B-SFT|
|:--|--:|--:|
| ⬇️ **Argdown Bench**      
| pass@1        |  80.8 |  98.9 |
| pass@5        |  98.9 |  99.8 |
| graph sim     |   8.4 |  65.5 |
| isomorphic    |   1.3 |  44.0 |
| 🤗 **HF Leaderboard**
|MMLU pro       |  37.6 |  34.5 |
|MUSR           |  40.1 |  41.6 |
|GPQA Diamond   |  32.8 |  30.8 |
|GPQA Main      |  28.5 |  32.1 |
|MATH           |  12.5 |   9.1 |
|BBH            |  54.7 |  48.2 |
| ⛓️ **COT Leaderboard**
| LogiQA        |   5.9 |   1.4 |
| LogiQA2       |  15.5 |   0.8 |
| LSAT-ar       |  11.7 |   3.0 |
| LSAT-lr       |  20.8 |   3.9 | 
| LSAT-rc       |  27.5 |  13.8 |


## SFT dataset mixture

|Dataset|Weight (examples)|Weight (tokens)|
|:------|:----:|:----:|
|DebateLabKIT/deepa2-conversations|25%|49%|
|DebateLabKIT/deep-argmap-conversations|25%|18%|
|allenai/tulu-3-sft-mixture|50%|33%|


## Training procedure

Trained with SFT on **1M examples** and for 1 epoch with 

* context length 8196
* packing (trl implementation)
* *spectrum* (top 30 percent)

```yaml
# Training parameters
num_train_epochs: 1
per_device_train_batch_size: 8
gradient_accumulation_steps: 2
gradient_checkpointing: true
gradient_checkpointing_kwargs:
  use_reentrant: false
learning_rate: 5.0e-6  # following _Tülu 3_ recipe
lr_scheduler_type: cosine
warmup_ratio: 0.1
```

Hardware: 2 x H100 GPUs.

_This work was performed on the HoreKa supercomputer funded by the
Ministry of Science, Research and the Arts Baden-Württemberg and by
the Federal Ministry of Education and Research._

### Framework versions

- TRL: 0.12.1
- Transformers: 4.46.3
- Pytorch: 2.4.1
- Datasets: 3.1.0
- Tokenizers: 0.20.3

## Credits 

This work wouldn't be possible without all the **great contributions from the open LLM community**. Thank you! Special kudos go to 

- @philschmid for his latest [fine-tuning boilerplate](https://www.philschmid.de/fine-tune-llms-in-2025)
- @lvwerra, @lewtun et al for building and maintaining [trl](https://github.com/huggingface/trl)
- @cognitivecomputations for sharing [spectrum](https://github.com/cognitivecomputations/spectrum/tree/main)



## [Open LLM Leaderboard Evaluation Results](https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard)
Detailed results can be found [here](https://huggingface.co/datasets/open-llm-leaderboard/DebateLabKIT__Llama-3.1-Argunaut-1-8B-SFT-details)!
Summarized results can be found [here](https://huggingface.co/datasets/open-llm-leaderboard/contents/viewer/default/train?q=DebateLabKIT%2FLlama-3.1-Argunaut-1-8B-SFT&sort[column]=Average%20%E2%AC%86%EF%B8%8F&sort[direction]=desc)!

|      Metric       |Value (%)|
|-------------------|--------:|
|**Average**        |    23.56|
|IFEval (0-Shot)    |    55.19|
|BBH (3-Shot)       |    27.19|
|MATH Lvl 5 (4-Shot)|    11.18|
|GPQA (0-shot)      |     4.47|
|MuSR (0-shot)      |    15.85|
|MMLU-PRO (5-shot)  |    27.47|