lewtun HF staff commited on
Commit
d4faf91
·
verified ·
1 Parent(s): 33b7bab

Upload eval_results/Qwen/Qwen1.5-0.5B-Chat/main/mmlu/results_2024-02-28T07-45-42.141224.json with huggingface_hub

Browse files
eval_results/Qwen/Qwen1.5-0.5B-Chat/main/mmlu/results_2024-02-28T07-45-42.141224.json ADDED
@@ -0,0 +1,2835 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "config_general": {
3
+ "lighteval_sha": "?",
4
+ "num_fewshot_seeds": 1,
5
+ "override_batch_size": 1,
6
+ "max_samples": null,
7
+ "job_id": "",
8
+ "start_time": 2507345.062565505,
9
+ "end_time": 2508008.844386519,
10
+ "total_evaluation_time_secondes": "663.7818210138939",
11
+ "model_name": "Qwen/Qwen1.5-0.5B-Chat",
12
+ "model_sha": "7f630fd18dccab574ab1b78411a8753f989a55ac",
13
+ "model_dtype": "torch.bfloat16",
14
+ "model_size": "1.05 GB",
15
+ "config": null
16
+ },
17
+ "results": {
18
+ "lighteval|mmlu:abstract_algebra|5": {
19
+ "acc": 0.26,
20
+ "acc_stderr": 0.0440844002276808
21
+ },
22
+ "lighteval|mmlu:anatomy|5": {
23
+ "acc": 0.31851851851851853,
24
+ "acc_stderr": 0.04024778401977111
25
+ },
26
+ "lighteval|mmlu:astronomy|5": {
27
+ "acc": 0.34210526315789475,
28
+ "acc_stderr": 0.038607315993160925
29
+ },
30
+ "lighteval|mmlu:business_ethics|5": {
31
+ "acc": 0.5,
32
+ "acc_stderr": 0.050251890762960605
33
+ },
34
+ "lighteval|mmlu:clinical_knowledge|5": {
35
+ "acc": 0.3622641509433962,
36
+ "acc_stderr": 0.0295822451283843
37
+ },
38
+ "lighteval|mmlu:college_biology|5": {
39
+ "acc": 0.3194444444444444,
40
+ "acc_stderr": 0.038990736873573344
41
+ },
42
+ "lighteval|mmlu:college_chemistry|5": {
43
+ "acc": 0.36,
44
+ "acc_stderr": 0.048241815132442176
45
+ },
46
+ "lighteval|mmlu:college_computer_science|5": {
47
+ "acc": 0.37,
48
+ "acc_stderr": 0.048523658709391
49
+ },
50
+ "lighteval|mmlu:college_mathematics|5": {
51
+ "acc": 0.33,
52
+ "acc_stderr": 0.047258156262526045
53
+ },
54
+ "lighteval|mmlu:college_medicine|5": {
55
+ "acc": 0.3468208092485549,
56
+ "acc_stderr": 0.036291466701596636
57
+ },
58
+ "lighteval|mmlu:college_physics|5": {
59
+ "acc": 0.29411764705882354,
60
+ "acc_stderr": 0.04533838195929775
61
+ },
62
+ "lighteval|mmlu:computer_security|5": {
63
+ "acc": 0.41,
64
+ "acc_stderr": 0.049431107042371025
65
+ },
66
+ "lighteval|mmlu:conceptual_physics|5": {
67
+ "acc": 0.2680851063829787,
68
+ "acc_stderr": 0.028957342788342347
69
+ },
70
+ "lighteval|mmlu:econometrics|5": {
71
+ "acc": 0.2894736842105263,
72
+ "acc_stderr": 0.04266339443159393
73
+ },
74
+ "lighteval|mmlu:electrical_engineering|5": {
75
+ "acc": 0.4,
76
+ "acc_stderr": 0.04082482904638628
77
+ },
78
+ "lighteval|mmlu:elementary_mathematics|5": {
79
+ "acc": 0.2830687830687831,
80
+ "acc_stderr": 0.023201392938194978
81
+ },
82
+ "lighteval|mmlu:formal_logic|5": {
83
+ "acc": 0.25396825396825395,
84
+ "acc_stderr": 0.03893259610604671
85
+ },
86
+ "lighteval|mmlu:global_facts|5": {
87
+ "acc": 0.31,
88
+ "acc_stderr": 0.04648231987117316
89
+ },
90
+ "lighteval|mmlu:high_school_biology|5": {
91
+ "acc": 0.3870967741935484,
92
+ "acc_stderr": 0.02770935967503249
93
+ },
94
+ "lighteval|mmlu:high_school_chemistry|5": {
95
+ "acc": 0.3103448275862069,
96
+ "acc_stderr": 0.03255086769970103
97
+ },
98
+ "lighteval|mmlu:high_school_computer_science|5": {
99
+ "acc": 0.33,
100
+ "acc_stderr": 0.04725815626252605
101
+ },
102
+ "lighteval|mmlu:high_school_european_history|5": {
103
+ "acc": 0.48484848484848486,
104
+ "acc_stderr": 0.03902551007374449
105
+ },
106
+ "lighteval|mmlu:high_school_geography|5": {
107
+ "acc": 0.5050505050505051,
108
+ "acc_stderr": 0.035621707606254015
109
+ },
110
+ "lighteval|mmlu:high_school_government_and_politics|5": {
111
+ "acc": 0.41968911917098445,
112
+ "acc_stderr": 0.035615873276858834
113
+ },
114
+ "lighteval|mmlu:high_school_macroeconomics|5": {
115
+ "acc": 0.3717948717948718,
116
+ "acc_stderr": 0.02450347255711094
117
+ },
118
+ "lighteval|mmlu:high_school_mathematics|5": {
119
+ "acc": 0.25925925925925924,
120
+ "acc_stderr": 0.026719240783712166
121
+ },
122
+ "lighteval|mmlu:high_school_microeconomics|5": {
123
+ "acc": 0.3739495798319328,
124
+ "acc_stderr": 0.031429466378837076
125
+ },
126
+ "lighteval|mmlu:high_school_physics|5": {
127
+ "acc": 0.2847682119205298,
128
+ "acc_stderr": 0.03684881521389023
129
+ },
130
+ "lighteval|mmlu:high_school_psychology|5": {
131
+ "acc": 0.3889908256880734,
132
+ "acc_stderr": 0.020902300887392866
133
+ },
134
+ "lighteval|mmlu:high_school_statistics|5": {
135
+ "acc": 0.3194444444444444,
136
+ "acc_stderr": 0.0317987634217685
137
+ },
138
+ "lighteval|mmlu:high_school_us_history|5": {
139
+ "acc": 0.4411764705882353,
140
+ "acc_stderr": 0.034849415144292316
141
+ },
142
+ "lighteval|mmlu:high_school_world_history|5": {
143
+ "acc": 0.5021097046413502,
144
+ "acc_stderr": 0.032546938018020076
145
+ },
146
+ "lighteval|mmlu:human_aging|5": {
147
+ "acc": 0.3273542600896861,
148
+ "acc_stderr": 0.03149384670994131
149
+ },
150
+ "lighteval|mmlu:human_sexuality|5": {
151
+ "acc": 0.37404580152671757,
152
+ "acc_stderr": 0.042438692422305246
153
+ },
154
+ "lighteval|mmlu:international_law|5": {
155
+ "acc": 0.5289256198347108,
156
+ "acc_stderr": 0.04556710331269498
157
+ },
158
+ "lighteval|mmlu:jurisprudence|5": {
159
+ "acc": 0.4074074074074074,
160
+ "acc_stderr": 0.04750077341199985
161
+ },
162
+ "lighteval|mmlu:logical_fallacies|5": {
163
+ "acc": 0.34355828220858897,
164
+ "acc_stderr": 0.03731133519673893
165
+ },
166
+ "lighteval|mmlu:machine_learning|5": {
167
+ "acc": 0.2767857142857143,
168
+ "acc_stderr": 0.042466243366976256
169
+ },
170
+ "lighteval|mmlu:management|5": {
171
+ "acc": 0.4563106796116505,
172
+ "acc_stderr": 0.04931801994220414
173
+ },
174
+ "lighteval|mmlu:marketing|5": {
175
+ "acc": 0.5,
176
+ "acc_stderr": 0.03275608910402091
177
+ },
178
+ "lighteval|mmlu:medical_genetics|5": {
179
+ "acc": 0.37,
180
+ "acc_stderr": 0.04852365870939098
181
+ },
182
+ "lighteval|mmlu:miscellaneous|5": {
183
+ "acc": 0.38697318007662834,
184
+ "acc_stderr": 0.017417138059440132
185
+ },
186
+ "lighteval|mmlu:moral_disputes|5": {
187
+ "acc": 0.43641618497109824,
188
+ "acc_stderr": 0.026700545424943687
189
+ },
190
+ "lighteval|mmlu:moral_scenarios|5": {
191
+ "acc": 0.24134078212290502,
192
+ "acc_stderr": 0.014310999547961441
193
+ },
194
+ "lighteval|mmlu:nutrition|5": {
195
+ "acc": 0.42483660130718953,
196
+ "acc_stderr": 0.028304576673141124
197
+ },
198
+ "lighteval|mmlu:philosophy|5": {
199
+ "acc": 0.3987138263665595,
200
+ "acc_stderr": 0.0278093225857745
201
+ },
202
+ "lighteval|mmlu:prehistory|5": {
203
+ "acc": 0.3734567901234568,
204
+ "acc_stderr": 0.026915003011380157
205
+ },
206
+ "lighteval|mmlu:professional_accounting|5": {
207
+ "acc": 0.2907801418439716,
208
+ "acc_stderr": 0.027090664368353178
209
+ },
210
+ "lighteval|mmlu:professional_law|5": {
211
+ "acc": 0.33116036505867014,
212
+ "acc_stderr": 0.01202012819598575
213
+ },
214
+ "lighteval|mmlu:professional_medicine|5": {
215
+ "acc": 0.3860294117647059,
216
+ "acc_stderr": 0.029573269134411124
217
+ },
218
+ "lighteval|mmlu:professional_psychology|5": {
219
+ "acc": 0.3562091503267974,
220
+ "acc_stderr": 0.019373332420724493
221
+ },
222
+ "lighteval|mmlu:public_relations|5": {
223
+ "acc": 0.42727272727272725,
224
+ "acc_stderr": 0.04738198703545483
225
+ },
226
+ "lighteval|mmlu:security_studies|5": {
227
+ "acc": 0.2693877551020408,
228
+ "acc_stderr": 0.02840125202902294
229
+ },
230
+ "lighteval|mmlu:sociology|5": {
231
+ "acc": 0.5174129353233831,
232
+ "acc_stderr": 0.03533389234739245
233
+ },
234
+ "lighteval|mmlu:us_foreign_policy|5": {
235
+ "acc": 0.54,
236
+ "acc_stderr": 0.05009082659620333
237
+ },
238
+ "lighteval|mmlu:virology|5": {
239
+ "acc": 0.3433734939759036,
240
+ "acc_stderr": 0.036965843170106004
241
+ },
242
+ "lighteval|mmlu:world_religions|5": {
243
+ "acc": 0.2807017543859649,
244
+ "acc_stderr": 0.034462962170884265
245
+ },
246
+ "lighteval|mmlu:_average|5": {
247
+ "acc": 0.3681551334211768,
248
+ "acc_stderr": 0.035698565367394484
249
+ }
250
+ },
251
+ "versions": {
252
+ "lighteval|mmlu:abstract_algebra|5": 0,
253
+ "lighteval|mmlu:anatomy|5": 0,
254
+ "lighteval|mmlu:astronomy|5": 0,
255
+ "lighteval|mmlu:business_ethics|5": 0,
256
+ "lighteval|mmlu:clinical_knowledge|5": 0,
257
+ "lighteval|mmlu:college_biology|5": 0,
258
+ "lighteval|mmlu:college_chemistry|5": 0,
259
+ "lighteval|mmlu:college_computer_science|5": 0,
260
+ "lighteval|mmlu:college_mathematics|5": 0,
261
+ "lighteval|mmlu:college_medicine|5": 0,
262
+ "lighteval|mmlu:college_physics|5": 0,
263
+ "lighteval|mmlu:computer_security|5": 0,
264
+ "lighteval|mmlu:conceptual_physics|5": 0,
265
+ "lighteval|mmlu:econometrics|5": 0,
266
+ "lighteval|mmlu:electrical_engineering|5": 0,
267
+ "lighteval|mmlu:elementary_mathematics|5": 0,
268
+ "lighteval|mmlu:formal_logic|5": 0,
269
+ "lighteval|mmlu:global_facts|5": 0,
270
+ "lighteval|mmlu:high_school_biology|5": 0,
271
+ "lighteval|mmlu:high_school_chemistry|5": 0,
272
+ "lighteval|mmlu:high_school_computer_science|5": 0,
273
+ "lighteval|mmlu:high_school_european_history|5": 0,
274
+ "lighteval|mmlu:high_school_geography|5": 0,
275
+ "lighteval|mmlu:high_school_government_and_politics|5": 0,
276
+ "lighteval|mmlu:high_school_macroeconomics|5": 0,
277
+ "lighteval|mmlu:high_school_mathematics|5": 0,
278
+ "lighteval|mmlu:high_school_microeconomics|5": 0,
279
+ "lighteval|mmlu:high_school_physics|5": 0,
280
+ "lighteval|mmlu:high_school_psychology|5": 0,
281
+ "lighteval|mmlu:high_school_statistics|5": 0,
282
+ "lighteval|mmlu:high_school_us_history|5": 0,
283
+ "lighteval|mmlu:high_school_world_history|5": 0,
284
+ "lighteval|mmlu:human_aging|5": 0,
285
+ "lighteval|mmlu:human_sexuality|5": 0,
286
+ "lighteval|mmlu:international_law|5": 0,
287
+ "lighteval|mmlu:jurisprudence|5": 0,
288
+ "lighteval|mmlu:logical_fallacies|5": 0,
289
+ "lighteval|mmlu:machine_learning|5": 0,
290
+ "lighteval|mmlu:management|5": 0,
291
+ "lighteval|mmlu:marketing|5": 0,
292
+ "lighteval|mmlu:medical_genetics|5": 0,
293
+ "lighteval|mmlu:miscellaneous|5": 0,
294
+ "lighteval|mmlu:moral_disputes|5": 0,
295
+ "lighteval|mmlu:moral_scenarios|5": 0,
296
+ "lighteval|mmlu:nutrition|5": 0,
297
+ "lighteval|mmlu:philosophy|5": 0,
298
+ "lighteval|mmlu:prehistory|5": 0,
299
+ "lighteval|mmlu:professional_accounting|5": 0,
300
+ "lighteval|mmlu:professional_law|5": 0,
301
+ "lighteval|mmlu:professional_medicine|5": 0,
302
+ "lighteval|mmlu:professional_psychology|5": 0,
303
+ "lighteval|mmlu:public_relations|5": 0,
304
+ "lighteval|mmlu:security_studies|5": 0,
305
+ "lighteval|mmlu:sociology|5": 0,
306
+ "lighteval|mmlu:us_foreign_policy|5": 0,
307
+ "lighteval|mmlu:virology|5": 0,
308
+ "lighteval|mmlu:world_religions|5": 0
309
+ },
310
+ "config_tasks": {
311
+ "lighteval|mmlu:abstract_algebra": {
312
+ "name": "mmlu:abstract_algebra",
313
+ "prompt_function": "mmlu_harness",
314
+ "hf_repo": "lighteval/mmlu",
315
+ "hf_subset": "abstract_algebra",
316
+ "metric": [
317
+ "loglikelihood_acc"
318
+ ],
319
+ "hf_avail_splits": [
320
+ "auxiliary_train",
321
+ "test",
322
+ "validation",
323
+ "dev"
324
+ ],
325
+ "evaluation_splits": [
326
+ "test"
327
+ ],
328
+ "few_shots_split": "dev",
329
+ "few_shots_select": "sequential",
330
+ "generation_size": 1,
331
+ "stop_sequence": [
332
+ "\n"
333
+ ],
334
+ "output_regex": null,
335
+ "frozen": false,
336
+ "suite": [
337
+ "lighteval",
338
+ "mmlu"
339
+ ]
340
+ },
341
+ "lighteval|mmlu:anatomy": {
342
+ "name": "mmlu:anatomy",
343
+ "prompt_function": "mmlu_harness",
344
+ "hf_repo": "lighteval/mmlu",
345
+ "hf_subset": "anatomy",
346
+ "metric": [
347
+ "loglikelihood_acc"
348
+ ],
349
+ "hf_avail_splits": [
350
+ "auxiliary_train",
351
+ "test",
352
+ "validation",
353
+ "dev"
354
+ ],
355
+ "evaluation_splits": [
356
+ "test"
357
+ ],
358
+ "few_shots_split": "dev",
359
+ "few_shots_select": "sequential",
360
+ "generation_size": 1,
361
+ "stop_sequence": [
362
+ "\n"
363
+ ],
364
+ "output_regex": null,
365
+ "frozen": false,
366
+ "suite": [
367
+ "lighteval",
368
+ "mmlu"
369
+ ]
370
+ },
371
+ "lighteval|mmlu:astronomy": {
372
+ "name": "mmlu:astronomy",
373
+ "prompt_function": "mmlu_harness",
374
+ "hf_repo": "lighteval/mmlu",
375
+ "hf_subset": "astronomy",
376
+ "metric": [
377
+ "loglikelihood_acc"
378
+ ],
379
+ "hf_avail_splits": [
380
+ "auxiliary_train",
381
+ "test",
382
+ "validation",
383
+ "dev"
384
+ ],
385
+ "evaluation_splits": [
386
+ "test"
387
+ ],
388
+ "few_shots_split": "dev",
389
+ "few_shots_select": "sequential",
390
+ "generation_size": 1,
391
+ "stop_sequence": [
392
+ "\n"
393
+ ],
394
+ "output_regex": null,
395
+ "frozen": false,
396
+ "suite": [
397
+ "lighteval",
398
+ "mmlu"
399
+ ]
400
+ },
401
+ "lighteval|mmlu:business_ethics": {
402
+ "name": "mmlu:business_ethics",
403
+ "prompt_function": "mmlu_harness",
404
+ "hf_repo": "lighteval/mmlu",
405
+ "hf_subset": "business_ethics",
406
+ "metric": [
407
+ "loglikelihood_acc"
408
+ ],
409
+ "hf_avail_splits": [
410
+ "auxiliary_train",
411
+ "test",
412
+ "validation",
413
+ "dev"
414
+ ],
415
+ "evaluation_splits": [
416
+ "test"
417
+ ],
418
+ "few_shots_split": "dev",
419
+ "few_shots_select": "sequential",
420
+ "generation_size": 1,
421
+ "stop_sequence": [
422
+ "\n"
423
+ ],
424
+ "output_regex": null,
425
+ "frozen": false,
426
+ "suite": [
427
+ "lighteval",
428
+ "mmlu"
429
+ ]
430
+ },
431
+ "lighteval|mmlu:clinical_knowledge": {
432
+ "name": "mmlu:clinical_knowledge",
433
+ "prompt_function": "mmlu_harness",
434
+ "hf_repo": "lighteval/mmlu",
435
+ "hf_subset": "clinical_knowledge",
436
+ "metric": [
437
+ "loglikelihood_acc"
438
+ ],
439
+ "hf_avail_splits": [
440
+ "auxiliary_train",
441
+ "test",
442
+ "validation",
443
+ "dev"
444
+ ],
445
+ "evaluation_splits": [
446
+ "test"
447
+ ],
448
+ "few_shots_split": "dev",
449
+ "few_shots_select": "sequential",
450
+ "generation_size": 1,
451
+ "stop_sequence": [
452
+ "\n"
453
+ ],
454
+ "output_regex": null,
455
+ "frozen": false,
456
+ "suite": [
457
+ "lighteval",
458
+ "mmlu"
459
+ ]
460
+ },
461
+ "lighteval|mmlu:college_biology": {
462
+ "name": "mmlu:college_biology",
463
+ "prompt_function": "mmlu_harness",
464
+ "hf_repo": "lighteval/mmlu",
465
+ "hf_subset": "college_biology",
466
+ "metric": [
467
+ "loglikelihood_acc"
468
+ ],
469
+ "hf_avail_splits": [
470
+ "auxiliary_train",
471
+ "test",
472
+ "validation",
473
+ "dev"
474
+ ],
475
+ "evaluation_splits": [
476
+ "test"
477
+ ],
478
+ "few_shots_split": "dev",
479
+ "few_shots_select": "sequential",
480
+ "generation_size": 1,
481
+ "stop_sequence": [
482
+ "\n"
483
+ ],
484
+ "output_regex": null,
485
+ "frozen": false,
486
+ "suite": [
487
+ "lighteval",
488
+ "mmlu"
489
+ ]
490
+ },
491
+ "lighteval|mmlu:college_chemistry": {
492
+ "name": "mmlu:college_chemistry",
493
+ "prompt_function": "mmlu_harness",
494
+ "hf_repo": "lighteval/mmlu",
495
+ "hf_subset": "college_chemistry",
496
+ "metric": [
497
+ "loglikelihood_acc"
498
+ ],
499
+ "hf_avail_splits": [
500
+ "auxiliary_train",
501
+ "test",
502
+ "validation",
503
+ "dev"
504
+ ],
505
+ "evaluation_splits": [
506
+ "test"
507
+ ],
508
+ "few_shots_split": "dev",
509
+ "few_shots_select": "sequential",
510
+ "generation_size": 1,
511
+ "stop_sequence": [
512
+ "\n"
513
+ ],
514
+ "output_regex": null,
515
+ "frozen": false,
516
+ "suite": [
517
+ "lighteval",
518
+ "mmlu"
519
+ ]
520
+ },
521
+ "lighteval|mmlu:college_computer_science": {
522
+ "name": "mmlu:college_computer_science",
523
+ "prompt_function": "mmlu_harness",
524
+ "hf_repo": "lighteval/mmlu",
525
+ "hf_subset": "college_computer_science",
526
+ "metric": [
527
+ "loglikelihood_acc"
528
+ ],
529
+ "hf_avail_splits": [
530
+ "auxiliary_train",
531
+ "test",
532
+ "validation",
533
+ "dev"
534
+ ],
535
+ "evaluation_splits": [
536
+ "test"
537
+ ],
538
+ "few_shots_split": "dev",
539
+ "few_shots_select": "sequential",
540
+ "generation_size": 1,
541
+ "stop_sequence": [
542
+ "\n"
543
+ ],
544
+ "output_regex": null,
545
+ "frozen": false,
546
+ "suite": [
547
+ "lighteval",
548
+ "mmlu"
549
+ ]
550
+ },
551
+ "lighteval|mmlu:college_mathematics": {
552
+ "name": "mmlu:college_mathematics",
553
+ "prompt_function": "mmlu_harness",
554
+ "hf_repo": "lighteval/mmlu",
555
+ "hf_subset": "college_mathematics",
556
+ "metric": [
557
+ "loglikelihood_acc"
558
+ ],
559
+ "hf_avail_splits": [
560
+ "auxiliary_train",
561
+ "test",
562
+ "validation",
563
+ "dev"
564
+ ],
565
+ "evaluation_splits": [
566
+ "test"
567
+ ],
568
+ "few_shots_split": "dev",
569
+ "few_shots_select": "sequential",
570
+ "generation_size": 1,
571
+ "stop_sequence": [
572
+ "\n"
573
+ ],
574
+ "output_regex": null,
575
+ "frozen": false,
576
+ "suite": [
577
+ "lighteval",
578
+ "mmlu"
579
+ ]
580
+ },
581
+ "lighteval|mmlu:college_medicine": {
582
+ "name": "mmlu:college_medicine",
583
+ "prompt_function": "mmlu_harness",
584
+ "hf_repo": "lighteval/mmlu",
585
+ "hf_subset": "college_medicine",
586
+ "metric": [
587
+ "loglikelihood_acc"
588
+ ],
589
+ "hf_avail_splits": [
590
+ "auxiliary_train",
591
+ "test",
592
+ "validation",
593
+ "dev"
594
+ ],
595
+ "evaluation_splits": [
596
+ "test"
597
+ ],
598
+ "few_shots_split": "dev",
599
+ "few_shots_select": "sequential",
600
+ "generation_size": 1,
601
+ "stop_sequence": [
602
+ "\n"
603
+ ],
604
+ "output_regex": null,
605
+ "frozen": false,
606
+ "suite": [
607
+ "lighteval",
608
+ "mmlu"
609
+ ]
610
+ },
611
+ "lighteval|mmlu:college_physics": {
612
+ "name": "mmlu:college_physics",
613
+ "prompt_function": "mmlu_harness",
614
+ "hf_repo": "lighteval/mmlu",
615
+ "hf_subset": "college_physics",
616
+ "metric": [
617
+ "loglikelihood_acc"
618
+ ],
619
+ "hf_avail_splits": [
620
+ "auxiliary_train",
621
+ "test",
622
+ "validation",
623
+ "dev"
624
+ ],
625
+ "evaluation_splits": [
626
+ "test"
627
+ ],
628
+ "few_shots_split": "dev",
629
+ "few_shots_select": "sequential",
630
+ "generation_size": 1,
631
+ "stop_sequence": [
632
+ "\n"
633
+ ],
634
+ "output_regex": null,
635
+ "frozen": false,
636
+ "suite": [
637
+ "lighteval",
638
+ "mmlu"
639
+ ]
640
+ },
641
+ "lighteval|mmlu:computer_security": {
642
+ "name": "mmlu:computer_security",
643
+ "prompt_function": "mmlu_harness",
644
+ "hf_repo": "lighteval/mmlu",
645
+ "hf_subset": "computer_security",
646
+ "metric": [
647
+ "loglikelihood_acc"
648
+ ],
649
+ "hf_avail_splits": [
650
+ "auxiliary_train",
651
+ "test",
652
+ "validation",
653
+ "dev"
654
+ ],
655
+ "evaluation_splits": [
656
+ "test"
657
+ ],
658
+ "few_shots_split": "dev",
659
+ "few_shots_select": "sequential",
660
+ "generation_size": 1,
661
+ "stop_sequence": [
662
+ "\n"
663
+ ],
664
+ "output_regex": null,
665
+ "frozen": false,
666
+ "suite": [
667
+ "lighteval",
668
+ "mmlu"
669
+ ]
670
+ },
671
+ "lighteval|mmlu:conceptual_physics": {
672
+ "name": "mmlu:conceptual_physics",
673
+ "prompt_function": "mmlu_harness",
674
+ "hf_repo": "lighteval/mmlu",
675
+ "hf_subset": "conceptual_physics",
676
+ "metric": [
677
+ "loglikelihood_acc"
678
+ ],
679
+ "hf_avail_splits": [
680
+ "auxiliary_train",
681
+ "test",
682
+ "validation",
683
+ "dev"
684
+ ],
685
+ "evaluation_splits": [
686
+ "test"
687
+ ],
688
+ "few_shots_split": "dev",
689
+ "few_shots_select": "sequential",
690
+ "generation_size": 1,
691
+ "stop_sequence": [
692
+ "\n"
693
+ ],
694
+ "output_regex": null,
695
+ "frozen": false,
696
+ "suite": [
697
+ "lighteval",
698
+ "mmlu"
699
+ ]
700
+ },
701
+ "lighteval|mmlu:econometrics": {
702
+ "name": "mmlu:econometrics",
703
+ "prompt_function": "mmlu_harness",
704
+ "hf_repo": "lighteval/mmlu",
705
+ "hf_subset": "econometrics",
706
+ "metric": [
707
+ "loglikelihood_acc"
708
+ ],
709
+ "hf_avail_splits": [
710
+ "auxiliary_train",
711
+ "test",
712
+ "validation",
713
+ "dev"
714
+ ],
715
+ "evaluation_splits": [
716
+ "test"
717
+ ],
718
+ "few_shots_split": "dev",
719
+ "few_shots_select": "sequential",
720
+ "generation_size": 1,
721
+ "stop_sequence": [
722
+ "\n"
723
+ ],
724
+ "output_regex": null,
725
+ "frozen": false,
726
+ "suite": [
727
+ "lighteval",
728
+ "mmlu"
729
+ ]
730
+ },
731
+ "lighteval|mmlu:electrical_engineering": {
732
+ "name": "mmlu:electrical_engineering",
733
+ "prompt_function": "mmlu_harness",
734
+ "hf_repo": "lighteval/mmlu",
735
+ "hf_subset": "electrical_engineering",
736
+ "metric": [
737
+ "loglikelihood_acc"
738
+ ],
739
+ "hf_avail_splits": [
740
+ "auxiliary_train",
741
+ "test",
742
+ "validation",
743
+ "dev"
744
+ ],
745
+ "evaluation_splits": [
746
+ "test"
747
+ ],
748
+ "few_shots_split": "dev",
749
+ "few_shots_select": "sequential",
750
+ "generation_size": 1,
751
+ "stop_sequence": [
752
+ "\n"
753
+ ],
754
+ "output_regex": null,
755
+ "frozen": false,
756
+ "suite": [
757
+ "lighteval",
758
+ "mmlu"
759
+ ]
760
+ },
761
+ "lighteval|mmlu:elementary_mathematics": {
762
+ "name": "mmlu:elementary_mathematics",
763
+ "prompt_function": "mmlu_harness",
764
+ "hf_repo": "lighteval/mmlu",
765
+ "hf_subset": "elementary_mathematics",
766
+ "metric": [
767
+ "loglikelihood_acc"
768
+ ],
769
+ "hf_avail_splits": [
770
+ "auxiliary_train",
771
+ "test",
772
+ "validation",
773
+ "dev"
774
+ ],
775
+ "evaluation_splits": [
776
+ "test"
777
+ ],
778
+ "few_shots_split": "dev",
779
+ "few_shots_select": "sequential",
780
+ "generation_size": 1,
781
+ "stop_sequence": [
782
+ "\n"
783
+ ],
784
+ "output_regex": null,
785
+ "frozen": false,
786
+ "suite": [
787
+ "lighteval",
788
+ "mmlu"
789
+ ]
790
+ },
791
+ "lighteval|mmlu:formal_logic": {
792
+ "name": "mmlu:formal_logic",
793
+ "prompt_function": "mmlu_harness",
794
+ "hf_repo": "lighteval/mmlu",
795
+ "hf_subset": "formal_logic",
796
+ "metric": [
797
+ "loglikelihood_acc"
798
+ ],
799
+ "hf_avail_splits": [
800
+ "auxiliary_train",
801
+ "test",
802
+ "validation",
803
+ "dev"
804
+ ],
805
+ "evaluation_splits": [
806
+ "test"
807
+ ],
808
+ "few_shots_split": "dev",
809
+ "few_shots_select": "sequential",
810
+ "generation_size": 1,
811
+ "stop_sequence": [
812
+ "\n"
813
+ ],
814
+ "output_regex": null,
815
+ "frozen": false,
816
+ "suite": [
817
+ "lighteval",
818
+ "mmlu"
819
+ ]
820
+ },
821
+ "lighteval|mmlu:global_facts": {
822
+ "name": "mmlu:global_facts",
823
+ "prompt_function": "mmlu_harness",
824
+ "hf_repo": "lighteval/mmlu",
825
+ "hf_subset": "global_facts",
826
+ "metric": [
827
+ "loglikelihood_acc"
828
+ ],
829
+ "hf_avail_splits": [
830
+ "auxiliary_train",
831
+ "test",
832
+ "validation",
833
+ "dev"
834
+ ],
835
+ "evaluation_splits": [
836
+ "test"
837
+ ],
838
+ "few_shots_split": "dev",
839
+ "few_shots_select": "sequential",
840
+ "generation_size": 1,
841
+ "stop_sequence": [
842
+ "\n"
843
+ ],
844
+ "output_regex": null,
845
+ "frozen": false,
846
+ "suite": [
847
+ "lighteval",
848
+ "mmlu"
849
+ ]
850
+ },
851
+ "lighteval|mmlu:high_school_biology": {
852
+ "name": "mmlu:high_school_biology",
853
+ "prompt_function": "mmlu_harness",
854
+ "hf_repo": "lighteval/mmlu",
855
+ "hf_subset": "high_school_biology",
856
+ "metric": [
857
+ "loglikelihood_acc"
858
+ ],
859
+ "hf_avail_splits": [
860
+ "auxiliary_train",
861
+ "test",
862
+ "validation",
863
+ "dev"
864
+ ],
865
+ "evaluation_splits": [
866
+ "test"
867
+ ],
868
+ "few_shots_split": "dev",
869
+ "few_shots_select": "sequential",
870
+ "generation_size": 1,
871
+ "stop_sequence": [
872
+ "\n"
873
+ ],
874
+ "output_regex": null,
875
+ "frozen": false,
876
+ "suite": [
877
+ "lighteval",
878
+ "mmlu"
879
+ ]
880
+ },
881
+ "lighteval|mmlu:high_school_chemistry": {
882
+ "name": "mmlu:high_school_chemistry",
883
+ "prompt_function": "mmlu_harness",
884
+ "hf_repo": "lighteval/mmlu",
885
+ "hf_subset": "high_school_chemistry",
886
+ "metric": [
887
+ "loglikelihood_acc"
888
+ ],
889
+ "hf_avail_splits": [
890
+ "auxiliary_train",
891
+ "test",
892
+ "validation",
893
+ "dev"
894
+ ],
895
+ "evaluation_splits": [
896
+ "test"
897
+ ],
898
+ "few_shots_split": "dev",
899
+ "few_shots_select": "sequential",
900
+ "generation_size": 1,
901
+ "stop_sequence": [
902
+ "\n"
903
+ ],
904
+ "output_regex": null,
905
+ "frozen": false,
906
+ "suite": [
907
+ "lighteval",
908
+ "mmlu"
909
+ ]
910
+ },
911
+ "lighteval|mmlu:high_school_computer_science": {
912
+ "name": "mmlu:high_school_computer_science",
913
+ "prompt_function": "mmlu_harness",
914
+ "hf_repo": "lighteval/mmlu",
915
+ "hf_subset": "high_school_computer_science",
916
+ "metric": [
917
+ "loglikelihood_acc"
918
+ ],
919
+ "hf_avail_splits": [
920
+ "auxiliary_train",
921
+ "test",
922
+ "validation",
923
+ "dev"
924
+ ],
925
+ "evaluation_splits": [
926
+ "test"
927
+ ],
928
+ "few_shots_split": "dev",
929
+ "few_shots_select": "sequential",
930
+ "generation_size": 1,
931
+ "stop_sequence": [
932
+ "\n"
933
+ ],
934
+ "output_regex": null,
935
+ "frozen": false,
936
+ "suite": [
937
+ "lighteval",
938
+ "mmlu"
939
+ ]
940
+ },
941
+ "lighteval|mmlu:high_school_european_history": {
942
+ "name": "mmlu:high_school_european_history",
943
+ "prompt_function": "mmlu_harness",
944
+ "hf_repo": "lighteval/mmlu",
945
+ "hf_subset": "high_school_european_history",
946
+ "metric": [
947
+ "loglikelihood_acc"
948
+ ],
949
+ "hf_avail_splits": [
950
+ "auxiliary_train",
951
+ "test",
952
+ "validation",
953
+ "dev"
954
+ ],
955
+ "evaluation_splits": [
956
+ "test"
957
+ ],
958
+ "few_shots_split": "dev",
959
+ "few_shots_select": "sequential",
960
+ "generation_size": 1,
961
+ "stop_sequence": [
962
+ "\n"
963
+ ],
964
+ "output_regex": null,
965
+ "frozen": false,
966
+ "suite": [
967
+ "lighteval",
968
+ "mmlu"
969
+ ]
970
+ },
971
+ "lighteval|mmlu:high_school_geography": {
972
+ "name": "mmlu:high_school_geography",
973
+ "prompt_function": "mmlu_harness",
974
+ "hf_repo": "lighteval/mmlu",
975
+ "hf_subset": "high_school_geography",
976
+ "metric": [
977
+ "loglikelihood_acc"
978
+ ],
979
+ "hf_avail_splits": [
980
+ "auxiliary_train",
981
+ "test",
982
+ "validation",
983
+ "dev"
984
+ ],
985
+ "evaluation_splits": [
986
+ "test"
987
+ ],
988
+ "few_shots_split": "dev",
989
+ "few_shots_select": "sequential",
990
+ "generation_size": 1,
991
+ "stop_sequence": [
992
+ "\n"
993
+ ],
994
+ "output_regex": null,
995
+ "frozen": false,
996
+ "suite": [
997
+ "lighteval",
998
+ "mmlu"
999
+ ]
1000
+ },
1001
+ "lighteval|mmlu:high_school_government_and_politics": {
1002
+ "name": "mmlu:high_school_government_and_politics",
1003
+ "prompt_function": "mmlu_harness",
1004
+ "hf_repo": "lighteval/mmlu",
1005
+ "hf_subset": "high_school_government_and_politics",
1006
+ "metric": [
1007
+ "loglikelihood_acc"
1008
+ ],
1009
+ "hf_avail_splits": [
1010
+ "auxiliary_train",
1011
+ "test",
1012
+ "validation",
1013
+ "dev"
1014
+ ],
1015
+ "evaluation_splits": [
1016
+ "test"
1017
+ ],
1018
+ "few_shots_split": "dev",
1019
+ "few_shots_select": "sequential",
1020
+ "generation_size": 1,
1021
+ "stop_sequence": [
1022
+ "\n"
1023
+ ],
1024
+ "output_regex": null,
1025
+ "frozen": false,
1026
+ "suite": [
1027
+ "lighteval",
1028
+ "mmlu"
1029
+ ]
1030
+ },
1031
+ "lighteval|mmlu:high_school_macroeconomics": {
1032
+ "name": "mmlu:high_school_macroeconomics",
1033
+ "prompt_function": "mmlu_harness",
1034
+ "hf_repo": "lighteval/mmlu",
1035
+ "hf_subset": "high_school_macroeconomics",
1036
+ "metric": [
1037
+ "loglikelihood_acc"
1038
+ ],
1039
+ "hf_avail_splits": [
1040
+ "auxiliary_train",
1041
+ "test",
1042
+ "validation",
1043
+ "dev"
1044
+ ],
1045
+ "evaluation_splits": [
1046
+ "test"
1047
+ ],
1048
+ "few_shots_split": "dev",
1049
+ "few_shots_select": "sequential",
1050
+ "generation_size": 1,
1051
+ "stop_sequence": [
1052
+ "\n"
1053
+ ],
1054
+ "output_regex": null,
1055
+ "frozen": false,
1056
+ "suite": [
1057
+ "lighteval",
1058
+ "mmlu"
1059
+ ]
1060
+ },
1061
+ "lighteval|mmlu:high_school_mathematics": {
1062
+ "name": "mmlu:high_school_mathematics",
1063
+ "prompt_function": "mmlu_harness",
1064
+ "hf_repo": "lighteval/mmlu",
1065
+ "hf_subset": "high_school_mathematics",
1066
+ "metric": [
1067
+ "loglikelihood_acc"
1068
+ ],
1069
+ "hf_avail_splits": [
1070
+ "auxiliary_train",
1071
+ "test",
1072
+ "validation",
1073
+ "dev"
1074
+ ],
1075
+ "evaluation_splits": [
1076
+ "test"
1077
+ ],
1078
+ "few_shots_split": "dev",
1079
+ "few_shots_select": "sequential",
1080
+ "generation_size": 1,
1081
+ "stop_sequence": [
1082
+ "\n"
1083
+ ],
1084
+ "output_regex": null,
1085
+ "frozen": false,
1086
+ "suite": [
1087
+ "lighteval",
1088
+ "mmlu"
1089
+ ]
1090
+ },
1091
+ "lighteval|mmlu:high_school_microeconomics": {
1092
+ "name": "mmlu:high_school_microeconomics",
1093
+ "prompt_function": "mmlu_harness",
1094
+ "hf_repo": "lighteval/mmlu",
1095
+ "hf_subset": "high_school_microeconomics",
1096
+ "metric": [
1097
+ "loglikelihood_acc"
1098
+ ],
1099
+ "hf_avail_splits": [
1100
+ "auxiliary_train",
1101
+ "test",
1102
+ "validation",
1103
+ "dev"
1104
+ ],
1105
+ "evaluation_splits": [
1106
+ "test"
1107
+ ],
1108
+ "few_shots_split": "dev",
1109
+ "few_shots_select": "sequential",
1110
+ "generation_size": 1,
1111
+ "stop_sequence": [
1112
+ "\n"
1113
+ ],
1114
+ "output_regex": null,
1115
+ "frozen": false,
1116
+ "suite": [
1117
+ "lighteval",
1118
+ "mmlu"
1119
+ ]
1120
+ },
1121
+ "lighteval|mmlu:high_school_physics": {
1122
+ "name": "mmlu:high_school_physics",
1123
+ "prompt_function": "mmlu_harness",
1124
+ "hf_repo": "lighteval/mmlu",
1125
+ "hf_subset": "high_school_physics",
1126
+ "metric": [
1127
+ "loglikelihood_acc"
1128
+ ],
1129
+ "hf_avail_splits": [
1130
+ "auxiliary_train",
1131
+ "test",
1132
+ "validation",
1133
+ "dev"
1134
+ ],
1135
+ "evaluation_splits": [
1136
+ "test"
1137
+ ],
1138
+ "few_shots_split": "dev",
1139
+ "few_shots_select": "sequential",
1140
+ "generation_size": 1,
1141
+ "stop_sequence": [
1142
+ "\n"
1143
+ ],
1144
+ "output_regex": null,
1145
+ "frozen": false,
1146
+ "suite": [
1147
+ "lighteval",
1148
+ "mmlu"
1149
+ ]
1150
+ },
1151
+ "lighteval|mmlu:high_school_psychology": {
1152
+ "name": "mmlu:high_school_psychology",
1153
+ "prompt_function": "mmlu_harness",
1154
+ "hf_repo": "lighteval/mmlu",
1155
+ "hf_subset": "high_school_psychology",
1156
+ "metric": [
1157
+ "loglikelihood_acc"
1158
+ ],
1159
+ "hf_avail_splits": [
1160
+ "auxiliary_train",
1161
+ "test",
1162
+ "validation",
1163
+ "dev"
1164
+ ],
1165
+ "evaluation_splits": [
1166
+ "test"
1167
+ ],
1168
+ "few_shots_split": "dev",
1169
+ "few_shots_select": "sequential",
1170
+ "generation_size": 1,
1171
+ "stop_sequence": [
1172
+ "\n"
1173
+ ],
1174
+ "output_regex": null,
1175
+ "frozen": false,
1176
+ "suite": [
1177
+ "lighteval",
1178
+ "mmlu"
1179
+ ]
1180
+ },
1181
+ "lighteval|mmlu:high_school_statistics": {
1182
+ "name": "mmlu:high_school_statistics",
1183
+ "prompt_function": "mmlu_harness",
1184
+ "hf_repo": "lighteval/mmlu",
1185
+ "hf_subset": "high_school_statistics",
1186
+ "metric": [
1187
+ "loglikelihood_acc"
1188
+ ],
1189
+ "hf_avail_splits": [
1190
+ "auxiliary_train",
1191
+ "test",
1192
+ "validation",
1193
+ "dev"
1194
+ ],
1195
+ "evaluation_splits": [
1196
+ "test"
1197
+ ],
1198
+ "few_shots_split": "dev",
1199
+ "few_shots_select": "sequential",
1200
+ "generation_size": 1,
1201
+ "stop_sequence": [
1202
+ "\n"
1203
+ ],
1204
+ "output_regex": null,
1205
+ "frozen": false,
1206
+ "suite": [
1207
+ "lighteval",
1208
+ "mmlu"
1209
+ ]
1210
+ },
1211
+ "lighteval|mmlu:high_school_us_history": {
1212
+ "name": "mmlu:high_school_us_history",
1213
+ "prompt_function": "mmlu_harness",
1214
+ "hf_repo": "lighteval/mmlu",
1215
+ "hf_subset": "high_school_us_history",
1216
+ "metric": [
1217
+ "loglikelihood_acc"
1218
+ ],
1219
+ "hf_avail_splits": [
1220
+ "auxiliary_train",
1221
+ "test",
1222
+ "validation",
1223
+ "dev"
1224
+ ],
1225
+ "evaluation_splits": [
1226
+ "test"
1227
+ ],
1228
+ "few_shots_split": "dev",
1229
+ "few_shots_select": "sequential",
1230
+ "generation_size": 1,
1231
+ "stop_sequence": [
1232
+ "\n"
1233
+ ],
1234
+ "output_regex": null,
1235
+ "frozen": false,
1236
+ "suite": [
1237
+ "lighteval",
1238
+ "mmlu"
1239
+ ]
1240
+ },
1241
+ "lighteval|mmlu:high_school_world_history": {
1242
+ "name": "mmlu:high_school_world_history",
1243
+ "prompt_function": "mmlu_harness",
1244
+ "hf_repo": "lighteval/mmlu",
1245
+ "hf_subset": "high_school_world_history",
1246
+ "metric": [
1247
+ "loglikelihood_acc"
1248
+ ],
1249
+ "hf_avail_splits": [
1250
+ "auxiliary_train",
1251
+ "test",
1252
+ "validation",
1253
+ "dev"
1254
+ ],
1255
+ "evaluation_splits": [
1256
+ "test"
1257
+ ],
1258
+ "few_shots_split": "dev",
1259
+ "few_shots_select": "sequential",
1260
+ "generation_size": 1,
1261
+ "stop_sequence": [
1262
+ "\n"
1263
+ ],
1264
+ "output_regex": null,
1265
+ "frozen": false,
1266
+ "suite": [
1267
+ "lighteval",
1268
+ "mmlu"
1269
+ ]
1270
+ },
1271
+ "lighteval|mmlu:human_aging": {
1272
+ "name": "mmlu:human_aging",
1273
+ "prompt_function": "mmlu_harness",
1274
+ "hf_repo": "lighteval/mmlu",
1275
+ "hf_subset": "human_aging",
1276
+ "metric": [
1277
+ "loglikelihood_acc"
1278
+ ],
1279
+ "hf_avail_splits": [
1280
+ "auxiliary_train",
1281
+ "test",
1282
+ "validation",
1283
+ "dev"
1284
+ ],
1285
+ "evaluation_splits": [
1286
+ "test"
1287
+ ],
1288
+ "few_shots_split": "dev",
1289
+ "few_shots_select": "sequential",
1290
+ "generation_size": 1,
1291
+ "stop_sequence": [
1292
+ "\n"
1293
+ ],
1294
+ "output_regex": null,
1295
+ "frozen": false,
1296
+ "suite": [
1297
+ "lighteval",
1298
+ "mmlu"
1299
+ ]
1300
+ },
1301
+ "lighteval|mmlu:human_sexuality": {
1302
+ "name": "mmlu:human_sexuality",
1303
+ "prompt_function": "mmlu_harness",
1304
+ "hf_repo": "lighteval/mmlu",
1305
+ "hf_subset": "human_sexuality",
1306
+ "metric": [
1307
+ "loglikelihood_acc"
1308
+ ],
1309
+ "hf_avail_splits": [
1310
+ "auxiliary_train",
1311
+ "test",
1312
+ "validation",
1313
+ "dev"
1314
+ ],
1315
+ "evaluation_splits": [
1316
+ "test"
1317
+ ],
1318
+ "few_shots_split": "dev",
1319
+ "few_shots_select": "sequential",
1320
+ "generation_size": 1,
1321
+ "stop_sequence": [
1322
+ "\n"
1323
+ ],
1324
+ "output_regex": null,
1325
+ "frozen": false,
1326
+ "suite": [
1327
+ "lighteval",
1328
+ "mmlu"
1329
+ ]
1330
+ },
1331
+ "lighteval|mmlu:international_law": {
1332
+ "name": "mmlu:international_law",
1333
+ "prompt_function": "mmlu_harness",
1334
+ "hf_repo": "lighteval/mmlu",
1335
+ "hf_subset": "international_law",
1336
+ "metric": [
1337
+ "loglikelihood_acc"
1338
+ ],
1339
+ "hf_avail_splits": [
1340
+ "auxiliary_train",
1341
+ "test",
1342
+ "validation",
1343
+ "dev"
1344
+ ],
1345
+ "evaluation_splits": [
1346
+ "test"
1347
+ ],
1348
+ "few_shots_split": "dev",
1349
+ "few_shots_select": "sequential",
1350
+ "generation_size": 1,
1351
+ "stop_sequence": [
1352
+ "\n"
1353
+ ],
1354
+ "output_regex": null,
1355
+ "frozen": false,
1356
+ "suite": [
1357
+ "lighteval",
1358
+ "mmlu"
1359
+ ]
1360
+ },
1361
+ "lighteval|mmlu:jurisprudence": {
1362
+ "name": "mmlu:jurisprudence",
1363
+ "prompt_function": "mmlu_harness",
1364
+ "hf_repo": "lighteval/mmlu",
1365
+ "hf_subset": "jurisprudence",
1366
+ "metric": [
1367
+ "loglikelihood_acc"
1368
+ ],
1369
+ "hf_avail_splits": [
1370
+ "auxiliary_train",
1371
+ "test",
1372
+ "validation",
1373
+ "dev"
1374
+ ],
1375
+ "evaluation_splits": [
1376
+ "test"
1377
+ ],
1378
+ "few_shots_split": "dev",
1379
+ "few_shots_select": "sequential",
1380
+ "generation_size": 1,
1381
+ "stop_sequence": [
1382
+ "\n"
1383
+ ],
1384
+ "output_regex": null,
1385
+ "frozen": false,
1386
+ "suite": [
1387
+ "lighteval",
1388
+ "mmlu"
1389
+ ]
1390
+ },
1391
+ "lighteval|mmlu:logical_fallacies": {
1392
+ "name": "mmlu:logical_fallacies",
1393
+ "prompt_function": "mmlu_harness",
1394
+ "hf_repo": "lighteval/mmlu",
1395
+ "hf_subset": "logical_fallacies",
1396
+ "metric": [
1397
+ "loglikelihood_acc"
1398
+ ],
1399
+ "hf_avail_splits": [
1400
+ "auxiliary_train",
1401
+ "test",
1402
+ "validation",
1403
+ "dev"
1404
+ ],
1405
+ "evaluation_splits": [
1406
+ "test"
1407
+ ],
1408
+ "few_shots_split": "dev",
1409
+ "few_shots_select": "sequential",
1410
+ "generation_size": 1,
1411
+ "stop_sequence": [
1412
+ "\n"
1413
+ ],
1414
+ "output_regex": null,
1415
+ "frozen": false,
1416
+ "suite": [
1417
+ "lighteval",
1418
+ "mmlu"
1419
+ ]
1420
+ },
1421
+ "lighteval|mmlu:machine_learning": {
1422
+ "name": "mmlu:machine_learning",
1423
+ "prompt_function": "mmlu_harness",
1424
+ "hf_repo": "lighteval/mmlu",
1425
+ "hf_subset": "machine_learning",
1426
+ "metric": [
1427
+ "loglikelihood_acc"
1428
+ ],
1429
+ "hf_avail_splits": [
1430
+ "auxiliary_train",
1431
+ "test",
1432
+ "validation",
1433
+ "dev"
1434
+ ],
1435
+ "evaluation_splits": [
1436
+ "test"
1437
+ ],
1438
+ "few_shots_split": "dev",
1439
+ "few_shots_select": "sequential",
1440
+ "generation_size": 1,
1441
+ "stop_sequence": [
1442
+ "\n"
1443
+ ],
1444
+ "output_regex": null,
1445
+ "frozen": false,
1446
+ "suite": [
1447
+ "lighteval",
1448
+ "mmlu"
1449
+ ]
1450
+ },
1451
+ "lighteval|mmlu:management": {
1452
+ "name": "mmlu:management",
1453
+ "prompt_function": "mmlu_harness",
1454
+ "hf_repo": "lighteval/mmlu",
1455
+ "hf_subset": "management",
1456
+ "metric": [
1457
+ "loglikelihood_acc"
1458
+ ],
1459
+ "hf_avail_splits": [
1460
+ "auxiliary_train",
1461
+ "test",
1462
+ "validation",
1463
+ "dev"
1464
+ ],
1465
+ "evaluation_splits": [
1466
+ "test"
1467
+ ],
1468
+ "few_shots_split": "dev",
1469
+ "few_shots_select": "sequential",
1470
+ "generation_size": 1,
1471
+ "stop_sequence": [
1472
+ "\n"
1473
+ ],
1474
+ "output_regex": null,
1475
+ "frozen": false,
1476
+ "suite": [
1477
+ "lighteval",
1478
+ "mmlu"
1479
+ ]
1480
+ },
1481
+ "lighteval|mmlu:marketing": {
1482
+ "name": "mmlu:marketing",
1483
+ "prompt_function": "mmlu_harness",
1484
+ "hf_repo": "lighteval/mmlu",
1485
+ "hf_subset": "marketing",
1486
+ "metric": [
1487
+ "loglikelihood_acc"
1488
+ ],
1489
+ "hf_avail_splits": [
1490
+ "auxiliary_train",
1491
+ "test",
1492
+ "validation",
1493
+ "dev"
1494
+ ],
1495
+ "evaluation_splits": [
1496
+ "test"
1497
+ ],
1498
+ "few_shots_split": "dev",
1499
+ "few_shots_select": "sequential",
1500
+ "generation_size": 1,
1501
+ "stop_sequence": [
1502
+ "\n"
1503
+ ],
1504
+ "output_regex": null,
1505
+ "frozen": false,
1506
+ "suite": [
1507
+ "lighteval",
1508
+ "mmlu"
1509
+ ]
1510
+ },
1511
+ "lighteval|mmlu:medical_genetics": {
1512
+ "name": "mmlu:medical_genetics",
1513
+ "prompt_function": "mmlu_harness",
1514
+ "hf_repo": "lighteval/mmlu",
1515
+ "hf_subset": "medical_genetics",
1516
+ "metric": [
1517
+ "loglikelihood_acc"
1518
+ ],
1519
+ "hf_avail_splits": [
1520
+ "auxiliary_train",
1521
+ "test",
1522
+ "validation",
1523
+ "dev"
1524
+ ],
1525
+ "evaluation_splits": [
1526
+ "test"
1527
+ ],
1528
+ "few_shots_split": "dev",
1529
+ "few_shots_select": "sequential",
1530
+ "generation_size": 1,
1531
+ "stop_sequence": [
1532
+ "\n"
1533
+ ],
1534
+ "output_regex": null,
1535
+ "frozen": false,
1536
+ "suite": [
1537
+ "lighteval",
1538
+ "mmlu"
1539
+ ]
1540
+ },
1541
+ "lighteval|mmlu:miscellaneous": {
1542
+ "name": "mmlu:miscellaneous",
1543
+ "prompt_function": "mmlu_harness",
1544
+ "hf_repo": "lighteval/mmlu",
1545
+ "hf_subset": "miscellaneous",
1546
+ "metric": [
1547
+ "loglikelihood_acc"
1548
+ ],
1549
+ "hf_avail_splits": [
1550
+ "auxiliary_train",
1551
+ "test",
1552
+ "validation",
1553
+ "dev"
1554
+ ],
1555
+ "evaluation_splits": [
1556
+ "test"
1557
+ ],
1558
+ "few_shots_split": "dev",
1559
+ "few_shots_select": "sequential",
1560
+ "generation_size": 1,
1561
+ "stop_sequence": [
1562
+ "\n"
1563
+ ],
1564
+ "output_regex": null,
1565
+ "frozen": false,
1566
+ "suite": [
1567
+ "lighteval",
1568
+ "mmlu"
1569
+ ]
1570
+ },
1571
+ "lighteval|mmlu:moral_disputes": {
1572
+ "name": "mmlu:moral_disputes",
1573
+ "prompt_function": "mmlu_harness",
1574
+ "hf_repo": "lighteval/mmlu",
1575
+ "hf_subset": "moral_disputes",
1576
+ "metric": [
1577
+ "loglikelihood_acc"
1578
+ ],
1579
+ "hf_avail_splits": [
1580
+ "auxiliary_train",
1581
+ "test",
1582
+ "validation",
1583
+ "dev"
1584
+ ],
1585
+ "evaluation_splits": [
1586
+ "test"
1587
+ ],
1588
+ "few_shots_split": "dev",
1589
+ "few_shots_select": "sequential",
1590
+ "generation_size": 1,
1591
+ "stop_sequence": [
1592
+ "\n"
1593
+ ],
1594
+ "output_regex": null,
1595
+ "frozen": false,
1596
+ "suite": [
1597
+ "lighteval",
1598
+ "mmlu"
1599
+ ]
1600
+ },
1601
+ "lighteval|mmlu:moral_scenarios": {
1602
+ "name": "mmlu:moral_scenarios",
1603
+ "prompt_function": "mmlu_harness",
1604
+ "hf_repo": "lighteval/mmlu",
1605
+ "hf_subset": "moral_scenarios",
1606
+ "metric": [
1607
+ "loglikelihood_acc"
1608
+ ],
1609
+ "hf_avail_splits": [
1610
+ "auxiliary_train",
1611
+ "test",
1612
+ "validation",
1613
+ "dev"
1614
+ ],
1615
+ "evaluation_splits": [
1616
+ "test"
1617
+ ],
1618
+ "few_shots_split": "dev",
1619
+ "few_shots_select": "sequential",
1620
+ "generation_size": 1,
1621
+ "stop_sequence": [
1622
+ "\n"
1623
+ ],
1624
+ "output_regex": null,
1625
+ "frozen": false,
1626
+ "suite": [
1627
+ "lighteval",
1628
+ "mmlu"
1629
+ ]
1630
+ },
1631
+ "lighteval|mmlu:nutrition": {
1632
+ "name": "mmlu:nutrition",
1633
+ "prompt_function": "mmlu_harness",
1634
+ "hf_repo": "lighteval/mmlu",
1635
+ "hf_subset": "nutrition",
1636
+ "metric": [
1637
+ "loglikelihood_acc"
1638
+ ],
1639
+ "hf_avail_splits": [
1640
+ "auxiliary_train",
1641
+ "test",
1642
+ "validation",
1643
+ "dev"
1644
+ ],
1645
+ "evaluation_splits": [
1646
+ "test"
1647
+ ],
1648
+ "few_shots_split": "dev",
1649
+ "few_shots_select": "sequential",
1650
+ "generation_size": 1,
1651
+ "stop_sequence": [
1652
+ "\n"
1653
+ ],
1654
+ "output_regex": null,
1655
+ "frozen": false,
1656
+ "suite": [
1657
+ "lighteval",
1658
+ "mmlu"
1659
+ ]
1660
+ },
1661
+ "lighteval|mmlu:philosophy": {
1662
+ "name": "mmlu:philosophy",
1663
+ "prompt_function": "mmlu_harness",
1664
+ "hf_repo": "lighteval/mmlu",
1665
+ "hf_subset": "philosophy",
1666
+ "metric": [
1667
+ "loglikelihood_acc"
1668
+ ],
1669
+ "hf_avail_splits": [
1670
+ "auxiliary_train",
1671
+ "test",
1672
+ "validation",
1673
+ "dev"
1674
+ ],
1675
+ "evaluation_splits": [
1676
+ "test"
1677
+ ],
1678
+ "few_shots_split": "dev",
1679
+ "few_shots_select": "sequential",
1680
+ "generation_size": 1,
1681
+ "stop_sequence": [
1682
+ "\n"
1683
+ ],
1684
+ "output_regex": null,
1685
+ "frozen": false,
1686
+ "suite": [
1687
+ "lighteval",
1688
+ "mmlu"
1689
+ ]
1690
+ },
1691
+ "lighteval|mmlu:prehistory": {
1692
+ "name": "mmlu:prehistory",
1693
+ "prompt_function": "mmlu_harness",
1694
+ "hf_repo": "lighteval/mmlu",
1695
+ "hf_subset": "prehistory",
1696
+ "metric": [
1697
+ "loglikelihood_acc"
1698
+ ],
1699
+ "hf_avail_splits": [
1700
+ "auxiliary_train",
1701
+ "test",
1702
+ "validation",
1703
+ "dev"
1704
+ ],
1705
+ "evaluation_splits": [
1706
+ "test"
1707
+ ],
1708
+ "few_shots_split": "dev",
1709
+ "few_shots_select": "sequential",
1710
+ "generation_size": 1,
1711
+ "stop_sequence": [
1712
+ "\n"
1713
+ ],
1714
+ "output_regex": null,
1715
+ "frozen": false,
1716
+ "suite": [
1717
+ "lighteval",
1718
+ "mmlu"
1719
+ ]
1720
+ },
1721
+ "lighteval|mmlu:professional_accounting": {
1722
+ "name": "mmlu:professional_accounting",
1723
+ "prompt_function": "mmlu_harness",
1724
+ "hf_repo": "lighteval/mmlu",
1725
+ "hf_subset": "professional_accounting",
1726
+ "metric": [
1727
+ "loglikelihood_acc"
1728
+ ],
1729
+ "hf_avail_splits": [
1730
+ "auxiliary_train",
1731
+ "test",
1732
+ "validation",
1733
+ "dev"
1734
+ ],
1735
+ "evaluation_splits": [
1736
+ "test"
1737
+ ],
1738
+ "few_shots_split": "dev",
1739
+ "few_shots_select": "sequential",
1740
+ "generation_size": 1,
1741
+ "stop_sequence": [
1742
+ "\n"
1743
+ ],
1744
+ "output_regex": null,
1745
+ "frozen": false,
1746
+ "suite": [
1747
+ "lighteval",
1748
+ "mmlu"
1749
+ ]
1750
+ },
1751
+ "lighteval|mmlu:professional_law": {
1752
+ "name": "mmlu:professional_law",
1753
+ "prompt_function": "mmlu_harness",
1754
+ "hf_repo": "lighteval/mmlu",
1755
+ "hf_subset": "professional_law",
1756
+ "metric": [
1757
+ "loglikelihood_acc"
1758
+ ],
1759
+ "hf_avail_splits": [
1760
+ "auxiliary_train",
1761
+ "test",
1762
+ "validation",
1763
+ "dev"
1764
+ ],
1765
+ "evaluation_splits": [
1766
+ "test"
1767
+ ],
1768
+ "few_shots_split": "dev",
1769
+ "few_shots_select": "sequential",
1770
+ "generation_size": 1,
1771
+ "stop_sequence": [
1772
+ "\n"
1773
+ ],
1774
+ "output_regex": null,
1775
+ "frozen": false,
1776
+ "suite": [
1777
+ "lighteval",
1778
+ "mmlu"
1779
+ ]
1780
+ },
1781
+ "lighteval|mmlu:professional_medicine": {
1782
+ "name": "mmlu:professional_medicine",
1783
+ "prompt_function": "mmlu_harness",
1784
+ "hf_repo": "lighteval/mmlu",
1785
+ "hf_subset": "professional_medicine",
1786
+ "metric": [
1787
+ "loglikelihood_acc"
1788
+ ],
1789
+ "hf_avail_splits": [
1790
+ "auxiliary_train",
1791
+ "test",
1792
+ "validation",
1793
+ "dev"
1794
+ ],
1795
+ "evaluation_splits": [
1796
+ "test"
1797
+ ],
1798
+ "few_shots_split": "dev",
1799
+ "few_shots_select": "sequential",
1800
+ "generation_size": 1,
1801
+ "stop_sequence": [
1802
+ "\n"
1803
+ ],
1804
+ "output_regex": null,
1805
+ "frozen": false,
1806
+ "suite": [
1807
+ "lighteval",
1808
+ "mmlu"
1809
+ ]
1810
+ },
1811
+ "lighteval|mmlu:professional_psychology": {
1812
+ "name": "mmlu:professional_psychology",
1813
+ "prompt_function": "mmlu_harness",
1814
+ "hf_repo": "lighteval/mmlu",
1815
+ "hf_subset": "professional_psychology",
1816
+ "metric": [
1817
+ "loglikelihood_acc"
1818
+ ],
1819
+ "hf_avail_splits": [
1820
+ "auxiliary_train",
1821
+ "test",
1822
+ "validation",
1823
+ "dev"
1824
+ ],
1825
+ "evaluation_splits": [
1826
+ "test"
1827
+ ],
1828
+ "few_shots_split": "dev",
1829
+ "few_shots_select": "sequential",
1830
+ "generation_size": 1,
1831
+ "stop_sequence": [
1832
+ "\n"
1833
+ ],
1834
+ "output_regex": null,
1835
+ "frozen": false,
1836
+ "suite": [
1837
+ "lighteval",
1838
+ "mmlu"
1839
+ ]
1840
+ },
1841
+ "lighteval|mmlu:public_relations": {
1842
+ "name": "mmlu:public_relations",
1843
+ "prompt_function": "mmlu_harness",
1844
+ "hf_repo": "lighteval/mmlu",
1845
+ "hf_subset": "public_relations",
1846
+ "metric": [
1847
+ "loglikelihood_acc"
1848
+ ],
1849
+ "hf_avail_splits": [
1850
+ "auxiliary_train",
1851
+ "test",
1852
+ "validation",
1853
+ "dev"
1854
+ ],
1855
+ "evaluation_splits": [
1856
+ "test"
1857
+ ],
1858
+ "few_shots_split": "dev",
1859
+ "few_shots_select": "sequential",
1860
+ "generation_size": 1,
1861
+ "stop_sequence": [
1862
+ "\n"
1863
+ ],
1864
+ "output_regex": null,
1865
+ "frozen": false,
1866
+ "suite": [
1867
+ "lighteval",
1868
+ "mmlu"
1869
+ ]
1870
+ },
1871
+ "lighteval|mmlu:security_studies": {
1872
+ "name": "mmlu:security_studies",
1873
+ "prompt_function": "mmlu_harness",
1874
+ "hf_repo": "lighteval/mmlu",
1875
+ "hf_subset": "security_studies",
1876
+ "metric": [
1877
+ "loglikelihood_acc"
1878
+ ],
1879
+ "hf_avail_splits": [
1880
+ "auxiliary_train",
1881
+ "test",
1882
+ "validation",
1883
+ "dev"
1884
+ ],
1885
+ "evaluation_splits": [
1886
+ "test"
1887
+ ],
1888
+ "few_shots_split": "dev",
1889
+ "few_shots_select": "sequential",
1890
+ "generation_size": 1,
1891
+ "stop_sequence": [
1892
+ "\n"
1893
+ ],
1894
+ "output_regex": null,
1895
+ "frozen": false,
1896
+ "suite": [
1897
+ "lighteval",
1898
+ "mmlu"
1899
+ ]
1900
+ },
1901
+ "lighteval|mmlu:sociology": {
1902
+ "name": "mmlu:sociology",
1903
+ "prompt_function": "mmlu_harness",
1904
+ "hf_repo": "lighteval/mmlu",
1905
+ "hf_subset": "sociology",
1906
+ "metric": [
1907
+ "loglikelihood_acc"
1908
+ ],
1909
+ "hf_avail_splits": [
1910
+ "auxiliary_train",
1911
+ "test",
1912
+ "validation",
1913
+ "dev"
1914
+ ],
1915
+ "evaluation_splits": [
1916
+ "test"
1917
+ ],
1918
+ "few_shots_split": "dev",
1919
+ "few_shots_select": "sequential",
1920
+ "generation_size": 1,
1921
+ "stop_sequence": [
1922
+ "\n"
1923
+ ],
1924
+ "output_regex": null,
1925
+ "frozen": false,
1926
+ "suite": [
1927
+ "lighteval",
1928
+ "mmlu"
1929
+ ]
1930
+ },
1931
+ "lighteval|mmlu:us_foreign_policy": {
1932
+ "name": "mmlu:us_foreign_policy",
1933
+ "prompt_function": "mmlu_harness",
1934
+ "hf_repo": "lighteval/mmlu",
1935
+ "hf_subset": "us_foreign_policy",
1936
+ "metric": [
1937
+ "loglikelihood_acc"
1938
+ ],
1939
+ "hf_avail_splits": [
1940
+ "auxiliary_train",
1941
+ "test",
1942
+ "validation",
1943
+ "dev"
1944
+ ],
1945
+ "evaluation_splits": [
1946
+ "test"
1947
+ ],
1948
+ "few_shots_split": "dev",
1949
+ "few_shots_select": "sequential",
1950
+ "generation_size": 1,
1951
+ "stop_sequence": [
1952
+ "\n"
1953
+ ],
1954
+ "output_regex": null,
1955
+ "frozen": false,
1956
+ "suite": [
1957
+ "lighteval",
1958
+ "mmlu"
1959
+ ]
1960
+ },
1961
+ "lighteval|mmlu:virology": {
1962
+ "name": "mmlu:virology",
1963
+ "prompt_function": "mmlu_harness",
1964
+ "hf_repo": "lighteval/mmlu",
1965
+ "hf_subset": "virology",
1966
+ "metric": [
1967
+ "loglikelihood_acc"
1968
+ ],
1969
+ "hf_avail_splits": [
1970
+ "auxiliary_train",
1971
+ "test",
1972
+ "validation",
1973
+ "dev"
1974
+ ],
1975
+ "evaluation_splits": [
1976
+ "test"
1977
+ ],
1978
+ "few_shots_split": "dev",
1979
+ "few_shots_select": "sequential",
1980
+ "generation_size": 1,
1981
+ "stop_sequence": [
1982
+ "\n"
1983
+ ],
1984
+ "output_regex": null,
1985
+ "frozen": false,
1986
+ "suite": [
1987
+ "lighteval",
1988
+ "mmlu"
1989
+ ]
1990
+ },
1991
+ "lighteval|mmlu:world_religions": {
1992
+ "name": "mmlu:world_religions",
1993
+ "prompt_function": "mmlu_harness",
1994
+ "hf_repo": "lighteval/mmlu",
1995
+ "hf_subset": "world_religions",
1996
+ "metric": [
1997
+ "loglikelihood_acc"
1998
+ ],
1999
+ "hf_avail_splits": [
2000
+ "auxiliary_train",
2001
+ "test",
2002
+ "validation",
2003
+ "dev"
2004
+ ],
2005
+ "evaluation_splits": [
2006
+ "test"
2007
+ ],
2008
+ "few_shots_split": "dev",
2009
+ "few_shots_select": "sequential",
2010
+ "generation_size": 1,
2011
+ "stop_sequence": [
2012
+ "\n"
2013
+ ],
2014
+ "output_regex": null,
2015
+ "frozen": false,
2016
+ "suite": [
2017
+ "lighteval",
2018
+ "mmlu"
2019
+ ]
2020
+ }
2021
+ },
2022
+ "summary_tasks": {
2023
+ "lighteval|mmlu:abstract_algebra|5": {
2024
+ "hashes": {
2025
+ "hash_examples": "4c76229e00c9c0e9",
2026
+ "hash_full_prompts": "273278cb9fb5ac01",
2027
+ "hash_input_tokens": "caf9777ccf71eab5",
2028
+ "hash_cont_tokens": "00520b0ec06da34f"
2029
+ },
2030
+ "truncated": 0,
2031
+ "non_truncated": 100,
2032
+ "padded": 400,
2033
+ "non_padded": 0,
2034
+ "effective_few_shots": 5.0,
2035
+ "num_truncated_few_shots": 0
2036
+ },
2037
+ "lighteval|mmlu:anatomy|5": {
2038
+ "hashes": {
2039
+ "hash_examples": "6a1f8104dccbd33b",
2040
+ "hash_full_prompts": "e77b5ebe030aabba",
2041
+ "hash_input_tokens": "d192cd7584fda4dc",
2042
+ "hash_cont_tokens": "263324e6ce7f9b36"
2043
+ },
2044
+ "truncated": 0,
2045
+ "non_truncated": 135,
2046
+ "padded": 540,
2047
+ "non_padded": 0,
2048
+ "effective_few_shots": 5.0,
2049
+ "num_truncated_few_shots": 0
2050
+ },
2051
+ "lighteval|mmlu:astronomy|5": {
2052
+ "hashes": {
2053
+ "hash_examples": "1302effa3a76ce4c",
2054
+ "hash_full_prompts": "0ff37ef4519e63f9",
2055
+ "hash_input_tokens": "d241783f0bfdf860",
2056
+ "hash_cont_tokens": "18ba399c6801138e"
2057
+ },
2058
+ "truncated": 0,
2059
+ "non_truncated": 152,
2060
+ "padded": 608,
2061
+ "non_padded": 0,
2062
+ "effective_few_shots": 5.0,
2063
+ "num_truncated_few_shots": 0
2064
+ },
2065
+ "lighteval|mmlu:business_ethics|5": {
2066
+ "hashes": {
2067
+ "hash_examples": "03cb8bce5336419a",
2068
+ "hash_full_prompts": "7c4d312a23bdd669",
2069
+ "hash_input_tokens": "0aee5ed969278926",
2070
+ "hash_cont_tokens": "00520b0ec06da34f"
2071
+ },
2072
+ "truncated": 0,
2073
+ "non_truncated": 100,
2074
+ "padded": 400,
2075
+ "non_padded": 0,
2076
+ "effective_few_shots": 5.0,
2077
+ "num_truncated_few_shots": 0
2078
+ },
2079
+ "lighteval|mmlu:clinical_knowledge|5": {
2080
+ "hashes": {
2081
+ "hash_examples": "ffbb9c7b2be257f9",
2082
+ "hash_full_prompts": "472d93369b1a8382",
2083
+ "hash_input_tokens": "aa05960be77863d3",
2084
+ "hash_cont_tokens": "9d7500060e0dd995"
2085
+ },
2086
+ "truncated": 0,
2087
+ "non_truncated": 265,
2088
+ "padded": 1060,
2089
+ "non_padded": 0,
2090
+ "effective_few_shots": 5.0,
2091
+ "num_truncated_few_shots": 0
2092
+ },
2093
+ "lighteval|mmlu:college_biology|5": {
2094
+ "hashes": {
2095
+ "hash_examples": "3ee77f176f38eb8e",
2096
+ "hash_full_prompts": "6853bf027b349083",
2097
+ "hash_input_tokens": "3843b5375a04262c",
2098
+ "hash_cont_tokens": "78a731af5d2f6472"
2099
+ },
2100
+ "truncated": 0,
2101
+ "non_truncated": 144,
2102
+ "padded": 576,
2103
+ "non_padded": 0,
2104
+ "effective_few_shots": 5.0,
2105
+ "num_truncated_few_shots": 0
2106
+ },
2107
+ "lighteval|mmlu:college_chemistry|5": {
2108
+ "hashes": {
2109
+ "hash_examples": "ce61a69c46d47aeb",
2110
+ "hash_full_prompts": "e0f8624971f7af71",
2111
+ "hash_input_tokens": "2096d1652e232764",
2112
+ "hash_cont_tokens": "00520b0ec06da34f"
2113
+ },
2114
+ "truncated": 0,
2115
+ "non_truncated": 100,
2116
+ "padded": 400,
2117
+ "non_padded": 0,
2118
+ "effective_few_shots": 5.0,
2119
+ "num_truncated_few_shots": 0
2120
+ },
2121
+ "lighteval|mmlu:college_computer_science|5": {
2122
+ "hashes": {
2123
+ "hash_examples": "32805b52d7d5daab",
2124
+ "hash_full_prompts": "841e9d2ecfbb104d",
2125
+ "hash_input_tokens": "1e007ac047722e9b",
2126
+ "hash_cont_tokens": "00520b0ec06da34f"
2127
+ },
2128
+ "truncated": 0,
2129
+ "non_truncated": 100,
2130
+ "padded": 400,
2131
+ "non_padded": 0,
2132
+ "effective_few_shots": 5.0,
2133
+ "num_truncated_few_shots": 0
2134
+ },
2135
+ "lighteval|mmlu:college_mathematics|5": {
2136
+ "hashes": {
2137
+ "hash_examples": "55da1a0a0bd33722",
2138
+ "hash_full_prompts": "696c5f73522b8706",
2139
+ "hash_input_tokens": "c3061d57b5a4ad7e",
2140
+ "hash_cont_tokens": "00520b0ec06da34f"
2141
+ },
2142
+ "truncated": 0,
2143
+ "non_truncated": 100,
2144
+ "padded": 400,
2145
+ "non_padded": 0,
2146
+ "effective_few_shots": 5.0,
2147
+ "num_truncated_few_shots": 0
2148
+ },
2149
+ "lighteval|mmlu:college_medicine|5": {
2150
+ "hashes": {
2151
+ "hash_examples": "c33e143163049176",
2152
+ "hash_full_prompts": "7d2530816f672426",
2153
+ "hash_input_tokens": "4cddd091001776d7",
2154
+ "hash_cont_tokens": "699c8eb24e3e446b"
2155
+ },
2156
+ "truncated": 0,
2157
+ "non_truncated": 173,
2158
+ "padded": 692,
2159
+ "non_padded": 0,
2160
+ "effective_few_shots": 5.0,
2161
+ "num_truncated_few_shots": 0
2162
+ },
2163
+ "lighteval|mmlu:college_physics|5": {
2164
+ "hashes": {
2165
+ "hash_examples": "ebdab1cdb7e555df",
2166
+ "hash_full_prompts": "66b3a61507c4c92b",
2167
+ "hash_input_tokens": "821b169941167548",
2168
+ "hash_cont_tokens": "075997110cbe055e"
2169
+ },
2170
+ "truncated": 0,
2171
+ "non_truncated": 102,
2172
+ "padded": 408,
2173
+ "non_padded": 0,
2174
+ "effective_few_shots": 5.0,
2175
+ "num_truncated_few_shots": 0
2176
+ },
2177
+ "lighteval|mmlu:computer_security|5": {
2178
+ "hashes": {
2179
+ "hash_examples": "a24fd7d08a560921",
2180
+ "hash_full_prompts": "f1143da88158bf03",
2181
+ "hash_input_tokens": "02e64465d74344b4",
2182
+ "hash_cont_tokens": "00520b0ec06da34f"
2183
+ },
2184
+ "truncated": 0,
2185
+ "non_truncated": 100,
2186
+ "padded": 400,
2187
+ "non_padded": 0,
2188
+ "effective_few_shots": 5.0,
2189
+ "num_truncated_few_shots": 0
2190
+ },
2191
+ "lighteval|mmlu:conceptual_physics|5": {
2192
+ "hashes": {
2193
+ "hash_examples": "8300977a79386993",
2194
+ "hash_full_prompts": "d2b4c706b65a71d9",
2195
+ "hash_input_tokens": "5c7a2235529d2821",
2196
+ "hash_cont_tokens": "f22daa6d4818086f"
2197
+ },
2198
+ "truncated": 0,
2199
+ "non_truncated": 235,
2200
+ "padded": 940,
2201
+ "non_padded": 0,
2202
+ "effective_few_shots": 5.0,
2203
+ "num_truncated_few_shots": 0
2204
+ },
2205
+ "lighteval|mmlu:econometrics|5": {
2206
+ "hashes": {
2207
+ "hash_examples": "ddde36788a04a46f",
2208
+ "hash_full_prompts": "aa5255d923b0e3a3",
2209
+ "hash_input_tokens": "e0a79ea9e037599d",
2210
+ "hash_cont_tokens": "26791a0b1941b4c4"
2211
+ },
2212
+ "truncated": 0,
2213
+ "non_truncated": 114,
2214
+ "padded": 456,
2215
+ "non_padded": 0,
2216
+ "effective_few_shots": 5.0,
2217
+ "num_truncated_few_shots": 0
2218
+ },
2219
+ "lighteval|mmlu:electrical_engineering|5": {
2220
+ "hashes": {
2221
+ "hash_examples": "acbc5def98c19b3f",
2222
+ "hash_full_prompts": "c1f9a9087987d1d7",
2223
+ "hash_input_tokens": "e48ddb58b2efa8e3",
2224
+ "hash_cont_tokens": "3e336577994f6c0d"
2225
+ },
2226
+ "truncated": 0,
2227
+ "non_truncated": 145,
2228
+ "padded": 580,
2229
+ "non_padded": 0,
2230
+ "effective_few_shots": 5.0,
2231
+ "num_truncated_few_shots": 0
2232
+ },
2233
+ "lighteval|mmlu:elementary_mathematics|5": {
2234
+ "hashes": {
2235
+ "hash_examples": "146e61d07497a9bd",
2236
+ "hash_full_prompts": "57fb9ddf2f814bb5",
2237
+ "hash_input_tokens": "9e81373b5265da10",
2238
+ "hash_cont_tokens": "1d6bbfa8a67327c8"
2239
+ },
2240
+ "truncated": 0,
2241
+ "non_truncated": 378,
2242
+ "padded": 1512,
2243
+ "non_padded": 0,
2244
+ "effective_few_shots": 5.0,
2245
+ "num_truncated_few_shots": 0
2246
+ },
2247
+ "lighteval|mmlu:formal_logic|5": {
2248
+ "hashes": {
2249
+ "hash_examples": "8635216e1909a03f",
2250
+ "hash_full_prompts": "dc7e34e04346adfd",
2251
+ "hash_input_tokens": "0378ed1f1a9bb3f6",
2252
+ "hash_cont_tokens": "60508d85eb7693a4"
2253
+ },
2254
+ "truncated": 0,
2255
+ "non_truncated": 126,
2256
+ "padded": 504,
2257
+ "non_padded": 0,
2258
+ "effective_few_shots": 5.0,
2259
+ "num_truncated_few_shots": 0
2260
+ },
2261
+ "lighteval|mmlu:global_facts|5": {
2262
+ "hashes": {
2263
+ "hash_examples": "30b315aa6353ee47",
2264
+ "hash_full_prompts": "7dedb5baa45f3a38",
2265
+ "hash_input_tokens": "d20db9bd82fb76c1",
2266
+ "hash_cont_tokens": "00520b0ec06da34f"
2267
+ },
2268
+ "truncated": 0,
2269
+ "non_truncated": 100,
2270
+ "padded": 400,
2271
+ "non_padded": 0,
2272
+ "effective_few_shots": 5.0,
2273
+ "num_truncated_few_shots": 0
2274
+ },
2275
+ "lighteval|mmlu:high_school_biology|5": {
2276
+ "hashes": {
2277
+ "hash_examples": "c9136373af2180de",
2278
+ "hash_full_prompts": "15157813fc668acf",
2279
+ "hash_input_tokens": "c3c10eef8c477c93",
2280
+ "hash_cont_tokens": "d236ce982144e65f"
2281
+ },
2282
+ "truncated": 0,
2283
+ "non_truncated": 310,
2284
+ "padded": 1240,
2285
+ "non_padded": 0,
2286
+ "effective_few_shots": 5.0,
2287
+ "num_truncated_few_shots": 0
2288
+ },
2289
+ "lighteval|mmlu:high_school_chemistry|5": {
2290
+ "hashes": {
2291
+ "hash_examples": "b0661bfa1add6404",
2292
+ "hash_full_prompts": "f51dfd92a2d6fdba",
2293
+ "hash_input_tokens": "dc53c87961ef4ab5",
2294
+ "hash_cont_tokens": "59f93238ec5aead6"
2295
+ },
2296
+ "truncated": 0,
2297
+ "non_truncated": 203,
2298
+ "padded": 812,
2299
+ "non_padded": 0,
2300
+ "effective_few_shots": 5.0,
2301
+ "num_truncated_few_shots": 0
2302
+ },
2303
+ "lighteval|mmlu:high_school_computer_science|5": {
2304
+ "hashes": {
2305
+ "hash_examples": "80fc1d623a3d665f",
2306
+ "hash_full_prompts": "fe432a03fe8cc766",
2307
+ "hash_input_tokens": "61fa356c3ea98372",
2308
+ "hash_cont_tokens": "00520b0ec06da34f"
2309
+ },
2310
+ "truncated": 0,
2311
+ "non_truncated": 100,
2312
+ "padded": 400,
2313
+ "non_padded": 0,
2314
+ "effective_few_shots": 5.0,
2315
+ "num_truncated_few_shots": 0
2316
+ },
2317
+ "lighteval|mmlu:high_school_european_history|5": {
2318
+ "hashes": {
2319
+ "hash_examples": "854da6e5af0fe1a1",
2320
+ "hash_full_prompts": "09a62e1560fb1171",
2321
+ "hash_input_tokens": "272f8d31300ef0af",
2322
+ "hash_cont_tokens": "7b7414d6a5da3d91"
2323
+ },
2324
+ "truncated": 0,
2325
+ "non_truncated": 165,
2326
+ "padded": 656,
2327
+ "non_padded": 4,
2328
+ "effective_few_shots": 5.0,
2329
+ "num_truncated_few_shots": 0
2330
+ },
2331
+ "lighteval|mmlu:high_school_geography|5": {
2332
+ "hashes": {
2333
+ "hash_examples": "7dc963c7acd19ad8",
2334
+ "hash_full_prompts": "8284151c76cee4d8",
2335
+ "hash_input_tokens": "12624aed9bf6356b",
2336
+ "hash_cont_tokens": "1b66289e10988f84"
2337
+ },
2338
+ "truncated": 0,
2339
+ "non_truncated": 198,
2340
+ "padded": 792,
2341
+ "non_padded": 0,
2342
+ "effective_few_shots": 5.0,
2343
+ "num_truncated_few_shots": 0
2344
+ },
2345
+ "lighteval|mmlu:high_school_government_and_politics|5": {
2346
+ "hashes": {
2347
+ "hash_examples": "1f675dcdebc9758f",
2348
+ "hash_full_prompts": "083339a69a8bfafa",
2349
+ "hash_input_tokens": "32e30c43a4a5347e",
2350
+ "hash_cont_tokens": "5ab3c3415b1d3a55"
2351
+ },
2352
+ "truncated": 0,
2353
+ "non_truncated": 193,
2354
+ "padded": 772,
2355
+ "non_padded": 0,
2356
+ "effective_few_shots": 5.0,
2357
+ "num_truncated_few_shots": 0
2358
+ },
2359
+ "lighteval|mmlu:high_school_macroeconomics|5": {
2360
+ "hashes": {
2361
+ "hash_examples": "2fb32cf2d80f0b35",
2362
+ "hash_full_prompts": "ececedb0c4a4ffcd",
2363
+ "hash_input_tokens": "dc2cd6b398f5f86e",
2364
+ "hash_cont_tokens": "2f5457058d187374"
2365
+ },
2366
+ "truncated": 0,
2367
+ "non_truncated": 390,
2368
+ "padded": 1557,
2369
+ "non_padded": 3,
2370
+ "effective_few_shots": 5.0,
2371
+ "num_truncated_few_shots": 0
2372
+ },
2373
+ "lighteval|mmlu:high_school_mathematics|5": {
2374
+ "hashes": {
2375
+ "hash_examples": "fd6646fdb5d58a1f",
2376
+ "hash_full_prompts": "d58a3ca5c8ed6780",
2377
+ "hash_input_tokens": "6f9c5ce6428dd87d",
2378
+ "hash_cont_tokens": "e35137cb972e1918"
2379
+ },
2380
+ "truncated": 0,
2381
+ "non_truncated": 270,
2382
+ "padded": 1080,
2383
+ "non_padded": 0,
2384
+ "effective_few_shots": 5.0,
2385
+ "num_truncated_few_shots": 0
2386
+ },
2387
+ "lighteval|mmlu:high_school_microeconomics|5": {
2388
+ "hashes": {
2389
+ "hash_examples": "2118f21f71d87d84",
2390
+ "hash_full_prompts": "bd49ce8a930e3e78",
2391
+ "hash_input_tokens": "44722cbe1d85e636",
2392
+ "hash_cont_tokens": "f756093278ebb83e"
2393
+ },
2394
+ "truncated": 0,
2395
+ "non_truncated": 238,
2396
+ "padded": 908,
2397
+ "non_padded": 44,
2398
+ "effective_few_shots": 5.0,
2399
+ "num_truncated_few_shots": 0
2400
+ },
2401
+ "lighteval|mmlu:high_school_physics|5": {
2402
+ "hashes": {
2403
+ "hash_examples": "dc3ce06378548565",
2404
+ "hash_full_prompts": "3904af994b32b959",
2405
+ "hash_input_tokens": "2132f616c2587937",
2406
+ "hash_cont_tokens": "9cf883ebf1c82176"
2407
+ },
2408
+ "truncated": 0,
2409
+ "non_truncated": 151,
2410
+ "padded": 604,
2411
+ "non_padded": 0,
2412
+ "effective_few_shots": 5.0,
2413
+ "num_truncated_few_shots": 0
2414
+ },
2415
+ "lighteval|mmlu:high_school_psychology|5": {
2416
+ "hashes": {
2417
+ "hash_examples": "c8d1d98a40e11f2f",
2418
+ "hash_full_prompts": "d3a4d5dd3f3513f8",
2419
+ "hash_input_tokens": "6cc69cf1a89e4a88",
2420
+ "hash_cont_tokens": "bda0f77331ebb21a"
2421
+ },
2422
+ "truncated": 0,
2423
+ "non_truncated": 545,
2424
+ "padded": 2178,
2425
+ "non_padded": 2,
2426
+ "effective_few_shots": 5.0,
2427
+ "num_truncated_few_shots": 0
2428
+ },
2429
+ "lighteval|mmlu:high_school_statistics|5": {
2430
+ "hashes": {
2431
+ "hash_examples": "666c8759b98ee4ff",
2432
+ "hash_full_prompts": "1b5599f9d4edc7de",
2433
+ "hash_input_tokens": "60af7a873b579818",
2434
+ "hash_cont_tokens": "4d04f014105a0bad"
2435
+ },
2436
+ "truncated": 0,
2437
+ "non_truncated": 216,
2438
+ "padded": 864,
2439
+ "non_padded": 0,
2440
+ "effective_few_shots": 5.0,
2441
+ "num_truncated_few_shots": 0
2442
+ },
2443
+ "lighteval|mmlu:high_school_us_history|5": {
2444
+ "hashes": {
2445
+ "hash_examples": "95fef1c4b7d3f81e",
2446
+ "hash_full_prompts": "001f7e7cc8185618",
2447
+ "hash_input_tokens": "8c2d01a0f291db69",
2448
+ "hash_cont_tokens": "f4590c58f12f2766"
2449
+ },
2450
+ "truncated": 0,
2451
+ "non_truncated": 204,
2452
+ "padded": 816,
2453
+ "non_padded": 0,
2454
+ "effective_few_shots": 5.0,
2455
+ "num_truncated_few_shots": 0
2456
+ },
2457
+ "lighteval|mmlu:high_school_world_history|5": {
2458
+ "hashes": {
2459
+ "hash_examples": "7e5085b6184b0322",
2460
+ "hash_full_prompts": "6a5c2a43cf7c6cb1",
2461
+ "hash_input_tokens": "612ed95e43bc21b5",
2462
+ "hash_cont_tokens": "db6bcddd891df5d9"
2463
+ },
2464
+ "truncated": 0,
2465
+ "non_truncated": 237,
2466
+ "padded": 948,
2467
+ "non_padded": 0,
2468
+ "effective_few_shots": 5.0,
2469
+ "num_truncated_few_shots": 0
2470
+ },
2471
+ "lighteval|mmlu:human_aging|5": {
2472
+ "hashes": {
2473
+ "hash_examples": "c17333e7c7c10797",
2474
+ "hash_full_prompts": "a3ad8e679fe07bef",
2475
+ "hash_input_tokens": "4c948b081b40ba31",
2476
+ "hash_cont_tokens": "25cec8d640319105"
2477
+ },
2478
+ "truncated": 0,
2479
+ "non_truncated": 223,
2480
+ "padded": 892,
2481
+ "non_padded": 0,
2482
+ "effective_few_shots": 5.0,
2483
+ "num_truncated_few_shots": 0
2484
+ },
2485
+ "lighteval|mmlu:human_sexuality|5": {
2486
+ "hashes": {
2487
+ "hash_examples": "4edd1e9045df5e3d",
2488
+ "hash_full_prompts": "3389ffb95929a661",
2489
+ "hash_input_tokens": "9e649cc80ef9f2fe",
2490
+ "hash_cont_tokens": "6778302b4a10b645"
2491
+ },
2492
+ "truncated": 0,
2493
+ "non_truncated": 131,
2494
+ "padded": 524,
2495
+ "non_padded": 0,
2496
+ "effective_few_shots": 5.0,
2497
+ "num_truncated_few_shots": 0
2498
+ },
2499
+ "lighteval|mmlu:international_law|5": {
2500
+ "hashes": {
2501
+ "hash_examples": "db2fa00d771a062a",
2502
+ "hash_full_prompts": "104f48c64f6f9622",
2503
+ "hash_input_tokens": "c51db1d4a2a87eed",
2504
+ "hash_cont_tokens": "9eb54e1a46032749"
2505
+ },
2506
+ "truncated": 0,
2507
+ "non_truncated": 121,
2508
+ "padded": 484,
2509
+ "non_padded": 0,
2510
+ "effective_few_shots": 5.0,
2511
+ "num_truncated_few_shots": 0
2512
+ },
2513
+ "lighteval|mmlu:jurisprudence|5": {
2514
+ "hashes": {
2515
+ "hash_examples": "e956f86b124076fe",
2516
+ "hash_full_prompts": "49295d36462ddc97",
2517
+ "hash_input_tokens": "a779a1b30bc13f30",
2518
+ "hash_cont_tokens": "f17d9a372cfd66b1"
2519
+ },
2520
+ "truncated": 0,
2521
+ "non_truncated": 108,
2522
+ "padded": 420,
2523
+ "non_padded": 12,
2524
+ "effective_few_shots": 5.0,
2525
+ "num_truncated_few_shots": 0
2526
+ },
2527
+ "lighteval|mmlu:logical_fallacies|5": {
2528
+ "hashes": {
2529
+ "hash_examples": "956e0e6365ab79f1",
2530
+ "hash_full_prompts": "b64f452752d5cd23",
2531
+ "hash_input_tokens": "61d99e8d4d4d8652",
2532
+ "hash_cont_tokens": "cf44a68f5bca9a96"
2533
+ },
2534
+ "truncated": 0,
2535
+ "non_truncated": 163,
2536
+ "padded": 648,
2537
+ "non_padded": 4,
2538
+ "effective_few_shots": 5.0,
2539
+ "num_truncated_few_shots": 0
2540
+ },
2541
+ "lighteval|mmlu:machine_learning|5": {
2542
+ "hashes": {
2543
+ "hash_examples": "397997cc6f4d581e",
2544
+ "hash_full_prompts": "54da136ebd708042",
2545
+ "hash_input_tokens": "11e6731506fcf366",
2546
+ "hash_cont_tokens": "eace00d420f4f32c"
2547
+ },
2548
+ "truncated": 0,
2549
+ "non_truncated": 112,
2550
+ "padded": 448,
2551
+ "non_padded": 0,
2552
+ "effective_few_shots": 5.0,
2553
+ "num_truncated_few_shots": 0
2554
+ },
2555
+ "lighteval|mmlu:management|5": {
2556
+ "hashes": {
2557
+ "hash_examples": "2bcbe6f6ca63d740",
2558
+ "hash_full_prompts": "a4b864ff27598ba3",
2559
+ "hash_input_tokens": "caffa6e4e80cbd5e",
2560
+ "hash_cont_tokens": "b7c51d0250c252d8"
2561
+ },
2562
+ "truncated": 0,
2563
+ "non_truncated": 103,
2564
+ "padded": 412,
2565
+ "non_padded": 0,
2566
+ "effective_few_shots": 5.0,
2567
+ "num_truncated_few_shots": 0
2568
+ },
2569
+ "lighteval|mmlu:marketing|5": {
2570
+ "hashes": {
2571
+ "hash_examples": "8ddb20d964a1b065",
2572
+ "hash_full_prompts": "c7183ac32f36104d",
2573
+ "hash_input_tokens": "5cd238ac5e8f19f4",
2574
+ "hash_cont_tokens": "086fb63f8b1d1339"
2575
+ },
2576
+ "truncated": 0,
2577
+ "non_truncated": 234,
2578
+ "padded": 924,
2579
+ "non_padded": 12,
2580
+ "effective_few_shots": 5.0,
2581
+ "num_truncated_few_shots": 0
2582
+ },
2583
+ "lighteval|mmlu:medical_genetics|5": {
2584
+ "hashes": {
2585
+ "hash_examples": "182a71f4763d2cea",
2586
+ "hash_full_prompts": "c17b0a66e3027303",
2587
+ "hash_input_tokens": "46c0c8a573b43089",
2588
+ "hash_cont_tokens": "00520b0ec06da34f"
2589
+ },
2590
+ "truncated": 0,
2591
+ "non_truncated": 100,
2592
+ "padded": 400,
2593
+ "non_padded": 0,
2594
+ "effective_few_shots": 5.0,
2595
+ "num_truncated_few_shots": 0
2596
+ },
2597
+ "lighteval|mmlu:miscellaneous|5": {
2598
+ "hashes": {
2599
+ "hash_examples": "4c404fdbb4ca57fc",
2600
+ "hash_full_prompts": "bc5fa37ce20a2503",
2601
+ "hash_input_tokens": "5327cd4585062ac2",
2602
+ "hash_cont_tokens": "1827274fa6537077"
2603
+ },
2604
+ "truncated": 0,
2605
+ "non_truncated": 783,
2606
+ "padded": 3132,
2607
+ "non_padded": 0,
2608
+ "effective_few_shots": 5.0,
2609
+ "num_truncated_few_shots": 0
2610
+ },
2611
+ "lighteval|mmlu:moral_disputes|5": {
2612
+ "hashes": {
2613
+ "hash_examples": "60cbd2baa3fea5c9",
2614
+ "hash_full_prompts": "075742051236078f",
2615
+ "hash_input_tokens": "a2c9da202f686839",
2616
+ "hash_cont_tokens": "472c223f6f28cfc7"
2617
+ },
2618
+ "truncated": 0,
2619
+ "non_truncated": 346,
2620
+ "padded": 1384,
2621
+ "non_padded": 0,
2622
+ "effective_few_shots": 5.0,
2623
+ "num_truncated_few_shots": 0
2624
+ },
2625
+ "lighteval|mmlu:moral_scenarios|5": {
2626
+ "hashes": {
2627
+ "hash_examples": "fd8b0431fbdd75ef",
2628
+ "hash_full_prompts": "533c4700637599a2",
2629
+ "hash_input_tokens": "9a1a9f3900b372e6",
2630
+ "hash_cont_tokens": "e90dade00a092f9e"
2631
+ },
2632
+ "truncated": 0,
2633
+ "non_truncated": 895,
2634
+ "padded": 3567,
2635
+ "non_padded": 13,
2636
+ "effective_few_shots": 5.0,
2637
+ "num_truncated_few_shots": 0
2638
+ },
2639
+ "lighteval|mmlu:nutrition|5": {
2640
+ "hashes": {
2641
+ "hash_examples": "71e55e2b829b6528",
2642
+ "hash_full_prompts": "02b6877dc5a603a6",
2643
+ "hash_input_tokens": "dd91fec063272e23",
2644
+ "hash_cont_tokens": "128e0ec97d96b165"
2645
+ },
2646
+ "truncated": 0,
2647
+ "non_truncated": 306,
2648
+ "padded": 1224,
2649
+ "non_padded": 0,
2650
+ "effective_few_shots": 5.0,
2651
+ "num_truncated_few_shots": 0
2652
+ },
2653
+ "lighteval|mmlu:philosophy|5": {
2654
+ "hashes": {
2655
+ "hash_examples": "a6d489a8d208fa4b",
2656
+ "hash_full_prompts": "0e65b5f40a9ceb20",
2657
+ "hash_input_tokens": "2255e15265a7d96a",
2658
+ "hash_cont_tokens": "cbfd7829a3e0f082"
2659
+ },
2660
+ "truncated": 0,
2661
+ "non_truncated": 311,
2662
+ "padded": 1244,
2663
+ "non_padded": 0,
2664
+ "effective_few_shots": 5.0,
2665
+ "num_truncated_few_shots": 0
2666
+ },
2667
+ "lighteval|mmlu:prehistory|5": {
2668
+ "hashes": {
2669
+ "hash_examples": "6cc50f032a19acaa",
2670
+ "hash_full_prompts": "e838e60749e4a598",
2671
+ "hash_input_tokens": "1b9b906efbcc97fd",
2672
+ "hash_cont_tokens": "9c0cf5a2f71afa7e"
2673
+ },
2674
+ "truncated": 0,
2675
+ "non_truncated": 324,
2676
+ "padded": 1284,
2677
+ "non_padded": 12,
2678
+ "effective_few_shots": 5.0,
2679
+ "num_truncated_few_shots": 0
2680
+ },
2681
+ "lighteval|mmlu:professional_accounting|5": {
2682
+ "hashes": {
2683
+ "hash_examples": "50f57ab32f5f6cea",
2684
+ "hash_full_prompts": "9abf7319f68b7ba8",
2685
+ "hash_input_tokens": "d42c8275cd4e10e1",
2686
+ "hash_cont_tokens": "50f011c2453517ee"
2687
+ },
2688
+ "truncated": 0,
2689
+ "non_truncated": 282,
2690
+ "padded": 1128,
2691
+ "non_padded": 0,
2692
+ "effective_few_shots": 5.0,
2693
+ "num_truncated_few_shots": 0
2694
+ },
2695
+ "lighteval|mmlu:professional_law|5": {
2696
+ "hashes": {
2697
+ "hash_examples": "a8fdc85c64f4b215",
2698
+ "hash_full_prompts": "4074faf1eaedcfda",
2699
+ "hash_input_tokens": "215c854d27e741b8",
2700
+ "hash_cont_tokens": "73527e852c24186c"
2701
+ },
2702
+ "truncated": 0,
2703
+ "non_truncated": 1534,
2704
+ "padded": 6136,
2705
+ "non_padded": 0,
2706
+ "effective_few_shots": 5.0,
2707
+ "num_truncated_few_shots": 0
2708
+ },
2709
+ "lighteval|mmlu:professional_medicine|5": {
2710
+ "hashes": {
2711
+ "hash_examples": "c373a28a3050a73a",
2712
+ "hash_full_prompts": "e72202fc20fcab70",
2713
+ "hash_input_tokens": "5a6e9aaaaea83544",
2714
+ "hash_cont_tokens": "ceb7af5e2e789abc"
2715
+ },
2716
+ "truncated": 0,
2717
+ "non_truncated": 272,
2718
+ "padded": 1088,
2719
+ "non_padded": 0,
2720
+ "effective_few_shots": 5.0,
2721
+ "num_truncated_few_shots": 0
2722
+ },
2723
+ "lighteval|mmlu:professional_psychology|5": {
2724
+ "hashes": {
2725
+ "hash_examples": "bf5254fe818356af",
2726
+ "hash_full_prompts": "4dcb71c9ef602791",
2727
+ "hash_input_tokens": "316d0ba731b0de4f",
2728
+ "hash_cont_tokens": "8cfdced8a9667380"
2729
+ },
2730
+ "truncated": 0,
2731
+ "non_truncated": 612,
2732
+ "padded": 2428,
2733
+ "non_padded": 20,
2734
+ "effective_few_shots": 5.0,
2735
+ "num_truncated_few_shots": 0
2736
+ },
2737
+ "lighteval|mmlu:public_relations|5": {
2738
+ "hashes": {
2739
+ "hash_examples": "b66d52e28e7d14e0",
2740
+ "hash_full_prompts": "c6050b1748185950",
2741
+ "hash_input_tokens": "2ba1d90c95e19dce",
2742
+ "hash_cont_tokens": "f8327461a9cc5123"
2743
+ },
2744
+ "truncated": 0,
2745
+ "non_truncated": 110,
2746
+ "padded": 436,
2747
+ "non_padded": 4,
2748
+ "effective_few_shots": 5.0,
2749
+ "num_truncated_few_shots": 0
2750
+ },
2751
+ "lighteval|mmlu:security_studies|5": {
2752
+ "hashes": {
2753
+ "hash_examples": "514c14feaf000ad9",
2754
+ "hash_full_prompts": "4c6786915b670d03",
2755
+ "hash_input_tokens": "b92f71eccf4f89bf",
2756
+ "hash_cont_tokens": "c30b0c4d52c2875d"
2757
+ },
2758
+ "truncated": 0,
2759
+ "non_truncated": 245,
2760
+ "padded": 980,
2761
+ "non_padded": 0,
2762
+ "effective_few_shots": 5.0,
2763
+ "num_truncated_few_shots": 0
2764
+ },
2765
+ "lighteval|mmlu:sociology|5": {
2766
+ "hashes": {
2767
+ "hash_examples": "f6c9bc9d18c80870",
2768
+ "hash_full_prompts": "a2e9a27e985a4e9b",
2769
+ "hash_input_tokens": "e821334ab55c0d44",
2770
+ "hash_cont_tokens": "eef4bd16d536fbd6"
2771
+ },
2772
+ "truncated": 0,
2773
+ "non_truncated": 201,
2774
+ "padded": 804,
2775
+ "non_padded": 0,
2776
+ "effective_few_shots": 5.0,
2777
+ "num_truncated_few_shots": 0
2778
+ },
2779
+ "lighteval|mmlu:us_foreign_policy|5": {
2780
+ "hashes": {
2781
+ "hash_examples": "ed7b78629db6678f",
2782
+ "hash_full_prompts": "46d0986398662d59",
2783
+ "hash_input_tokens": "9f6b40a7b6b8a3b2",
2784
+ "hash_cont_tokens": "00520b0ec06da34f"
2785
+ },
2786
+ "truncated": 0,
2787
+ "non_truncated": 100,
2788
+ "padded": 400,
2789
+ "non_padded": 0,
2790
+ "effective_few_shots": 5.0,
2791
+ "num_truncated_few_shots": 0
2792
+ },
2793
+ "lighteval|mmlu:virology|5": {
2794
+ "hashes": {
2795
+ "hash_examples": "bc52ffdc3f9b994a",
2796
+ "hash_full_prompts": "6b591e3983159283",
2797
+ "hash_input_tokens": "d7c6d39e149defc9",
2798
+ "hash_cont_tokens": "f5fc195e049353c0"
2799
+ },
2800
+ "truncated": 0,
2801
+ "non_truncated": 166,
2802
+ "padded": 664,
2803
+ "non_padded": 0,
2804
+ "effective_few_shots": 5.0,
2805
+ "num_truncated_few_shots": 0
2806
+ },
2807
+ "lighteval|mmlu:world_religions|5": {
2808
+ "hashes": {
2809
+ "hash_examples": "ecdb4a4f94f62930",
2810
+ "hash_full_prompts": "8c2e37a02519af15",
2811
+ "hash_input_tokens": "80b87b6e634441d6",
2812
+ "hash_cont_tokens": "ada548665e87b1e0"
2813
+ },
2814
+ "truncated": 0,
2815
+ "non_truncated": 171,
2816
+ "padded": 684,
2817
+ "non_padded": 0,
2818
+ "effective_few_shots": 5.0,
2819
+ "num_truncated_few_shots": 0
2820
+ }
2821
+ },
2822
+ "summary_general": {
2823
+ "hashes": {
2824
+ "hash_examples": "341a076d0beb7048",
2825
+ "hash_full_prompts": "7c1eeddf962b8fc9",
2826
+ "hash_input_tokens": "98bef9715b6ebf74",
2827
+ "hash_cont_tokens": "3672212ca582e2d0"
2828
+ },
2829
+ "truncated": 0,
2830
+ "non_truncated": 14042,
2831
+ "padded": 56038,
2832
+ "non_padded": 130,
2833
+ "num_truncated_few_shots": 0
2834
+ }
2835
+ }