wdevazelhes commited on
Commit
ec089c6
·
verified ·
1 Parent(s): d2c553b

few formatting and unifying MATH Lvl-5 label

Browse files
Files changed (1) hide show
  1. README.md +16 -16
README.md CHANGED
@@ -91,44 +91,44 @@ We report in the following table our internal pipeline benchmarks:
91
  <td rowspan="3">General</td>
92
  <td>MMLU (5-shot)</td>
93
  <td>56.1</td>
94
- <td>65.6</td>
95
  <td>58.6</td>
96
  <td>55.5</td>
97
  </tr>
98
  <tr>
99
  <td>MMLU-PRO (5-shot)</td>
100
  <td>24.9</td>
101
- <td>31.99</td>
102
  <td>26.21</td>
103
  <td>28.77</td>
104
  </tr>
105
  <tr>
106
  <td>IFEval</td>
107
  <td>12.83</td>
108
- <td>27</td>
109
  <td>22.81</td>
110
- <td>27.67</td>
111
  </tr>
112
  <tr>
113
  <td rowspan="2">Math</td>
114
  <td>GSM8K (5-shot)</td>
115
  <td>26.68</td>
116
- <td>68.99</td>
117
  <td>25.7</td>
118
  <td>63.91</td>
119
  </tr>
120
  <tr>
121
- <td>MATH(4-shot)</td>
122
  <td>1.39</td>
123
  <td>8.43</td>
124
  <td>1.73</td>
125
- <td>9.38</td>
126
  </tr>
127
  <tr>
128
  <td rowspan="4">Reasoning</td>
129
  <td>Arc Challenge (25-shot)</td>
130
  <td>50.76</td>
131
- <td>55.54</td>
132
  <td>50.34</td>
133
  <td>54.86</td>
134
  </tr>
@@ -136,20 +136,20 @@ We report in the following table our internal pipeline benchmarks:
136
  <td>GPQA (0-shot)</td>
137
  <td>27.49</td>
138
  <td>27.53</td>
139
- <td>38.6</td>
140
  <td>31.15</td>
141
  </tr>
142
  <tr>
143
  <td>MUSR (0-shot)</td>
144
  <td>35.24</td>
145
- <td>43.03</td>
146
  <td>42.13</td>
147
  <td>37.5</td>
148
  </tr>
149
  <tr>
150
  <td>BBH (3-shot)</td>
151
  <td>38.59</td>
152
- <td>46.12</td>
153
  <td>40.85</td>
154
  <td>44.23</td>
155
  </tr>
@@ -157,7 +157,7 @@ We report in the following table our internal pipeline benchmarks:
157
  <td rowspan="4">CommonSense Understanding</td>
158
  <td>PIQA (0-shot)</td>
159
  <td>77.42</td>
160
- <td>78.89</td>
161
  <td>78.29</td>
162
  <td>75.62</td>
163
  </tr>
@@ -165,21 +165,21 @@ We report in the following table our internal pipeline benchmarks:
165
  <td>SciQ (0-shot)</td>
166
  <td>92.7</td>
167
  <td>95.6</td>
168
- <td>96.1</td>
169
  <td>93.1</td>
170
  </tr>
171
  <tr>
172
  <td>Winogrande (0-shot)</td>
173
- <td>69.69</td>
174
  <td>68.82</td>
175
  <td>68.35</td>
176
  <td>64.64</td>
177
  </tr>
178
  <tr>
179
  <td>OpenbookQA (0-shot)</td>
180
- <td>43.2</td>
181
  <td>42.2</td>
182
- <td>43</td>
183
  <td>39.4</td>
184
  </tr>
185
  </tbody>
 
91
  <td rowspan="3">General</td>
92
  <td>MMLU (5-shot)</td>
93
  <td>56.1</td>
94
+ <td><b>65.6</b></td>
95
  <td>58.6</td>
96
  <td>55.5</td>
97
  </tr>
98
  <tr>
99
  <td>MMLU-PRO (5-shot)</td>
100
  <td>24.9</td>
101
+ <td><b>31.99</b></td>
102
  <td>26.21</td>
103
  <td>28.77</td>
104
  </tr>
105
  <tr>
106
  <td>IFEval</td>
107
  <td>12.83</td>
108
+ <td>27.0</td>
109
  <td>22.81</td>
110
+ <td><b>27.67</b></td>
111
  </tr>
112
  <tr>
113
  <td rowspan="2">Math</td>
114
  <td>GSM8K (5-shot)</td>
115
  <td>26.68</td>
116
+ <td><b>68.99</b></td>
117
  <td>25.7</td>
118
  <td>63.91</td>
119
  </tr>
120
  <tr>
121
+ <td>MATH Lvl-5 (4-shot)</td>
122
  <td>1.39</td>
123
  <td>8.43</td>
124
  <td>1.73</td>
125
+ <td><b>9.38</b></td>
126
  </tr>
127
  <tr>
128
  <td rowspan="4">Reasoning</td>
129
  <td>Arc Challenge (25-shot)</td>
130
  <td>50.76</td>
131
+ <td><b>55.54</b></td>
132
  <td>50.34</td>
133
  <td>54.86</td>
134
  </tr>
 
136
  <td>GPQA (0-shot)</td>
137
  <td>27.49</td>
138
  <td>27.53</td>
139
+ <td><b>38.6</b></td>
140
  <td>31.15</td>
141
  </tr>
142
  <tr>
143
  <td>MUSR (0-shot)</td>
144
  <td>35.24</td>
145
+ <td><b>43.03</b></td>
146
  <td>42.13</td>
147
  <td>37.5</td>
148
  </tr>
149
  <tr>
150
  <td>BBH (3-shot)</td>
151
  <td>38.59</td>
152
+ <td><b>46.12</b></td>
153
  <td>40.85</td>
154
  <td>44.23</td>
155
  </tr>
 
157
  <td rowspan="4">CommonSense Understanding</td>
158
  <td>PIQA (0-shot)</td>
159
  <td>77.42</td>
160
+ <td><b>78.89</b></td>
161
  <td>78.29</td>
162
  <td>75.62</td>
163
  </tr>
 
165
  <td>SciQ (0-shot)</td>
166
  <td>92.7</td>
167
  <td>95.6</td>
168
+ <td><b>96.1</b></td>
169
  <td>93.1</td>
170
  </tr>
171
  <tr>
172
  <td>Winogrande (0-shot)</td>
173
+ <td><b>69.69</b></td>
174
  <td>68.82</td>
175
  <td>68.35</td>
176
  <td>64.64</td>
177
  </tr>
178
  <tr>
179
  <td>OpenbookQA (0-shot)</td>
180
+ <td><b>43.2</b></td>
181
  <td>42.2</td>
182
+ <td>43.0</td>
183
  <td>39.4</td>
184
  </tr>
185
  </tbody>