alexmarques commited on
Commit
218c80f
1 Parent(s): bb6fae1

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +29 -27
README.md CHANGED
@@ -33,7 +33,7 @@ base_model: meta-llama/Meta-Llama-3.1-70B-Instruct
33
  - **Model Developers:** Neural Magic
34
 
35
  Quantized version of [Meta-Llama-3.1-70B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3.1-70B-Instruct).
36
- It achieves scores within 1% of the scores of the unquantized model for MMLU, ARC-Challenge, GSM-8k, Hellaswag, Winogrande and TruthfulQA.
37
 
38
  ### Model Optimizations
39
 
@@ -136,6 +136,8 @@ The model was evaluated on MMLU, ARC-Challenge, GSM-8K, Hellaswag, Winogrande an
136
  Evaluation was conducted using the Neural Magic fork of [lm-evaluation-harness](https://github.com/neuralmagic/lm-evaluation-harness/tree/llama_3.1_instruct) (branch llama_3.1_instruct) and the [vLLM](https://docs.vllm.ai/en/stable/) engine.
137
  This version of the lm-evaluation-harness includes versions of MMLU, ARC-Challenge and GSM-8K that match the prompting style of [Meta-Llama-3.1-Instruct-evals](https://huggingface.co/datasets/meta-llama/Meta-Llama-3.1-70B-Instruct-evals).
138
 
 
 
139
  ### Accuracy
140
 
141
  #### Open LLM Leaderboard evaluation scores
@@ -153,9 +155,9 @@ This version of the lm-evaluation-harness includes versions of MMLU, ARC-Challen
153
  <tr>
154
  <td>MMLU (5-shot)
155
  </td>
156
- <td>83.88
157
  </td>
158
- <td>83.65
159
  </td>
160
  <td>99.7%
161
  </td>
@@ -163,71 +165,71 @@ This version of the lm-evaluation-harness includes versions of MMLU, ARC-Challen
163
  <tr>
164
  <td>MMLU (CoT, 0-shot)
165
  </td>
166
- <td>85.74
167
  </td>
168
- <td>85.41
169
  </td>
170
- <td>99.6%
171
  </td>
172
  </tr>
173
  <tr>
174
  <td>ARC Challenge (0-shot)
175
  </td>
176
- <td>93.26
177
  </td>
178
- <td>93.26
179
  </td>
180
- <td>100.0%
181
  </td>
182
  </tr>
183
  <tr>
184
  <td>GSM-8K (CoT, 8-shot, strict-match)
185
  </td>
186
- <td>93.10
187
  </td>
188
- <td>93.25
189
  </td>
190
- <td>100.2%
191
  </td>
192
  </tr>
193
  <tr>
194
  <td>Hellaswag (10-shot)
195
  </td>
196
- <td>86.40
197
  </td>
198
- <td>86.28
199
  </td>
200
- <td>99.9%
201
  </td>
202
  </tr>
203
  <tr>
204
  <td>Winogrande (5-shot)
205
  </td>
206
- <td>85.00
207
  </td>
208
  <td>85.00
209
  </td>
210
- <td>100.0%
211
  </td>
212
  </tr>
213
  <tr>
214
  <td>TruthfulQA (0-shot, mc2)
215
  </td>
216
- <td>59.83
217
  </td>
218
- <td>60.88
219
  </td>
220
- <td>101.8%
221
  </td>
222
  </tr>
223
  <tr>
224
  <td><strong>Average</strong>
225
  </td>
226
- <td><strong>83.89</strong>
227
  </td>
228
- <td><strong>83.96</strong>
229
  </td>
230
- <td><strong>100.2%</strong>
231
  </td>
232
  </tr>
233
  </table>
@@ -240,7 +242,7 @@ The results were obtained using the following commands:
240
  ```
241
  lm_eval \
242
  --model vllm \
243
- --model_args pretrained="neuralmagic/Meta-Llama-3.1-70B-Instruct-quantized.w8a8",dtype=auto,add_bos_token=True,max_model_len=3850,max_gen_toks=10,tensor_parallel_size=1 \
244
  --tasks mmlu_llama_3.1_instruct \
245
  --fewshot_as_multiturn \
246
  --apply_chat_template \
@@ -252,7 +254,7 @@ lm_eval \
252
  ```
253
  lm_eval \
254
  --model vllm \
255
- --model_args pretrained="neuralmagic/Meta-Llama-3.1-70B-Instruct-quantized.w8a8",dtype=auto,add_bos_token=True,max_model_len=4064,max_gen_toks=1024,tensor_parallel_size=1 \
256
  --tasks mmlu_cot_0shot_llama_3.1_instruct \
257
  --apply_chat_template \
258
  --num_fewshot 0 \
@@ -263,7 +265,7 @@ lm_eval \
263
  ```
264
  lm_eval \
265
  --model vllm \
266
- --model_args pretrained="neuralmagic/Meta-Llama-3.1-70B-Instruct-quantized.w8a8",dtype=auto,add_bos_token=True,max_model_len=3940,max_gen_toks=100,tensor_parallel_size=1 \
267
  --tasks arc_challenge_llama_3.1_instruct \
268
  --apply_chat_template \
269
  --num_fewshot 0 \
@@ -274,7 +276,7 @@ lm_eval \
274
  ```
275
  lm_eval \
276
  --model vllm \
277
- --model_args pretrained="neuralmagic/Meta-Llama-3.1-70B-Instruct-quantized.w8a8",dtype=auto,add_bos_token=True,max_model_len=4096,max_gen_toks=1024,tensor_parallel_size=1 \
278
  --tasks gsm8k_cot_llama_3.1_instruct \
279
  --fewshot_as_multiturn \
280
  --apply_chat_template \
 
33
  - **Model Developers:** Neural Magic
34
 
35
  Quantized version of [Meta-Llama-3.1-70B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3.1-70B-Instruct).
36
+ It achieves scores within 1.2% of the scores of the unquantized model for MMLU, ARC-Challenge, GSM-8k, Hellaswag, Winogrande and TruthfulQA.
37
 
38
  ### Model Optimizations
39
 
 
136
  Evaluation was conducted using the Neural Magic fork of [lm-evaluation-harness](https://github.com/neuralmagic/lm-evaluation-harness/tree/llama_3.1_instruct) (branch llama_3.1_instruct) and the [vLLM](https://docs.vllm.ai/en/stable/) engine.
137
  This version of the lm-evaluation-harness includes versions of MMLU, ARC-Challenge and GSM-8K that match the prompting style of [Meta-Llama-3.1-Instruct-evals](https://huggingface.co/datasets/meta-llama/Meta-Llama-3.1-70B-Instruct-evals).
138
 
139
+ **Note:** Results have been updated after Meta modified the chat template.
140
+
141
  ### Accuracy
142
 
143
  #### Open LLM Leaderboard evaluation scores
 
155
  <tr>
156
  <td>MMLU (5-shot)
157
  </td>
158
+ <td>83.94
159
  </td>
160
+ <td>83.71
161
  </td>
162
  <td>99.7%
163
  </td>
 
165
  <tr>
166
  <td>MMLU (CoT, 0-shot)
167
  </td>
168
+ <td>86.23
169
  </td>
170
+ <td>85.81
171
  </td>
172
+ <td>99.5%
173
  </td>
174
  </tr>
175
  <tr>
176
  <td>ARC Challenge (0-shot)
177
  </td>
178
+ <td>93.34
179
  </td>
180
+ <td>93.09
181
  </td>
182
+ <td>99.7%
183
  </td>
184
  </tr>
185
  <tr>
186
  <td>GSM-8K (CoT, 8-shot, strict-match)
187
  </td>
188
+ <td>95.38
189
  </td>
190
+ <td>94.24
191
  </td>
192
+ <td>98.8%
193
  </td>
194
  </tr>
195
  <tr>
196
  <td>Hellaswag (10-shot)
197
  </td>
198
+ <td>86.66
199
  </td>
200
+ <td>86.19
201
  </td>
202
+ <td>99.5%
203
  </td>
204
  </tr>
205
  <tr>
206
  <td>Winogrande (5-shot)
207
  </td>
208
+ <td>85.32
209
  </td>
210
  <td>85.00
211
  </td>
212
+ <td>99.6%
213
  </td>
214
  </tr>
215
  <tr>
216
  <td>TruthfulQA (0-shot, mc2)
217
  </td>
218
+ <td>60.65
219
  </td>
220
+ <td>60.69
221
  </td>
222
+ <td>100.1%
223
  </td>
224
  </tr>
225
  <tr>
226
  <td><strong>Average</strong>
227
  </td>
228
+ <td><strong>84.50</strong>
229
  </td>
230
+ <td><strong>84.10</strong>
231
  </td>
232
+ <td><strong>99.6%</strong>
233
  </td>
234
  </tr>
235
  </table>
 
242
  ```
243
  lm_eval \
244
  --model vllm \
245
+ --model_args pretrained="neuralmagic/Meta-Llama-3.1-70B-Instruct-quantized.w8a8",dtype=auto,max_model_len=3850,max_gen_toks=10,tensor_parallel_size=1 \
246
  --tasks mmlu_llama_3.1_instruct \
247
  --fewshot_as_multiturn \
248
  --apply_chat_template \
 
254
  ```
255
  lm_eval \
256
  --model vllm \
257
+ --model_args pretrained="neuralmagic/Meta-Llama-3.1-70B-Instruct-quantized.w8a8",dtype=auto,max_model_len=4064,max_gen_toks=1024,tensor_parallel_size=1 \
258
  --tasks mmlu_cot_0shot_llama_3.1_instruct \
259
  --apply_chat_template \
260
  --num_fewshot 0 \
 
265
  ```
266
  lm_eval \
267
  --model vllm \
268
+ --model_args pretrained="neuralmagic/Meta-Llama-3.1-70B-Instruct-quantized.w8a8",dtype=auto,max_model_len=3940,max_gen_toks=100,tensor_parallel_size=1 \
269
  --tasks arc_challenge_llama_3.1_instruct \
270
  --apply_chat_template \
271
  --num_fewshot 0 \
 
276
  ```
277
  lm_eval \
278
  --model vllm \
279
+ --model_args pretrained="neuralmagic/Meta-Llama-3.1-70B-Instruct-quantized.w8a8",dtype=auto,max_model_len=4096,max_gen_toks=1024,tensor_parallel_size=1 \
280
  --tasks gsm8k_cot_llama_3.1_instruct \
281
  --fewshot_as_multiturn \
282
  --apply_chat_template \