Lin-K76 commited on
Commit
87fc45f
1 Parent(s): 91ae9af

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +70 -11
README.md CHANGED
@@ -161,16 +161,9 @@ oneshot(
161
 
162
  ## Evaluation
163
 
164
- The model was evaluated on the [OpenLLM](https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard) leaderboard tasks (version 1) with the [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness) and the [vLLM](https://docs.vllm.ai/en/stable/) engine, using the following command.
165
- A modified version of ARC-C and GSM8k-cot was used for evaluations, in line with Llama 3.1's prompting. It can be accessed on the [Neural Magic fork of the lm-evaluation-harness](https://github.com/neuralmagic/lm-evaluation-harness/tree/llama_3.1_instruct).
166
- Additional evaluations that were collected for the original Llama 3.1 models will be added in the future.
167
- ```
168
- lm_eval \
169
- --model vllm \
170
- --model_args pretrained="neuralmagic/Meta-Llama-3.1-8B-Instruct-FP8",dtype=auto,gpu_memory_utilization=0.4,add_bos_token=True,max_model_len=4096 \
171
- --tasks openllm \
172
- --batch_size auto
173
- ```
174
 
175
  ### Accuracy
176
 
@@ -256,4 +249,70 @@ lm_eval \
256
  <td><strong>99.33%</strong>
257
  </td>
258
  </tr>
259
- </table>
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
161
 
162
  ## Evaluation
163
 
164
+ The model was evaluated on MMLU, ARC-Challenge, GSM-8K, Hellaswag, Winogrande and TruthfulQA.
165
+ Evaluation was conducted using the Neural Magic fork of [lm-evaluation-harness](https://github.com/neuralmagic/lm-evaluation-harness/tree/llama_3.1_instruct) (branch llama_3.1_instruct) and the [vLLM](https://docs.vllm.ai/en/stable/) engine.
166
+ This version of the lm-evaluation-harness includes versions of ARC-Challenge and GSM-8K that match the prompting style of [Meta-Llama-3.1-Instruct-evals](https://huggingface.co/datasets/meta-llama/Meta-Llama-3.1-8B-Instruct-evals).
 
 
 
 
 
 
 
167
 
168
  ### Accuracy
169
 
 
249
  <td><strong>99.33%</strong>
250
  </td>
251
  </tr>
252
+ </table>
253
+
254
+ ### Reproduction
255
+
256
+ The results were obtained using the following commands:
257
+
258
+ #### MMLU
259
+ ```
260
+ lm_eval \
261
+ --model vllm \
262
+ --model_args pretrained="neuralmagic/Meta-Llama-3.1-8B-Instruct-FP8",dtype=auto,add_bos_token=True,max_model_len=4096,tensor_parallel_size=1 \
263
+ --tasks mmlu \
264
+ --num_fewshot 5 \
265
+ --batch_size auto
266
+ ```
267
+
268
+ #### ARC-Challenge
269
+ ```
270
+ lm_eval \
271
+ --model vllm \
272
+ --model_args pretrained="neuralmagic/Meta-Llama-3.1-8B-Instruct-FP8",dtype=auto,add_bos_token=True,max_model_len=4096,tensor_parallel_size=1 \
273
+ --tasks arc_challenge_llama_3.1_instruct \
274
+ --apply_chat_template \
275
+ --num_fewshot 0 \
276
+ --batch_size auto
277
+ ```
278
+
279
+ #### GSM-8K
280
+ ```
281
+ lm_eval \
282
+ --model vllm \
283
+ --model_args pretrained="neuralmagic/Meta-Llama-3.1-8B-Instruct-FP8",dtype=auto,add_bos_token=True,max_model_len=4096,tensor_parallel_size=1 \
284
+ --tasks gsm8k_cot_llama_3.1_instruct \
285
+ --apply_chat_template \
286
+ --num_fewshot 8 \
287
+ --batch_size auto
288
+ ```
289
+
290
+ #### Hellaswag
291
+ ```
292
+ lm_eval \
293
+ --model vllm \
294
+ --model_args pretrained="neuralmagic/Meta-Llama-3.1-8B-Instruct-FP8",dtype=auto,add_bos_token=True,max_model_len=4096,tensor_parallel_size=1 \
295
+ --tasks hellaswag \
296
+ --num_fewshot 10 \
297
+ --batch_size auto
298
+ ```
299
+
300
+ #### Winogrande
301
+ ```
302
+ lm_eval \
303
+ --model vllm \
304
+ --model_args pretrained="neuralmagic/Meta-Llama-3.1-8B-Instruct-FP8",dtype=auto,add_bos_token=True,max_model_len=4096,tensor_parallel_size=1 \
305
+ --tasks winogrande \
306
+ --num_fewshot 5 \
307
+ --batch_size auto
308
+ ```
309
+
310
+ #### TruthfulQA
311
+ ```
312
+ lm_eval \
313
+ --model vllm \
314
+ --model_args pretrained="neuralmagic/Meta-Llama-3.1-8B-Instruct-FP8",dtype=auto,add_bos_token=True,max_model_len=4096,tensor_parallel_size=1 \
315
+ --tasks truthfulqa_mc \
316
+ --num_fewshot 0 \
317
+ --batch_size auto
318
+ ```