alexmarques
commited on
Commit
•
816f7dd
1
Parent(s):
134a03a
Update README.md
Browse files
README.md
CHANGED
@@ -135,12 +135,16 @@ model.save_pretrained("Meta-Llama-3.1-70B-Instruct-quantized.w8a8")
|
|
135 |
|
136 |
This model was evaluated on the well-known Arena-Hard, OpenLLM v1, OpenLLM v2, HumanEval, and HumanEval+ benchmarks.
|
137 |
In all cases, model outputs were generated with the [vLLM](https://docs.vllm.ai/en/stable/) engine.
|
|
|
138 |
Arena-Hard evaluations were conducted using the [Arena-Hard-Auto](https://github.com/lmarena/arena-hard-auto) repository.
|
139 |
The model generated a single answer for each prompt form Arena-Hard, and each answer was judged twice by GPT-4.
|
140 |
We report below the scores obtained in each judgement and the average.
|
|
|
141 |
OpenLLM v1 and v2 evaluations were conducted using Neural Magic's fork of [lm-evaluation-harness](https://github.com/neuralmagic/lm-evaluation-harness/tree/llama_3.1_instruct) (branch llama_3.1_instruct).
|
142 |
This version of the lm-evaluation-harness includes versions of MMLU, ARC-Challenge and GSM-8K that match the prompting style of [Meta-Llama-3.1-Instruct-evals](https://huggingface.co/datasets/meta-llama/Meta-Llama-3.1-70B-Instruct-evals) and a few fixes to OpenLLM v2 tasks.
|
|
|
143 |
HumanEval and HumanEval+ evaluations were conducted using Neural Magic's fork of the [EvalPlus](https://github.com/neuralmagic/evalplus) repository.
|
|
|
144 |
Detailed model outputs are available as HuggingFace datasets for [Arena-Hard](https://huggingface.co/datasets/neuralmagic/quantized-llama-3.1-arena-hard-evals), [OpenLLM v2](https://huggingface.co/datasets/neuralmagic/quantized-llama-3.1-leaderboard-v2-evals), and [HumanEval](https://huggingface.co/datasets/neuralmagic/quantized-llama-3.1-humaneval-evals).
|
145 |
|
146 |
**Note:** Results have been updated after Meta modified the chat template.
|
|
|
135 |
|
136 |
This model was evaluated on the well-known Arena-Hard, OpenLLM v1, OpenLLM v2, HumanEval, and HumanEval+ benchmarks.
|
137 |
In all cases, model outputs were generated with the [vLLM](https://docs.vllm.ai/en/stable/) engine.
|
138 |
+
|
139 |
Arena-Hard evaluations were conducted using the [Arena-Hard-Auto](https://github.com/lmarena/arena-hard-auto) repository.
|
140 |
The model generated a single answer for each prompt form Arena-Hard, and each answer was judged twice by GPT-4.
|
141 |
We report below the scores obtained in each judgement and the average.
|
142 |
+
|
143 |
OpenLLM v1 and v2 evaluations were conducted using Neural Magic's fork of [lm-evaluation-harness](https://github.com/neuralmagic/lm-evaluation-harness/tree/llama_3.1_instruct) (branch llama_3.1_instruct).
|
144 |
This version of the lm-evaluation-harness includes versions of MMLU, ARC-Challenge and GSM-8K that match the prompting style of [Meta-Llama-3.1-Instruct-evals](https://huggingface.co/datasets/meta-llama/Meta-Llama-3.1-70B-Instruct-evals) and a few fixes to OpenLLM v2 tasks.
|
145 |
+
|
146 |
HumanEval and HumanEval+ evaluations were conducted using Neural Magic's fork of the [EvalPlus](https://github.com/neuralmagic/evalplus) repository.
|
147 |
+
|
148 |
Detailed model outputs are available as HuggingFace datasets for [Arena-Hard](https://huggingface.co/datasets/neuralmagic/quantized-llama-3.1-arena-hard-evals), [OpenLLM v2](https://huggingface.co/datasets/neuralmagic/quantized-llama-3.1-leaderboard-v2-evals), and [HumanEval](https://huggingface.co/datasets/neuralmagic/quantized-llama-3.1-humaneval-evals).
|
149 |
|
150 |
**Note:** Results have been updated after Meta modified the chat template.
|