language:
- en
pipeline_tag: text-generation
Meta-Llama-3-70B-Instruct-quantized.w8a16
Model Overview
- Model Architecture: Meta-Llama-3
- Input: Text
- Output: Text
- Model Optimizations:
- Quantized: INT8 weights
- Release Date: 7/2/2024
- Version: 1.0
- Model Developers: Neural Magic
Quantized version of Meta-Llama-3-70B-Instruct. It achieves an average score of 79.18% on the OpenLLM benchmark (version 1), whereas the unquantized model achieves 77.90%.
Model Optimizations
This model was obtained by quantizing the weights of Meta-Llama-3-70B-Instruct to INT8 data type. Only the weights of the linear operators within transformers blocks are quantized. Symmetric per-channel quantization is applied, in which a linear scaling per output dimension maps the INT8 and floating point representations of the quantized weights. AutoGPTQ is used for quantization. This optimization reduces the number of bits per parameter from 16 to 8, reducing the disk size and GPU memory requirements by approximately 50%.
Evaluation
The model was evaluated with the lm-evaluation-harness using the vLLM engine.
Accuracy
Open LLM Leaderboard evaluation scores
Meta-Llama-3-70B-Instruct | Meta-Llama-3-70B-Instruct-quantized.w8a16 (this model) |
|
---|---|---|
arc-c 25-shot |
72.44% | 71.59% |
hellaswag 10-shot |
85.54% | 85.65% |
mmlu 5-shot |
80.18% | 78.69% |
truthfulqa 0-shot |
62.92% | 61.94% |
winogrande 5-shot |
83.19% | 83.11% |
gsm8k 5-shot |
90.83% | 86.43% |
Average Accuracy |
79.18% | 77.90% |
Recovery | 100% | 98.38% |