---
language:
- en
pipeline_tag: text-generation
---
# Meta-Llama-3-8B-Instruct-quantized.w8a16
## Model Overview
- **Model Architecture:** Meta-Llama-3
- **Input:** Text
- **Output:** Text
- **Model Optimizations:**
- **Quantized:** INT8 weights
- **Release Date:** 7/2/2024
- **Version:** 1.0
- **Model Developers:** Neural Magic
Quantized version of [Meta-Llama-3-8B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct).
It achieves an average score of 68.69% on the OpenLLM benchmark (version 1), whereas the unquantized model achieves 68.54%.
## Model Optimizations
This model was obtained by quantizing the weights of [Meta-Llama-3-8B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct) to INT8 data type.
Only the weights of the linear operators within transformers blocks are quantized. Symmetric per-channel quantization is applied, in which a linear scaling per output dimension maps the INT8 and floating point representations of the quantized weights.
[AutoGPTQ](https://github.com/AutoGPTQ/AutoGPTQ) is used for quantization.
This optimization reduces the number of bits per parameter from 16 to 8, reducing the disk size and GPU memory requirements by approximately 50%.
## Evaluation
The model was evaluated with the [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness) using the [vLLM](https://docs.vllm.ai/en/stable/) engine.
## Accuracy
### Open LLM Leaderboard evaluation scores
| | [Meta-Llama-3-8B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct) | Meta-Llama-3-8B-Instruct-quantized.w8a16
(this model) |
| :------------------: | :----------------------: | :------------------------------------------------: |
| arc-c
25-shot | 62.63% | 61.52% |
| hellaswag
10-shot | 78.81% | 78.69% |
| mmlu
5-shot | 66.54% | 66.55% |
| truthfulqa
0-shot | 52.49% | 52.60% |
| winogrande
5-shot | 76.48% | 76.01% |
| gsm8k
5-shot | 75.21% | 75.89% |
| **Average
Accuracy** | **68.69%** | **68.54%** |
| **Recovery** | **100%** | **99.78%** |