--- tags: - vllm - sparsity - quantization - int4 pipeline_tag: text-generation license: llama3.1 base_model: neuralmagic/Sparse-Llama-3.1-8B-evolcodealpaca-2of4 datasets: - theblackcat102/evol-codealpaca-v1 language: - en --- # Sparse-Llama-3.1-8B-evolcodealpaca-2of4-quantized.w4a16 ## Model Overview - **Model Architecture:** Llama-3.1-8B - **Input:** Text - **Output:** Text - **Model Optimizations:** - **Sparsity:** 2:4 - **Weight quantization:** INT4 - **Release Date:** 11/21/2024 - **Version:** 1.0 - **License(s):** [llama3.1](https://huggingface.co/meta-llama/Meta-Llama-3.1-8B/blob/main/LICENSE) - **Model Developers:** Neural Magic This is a code completion AI model obtained by fine-tuning the 2:4 sparse [Sparse-Llama-3.1-8B-2of4](https://huggingface.co/neuralmagic/Sparse-Llama-3.1-8B-2of4) on the [evol-codealpaca-v1](https://huggingface.co/datasets/theblackcat102/evol-codealpaca-v1) dataset, followed by quantization On the [HumanEval](https://arxiv.org/abs/2107.03374) benchmark, it achieves a pass@1 of 50.6, compared to 48.5 for the fine-tuned dense model [Llama-3.1-8B-evolcodealpaca](https://huggingface.co/neuralmagic/Llama-3.1-8B-evolcodealpaca) — demonstrating over **100% accuracy recovery**. ### Model Optimizations This model was obtained by quantizing the weights of [Sparse-Llama-3.1-8B-evolcodealpaca-2of4](https://huggingface.co/neuralmagic/Sparse-Llama-3.1-8B-evolcodealpaca-2of4) to INT4 data type. This optimization reduces the number of bits per parameter from 16 to 4, reducing the disk size and GPU memory requirements by approximately 75%. That is on top of the reduction of 50% of weights via 2:4 pruning employed on [Sparse-Llama-3.1-8B-evolcodealpaca-2of4](https://huggingface.co/neuralmagic/Sparse-Llama-3.1-8B-evolcodealpaca-2of4). Only the weights of the linear operators within transformers blocks are quantized. Symmetric per-channel quantization is applied, in which a linear scaling per output dimension maps the INT4 and floating point representations of the quantized weights. The [GPTQ](https://arxiv.org/abs/2210.17323) algorithm is applied for quantization, as implemented in the [llm-compressor](https://github.com/vllm-project/llm-compressor) library. ## Deployment with vLLM This model can be deployed efficiently using the [vLLM](https://docs.vllm.ai/en/latest/) backend. vLLM aslo supports OpenAI-compatible serving. See the [documentation](https://docs.vllm.ai/en/latest/) for more details. ## Evaluation This model was evaluated on Neural Magic's fork of [EvalPlus](https://github.com/neuralmagic/evalplus). ### Accuracy #### Human Benchmark

Metric	Llama-3.1-8B-evolcodealpaca	Sparse-Llama-3.1-8B-evolcodealpaca-2of4	Sparse-Llama-3.1-8B-evolcodealpaca-2of4-quantized.w4a16
HumanEval pass@1	48.5	49.1	50.6
HumanEval+ pass@1	44.2	46.3	48.0