Sparse-Llama-3.1-8B-evolcodealpaca-2of4-quantized.w4a16

Model Overview

  • Model Architecture: Llama-3.1-8B
    • Input: Text
    • Output: Text
  • Model Optimizations:
    • Sparsity: 2:4
    • Weight quantization: INT4
  • Release Date: 11/21/2024
  • Version: 1.0
  • License(s): llama3.1
  • Model Developers: Neural Magic

This is a code completion AI model obtained by fine-tuning the 2:4 sparse Sparse-Llama-3.1-8B-2of4 on the evol-codealpaca-v1 dataset, followed by quantization On the HumanEval benchmark, it achieves a pass@1 of 50.6, compared to 48.5 for the fine-tuned dense model Llama-3.1-8B-evolcodealpaca — demonstrating over 100% accuracy recovery.

Model Optimizations

This model was obtained by quantizing the weights of Sparse-Llama-3.1-8B-evolcodealpaca-2of4 to INT4 data type. This optimization reduces the number of bits per parameter from 16 to 4, reducing the disk size and GPU memory requirements by approximately 75%. That is on top of the reduction of 50% of weights via 2:4 pruning employed on Sparse-Llama-3.1-8B-evolcodealpaca-2of4.

Only the weights of the linear operators within transformers blocks are quantized. Symmetric per-channel quantization is applied, in which a linear scaling per output dimension maps the INT4 and floating point representations of the quantized weights. The GPTQ algorithm is applied for quantization, as implemented in the llm-compressor library.

Deployment with vLLM

This model can be deployed efficiently using the vLLM backend. vLLM aslo supports OpenAI-compatible serving. See the documentation for more details.

Evaluation

This model was evaluated on Neural Magic's fork of EvalPlus.

Accuracy

Human Benchmark

Metric Llama-3.1-8B-evolcodealpaca Sparse-Llama-3.1-8B-evolcodealpaca-2of4 Sparse-Llama-3.1-8B-evolcodealpaca-2of4-quantized.w4a16
HumanEval pass@1 48.5 49.1 50.6
HumanEval+ pass@1 44.2 46.3 48.0
Downloads last month
46
Safetensors
Model size
1.98B params
Tensor type
I32
·
FP16
·
I16
·
Inference Examples
Unable to determine this model's library. Check the docs .

Model tree for neuralmagic/Sparse-Llama-3.1-8B-evolcodealpaca-2of4-quantized.w4a16

Dataset used to train neuralmagic/Sparse-Llama-3.1-8B-evolcodealpaca-2of4-quantized.w4a16

Collection including neuralmagic/Sparse-Llama-3.1-8B-evolcodealpaca-2of4-quantized.w4a16