Meta-Llama-3.1-405B-Instruct-GGUF

image/png

Low bit quantizations of Meta's Llama 3.1 405B Instruct model. Quantized from ollama q4_0 GGUF.

Quantized with llama.cpp b3449

Quant Notes
BF16 Brain floating point, very high quality, smaller than F16
Q8_0 8-bit quantization, high quality, larger size
Q6_K 6-bit quantization, very good quality-to-size ratio
Q5_K 5-bit quantization, good balance of quality and size
Q5_0 Alternative 5-bit quantization, slightly different balance
Q4_K_M 4-bit quantization, good for production use
Q4_K_S 4-bit quantization, faster inference, efficient for scaling
Q4_0 Basic 4-bit quantization, good for experimentation
Q3_K_L 3-bit quantization, high-quality with more VRAM requirement
Q3_K_M 3-bit quantization, good balance between speed and accuracy
Q3_K_S 3-bit quantization, faster inference with minor quality loss
Q2_K 2-bit quantization, suitable for general inference tasks
IQ2_S Integer 2-bit quantization, optimized for small VRAM environments
IQ2_XXS Integer 2-bit quantization, best for ultra-low memory footprint
IQ1_M Integer 1-bit quantization, usable
IQ1_S Integer 1-bit quantization, not recommended

For higher quality quantizations (q4+), please refer to nisten/meta-405b-instruct-cpu-optimized-gguf.

Regarding the smaug-bpe tokenizer, this doesn't make a difference (they are identical). However, if you have concerns you can use the following command to set the llama-bpe tokenizer:

./gguf-py/scripts/gguf_new_metadata.py --pre-tokenizer "llama-bpe" Llama-3.1-405B-Instruct-old.gguf LLama-3.1-405B-Instruct-fixed.gguf

imatrix

Generated from Q2_K quant.

imatrix calibration data: groups_merged.txt

Downloads last month
586
GGUF
Model size
410B params
Architecture
llama

2-bit

4-bit

Inference Examples
Inference API (serverless) does not yet support ggml models for this pipeline type.

Dataset used to train leafspark/Meta-Llama-3.1-405B-Instruct-GGUF