πŸš€ CPU optimized quantizations of Meta-Llama-3.1-405B-Instruct πŸ–₯️

This repository contains CPU-optimized GGUF quantizations of the Meta-Llama-3.1-405B-Instruct model. These quantizations are designed to run efficiently on CPU hardware while maintaining good performance.

Available Quantizations

Available Quantizations

  1. Q4_0_4_8 (CPU FMA-Optimized): ~246 GB
  2. IQ4_XS (Fastest for CPU/GPU): ~212 GB
  3. Q2K-Q8 Mixed quant with iMatrix: ~154 GB
  4. Q2K-Q8 Mixed without iMat for testing: ~165 GB
  5. 1-bit Custom per weight COHERENT quant: ~103 GB
  6. BF16: ~811 GB (original model)
  7. Q8_0: ~406 GB (original model)

Use Aria2 for parallelized downloads, links will download 9x faster

🐧 On Linux sudo apt install -y aria2

🍎 On Mac brew install aria2

Feel free to paste these all in at once or one at a time

Q4_0_48 (CPU FMA Optimized Specifically for ARM server chips, NOT TESTED on X86)

aria2c -x 16 -s 16 -k 1M -o meta-405b-inst-cpu-optimized-q4048-00001-of-00006.gguf https://huggingface.co/nisten/meta-405b-instruct-cpu-optimized-gguf/resolve/main/meta-405b-inst-cpu-optimized-q4048-00001-of-00006.gguf
aria2c -x 16 -s 16 -k 1M -o meta-405b-inst-cpu-optimized-q4048-00002-of-00006.gguf https://huggingface.co/nisten/meta-405b-instruct-cpu-optimized-gguf/resolve/main/meta-405b-inst-cpu-optimized-q4048-00002-of-00006.gguf
aria2c -x 16 -s 16 -k 1M -o meta-405b-inst-cpu-optimized-q4048-00003-of-00006.gguf https://huggingface.co/nisten/meta-405b-instruct-cpu-optimized-gguf/resolve/main/meta-405b-inst-cpu-optimized-q4048-00003-of-00006.gguf
aria2c -x 16 -s 16 -k 1M -o meta-405b-inst-cpu-optimized-q4048-00004-of-00006.gguf https://huggingface.co/nisten/meta-405b-instruct-cpu-optimized-gguf/resolve/main/meta-405b-inst-cpu-optimized-q4048-00004-of-00006.gguf
aria2c -x 16 -s 16 -k 1M -o meta-405b-inst-cpu-optimized-q4048-00005-of-00006.gguf https://huggingface.co/nisten/meta-405b-instruct-cpu-optimized-gguf/resolve/main/meta-405b-inst-cpu-optimized-q4048-00005-of-00006.gguf
aria2c -x 16 -s 16 -k 1M -o meta-405b-inst-cpu-optimized-q4048-00006-of-00006.gguf https://huggingface.co/nisten/meta-405b-instruct-cpu-optimized-gguf/resolve/main/meta-405b-inst-cpu-optimized-q4048-00006-of-00006.gguf

IQ4_XS Version - Fastest for CPU/GPU should work everywhere (Size: ~212 GB)

aria2c -x 16 -s 16 -k 1M -o meta-405b-cpu-i1-q4xs-00001-of-00005.gguf https://huggingface.co/nisten/meta-405b-instruct-cpu-optimized-gguf/resolve/main/meta-405b-cpu-i1-q4xs-00001-of-00005.gguf
aria2c -x 16 -s 16 -k 1M -o meta-405b-cpu-i1-q4xs-00002-of-00005.gguf https://huggingface.co/nisten/meta-405b-instruct-cpu-optimized-gguf/resolve/main/meta-405b-cpu-i1-q4xs-00002-of-00005.gguf
aria2c -x 16 -s 16 -k 1M -o meta-405b-cpu-i1-q4xs-00003-of-00005.gguf https://huggingface.co/nisten/meta-405b-instruct-cpu-optimized-gguf/resolve/main/meta-405b-cpu-i1-q4xs-00003-of-00005.gguf
aria2c -x 16 -s 16 -k 1M -o meta-405b-cpu-i1-q4xs-00004-of-00005.gguf https://huggingface.co/nisten/meta-405b-instruct-cpu-optimized-gguf/resolve/main/meta-405b-cpu-i1-q4xs-00004-of-00005.gguf
aria2c -x 16 -s 16 -k 1M -o meta-405b-cpu-i1-q4xs-00005-of-00005.gguf https://huggingface.co/nisten/meta-405b-instruct-cpu-optimized-gguf/resolve/main/meta-405b-cpu-i1-q4xs-00005-of-00005.gguf

1-bit Custom Per Weight Quantization (Size: ~103 GB)

aria2c -x 16 -s 16 -k 1M -o meta-405b-1bit-00001-of-00003.gguf https://huggingface.co/nisten/meta-405b-instruct-cpu-optimized-gguf/resolve/main/meta-405b-1bit-00001-of-00003.gguf
aria2c -x 16 -s 16 -k 1M -o meta-405b-1bit-00002-of-00003.gguf https://huggingface.co/nisten/meta-405b-instruct-cpu-optimized-gguf/resolve/main/meta-405b-1bit-00002-of-00003.gguf
aria2c -x 16 -s 16 -k 1M -o meta-405b-1bit-00003-of-00003.gguf https://huggingface.co/nisten/meta-405b-instruct-cpu-optimized-gguf/resolve/main/meta-405b-1bit-00003-of-00003.gguf

Q2K-Q8 Mixed 2bit 8bit I wrote myself. This is the smallest coherent one I could make WITHOUT imatrix

aria2c -x 16 -s 16 -k 1M -o meta-405b-inst-cpu-2kmix8k-00001-of-00004.gguf https://huggingface.co/nisten/meta-405b-instruct-cpu-optimized-gguf/resolve/main/meta-405b-inst-cpu-2kmix8k-00001-of-00004.gguf
aria2c -x 16 -s 16 -k 1M -o meta-405b-inst-cpu-2kmix8k-00002-of-00004.gguf https://huggingface.co/nisten/meta-405b-instruct-cpu-optimized-gguf/resolve/main/meta-405b-inst-cpu-2kmix8k-00002-of-00004.gguf
aria2c -x 16 -s 16 -k 1M -o meta-405b-inst-cpu-2kmix8k-00003-of-00004.gguf https://huggingface.co/nisten/meta-405b-instruct-cpu-optimized-gguf/resolve/main/meta-405b-inst-cpu-2kmix8k-00003-of-00004.gguf
aria2c -x 16 -s 16 -k 1M -o meta-405b-inst-cpu-2kmix8k-00004-of-00004.gguf https://huggingface.co/nisten/meta-405b-instruct-cpu-optimized-gguf/resolve/main/meta-405b-inst-cpu-2kmix8k-00004-of-00004.gguf

Same as above but with higher quality iMatrix Q2K-Q8 (Size: ~154 GB) USE THIS ONE

aria2c -x 16 -s 16 -k 1M -o meta-405b-cpu-imatrix-2k-00001-of-00004.gguf https://huggingface.co/nisten/meta-405b-instruct-cpu-optimized-gguf/resolve/main/meta-405b-cpu-imatrix-2k-00001-of-00004.gguf
aria2c -x 16 -s 16 -k 1M -o meta-405b-cpu-imatrix-2k-00002-of-00004.gguf https://huggingface.co/nisten/meta-405b-instruct-cpu-optimized-gguf/resolve/main/meta-405b-cpu-imatrix-2k-00002-of-00004.gguf
aria2c -x 16 -s 16 -k 1M -o meta-405b-cpu-imatrix-2k-00003-of-00004.gguf https://huggingface.co/nisten/meta-405b-instruct-cpu-optimized-gguf/resolve/main/meta-405b-cpu-imatrix-2k-00003-of-00004.gguf
aria2c -x 16 -s 16 -k 1M -o meta-405b-cpu-imatrix-2k-00004-of-00004.gguf https://huggingface.co/nisten/meta-405b-instruct-cpu-optimized-gguf/resolve/main/meta-405b-cpu-imatrix-2k-00004-of-00004.gguf
Q4_0_48 CPU Optimized example response
Q4_0_48 (CPU Optimized) (246GB): Example response of 20000 token prompt

BF16 Version

aria2c -x 16 -s 16 -k 1M -o meta-llama-405b-inst-bf16-00001-of-00019.gguf https://huggingface.co/nisten/meta-405b-instruct-cpu-optimized-gguf/resolve/main/meta-llama-405b-inst-bf16-00001-of-00019.gguf
aria2c -x 16 -s 16 -k 1M -o meta-llama-405b-inst-bf16-00002-of-00019.gguf https://huggingface.co/nisten/meta-405b-instruct-cpu-optimized-gguf/resolve/main/meta-llama-405b-inst-bf16-00002-of-00019.gguf
aria2c -x 16 -s 16 -k 1M -o meta-llama-405b-inst-bf16-00003-of-00019.gguf https://huggingface.co/nisten/meta-405b-instruct-cpu-optimized-gguf/resolve/main/meta-llama-405b-inst-bf16-00003-of-00019.gguf
aria2c -x 16 -s 16 -k 1M -o meta-llama-405b-inst-bf16-00004-of-00019.gguf https://huggingface.co/nisten/meta-405b-instruct-cpu-optimized-gguf/resolve/main/meta-llama-405b-inst-bf16-00004-of-00019.gguf
aria2c -x 16 -s 16 -k 1M -o meta-llama-405b-inst-bf16-00005-of-00019.gguf https://huggingface.co/nisten/meta-405b-instruct-cpu-optimized-gguf/resolve/main/meta-llama-405b-inst-bf16-00005-of-00019.gguf
aria2c -x 16 -s 16 -k 1M -o meta-llama-405b-inst-bf16-00006-of-00019.gguf https://huggingface.co/nisten/meta-405b-instruct-cpu-optimized-gguf/resolve/main/meta-llama-405b-inst-bf16-00006-of-00019.gguf
aria2c -x 16 -s 16 -k 1M -o meta-llama-405b-inst-bf16-00007-of-00019.gguf https://huggingface.co/nisten/meta-405b-instruct-cpu-optimized-gguf/resolve/main/meta-llama-405b-inst-bf16-00007-of-00019.gguf
aria2c -x 16 -s 16 -k 1M -o meta-llama-405b-inst-bf16-00008-of-00019.gguf https://huggingface.co/nisten/meta-405b-instruct-cpu-optimized-gguf/resolve/main/meta-llama-405b-inst-bf16-00008-of-00019.gguf
aria2c -x 16 -s 16 -k 1M -o meta-llama-405b-inst-bf16-00009-of-00019.gguf https://huggingface.co/nisten/meta-405b-instruct-cpu-optimized-gguf/resolve/main/meta-llama-405b-inst-bf16-00009-of-00019.gguf
aria2c -x 16 -s 16 -k 1M -o meta-llama-405b-inst-bf16-00010-of-00019.gguf https://huggingface.co/nisten/meta-405b-instruct-cpu-optimized-gguf/resolve/main/meta-llama-405b-inst-bf16-00010-of-00019.gguf
aria2c -x 16 -s 16 -k 1M -o meta-llama-405b-inst-bf16-00011-of-00019.gguf https://huggingface.co/nisten/meta-405b-instruct-cpu-optimized-gguf/resolve/main/meta-llama-405b-inst-bf16-00011-of-00019.gguf
aria2c -x 16 -s 16 -k 1M -o meta-llama-405b-inst-bf16-00012-of-00019.gguf https://huggingface.co/nisten/meta-405b-instruct-cpu-optimized-gguf/resolve/main/meta-llama-405b-inst-bf16-00012-of-00019.gguf
aria2c -x 16 -s 16 -k 1M -o meta-llama-405b-inst-bf16-00013-of-00019.gguf https://huggingface.co/nisten/meta-405b-instruct-cpu-optimized-gguf/resolve/main/meta-llama-405b-inst-bf16-00013-of-00019.gguf
aria2c -x 16 -s 16 -k 1M -o meta-llama-405b-inst-bf16-00014-of-00019.gguf https://huggingface.co/nisten/meta-405b-instruct-cpu-optimized-gguf/resolve/main/meta-llama-405b-inst-bf16-00014-of-00019.gguf
aria2c -x 16 -s 16 -k 1M -o meta-llama-405b-inst-bf16-00015-of-00019.gguf https://huggingface.co/nisten/meta-405b-instruct-cpu-optimized-gguf/resolve/main/meta-llama-405b-inst-bf16-00015-of-00019.gguf
aria2c -x 16 -s 16 -k 1M -o meta-llama-405b-inst-bf16-00016-of-00019.gguf https://huggingface.co/nisten/meta-405b-instruct-cpu-optimized-gguf/resolve/main/meta-llama-405b-inst-bf16-00016-of-00019.gguf
aria2c -x 16 -s 16 -k 1M -o meta-llama-405b-inst-bf16-00017-of-00019.gguf https://huggingface.co/nisten/meta-405b-instruct-cpu-optimized-gguf/resolve/main/meta-llama-405b-inst-bf16-00017-of-00019.gguf
aria2c -x 16 -s 16 -k 1M -o meta-llama-405b-inst-bf16-00018-of-00019.gguf https://huggingface.co/nisten/meta-405b-instruct-cpu-optimized-gguf/resolve/main/meta-llama-405b-inst-bf16-00018-of-00019.gguf
aria2c -x 16 -s 16 -k 1M -o meta-llama-405b-inst-bf16-00019-of-00019.gguf https://huggingface.co/nisten/meta-405b-instruct-cpu-optimized-gguf/resolve/main/meta-llama-405b-inst-bf16-00019-of-00019.gguf

Q8_0 Version

aria2c -x 16 -s 16 -k 1M -o meta-llama-405b-inst-q8_0-00001-of-00010.gguf https://huggingface.co/nisten/meta-405b-instruct-cpu-optimized-gguf/resolve/main/meta-llama-405b-inst-q8_0-00001-of-00010.gguf
aria2c -x 16 -s 16 -k 1M -o meta-llama-405b-inst-q8_0-00002-of-00010.gguf https://huggingface.co/nisten/meta-405b-instruct-cpu-optimized-gguf/resolve/main/meta-llama-405b-inst-q8_0-00002-of-00010.gguf
aria2c -x 16 -s 16 -k 1M -o meta-llama-405b-inst-q8_0-00003-of-00010.gguf https://huggingface.co/nisten/meta-405b-instruct-cpu-optimized-gguf/resolve/main/meta-llama-405b-inst-q8_0-00003-of-00010.gguf
aria2c -x 16 -s 16 -k 1M -o meta-llama-405b-inst-q8_0-00004-of-00010.gguf https://huggingface.co/nisten/meta-405b-instruct-cpu-optimized-gguf/resolve/main/meta-llama-405b-inst-q8_0-00004-of-00010.gguf
aria2c -x 16 -s 16 -k 1M -o meta-llama-405b-inst-q8_0-00005-of-00010.gguf https://huggingface.co/nisten/meta-405b-instruct-cpu-optimized-gguf/resolve/main/meta-llama-405b-inst-q8_0-00005-of-00010.gguf
aria2c -x 16 -s 16 -k 1M -o meta-llama-405b-inst-q8_0-00006-of-00010.gguf https://huggingface.co/nisten/meta-405b-instruct-cpu-optimized-gguf/resolve/main/meta-llama-405b-inst-q8_0-00006-of-00010.gguf
aria2c -x 16 -s 16 -k 1M -o meta-llama-405b-inst-q8_0-00007-of-00010.gguf https://huggingface.co/nisten/meta-405b-instruct-cpu-optimized-gguf/resolve/main/meta-llama-405b-inst-q8_0-00007-of-00010.gguf
aria2c -x 16 -s 16 -k 1M -o meta-llama-405b-inst-q8_0-00008-of-00010.gguf https://huggingface.co/nisten/meta-405b-instruct-cpu-optimized-gguf/resolve/main/meta-llama-405b-inst-q8_0-00008-of-00010.gguf
aria2c -x 16 -s 16 -k 1M -o meta-llama-405b-inst-q8_0-00009-of-00010.gguf https://huggingface.co/nisten/meta-405b-instruct-cpu-optimized-gguf/resolve/main/meta-llama-405b-inst-q8_0-00009-of-00010.gguf
aria2c -x 16 -s 16 -k 1M -o meta-llama-405b-inst-q8_0-00010-of-00010.gguf https://huggingface.co/nisten/meta-405b-instruct-cpu-optimized-gguf/resolve/main/meta-llama-405b-inst-q8_0-00010-of-00010.gguf

Usage

After downloading, you can use these models with libraries like llama.cpp. Here's a basic example:

 ./llama-cli -t 32 --temp 0.4 -fa -m ~/meow/meta-405b-inst-cpu-optimized-q4048-00001-of-00006.gguf -b 512 -c 9000 -p "Adopt the persona of a NASA JPL mathmatician and firendly helpful programmer." -cnv -co -i

Model Information

This model is based on the Meta-Llama-3.1-405B-Instruct model. It's an instruction-tuned version of the 405B parameter Llama 3.1 model, designed for assistant-like chat and various natural language generation tasks.

Key features:

  • 405 billion parameters
  • Supports 8 languages: English, German, French, Italian, Portuguese, Hindi, Spanish, and Thai
  • 128k context length
  • Uses Grouped-Query Attention (GQA) for improved inference scalability

For more detailed information about the base model, please refer to the original model card.

License

The use of this model is subject to the Llama 3.1 Community License. Please ensure you comply with the license terms when using this model.

Acknowledgements

Special thanks to the Meta AI team for creating and releasing the Llama 3.1 model series.

Enjoy; more quants and perplexity benchmarks coming.

Downloads last month
576
GGUF
Model size
410B params
Architecture
llama

8-bit

16-bit

Inference API
Unable to determine this model's library. Check the docs .

Model tree for nisten/meta-405b-instruct-cpu-optimized-gguf

Quantized
(32)
this model