Importance-Matrix quantizations of Mixtral-8x22B-v0.1 πŸ’«

the imatrix.dat file was calcuated over 1000 chunks with wikitext.train.raw( included )

Wrote a bit of custom c++ to avoid quantizing certain layers, tested fully compatible with llama.cpp as of 10April2024.

To put it all asa single file ( this is not needed with llama.cpp as it will autodetect the chunks but can help troubleshooting ollama)

cat mix4ns-0000* > mix4ns.gguf

careful this can take 5 minutes or up to 10-15 on slow instances, check progress with ls -la

Run with llama.cpp

git clone https://github.com/ggerganov/llama.cpp && cd llama.cpp/ && make -j

./main -m ~/mix4ns-00001-of-00005.gguf -n 256 -t 64 --temp 0.2 --color -p "How to build a city on mars via aldrin cycler orgbits?"

Perplexity benchmarks

Command I used to run these on 48 core CPU only machine, you can add -ngl 16 to offload 16 layers or more to gpu on your own.

./perplexity -m ~/mix4xs.gguf -f wiki.test.raw --chunks 12 -t 48

The results are interesting. quantizing from hf-bf16 folder to f16 gguf adds a bit of loss (increases perplexity). I've noticed on smaller models that going straight from huggingface repo folder to 8bit via using python convert.py --outtype q8_0 produces less perplexity than going hf-f16-q8_0. What's even more interesting is that quantizing TWICE (hf-q8_0 and then q8_0-imatrix) also produces better perplexity compared to regular f16gguf to imatrix.

All you need to pay attention to is the final value PPL = 2.2585 in this case that of a regular 8bit

NOT ALL 8 BIT ARE CREATED EQUAL, this took 9 hours to convert to 8bit on a 64core cpu 256GB-RAM (8channel DDR5)

image/png

Even though the file is a tiny bit slower, it gets a tiny bit lower perplexity. It looks like nothing here over 12 chunks, and 2.2584-mix8ns vs 2.2585-mix8 regular q8_0 but past testing on smaller models and 100+ chunks has shown this difference to be a bit more pronounced

perplexity regular q8_0 (from f16): 126.35 seconds per pass - ETA 6.32 minutes
[1]2.6256,[2]3.1043,[3]3.6463,[4]3.2092,[5]2.6847,[6]2.4791,[7]2.3112,[8]2.2502,[9]2.2858,[10]2.2690,[11]2.2693,[12]2.2585,
Final estimate: PPL = 2.2585 +/- 0.06534

perplexity q8_0 (slow convert.py from hf): 96.86 seconds per pass - ETA 4.83 minutes
[1]2.6191,[2]3.1045,[3]3.6551,[4]3.2302,[5]2.6990,[6]2.4908,[7]2.3167,[8]2.2541,[9]2.2877,[10]2.2682,[11]2.2685,[12]2.2584,
Final estimate: PPL = 2.2584 +/- 0.06514

perplexity regular iq4_xs (no imatrix): 91.53 seconds per pass 
[1]2.6966,[2]3.1749,[3]3.6972,[4]3.2577,[5]2.7905,[6]2.6097,[7]2.4536,[8]2.4001,[9]2.4469,[10]2.4219,[11]2.4366,[12]2.4367,
Final estimate: PPL = 2.4367 +/- 0.07218

perplexity regular q4_km (no imatrix): 108.59 seconds per pass 
[1]2.6100,[2]3.1304,[3]3.6897,[4]3.3500,[5]2.8118,[6]2.5992,[7]2.4349,[8]2.3816,[9]2.4174,[10]2.3959,[11]2.3988,[12]2.3976,
Final estimate: PPL = 2.3976 +/- 0.07111

perplexity EdgeQuant iq4-ns (no imatrix) 84.45 seconds per pass - FILESIZE 77258 MB 
[1]2.7195,[2]3.1821,[3]3.7177,[4]3.3017,[5]2.8012,[6]2.6034,[7]2.4318,[8]2.3747,[9]2.4160,[10]2.3931,[11]2.4023,[12]2.4013,
Final estimate: PPL = 2.4013 +/- 0.07116

perplexity EdgeQuant iq4-ns (WITH imatrix) 82.76 seconds per pass - FILESIZE 73636 MB ( mix4ns.gguf ) //BEST ONE FOR 80GB CARD
[1]2.7166,[2]3.1720,[3]3.6988,[4]3.3195,[5]2.7949,[6]2.5862,[7]2.4186,[8]2.3621,[9]2.3981,[10]2.3876,[11]2.3971,[12]2.3973,
Final estimate: PPL = 2.3973 +/- 0.07080

perplexity EdgeQuant mix3ns (WITH imatrix) FILESIZE 60826 MB //BEST ONE FOR 64GB MACHINE
[1]2.7921,[2]3.2356,[3]3.8254,[4]3.3874,[5]2.9992,[6]2.8053,[7]2.7000,[8]2.6565,[9]2.7085,[10]2.7248,[11]2.7627,[12]2.7589,
Final estimate: PPL = 2.7589 +/- 0.08399

perplexity 2K (no imatrix) 207.70 seconds per pass - FILESIZE 47564MB (mix2k-noimatrix-but-usable-reference.gguf)
[1]2.9401,[2]3.4224,[3]4.0174,[4]3.8503,[5]3.5607,[6]3.4449,[7]3[9]3.5589,[10]3.6546,[11]3.7810,[12]3.7733,
Final estimate: PPL = 3.7733 +/- 0.13299

perplexity EdgeQuant mix2ns (WITH imatrix) FILESIZE 44024 MB //BEST ONE FOR 48GB LIMIT
[1]2.9890,[2]3.4809,[3]4.0181,[4]4.1660,[5]4.0785,[6]3.9915,[7]4.0004,[8]3.9970,[9]4.0762,[10]4.1886,[11]4.3717,[12]4.3661,
Final estimate: PPL = 4.3661 +/- 0.16065
    

image/png

command to run these was:

./main -m mix4ns.gguf -n 256 -t 48 --temp 0.5 --color -p "How to build a city on mars via shipping through aldrin cycler orbits?"
Downloads last month
25
GGUF
Model size
141B params
Architecture
llama
Inference API
Unable to determine this model's library. Check the docs .

Model tree for nisten/mixtral8x22-imatrix-gguf

Quantized
(37)
this model