My alternate quantizations.

#3
by ZeroWw - opened

These are my own quantizations (updated almost daily).

The difference with normal quantizations is that I quantize the output and embed tensors to f16.
and the other tensors to 15_k,q6_k or q8_0.
This creates models that are little or not degraded at all and have a smaller size.
They run at about 3-6 t/sec on CPU only using llama.cpp
And obviously faster on computers with potent GPUs
ALL the models were quantized in this way:

Fascinating. Llama.cpp can't even load the 12B for me:

llama_model_load: error loading model: error loading model hyperparameters: invalid n_rot: 128, expected 160

No idea. I checked them before posting them. They were working inside colab even.
If you have problems try the 8B, if that works, then I will check again the 12B (but I am sure it worked).
NOTE: You must update llama.cpp to the very latest version for these to work.

Yes, sorry, I forgot to update this. Indeed, I ran a pre-check before doing imatrix calculations, and that pre-check indeed used an outdated llama.cpp version.

Do I have to use llama.cpp for this, or can I use SillyTavern with Tabby or OOBE to load the 12b one fine? (I have a 24GB card)

Do I have to use llama.cpp for this, or can I use SillyTavern with Tabby or OOBE to load the 12b one fine? (I have a 24GB card)

any updated version that supports gguf.

Sign up or log in to comment