Please check these quantizations.

#40

by ZeroWw - opened Jun 18, 2024

Discussion

ZeroWw

Jun 18, 2024

•

edited Jul 23, 2024

I don't have enough resources to run all tests, but I came up with a slightly different way to quantize models.

As you will see, the f16.q6 and f16.q5 are smaller than the q8_0 and very similar to the pure f16.

https://huggingface.co/ZeroWw/Mistral-7B-Instruct-v0.3-GGUF/tree/main

These are my own quantizations (updated almost daily).

This is how I did:

echo Quantizing f16/q5
./build/bin/llama-quantize &>/dev/null --allow-requantize --output-tensor-type f16 --token-embedding-type f16 ${model_name}.f16.gguf ${model_name}.f16.q5.gguf q5_k $(nproc)
echo Quantizing f16/q6
./build/bin/llama-quantize &>/dev/null --allow-requantize --output-tensor-type f16 --token-embedding-type f16 ${model_name}.f16.gguf ${model_name}.f16.q6.gguf q6_k $(nproc)
echo Quantizing q8_0
./build/bin/llama-quantize &>/dev/null --allow-requantize --pure ${model_name}.f16.gguf ${model_name}.q8.gguf q8_0 $(nproc)

I quantized the output and embed tensors to f16 and the other ones to q5_k or q6_k.
If someone could test them better it would be great.

P.S.
even the f16/q5 is not that different from the pure f16. And way better than the q8_0.

bartowski

Jun 18, 2024

Please start posting some side-by-side comparisons, we really need to see how the model is different, no sense asking for it everywhere without proof that there's a difference

ZeroWw

Jun 21, 2024

@bartowski as I said, I test the models by chatting with them. I have no equipment (not even a decent GPU) to do any kind of testing... but many people here can...

bartowski

Jun 21, 2024

@ZeroWw sure but can you do a chat with one and then a chat with the other with the exact same prompt and show the results? otherwise just saying "this chat is better" is a bit useless. Not to take anything away from it, i've been releasing with the new --output-tensor-type f16 --token-embedding-type f16 and have a bunch of models up with those quants, but no concrete feedback that they're better yet

ZeroWw

Jun 23, 2024

@ZeroWw sure but can you do a chat with one and then a chat with the other with the exact same prompt and show the results? otherwise just saying "this chat is better" is a bit useless. Not to take anything away from it, i've been releasing with the new --output-tensor-type f16 --token-embedding-type f16 and have a bunch of models up with those quants, but no concrete feedback that they're better yet

I have very very little resources.. imagine that I made all those quants from google colab :D
Just test any of the models in my profile (they are all quantized this way) and you will notice that f16/q6 is (imho) almost indistiguishable from the pure f16 at almost half the size.
Also in f16/q5 I don't notice particular degradations... I just did a few perplexity tests.
Most guys on this website have better computer resources than me.
I just use to try to optimize things until the trade-off is fair.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

Your need to confirm your account before you can post a new comment.

· Sign up or log in to comment