I have a few questions for the quantized model quality.
Thank you so much for this amazing model! I'd like to ask some questions.
Is your 'Meta-Llama-3-8B-Instruct-GGUF' model only just converted into GGUF format from the original 'Mata-Llama-3-8B-Instruct' from Meta? Or is there any more tuning or extra processing?
I wonder:
Is the model quality exactly the same as yours if I convert 'Mata-Llama-3-8B-Instruct' into GGUF format by myself?
I heard that the quantized model mostly shows better quality than the original model. Then, is the 'Meta-Llama-3-8B-Instruct-GGUF' model better than the original 'Meta-Llama-3-8B-Instruct'? Or is it just the same?
Thank you so much for your effort.
If anyone knows the answer, help me out, plz!
The GGUF quants are just static quantizations. However, the IQuants are not; they are much more aggressive and require more work and more complex techniques to achieve the same levels of coherence and de-braindeadedness that the static quants achieve. For this, IQuants use an Imatrix. An imatrix is a “map” of all of a model’s activations over a text corpus (such as wikitext-raw). During the quantization process a pretrained imatrix can be used to help guide the quantization to retain the model’s coherence and abilities
. However, it’s important to note and stress: that the smaller the model becomes due to aggressive quantization, the more likely it is that the model will be less coherent and less capable overall (due to the effect of the current way that aggressively quantizing a model will affect its ability to remain coherent and to retain its capabilities). I hope this helps. 😁
Converting the model to GGUF and quantizing it yourself will yield exactly the same GGUF’ed model. There will be no difference between yours and this repos GGUF’ed model files. The only exception to this, being if you trained and used an imatrix for the model using wikitext-raw vs groups_merged.txt for the quantization process. 🤔
An un-quantized model is always superior in quality and performance to a quantized model. If you want the max performance and the highest throughput for your model, you will always want to run it un-quantized. As a matter of fact, the only time you want to quantize a model is when you don’t have the enough vRAM and RAM to run it un-quantized. 🤔
@Joseph717171 is correct with one tiny addition that for this (and all other models on lmstudio-community besides 70b until it's reuploaded) all of the quant levels are made with imatrix, not just iquants
@Joseph717171 is correct with one tiny addition that for this (and all other models on lmstudio-community besides 70b until it's reuploaded) all of the quant levels are made with imatrix, not just iquants
Can you clarify this for me please, are you saying that Q4_K (as an example) is a 4bit K quant that uses imatrix, but it is NOT an iquant?
Also, thank you 😅😅😅
Yes, that is exactly right! What @bartowski is implying is that all the GGUF'ed quants are made using an imatrix. Which means: all the quantizations are now IQuants. (The imatrix is trained on groups_merged.txt) 😁
It's actually more than anything an unfortunate naming convention and timing
i-quants != imatrix
i-quants are just a newer SOTA quantization technique that borrows ideas from QuIP#, and can be made without an imatrix
https://github.com/ggerganov/llama.cpp/pull/4773
imatrix is an importance matrix that can be used with any quant level, though was originally only targeting i-quants
Then the feature was expanded to target K-quants: