Question about your quantization method
Hello ZeroWw,
I've been using the method that you posted about somewhere else (a couple months ago) for my local quants:--output-tensor-type f16 --token-embedding-type f16
Today I noticed that there are some other options as well:
--leave-output-tensor: Will leave output.weight un(re)quantized. Increases model size but may also increase quality, especially when requantizing
--pure: Disable k-quant mixtures and quantize all tensors to the same type
I'm curious to know what your thoughts are on those options, if you don't mind me asking?
Thank you.
Well.. in my quants there is a pure q8 for comparison.
the leave output tensor is quite useless because it leaves to to f16 (so it's the same as doing --output-tensor-type f16)
I usually first conver to f16 using convert.py then I quantize. so the base is always f16. except for the q8q4 which is quite cool. output and embed at q8 and the rest at q4... it works pretty well.