Please post f16 quantization.
Please post f16 quantization.
Requantizing is better from f16 or f32.
If you can, post them both.
I through the original format is BF16.
yes.. but f16 (fp16) does not cause harm to the model. bf16 is way bigger.
BF16 and F16 should be identical in size
If you need the f32 i uploaded it here: https://huggingface.co/bartowski/Qwen2-7B-Instruct-GGUF/blob/main/Qwen2-7B-Instruct-f32.gguf
hmm maybe I got confused... I though bf16 was way bigger than f16 (I know they are both 16 bit) perhaps I was tired and read it wrong.
anyway I posted now my quantizations of quen 1.5 and qwen2...
Bf16 represents a larger range of values but is not bigger
Bf16 represents a larger range of values but is not bigger
Got it. thanks.
Bf16 represents a larger range of values but is not bigger
On second thought, I checked and I don't agree: if I quantize to bf16 using llama I get a way bigger size I get if I quantize to f16.
Perhaps it's because llama does a mixed quantization and keeps some tensors at f32...
Anyway I see no degradation at pure f16.
that's llama.cpp doing it then, if you take a bf16 and convert it to fp16 the model size stays identical