FP4 in attention proj

#9
by yoursmin - opened

I noticed that in your weight files, the qkv proj weights in the transformer attention are in BF16. Does this indicate that you are performing computations in BF16, or are you using FP4 for real-time quantization? I ask because in the llama-FP4 model, the corresponding weights are stored in FP4. Thank you very much for your excellent work, and I look forward to your reply.

When mention FP4, it does not imply that all parameters are FP4, just as the original Deepseek model does not have all parameters in FP8.

As the README said:" Only the weights and activations of the linear operators within transformers blocks are quantized." Note that the linear layers in the attention modules are also part of these 'linear operators within transformers blocks' . Meanwhile the corresponding parameters of the original Deepseek model are in FP8, not in BF16

image.png

image.png

Your need to confirm your account before you can post a new comment.

Sign up or log in to comment