FP4 in attention proj

by yoursmin - opened 8 days ago

8 days ago

I noticed that in your weight files, the qkv proj weights in the transformer attention are in BF16. Does this indicate that you are performing computations in BF16, or are you using FP4 for real-time quantization? I ask because in the llama-FP4 model, the corresponding weights are stored in FP4. Thank you very much for your excellent work, and I look forward to your reply.

junliu-mde

7 days ago

•

edited 7 days ago

When mention FP4, it does not imply that all parameters are FP4, just as the original Deepseek model does not have all parameters in FP8.

yoursmin

6 days ago

•

edited 5 days ago

As the README said:" Only the weights and activations of the linear operators within transformers blocks are quantized." Note that the linear layers in the attention modules are also part of these 'linear operators within transformers blocks' . Meanwhile the corresponding parameters of the original Deepseek model are in FP8, not in BF16

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

Your need to confirm your account before you can post a new comment.

· Sign up or log in to comment