"head_dim": 80
#2
by
rjmehta
- opened
Can "head_dim": 128 match the qwen3 head_dim?
Really appreciate if you could retrain the model with head_dim set to a power of two?
Really appreciate if you could retrain the model with head_dim set to a power of two?
Hi, our head_dim selection in training is based on hidden_size/num_attention_heads, which aligns with the config settings of Qwen3
RuntimeError: Error in function 'VariableLengthMergeStates' at /opt/conda/lib/python3.10/site-packages/flashinfer/data/include/flashinfer/attention/cascade.cuh:692: Unsupported head_dim: 80
In Qwen/Qwen3-32B config.json, the head_dim is 128
从你们展示的数据来看,Qwen/Qwen3-32B的加速比远低于Qwen3-14B和Qwen3-8B,有没有可能是Head dim的原因