Documentation about the linear attention used in some layers of this model?
#21
by
ymcki
- opened
I am in the process of modifying llama.cpp to support the conversion of this model.
https://github.com/ggerganov/llama.cpp/issues/10028
I successfully converted DeciLM-7B.
https://huggingface.co/ymcki/DeciLM-7B-Instruct-GGUF
However, this 51B model has some layers with linear attention, e.g.
INFO:hf-to-gguf:blk.11.attn_norm.weight, torch.bfloat16 --> F32, shape = {8192}
INFO:hf-to-gguf:blk.11.ffn_down.weight, torch.bfloat16 --> F16, shape = {14336, 8192}
INFO:hf-to-gguf:blk.11.ffn_gate.weight, torch.bfloat16 --> F16, shape = {8192, 14336}
INFO:hf-to-gguf:blk.11.ffn_up.weight, torch.bfloat16 --> F16, shape = {8192, 14336}
INFO:hf-to-gguf:blk.11.ffn_norm.weight, torch.bfloat16 --> F32, shape = {8192}
INFO:hf-to-gguf:blk.11.self_attn.linear_attn.weight, torch.bfloat16 --> F16, shape = {8192, 8192}
Is it possible to provide some documentations about the linear attention implementation?
Is it the same thing as described here?
https://arxiv.org/pdf/2006.16236