Documentation about the linear attention used in some layers of this model?

#21
by ymcki - opened

I am in the process of modifying llama.cpp to support the conversion of this model.
https://github.com/ggerganov/llama.cpp/issues/10028

I successfully converted DeciLM-7B.
https://huggingface.co/ymcki/DeciLM-7B-Instruct-GGUF

However, this 51B model has some layers with linear attention, e.g.

INFO:hf-to-gguf:blk.11.attn_norm.weight,             torch.bfloat16 --> F32, shape = {8192}
INFO:hf-to-gguf:blk.11.ffn_down.weight,              torch.bfloat16 --> F16, shape = {14336, 8192}
INFO:hf-to-gguf:blk.11.ffn_gate.weight,              torch.bfloat16 --> F16, shape = {8192, 14336}
INFO:hf-to-gguf:blk.11.ffn_up.weight,                torch.bfloat16 --> F16, shape = {8192, 14336}
INFO:hf-to-gguf:blk.11.ffn_norm.weight,              torch.bfloat16 --> F32, shape = {8192}
INFO:hf-to-gguf:blk.11.self_attn.linear_attn.weight, torch.bfloat16 --> F16, shape = {8192, 8192}

Is it possible to provide some documentations about the linear attention implementation?

Is it the same thing as described here?
https://arxiv.org/pdf/2006.16236

Sign up or log in to comment