Mem usage
Can use estimate Vram for context length 8k, 128k , 512k, 1M. thanks
8k, 24G is ok;
128k, 60G is ok (We recommend glm-4-chat because it has a GQA group of 2 and a smaller kv cache);
512k and 1M, 4 * 80G, MUST use vLLM with enable_chunked_prefill.
At present, the mainstream open source inference frameworks are not deeply optimized for 1M length. vLLM takes about 4*80G for 1M length inference (enable enable_chunked_prefill, although it will significantly slow down encode). It is believed that with the optimization of mainstream open source inference frameworks in the future, the inference of 1M will be faster and faster.
In fact, 8 * 24G is sufficient for 9B 1M inference, but the current open source inference framework has not done enough optimization and adaptation.
did you use standart attention while training? I guess not, will a paper released?
Yes, we use the standard attention during the 1M training, with a divide-and-conquer context parallel to prevent OOM issues and a balanced varlen training to reduce idle bubble time.
There are no plans for a paper at the moment, but there may be a technical blog similar to notion.
nice work, based on your description of the attention mechanism, I think it is good but still not mathematically exact attention(?), I believe ring attention(arXiv:2310.01889) can help and get more accuracy, its a exact attention with linear memory scaling for device amount via blockwise processing using a ring topology, only bad point is it needs more time and flops, so like a tradeoff between memory and time.
It is a precise Full attention mechanism, which can be referenced from the LongAlign paper. We have packed different training samples into 1M for efficient training. This is supported in the Context Parallel of Transformer Engine (with THD format), which is what you refer to as Ring Attention.
thanks @davidlvxin alot, I have questions:
- Which method you apply to extend to 1M
- Are your models optimized for RAG/ or good at it, in comparison with competitors like llama3, qwen 2 or command-R (rag optimized) thanks
- We just use transformer with full attention (with a divide-and-conquer context parallel to prevent OOM issues and a balanced varlen training to reduce idle bubble time).
- We didn't optimize for RAG, but it should be good at it. Give it a try.
@davidlvxin I also impressed by the glm 9b V, is the data details private ? thanks
Thanks!!!
@davidlvxin I actually mean the data for vision model lol