Question About VRAM Requirements for Full 256K Context Length
Dear CohereForAI Team,
first of all, thank you for your incredible work on the c4ai-command-a-03-2025 model. The advancements in context length and efficiency are truly impressive!
I am currently experimenting with the model using vLLM and have achieved a context length of approximately 110K tokens on 8 x RTX A6000 GPUs with the following settings:
#!/bin/bash
export VLLM_ALLOW_LONG_MAX_MODEL_LEN=1
export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True
export CUDA_LAUNCH_BLOCKING=0
export disable_custom_all_reduce=True
token=100000
python -m vllm.entrypoints.openai.api_server
--model=CohereForAI/c4ai-command-a-03-2025
--host 192.xxx.x.xx
--port 9000
--trust-remote-code
--device cuda
--tensor-parallel-size 8
--gpu-memory-utilization 1
--swap-space 10
--max_num_seqs 3
--max_num_batched_tokens $token
--max_model_len $token
According to the model card, the 256K token limit is theoretically achievable. However, I couldn’t find an explicit reference to the VRAM requirements needed to reach this full context length.
Could you provide any insights into how much GPU memory (VRAM) would be required to fully utilize 256K tokens? Would increasing the number of GPUs beyond 8x RTX A6000 significantly help, or is there another approach (e.g., CPU offloading, swap space tuning) that you would recommend?
Again, thank you for your fantastic work—this model is truly pushing boundaries! I appreciate any guidance you can share.
Best regards