Lanuch model with vllm --tensor-parallel (Error: input_size_per_partition not divisible by min_thread_k)
Dear all,
thank you very much for providing the quantized version! When trying to lanuch the AWQ model on a multi-GPU instance (8x16GB) with "--tensor-parallel 8 or 4 or 2" I receive an error of the following format:
Weight input_size_per_partition = 7392 is not divisible by min_thread_k = 128. Consider reducing tensor_parallel_size or running with --quantization gptq.
orWeight output_size_per_partition = 7392 is not divisible by min_thread_n = 64.
Was anyone able to launch the model on a multi-GPU instance using "--tensor-parallel" with vLLM? Currently for me it is only possible to launch the model without "--tensor-parallel" and only using "--pipeline-parallel" - here vLLM does not complain -, which of course is something different than "--tensor-parallel". Is this rather a vLLM specific issue and deploying the model using the provided transformers instructions is the way to go?
Thanks and kind regards!
Hi! When you load it, how much VRAM does it require? If it's 4-bit quant, it should something like 36 GB, right? 72B params * 4bit /1byte. Ofc, then performing inference will require more VRAM, but I just wanted to confirm the mem occupation of the loaded model
Dear all,
Please fix the intermediate model size so that it can be run on 2,4 or 8 GPUs. Thanks!
https://github.com/QwenLM/Qwen2.5-VL/issues/231
Qwen2-VL-72B-Instruct had the same issue.
Great to know! Thanks for pointing it out.
So probably we'll just wait until it was re-quantized (which was done last time) to get it working with multi-GPUs. Perfect, thanks :)
Dear all,
Please fix the intermediate model size so that it can be run on 2,4 or 8 GPUs. Thanks!
Try PointerHQ/Qwen2.5-VL-72B-Instruct-Pointer-AWQ
which supports --tensor-parallel
on 2, 4 and 8 GPUs.
https://huggingface.co/PointerHQ/Qwen2.5-VL-72B-Instruct-Pointer-AWQ
Dear all,
Please fix the intermediate model size so that it can be run on 2,4 or 8 GPUs. Thanks!
@Mitke15 what should be the intermediate model size in those cases? I'm receiving a similar error
The input size is not aligned with the quantized weight shape. This can be caused by too large tensor parallel size.
It happens also with the model suggested by @imjliao .
I'm trying to run the model on A100 40GB GPUs
Dear all,
Please fix the intermediate model size so that it can be run on 2,4 or 8 GPUs. Thanks!
@Mitke15 what should be the intermediate model size in those cases? I'm receiving a similar error
The input size is not aligned with the quantized weight shape. This can be caused by too large tensor parallel size.
It happens also with the model suggested by @imjliao .
I'm trying to run the model on A100 40GB GPUs
Hi, I just finished uploading the updated weights, feel free to try again.
I didn't manage to get the PointerHQ version running, but it is working fine now with the official version. Thanks!