Lanuch model with vllm --tensor-parallel (Error: input_size_per_partition not divisible by min_thread_k)

by prstg - opened 23 days ago

23 days ago

•

Dear all,

thank you very much for providing the quantized version! When trying to lanuch the AWQ model on a multi-GPU instance (8x16GB) with "--tensor-parallel 8 or 4 or 2" I receive an error of the following format:

Weight input_size_per_partition = 7392 is not divisible by min_thread_k = 128. Consider reducing tensor_parallel_size or running with --quantization gptq.
or
Weight output_size_per_partition = 7392 is not divisible by min_thread_n = 64.

Was anyone able to launch the model on a multi-GPU instance using "--tensor-parallel" with vLLM? Currently for me it is only possible to launch the model without "--tensor-parallel" and only using "--pipeline-parallel" - here vLLM does not complain -, which of course is something different than "--tensor-parallel". Is this rather a vLLM specific issue and deploying the model using the provided transformers instructions is the way to go?

Thanks and kind regards!

DeltaQuattro

22 days ago

Hi! When you load it, how much VRAM does it require? If it's 4-bit quant, it should something like 36 GB, right? 72B params * 4bit /1byte. Ofc, then performing inference will require more VRAM, but I just wanted to confirm the mem occupation of the loaded model

NaiveYan

22 days ago

https://github.com/QwenLM/Qwen2.5-VL/issues/231
Qwen2-VL-72B-Instruct had the same issue.

Mitke15

22 days ago

Dear all,

Please fix the intermediate model size so that it can be run on 2,4 or 8 GPUs. Thanks!

prstg

22 days ago

https://github.com/QwenLM/Qwen2.5-VL/issues/231
Qwen2-VL-72B-Instruct had the same issue.

Great to know! Thanks for pointing it out.

So probably we'll just wait until it was re-quantized (which was done last time) to get it working with multi-GPUs. Perfect, thanks :)

imjliao

21 days ago

Dear all,

Please fix the intermediate model size so that it can be run on 2,4 or 8 GPUs. Thanks!

Try PointerHQ/Qwen2.5-VL-72B-Instruct-Pointer-AWQ which supports --tensor-parallel on 2, 4 and 8 GPUs.

https://huggingface.co/PointerHQ/Qwen2.5-VL-72B-Instruct-Pointer-AWQ

cbrug

21 days ago

Dear all,

Please fix the intermediate model size so that it can be run on 2,4 or 8 GPUs. Thanks!

@Mitke15 what should be the intermediate model size in those cases? I'm receiving a similar error

The input size is not aligned with the quantized weight shape. This can be caused by too large tensor parallel size.

It happens also with the model suggested by @imjliao .

I'm trying to run the model on A100 40GB GPUs

imjliao

21 days ago

Dear all,

Please fix the intermediate model size so that it can be run on 2,4 or 8 GPUs. Thanks!

@Mitke15 what should be the intermediate model size in those cases? I'm receiving a similar error
The input size is not aligned with the quantized weight shape. This can be caused by too large tensor parallel size.
It happens also with the model suggested by @imjliao .

I'm trying to run the model on A100 40GB GPUs

Hi, I just finished uploading the updated weights, feel free to try again.

prstg

18 days ago

I didn't manage to get the PointerHQ version running, but it is working fine now with the official version. Thanks!

prstg changed discussion status to closed 18 days ago

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

Your need to confirm your account before you can post a new comment.

· Sign up or log in to comment