inference time and memory usage of quantized version

#18

by DILLIP-KUMAR - opened Nov 15, 2024

Nov 15, 2024

I've tested Qwen-2VL-2B-Instruct and its quantized versions (e.g., GPTQ-Int8, 4-bit, etc.) on different GPUs including V100, RTX 4090, RTX 3060, and T4 (Colab). Despite the reduced GPU memory usage of quantized models compared to the original 16-bit version, I observed an unexpected increase in inference time:

Int8 inference is slower than 16-bit.
4-bit inference time is in between 16-bit and 8-bit.
I would appreciate insights into why this increase in inference time occurs with quantized models and suggestions on how to optimize inference speed.

Additional Testing: This behavior was also consistent across other models like Qwen2.5-Coder with different quantization methods (e.g., GPTQ, AWQ, and custom linear quantization**).

jeanflop

Jan 31

i notice the same thing, i know quantization reduces memory but not sure if it impacts inference through computation

DILLIP-KUMAR

Jan 31

yes, "In the paper on GPTQ, the authors claimed that quantization reduces memory and improves inference speed, achieving up to 3.25x speedup on A100 and 4.5x on A6000. Similarly, AWQ preserves only 1% of the most important weights to minimize quantization error and, with optimizations in TinyChat, achieves more than 3× speedup over FP16 implementations on both desktop and mobile GPUs[1][2]."
[1] Frantar, Elias, Saleh Ashkboos, Torsten Hoefler, and Dan Alistarh. "Gptq: Accurate post-training quantization for generative pre-trained transformers." arXiv preprint arXiv:2210.17323 (2022).
[2]Lin, Ji, Jiaming Tang, Haotian Tang, Shang Yang, Wei-Ming Chen, Wei-Chen Wang, Guangxuan Xiao, Xingyu Dang, Chuang Gan, and Song Han. "AWQ: Activation-aware Weight Quantization for On-Device LLM Compression and Acceleration." Proceedings of Machine Learning and Systems 6 (2024): 87-100.

DILLIP-KUMAR changed discussion status to closed Jan 31

DILLIP-KUMAR

Jan 31

"I haven’t tried it yet, but there is another quantization method, GWQ, which further optimizes quantization by leveraging gradients to detect and retain outlier weights at FP16, achieving 1.2× speedup while maintaining higher accuracy in zero-shot tasks[3].

[3] Shao, Yihua, Siyu Liang, Xiaolin Lin, Zijian Ling, Zixian Zhu, Minxi Yan, Haiyang Liu et al. "GWQ: Gradient-Aware Weight Quantization for Large Language Models." arXiv preprint arXiv:2411.00850 (2024).

DILLIP-KUMAR changed discussion status to open Jan 31

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment