arxiv:2410.09426

FlatQuant: Flatness Matters for LLM Quantization

Published on Oct 12

· Submitted by

lianlio on Oct 18

Upvote

Authors:

Ruikang Liu ,

Haoli Bai ,

Jiaxin Hu ,

Xin Jiang ,

Abstract

Recently, quantization has been widely used for the compression and acceleration of large language models~(LLMs). Due to the outliers in LLMs, it is crucial to flatten weights and activations to minimize quantization error with the equally spaced quantization points. Prior research explores various pre-quantization transformations to suppress outliers, such as per-channel scaling and Hadamard transformation. However, we observe that these transformed weights and activations can still remain steep and outspread. In this paper, we propose FlatQuant (Fast and Learnable Affine Transformation), a new post-training quantization approach to enhance flatness of weights and activations. Our approach identifies optimal affine transformations tailored to each linear layer, calibrated in hours via a lightweight objective. To reduce runtime overhead, we apply Kronecker decomposition to the transformation matrices, and fuse all operations in FlatQuant into a single kernel. Extensive experiments show that FlatQuant sets up a new state-of-the-art quantization benchmark. For instance, it achieves less than 1% accuracy drop for W4A4 quantization on the LLaMA-3-70B model, surpassing SpinQuant by 7.5%. For inference latency, FlatQuant reduces the slowdown induced by pre-quantization transformation from 0.26x of QuaRot to merely 0.07x, bringing up to 2.3x speedup for prefill and 1.7x speedup for decoding, respectively. Code is available at: https://github.com/ruikangliu/FlatQuant.

View arXiv page View PDF Add to collection

Community

lianlio

Paper author Paper submitter 7 days ago

The contributions of this work are summarized below:

We highlight the significance of achieving flatness for LLM quantization, demonstrating that flat distributions of weights and activations facilitate quantization and reduce error propagation across Transformer layers.
We introduce FLATQUANT, a new post-training quantization method with fast and learn-able affine transformations optimized for each linear layer. The approach is empirically demonstrated to enhance the flatness of both weights and activations in LLMs.
Extensive experiments demonstrate that FLATQUANT sets new state-of-the-art results for quantization. To the best of our knowledge, we are the first to achieve ≤ 1% accuracy drop with simply round-to-nearest W4A4 quantization on the LLaMA-3-70B model.
We have designed an efficient kernel that fuses affine transformation and quantization, reducing the additional latency caused by transformation from a 0.26x slowdown with QuaRot to only 0.07x. This enhancement gives up to 2.3x speedup for prefill and 1.7x speedup for decoding compared to the FP16 baseline.

The code is available at https://github.com/ruikangliu/FlatQuant.