File size: 2,980 Bytes

e0c2d04

# Some theory

## A (very) short intro to INT-8 quantization

Basic idea behind model quantization is to replace tensors made of float numbers (usually encoded on 32 bits) by lower precision representation (integers encoded on 8 bits for Nvidia GPUs).
Therefore computation is faster and model memory footprint is lower. Making tensor storage smaller makes memory transfer faster... and is also a source of computation acceleration.
This approach is very interesting for its trade-off: you reduce inference time significantly, and it costs close to nothing in accuracy.

Replacing float numbers by integers is done through a mapping.
This step is called `calibration`, and its purpose is to compute for each tensor or each channel of a tensor (one of its dimensions) a range covering most weights and then define a scale and a distribution center to map float numbers to 8 bits integers.

There are several ways to perform quantization, depending of how and when the `calibration` is performed:

* dynamically: the mapping is done online, during the inference, there are some overhead but it's usually the easiest to leverage, end user has very few configuration to set,
* statically, after training (`post training quantization` or `PTQ`): this way is efficient because quantization is done offline, before inference, but it may have an accuracy cost,
* statically, after training (`quantization aware training` or `QAT`): like a PTQ followed by a second fine tuning. Same efficiency but usually slightly better accuracy.

Nvidia GPUs don't support dynamic quantization, CPU supports all types of quantization.  
Compared to `PTQ`, `QAT` better preserves accuracy and should be preferred in most cases.

During the *quantization aware training*:

* in the inside, Pytorch will train with high precision float numbers,
* on the outside, Pytorch will simulate that a quantization has already been applied and output results accordingly (for loss computation for instance)

The simulation process is done through the add of quantization / dequantization nodes, most often called `QDQ`, it's an abbreviation you will see often in the quantization world.

!!! info "Want to learn more about quantization?"

    * You can check this [high quality blog post](https://leimao.github.io/article/Neural-Networks-Quantization/) for more information.
    * The process is well described in this [Nvidia presentation](https://on-demand.gputechconf.com/gtc/2017/presentation/s7310-8-bit-inference-with-tensorrt.pdf)

## Why does it matter?

CPU quantization is supported out of the box by `Pytorch` and `ONNX Runtime`.
**GPU quantization on the other side requires specific tools and process to be applied**.

In the specific case of `Transformer` models, few demos from Nvidia and Microsoft exist; they are all for the old vanilla Bert architecture.

It doesn't support modern architectures out of the box, like `Albert`, `Roberta`, `Deberta` or `Electra`.

--8<-- "resources/abbreviations.md"