File size: 2,980 Bytes
e0c2d04 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 |
# Some theory
## A (very) short intro to INT-8 quantization
Basic idea behind model quantization is to replace tensors made of float numbers (usually encoded on 32 bits) by lower precision representation (integers encoded on 8 bits for Nvidia GPUs).
Therefore computation is faster and model memory footprint is lower. Making tensor storage smaller makes memory transfer faster... and is also a source of computation acceleration.
This approach is very interesting for its trade-off: you reduce inference time significantly, and it costs close to nothing in accuracy.
Replacing float numbers by integers is done through a mapping.
This step is called `calibration`, and its purpose is to compute for each tensor or each channel of a tensor (one of its dimensions) a range covering most weights and then define a scale and a distribution center to map float numbers to 8 bits integers.
There are several ways to perform quantization, depending of how and when the `calibration` is performed:
* dynamically: the mapping is done online, during the inference, there are some overhead but it's usually the easiest to leverage, end user has very few configuration to set,
* statically, after training (`post training quantization` or `PTQ`): this way is efficient because quantization is done offline, before inference, but it may have an accuracy cost,
* statically, after training (`quantization aware training` or `QAT`): like a PTQ followed by a second fine tuning. Same efficiency but usually slightly better accuracy.
Nvidia GPUs don't support dynamic quantization, CPU supports all types of quantization.
Compared to `PTQ`, `QAT` better preserves accuracy and should be preferred in most cases.
During the *quantization aware training*:
* in the inside, Pytorch will train with high precision float numbers,
* on the outside, Pytorch will simulate that a quantization has already been applied and output results accordingly (for loss computation for instance)
The simulation process is done through the add of quantization / dequantization nodes, most often called `QDQ`, it's an abbreviation you will see often in the quantization world.
!!! info "Want to learn more about quantization?"
* You can check this [high quality blog post](https://leimao.github.io/article/Neural-Networks-Quantization/) for more information.
* The process is well described in this [Nvidia presentation](https://on-demand.gputechconf.com/gtc/2017/presentation/s7310-8-bit-inference-with-tensorrt.pdf)
## Why does it matter?
CPU quantization is supported out of the box by `Pytorch` and `ONNX Runtime`.
**GPU quantization on the other side requires specific tools and process to be applied**.
In the specific case of `Transformer` models, few demos from Nvidia and Microsoft exist; they are all for the old vanilla Bert architecture.
It doesn't support modern architectures out of the box, like `Albert`, `Roberta`, `Deberta` or `Electra`.
--8<-- "resources/abbreviations.md" |