|
# Nvidia GPU INT-8 quantization |
|
|
|
## What is it about? |
|
|
|
Quantization is one of the most effective and generic approaches to make model inference faster. |
|
Basically, it replaces high precision float numbers in model tensors encoded in 32 or 16 bits by lower precision ones encoded in 8 bits or less: |
|
|
|
* it takes less memory |
|
* computation is easier / faster |
|
|
|
It can be applied to any model in theory, and, if done well, it should maintain accuracy. |
|
|
|
The purpose of this notebook is to show a process to perform quantization on any `Transformer` architectures. |
|
|
|
Moreover, the library is designed to offer a simple API and still let advanced users tweak the algorithm. |
|
|
|
## Benchmark |
|
|
|
!!! tip "TL;DR" |
|
|
|
We benchmarked Pytorch and Nvidia TensorRT, on both CPU and GPU, with/without quantization, our methods provide the fastest inference by large margin. |
|
|
|
| Framework | Precision | Latency (ms) | Accuracy | Speedup | Hardware | |
|
|:--------------------------|-----------|--------------|----------|:-----------|:--------:| |
|
| Pytorch | FP32 | 4267 | 86.6 % | X 0.02 | CPU | |
|
| Pytorch | FP16 | 4428 | 86.6 % | X 0.02 | CPU | |
|
| Pytorch | INT-8 | 3300 | 85.9 % | X 0.02 | CPU | |
|
| Pytorch | FP32 | 77 | 86.6 % | X 1 | GPU | |
|
| Pytorch | FP16 | 56 | 86.6 % | X 1.38 | GPU | |
|
| ONNX Runtime | FP32 | 76 | 86.6 % | X 1.01 | GPU | |
|
| ONNX Runtime | FP16 | 34 | 86.6 % | X 2.26 | GPU | |
|
| ONNX Runtime | FP32 | 4023 | 86.6 % | X 0.02 | CPU | |
|
| ONNX Runtime | FP16 | 3957 | 86.6 % | X 0.02 | CPU | |
|
| ONNX Runtime | INT-8 | 3336 | 86.5 % | X 0.02 | CPU | |
|
| TensorRT | FP16 | 30 | 86.6 % | X 2.57 | GPU | |
|
| TensorRT (**our method**) | **INT-8** | **17** | 86.2 % | **X 4.53** | **GPU** | |
|
|
|
|
|
!!! note |
|
|
|
measures done on a Nvidia RTX 3090 GPU + 12 cores i7 Intel CPU (support AVX-2 instruction) |
|
Roberta `base` architecture flavor with batch of size 32 / seq len 256, similar results obtained for other sizes/seq len not included in the table. |
|
Accuracy obtained after a single epoch, no LR search or any hyper parameter optimization |
|
|
|
Check the end to end demo to see where these numbers are from. |
|
|
|
--8<-- "resources/abbreviations.md" |