storage / ASR /transformer-deploy /docs /quantization /quantization_intro.md

Upload folder using huggingface_hub

e0c2d04 verified about 1 year ago

2.59 kB

	# Nvidia GPU INT-8 quantization

	## What is it about?

	Quantization is one of the most effective and generic approaches to make model inference faster.
	Basically, it replaces high precision float numbers in model tensors encoded in 32 or 16 bits by lower precision ones encoded in 8 bits or less:

	* it takes less memory
	* computation is easier / faster

	It can be applied to any model in theory, and, if done well, it should maintain accuracy.

	The purpose of this notebook is to show a process to perform quantization on any `Transformer` architectures.

	Moreover, the library is designed to offer a simple API and still let advanced users tweak the algorithm.

	## Benchmark

	!!! tip "TL;DR"

	We benchmarked Pytorch and Nvidia TensorRT, on both CPU and GPU, with/without quantization, our methods provide the fastest inference by large margin.

	\| Framework \| Precision \| Latency (ms) \| Accuracy \| Speedup \| Hardware \|
	\|:--------------------------\|-----------\|--------------\|----------\|:-----------\|:--------:\|
	\| Pytorch \| FP32 \| 4267 \| 86.6 % \| X 0.02 \| CPU \|
	\| Pytorch \| FP16 \| 4428 \| 86.6 % \| X 0.02 \| CPU \|
	\| Pytorch \| INT-8 \| 3300 \| 85.9 % \| X 0.02 \| CPU \|
	\| Pytorch \| FP32 \| 77 \| 86.6 % \| X 1 \| GPU \|
	\| Pytorch \| FP16 \| 56 \| 86.6 % \| X 1.38 \| GPU \|
	\| ONNX Runtime \| FP32 \| 76 \| 86.6 % \| X 1.01 \| GPU \|
	\| ONNX Runtime \| FP16 \| 34 \| 86.6 % \| X 2.26 \| GPU \|
	\| ONNX Runtime \| FP32 \| 4023 \| 86.6 % \| X 0.02 \| CPU \|
	\| ONNX Runtime \| FP16 \| 3957 \| 86.6 % \| X 0.02 \| CPU \|
	\| ONNX Runtime \| INT-8 \| 3336 \| 86.5 % \| X 0.02 \| CPU \|
	\| TensorRT \| FP16 \| 30 \| 86.6 % \| X 2.57 \| GPU \|
	\| TensorRT (our method) \| INT-8 \| 17 \| 86.2 % \| X 4.53 \| GPU \|


	!!! note

	measures done on a Nvidia RTX 3090 GPU + 12 cores i7 Intel CPU (support AVX-2 instruction)
	Roberta `base` architecture flavor with batch of size 32 / seq len 256, similar results obtained for other sizes/seq len not included in the table.
	Accuracy obtained after a single epoch, no LR search or any hyper parameter optimization

	Check the end to end demo to see where these numbers are from.

	--8<-- "resources/abbreviations.md"