|
# Optimizing a model for inference |
|
|
|
There are few ways to optimize a model for inference, some of them are basically a simplification of its graph: |
|
|
|
* :material-delete: `find and remove redundant operations`: for instance dropout has no use outside the training loop, it can be removed without any impact on inference; |
|
* :material-find-replace: `perform constant folding`: meaning find some parts of the graph made of constant expressions, and compute the results at compile time instead of runtime (similar to most programming language compiler); |
|
* :material-merge: `kernel fusion`: to avoid 1/ loading time, 2/ share memory to avoid back and forth transfers with the global memory and 3/ use optimal implementation of a series of operations. Obviously, it will mainly benefit to memory bound operations (like multiply and add operations, a very common pattern in deep learning), it’s called “kernel fusion”; |
|
|
|
Another orthogonal approach is to use lower precision tensors, it may be FP16 float number or INT-8 quantization. |
|
|
|
 |
|
|
|
!!! attention |
|
|
|
Mixed precision and INT-8 quantization may have an accuracy cost. |
|
The reason is that you can't encode as much information in FP16 or INT-8 tensor that you can in FP32 tensor. |
|
Sometimes you do not have enough granularity, some other times the range is not big enough. |
|
When it happens, you need to modify the computation graph to keep some operators in full precision. |
|
This library does it for mixed precision (for most models) and provide you with a simple way to do it for INT-8 quantization |
|
|
|
--8<-- "resources/abbreviations.md" |