File size: 7,797 Bytes
e0c2d04
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
# High-level comparison

## Inference engine

The inference engine performs the computation only, it doesn't manage the communication part (HTTP/GRPC API, etc.). 

!!! summary

    * don't use Pytorch in production for inference
    * ONNX Runtime is your good enough API for most inference jobs
    * if you need best performances, use TensorRT

|                                                           | Nvidia TensorRT                                         | :material-microsoft: Microsoft ONNX Runtime                | :material-facebook: Meta Pytorch              | comments                                                                                                                        |
|:----------------------------------------------------------|:--------------------------------------------------------|:-----------------------------------------------------------|:----------------------------------------------|:--------------------------------------------------------------------------------------------------------------------------------|
| :octicons-rocket-16: transformer-deploy support           | :material-check:                                        | :material-check:                                           | :material-cancel:                             |                                                                                                                                 |
| :material-license: Licence                                | Apache 2, optimization engine is closed source          | MIT                                                        | Modified BSD                                  |                                                                                                                                 |
| :material-api: ease of use (API)                          | :fontawesome-regular-angry:                             | :material-check-all:                                       | :material-check-all:                          | Nvidia has chosen to not hide technical details + model is specific to a single `hardware + model + data shapes` association    |
| :material-file-document-edit: ease of use (documentation) | :material-spider-thread: <br/> (spread out, incomplete) | :material-check: <br/> (improving)                         | :material-check-all: <br/> (strong community) |                                                                                                                                 |
| :octicons-cpu-16: Hardware support                        | :material-check: <br/> GPU + Jetson                     | :material-check-all: <br/> CPU + GPU + IoT + Edge + Mobile | :material-check: <br/> CPU + GPU              |                                                                                                                                 |
| :octicons-stopwatch-16: Performance                       | :material-speedometer:                                  | :material-speedometer-medium:                              | :material-speedometer-slow:                   | TensorRT is usually 5 to 10X faster than Pytorch when you use quantization, etc.                                                | 
| :material-target: Accuracy                                | :material-speedometer-medium:                           | :material-speedometer:                                     | :material-speedometer:                        | TensorRT optimizations may be a bit too aggressive and decrease model accuracy. It requires manual modification to retrieve it. |

## Inference HTTP/GRPC server

|                                                           | Nvidia Triton          | :material-facebook: Meta TorchServe | FastAPI                     | comments                                                                               |
|:----------------------------------------------------------|:-----------------------|:------------------------------------|:----------------------------|:---------------------------------------------------------------------------------------|
| :octicons-rocket-16: transformer-deploy support           | :material-check:       | :material-cancel:                   | :material-cancel:           |                                                                                        |
| :material-license: Licence                                | Modified BSD           | Apache 2                            | MIT                         |                                                                                        |
| :material-api: ease of use (API)                          | :material-check:       | :material-check:                    | :material-check-all:        | As a classic HTTP server, FastAPI may appear easier to use                             |
| :material-file-document-edit: ease of use (documentation) | :material-check:       | :material-check:                    | :material-check-all:        | FastAPI has one of the most beautiful documentation ever!                              |
| :octicons-stopwatch-16:  Performance                      | :material-speedometer: | :material-speedometer-medium:       | :material-speedometer-slow: | FastAPI is 6-10X slower to manage user query than Triton                               |
| **Support**                                               |                        |                                     |                             |                                                                                        |
| :octicons-cpu-16: CPU                                     | :material-check:       | :material-check:                    | :material-check:            |                                                                                        |
| :octicons-cpu-16: GPU                                     | :material-check:       | :material-check:                    | :material-check:            |                                                                                        |
| dynamic batching                                          | :material-check:       | :material-check:                    | :material-cancel:           | combine individual inference requests together to improve inference throughput         |
| concurrent model execution                                | :material-check:       | :material-check:                    | :material-cancel:           | run multiple models (or multiple instances of the same model)                          |
| pipeline                                                  | :material-check:       | :material-cancel:                   | :material-cancel:           | one or more models and the connection of input and output tensors between those models |
| native multiple backends* support                         | :material-check:       | :material-cancel:                   | :material-check:            | *backends: Microsoft ONNX Runtime, Nvidia Triton, Meta Pytorch                         |
| REST API                                                  | :material-check:       | :material-check:                    | :material-check:            |                                                                                        |
| GRPC API                                                  | :material-check:       | :material-check:                    | :material-cancel:           |                                                                                        |
| Inference metrics                                         | :material-check:       | :material-check:                    | :material-cancel:           | GPU utilization, server throughput, and server latency                                 |



--8<-- "resources/abbreviations.md"