darshanmakwana's picture
Upload folder using huggingface_hub
e0c2d04 verified
# High-level comparison
## Inference engine
The inference engine performs the computation only, it doesn't manage the communication part (HTTP/GRPC API, etc.).
!!! summary
* don't use Pytorch in production for inference
* ONNX Runtime is your good enough API for most inference jobs
* if you need best performances, use TensorRT
| | Nvidia TensorRT | :material-microsoft: Microsoft ONNX Runtime | :material-facebook: Meta Pytorch | comments |
|:----------------------------------------------------------|:--------------------------------------------------------|:-----------------------------------------------------------|:----------------------------------------------|:--------------------------------------------------------------------------------------------------------------------------------|
| :octicons-rocket-16: transformer-deploy support | :material-check: | :material-check: | :material-cancel: | |
| :material-license: Licence | Apache 2, optimization engine is closed source | MIT | Modified BSD | |
| :material-api: ease of use (API) | :fontawesome-regular-angry: | :material-check-all: | :material-check-all: | Nvidia has chosen to not hide technical details + model is specific to a single `hardware + model + data shapes` association |
| :material-file-document-edit: ease of use (documentation) | :material-spider-thread: <br/> (spread out, incomplete) | :material-check: <br/> (improving) | :material-check-all: <br/> (strong community) | |
| :octicons-cpu-16: Hardware support | :material-check: <br/> GPU + Jetson | :material-check-all: <br/> CPU + GPU + IoT + Edge + Mobile | :material-check: <br/> CPU + GPU | |
| :octicons-stopwatch-16: Performance | :material-speedometer: | :material-speedometer-medium: | :material-speedometer-slow: | TensorRT is usually 5 to 10X faster than Pytorch when you use quantization, etc. |
| :material-target: Accuracy | :material-speedometer-medium: | :material-speedometer: | :material-speedometer: | TensorRT optimizations may be a bit too aggressive and decrease model accuracy. It requires manual modification to retrieve it. |
## Inference HTTP/GRPC server
| | Nvidia Triton | :material-facebook: Meta TorchServe | FastAPI | comments |
|:----------------------------------------------------------|:-----------------------|:------------------------------------|:----------------------------|:---------------------------------------------------------------------------------------|
| :octicons-rocket-16: transformer-deploy support | :material-check: | :material-cancel: | :material-cancel: | |
| :material-license: Licence | Modified BSD | Apache 2 | MIT | |
| :material-api: ease of use (API) | :material-check: | :material-check: | :material-check-all: | As a classic HTTP server, FastAPI may appear easier to use |
| :material-file-document-edit: ease of use (documentation) | :material-check: | :material-check: | :material-check-all: | FastAPI has one of the most beautiful documentation ever! |
| :octicons-stopwatch-16: Performance | :material-speedometer: | :material-speedometer-medium: | :material-speedometer-slow: | FastAPI is 6-10X slower to manage user query than Triton |
| **Support** | | | | |
| :octicons-cpu-16: CPU | :material-check: | :material-check: | :material-check: | |
| :octicons-cpu-16: GPU | :material-check: | :material-check: | :material-check: | |
| dynamic batching | :material-check: | :material-check: | :material-cancel: | combine individual inference requests together to improve inference throughput |
| concurrent model execution | :material-check: | :material-check: | :material-cancel: | run multiple models (or multiple instances of the same model) |
| pipeline | :material-check: | :material-cancel: | :material-cancel: | one or more models and the connection of input and output tensors between those models |
| native multiple backends* support | :material-check: | :material-cancel: | :material-check: | *backends: Microsoft ONNX Runtime, Nvidia Triton, Meta Pytorch |
| REST API | :material-check: | :material-check: | :material-check: | |
| GRPC API | :material-check: | :material-check: | :material-cancel: | |
| Inference metrics | :material-check: | :material-check: | :material-cancel: | GPU utilization, server throughput, and server latency |
--8<-- "resources/abbreviations.md"