# High-level comparison ## Inference engine The inference engine performs the computation only, it doesn't manage the communication part (HTTP/GRPC API, etc.). !!! summary * don't use Pytorch in production for inference * ONNX Runtime is your good enough API for most inference jobs * if you need best performances, use TensorRT | | Nvidia TensorRT | :material-microsoft: Microsoft ONNX Runtime | :material-facebook: Meta Pytorch | comments | |:----------------------------------------------------------|:--------------------------------------------------------|:-----------------------------------------------------------|:----------------------------------------------|:--------------------------------------------------------------------------------------------------------------------------------| | :octicons-rocket-16: transformer-deploy support | :material-check: | :material-check: | :material-cancel: | | | :material-license: Licence | Apache 2, optimization engine is closed source | MIT | Modified BSD | | | :material-api: ease of use (API) | :fontawesome-regular-angry: | :material-check-all: | :material-check-all: | Nvidia has chosen to not hide technical details + model is specific to a single `hardware + model + data shapes` association | | :material-file-document-edit: ease of use (documentation) | :material-spider-thread:
(spread out, incomplete) | :material-check:
(improving) | :material-check-all:
(strong community) | | | :octicons-cpu-16: Hardware support | :material-check:
GPU + Jetson | :material-check-all:
CPU + GPU + IoT + Edge + Mobile | :material-check:
CPU + GPU | | | :octicons-stopwatch-16: Performance | :material-speedometer: | :material-speedometer-medium: | :material-speedometer-slow: | TensorRT is usually 5 to 10X faster than Pytorch when you use quantization, etc. | | :material-target: Accuracy | :material-speedometer-medium: | :material-speedometer: | :material-speedometer: | TensorRT optimizations may be a bit too aggressive and decrease model accuracy. It requires manual modification to retrieve it. | ## Inference HTTP/GRPC server | | Nvidia Triton | :material-facebook: Meta TorchServe | FastAPI | comments | |:----------------------------------------------------------|:-----------------------|:------------------------------------|:----------------------------|:---------------------------------------------------------------------------------------| | :octicons-rocket-16: transformer-deploy support | :material-check: | :material-cancel: | :material-cancel: | | | :material-license: Licence | Modified BSD | Apache 2 | MIT | | | :material-api: ease of use (API) | :material-check: | :material-check: | :material-check-all: | As a classic HTTP server, FastAPI may appear easier to use | | :material-file-document-edit: ease of use (documentation) | :material-check: | :material-check: | :material-check-all: | FastAPI has one of the most beautiful documentation ever! | | :octicons-stopwatch-16: Performance | :material-speedometer: | :material-speedometer-medium: | :material-speedometer-slow: | FastAPI is 6-10X slower to manage user query than Triton | | **Support** | | | | | | :octicons-cpu-16: CPU | :material-check: | :material-check: | :material-check: | | | :octicons-cpu-16: GPU | :material-check: | :material-check: | :material-check: | | | dynamic batching | :material-check: | :material-check: | :material-cancel: | combine individual inference requests together to improve inference throughput | | concurrent model execution | :material-check: | :material-check: | :material-cancel: | run multiple models (or multiple instances of the same model) | | pipeline | :material-check: | :material-cancel: | :material-cancel: | one or more models and the connection of input and output tensors between those models | | native multiple backends* support | :material-check: | :material-cancel: | :material-check: | *backends: Microsoft ONNX Runtime, Nvidia Triton, Meta Pytorch | | REST API | :material-check: | :material-check: | :material-check: | | | GRPC API | :material-check: | :material-check: | :material-cancel: | | | Inference metrics | :material-check: | :material-check: | :material-cancel: | GPU utilization, server throughput, and server latency | --8<-- "resources/abbreviations.md"