High-level comparison

Inference engine

The inference engine performs the computation only, it doesn't manage the communication part (HTTP/GRPC API, etc.).

!!! summary

* don't use Pytorch in production for inference
* ONNX Runtime is your good enough API for most inference jobs
* if you need best performances, use TensorRT

	Nvidia TensorRT	:material-microsoft: Microsoft ONNX Runtime	:material-facebook: Meta Pytorch	comments
:octicons-rocket-16: transformer-deploy support	:material-check:	:material-check:	:material-cancel:
:material-license: Licence	Apache 2, optimization engine is closed source	MIT	Modified BSD
:material-api: ease of use (API)	:fontawesome-regular-angry:	:material-check-all:	:material-check-all:	Nvidia has chosen to not hide technical details + model is specific to a single `hardware + model + data shapes` association
:material-file-document-edit: ease of use (documentation)	:material-spider-thread: (spread out, incomplete)	:material-check: (improving)	:material-check-all: (strong community)
:octicons-cpu-16: Hardware support	:material-check: GPU + Jetson	:material-check-all: CPU + GPU + IoT + Edge + Mobile	:material-check: CPU + GPU
:octicons-stopwatch-16: Performance	:material-speedometer:	:material-speedometer-medium:	:material-speedometer-slow:	TensorRT is usually 5 to 10X faster than Pytorch when you use quantization, etc.
:material-target: Accuracy	:material-speedometer-medium:	:material-speedometer:	:material-speedometer:	TensorRT optimizations may be a bit too aggressive and decrease model accuracy. It requires manual modification to retrieve it.

Inference HTTP/GRPC server

	Nvidia Triton	:material-facebook: Meta TorchServe	FastAPI	comments
:octicons-rocket-16: transformer-deploy support	:material-check:	:material-cancel:	:material-cancel:
:material-license: Licence	Modified BSD	Apache 2	MIT
:material-api: ease of use (API)	:material-check:	:material-check:	:material-check-all:	As a classic HTTP server, FastAPI may appear easier to use
:material-file-document-edit: ease of use (documentation)	:material-check:	:material-check:	:material-check-all:	FastAPI has one of the most beautiful documentation ever!
:octicons-stopwatch-16: Performance	:material-speedometer:	:material-speedometer-medium:	:material-speedometer-slow:	FastAPI is 6-10X slower to manage user query than Triton
Support
:octicons-cpu-16: CPU	:material-check:	:material-check:	:material-check:
:octicons-cpu-16: GPU	:material-check:	:material-check:	:material-check:
dynamic batching	:material-check:	:material-check:	:material-cancel:	combine individual inference requests together to improve inference throughput
concurrent model execution	:material-check:	:material-check:	:material-cancel:	run multiple models (or multiple instances of the same model)
pipeline	:material-check:	:material-cancel:	:material-cancel:	one or more models and the connection of input and output tensors between those models
native multiple backends* support	:material-check:	:material-cancel:	:material-check:	*backends: Microsoft ONNX Runtime, Nvidia Triton, Meta Pytorch
REST API	:material-check:	:material-check:	:material-check:
GRPC API	:material-check:	:material-check:	:material-cancel:
Inference metrics	:material-check:	:material-check:	:material-cancel:	GPU utilization, server throughput, and server latency

--8<-- "resources/abbreviations.md"