darshanmakwana's picture
Upload folder using huggingface_hub
e0c2d04 verified

High-level comparison

Inference engine

The inference engine performs the computation only, it doesn't manage the communication part (HTTP/GRPC API, etc.).

!!! summary

* don't use Pytorch in production for inference
* ONNX Runtime is your good enough API for most inference jobs
* if you need best performances, use TensorRT
Nvidia TensorRT :material-microsoft: Microsoft ONNX Runtime :material-facebook: Meta Pytorch comments
:octicons-rocket-16: transformer-deploy support :material-check: :material-check: :material-cancel:
:material-license: Licence Apache 2, optimization engine is closed source MIT Modified BSD
:material-api: ease of use (API) :fontawesome-regular-angry: :material-check-all: :material-check-all: Nvidia has chosen to not hide technical details + model is specific to a single hardware + model + data shapes association
:material-file-document-edit: ease of use (documentation) :material-spider-thread:
(spread out, incomplete)
:material-check:
(improving)
:material-check-all:
(strong community)
:octicons-cpu-16: Hardware support :material-check:
GPU + Jetson
:material-check-all:
CPU + GPU + IoT + Edge + Mobile
:material-check:
CPU + GPU
:octicons-stopwatch-16: Performance :material-speedometer: :material-speedometer-medium: :material-speedometer-slow: TensorRT is usually 5 to 10X faster than Pytorch when you use quantization, etc.
:material-target: Accuracy :material-speedometer-medium: :material-speedometer: :material-speedometer: TensorRT optimizations may be a bit too aggressive and decrease model accuracy. It requires manual modification to retrieve it.

Inference HTTP/GRPC server

Nvidia Triton :material-facebook: Meta TorchServe FastAPI comments
:octicons-rocket-16: transformer-deploy support :material-check: :material-cancel: :material-cancel:
:material-license: Licence Modified BSD Apache 2 MIT
:material-api: ease of use (API) :material-check: :material-check: :material-check-all: As a classic HTTP server, FastAPI may appear easier to use
:material-file-document-edit: ease of use (documentation) :material-check: :material-check: :material-check-all: FastAPI has one of the most beautiful documentation ever!
:octicons-stopwatch-16: Performance :material-speedometer: :material-speedometer-medium: :material-speedometer-slow: FastAPI is 6-10X slower to manage user query than Triton
Support
:octicons-cpu-16: CPU :material-check: :material-check: :material-check:
:octicons-cpu-16: GPU :material-check: :material-check: :material-check:
dynamic batching :material-check: :material-check: :material-cancel: combine individual inference requests together to improve inference throughput
concurrent model execution :material-check: :material-check: :material-cancel: run multiple models (or multiple instances of the same model)
pipeline :material-check: :material-cancel: :material-cancel: one or more models and the connection of input and output tensors between those models
native multiple backends* support :material-check: :material-cancel: :material-check: *backends: Microsoft ONNX Runtime, Nvidia Triton, Meta Pytorch
REST API :material-check: :material-check: :material-check:
GRPC API :material-check: :material-check: :material-cancel:
Inference metrics :material-check: :material-check: :material-cancel: GPU utilization, server throughput, and server latency

--8<-- "resources/abbreviations.md"