storage / ASR /transformer-deploy /docs /compare.md

Upload folder using huggingface_hub

e0c2d04 verified about 1 year ago

7.8 kB

	# High-level comparison

	## Inference engine

	The inference engine performs the computation only, it doesn't manage the communication part (HTTP/GRPC API, etc.).

	!!! summary

	* don't use Pytorch in production for inference
	* ONNX Runtime is your good enough API for most inference jobs
	* if you need best performances, use TensorRT

	\| \| Nvidia TensorRT \| :material-microsoft: Microsoft ONNX Runtime \| :material-facebook: Meta Pytorch \| comments \|
	\|:----------------------------------------------------------\|:--------------------------------------------------------\|:-----------------------------------------------------------\|:----------------------------------------------\|:--------------------------------------------------------------------------------------------------------------------------------\|
	\| :octicons-rocket-16: transformer-deploy support \| :material-check: \| :material-check: \| :material-cancel: \| \|
	\| :material-license: Licence \| Apache 2, optimization engine is closed source \| MIT \| Modified BSD \| \|
	\| :material-api: ease of use (API) \| :fontawesome-regular-angry: \| :material-check-all: \| :material-check-all: \| Nvidia has chosen to not hide technical details + model is specific to a single `hardware + model + data shapes` association \|
	\| :material-file-document-edit: ease of use (documentation) \| :material-spider-thread: <br/> (spread out, incomplete) \| :material-check: <br/> (improving) \| :material-check-all: <br/> (strong community) \| \|
	\| :octicons-cpu-16: Hardware support \| :material-check: <br/> GPU + Jetson \| :material-check-all: <br/> CPU + GPU + IoT + Edge + Mobile \| :material-check: <br/> CPU + GPU \| \|
	\| :octicons-stopwatch-16: Performance \| :material-speedometer: \| :material-speedometer-medium: \| :material-speedometer-slow: \| TensorRT is usually 5 to 10X faster than Pytorch when you use quantization, etc. \|
	\| :material-target: Accuracy \| :material-speedometer-medium: \| :material-speedometer: \| :material-speedometer: \| TensorRT optimizations may be a bit too aggressive and decrease model accuracy. It requires manual modification to retrieve it. \|

	## Inference HTTP/GRPC server

	\| \| Nvidia Triton \| :material-facebook: Meta TorchServe \| FastAPI \| comments \|
	\|:----------------------------------------------------------\|:-----------------------\|:------------------------------------\|:----------------------------\|:---------------------------------------------------------------------------------------\|
	\| :octicons-rocket-16: transformer-deploy support \| :material-check: \| :material-cancel: \| :material-cancel: \| \|
	\| :material-license: Licence \| Modified BSD \| Apache 2 \| MIT \| \|
	\| :material-api: ease of use (API) \| :material-check: \| :material-check: \| :material-check-all: \| As a classic HTTP server, FastAPI may appear easier to use \|
	\| :material-file-document-edit: ease of use (documentation) \| :material-check: \| :material-check: \| :material-check-all: \| FastAPI has one of the most beautiful documentation ever! \|
	\| :octicons-stopwatch-16: Performance \| :material-speedometer: \| :material-speedometer-medium: \| :material-speedometer-slow: \| FastAPI is 6-10X slower to manage user query than Triton \|
	\| Support \| \| \| \| \|
	\| :octicons-cpu-16: CPU \| :material-check: \| :material-check: \| :material-check: \| \|
	\| :octicons-cpu-16: GPU \| :material-check: \| :material-check: \| :material-check: \| \|
	\| dynamic batching \| :material-check: \| :material-check: \| :material-cancel: \| combine individual inference requests together to improve inference throughput \|
	\| concurrent model execution \| :material-check: \| :material-check: \| :material-cancel: \| run multiple models (or multiple instances of the same model) \|
	\| pipeline \| :material-check: \| :material-cancel: \| :material-cancel: \| one or more models and the connection of input and output tensors between those models \|
	\| native multiple backends* support \| :material-check: \| :material-cancel: \| :material-check: \| *backends: Microsoft ONNX Runtime, Nvidia Triton, Meta Pytorch \|
	\| REST API \| :material-check: \| :material-check: \| :material-check: \| \|
	\| GRPC API \| :material-check: \| :material-check: \| :material-cancel: \| \|
	\| Inference metrics \| :material-check: \| :material-check: \| :material-cancel: \| GPU utilization, server throughput, and server latency \|



	--8<-- "resources/abbreviations.md"

	# High-level comparison

	## Inference engine

	The inference engine performs the computation only, it doesn't manage the communication part (HTTP/GRPC API, etc.).

	!!! summary

	* don't use Pytorch in production for inference
	* ONNX Runtime is your good enough API for most inference jobs
	* if you need best performances, use TensorRT

	\| \| Nvidia TensorRT \| :material-microsoft: Microsoft ONNX Runtime \| :material-facebook: Meta Pytorch \| comments \|
	\|:----------------------------------------------------------\|:--------------------------------------------------------\|:-----------------------------------------------------------\|:----------------------------------------------\|:--------------------------------------------------------------------------------------------------------------------------------\|
	\| :octicons-rocket-16: transformer-deploy support \| :material-check: \| :material-check: \| :material-cancel: \| \|
	\| :material-license: Licence \| Apache 2, optimization engine is closed source \| MIT \| Modified BSD \| \|
	\| :material-api: ease of use (API) \| :fontawesome-regular-angry: \| :material-check-all: \| :material-check-all: \| Nvidia has chosen to not hide technical details + model is specific to a single `hardware + model + data shapes` association \|
	\| :material-file-document-edit: ease of use (documentation) \| :material-spider-thread: <br/> (spread out, incomplete) \| :material-check: <br/> (improving) \| :material-check-all: <br/> (strong community) \| \|
	\| :octicons-cpu-16: Hardware support \| :material-check: <br/> GPU + Jetson \| :material-check-all: <br/> CPU + GPU + IoT + Edge + Mobile \| :material-check: <br/> CPU + GPU \| \|
	\| :octicons-stopwatch-16: Performance \| :material-speedometer: \| :material-speedometer-medium: \| :material-speedometer-slow: \| TensorRT is usually 5 to 10X faster than Pytorch when you use quantization, etc. \|
	\| :material-target: Accuracy \| :material-speedometer-medium: \| :material-speedometer: \| :material-speedometer: \| TensorRT optimizations may be a bit too aggressive and decrease model accuracy. It requires manual modification to retrieve it. \|

	## Inference HTTP/GRPC server

	\| \| Nvidia Triton \| :material-facebook: Meta TorchServe \| FastAPI \| comments \|
	\|:----------------------------------------------------------\|:-----------------------\|:------------------------------------\|:----------------------------\|:---------------------------------------------------------------------------------------\|
	\| :octicons-rocket-16: transformer-deploy support \| :material-check: \| :material-cancel: \| :material-cancel: \| \|
	\| :material-license: Licence \| Modified BSD \| Apache 2 \| MIT \| \|
	\| :material-api: ease of use (API) \| :material-check: \| :material-check: \| :material-check-all: \| As a classic HTTP server, FastAPI may appear easier to use \|
	\| :material-file-document-edit: ease of use (documentation) \| :material-check: \| :material-check: \| :material-check-all: \| FastAPI has one of the most beautiful documentation ever! \|
	\| :octicons-stopwatch-16: Performance \| :material-speedometer: \| :material-speedometer-medium: \| :material-speedometer-slow: \| FastAPI is 6-10X slower to manage user query than Triton \|
	\| Support \| \| \| \| \|
	\| :octicons-cpu-16: CPU \| :material-check: \| :material-check: \| :material-check: \| \|
	\| :octicons-cpu-16: GPU \| :material-check: \| :material-check: \| :material-check: \| \|
	\| dynamic batching \| :material-check: \| :material-check: \| :material-cancel: \| combine individual inference requests together to improve inference throughput \|
	\| concurrent model execution \| :material-check: \| :material-check: \| :material-cancel: \| run multiple models (or multiple instances of the same model) \|
	\| pipeline \| :material-check: \| :material-cancel: \| :material-cancel: \| one or more models and the connection of input and output tensors between those models \|
	\| native multiple backends* support \| :material-check: \| :material-cancel: \| :material-check: \| *backends: Microsoft ONNX Runtime, Nvidia Triton, Meta Pytorch \|
	\| REST API \| :material-check: \| :material-check: \| :material-check: \| \|
	\| GRPC API \| :material-check: \| :material-check: \| :material-cancel: \| \|
	\| Inference metrics \| :material-check: \| :material-check: \| :material-cancel: \| GPU utilization, server throughput, and server latency \|



	--8<-- "resources/abbreviations.md"