storage / ASR /transformer-deploy /docs /python.md

Upload folder using huggingface_hub

e0c2d04 verified about 1 year ago

2.25 kB

	# TensorRT usage in Python script

	There are 2 ways to use a TensorRT optimized model:

	* deploy it on Triton server
	* use it directly in Python

	This document is about the second option.

	## High-level explanations

	* call `load_engine()` to parse an existing TensorRT engine or `build_engine()` to convert an ONNX file
	* setup a CUDA `stream` (for async call), a TensorRT `runtime` and a `context`
	* load your `profile`(s)
	* call `infer_tensorrt()`

	## Build engine

	We assume that you have already prepared your ONNX file.
	Now we need to convert to TensorRT:

	```python
	import tensorrt as trt
	from tensorrt.tensorrt import Logger, Runtime

	from transformer_deploy.backends.trt_utils import build_engine

	trt_logger: Logger = trt.Logger(trt.Logger.ERROR)
	runtime: Runtime = trt.Runtime(trt_logger)
	profile_index = 0
	max_seq_len = 256
	batch_size = 32

	engine = build_engine(
	runtime=runtime,
	onnx_file_path="model_qat.onnx",
	logger=trt_logger,
	min_shape=(1, max_seq_len),
	optimal_shape=(batch_size, max_seq_len),
	max_shape=(batch_size, max_seq_len),
	workspace_size=10000 * 1024 * 1024,
	fp16=True,
	int8=True,
	)
	```

	## Prepare inference

	Now the engine is ready, we can prepare the inference:

	```python
	import torch
	from tensorrt.tensorrt import IExecutionContext

	from transformer_deploy.backends.trt_utils import get_binding_idxs

	context: IExecutionContext = engine.create_execution_context()
	context.set_optimization_profile_async(profile_index=profile_index, stream_handle=torch.cuda.current_stream().cuda_stream)
	input_binding_idxs, output_binding_idxs = get_binding_idxs(engine, profile_index) # type: List[int], List[int]
	```

	## Inference

	```python


	from transformer_deploy.backends.trt_utils import infer_tensorrt

	input_np = ...

	tensorrt_output = infer_tensorrt(
	context=context,
	host_inputs=input_np,
	input_binding_idxs=input_binding_idxs,
	output_binding_idxs=output_binding_idxs,
	)
	print(tensorrt_output)
	```

	... and you are done! 🎉

	!!! tip

	To go deeper, check in the API:

	* `Convert`
	* `Backends/Trt utils`

	... and if you are looking for inspiration, check [onnx-tensorrt](https://github.com/onnx/onnx-tensorrt)

	--8<-- "resources/abbreviations.md"