|
## Command line |
|
|
|
!!! tip |
|
|
|
You can run commands below from Docker directly (no need to install Nvidia dependencies outside Docker one), like: |
|
```shell |
|
docker run -it --rm --gpus all \ |
|
-v $PWD:/project ghcr.io/els-rd/transformer-deploy:latest \ |
|
bash -c "cd /project && \ |
|
convert_model -m \"philschmid/MiniLM-L6-H384-uncased-sst2\" \ |
|
--backend onnx \ |
|
--seq-len 128 128 128" |
|
``` |
|
|
|
With the single command below, you will: |
|
|
|
* **download** the model and its tokenizer from :hugging: Hugging Face hub, |
|
* **convert** the model to ONNX graph, |
|
* **optimize** |
|
* the model with ONNX Runtime and save artefact (`model.onnx`), |
|
* the model with TensorRT and save artefact (`model.plan`), |
|
* **benchmark** each backend (including Pytorch), |
|
* **generate** configuration files for Triton inference server |
|
|
|
```shell |
|
convert_model -m philschmid/MiniLM-L6-H384-uncased-sst2 --backend onnx --seq-len 128 128 128 --batch-size 1 32 32 |
|
# ... |
|
# Inference done on NVIDIA GeForce RTX 3090 |
|
# latencies: |
|
# [Pytorch (FP32)] mean=8.75ms, sd=0.30ms, min=8.60ms, max=11.20ms, median=8.68ms, 95p=9.15ms, 99p=10.77ms |
|
# [Pytorch (FP16)] mean=6.75ms, sd=0.22ms, min=6.66ms, max=8.99ms, median=6.71ms, 95p=6.88ms, 99p=7.95ms |
|
# [ONNX Runtime (FP32)] mean=8.10ms, sd=0.43ms, min=7.93ms, max=11.76ms, median=8.02ms, 95p=8.39ms, 99p=11.30ms |
|
# [ONNX Runtime (optimized)] mean=3.66ms, sd=0.23ms, min=3.57ms, max=6.46ms, median=3.62ms, 95p=3.70ms, 99p=4.95ms |
|
``` |
|
|
|
!!! info |
|
|
|
**128 128 128** -> minimum, optimal, maximum sequence length, to help TensorRT better optimize your model. |
|
Better to have the same value for seq len to get best performances from TensorRT (ONNX Runtime has not this limitation). |
|
|
|
**1 32 32** -> batch size, same as above. Good idea to get 1 as minimum value. No impact on TensorRT performance. |
|
|
|
* Launch Nvidia Triton inference server to play with both ONNX and TensorRT models: |
|
|
|
```shell |
|
docker run -it --rm --gpus all -p8000:8000 -p8001:8001 -p8002:8002 --shm-size 256m \ |
|
-v $PWD/triton_models:/models nvcr.io/nvidia/tritonserver:22.07-py3 \ |
|
bash -c "pip install transformers sentencepiece && tritonserver --model-repository=/models" |
|
``` |
|
|
|
> As you can see, we install Transformers and then launch the server itself. |
|
> This is, of course, a bad practice, you should make your own 2 lines Dockerfile with Transformers inside. |
|
|
|
## Query the inference server |
|
|
|
To query your inference server you first need to need to convert your string to a binary file following the same format as in `demo/infinity/query_body.bin`. |
|
|
|
The `query_body.bin` file is composed of three parts, a header describing the inputs/outputs values, a binary value representing the length of the binary input data and the binary input data. |
|
|
|
The header part is straightforward as it's simply a JSON describing the inputs and outputs values: |
|
```json |
|
{"inputs":[{"name":"TEXT","shape":[1],"datatype":"BYTES","parameters":{"binary_data_size":59}}],"outputs":[{"name":"output","parameters":{"binary_data":false}}]} |
|
``` |
|
> Note that we provide the `binary_data_size` value describing the lenght of the content following the header. |
|
|
|
The integer representing the length of the input data is written in little endian: `7`. |
|
> Note that the length of the content is not 7 but 55 as the `7` in the 55th ascii character. |
|
|
|
The binary input data is simply a text encoded in `UTF-8` format: `This live event is great. I will sign-up for Infinity.` |
|
(If you have several strings, just concatenate the results.) |
|
|
|
You can follow the code below for the recipe for a single string. It will also tell you the value to specify within the `Inference-Header-Content-Length` header. |
|
```python |
|
import struct |
|
|
|
text_b: bytes = "This live event is great. I will sign-up for Infinity.\n".encode("UTF-8") |
|
|
|
prefix: bytes = b'{"inputs":[{"name":"TEXT","shape":[1],"datatype":"BYTES","parameters":{"binary_data_size":' + bytes(str(len(text_b) + len(struct.pack("<I", len(text_b)))), encoding='utf8') + b'}}],"outputs":[{"name":"output","parameters":{"binary_data":false}}]}' |
|
|
|
print('--header "Inference-Header-Content-Length: ', len(prefix), '"', sep="") |
|
|
|
with open("body.bin", "wb+") as f: |
|
f.write(prefix + struct.pack("<I", len(text_b)) + text_b) |
|
# <I means little-endian unsigned integers, followed by the number of elements |
|
``` |
|
|
|
!!! tip |
|
|
|
check Nvidia implementation from [https://github.com/triton-inference-server/client/blob/530bcac5f1574aa2222930076200544eb274245c/src/python/library/tritonclient/utils/__init__.py#L187](https://github.com/triton-inference-server/client/blob/530bcac5f1574aa2222930076200544eb274245c/src/python/library/tritonclient/utils/__init__.py#L187) |
|
for more information. |
|
|
|
You can now query your model using your input file : |
|
```shell |
|
# https://github.com/triton-inference-server/server/blob/main/docs/protocol/extension_binary_data.md |
|
# @ means no data conversion (curl feature) |
|
curl -X POST http://localhost:8000/v2/models/transformer_onnx_inference/versions/1/infer \ |
|
--data-binary "@demo/infinity/query_body.bin" \ |
|
--header "Inference-Header-Content-Length: 161" |
|
``` |
|
|
|
> check [`demo`](https://github.com/ELS-RD/transformer-deploy/tree/main/demo/infinity) folder to discover more performant ways to query the server from Python or elsewhere. |
|
|
|
--8<-- "resources/abbreviations.md" |