File size: 5,304 Bytes
e0c2d04
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
## Command line

!!! tip

    You can run commands below from Docker directly (no need to install Nvidia dependencies outside Docker one), like:
    ```shell
    docker run -it --rm --gpus all \
    -v $PWD:/project ghcr.io/els-rd/transformer-deploy:latest \
    bash -c "cd /project && \
      convert_model -m \"philschmid/MiniLM-L6-H384-uncased-sst2\" \
      --backend onnx \
      --seq-len 128 128 128"
    ```

With the single command below, you will:

* **download** the model and its tokenizer from :hugging: Hugging Face hub, 
* **convert** the model to ONNX graph,
* **optimize** 
  * the model with ONNX Runtime and save artefact (`model.onnx`),
  * the model with TensorRT and save artefact (`model.plan`),
* **benchmark** each backend (including Pytorch),
* **generate** configuration files for Triton inference server

```shell
convert_model -m philschmid/MiniLM-L6-H384-uncased-sst2 --backend onnx --seq-len 128 128 128 --batch-size 1 32 32
# ...
# Inference done on NVIDIA GeForce RTX 3090
# latencies:
# [Pytorch (FP32)] mean=8.75ms, sd=0.30ms, min=8.60ms, max=11.20ms, median=8.68ms, 95p=9.15ms, 99p=10.77ms
# [Pytorch (FP16)] mean=6.75ms, sd=0.22ms, min=6.66ms, max=8.99ms, median=6.71ms, 95p=6.88ms, 99p=7.95ms
# [ONNX Runtime (FP32)] mean=8.10ms, sd=0.43ms, min=7.93ms, max=11.76ms, median=8.02ms, 95p=8.39ms, 99p=11.30ms
# [ONNX Runtime (optimized)] mean=3.66ms, sd=0.23ms, min=3.57ms, max=6.46ms, median=3.62ms, 95p=3.70ms, 99p=4.95ms
```

!!! info

    **128 128 128** -> minimum, optimal, maximum sequence length, to help TensorRT better optimize your model. 
    Better to have the same value for seq len to get best performances from TensorRT (ONNX Runtime has not this limitation).

    **1 32 32** -> batch size, same as above. Good idea to get 1 as minimum value. No impact on TensorRT performance.

* Launch Nvidia Triton inference server to play with both ONNX and TensorRT models:

```shell
docker run -it --rm --gpus all -p8000:8000 -p8001:8001 -p8002:8002 --shm-size 256m \
  -v $PWD/triton_models:/models nvcr.io/nvidia/tritonserver:22.07-py3 \
  bash -c "pip install transformers sentencepiece && tritonserver --model-repository=/models"
```

> As you can see, we install Transformers and then launch the server itself.  
> This is, of course, a bad practice, you should make your own 2 lines Dockerfile with Transformers inside.

## Query the inference server

To query your inference server you first need to need to convert your string to a binary file following the same format as in `demo/infinity/query_body.bin`. 

The `query_body.bin` file is composed of three parts, a header describing the inputs/outputs values, a binary value representing the length of the binary input data and the binary input data.

The header part is straightforward as it's simply a JSON describing the inputs and outputs values:
```json
{"inputs":[{"name":"TEXT","shape":[1],"datatype":"BYTES","parameters":{"binary_data_size":59}}],"outputs":[{"name":"output","parameters":{"binary_data":false}}]}
```
> Note that we provide the `binary_data_size` value describing the lenght of the content following the header.

The integer representing the length of the input data is written in little endian: `7`.
> Note that the length of the content is not 7 but 55 as the `7` in the 55th ascii character.

The binary input data is simply a text encoded in `UTF-8` format: `This live event is great. I will sign-up for Infinity.`
(If you have several strings, just concatenate the results.)

You can follow the code below for the recipe for a single string. It will also tell you the value to specify within the `Inference-Header-Content-Length` header.
```python
import struct

text_b: bytes = "This live event is great. I will sign-up for Infinity.\n".encode("UTF-8")

prefix: bytes = b'{"inputs":[{"name":"TEXT","shape":[1],"datatype":"BYTES","parameters":{"binary_data_size":' + bytes(str(len(text_b) + len(struct.pack("<I", len(text_b)))), encoding='utf8') + b'}}],"outputs":[{"name":"output","parameters":{"binary_data":false}}]}'

print('--header "Inference-Header-Content-Length: ', len(prefix), '"', sep="")

with open("body.bin", "wb+") as f:
  f.write(prefix + struct.pack("<I", len(text_b)) + text_b)
  # <I means little-endian unsigned integers, followed by the number of elements
```

!!! tip

    check Nvidia implementation from [https://github.com/triton-inference-server/client/blob/530bcac5f1574aa2222930076200544eb274245c/src/python/library/tritonclient/utils/__init__.py#L187](https://github.com/triton-inference-server/client/blob/530bcac5f1574aa2222930076200544eb274245c/src/python/library/tritonclient/utils/__init__.py#L187)
    for more information.

You can now query your model using your input file :
```shell
# https://github.com/triton-inference-server/server/blob/main/docs/protocol/extension_binary_data.md
# @ means no data conversion (curl feature)
curl -X POST  http://localhost:8000/v2/models/transformer_onnx_inference/versions/1/infer \
  --data-binary "@demo/infinity/query_body.bin" \
  --header "Inference-Header-Content-Length: 161"
```

> check [`demo`](https://github.com/ELS-RD/transformer-deploy/tree/main/demo/infinity) folder to discover more performant ways to query the server from Python or elsewhere.

--8<-- "resources/abbreviations.md"