|
# Hugging Face Transformer submillisecond inference️ and deployment to production: 🤗 → 🤯 |
|
|
|
[](https://els-rd.github.io/transformer-deploy/) [](https://github.com/ELS-RD/transformer-deploy/actions/workflows/python-app.yml) [](https://www.python.org/downloads/release/python-380/) [](https://twitter.com/pommedeterre33) |
|
|
|
### Optimize and deploy in **production** 🤗 Hugging Face Transformer models in a single command line. |
|
|
|
=> Up to 10X faster inference! <= |
|
|
|
#### Why this tool? |
|
|
|
<!--why-start--> |
|
|
|
At [Lefebvre Dalloz](https://www.lefebvre-dalloz.fr/) we run in production *semantic search engines* in the legal domain, |
|
in non-marketing language it's a re-ranker, and we based ours on `Transformer`. |
|
In those setup, latency is key to provide good user experience, and relevancy inference is done online for hundreds of snippets per user query. |
|
We have tested many solutions, and below is what we found: |
|
|
|
[`Pytorch`](https://pytorch.org/) + [`FastAPI`](https://fastapi.tiangolo.com/) = 🐢 |
|
Most tutorials on `Transformer` deployment in production are built over Pytorch and FastAPI. |
|
Both are great tools but not very performant in inference (actual measures below). |
|
|
|
[`Microsoft ONNX Runtime`](https://github.com/microsoft/onnxruntime/) + [`Nvidia Triton inference server`](https://github.com/triton-inference-server/server) = ️🏃💨 |
|
Then, if you spend some time, you can build something over ONNX Runtime and Triton inference server. |
|
You will usually get from 2X to 4X faster inference compared to vanilla Pytorch. It's cool! |
|
|
|
[`Nvidia TensorRT`](https://github.com/NVIDIA/TensorRT/) + [`Nvidia Triton inference server`](https://github.com/triton-inference-server/server) = ⚡️🏃💨💨 |
|
However, if you want the best in class performances on GPU, there is only a single possible combination: Nvidia TensorRT and Triton. |
|
You will usually get 5X faster inference compared to vanilla Pytorch. |
|
Sometimes it can rise up to **10X faster inference**. |
|
Buuuuttt... TensorRT can ask some efforts to master, it requires tricks not easy to come up with, we implemented them for you! |
|
|
|
[Detailed tool comparison table](https://els-rd.github.io/transformer-deploy/compare/) |
|
|
|
## Features |
|
|
|
* Heavily optimize transformer models for inference (CPU and GPU) -> between 5X and 10X speedup |
|
* deploy models on `Nvidia Triton` inference servers (enterprise grade), 6X faster than `FastAPI` |
|
* add quantization support for both CPU and GPU |
|
* simple to use: optimization done in a single command line! |
|
* supported model: any model that can be exported to ONNX (-> most of them) |
|
* supported tasks: document classification, token classification (NER), feature extraction (aka sentence-transformers dense embeddings), text generation |
|
|
|
> Want to understand how it works under the hood? |
|
> read [🤗 Hugging Face Transformer inference UNDER 1 millisecond latency 📖](https://towardsdatascience.com/hugging-face-transformer-inference-under-1-millisecond-latency-e1be0057a51c?source=friends_link&sk=cd880e05c501c7880f2b9454830b8915) |
|
> <img src="resources/rabbit.jpg" width="120"> |
|
|
|
## Want to check by yourself in 3 minutes? |
|
|
|
To have a raw idea of what kind of acceleration you will get on your own model, you can try the `docker` only run below. |
|
For GPU run, you need to have installed on your machine Nvidia drivers and [NVIDIA Container Toolkit](https://github.com/NVIDIA/nvidia-docker). |
|
|
|
**3 tasks are covered** below: |
|
|
|
* Classification, |
|
* feature extraction (text to dense embeddings) |
|
* text generation (GPT-2 style). |
|
|
|
Moreover, we have added a GPU `quantization` notebook to open directly on `Docker` to play with. |
|
|
|
First, clone the repo as some commands below expect to find the `demo` folder: |
|
|
|
```shell |
|
git clone [email protected]:ELS-RD/transformer-deploy.git |
|
cd transformer-deploy |
|
# docker image may take a few minutes |
|
docker pull ghcr.io/els-rd/transformer-deploy:0.6.0 |
|
|
|
|
|
### Classification/reranking (encoder model) |
|
|
|
Classification is a common task in NLP, and large language models have shown great results. |
|
This task is also used for search engines to provide Google like relevancy (cf. [arxiv](https://arxiv.org/abs/1901.04085)) |
|
|
|
#### Optimize existing model |
|
|
|
This will optimize models, generate Triton configuration and Triton folder layout in a single command: |
|
|
|
```shell |
|
docker run -it --rm --gpus all \ |
|
-v $PWD:/project ghcr.io/els-rd/transformer-deploy:0.6.0 \ |
|
bash -c "cd /project && \ |
|
convert_model -m \"philschmid/MiniLM-L6-H384-uncased-sst2\" \ |
|
--backend tensorrt onnx \ |
|
--seq-len 16 128 128" |
|
|
|
# output: |
|
# ... |
|
# Inference done on NVIDIA GeForce RTX 3090 |
|
# latencies: |
|
# [Pytorch (FP32)] mean=5.43ms, sd=0.70ms, min=4.88ms, max=7.81ms, median=5.09ms, 95p=7.01ms, 99p=7.53ms |
|
# [Pytorch (FP16)] mean=6.55ms, sd=1.00ms, min=5.75ms, max=10.38ms, median=6.01ms, 95p=8.57ms, 99p=9.21ms |
|
# [TensorRT (FP16)] mean=0.53ms, sd=0.03ms, min=0.49ms, max=0.61ms, median=0.52ms, 95p=0.57ms, 99p=0.58ms |
|
# [ONNX Runtime (FP32)] mean=1.57ms, sd=0.05ms, min=1.49ms, max=1.90ms, median=1.57ms, 95p=1.63ms, 99p=1.76ms |
|
# [ONNX Runtime (optimized)] mean=0.90ms, sd=0.03ms, min=0.88ms, max=1.23ms, median=0.89ms, 95p=0.95ms, 99p=0.97ms |
|
# Each infence engine output is within 0.3 tolerance compared to Pytorch output |
|
``` |
|
|
|
It will output mean latency and other statistics. |
|
Usually `Nvidia TensorRT` is the fastest option and `ONNX Runtime` is usually a strong second option. |
|
On ONNX Runtime, `optimized` means that kernel fusion and mixed precision are enabled. |
|
`Pytorch` is never competitive on transformer inference, including mixed precision, whatever the model size. |
|
|
|
#### Run Nvidia Triton inference server |
|
|
|
Note that we install `transformers` at run time. |
|
For production, it's advised to build your own 3-line Docker image with `transformers` pre-installed. |
|
|
|
```shell |
|
docker run -it --rm --gpus all -p8000:8000 -p8001:8001 -p8002:8002 --shm-size 256m \ |
|
-v $PWD/triton_models:/models nvcr.io/nvidia/tritonserver:22.07-py3 \ |
|
bash -c "pip install transformers && tritonserver --model-repository=/models" |
|
|
|
# output: |
|
# ... |
|
# I0207 09:58:32.738831 1 grpc_server.cc:4195] Started GRPCInferenceService at 0.0.0.0:8001 |
|
# I0207 09:58:32.739875 1 http_server.cc:2857] Started HTTPService at 0.0.0.0:8000 |
|
# I0207 09:58:32.782066 1 http_server.cc:167] Started Metrics Service at 0.0.0.0:8002 |
|
``` |
|
|
|
#### Query inference |
|
|
|
Query ONNX models (replace `transformer_onnx_inference` by `transformer_tensorrt_inference` to query TensorRT engine): |
|
|
|
```shell |
|
curl -X POST http://localhost:8000/v2/models/transformer_onnx_inference/versions/1/infer \ |
|
--data-binary "@demo/infinity/query_body.bin" \ |
|
--header "Inference-Header-Content-Length: 161" |
|
|
|
# output: |
|
# {"model_name":"transformer_onnx_inference","model_version":"1","parameters":{"sequence_id":0,"sequence_start":false,"sequence_end":false},"outputs":[{"name":"output","datatype":"FP32","shape":[1,2],"data":[-3.431640625,3.271484375]}]} |
|
``` |
|
|
|
Model output is at the end of the Json (`data` field). |
|
[More information about how to query the server from `Python`, and other languages](https://els-rd.github.io/transformer-deploy/run/). |
|
|
|
To get very low latency inference in your Python code (no inference server): [click here](https://els-rd.github.io/transformer-deploy/python/) |
|
|
|
### Token-classification (NER) (encoder model) |
|
|
|
Token classification assigns a label to individual tokens in a sentence. |
|
One of the most common token classification tasks is Named Entity Recognition (NER). |
|
NER attempts to find a label for each entity in a sentence, such as a person, location, or organization. |
|
|
|
#### Optimize existing model |
|
|
|
This will optimize models, generate Triton configuration and Triton folder layout in a single command: |
|
|
|
```shell |
|
docker run -it --rm --gpus all \ |
|
-v $PWD:/project ghcr.io/els-rd/transformer-deploy:0.6.0 \ |
|
bash -c "cd /project && \ |
|
convert_model -m \"kamalkraj/bert-base-cased-ner-conll2003\" \ |
|
--backend tensorrt onnx \ |
|
--seq-len 16 128 128 \ |
|
--task token-classification" |
|
|
|
# output: |
|
# ... |
|
# Inference done on Tesla T4 |
|
# latencies: |
|
# [Pytorch (FP32)] mean=8.24ms, sd=0.46ms, min=7.66ms, max=13.91ms, median=8.20ms, 95p=8.38ms, 99p=10.01ms |
|
# [Pytorch (FP16)] mean=6.87ms, sd=0.44ms, min=6.69ms, max=13.05ms, median=6.78ms, 95p=7.33ms, 99p=8.86ms |
|
# [TensorRT (FP16)] mean=2.33ms, sd=0.32ms, min=2.19ms, max=4.18ms, median=2.24ms, 95p=3.00ms, 99p=4.04ms |
|
# [ONNX Runtime (FP32)] mean=8.08ms, sd=0.33ms, min=7.78ms, max=10.61ms, median=8.06ms, 95p=8.18ms, 99p=10.55ms |
|
# [ONNX Runtime (optimized)] mean=2.57ms, sd=0.04ms, min=2.38ms, max=2.83ms, median=2.56ms, 95p=2.68ms, 99p=2.73ms |
|
# Each infence engine output is within 0.3 tolerance compared to Pytorch output |
|
``` |
|
|
|
It will output mean latency and other statistics. |
|
Usually `Nvidia TensorRT` is the fastest option and `ONNX Runtime` is usually a strong second option. |
|
On ONNX Runtime, `optimized` means that kernel fusion and mixed precision are enabled. |
|
`Pytorch` is never competitive on transformer inference, including mixed precision, whatever the model size. |
|
|
|
#### Run Nvidia Triton inference server |
|
|
|
Note that we install `transformers` at run time. |
|
For production, it's advised to build your own 3-line Docker image with `transformers` pre-installed. |
|
|
|
```shell |
|
docker run -it --rm --gpus all -p8000:8000 -p8001:8001 -p8002:8002 --shm-size 256m \ |
|
-v $PWD/triton_models:/models nvcr.io/nvidia/tritonserver:22.07-py3 \ |
|
bash -c "pip install transformers torch==1.12.0 -f https://download.pytorch.org/whl/cu116/torch_stable.html && \ |
|
tritonserver --model-repository=/models" |
|
|
|
# output: |
|
# ... |
|
# I0207 09:58:32.738831 1 grpc_server.cc:4195] Started GRPCInferenceService at 0.0.0.0:8001 |
|
# I0207 09:58:32.739875 1 http_server.cc:2857] Started HTTPService at 0.0.0.0:8000 |
|
# I0207 09:58:32.782066 1 http_server.cc:167] Started Metrics Service at 0.0.0.0:8002 |
|
``` |
|
|
|
#### Query inference |
|
|
|
Query ONNX models (replace `transformer_onnx_inference` by `transformer_tensorrt_inference` to query TensorRT engine): |
|
|
|
```shell |
|
curl -X POST http://localhost:8000/v2/models/transformer_onnx_inference/versions/1/infer \ |
|
--data-binary "@demo/infinity/query_body.bin" \ |
|
--header "Inference-Header-Content-Length: 161" |
|
|
|
# output: |
|
# {"model_name":"transformer_onnx_inference","model_version":"1","outputs":[{"name":"output","datatype":"BYTES","shape":[],"data":["[{\"entity_group\": \"ORG\", \"score\": 0.9848777055740356, \"word\": \"Infinity\", \"start\": 45, \"end\": 53}]"]}]} |
|
``` |
|
|
|
### Question Answering (encoder model) |
|
|
|
Question Answering models can retrieve the answer to a question from a given text, which is useful for searching for an answer in a document. |
|
|
|
#### Optimize existing model |
|
|
|
This will optimize models, generate Triton configuration and Triton folder layout in a single command: |
|
|
|
```shell |
|
docker run -it --rm --gpus all \ |
|
-v $PWD:/project ghcr.io/els-rd/transformer-deploy:0.6.0 \ |
|
bash -c "cd /project && \ |
|
convert_model -m \"distilbert-base-cased-distilled-squad\" \ |
|
--backend tensorrt onnx \ |
|
--seq-len 16 128 384 \ |
|
--task question-answering" |
|
|
|
# output: |
|
# ... |
|
# Inference done on Tesla T4 |
|
# latencies: |
|
# [Pytorch (FP32)] mean=8.24ms, sd=0.46ms, min=7.66ms, max=13.91ms, median=8.20ms, 95p=8.38ms, 99p=10.01ms |
|
# [Pytorch (FP16)] mean=6.87ms, sd=0.44ms, min=6.69ms, max=13.05ms, median=6.78ms, 95p=7.33ms, 99p=8.86ms |
|
# [TensorRT (FP16)] mean=2.33ms, sd=0.32ms, min=2.19ms, max=4.18ms, median=2.24ms, 95p=3.00ms, 99p=4.04ms |
|
# [ONNX Runtime (FP32)] mean=8.08ms, sd=0.33ms, min=7.78ms, max=10.61ms, median=8.06ms, 95p=8.18ms, 99p=10.55ms |
|
# [ONNX Runtime (optimized)] mean=2.57ms, sd=0.04ms, min=2.38ms, max=2.83ms, median=2.56ms, 95p=2.68ms, 99p=2.73ms |
|
# Each infence engine output is within 0.3 tolerance compared to Pytorch output |
|
``` |
|
|
|
It will output mean latency and other statistics. |
|
Usually `Nvidia TensorRT` is the fastest option and `ONNX Runtime` is usually a strong second option. |
|
On ONNX Runtime, `optimized` means that kernel fusion and mixed precision are enabled. |
|
`Pytorch` is never competitive on transformer inference, including mixed precision, whatever the model size. |
|
|
|
#### Run Nvidia Triton inference server |
|
|
|
Note that we install `transformers` at run time. |
|
For production, it's advised to build your own 3-line Docker image with `transformers` pre-installed. |
|
|
|
```shell |
|
docker run -it --rm --gpus all -p8000:8000 -p8001:8001 -p8002:8002 --shm-size 1024m \ |
|
-v $PWD/triton_models:/models nvcr.io/nvidia/tritonserver:22.07-py3 \ |
|
bash -c "pip install transformers torch==1.12.0 -f https://download.pytorch.org/whl/cu116/torch_stable.html && \ |
|
tritonserver --model-repository=/models" |
|
|
|
# output: |
|
# ... |
|
# I0207 09:58:32.738831 1 grpc_server.cc:4195] Started GRPCInferenceService at 0.0.0.0:8001 |
|
# I0207 09:58:32.739875 1 http_server.cc:2857] Started HTTPService at 0.0.0.0:8000 |
|
# I0207 09:58:32.782066 1 http_server.cc:167] Started Metrics Service at 0.0.0.0:8002 |
|
``` |
|
|
|
#### Query inference |
|
|
|
Query ONNX models (replace `transformer_onnx_inference` by `transformer_tensorrt_inference` to query TensorRT engine): |
|
|
|
```shell |
|
curl -X POST http://localhost:8000/v2/models/transformer_onnx_inference/versions/1/infer \ |
|
--data-binary "@demo/question-answering/query_body.bin" \ |
|
--header "Inference-Header-Content-Length: 276" |
|
|
|
# output: |
|
# {"model_name":"transformer_onnx_inference","model_version":"1","outputs":[{"name":"output","datatype":"BYTES","shape":[],"data":["{\"score\": 0.9925152659416199, \"start\": 34, \"end\": 40, \"answer\": \"Berlin\"}"]}]} |
|
``` |
|
Checkout demo/question-answering/query_bin_gen.ipynb for how to generate the query_body.bin file. |
|
More examples of inference can be found in demo/question-answering/ |
|
|
|
|
|
### Feature extraction / dense embeddings |
|
|
|
Feature extraction in NLP is the task to convert text to dense embeddings. |
|
It has gained some traction as a robust way to improve search engine relevancy (increase recall). |
|
This project supports models from [sentence-transformers](https://github.com/UKPLab/sentence-transformers) and it requires |
|
a version >= V2.2.0 of sentence-transformers library. |
|
#### Optimize existing model |
|
|
|
```shell |
|
docker run -it --rm --gpus all \ |
|
-v $PWD:/project ghcr.io/els-rd/transformer-deploy:0.6.0 \ |
|
bash -c "cd /project && \ |
|
convert_model -m \"sentence-transformers/msmarco-distilbert-cos-v5\" \ |
|
--backend tensorrt onnx \ |
|
--task embedding \ |
|
--seq-len 16 128 128" |
|
|
|
# output: |
|
# ... |
|
# Inference done on NVIDIA GeForce RTX 3090 |
|
# latencies: |
|
# [Pytorch (FP32)] mean=5.19ms, sd=0.45ms, min=4.74ms, max=6.64ms, median=5.03ms, 95p=6.14ms, 99p=6.26ms |
|
# [Pytorch (FP16)] mean=5.41ms, sd=0.18ms, min=5.26ms, max=8.15ms, median=5.36ms, 95p=5.62ms, 99p=5.72ms |
|
# [TensorRT (FP16)] mean=0.72ms, sd=0.04ms, min=0.69ms, max=1.33ms, median=0.70ms, 95p=0.78ms, 99p=0.81ms |
|
# [ONNX Runtime (FP32)] mean=1.69ms, sd=0.18ms, min=1.62ms, max=4.07ms, median=1.64ms, 95p=1.86ms, 99p=2.44ms |
|
# [ONNX Runtime (optimized)] mean=1.03ms, sd=0.09ms, min=0.98ms, max=2.30ms, median=1.00ms, 95p=1.15ms, 99p=1.41ms |
|
# Each infence engine output is within 0.3 tolerance compared to Pytorch output |
|
``` |
|
|
|
#### Run Nvidia Triton inference server |
|
|
|
```shell |
|
docker run -it --rm --gpus all -p8000:8000 -p8001:8001 -p8002:8002 --shm-size 256m \ |
|
-v $PWD/triton_models:/models nvcr.io/nvidia/tritonserver:22.07-py3 \ |
|
bash -c "pip install transformers && tritonserver --model-repository=/models" |
|
|
|
# output: |
|
# ... |
|
# I0207 11:04:33.761517 1 grpc_server.cc:4195] Started GRPCInferenceService at 0.0.0.0:8001 |
|
# I0207 11:04:33.761844 1 http_server.cc:2857] Started HTTPService at 0.0.0.0:8000 |
|
# I0207 11:04:33.803373 1 http_server.cc:167] Started Metrics Service at 0.0.0.0:8002 |
|
|
|
``` |
|
|
|
#### Query inference |
|
|
|
```shell |
|
curl -X POST http://localhost:8000/v2/models/transformer_onnx_inference/versions/1/infer \ |
|
--data-binary "@demo/infinity/query_body.bin" \ |
|
--header "Inference-Header-Content-Length: 161" |
|
|
|
# output: |
|
# {"model_name":"transformer_onnx_inference","model_version":"1","parameters":{"sequence_id":0,"sequence_start":false,"sequence_end":false},"outputs":[{"name":"output","datatype":"FP32","shape":[1,768],"data":[0.06549072265625,-0.04327392578125,0.1103515625,-0.007320404052734375,... |
|
``` |
|
|
|
### Generate text (decoder model) |
|
|
|
Text generation seems to be the way to go for NLP. |
|
Unfortunately, they are slow to run, below we will accelerate the most famous of them: GPT-2. |
|
|
|
#### GPT example |
|
We will start with GPT-2 model example, then in the next section we will use T5-model. |
|
|
|
#### Optimize existing model |
|
|
|
Like before, command below will prepare Triton inference server stuff. |
|
One point to have in mind is that Triton run: |
|
- inference engines (`ONNX Runtime` and `TensorRT`) |
|
- `Python` code in charge of the `decoding` part. `Python` code delegate to Triton server the model management. |
|
|
|
`Python` code is in `./triton_models/transformer_tensorrt_generate/1/model.py` |
|
|
|
```shell |
|
docker run -it --rm --gpus all \ |
|
-v $PWD:/project ghcr.io/els-rd/transformer-deploy:0.6.0 \ |
|
bash -c "cd /project && \ |
|
convert_model -m gpt2 \ |
|
--backend tensorrt onnx \ |
|
--seq-len 6 256 256 \ |
|
--task text-generation" |
|
|
|
# output: |
|
# ... |
|
# Inference done on NVIDIA GeForce RTX 3090 |
|
# latencies: |
|
# [Pytorch (FP32)] mean=9.43ms, sd=0.59ms, min=8.95ms, max=15.02ms, median=9.33ms, 95p=10.38ms, 99p=12.46ms |
|
# [Pytorch (FP16)] mean=9.92ms, sd=0.55ms, min=9.50ms, max=15.06ms, median=9.74ms, 95p=10.96ms, 99p=12.26ms |
|
# [TensorRT (FP16)] mean=2.19ms, sd=0.18ms, min=2.06ms, max=3.04ms, median=2.10ms, 95p=2.64ms, 99p=2.79ms |
|
# [ONNX Runtime (FP32)] mean=4.99ms, sd=0.38ms, min=4.68ms, max=9.09ms, median=4.78ms, 95p=5.72ms, 99p=5.95ms |
|
# [ONNX Runtime (optimized)] mean=3.93ms, sd=0.40ms, min=3.62ms, max=6.53ms, median=3.81ms, 95p=4.49ms, 99p=5.79ms |
|
# Each infence engine output is within 0.3 tolerance compared to Pytorch output |
|
``` |
|
|
|
Two detailed notebooks are available: |
|
|
|
* GPT-2: <https://github.com/ELS-RD/transformer-deploy/blob/main/demo/generative-model/gpt2.ipynb> |
|
* T5: <https://github.com/ELS-RD/transformer-deploy/blob/main/demo/generative-model/t5.ipynb> |
|
|
|
#### Optimize existing large model |
|
|
|
To optimize models which typically don't fit twice onto a single GPU, run the script as follows: |
|
|
|
```shell |
|
docker run -it --rm --shm-size=24g --ulimit memlock=-1 --ulimit stack=67108864 --gpus all \ |
|
-v $PWD:/project ghcr.io/els-rd/transformer-deploy:0.6.0 \ |
|
bash -c "cd /project && \ |
|
convert_model -m gpt2-medium \ |
|
--backend tensorrt onnx \ |
|
--seq-len 6 256 256 \ |
|
--fast \ |
|
--atol 3 \ |
|
--task text-generation" |
|
``` |
|
|
|
The larger the model gets, the more likely it is that you need to also increase the absolute tolerance of the script. |
|
Additionally, some models may return a message similar to: `Converted FP32 value in weights (either FP32 infinity or FP32 value outside FP16 range) to corresponding FP16 infinity`. It is best to test and evaluate the model afterwards to understand the implications of this conversion. |
|
|
|
Depending on model size this may take really long. GPT Neo 2.7B can easily take 1 hour of conversion or more. |
|
|
|
#### Run Nvidia Triton inference server |
|
|
|
To run decoding algorithm server side, we need to install `Pytorch` on `Triton` docker image. |
|
|
|
```shell |
|
docker run -it --rm --gpus all -p8000:8000 -p8001:8001 -p8002:8002 --shm-size 8g \ |
|
-v $PWD/triton_models:/models nvcr.io/nvidia/tritonserver:22.07-py3 \ |
|
bash -c "pip install transformers torch==1.12.0 -f https://download.pytorch.org/whl/cu116/torch_stable.html && \ |
|
tritonserver --model-repository=/models" |
|
|
|
# output: |
|
# ... |
|
# I0207 10:29:19.091191 1 grpc_server.cc:4195] Started GRPCInferenceService at 0.0.0.0:8001 |
|
# I0207 10:29:19.091417 1 http_server.cc:2857] Started HTTPService at 0.0.0.0:8000 |
|
# I0207 10:29:19.132902 1 http_server.cc:167] Started Metrics Service at 0.0.0.0:8002 |
|
``` |
|
|
|
#### Query inference |
|
|
|
Replace `transformer_onnx_generate` by `transformer_tensorrt_generate` to query `TensorRT` engine. |
|
|
|
```shell |
|
curl -X POST http://localhost:8000/v2/models/transformer_onnx_generate/versions/1/infer \ |
|
--data-binary "@demo/infinity/query_body.bin" \ |
|
--header "Inference-Header-Content-Length: 161" |
|
|
|
# output: |
|
# {"model_name":"transformer_onnx_generate","model_version":"1","outputs":[{"name":"output","datatype":"BYTES","shape":[],"data":["This live event is great. I will sign-up for Infinity.\n\nI'm going to be doing a live stream of the event.\n\nI"]}]} |
|
``` |
|
|
|
Ok, the output is not very interesting (💩 in -> 💩 out) but you get the idea. |
|
Source code of the generative model is in `./triton_models/transformer_tensorrt_generate/1/model.py`. |
|
You may want to tweak it regarding your needs (default is set for greedy search and output 64 tokens). |
|
|
|
#### Python code |
|
|
|
You may be interested in running optimized text generation on Python directly, without using any inference server: |
|
|
|
```shell |
|
docker run -p 8888:8888 -v $PWD/demo/generative-model:/project ghcr.io/els-rd/transformer-deploy:0.6.0 \ |
|
bash -c "cd /project && jupyter notebook --ip 0.0.0.0 --port 8888 --no-browser --allow-root" |
|
``` |
|
|
|
#### T5-small example |
|
In this section we will present the t5-small model conversion. |
|
|
|
#### Optimize existing large model |
|
|
|
To optimize model run the script as follows: |
|
|
|
```shell |
|
docker run -it --rm --shm-size=24g --ulimit memlock=-1 --ulimit stack=67108864 --gpus all \ |
|
-v $PWD:/project ghcr.io/els-rd/transformer-deploy:0.6.0 \ |
|
bash -c "cd /project && \ |
|
convert_model -m t5-small \ |
|
--backend onnx \ |
|
--seq-len 16 256 256 \ |
|
--task text-generation \ |
|
--nb-measures 100 \ |
|
--generative-model t5 \ |
|
--output triton_models" |
|
``` |
|
#### Run Nvidia Triton inference server |
|
|
|
To run decoding algorithm server side, we need to install `Pytorch` on `Triton` docker image. |
|
|
|
```shell |
|
docker run -it --rm --gpus all -p8000:8000 -p8001:8001 -p8002:8002 --shm-size 8g \ |
|
-v $PWD/triton_models/:/models nvcr.io/nvidia/tritonserver:22.07-py3 \ |
|
bash -c "pip install onnx onnxruntime-gpu transformers==4.21.3 git+https://github.com/ELS-RD/transformer-deploy torch==1.12.0 -f https://download.pytorch.org/whl/cu116/torch_stable.html onnx onnxruntime-gpu && \ |
|
tritonserver --model-repository=/models" |
|
``` |
|
To test text generation, you can try this request: |
|
```shell |
|
curl -X POST http://localhost:8000/v2/models/t5_model_generate/versions/1/infer --data-binary "@demo/generative-model/t5_query_body.bin" --header "Inference-Header-Content-Length: 181" |
|
|
|
# output: |
|
# {"model_name":"t5_model_generate","model_version":"1","outputs":[{"name":"OUTPUT_TEXT","datatype":"BYTES","shape":[],"data":["Mein Name mein Wolfgang Wolfgang und ich wohne in Berlin."]}]} |
|
``` |
|
#### Query inference |
|
|
|
Replace `transformer_onnx_generate` by `transformer_tensorrt_generate` to query `TensorRT` engine. |
|
|
|
```shell |
|
curl -X POST http://localhost:8000/v2/models/transformer_onnx_inference/versions/1/infer \ |
|
--data-binary "@demo/infinity/seq2seq_query_body.bin" \ |
|
--header "Inference-Header-Content-Length: 176" |
|
``` |
|
|
|
### Model quantization on GPU |
|
|
|
Quantization is a generic method to get X2 speedup on top of other inference optimization. |
|
GPU quantization on transformers is almost never used because it requires to modify model source code. |
|
|
|
We have implemented in this library a mechanism which updates Hugging Face transformers library to support quantization. |
|
It makes it easy to use. |
|
|
|
To play with it, open this notebook: |
|
|
|
```shell |
|
docker run -p 8888:8888 -v $PWD/demo/quantization:/project ghcr.io/els-rd/transformer-deploy:0.6.0 \ |
|
bash -c "cd /project && jupyter notebook --ip 0.0.0.0 --port 8888 --no-browser --allow-root" |
|
``` |
|
|
|
<!--why-end--> |
|
|
|
## See our [documentation](https://els-rd.github.io/transformer-deploy/) for detailed instructions on how to use the package, including setup, GPU quantization support and Nvidia Triton inference server deployment. |
|
|