Demo
In the subfolders, you will find some experiments that we hope you will find interesting.
Infinity
This code is related to the article Hugging Face Transformer inference UNDER 1 millisecond latency.
It shows how with only open source tools you can easily get better performances than commercial solution from Hugging Face.
You will get inference in the millisecond range on a cheap T4 GPU (the cheapest option from AWS).
It includes end to end code to reproduce benchmarks published in the Medium article linked above.
Quantization
A notebook explaining end to end how to apply GPU quantization to a transformer model. It also includes code to significantly improve accuracy by disabling quantization on sensitive nodes. Whith this technic expect X4-X5 faster inference than vanilla Pytorch.
Generative model
Decoder based model like GPT-2
have similar architecture than Bert but are definitly different beast.
In the notebook we show how IO is important.
At the end, we get X4 speedup compared to Hugging Face code.
Question answering
Example of a question answering model server request using triton. A notebook explaining how to create query_body.bin for a question answering model. for cURL request.
TorchDynamo
TorchDynamo
is a promising system to get the speedup of a model compiler and the flexibility of Pytorch.
In this experiment we benchmark the tools with more traditional approaches.