Demo

In the subfolders, you will find some experiments that we hope you will find interesting.

Infinity

This code is related to the article Hugging Face Transformer inference UNDER 1 millisecond latency. It shows how with only open source tools you can easily get better performances than commercial solution from Hugging Face.
You will get inference in the millisecond range on a cheap T4 GPU (the cheapest option from AWS).

It includes end to end code to reproduce benchmarks published in the Medium article linked above.

Quantization

A notebook explaining end to end how to apply GPU quantization to a transformer model. It also includes code to significantly improve accuracy by disabling quantization on sensitive nodes. Whith this technic expect X4-X5 faster inference than vanilla Pytorch.

Generative model

Decoder based model like GPT-2 have similar architecture than Bert but are definitly different beast. In the notebook we show how IO is important. At the end, we get X4 speedup compared to Hugging Face code.

Question answering

Example of a question answering model server request using triton. A notebook explaining how to create query_body.bin for a question answering model. for cURL request.

TorchDynamo

TorchDynamo is a promising system to get the speedup of a model compiler and the flexibility of Pytorch.
In this experiment we benchmark the tools with more traditional approaches.