tlwu's picture
models from Olive
638944e
|
raw
history blame
4.43 kB
metadata
license: openrail++
base_model: stabilityai/stable-diffusion-xl-base-1.0
language:
  - en
tags:
  - stable-diffusion
  - stable-diffusion-xl
  - onnxruntime
  - onnx
  - text-to-image

Stable Diffusion XL 1.0 for ONNX Runtime

Introduction

This repository hosts the optimized versions of Stable Diffusion XL 1.0 to accelerate inference with ONNX Runtime CUDA execution provider.

The models are generated by Olive with command like the following:

python stable_diffusion_xl.py --provider cuda --optimize --use_fp16_fixed_vae

See the usage instructions for how to run the SDXL pipeline with the ONNX files hosted in this repository.

Model Description

The VAE decoder is converted from sdxl-vae-fp16-fix. There are slight discrepancies between its output and that of the original VAE, but the decoded images should be close enough for most purposes.

Performance Comparison

Latency for 30 steps base and 9 steps refiner

Below is average latency of generating an image of size 1024x1024 using NVIDIA A100-SXM4-80GB GPU:

Batch Size PyTorch 2.1 ONNX Runtime CUDA
1 3779 ms 3389 ms
4 13504 ms 12264 ms

In this test, CUDA graph was used to speed up in both torch compile the unet and ONNX Runtime.

Usage Example

Following the demo instructions. Example steps:

  1. Install nvidia-docker using these instructions.

  2. Clone onnxruntime repository.

git clone https://github.com/microsoft/onnxruntime
cd onnxruntime
  1. Download the SDXL ONNX files from this repo
git lfs install
git clone https://huggingface.co/tlwu/stable-diffusion-xl-1.0-onnxruntime
  1. Launch the docker
docker run --rm -it --gpus all -v $PWD:/workspace nvcr.io/nvidia/pytorch:23.10-py3 /bin/bash
  1. Build ONNX Runtime from source
export CUDACXX=/usr/local/cuda-12.2/bin/nvcc
git config --global --add safe.directory '*'
sh build.sh --config Release  --build_shared_lib --parallel --use_cuda --cuda_version 12.2 \
            --cuda_home /usr/local/cuda-12.2 --cudnn_home /usr/lib/x86_64-linux-gnu/ --build_wheel --skip_tests \
            --use_tensorrt --tensorrt_home /usr/src/tensorrt \
            --cmake_extra_defines onnxruntime_BUILD_UNIT_TESTS=OFF \
            --cmake_extra_defines CMAKE_CUDA_ARCHITECTURES=80 \
            --allow_running_as_root
python3 -m pip install build/Linux/Release/dist/onnxruntime_gpu-*-cp310-cp310-linux_x86_64.whl --force-reinstall

If the GPU is not A100, change CMAKE_CUDA_ARCHITECTURES=80 in the command line according to the GPU compute capacity (like 89 for RTX 4090, or 86 for RTX 3090). If your machine has less than 64GB memory, replace --parallel by --parallel 4 --nvcc_threads 1 to avoid out of memory.

  1. Install libraries and requirements
python3 -m pip install --upgrade pip
cd /workspace/onnxruntime/python/tools/transformers/models/stable_diffusion
python3 -m pip install -r requirements-cuda12.txt
python3 -m pip install --upgrade polygraphy onnx-graphsurgeon --extra-index-url https://pypi.ngc.nvidia.com
  1. Perform ONNX Runtime optimized inference
python3 demo_txt2img_xl.py \
  "starry night over Golden Gate Bridge by van gogh" \
  --engine-dir /workspace/stable-diffusion-xl-1.0-onnxruntime