File size: 5,376 Bytes
ba205fe
001b1a7
 
3fd0370
ba205fe
 
 
585a602
 
af81976
585a602
af81976
9f41c55
af81976
9f41c55
 
 
 
af81976
 
 
9f41c55
af81976
 
9f41c55
af81976
 
 
 
 
7baaca1
af81976
 
9f41c55
5d1ddad
585a602
 
f10710d
585a602
af81976
 
 
f10710d
 
af81976
585a602
ac22e3d
af81976
f10710d
af81976
 
f10710d
 
 
ac22e3d
af81976
 
7baaca1
 
 
 
 
af81976
 
 
 
 
 
8b30da2
af81976
 
c3da414
 
 
 
 
 
 
af81976
 
 
 
 
 
 
 
 
 
 
001b1a7
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
---
license: mit
prior:
- warp-diffusion/wuerstchen-prior
tags:
- text-to-image
- wuerstchen
---

<img src="https://cdn-uploads.huggingface.co/production/uploads/634cb5eefb80cc6bcaf63c3e/i-DYpDHw8Pwiy7QBKZVR5.jpeg" width=1500>

## Würstchen - Overview
Würstchen is a diffusion model, whose text-conditional model works in a highly compressed latent space of images. Why is this important? Compressing data can reduce
computational costs for both training and inference by magnitudes. Training on 1024x1024 images, is way more expensive than training at 32x32. Usually, other works make 
use of a relatively small compression, in the range of 4x - 8x spatial compression. Würstchen takes this to an extreme. Through its novel design, we achieve a 42x spatial
compression. This was unseen before because common methods fail to faithfully reconstruct detailed images after 16x spatial compression. Würstchen employs a 
two-stage compression, what we call Stage A and Stage B. Stage A is a VQGAN, and Stage B is a Diffusion Autoencoder (more details can be found in the [paper](https://arxiv.org/abs/2306.00637)).
A third model, Stage C, is learned in that highly compressed latent space. This training requires fractions of the compute used for current top-performing models, allowing
also cheaper and faster inference. 

## Würstchen - Decoder
The Decoder is what we refer to as "Stage A" and "Stage B". The decoder takes in image embeddings, either generated by the Prior (Stage C) or extracted from a real image, and decodes those latents back into the pixel space. Specifically, Stage B first decodes the image embeddings into the VQGAN Space, and Stage A (which is a VQGAN)
decodes the latents into pixel space. Together, they achieve a spatial compression of 42. 

**Note:** The reconstruction is lossy and loses information of the image. The current Stage B often lacks details in the reconstructions, which are especially noticeable to
us humans when looking at faces, hands, etc. We are working on making these reconstructions even better in the future!

### Image Sizes
Würstchen was trained on image resolutions between 1024x1024 & 1536x1536. We sometimes also observe good outputs at resolutions like 1024x2048. Feel free to try it out.
We also observed that the Prior (Stage C) adapts extremely fast to new resolutions. So finetuning it at 2048x2048 should be computationally cheap.
<img src="https://cdn-uploads.huggingface.co/production/uploads/634cb5eefb80cc6bcaf63c3e/5pA5KUfGmvsObqiIjdGY1.jpeg" width=1000>

## How to run
This pipeline should be run together with a prior https://huggingface.co/warp-ai/wuerstchen-prior:

```py
import torch
from diffusers import AutoPipelineForText2Image

device = "cuda"
dtype = torch.float16

pipeline =  AutoPipelineForText2Image.from_pretrained(
    "warp-diffusion/wuerstchen", torch_dtype=dtype
).to(device)

caption = "Anthropomorphic cat dressed as a fire fighter"

output = pipeline(
    prompt=caption,
    height=1024,
    width=1024,
    prior_guidance_scale=4.0,
    decoder_guidance_scale=0.0,
).images
```

### Image Sampling Times
The figure shows the inference times (on an A100) for different batch sizes (`num_images_per_prompt`) on Würstchen compared to [Stable Diffusion XL](https://huggingface.co/stabilityai/stable-diffusion-xl-base-1.0) (without refiner).
The left figure shows inference times (using torch > 2.0), whereas the right figure applies `torch.compile` to both pipelines in advance.
![image/jpeg](https://cdn-uploads.huggingface.co/production/uploads/634cb5eefb80cc6bcaf63c3e/UPhsIH2f079ZuTA_sLdVe.jpeg)

## Model Details
- **Developed by:** Pablo Pernias, Dominic Rampas
- **Model type:** Diffusion-based text-to-image generation model
- **Language(s):** English
- **License:** MIT
- **Model Description:** This is a model that can be used to generate and modify images based on text prompts. It is a Diffusion model in the style of Stage C from the [Würstchen paper](https://arxiv.org/abs/2306.00637) that uses a fixed, pretrained text encoder ([CLIP ViT-bigG/14](https://huggingface.co/laion/CLIP-ViT-bigG-14-laion2B-39B-b160k)).
- **Resources for more information:** [GitHub Repository](https://github.com/dome272/Wuerstchen), [Paper](https://arxiv.org/abs/2306.00637).
- **Cite as:**

      @inproceedings{
            pernias2024wrstchen,
            title={W\"urstchen: An Efficient Architecture for Large-Scale Text-to-Image Diffusion Models},
            author={Pablo Pernias and Dominic Rampas and Mats Leon Richter and Christopher Pal and Marc Aubreville},
            booktitle={The Twelfth International Conference on Learning Representations},
            year={2024},
            url={https://openreview.net/forum?id=gU58d5QeGv}
      }

## Environmental Impact

**Würstchen v2** **Estimated Emissions**
Based on that information, we estimate the following CO2 emissions using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700). The hardware, runtime, cloud provider, and compute region were utilized to estimate the carbon impact.

- **Hardware Type:** A100 PCIe 40GB
- **Hours used:** 24602
- **Cloud Provider:** AWS
- **Compute Region:** US-east
- **Carbon Emitted (Power consumption x Time x Carbon produced based on location of power grid):** 2275.68 kg CO2 eq.