File size: 3,881 Bytes
fb21dd2
 
 
 
454615f
fb21dd2
ddd56e8
410c9af
d175329
ddd56e8
13645f4
 
 
ddd56e8
 
3f523df
ee34371
6487dd2
bb3264f
 
6487dd2
 
 
7cf8c81
6487dd2
 
 
 
 
 
4dff679
6487dd2
 
 
 
 
 
 
 
 
3f523df
581062e
 
 
 
 
bed2421
581062e
 
6262708
581062e
 
 
 
 
6262708
581062e
00bed5d
3f523df
 
 
ddd56e8
 
 
7cf8c81
ddd56e8
 
 
 
 
 
 
 
 
6109f69
b408920
0c7a11a
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
7ef0e6c
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
---
language:
- en
pipeline_tag: text-to-image
library_name: diffusers
---
# You Only Sample Once (YOSO)
![overview](overview.jpg)
The YOSO was proposed in "[You Only Sample Once: Taming One-Step Text-To-Image Synthesis by Self-Cooperative Diffusion GANs](https://www.arxiv.org/abs/2403.12931)" by *Yihong Luo, Xiaolong Chen, Jing Tang*. 

Official Repository of this paper: [YOSO](https://github.com/Luo-Yihong/YOSO).


## Usage

### 1-step inference
1-step inference is only allowed based on SD v1.5 for now. And you should prepare the informative initialization according to the paper for better results.
```python
import torch
from diffusers import DiffusionPipeline, LCMScheduler
pipeline = DiffusionPipeline.from_pretrained("runwayml/stable-diffusion-v1-5", torch_dtype = torch.float16)
pipeline = pipeline.to('cuda')
pipeline.scheduler = LCMScheduler.from_config(pipeline.scheduler.config)
pipeline.load_lora_weights('Luo-Yihong/yoso_sd1.5_lora')
generator = torch.manual_seed(318)
steps = 1
bs = 1
latents = ... # maybe some latent codes of real images or SD generation
latent_mean = latent.mean(dim=0)
noise = torch.randn([1,bs,64,64])
input_latent = pipeline.scheduler.add_noise(latent_mean.repeat(bs,1,1,1),noise,T)
imgs= pipeline(prompt="A photo of a dog",
                    num_inference_steps=steps, 
                    num_images_per_prompt = 1,
                        generator = generator,
                        guidance_scale=1.5,
                    latents = input_latent,
                   )[0]
imgs
```

The simple inference without informative initialization, but worse quality:
```python
pipeline = DiffusionPipeline.from_pretrained("runwayml/stable-diffusion-v1-5", torch_dtype = torch.float16)
pipeline = pipeline.to('cuda')
pipeline.scheduler = LCMScheduler.from_config(pipeline.scheduler.config)
pipeline.load_lora_weights('Luo-Yihong/yoso_sd1.5_lora')
generator = torch.manual_seed(318)
steps = 1
imgs = pipeline(prompt="A photo of a corgi in forest, highly detailed, 8k, XT3.",
                    num_inference_steps=1, 
                    num_images_per_prompt = 1,
                        generator = generator,
                        guidance_scale=1.,
                   )[0]
imgs[0]
```
![Corgi](corgi.jpg)
### 2-step inference
We note that a small CFG can be used to enhance the image quality.
```python
pipeline = DiffusionPipeline.from_pretrained("stablediffusionapi/realistic-vision-v51", torch_dtype = torch.float16)
pipeline = pipeline.to('cuda')
pipeline.scheduler = LCMScheduler.from_config(pipeline.scheduler.config)
pipeline.load_lora_weights('Luo-Yihong/yoso_sd1.5_lora')
generator = torch.manual_seed(318)
steps = 2
imgs= pipeline(prompt="A photo of a man, XT3",
                    num_inference_steps=steps, 
                    num_images_per_prompt = 1,
                        generator = generator,
                        guidance_scale=1.5,
                   )[0]
imgs
```
![man](man.jpg)

You may try some interesting applications, like:
```python
generator = torch.manual_seed(318)
steps = 2
img_list = []
for age in [2,20,30,50,60,80]:
    imgs = pipeline(prompt=f"A photo of a cute girl, {age} yr old, XT3",
                        num_inference_steps=steps, 
                        num_images_per_prompt = 1,
                            generator = generator,
                            guidance_scale=1.1,
                       )[0]
    img_list.append(imgs[0])
make_image_grid(img_list,rows=1,cols=len(img_list))
```
![life](life.jpg)

You can increase the steps to improve sample quality.

## Bibtex
```
@misc{luo2024sample,
   title={You Only Sample Once: Taming One-Step Text-To-Image Synthesis by Self-Cooperative Diffusion GANs},
  author={Yihong Luo and Xiaolong Chen and Jing Tang},
  booktitle={arXiv preprint arxiv:2403.12931},
  year={2024},
  archivePrefix={arXiv},
  primaryClass={cs.CV}
}
```