metadata

language:
  - en
pipeline_tag: text-to-image
library_name: diffusers

You Only Sample Once (YOSO)

The YOSO was proposed in "You Only Sample Once: Taming One-Step Text-To-Image Synthesis by Self-Cooperative Diffusion GANs" by Yihong Luo, Xiaolong Chen, Jing Tang.

Official Repository of this paper: YOSO.

Usage

1-step inference

1-step inference is only allowed based on SD v1.5 for now. And you should prepare the informative initialization according to the paper for better results.

import torch
from diffusers import DiffusionPipeline, LCMScheduler
pipeline = DiffusionPipeline.from_pretrained("runwayml/stable-diffusion-v1-5", torch_dtype = torch.float16)
pipeline = pipeline.to('cuda')
pipeline.scheduler = LCMScheduler.from_config(pipeline.scheduler.config)
pipeline.load_lora_weights('Luo-Yihong/yoso_sd1.5_lora')
generator = torch.manual_seed(318)
steps = 1
bs = 1
latents = ... # maybe some latent codes of real images or SD generation
latent_mean = latent.mean(dim=0)
noise = torch.randn([1,bs,64,64])
input_latent = pipeline.scheduler.add_noise(latent_mean.repeat(bs,1,1,1),noise,T)
imgs= pipeline(prompt="A photo of a dog",
                    num_inference_steps=steps, 
                    num_images_per_prompt = 1,
                        generator = generator,
                        guidance_scale=1.5,
                    latents = input_latent,
                   )[0]
imgs

The simple inference without informative initialization, but worse quality:

pipeline = DiffusionPipeline.from_pretrained("runwayml/stable-diffusion-v1-5", torch_dtype = torch.float16)
pipeline = pipeline.to('cuda')
pipeline.scheduler = LCMScheduler.from_config(pipeline.scheduler.config)
pipeline.load_lora_weights('Luo-Yihong/yoso_sd1.5_lora')
generator = torch.manual_seed(318)
steps = 1
imgs = pipeline(prompt="A photo of a corgi in forest, highly detailed, 8k, XT3.",
                    num_inference_steps=1, 
                    num_images_per_prompt = 1,
                        generator = generator,
                        guidance_scale=1.,
                   )[0]
imgs[0]

2-step inference

We note that a small CFG can be used to enhance the image quality.

pipeline = DiffusionPipeline.from_pretrained("stablediffusionapi/realistic-vision-v51", torch_dtype = torch.float16)
pipeline = pipeline.to('cuda')
pipeline.scheduler = LCMScheduler.from_config(pipeline.scheduler.config)
pipeline.load_lora_weights('Luo-Yihong/yoso_sd1.5_lora')
generator = torch.manual_seed(318)
steps = 2
imgs= pipeline(prompt="A photo of a man, XT3",
                    num_inference_steps=steps, 
                    num_images_per_prompt = 1,
                        generator = generator,
                        guidance_scale=1.5,
                   )[0]
imgs

You may try some interesting applications, like:

generator = torch.manual_seed(318)
steps = 2
img_list = []
for age in [2,20,30,50,60,80]:
    imgs = pipeline(prompt=f"A photo of a cute girl, {age} yr old, XT3",
                        num_inference_steps=steps, 
                        num_images_per_prompt = 1,
                            generator = generator,
                            guidance_scale=1.1,
                       )[0]
    img_list.append(imgs[0])
make_image_grid(img_list,rows=1,cols=len(img_list))

You can increase the steps to improve sample quality.

Bibtex

@misc{luo2024sample,
   title={You Only Sample Once: Taming One-Step Text-To-Image Synthesis by Self-Cooperative Diffusion GANs},
  author={Yihong Luo and Xiaolong Chen and Jing Tang},
  booktitle={arXiv preprint arxiv:2403.12931},
  year={2024},
  archivePrefix={arXiv},
  primaryClass={cs.CV}
}