Text-to-Image
English
File size: 4,729 Bytes
769c246
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
d85e1da
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
---
license: apache-2.0
datasets:
- yuvalkirstain/pickapic_v1
language:
- en
pipeline_tag: text-to-image
---
# Aesthetic Post-Training Diffusion Models from Generic Preferences with Step-by-step Preference

<a href="https://arxiv.org/abs/2406.04314"><img src="https://img.shields.io/badge/Paper-arXiv-red?style=for-the-badge" height=22.5></a>
<a href="https://github.com/RockeyCoss/SPO"><img src="https://img.shields.io/badge/Gihub-Code-succees?style=for-the-badge&logo=GitHub" height=22.5></a>
<a href="https://rockeycoss.github.io/spo.github.io/"><img src="https://img.shields.io/badge/Project-Page-blue?style=for-the-badge" height=22.5></a>

<table>
  <tr>
    <td><img src="assets/1.png" alt="teaser example 0" width="200"/></td>
    <td><img src="assets/2.png" alt="teaser example 1" width="200"/></td>
    <td><img src="assets/3.png" alt="teaser example 2" width="200"/></td>
    <td><img src="assets/4.png" alt="teaser example 3" width="200"/></td>
  </tr>
</table>

## Abstract
<p>
    Generating visually appealing images is fundamental to modern text-to-image generation models. 
    A potential solution to better aesthetics is direct preference optimization (DPO), 
    which has been applied to diffusion models to improve general image quality including prompt alignment and aesthetics. 
    Popular DPO methods propagate preference labels from clean image pairs to all the intermediate steps along the two generation trajectories. 
    However, preference labels provided in existing datasets are blended with layout and aesthetic opinions, which would disagree with aesthetic preference. 
    Even if aesthetic labels were provided (at substantial cost), it would be hard for the two-trajectory methods to capture nuanced visual differences at different steps.
</p>
<p>
    To improve aesthetics economically, this paper uses existing generic preference data and introduces step-by-step preference optimization 
    (SPO) that discards the propagation strategy and allows fine-grained image details to be assessed. Specifically, 
    at each denoising step, we 1) sample a pool of candidates by denoising from a shared noise latent, 
    2) use a step-aware preference model to find a suitable win-lose pair to supervise the diffusion model, and 
    3) randomly select one from the pool to initialize the next denoising step. 
    This strategy ensures that diffusion models focus on the subtle, fine-grained visual differences 
    instead of layout aspect. We find that aesthetic can be significantly enhanced by accumulating these 
    improved minor differences.
</p>
<p>
    When fine-tuning Stable Diffusion v1.5 and SDXL, SPO yields significant 
    improvements in aesthetics compared with existing DPO methods while not sacrificing image-text alignment 
    compared with vanilla models. Moreover, SPO converges much faster than DPO methods due to the step-by-step 
    alignment of fine-grained visual details.
</p>

## Model Description

This model is fine-tuned from [runwayml/stable-diffusion-v1-5](https://huggingface.co/runwayml/stable-diffusion-v1-5). It has been trained on 4,000 prompts for 10 epochs. This checkpoint is a LoRA checkpoint. We also provide a LoRA checkpoint compatible with [stable-diffusion-webui](https://github.com/AUTOMATIC1111/stable-diffusion-webui), which can be accessed [here](https://civitai.com/models/526379/spo-sd-v1-54k-p10eplorawebui)

If you want to access the merged checkpoint that combines the LoRA checkpoint with the base model [runwayml/stable-diffusion-v1-5](https://huggingface.co/runwayml/stable-diffusion-v1-5), please visit [SPO-SD-v1-5_4k-p_10ep](https://huggingface.co/SPO-Diffusion-Models/SPO-SD-v1-5_4k-p_10ep).



## A quick example
```python
from diffusers import StableDiffusionPipeline
import torch

# load pipeline
inference_dtype = torch.float16
pipe = StableDiffusionPipeline.from_pretrained(
    "runwayml/stable-diffusion-v1-5",
    torch_dtype=inference_dtype,
)
pipe.load_lora_weights("SPO-Diffusion-Models/SPO-SD-v1-5_4k-p_10ep_LoRA")
pipe.to('cuda')

generator=torch.Generator(device='cuda').manual_seed(42)
image = pipe(
    prompt='an image of a beautiful lake',
    generator=generator,
    guidance_scale=7.5,
    output_type='pil',
).images[0]
image.save('lake.png')
```

## Citation
If you find our work or codebase useful, please consider giving us a star and citing our work.
```
@article{liang2024step,
  title={Aesthetic Post-Training Diffusion Models from Generic Preferences with Step-by-step Preference Optimization},
  author={Liang, Zhanhao and Yuan, Yuhui and Gu, Shuyang and Chen, Bohan and Hang, Tiankai and Cheng, Mingxi and Li, Ji and Zheng, Liang},
  journal={arXiv preprint arXiv:2406.04314},
  year={2024}
}
```