Spaces:
Runtime error
Runtime error
File size: 9,180 Bytes
3aff77a |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 |
<!doctype html>
<html lang="en">
<!-- === Header Starts === -->
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
<title>Ctrl-X</title>
<link href="./assets/bootstrap.min.css" rel="stylesheet">
<link href="./assets/font.css" rel="stylesheet" type="text/css">
<link href="./assets/style.css" rel="stylesheet" type="text/css">
</head>
<!-- === Header Ends === -->
<body>
<!-- === Home Section Starts === -->
<div class="section">
<!-- === Title Starts === -->
<div class="header">
<div class="logo">
<a href="https://genforce.github.io/" target="_blank"><img src="./assets/genforce.png"></a>
</div>
<div class="title", style="padding-top: 25pt;"> <!-- Set padding as 10 if title is with two lines. -->
Ctrl-X: Controlling Structure and Appearance for Text-To-Image Generation Without Guidance
</div>
</div>
<!-- === Title Ends === -->
<div class="author">
<a href="https://kuanhenglin.github.io" target="_blank">Kuan Heng Lin</a><sup>1</sup>*
<a href="https://sichengmo.github.io/" target="_blank">Sicheng Mo</a><sup>1</sup>*
<a href="https://bklingher.github.io" target="_blank">Ben Klingher</a><sup>1</sup>
<a href="https://pages.cs.wisc.edu/~fmu/" target="_blank">Fangzhou Mu</a><sup>2</sup>
<a href="https://boleizhou.github.io/" target="_blank">Bolei Zhou</a><sup>1</sup>
</div>
<div class="institution">
<sup>1</sup>UCLA
<sup>2</sup>NVIDIA
</div>
<div class="note">
*Equal contribution
</div>
<div class="title" style="font-size: 18pt;margin: 15pt 0 15pt 0">
NeurIPS 2024
</div>
<div class="link">
[<a href="https://arxiv.org/abs/2406.07540" target="_blank">Paper</a>]
[<a href="https://github.com/genforce/ctrl-x" target="_blank">Code</a>]
</div>
<div class="teaser">
<img src="assets/ctrl-x.jpg" width="85%">
</div>
</div>
<!-- === Home Section Ends === -->
<!-- === Overview Section Starts === -->
<div class="section">
<div class="title">Overview</div>
<div class="body">
We present <b>Ctrl-X</b>, a simple <i>training-free</i> and <i>guidance-free</i> framework for text-to-image (T2I) generation with structure and appearance control. Given user-provided structure and appearance images, Ctrl-X designs feedforward structure control to enable structure alignment with the structure image and semantic-aware appearance transfer to facilitate the appearance transfer from the appearance image. Ctrl-X supports novel structure control with arbitrary condition images of any modality, is significantly faster than prior training-free appearance transfer methods, and provides instant plug-and-play to any T2I and text-to-video (T2V) diffusion model.
<table width="100%" style="margin: 20pt 0; text-align: center;">
<tr>
<td><img src="assets/pipeline.jpg" width="85%"></td>
</tr>
</table>
<b>How does it work?</b> Given clean structure and appearance latents, we first obtain noised structure and appearance latents via the diffusion forward process, then extracting their U-Net features from a pretrained T2I diffusion model. When denoising the output latent, we inject convolution and self-attention features from the structure latent and leverage self-attention correspondence to transfer spatially-aware appearance statistics from the appearance latent to achieve structure and appearance control. We name our method "Ctrl-X" because we reformulate the controllable generation problem by 'cutting' (and 'pasting') structure preservation and semantic-aware stylization together.
</div>
</div>
<!-- === Overview Section Ends === -->
<!-- === Result Section Starts === -->
<div class="section">
<div class="title">Results: Structure and appearance control</div>
<div class="body">
Results of training-free and guidance-free T2I diffusion with structure and appearance control, where Ctrl-X supports a diverse variety of structure images, including natural images, ControlNet-supported conditions (e.g., canny maps, normal maps), and in-the-wild conditions (e.g., wireframes, 3D meshes). The base model here is <a href="https://arxiv.org/abs/2307.01952" target="_blank">Stable Diffusion XL v1.0</a>.
<!-- Adjust the number of rows and columns (EVERY project differs). -->
<table width="100%" style="margin: 20pt 0; text-align: center;">
<tr>
<td><img src="assets/results_struct+app.jpg" width="100%"></td>
</tr>
</table>
<table width="100%" style="margin: 20pt 0; text-align: center;">
<tr>
<td><img src="assets/results_struct+app_2.jpg" width="85%"></td>
</tr>
</table>
</div>
</div>
<div class="section">
<div class="title">Results: Multi-subject structure and appearance control</div>
<div class="body">
Ctrl-X is capable of multi-subject generation with semantic correspondence between appearance and structure images across both subjects and backgrounds. In comparison, <a href="https://arxiv.org/abs/2302.05543" target="_blank">ControlNet</a> + <a href="https://arxiv.org/abs/2308.06721" target="_blank">IP-Adapter</a> often fails at transferring all subject and background appearances.
<!-- Adjust the number of rows and columns (EVERY project differs). -->
<table width="100%" style="margin: 20pt 0; text-align: center;">
<tr>
<td><img src="assets/results_multi_subject.jpg" width="90%"></td>
</tr>
</table>
</div>
</div>
<div class="section">
<div class="title">Results: Prompt-driven conditional generation</div>
<div class="body">
Ctrl-X also supports prompt-driven conditional generation, where it generates an output image complying with the given text prompt while aligning with the structure of the structure image. Ctrl-X continues to support any structure image/condition type here as well. The base model here is <a href="https://arxiv.org/abs/2307.01952" target="_blank">Stable Diffusion XL v1.0</a>.
<!-- Adjust the number of rows and columns (EVERY project differs). -->
<table width="100%" style="margin: 20pt 0; text-align: center;">
<tr>
<td><img src="assets/results_struct+prompt.jpg" width="100%"></td>
</tr>
</table>
</div>
</div>
<div class="section">
<div class="title">Results: Extension to video generation</div>
<div class="body">
We can directly apply Ctrl-X to text-to-video (T2V) models. We show results of <a href="https://animatediff.github.io/" target="_blank">AnimateDiff v1.5.3</a> (with base model <a href="https://huggingface.co/SG161222/Realistic_Vision_V5.1_noVAE" target="_blank">Realistic Vision v5.1</a>) here.
<!-- Demo video here. Adjust the frame size based on the demo (EVERY project differs). -->
<div style="position: relative; padding-top: 50%; margin: 20pt 0; text-align: center;">
<iframe src="assets/results_animatediff.mp4" frameborder=0
style="position: absolute; top: 2.5%; left: 0%; width: 100%; height: 100%;"
allow="accelerometer; autoplay; encrypted-media; gyroscope; picture-in-picture"
allowfullscreen></iframe>
</div>
</div>
</div>
<!-- === Result Section Ends === -->
<!-- === Reference Section Starts === -->
<div class="section">
<div class="bibtex">BibTeX</div>
<pre>
@inproceedings{lin2024ctrlx,
author = {Lin, {Kuan Heng} and Mo, Sicheng and Klingher, Ben and Mu, Fangzhou and Zhou, Bolei},
booktitle = {Advances in Neural Information Processing Systems},
title = {Ctrl-X: Controlling Structure and Appearance for Text-To-Image Generation Without Guidance},
year = {2024}
}
</pre>
<!-- BZ: we should give other related work enough credits, -->
<!-- so please include some most relevant work and leave some comment to summarize work and the difference. -->
<div class="ref">Related Work</div>
<div class="citation">
<div class="image"><img src="assets/freecontrol.jpg"></div>
<div class="comment">
<a href="https://genforce.github.io/freecontrol/" target="_blank">
Sicheng Mo, Fangzhou Mu, Kuan Heng Lin, Yanli Liu, Bochen Guan, Yin Li, Bolei Zhou.
FreeControl: Training-Free Spatial Control of Any Text-to-Image Diffusion Model with Any Condition.
CVPR 2024.</a><br>
<b>Comment:</b>
Training-free conditional generation by guidance in diffusion U-Net subspaces for structure control and appearance regularization.
</div>
</div>
<div class="citation">
<div class="image"><img src="assets/cross_image_attention.jpg"></div>
<div class="comment">
<a href="https://garibida.github.io/cross-image-attention/" target="_blank">
Yuval Alaluf, Daniel Garibi, Or Patashnik, Hadar Averbuch-Elor, Daniel Cohen-Or.
Cross-Image Attention for Zero-Shot Appearance Transfer.
SIGGRAPH 2024.</a><br>
<b>Comment:</b>
Guidance-free appearance transfer to natural images with self-attention key + value swaps via cross-image correspondence.
</div>
</div>
</div>
<!-- === Reference Section Ends === -->
</body>
</html>
|