toshas's picture
initial commit
a45988a

A newer version of the Gradio SDK is available: 5.23.3

Upgrade

Kandinsky

[[open-in-colab]]

Kandinsky ๋ชจ๋ธ์€ ์ผ๋ จ์˜ ๋‹ค๊ตญ์–ด text-to-image ์ƒ์„ฑ ๋ชจ๋ธ์ž…๋‹ˆ๋‹ค. Kandinsky 2.0 ๋ชจ๋ธ์€ ๋‘ ๊ฐœ์˜ ๋‹ค๊ตญ์–ด ํ…์ŠคํŠธ ์ธ์ฝ”๋”๋ฅผ ์‚ฌ์šฉํ•˜๊ณ  ๊ทธ ๊ฒฐ๊ณผ๋ฅผ ์—ฐ๊ฒฐํ•ด UNet์— ์‚ฌ์šฉ๋ฉ๋‹ˆ๋‹ค.

Kandinsky 2.1์€ ํ…์ŠคํŠธ์™€ ์ด๋ฏธ์ง€ ์ž„๋ฒ ๋”ฉ ๊ฐ„์˜ ๋งคํ•‘์„ ์ƒ์„ฑํ•˜๋Š” image prior ๋ชจ๋ธ(CLIP)์„ ํฌํ•จํ•˜๋„๋ก ์•„ํ‚คํ…์ฒ˜๋ฅผ ๋ณ€๊ฒฝํ–ˆ์Šต๋‹ˆ๋‹ค. ์ด ๋งคํ•‘์€ ๋” ๋‚˜์€ text-image alignment๋ฅผ ์ œ๊ณตํ•˜๋ฉฐ, ํ•™์Šต ์ค‘์— ํ…์ŠคํŠธ ์ž„๋ฒ ๋”ฉ๊ณผ ํ•จ๊ป˜ ์‚ฌ์šฉ๋˜์–ด ๋” ๋†’์€ ํ’ˆ์งˆ์˜ ๊ฒฐ๊ณผ๋ฅผ ๊ฐ€์ ธ์˜ต๋‹ˆ๋‹ค. ๋งˆ์ง€๋ง‰์œผ๋กœ, Kandinsky 2.1์€ spatial conditional ์ •๊ทœํ™” ๋ ˆ์ด์–ด๋ฅผ ์ถ”๊ฐ€ํ•˜์—ฌ ์‚ฌ์‹ค๊ฐ์„ ๋†’์—ฌ์ฃผ๋Š” Modulating Quantized Vectors (MoVQ) ๋””์ฝ”๋”๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ latents๋ฅผ ์ด๋ฏธ์ง€๋กœ ๋””์ฝ”๋”ฉํ•ฉ๋‹ˆ๋‹ค.

Kandinsky 2.2๋Š” image prior ๋ชจ๋ธ์˜ ์ด๋ฏธ์ง€ ์ธ์ฝ”๋”๋ฅผ ๋” ํฐ CLIP-ViT-G ๋ชจ๋ธ๋กœ ๊ต์ฒดํ•˜์—ฌ ํ’ˆ์งˆ์„ ๊ฐœ์„ ํ•จ์œผ๋กœ์จ ์ด์ „ ๋ชจ๋ธ์„ ๊ฐœ์„ ํ–ˆ์Šต๋‹ˆ๋‹ค. ๋˜ํ•œ image prior ๋ชจ๋ธ์€ ํ•ด์ƒ๋„์™€ ์ข…ํšก๋น„๊ฐ€ ๋‹ค๋ฅธ ์ด๋ฏธ์ง€๋กœ ์žฌํ›ˆ๋ จ๋˜์–ด ๋” ๋†’์€ ํ•ด์ƒ๋„์˜ ์ด๋ฏธ์ง€์™€ ๋‹ค์–‘ํ•œ ์ด๋ฏธ์ง€ ํฌ๊ธฐ๋ฅผ ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค.

Kandinsky 3๋Š” ์•„ํ‚คํ…์ฒ˜๋ฅผ ๋‹จ์ˆœํ™”ํ•˜๊ณ  prior ๋ชจ๋ธ๊ณผ diffusion ๋ชจ๋ธ์„ ํฌํ•จํ•˜๋Š” 2๋‹จ๊ณ„ ์ƒ์„ฑ ํ”„๋กœ์„ธ์Šค์—์„œ ๋ฒ—์–ด๋‚˜๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค. ๋Œ€์‹ , Kandinsky 3๋Š” Flan-UL2๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ํ…์ŠคํŠธ๋ฅผ ์ธ์ฝ”๋”ฉํ•˜๊ณ , BigGan-deep ๋ธ”๋ก์ด ํฌํ•จ๋œ UNet์„ ์‚ฌ์šฉํ•˜๋ฉฐ, Sber-MoVQGAN์„ ์‚ฌ์šฉํ•˜์—ฌ latents๋ฅผ ์ด๋ฏธ์ง€๋กœ ๋””์ฝ”๋”ฉํ•ฉ๋‹ˆ๋‹ค. ํ…์ŠคํŠธ ์ดํ•ด์™€ ์ƒ์„ฑ๋œ ์ด๋ฏธ์ง€ ํ’ˆ์งˆ์€ ์ฃผ๋กœ ๋” ํฐ ํ…์ŠคํŠธ ์ธ์ฝ”๋”์™€ UNet์„ ์‚ฌ์šฉํ•จ์œผ๋กœ์จ ๋‹ฌ์„ฑ๋ฉ๋‹ˆ๋‹ค.

์ด ๊ฐ€์ด๋“œ์—์„œ๋Š” text-to-image, image-to-image, ์ธํŽ˜์ธํŒ…, ๋ณด๊ฐ„ ๋“ฑ์„ ์œ„ํ•ด Kandinsky ๋ชจ๋ธ์„ ์‚ฌ์šฉํ•˜๋Š” ๋ฐฉ๋ฒ•์„ ์„ค๋ช…ํ•ฉ๋‹ˆ๋‹ค.

์‹œ์ž‘ํ•˜๊ธฐ ์ „์— ๋‹ค์Œ ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ๊ฐ€ ์„ค์น˜๋˜์–ด ์žˆ๋Š”์ง€ ํ™•์ธํ•˜์„ธ์š”:

# Colab์—์„œ ํ•„์š”ํ•œ ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ๋ฅผ ์„ค์น˜ํ•˜๊ธฐ ์œ„ํ•ด ์ฃผ์„์„ ์ œ์™ธํ•˜์„ธ์š”
#!pip install -q diffusers transformers accelerate

Kandinsky 2.1๊ณผ 2.2์˜ ์‚ฌ์šฉ๋ฒ•์€ ๋งค์šฐ ์œ ์‚ฌํ•ฉ๋‹ˆ๋‹ค! ์œ ์ผํ•œ ์ฐจ์ด์ ์€ Kandinsky 2.2๋Š” latents๋ฅผ ๋””์ฝ”๋”ฉํ•  ๋•Œ ํ”„๋กฌํ”„ํŠธ๋ฅผ ์ž…๋ ฅ์œผ๋กœ ๋ฐ›์ง€ ์•Š๋Š”๋‹ค๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค. ๋Œ€์‹ , Kandinsky 2.2๋Š” ๋””์ฝ”๋”ฉ ์ค‘์—๋Š” image_embeds๋งŒ ๋ฐ›์•„๋“ค์ž…๋‹ˆ๋‹ค.


Kandinsky 3๋Š” ๋” ๊ฐ„๊ฒฐํ•œ ์•„ํ‚คํ…์ฒ˜๋ฅผ ๊ฐ€์ง€๊ณ  ์žˆ์œผ๋ฉฐ prior ๋ชจ๋ธ์ด ํ•„์š”ํ•˜์ง€ ์•Š์Šต๋‹ˆ๋‹ค. ์ฆ‰, Stable Diffusion XL๊ณผ ๊ฐ™์€ ๋‹ค๋ฅธ diffusion ๋ชจ๋ธ๊ณผ ์‚ฌ์šฉ๋ฒ•์ด ๋™์ผํ•ฉ๋‹ˆ๋‹ค.

Text-to-image

๋ชจ๋“  ์ž‘์—…์— Kandinsky ๋ชจ๋ธ์„ ์‚ฌ์šฉํ•˜๋ ค๋ฉด ํ•ญ์ƒ ํ”„๋กฌํ”„ํŠธ๋ฅผ ์ธ์ฝ”๋”ฉํ•˜๊ณ  ์ด๋ฏธ์ง€ ์ž„๋ฒ ๋”ฉ์„ ์ƒ์„ฑํ•˜๋Š” prior ํŒŒ์ดํ”„๋ผ์ธ์„ ์„ค์ •ํ•˜๋Š” ๊ฒƒ๋ถ€ํ„ฐ ์‹œ์ž‘ํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค. ์ด์ „ ํŒŒ์ดํ”„๋ผ์ธ์€ negative ํ”„๋กฌํ”„ํŠธ ""์— ํ•ด๋‹นํ•˜๋Š” negative_image_embeds๋„ ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค. ๋” ๋‚˜์€ ๊ฒฐ๊ณผ๋ฅผ ์–ป์œผ๋ ค๋ฉด ์ด์ „ ํŒŒ์ดํ”„๋ผ์ธ์— ์‹ค์ œ negative_prompt๋ฅผ ์ „๋‹ฌํ•  ์ˆ˜ ์žˆ์ง€๋งŒ, ์ด๋ ‡๊ฒŒ ํ•˜๋ฉด prior ํŒŒ์ดํ”„๋ผ์ธ์˜ ์œ ํšจ ๋ฐฐ์น˜ ํฌ๊ธฐ๊ฐ€ 2๋ฐฐ๋กœ ์ฆ๊ฐ€ํ•ฉ๋‹ˆ๋‹ค.

from diffusers import KandinskyPriorPipeline, KandinskyPipeline
import torch

prior_pipeline = KandinskyPriorPipeline.from_pretrained("kandinsky-community/kandinsky-2-1-prior", torch_dtype=torch.float16).to("cuda")
pipeline = KandinskyPipeline.from_pretrained("kandinsky-community/kandinsky-2-1", torch_dtype=torch.float16).to("cuda")

prompt = "A alien cheeseburger creature eating itself, claymation, cinematic, moody lighting"
negative_prompt = "low quality, bad quality" # negative ํ”„๋กฌํ”„ํŠธ ํฌํ•จ์€ ์„ ํƒ์ ์ด์ง€๋งŒ, ๋ณดํ†ต ๊ฒฐ๊ณผ๋Š” ๋” ์ข‹์Šต๋‹ˆ๋‹ค
image_embeds, negative_image_embeds = prior_pipeline(prompt, negative_prompt, guidance_scale=1.0).to_tuple()

์ด์ œ ๋ชจ๋“  ํ”„๋กฌํ”„ํŠธ์™€ ์ž„๋ฒ ๋”ฉ์„ [KandinskyPipeline]์— ์ „๋‹ฌํ•˜์—ฌ ์ด๋ฏธ์ง€๋ฅผ ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค:

image = pipeline(prompt, image_embeds=image_embeds, negative_prompt=negative_prompt, negative_image_embeds=negative_image_embeds, height=768, width=768).images[0]
image
from diffusers import KandinskyV22PriorPipeline, KandinskyV22Pipeline
import torch

prior_pipeline = KandinskyV22PriorPipeline.from_pretrained("kandinsky-community/kandinsky-2-2-prior", torch_dtype=torch.float16).to("cuda")
pipeline = KandinskyV22Pipeline.from_pretrained("kandinsky-community/kandinsky-2-2-decoder", torch_dtype=torch.float16).to("cuda")

prompt = "A alien cheeseburger creature eating itself, claymation, cinematic, moody lighting"
negative_prompt = "low quality, bad quality" # negative ํ”„๋กฌํ”„ํŠธ ํฌํ•จ์€ ์„ ํƒ์ ์ด์ง€๋งŒ, ๋ณดํ†ต ๊ฒฐ๊ณผ๋Š” ๋” ์ข‹์Šต๋‹ˆ๋‹ค
image_embeds, negative_image_embeds = prior_pipeline(prompt, guidance_scale=1.0).to_tuple()

์ด๋ฏธ์ง€ ์ƒ์„ฑ์„ ์œ„ํ•ด image_embeds์™€ negative_image_embeds๋ฅผ [KandinskyV22Pipeline]์— ์ „๋‹ฌํ•ฉ๋‹ˆ๋‹ค:

image = pipeline(image_embeds=image_embeds, negative_image_embeds=negative_image_embeds, height=768, width=768).images[0]
image

Kandinsky 3๋Š” prior ๋ชจ๋ธ์ด ํ•„์š”ํ•˜์ง€ ์•Š์œผ๋ฏ€๋กœ [Kandinsky3Pipeline]์„ ์ง์ ‘ ๋ถˆ๋Ÿฌ์˜ค๊ณ  ์ด๋ฏธ์ง€ ์ƒ์„ฑ ํ”„๋กฌํ”„ํŠธ๋ฅผ ์ „๋‹ฌํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค:

from diffusers import Kandinsky3Pipeline
import torch

pipeline = Kandinsky3Pipeline.from_pretrained("kandinsky-community/kandinsky-3", variant="fp16", torch_dtype=torch.float16)
pipeline.enable_model_cpu_offload()

prompt = "A alien cheeseburger creature eating itself, claymation, cinematic, moody lighting"
image = pipeline(prompt).images[0]
image

๐Ÿค— Diffusers๋Š” ๋˜ํ•œ [KandinskyCombinedPipeline] ๋ฐ [KandinskyV22CombinedPipeline]์ด ํฌํ•จ๋œ end-to-end API๋ฅผ ์ œ๊ณตํ•˜๋ฏ€๋กœ prior ํŒŒ์ดํ”„๋ผ์ธ๊ณผ text-to-image ๋ณ€ํ™˜ ํŒŒ์ดํ”„๋ผ์ธ์„ ๋ณ„๋„๋กœ ๋ถˆ๋Ÿฌ์˜ฌ ํ•„์š”๊ฐ€ ์—†์Šต๋‹ˆ๋‹ค. ๊ฒฐํ•ฉ๋œ ํŒŒ์ดํ”„๋ผ์ธ์€ prior ๋ชจ๋ธ๊ณผ ๋””์ฝ”๋”๋ฅผ ๋ชจ๋‘ ์ž๋™์œผ๋กœ ๋ถˆ๋Ÿฌ์˜ต๋‹ˆ๋‹ค. ์›ํ•˜๋Š” ๊ฒฝ์šฐ prior_guidance_scale ๋ฐ prior_num_inference_steps ๋งค๊ฐœ ๋ณ€์ˆ˜๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ prior ํŒŒ์ดํ”„๋ผ์ธ์— ๋Œ€ํ•ด ๋‹ค๋ฅธ ๊ฐ’์„ ์„ค์ •ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

๋‚ด๋ถ€์—์„œ ๊ฒฐํ•ฉ๋œ ํŒŒ์ดํ”„๋ผ์ธ์„ ์ž๋™์œผ๋กœ ํ˜ธ์ถœํ•˜๋ ค๋ฉด [AutoPipelineForText2Image]๋ฅผ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค:

from diffusers import AutoPipelineForText2Image
import torch

pipeline = AutoPipelineForText2Image.from_pretrained("kandinsky-community/kandinsky-2-1", torch_dtype=torch.float16)
pipeline.enable_model_cpu_offload()

prompt = "A alien cheeseburger creature eating itself, claymation, cinematic, moody lighting"
negative_prompt = "low quality, bad quality"

image = pipeline(prompt=prompt, negative_prompt=negative_prompt, prior_guidance_scale=1.0, guidance_scale=4.0, height=768, width=768).images[0]
image
from diffusers import AutoPipelineForText2Image
import torch

pipeline = AutoPipelineForText2Image.from_pretrained("kandinsky-community/kandinsky-2-2-decoder", torch_dtype=torch.float16)
pipeline.enable_model_cpu_offload()

prompt = "A alien cheeseburger creature eating itself, claymation, cinematic, moody lighting"
negative_prompt = "low quality, bad quality"

image = pipeline(prompt=prompt, negative_prompt=negative_prompt, prior_guidance_scale=1.0, guidance_scale=4.0, height=768, width=768).images[0]
image

Image-to-image

Image-to-image ๊ฒฝ์šฐ, ์ดˆ๊ธฐ ์ด๋ฏธ์ง€์™€ ํ…์ŠคํŠธ ํ”„๋กฌํ”„ํŠธ๋ฅผ ์ „๋‹ฌํ•˜์—ฌ ํŒŒ์ดํ”„๋ผ์ธ์— ์ด๋ฏธ์ง€๋ฅผ conditioningํ•ฉ๋‹ˆ๋‹ค. Prior ํŒŒ์ดํ”„๋ผ์ธ์„ ๋ถˆ๋Ÿฌ์˜ค๋Š” ๊ฒƒ์œผ๋กœ ์‹œ์ž‘ํ•ฉ๋‹ˆ๋‹ค:

import torch
from diffusers import KandinskyImg2ImgPipeline, KandinskyPriorPipeline

prior_pipeline = KandinskyPriorPipeline.from_pretrained("kandinsky-community/kandinsky-2-1-prior", torch_dtype=torch.float16, use_safetensors=True).to("cuda")
pipeline = KandinskyImg2ImgPipeline.from_pretrained("kandinsky-community/kandinsky-2-1", torch_dtype=torch.float16, use_safetensors=True).to("cuda")
import torch
from diffusers import KandinskyV22Img2ImgPipeline, KandinskyPriorPipeline

prior_pipeline = KandinskyPriorPipeline.from_pretrained("kandinsky-community/kandinsky-2-2-prior", torch_dtype=torch.float16, use_safetensors=True).to("cuda")
pipeline = KandinskyV22Img2ImgPipeline.from_pretrained("kandinsky-community/kandinsky-2-2-decoder", torch_dtype=torch.float16, use_safetensors=True).to("cuda")

Kandinsky 3๋Š” prior ๋ชจ๋ธ์ด ํ•„์š”ํ•˜์ง€ ์•Š์œผ๋ฏ€๋กœ image-to-image ํŒŒ์ดํ”„๋ผ์ธ์„ ์ง์ ‘ ๋ถˆ๋Ÿฌ์˜ฌ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค:

from diffusers import Kandinsky3Img2ImgPipeline
from diffusers.utils import load_image
import torch

pipeline = Kandinsky3Img2ImgPipeline.from_pretrained("kandinsky-community/kandinsky-3", variant="fp16", torch_dtype=torch.float16)
pipeline.enable_model_cpu_offload()

Conditioningํ•  ์ด๋ฏธ์ง€๋ฅผ ๋‹ค์šด๋กœ๋“œํ•ฉ๋‹ˆ๋‹ค:

from diffusers.utils import load_image

# ์ด๋ฏธ์ง€ ๋‹ค์šด๋กœ๋“œ
url = "https://raw.githubusercontent.com/CompVis/stable-diffusion/main/assets/stable-samples/img2img/sketch-mountains-input.jpg"
original_image = load_image(url)
original_image = original_image.resize((768, 512))

Prior ํŒŒ์ดํ”„๋ผ์ธ์œผ๋กœ image_embeds์™€ negative_image_embeds๋ฅผ ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค:

prompt = "A fantasy landscape, Cinematic lighting"
negative_prompt = "low quality, bad quality"

image_embeds, negative_image_embeds = prior_pipeline(prompt, negative_prompt).to_tuple()

์ด์ œ ์›๋ณธ ์ด๋ฏธ์ง€์™€ ๋ชจ๋“  ํ”„๋กฌํ”„ํŠธ ๋ฐ ์ž„๋ฒ ๋”ฉ์„ ํŒŒ์ดํ”„๋ผ์ธ์œผ๋กœ ์ „๋‹ฌํ•˜์—ฌ ์ด๋ฏธ์ง€๋ฅผ ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค:

from diffusers.utils import make_image_grid

image = pipeline(prompt, negative_prompt=negative_prompt, image=original_image, image_embeds=image_embeds, negative_image_embeds=negative_image_embeds, height=768, width=768, strength=0.3).images[0]
make_image_grid([original_image.resize((512, 512)), image.resize((512, 512))], rows=1, cols=2)
from diffusers.utils import make_image_grid

image = pipeline(image=original_image, image_embeds=image_embeds, negative_image_embeds=negative_image_embeds, height=768, width=768, strength=0.3).images[0]
make_image_grid([original_image.resize((512, 512)), image.resize((512, 512))], rows=1, cols=2)
image = pipeline(prompt, negative_prompt=negative_prompt, image=image, strength=0.75, num_inference_steps=25).images[0]
image

๋˜ํ•œ ๐Ÿค— Diffusers์—์„œ๋Š” [KandinskyImg2ImgCombinedPipeline] ๋ฐ [KandinskyV22Img2ImgCombinedPipeline]์ด ํฌํ•จ๋œ end-to-end API๋ฅผ ์ œ๊ณตํ•˜๋ฏ€๋กœ prior ํŒŒ์ดํ”„๋ผ์ธ๊ณผ image-to-image ํŒŒ์ดํ”„๋ผ์ธ์„ ๋ณ„๋„๋กœ ๋ถˆ๋Ÿฌ์˜ฌ ํ•„์š”๊ฐ€ ์—†์Šต๋‹ˆ๋‹ค. ๊ฒฐํ•ฉ๋œ ํŒŒ์ดํ”„๋ผ์ธ์€ prior ๋ชจ๋ธ๊ณผ ๋””์ฝ”๋”๋ฅผ ๋ชจ๋‘ ์ž๋™์œผ๋กœ ๋ถˆ๋Ÿฌ์˜ต๋‹ˆ๋‹ค. ์›ํ•˜๋Š” ๊ฒฝ์šฐ prior_guidance_scale ๋ฐ prior_num_inference_steps ๋งค๊ฐœ ๋ณ€์ˆ˜๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ์ด์ „ ํŒŒ์ดํ”„๋ผ์ธ์— ๋Œ€ํ•ด ๋‹ค๋ฅธ ๊ฐ’์„ ์„ค์ •ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

๋‚ด๋ถ€์—์„œ ๊ฒฐํ•ฉ๋œ ํŒŒ์ดํ”„๋ผ์ธ์„ ์ž๋™์œผ๋กœ ํ˜ธ์ถœํ•˜๋ ค๋ฉด [AutoPipelineForImage2Image]๋ฅผ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค:

from diffusers import AutoPipelineForImage2Image
from diffusers.utils import make_image_grid, load_image
import torch

pipeline = AutoPipelineForImage2Image.from_pretrained("kandinsky-community/kandinsky-2-1", torch_dtype=torch.float16, use_safetensors=True)
pipeline.enable_model_cpu_offload()

prompt = "A fantasy landscape, Cinematic lighting"
negative_prompt = "low quality, bad quality"

url = "https://raw.githubusercontent.com/CompVis/stable-diffusion/main/assets/stable-samples/img2img/sketch-mountains-input.jpg"
original_image = load_image(url)

original_image.thumbnail((768, 768))

image = pipeline(prompt=prompt, negative_prompt=negative_prompt, image=original_image, strength=0.3).images[0]
make_image_grid([original_image.resize((512, 512)), image.resize((512, 512))], rows=1, cols=2)
from diffusers import AutoPipelineForImage2Image
from diffusers.utils import make_image_grid, load_image
import torch

pipeline = AutoPipelineForImage2Image.from_pretrained("kandinsky-community/kandinsky-2-2-decoder", torch_dtype=torch.float16)
pipeline.enable_model_cpu_offload()

prompt = "A fantasy landscape, Cinematic lighting"
negative_prompt = "low quality, bad quality"

url = "https://raw.githubusercontent.com/CompVis/stable-diffusion/main/assets/stable-samples/img2img/sketch-mountains-input.jpg"
original_image = load_image(url)

original_image.thumbnail((768, 768))

image = pipeline(prompt=prompt, negative_prompt=negative_prompt, image=original_image, strength=0.3).images[0]
make_image_grid([original_image.resize((512, 512)), image.resize((512, 512))], rows=1, cols=2)

Inpainting

โš ๏ธ Kandinsky ๋ชจ๋ธ์€ ์ด์ œ ๊ฒ€์€์ƒ‰ ํ”ฝ์…€ ๋Œ€์‹  โฌœ๏ธ ํฐ์ƒ‰ ํ”ฝ์…€์„ ์‚ฌ์šฉํ•˜์—ฌ ๋งˆ์Šคํฌ ์˜์—ญ์„ ํ‘œํ˜„ํ•ฉ๋‹ˆ๋‹ค. ํ”„๋กœ๋•์…˜์—์„œ [KandinskyInpaintPipeline]์„ ์‚ฌ์šฉํ•˜๋Š” ๊ฒฝ์šฐ ํฐ์ƒ‰ ํ”ฝ์…€์„ ์‚ฌ์šฉํ•˜๋„๋ก ๋งˆ์Šคํฌ๋ฅผ ๋ณ€๊ฒฝํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค:

# PIL ์ž…๋ ฅ์— ๋Œ€ํ•ด
import PIL.ImageOps
mask = PIL.ImageOps.invert(mask)

# PyTorch์™€ NumPy ์ž…๋ ฅ์— ๋Œ€ํ•ด
mask = 1 - mask

์ธํŽ˜์ธํŒ…์—์„œ๋Š” ์›๋ณธ ์ด๋ฏธ์ง€, ์›๋ณธ ์ด๋ฏธ์ง€์—์„œ ๋Œ€์ฒดํ•  ์˜์—ญ์˜ ๋งˆ์Šคํฌ, ์ธํŽ˜์ธํŒ…ํ•  ๋‚ด์šฉ์— ๋Œ€ํ•œ ํ…์ŠคํŠธ ํ”„๋กฌํ”„ํŠธ๊ฐ€ ํ•„์š”ํ•ฉ๋‹ˆ๋‹ค. Prior ํŒŒ์ดํ”„๋ผ์ธ์„ ๋ถˆ๋Ÿฌ์˜ต๋‹ˆ๋‹ค:

from diffusers import KandinskyInpaintPipeline, KandinskyPriorPipeline
from diffusers.utils import load_image, make_image_grid
import torch
import numpy as np
from PIL import Image

prior_pipeline = KandinskyPriorPipeline.from_pretrained("kandinsky-community/kandinsky-2-1-prior", torch_dtype=torch.float16, use_safetensors=True).to("cuda")
pipeline = KandinskyInpaintPipeline.from_pretrained("kandinsky-community/kandinsky-2-1-inpaint", torch_dtype=torch.float16, use_safetensors=True).to("cuda")
from diffusers import KandinskyV22InpaintPipeline, KandinskyV22PriorPipeline
from diffusers.utils import load_image, make_image_grid
import torch
import numpy as np
from PIL import Image

prior_pipeline = KandinskyV22PriorPipeline.from_pretrained("kandinsky-community/kandinsky-2-2-prior", torch_dtype=torch.float16, use_safetensors=True).to("cuda")
pipeline = KandinskyV22InpaintPipeline.from_pretrained("kandinsky-community/kandinsky-2-2-decoder-inpaint", torch_dtype=torch.float16, use_safetensors=True).to("cuda")

์ดˆ๊ธฐ ์ด๋ฏธ์ง€๋ฅผ ๋ถˆ๋Ÿฌ์˜ค๊ณ  ๋งˆ์Šคํฌ๋ฅผ ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค:

init_image = load_image("https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/kandinsky/cat.png")
mask = np.zeros((768, 768), dtype=np.float32)
# mask area above cat's head
mask[:250, 250:-250] = 1

Prior ํŒŒ์ดํ”„๋ผ์ธ์œผ๋กœ ์ž„๋ฒ ๋”ฉ์„ ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค:

prompt = "a hat"
prior_output = prior_pipeline(prompt)

์ด์ œ ์ด๋ฏธ์ง€ ์ƒ์„ฑ์„ ์œ„ํ•ด ์ดˆ๊ธฐ ์ด๋ฏธ์ง€, ๋งˆ์Šคํฌ, ํ”„๋กฌํ”„ํŠธ์™€ ์ž„๋ฒ ๋”ฉ์„ ํŒŒ์ดํ”„๋ผ์ธ์— ์ „๋‹ฌํ•ฉ๋‹ˆ๋‹ค:

output_image = pipeline(prompt, image=init_image, mask_image=mask, **prior_output, height=768, width=768, num_inference_steps=150).images[0]
mask = Image.fromarray((mask*255).astype('uint8'), 'L')
make_image_grid([init_image, mask, output_image], rows=1, cols=3)
output_image = pipeline(image=init_image, mask_image=mask, **prior_output, height=768, width=768, num_inference_steps=150).images[0]
mask = Image.fromarray((mask*255).astype('uint8'), 'L')
make_image_grid([init_image, mask, output_image], rows=1, cols=3)

[KandinskyInpaintCombinedPipeline] ๋ฐ [KandinskyV22InpaintCombinedPipeline]์„ ์‚ฌ์šฉํ•˜์—ฌ ๋‚ด๋ถ€์—์„œ prior ๋ฐ ๋””์ฝ”๋” ํŒŒ์ดํ”„๋ผ์ธ์„ ํ•จ๊ป˜ ํ˜ธ์ถœํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์ด๋ฅผ ์œ„ํ•ด [AutoPipelineForInpainting]์„ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค:

import torch
import numpy as np
from PIL import Image
from diffusers import AutoPipelineForInpainting
from diffusers.utils import load_image, make_image_grid

pipe = AutoPipelineForInpainting.from_pretrained("kandinsky-community/kandinsky-2-1-inpaint", torch_dtype=torch.float16)
pipe.enable_model_cpu_offload()

init_image = load_image("https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/kandinsky/cat.png")
mask = np.zeros((768, 768), dtype=np.float32)
# ๊ณ ์–‘์ด ๋จธ๋ฆฌ ์œ„ ๋งˆ์Šคํฌ ์ง€์—ญ
mask[:250, 250:-250] = 1
prompt = "a hat"

output_image = pipe(prompt=prompt, image=init_image, mask_image=mask).images[0]
mask = Image.fromarray((mask*255).astype('uint8'), 'L')
make_image_grid([init_image, mask, output_image], rows=1, cols=3)
import torch
import numpy as np
from PIL import Image
from diffusers import AutoPipelineForInpainting
from diffusers.utils import load_image, make_image_grid

pipe = AutoPipelineForInpainting.from_pretrained("kandinsky-community/kandinsky-2-2-decoder-inpaint", torch_dtype=torch.float16)
pipe.enable_model_cpu_offload()

init_image = load_image("https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/kandinsky/cat.png")
mask = np.zeros((768, 768), dtype=np.float32)
# ๊ณ ์–‘์ด ๋จธ๋ฆฌ ์œ„ ๋งˆ์Šคํฌ ์˜์—ญ
mask[:250, 250:-250] = 1
prompt = "a hat"

output_image = pipe(prompt=prompt, image=original_image, mask_image=mask).images[0]
mask = Image.fromarray((mask*255).astype('uint8'), 'L')
make_image_grid([init_image, mask, output_image], rows=1, cols=3)

Interpolation (๋ณด๊ฐ„)

Interpolation(๋ณด๊ฐ„)์„ ์‚ฌ์šฉํ•˜๋ฉด ์ด๋ฏธ์ง€์™€ ํ…์ŠคํŠธ ์ž„๋ฒ ๋”ฉ ์‚ฌ์ด์˜ latent space๋ฅผ ํƒ์ƒ‰ํ•  ์ˆ˜ ์žˆ์–ด prior ๋ชจ๋ธ์˜ ์ค‘๊ฐ„ ๊ฒฐ๊ณผ๋ฌผ์„ ๋ณผ ์ˆ˜ ์žˆ๋Š” ๋ฉ‹์ง„ ๋ฐฉ๋ฒ•์ž…๋‹ˆ๋‹ค. Prior ํŒŒ์ดํ”„๋ผ์ธ๊ณผ ๋ณด๊ฐ„ํ•˜๋ ค๋Š” ๋‘ ๊ฐœ์˜ ์ด๋ฏธ์ง€๋ฅผ ๋ถˆ๋Ÿฌ์˜ต๋‹ˆ๋‹ค:

from diffusers import KandinskyPriorPipeline, KandinskyPipeline
from diffusers.utils import load_image, make_image_grid
import torch

prior_pipeline = KandinskyPriorPipeline.from_pretrained("kandinsky-community/kandinsky-2-1-prior", torch_dtype=torch.float16, use_safetensors=True).to("cuda")
img_1 = load_image("https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/kandinsky/cat.png")
img_2 = load_image("https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/kandinsky/starry_night.jpeg")
make_image_grid([img_1.resize((512,512)), img_2.resize((512,512))], rows=1, cols=2)
from diffusers import KandinskyV22PriorPipeline, KandinskyV22Pipeline
from diffusers.utils import load_image, make_image_grid
import torch

prior_pipeline = KandinskyV22PriorPipeline.from_pretrained("kandinsky-community/kandinsky-2-2-prior", torch_dtype=torch.float16, use_safetensors=True).to("cuda")
img_1 = load_image("https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/kandinsky/cat.png")
img_2 = load_image("https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/kandinsky/starry_night.jpeg")
make_image_grid([img_1.resize((512,512)), img_2.resize((512,512))], rows=1, cols=2)
a cat
Van Gogh's Starry Night painting

๋ณด๊ฐ„ํ•  ํ…์ŠคํŠธ ๋˜๋Š” ์ด๋ฏธ์ง€๋ฅผ ์ง€์ •ํ•˜๊ณ  ๊ฐ ํ…์ŠคํŠธ ๋˜๋Š” ์ด๋ฏธ์ง€์— ๋Œ€ํ•œ ๊ฐ€์ค‘์น˜๋ฅผ ์„ค์ •ํ•ฉ๋‹ˆ๋‹ค. ๊ฐ€์ค‘์น˜๋ฅผ ์‹คํ—˜ํ•˜์—ฌ ๋ณด๊ฐ„์— ์–ด๋–ค ์˜ํ–ฅ์„ ๋ฏธ์น˜๋Š”์ง€ ํ™•์ธํ•˜์„ธ์š”!

images_texts = ["a cat", img_1, img_2]
weights = [0.3, 0.3, 0.4]

interpolate ํ•จ์ˆ˜๋ฅผ ํ˜ธ์ถœํ•˜์—ฌ ์ž„๋ฒ ๋”ฉ์„ ์ƒ์„ฑํ•œ ๋‹ค์Œ, ํŒŒ์ดํ”„๋ผ์ธ์œผ๋กœ ์ „๋‹ฌํ•˜์—ฌ ์ด๋ฏธ์ง€๋ฅผ ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค:

# ํ”„๋กฌํ”„ํŠธ๋Š” ๋นˆ์นธ์œผ๋กœ ๋‚จ๊ฒจ๋„ ๋ฉ๋‹ˆ๋‹ค
prompt = ""
prior_out = prior_pipeline.interpolate(images_texts, weights)

pipeline = KandinskyPipeline.from_pretrained("kandinsky-community/kandinsky-2-1", torch_dtype=torch.float16, use_safetensors=True).to("cuda")

image = pipeline(prompt, **prior_out, height=768, width=768).images[0]
image
# ํ”„๋กฌํ”„ํŠธ๋Š” ๋นˆ์นธ์œผ๋กœ ๋‚จ๊ฒจ๋„ ๋ฉ๋‹ˆ๋‹ค
prompt = ""
prior_out = prior_pipeline.interpolate(images_texts, weights)

pipeline = KandinskyV22Pipeline.from_pretrained("kandinsky-community/kandinsky-2-2-decoder", torch_dtype=torch.float16, use_safetensors=True).to("cuda")

image = pipeline(prompt, **prior_out, height=768, width=768).images[0]
image

ControlNet

โš ๏ธ ControlNet์€ Kandinsky 2.2์—์„œ๋งŒ ์ง€์›๋ฉ๋‹ˆ๋‹ค!

ControlNet์„ ์‚ฌ์šฉํ•˜๋ฉด depth map์ด๋‚˜ edge detection์™€ ๊ฐ™์€ ์ถ”๊ฐ€ ์ž…๋ ฅ์„ ํ†ตํ•ด ์‚ฌ์ „ํ•™์Šต๋œ large diffusion ๋ชจ๋ธ์„ conditioningํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์˜ˆ๋ฅผ ๋“ค์–ด, ๋ชจ๋ธ์ด depth map์˜ ๊ตฌ์กฐ๋ฅผ ์ดํ•ดํ•˜๊ณ  ๋ณด์กดํ•  ์ˆ˜ ์žˆ๋„๋ก ๊นŠ์ด ๋งต์œผ๋กœ Kandinsky 2.2๋ฅผ conditioningํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

์ด๋ฏธ์ง€๋ฅผ ๋ถˆ๋Ÿฌ์˜ค๊ณ  depth map์„ ์ถ”์ถœํ•ด ๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค:

from diffusers.utils import load_image

img = load_image(
    "https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/kandinskyv22/cat.png"
).resize((768, 768))
img

๊ทธ๋Ÿฐ ๋‹ค์Œ ๐Ÿค— Transformers์˜ depth-estimation [~transformers.Pipeline]์„ ์‚ฌ์šฉํ•˜์—ฌ ์ด๋ฏธ์ง€๋ฅผ ์ฒ˜๋ฆฌํ•ด depth map์„ ๊ตฌํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค:

import torch
import numpy as np

from transformers import pipeline

def make_hint(image, depth_estimator):
    image = depth_estimator(image)["depth"]
    image = np.array(image)
    image = image[:, :, None]
    image = np.concatenate([image, image, image], axis=2)
    detected_map = torch.from_numpy(image).float() / 255.0
    hint = detected_map.permute(2, 0, 1)
    return hint

depth_estimator = pipeline("depth-estimation")
hint = make_hint(img, depth_estimator).unsqueeze(0).half().to("cuda")

Text-to-image [[controlnet-text-to-image]]

Prior ํŒŒ์ดํ”„๋ผ์ธ๊ณผ [KandinskyV22ControlnetPipeline]๋ฅผ ๋ถˆ๋Ÿฌ์˜ต๋‹ˆ๋‹ค:

from diffusers import KandinskyV22PriorPipeline, KandinskyV22ControlnetPipeline

prior_pipeline = KandinskyV22PriorPipeline.from_pretrained(
    "kandinsky-community/kandinsky-2-2-prior", torch_dtype=torch.float16, use_safetensors=True
).to("cuda")

pipeline = KandinskyV22ControlnetPipeline.from_pretrained(
    "kandinsky-community/kandinsky-2-2-controlnet-depth", torch_dtype=torch.float16
).to("cuda")

ํ”„๋กฌํ”„ํŠธ์™€ negative ํ”„๋กฌํ”„ํŠธ๋กœ ์ด๋ฏธ์ง€ ์ž„๋ฒ ๋”ฉ์„ ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค:

prompt = "A robot, 4k photo"
negative_prior_prompt = "lowres, text, error, cropped, worst quality, low quality, jpeg artifacts, ugly, duplicate, morbid, mutilated, out of frame, extra fingers, mutated hands, poorly drawn hands, poorly drawn face, mutation, deformed, blurry, dehydrated, bad anatomy, bad proportions, extra limbs, cloned face, disfigured, gross proportions, malformed limbs, missing arms, missing legs, extra arms, extra legs, fused fingers, too many fingers, long neck, username, watermark, signature"

generator = torch.Generator(device="cuda").manual_seed(43)

image_emb, zero_image_emb = prior_pipeline(
    prompt=prompt, negative_prompt=negative_prior_prompt, generator=generator
).to_tuple()

๋งˆ์ง€๋ง‰์œผ๋กœ ์ด๋ฏธ์ง€ ์ž„๋ฒ ๋”ฉ๊ณผ depth ์ด๋ฏธ์ง€๋ฅผ [KandinskyV22ControlnetPipeline]์— ์ „๋‹ฌํ•˜์—ฌ ์ด๋ฏธ์ง€๋ฅผ ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค:

image = pipeline(image_embeds=image_emb, negative_image_embeds=zero_image_emb, hint=hint, num_inference_steps=50, generator=generator, height=768, width=768).images[0]
image

Image-to-image [[controlnet-image-to-image]]

ControlNet์„ ์‚ฌ์šฉํ•œ image-to-image์˜ ๊ฒฝ์šฐ, ๋‹ค์Œ์„ ์‚ฌ์šฉํ•  ํ•„์š”๊ฐ€ ์žˆ์Šต๋‹ˆ๋‹ค:

  • [KandinskyV22PriorEmb2EmbPipeline]๋กœ ํ…์ŠคํŠธ ํ”„๋กฌํ”„ํŠธ์™€ ์ด๋ฏธ์ง€์—์„œ ์ด๋ฏธ์ง€ ์ž„๋ฒ ๋”ฉ์„ ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค.
  • [KandinskyV22ControlnetImg2ImgPipeline]๋กœ ์ดˆ๊ธฐ ์ด๋ฏธ์ง€์™€ ์ด๋ฏธ์ง€ ์ž„๋ฒ ๋”ฉ์—์„œ ์ด๋ฏธ์ง€๋ฅผ ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค.

๐Ÿค— Transformers์—์„œ depth-estimation [~transformers.Pipeline]์„ ์‚ฌ์šฉํ•˜์—ฌ ๊ณ ์–‘์ด์˜ ์ดˆ๊ธฐ ์ด๋ฏธ์ง€์˜ depth map์„ ์ฒ˜๋ฆฌํ•ด ์ถ”์ถœํ•ฉ๋‹ˆ๋‹ค:

import torch
import numpy as np

from diffusers import KandinskyV22PriorEmb2EmbPipeline, KandinskyV22ControlnetImg2ImgPipeline
from diffusers.utils import load_image
from transformers import pipeline

img = load_image(
    "https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/kandinskyv22/cat.png"
).resize((768, 768))

def make_hint(image, depth_estimator):
    image = depth_estimator(image)["depth"]
    image = np.array(image)
    image = image[:, :, None]
    image = np.concatenate([image, image, image], axis=2)
    detected_map = torch.from_numpy(image).float() / 255.0
    hint = detected_map.permute(2, 0, 1)
    return hint

depth_estimator = pipeline("depth-estimation")
hint = make_hint(img, depth_estimator).unsqueeze(0).half().to("cuda")

Prior ํŒŒ์ดํ”„๋ผ์ธ๊ณผ [KandinskyV22ControlnetImg2ImgPipeline]์„ ๋ถˆ๋Ÿฌ์˜ต๋‹ˆ๋‹ค:

prior_pipeline = KandinskyV22PriorEmb2EmbPipeline.from_pretrained(
    "kandinsky-community/kandinsky-2-2-prior", torch_dtype=torch.float16, use_safetensors=True
).to("cuda")

pipeline = KandinskyV22ControlnetImg2ImgPipeline.from_pretrained(
    "kandinsky-community/kandinsky-2-2-controlnet-depth", torch_dtype=torch.float16
).to("cuda")

ํ…์ŠคํŠธ ํ”„๋กฌํ”„ํŠธ์™€ ์ดˆ๊ธฐ ์ด๋ฏธ์ง€๋ฅผ ์ด์ „ ํŒŒ์ดํ”„๋ผ์ธ์— ์ „๋‹ฌํ•˜์—ฌ ์ด๋ฏธ์ง€ ์ž„๋ฒ ๋”ฉ์„ ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค:

prompt = "A robot, 4k photo"
negative_prior_prompt = "lowres, text, error, cropped, worst quality, low quality, jpeg artifacts, ugly, duplicate, morbid, mutilated, out of frame, extra fingers, mutated hands, poorly drawn hands, poorly drawn face, mutation, deformed, blurry, dehydrated, bad anatomy, bad proportions, extra limbs, cloned face, disfigured, gross proportions, malformed limbs, missing arms, missing legs, extra arms, extra legs, fused fingers, too many fingers, long neck, username, watermark, signature"

generator = torch.Generator(device="cuda").manual_seed(43)

img_emb = prior_pipeline(prompt=prompt, image=img, strength=0.85, generator=generator)
negative_emb = prior_pipeline(prompt=negative_prior_prompt, image=img, strength=1, generator=generator)

์ด์ œ [KandinskyV22ControlnetImg2ImgPipeline]์„ ์‹คํ–‰ํ•˜์—ฌ ์ดˆ๊ธฐ ์ด๋ฏธ์ง€์™€ ์ด๋ฏธ์ง€ ์ž„๋ฒ ๋”ฉ์œผ๋กœ๋ถ€ํ„ฐ ์ด๋ฏธ์ง€๋ฅผ ์ƒ์„ฑํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค:

image = pipeline(image=img, strength=0.5, image_embeds=img_emb.image_embeds, negative_image_embeds=negative_emb.image_embeds, hint=hint, num_inference_steps=50, generator=generator, height=768, width=768).images[0]
make_image_grid([img.resize((512, 512)), image.resize((512, 512))], rows=1, cols=2)

์ตœ์ ํ™”

Kandinsky๋Š” mapping์„ ์ƒ์„ฑํ•˜๊ธฐ ์œ„ํ•œ prior ํŒŒ์ดํ”„๋ผ์ธ๊ณผ latents๋ฅผ ์ด๋ฏธ์ง€๋กœ ๋””์ฝ”๋”ฉํ•˜๊ธฐ ์œ„ํ•œ ๋‘ ๋ฒˆ์งธ ํŒŒ์ดํ”„๋ผ์ธ์ด ํ•„์š”ํ•˜๋‹ค๋Š” ์ ์—์„œ ๋…ํŠนํ•ฉ๋‹ˆ๋‹ค. ๋Œ€๋ถ€๋ถ„์˜ ๊ณ„์‚ฐ์ด ๋‘ ๋ฒˆ์งธ ํŒŒ์ดํ”„๋ผ์ธ์—์„œ ์ด๋ฃจ์–ด์ง€๋ฏ€๋กœ ์ตœ์ ํ™”์˜ ๋…ธ๋ ฅ์€ ๋‘ ๋ฒˆ์งธ ํŒŒ์ดํ”„๋ผ์ธ์— ์ง‘์ค‘๋˜์–ด์•ผ ํ•ฉ๋‹ˆ๋‹ค. ๋‹ค์Œ์€ ์ถ”๋ก  ์ค‘ Kandinskyํ‚ค๋ฅผ ๊ฐœ์„ ํ•˜๊ธฐ ์œ„ํ•œ ๋ช‡ ๊ฐ€์ง€ ํŒ์ž…๋‹ˆ๋‹ค.

  1. PyTorch < 2.0์„ ์‚ฌ์šฉํ•  ๊ฒฝ์šฐ xFormers์„ ํ™œ์„ฑํ™”ํ•ฉ๋‹ˆ๋‹ค.
  from diffusers import DiffusionPipeline
  import torch

  pipe = DiffusionPipeline.from_pretrained("kandinsky-community/kandinsky-2-1", torch_dtype=torch.float16)
+ pipe.enable_xformers_memory_efficient_attention()
  1. PyTorch >= 2.0์„ ์‚ฌ์šฉํ•  ๊ฒฝ์šฐ torch.compile์„ ํ™œ์„ฑํ™”ํ•˜์—ฌ scaled dot-product attention (SDPA)๋ฅผ ์ž๋™์œผ๋กœ ์‚ฌ์šฉํ•˜๋„๋ก ํ•ฉ๋‹ˆ๋‹ค:
  pipe.unet.to(memory_format=torch.channels_last)
+ pipe.unet = torch.compile(pipe.unet, mode="reduce-overhead", fullgraph=True)

์ด๋Š” attention processor๋ฅผ ๋ช…์‹œ์ ์œผ๋กœ [~models.attention_processor.AttnAddedKVProcessor2_0]์„ ์‚ฌ์šฉํ•˜๋„๋ก ์„ค์ •ํ•˜๋Š” ๊ฒƒ๊ณผ ๋™์ผํ•ฉ๋‹ˆ๋‹ค:

from diffusers.models.attention_processor import AttnAddedKVProcessor2_0

pipe.unet.set_attn_processor(AttnAddedKVProcessor2_0())
  1. ๋ฉ”๋ชจ๋ฆฌ ๋ถ€์กฑ ์˜ค๋ฅ˜๋ฅผ ๋ฐฉ์ง€ํ•˜๊ธฐ ์œ„ํ•ด [~KandinskyPriorPipeline.enable_model_cpu_offload]๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ๋ชจ๋ธ์„ CPU๋กœ ์˜คํ”„๋กœ๋“œํ•ฉ๋‹ˆ๋‹ค:
  from diffusers import DiffusionPipeline
  import torch

  pipe = DiffusionPipeline.from_pretrained("kandinsky-community/kandinsky-2-1", torch_dtype=torch.float16)
+ pipe.enable_model_cpu_offload()
  1. ๊ธฐ๋ณธ์ ์œผ๋กœ text-to-image ํŒŒ์ดํ”„๋ผ์ธ์€ [DDIMScheduler]๋ฅผ ์‚ฌ์šฉํ•˜์ง€๋งŒ, [DDPMScheduler]์™€ ๊ฐ™์€ ๋‹ค๋ฅธ ์Šค์ผ€์ค„๋Ÿฌ๋กœ ๋Œ€์ฒดํ•˜์—ฌ ์ถ”๋ก  ์†๋„์™€ ์ด๋ฏธ์ง€ ํ’ˆ์งˆ ๊ฐ„์˜ ๊ท ํ˜•์— ์–ด๋–ค ์˜ํ–ฅ์„ ๋ฏธ์น˜๋Š”์ง€ ํ™•์ธํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค:
from diffusers import DDPMScheduler
from diffusers import DiffusionPipeline

scheduler = DDPMScheduler.from_pretrained("kandinsky-community/kandinsky-2-1", subfolder="ddpm_scheduler")
pipe = DiffusionPipeline.from_pretrained("kandinsky-community/kandinsky-2-1", scheduler=scheduler, torch_dtype=torch.float16, use_safetensors=True).to("cuda")