toshas's picture
initial commit
a45988a

A newer version of the Gradio SDK is available: 5.23.3

Upgrade

Stable Video Diffusion

[[open-in-colab]]

Stable Video Diffusion (SVD)은 μž…λ ₯ 이미지에 맞좰 2~4초 λΆ„λŸ‰μ˜ 고해상도(576x1024) λΉ„λ””μ˜€λ₯Ό 생성할 수 μžˆλŠ” κ°•λ ₯ν•œ image-to-video 생성 λͺ¨λΈμž…λ‹ˆλ‹€.

이 κ°€μ΄λ“œμ—μ„œλŠ” SVDλ₯Ό μ‚¬μš©ν•˜μ—¬ μ΄λ―Έμ§€μ—μ„œ 짧은 λ™μ˜μƒμ„ μƒμ„±ν•˜λŠ” 방법을 μ„€λͺ…ν•©λ‹ˆλ‹€.

μ‹œμž‘ν•˜κΈ° 전에 λ‹€μŒ λΌμ΄λΈŒλŸ¬λ¦¬κ°€ μ„€μΉ˜λ˜μ–΄ μžˆλŠ”μ§€ ν™•μΈν•˜μ„Έμš”:

!pip install -q -U diffusers transformers accelerate

이 λͺ¨λΈμ—λŠ” SVD와 SVD-XT 두 가지 μ’…λ₯˜κ°€ μžˆμŠ΅λ‹ˆλ‹€. SVD μ²΄ν¬ν¬μΈνŠΈλŠ” 14개의 ν”„λ ˆμž„μ„ μƒμ„±ν•˜λ„λ‘ ν•™μŠ΅λ˜μ—ˆκ³ , SVD-XT μ²΄ν¬ν¬μΈνŠΈλŠ” 25개의 ν”„λ ˆμž„μ„ μƒμ„±ν•˜λ„λ‘ νŒŒμΈνŠœλ‹λ˜μ—ˆμŠ΅λ‹ˆλ‹€.

이 κ°€μ΄λ“œμ—μ„œλŠ” SVD-XT 체크포인트λ₯Ό μ‚¬μš©ν•©λ‹ˆλ‹€.

import torch

from diffusers import StableVideoDiffusionPipeline
from diffusers.utils import load_image, export_to_video

pipe = StableVideoDiffusionPipeline.from_pretrained(
    "stabilityai/stable-video-diffusion-img2vid-xt", torch_dtype=torch.float16, variant="fp16"
)
pipe.enable_model_cpu_offload()

# Conditioning 이미지 뢈러였기
image = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/svd/rocket.png")
image = image.resize((1024, 576))

generator = torch.manual_seed(42)
frames = pipe(image, decode_chunk_size=8, generator=generator).frames[0]

export_to_video(frames, "generated.mp4", fps=7)
"source image of a rocket"
"generated video from source image"

torch.compile

UNet을 μ»΄νŒŒμΌν•˜λ©΄ λ©”λͺ¨λ¦¬ μ‚¬μš©λŸ‰μ΄ 살짝 μ¦κ°€ν•˜μ§€λ§Œ, 20~25%의 속도 ν–₯상을 얻을 수 μžˆμŠ΅λ‹ˆλ‹€.

- pipe.enable_model_cpu_offload()
+ pipe.to("cuda")
+ pipe.unet = torch.compile(pipe.unet, mode="reduce-overhead", fullgraph=True)

λ©”λͺ¨λ¦¬ μ‚¬μš©λŸ‰ 쀄이기

λΉ„λ””μ˜€ 생성은 기본적으둜 배치 크기가 큰 text-to-image 생성과 μœ μ‚¬ν•˜κ²Œ 'num_frames'λ₯Ό ν•œ λ²ˆμ— μƒμ„±ν•˜κΈ° λ•Œλ¬Έμ— λ©”λͺ¨λ¦¬ μ‚¬μš©λŸ‰μ΄ 맀우 λ†’μŠ΅λ‹ˆλ‹€. λ©”λͺ¨λ¦¬ μ‚¬μš©λŸ‰μ„ 쀄이기 μœ„ν•΄ μΆ”λ‘  속도와 λ©”λͺ¨λ¦¬ μ‚¬μš©λŸ‰μ„ μ ˆμΆ©ν•˜λŠ” μ—¬λŸ¬ 가지 μ˜΅μ…˜μ΄ μžˆμŠ΅λ‹ˆλ‹€:

  • λͺ¨λΈ μ˜€ν”„λ‘œλ§ ν™œμ„±ν™”: νŒŒμ΄ν”„λΌμΈμ˜ 각 ꡬ성 μš”μ†Œκ°€ 더 이상 ν•„μš”ν•˜μ§€ μ•Šμ„ λ•Œ CPU둜 μ˜€ν”„λ‘œλ“œλ©λ‹ˆλ‹€.
  • Feed-forward chunking ν™œμ„±ν™”: feed-forward λ ˆμ΄μ–΄κ°€ 배치 크기가 큰 단일 feed-forwardλ₯Ό μ‹€ν–‰ν•˜λŠ” λŒ€μ‹  λ£¨ν”„λ‘œ λ°˜λ³΅ν•΄μ„œ μ‹€ν–‰λ©λ‹ˆλ‹€.
  • decode_chunk_size κ°μ†Œ: VAEκ°€ ν”„λ ˆμž„λ“€μ„ ν•œκΊΌλ²ˆμ— λ””μ½”λ”©ν•˜λŠ” λŒ€μ‹  chunk λ‹¨μœ„λ‘œ λ””μ½”λ”©ν•©λ‹ˆλ‹€. decode_chunk_size=1을 μ„€μ •ν•˜λ©΄ ν•œ λ²ˆμ— ν•œ ν”„λ ˆμž„μ”© λ””μ½”λ”©ν•˜κ³  μ΅œμ†Œν•œμ˜ λ©”λͺ¨λ¦¬λ§Œ μ‚¬μš©ν•˜μ§€λ§Œ(GPU λ©”λͺ¨λ¦¬μ— 따라 이 값을 μ‘°μ •ν•˜λŠ” 것이 μ’‹μŠ΅λ‹ˆλ‹€), λ™μ˜μƒμ— μ•½κ°„μ˜ κΉœλ°•μž„μ΄ λ°œμƒν•  수 μžˆμŠ΅λ‹ˆλ‹€.
- pipe.enable_model_cpu_offload()
- frames = pipe(image, decode_chunk_size=8, generator=generator).frames[0]
+ pipe.enable_model_cpu_offload()
+ pipe.unet.enable_forward_chunking()
+ frames = pipe(image, decode_chunk_size=2, generator=generator, num_frames=25).frames[0]

μ΄λŸ¬ν•œ λͺ¨λ“  방법듀을 μ‚¬μš©ν•˜λ©΄ λ©”λͺ¨λ¦¬ μ‚¬μš©λŸ‰μ΄ 8GAM VRAM보닀 적을 κ²ƒμž…λ‹ˆλ‹€.

Micro-conditioning

Stable Diffusion VideoλŠ” λ˜ν•œ 이미지 conditoning 외에도 micro-conditioning을 ν—ˆμš©ν•˜λ―€λ‘œ μƒμ„±λœ λΉ„λ””μ˜€λ₯Ό 더 잘 μ œμ–΄ν•  수 μžˆμŠ΅λ‹ˆλ‹€:

  • fps: μƒμ„±λœ λΉ„λ””μ˜€μ˜ μ΄ˆλ‹Ή ν”„λ ˆμž„ μˆ˜μž…λ‹ˆλ‹€.
  • motion_bucket_id: μƒμ„±λœ λ™μ˜μƒμ— μ‚¬μš©ν•  λͺ¨μ…˜ 버킷 μ•„μ΄λ””μž…λ‹ˆλ‹€. μƒμ„±λœ λ™μ˜μƒμ˜ λͺ¨μ…˜μ„ μ œμ–΄ν•˜λŠ” 데 μ‚¬μš©ν•  수 μžˆμŠ΅λ‹ˆλ‹€. λͺ¨μ…˜ 버킷 아이디λ₯Ό 늘리면 μƒμ„±λ˜λŠ” λ™μ˜μƒμ˜ λͺ¨μ…˜μ΄ μ¦κ°€ν•©λ‹ˆλ‹€.
  • noise_aug_strength: Conditioning 이미지에 μΆ”κ°€λ˜λŠ” λ…Έμ΄μ¦ˆμ˜ μ–‘μž…λ‹ˆλ‹€. 값이 클수둝 λΉ„λ””μ˜€κ°€ conditioning 이미지와 덜 μœ μ‚¬ν•΄μ§‘λ‹ˆλ‹€. 이 값을 높이면 μƒμ„±λœ λΉ„λ””μ˜€μ˜ μ›€μ§μž„λ„ μ¦κ°€ν•©λ‹ˆλ‹€.

예λ₯Ό λ“€μ–΄, λͺ¨μ…˜μ΄ 더 λ§Žμ€ λ™μ˜μƒμ„ μƒμ„±ν•˜λ €λ©΄ motion_bucket_id 및 noise_aug_strength micro-conditioning νŒŒλΌλ―Έν„°λ₯Ό μ‚¬μš©ν•©λ‹ˆλ‹€:

import torch

from diffusers import StableVideoDiffusionPipeline
from diffusers.utils import load_image, export_to_video

pipe = StableVideoDiffusionPipeline.from_pretrained(
  "stabilityai/stable-video-diffusion-img2vid-xt", torch_dtype=torch.float16, variant="fp16"
)
pipe.enable_model_cpu_offload()

# Conditioning 이미지 뢈러였기
image = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/svd/rocket.png")
image = image.resize((1024, 576))

generator = torch.manual_seed(42)
frames = pipe(image, decode_chunk_size=8, generator=generator, motion_bucket_id=180, noise_aug_strength=0.1).frames[0]
export_to_video(frames, "generated.mp4", fps=7)