Spaces:
Running
on
Zero
A newer version of the Gradio SDK is available:
5.34.2
HunyuanVideo
HunyuanVideo is a 13B parameter diffusion transformer model designed to be competitive with closed-source video foundation models and enable wider community access. This model uses a "dual-stream to single-stream" architecture to separately process the video and text tokens first, before concatenating and feeding them to the transformer to fuse the multimodal information. A pretrained multimodal large language model (MLLM) is used as the encoder because it has better image-text alignment, better image detail description and reasoning, and it can be used as a zero-shot learner if system instructions are added to user prompts. Finally, HunyuanVideo uses a 3D causal variational autoencoder to more efficiently process video data at the original resolution and frame rate.
You can find all the original HunyuanVideo checkpoints under the Tencent organization.
Click on the HunyuanVideo models in the right sidebar for more examples of video generation tasks.
The examples below use a checkpoint from hunyuanvideo-community because the weights are stored in a layout compatible with Diffusers.
The example below demonstrates how to generate a video optimized for memory or inference speed.
Refer to the Reduce memory usage guide for more details about the various memory saving techniques.
The quantized HunyuanVideo model below requires ~14GB of VRAM.
import torch
from diffusers import AutoModel, HunyuanVideoPipeline
from diffusers.quantizers import PipelineQuantizationConfig
from diffusers.utils import export_to_video
# quantize weights to int4 with bitsandbytes
pipeline_quant_config = PipelineQuantizationConfig(
quant_backend="bitsandbytes_4bit",
quant_kwargs={
"load_in_4bit": True,
"bnb_4bit_quant_type": "nf4",
"bnb_4bit_compute_dtype": torch.bfloat16
},
components_to_quantize=["transformer"]
)
pipeline = HunyuanVideoPipeline.from_pretrained(
"hunyuanvideo-community/HunyuanVideo",
quantization_config=pipeline_quant_config,
torch_dtype=torch.bfloat16,
)
# model-offloading and tiling
pipeline.enable_model_cpu_offload()
pipeline.vae.enable_tiling()
prompt = "A fluffy teddy bear sits on a bed of soft pillows surrounded by children's toys."
video = pipeline(prompt=prompt, num_frames=61, num_inference_steps=30).frames[0]
export_to_video(video, "output.mp4", fps=15)
Compilation is slow the first time but subsequent calls to the pipeline are faster.
import torch
from diffusers import AutoModel, HunyuanVideoPipeline
from diffusers.quantizers import PipelineQuantizationConfig
from diffusers.utils import export_to_video
# quantize weights to int4 with bitsandbytes
pipeline_quant_config = PipelineQuantizationConfig(
quant_backend="bitsandbytes_4bit",
quant_kwargs={
"load_in_4bit": True,
"bnb_4bit_quant_type": "nf4",
"bnb_4bit_compute_dtype": torch.bfloat16
},
components_to_quantize=["transformer"]
)
pipeline = HunyuanVideoPipeline.from_pretrained(
"hunyuanvideo-community/HunyuanVideo",
quantization_config=pipeline_quant_config,
torch_dtype=torch.bfloat16,
)
# model-offloading and tiling
pipeline.enable_model_cpu_offload()
pipeline.vae.enable_tiling()
# torch.compile
pipeline.transformer.to(memory_format=torch.channels_last)
pipeline.transformer = torch.compile(
pipeline.transformer, mode="max-autotune", fullgraph=True
)
prompt = "A fluffy teddy bear sits on a bed of soft pillows surrounded by children's toys."
video = pipeline(prompt=prompt, num_frames=61, num_inference_steps=30).frames[0]
export_to_video(video, "output.mp4", fps=15)
Notes
HunyuanVideo supports LoRAs with [
~loaders.HunyuanVideoLoraLoaderMixin.load_lora_weights
].Show example code
import torch from diffusers import AutoModel, HunyuanVideoPipeline from diffusers.quantizers import PipelineQuantizationConfig from diffusers.utils import export_to_video # quantize weights to int4 with bitsandbytes pipeline_quant_config = PipelineQuantizationConfig( quant_backend="bitsandbytes_4bit", quant_kwargs={ "load_in_4bit": True, "bnb_4bit_quant_type": "nf4", "bnb_4bit_compute_dtype": torch.bfloat16 }, components_to_quantize=["transformer"] ) pipeline = HunyuanVideoPipeline.from_pretrained( "hunyuanvideo-community/HunyuanVideo", quantization_config=pipeline_quant_config, torch_dtype=torch.bfloat16, ) # load LoRA weights pipeline.load_lora_weights("https://huggingface.co/lucataco/hunyuan-steamboat-willie-10", adapter_name="steamboat-willie") pipeline.set_adapters("steamboat-willie", 0.9) # model-offloading and tiling pipeline.enable_model_cpu_offload() pipeline.vae.enable_tiling() # use "In the style of SWR" to trigger the LoRA prompt = """ In the style of SWR. A black and white animated scene featuring a fluffy teddy bear sits on a bed of soft pillows surrounded by children's toys. """ video = pipeline(prompt=prompt, num_frames=61, num_inference_steps=30).frames[0] export_to_video(video, "output.mp4", fps=15)
Refer to the table below for recommended inference values.
parameter recommended value text encoder dtype torch.float16
transformer dtype torch.bfloat16
vae dtype torch.float16
num_frames (k)
4 * k
+ 1Try lower
shift
values (2.0
to5.0
) for lower resolution videos and highershift
values (7.0
to12.0
) for higher resolution images.
HunyuanVideoPipeline
[[autodoc]] HunyuanVideoPipeline
- all
- call
HunyuanVideoPipelineOutput
[[autodoc]] pipelines.hunyuan_video.pipeline_output.HunyuanVideoPipelineOutput