multimodalart's picture
Upload 2025 files
22a452a verified
|
raw
history blame
7.02 kB

HunyuanVideo

HunyuanVideo is a 13B parameter diffusion transformer model designed to be competitive with closed-source video foundation models and enable wider community access. This model uses a "dual-stream to single-stream" architecture to separately process the video and text tokens first, before concatenating and feeding them to the transformer to fuse the multimodal information. A pretrained multimodal large language model (MLLM) is used as the encoder because it has better image-text alignment, better image detail description and reasoning, and it can be used as a zero-shot learner if system instructions are added to user prompts. Finally, HunyuanVideo uses a 3D causal variational autoencoder to more efficiently process video data at the original resolution and frame rate.

You can find all the original HunyuanVideo checkpoints under the Tencent organization.

Click on the HunyuanVideo models in the right sidebar for more examples of video generation tasks.

The examples below use a checkpoint from hunyuanvideo-community because the weights are stored in a layout compatible with Diffusers.

The example below demonstrates how to generate a video optimized for memory or inference speed.

Refer to the Reduce memory usage guide for more details about the various memory saving techniques.

The quantized HunyuanVideo model below requires ~14GB of VRAM.

import torch
from diffusers import AutoModel, HunyuanVideoPipeline
from diffusers.quantizers import PipelineQuantizationConfig
from diffusers.utils import export_to_video

# quantize weights to int4 with bitsandbytes
pipeline_quant_config = PipelineQuantizationConfig(
    quant_backend="bitsandbytes_4bit",
    quant_kwargs={
      "load_in_4bit": True,
      "bnb_4bit_quant_type": "nf4",
      "bnb_4bit_compute_dtype": torch.bfloat16
      },
    components_to_quantize=["transformer"]
)

pipeline = HunyuanVideoPipeline.from_pretrained(
    "hunyuanvideo-community/HunyuanVideo",
    quantization_config=pipeline_quant_config,
    torch_dtype=torch.bfloat16,
)

# model-offloading and tiling
pipeline.enable_model_cpu_offload()
pipeline.vae.enable_tiling()

prompt = "A fluffy teddy bear sits on a bed of soft pillows surrounded by children's toys."
video = pipeline(prompt=prompt, num_frames=61, num_inference_steps=30).frames[0]
export_to_video(video, "output.mp4", fps=15)

Compilation is slow the first time but subsequent calls to the pipeline are faster.

import torch
from diffusers import AutoModel, HunyuanVideoPipeline
from diffusers.quantizers import PipelineQuantizationConfig
from diffusers.utils import export_to_video

# quantize weights to int4 with bitsandbytes
pipeline_quant_config = PipelineQuantizationConfig(
    quant_backend="bitsandbytes_4bit",
    quant_kwargs={
      "load_in_4bit": True,
      "bnb_4bit_quant_type": "nf4",
      "bnb_4bit_compute_dtype": torch.bfloat16
      },
    components_to_quantize=["transformer"]
)

pipeline = HunyuanVideoPipeline.from_pretrained(
    "hunyuanvideo-community/HunyuanVideo",
    quantization_config=pipeline_quant_config,
    torch_dtype=torch.bfloat16,
)

# model-offloading and tiling
pipeline.enable_model_cpu_offload()
pipeline.vae.enable_tiling()

# torch.compile
pipeline.transformer.to(memory_format=torch.channels_last)
pipeline.transformer = torch.compile(
    pipeline.transformer, mode="max-autotune", fullgraph=True
)

prompt = "A fluffy teddy bear sits on a bed of soft pillows surrounded by children's toys."
video = pipeline(prompt=prompt, num_frames=61, num_inference_steps=30).frames[0]
export_to_video(video, "output.mp4", fps=15)

Notes

  • HunyuanVideo supports LoRAs with [~loaders.HunyuanVideoLoraLoaderMixin.load_lora_weights].

    Show example code
    import torch
    from diffusers import AutoModel, HunyuanVideoPipeline
    from diffusers.quantizers import PipelineQuantizationConfig
    from diffusers.utils import export_to_video
    
    # quantize weights to int4 with bitsandbytes
    pipeline_quant_config = PipelineQuantizationConfig(
        quant_backend="bitsandbytes_4bit",
        quant_kwargs={
          "load_in_4bit": True,
          "bnb_4bit_quant_type": "nf4",
          "bnb_4bit_compute_dtype": torch.bfloat16
          },
        components_to_quantize=["transformer"]
    )
    
    pipeline = HunyuanVideoPipeline.from_pretrained(
        "hunyuanvideo-community/HunyuanVideo",
        quantization_config=pipeline_quant_config,
        torch_dtype=torch.bfloat16,
    )
    
    # load LoRA weights
    pipeline.load_lora_weights("https://huggingface.co/lucataco/hunyuan-steamboat-willie-10", adapter_name="steamboat-willie")
    pipeline.set_adapters("steamboat-willie", 0.9)
    
    # model-offloading and tiling
    pipeline.enable_model_cpu_offload()
    pipeline.vae.enable_tiling()
    
    # use "In the style of SWR" to trigger the LoRA
    prompt = """
    In the style of SWR. A black and white animated scene featuring a fluffy teddy bear sits on a bed of soft pillows surrounded by children's toys.
    """
    video = pipeline(prompt=prompt, num_frames=61, num_inference_steps=30).frames[0]
    export_to_video(video, "output.mp4", fps=15)
    
  • Refer to the table below for recommended inference values.

    parameter recommended value
    text encoder dtype torch.float16
    transformer dtype torch.bfloat16
    vae dtype torch.float16
    num_frames (k) 4 * k + 1
  • Try lower shift values (2.0 to 5.0) for lower resolution videos and higher shift values (7.0 to 12.0) for higher resolution images.

HunyuanVideoPipeline

[[autodoc]] HunyuanVideoPipeline

  • all
  • call

HunyuanVideoPipelineOutput

[[autodoc]] pipelines.hunyuan_video.pipeline_output.HunyuanVideoPipelineOutput