LoRA
# HunyuanVideo [HunyuanVideo](https://huggingface.co/papers/2412.03603) is a 13B parameter diffusion transformer model designed to be competitive with closed-source video foundation models and enable wider community access. This model uses a "dual-stream to single-stream" architecture to separately process the video and text tokens first, before concatenating and feeding them to the transformer to fuse the multimodal information. A pretrained multimodal large language model (MLLM) is used as the encoder because it has better image-text alignment, better image detail description and reasoning, and it can be used as a zero-shot learner if system instructions are added to user prompts. Finally, HunyuanVideo uses a 3D causal variational autoencoder to more efficiently process video data at the original resolution and frame rate. You can find all the original HunyuanVideo checkpoints under the [Tencent](https://huggingface.co/tencent) organization. > [!TIP] > Click on the HunyuanVideo models in the right sidebar for more examples of video generation tasks. > > The examples below use a checkpoint from [hunyuanvideo-community](https://huggingface.co/hunyuanvideo-community) because the weights are stored in a layout compatible with Diffusers. The example below demonstrates how to generate a video optimized for memory or inference speed. Refer to the [Reduce memory usage](../../optimization/memory) guide for more details about the various memory saving techniques. The quantized HunyuanVideo model below requires ~14GB of VRAM. ```py import torch from diffusers import AutoModel, HunyuanVideoPipeline from diffusers.quantizers import PipelineQuantizationConfig from diffusers.utils import export_to_video # quantize weights to int4 with bitsandbytes pipeline_quant_config = PipelineQuantizationConfig( quant_backend="bitsandbytes_4bit", quant_kwargs={ "load_in_4bit": True, "bnb_4bit_quant_type": "nf4", "bnb_4bit_compute_dtype": torch.bfloat16 }, components_to_quantize=["transformer"] ) pipeline = HunyuanVideoPipeline.from_pretrained( "hunyuanvideo-community/HunyuanVideo", quantization_config=pipeline_quant_config, torch_dtype=torch.bfloat16, ) # model-offloading and tiling pipeline.enable_model_cpu_offload() pipeline.vae.enable_tiling() prompt = "A fluffy teddy bear sits on a bed of soft pillows surrounded by children's toys." video = pipeline(prompt=prompt, num_frames=61, num_inference_steps=30).frames[0] export_to_video(video, "output.mp4", fps=15) ``` [Compilation](../../optimization/fp16#torchcompile) is slow the first time but subsequent calls to the pipeline are faster. ```py import torch from diffusers import AutoModel, HunyuanVideoPipeline from diffusers.quantizers import PipelineQuantizationConfig from diffusers.utils import export_to_video # quantize weights to int4 with bitsandbytes pipeline_quant_config = PipelineQuantizationConfig( quant_backend="bitsandbytes_4bit", quant_kwargs={ "load_in_4bit": True, "bnb_4bit_quant_type": "nf4", "bnb_4bit_compute_dtype": torch.bfloat16 }, components_to_quantize=["transformer"] ) pipeline = HunyuanVideoPipeline.from_pretrained( "hunyuanvideo-community/HunyuanVideo", quantization_config=pipeline_quant_config, torch_dtype=torch.bfloat16, ) # model-offloading and tiling pipeline.enable_model_cpu_offload() pipeline.vae.enable_tiling() # torch.compile pipeline.transformer.to(memory_format=torch.channels_last) pipeline.transformer = torch.compile( pipeline.transformer, mode="max-autotune", fullgraph=True ) prompt = "A fluffy teddy bear sits on a bed of soft pillows surrounded by children's toys." video = pipeline(prompt=prompt, num_frames=61, num_inference_steps=30).frames[0] export_to_video(video, "output.mp4", fps=15) ``` ## Notes - HunyuanVideo supports LoRAs with [`~loaders.HunyuanVideoLoraLoaderMixin.load_lora_weights`].
Show example code ```py import torch from diffusers import AutoModel, HunyuanVideoPipeline from diffusers.quantizers import PipelineQuantizationConfig from diffusers.utils import export_to_video # quantize weights to int4 with bitsandbytes pipeline_quant_config = PipelineQuantizationConfig( quant_backend="bitsandbytes_4bit", quant_kwargs={ "load_in_4bit": True, "bnb_4bit_quant_type": "nf4", "bnb_4bit_compute_dtype": torch.bfloat16 }, components_to_quantize=["transformer"] ) pipeline = HunyuanVideoPipeline.from_pretrained( "hunyuanvideo-community/HunyuanVideo", quantization_config=pipeline_quant_config, torch_dtype=torch.bfloat16, ) # load LoRA weights pipeline.load_lora_weights("https://huggingface.co/lucataco/hunyuan-steamboat-willie-10", adapter_name="steamboat-willie") pipeline.set_adapters("steamboat-willie", 0.9) # model-offloading and tiling pipeline.enable_model_cpu_offload() pipeline.vae.enable_tiling() # use "In the style of SWR" to trigger the LoRA prompt = """ In the style of SWR. A black and white animated scene featuring a fluffy teddy bear sits on a bed of soft pillows surrounded by children's toys. """ video = pipeline(prompt=prompt, num_frames=61, num_inference_steps=30).frames[0] export_to_video(video, "output.mp4", fps=15) ```
- Refer to the table below for recommended inference values. | parameter | recommended value | |---|---| | text encoder dtype | `torch.float16` | | transformer dtype | `torch.bfloat16` | | vae dtype | `torch.float16` | | `num_frames (k)` | 4 * `k` + 1 | - Try lower `shift` values (`2.0` to `5.0`) for lower resolution videos and higher `shift` values (`7.0` to `12.0`) for higher resolution images. ## HunyuanVideoPipeline [[autodoc]] HunyuanVideoPipeline - all - __call__ ## HunyuanVideoPipelineOutput [[autodoc]] pipelines.hunyuan_video.pipeline_output.HunyuanVideoPipelineOutput