# CogVideoX [CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer](https://github.com/THUDM/CogVideo/blob/main/resources/CogVideoX.pdf) from Tsinghua University & ZhipuAI. The abstract from the paper is: *We introduce CogVideoX, a large-scale diffusion transformer model designed for generating videos based on text prompts. To efficently model video data, we propose to levearge a 3D Variational Autoencoder (VAE) to compresses videos along both spatial and temporal dimensions. To improve the text-video alignment, we propose an expert transformer with the expert adaptive LayerNorm to facilitate the deep fusion between the two modalities. By employing a progressive training technique, CogVideoX is adept at producing coherent, long-duration videos characterized by significant motion. In addition, we develop an effectively text-video data processing pipeline that includes various data preprocessing strategies and a video captioning method. It significantly helps enhance the performance of CogVideoX, improving both generation quality and semantic alignment. Results show that CogVideoX demonstrates state-of-the-art performance across both multiple machine metrics and human evaluations. The model weight of CogVideoX-2B is publicly available at https://github.com/THUDM/CogVideo.* Make sure to check out the Schedulers [guide](../../using-diffusers/schedulers.md) to learn how to explore the tradeoff between scheduler speed and quality, and see the [reuse components across pipelines](../../using-diffusers/loading.md#reuse-a-pipeline) section to learn how to efficiently load the same components into multiple pipelines. This pipeline was contributed by [zRzRzRzRzRzRzR](https://github.com/zRzRzRzRzRzRzR). The original codebase can be found [here](https://huggingface.co/THUDM). The original weights can be found under [hf.co/THUDM](https://huggingface.co/THUDM). ## Inference Use [`torch.compile`](https://huggingface.co/docs/diffusers/main/en/tutorials/fast_diffusion#torchcompile) to reduce the inference latency. First, load the pipeline: ```python import torch from diffusers import CogVideoXPipeline from diffusers.utils import export_to_video pipe = CogVideoXPipeline.from_pretrained("THUDM/CogVideoX-2b").to("cuda") prompt = ( "A panda, dressed in a small, red jacket and a tiny hat, sits on a wooden stool in a serene bamboo forest. " "The panda's fluffy paws strum a miniature acoustic guitar, producing soft, melodic tunes. Nearby, a few other " "pandas gather, watching curiously and some clapping in rhythm. Sunlight filters through the tall bamboo, " "casting a gentle glow on the scene. The panda's face is expressive, showing concentration and joy as it plays. " "The background includes a small, flowing stream and vibrant green foliage, enhancing the peaceful and magical " "atmosphere of this unique musical performance." ) video = pipe(prompt=prompt, guidance_scale=6, num_inference_steps=50).frames[0] export_to_video(video, "output.mp4", fps=8) ``` Then change the memory layout of the pipelines `transformer` and `vae` components to `torch.channels-last`: ```python pipeline.transformer.to(memory_format=torch.channels_last) pipeline.vae.to(memory_format=torch.channels_last) ``` Finally, compile the components and run inference: ```python pipeline.transformer = torch.compile(pipeline.transformer) pipeline.vae.decode = torch.compile(pipeline.vae.decode) # CogVideoX works very well with long and well-described prompts prompt = "A panda, dressed in a small, red jacket and a tiny hat, sits on a wooden stool in a serene bamboo forest. The panda's fluffy paws strum a miniature acoustic guitar, producing soft, melodic tunes. Nearby, a few other pandas gather, watching curiously and some clapping in rhythm. Sunlight filters through the tall bamboo, casting a gentle glow on the scene. The panda's face is expressive, showing concentration and joy as it plays. The background includes a small, flowing stream and vibrant green foliage, enhancing the peaceful and magical atmosphere of this unique musical performance." video = pipeline(prompt=prompt, guidance_scale=6, num_inference_steps=50).frames[0] ``` The [benchmark](TODO: link) results on an 80GB A100 machine are: ``` Without torch.compile(): Average inference time: TODO seconds. With torch.compile(): Average inference time: TODO seconds. ``` ## CogVideoXPipeline [[autodoc]] CogVideoXPipeline - all - __call__ ## CogVideoXPipelineOutput [[autodoc]] pipelines.cogvideo.pipeline_cogvideox.CogVideoXPipelineOutput