# Tune-A-Video This repository is the official implementation of [Tune-A-Video](https://arxiv.org/abs/2212.11565). **[Tune-A-Video: One-Shot Tuning of Image Diffusion Models for Text-to-Video Generation](https://arxiv.org/abs/2212.11565)**
[Jay Zhangjie Wu](https://zhangjiewu.github.io/), [Yixiao Ge](https://geyixiao.com/), [Xintao Wang](https://xinntao.github.io/), [Stan Weixian Lei](), [Yuchao Gu](https://ycgu.site/), [Wynne Hsu](https://www.comp.nus.edu.sg/~whsu/), [Ying Shan](https://scholar.google.com/citations?user=4oXBp9UAAAAJ&hl=en), [Xiaohu Qie](https://scholar.google.com/citations?user=mk-F69UAAAAJ&hl=en), [Mike Zheng Shou](https://sites.google.com/view/showlab)
[Project Page](https://tuneavideo.github.io/) | [arXiv](https://arxiv.org/abs/2212.11565) ## Setup ### Requirements ```shell pip install -r requirements.txt ``` Installing [xformers](https://github.com/facebookresearch/xformers) is highly recommended for more efficiency and speed on GPUs. To enable xformers, set `enable_xformers_memory_efficient_attention=True` (default). ### Weights You can download the pre-trained [Stable Diffusion](https://arxiv.org/abs/2112.10752) models (e.g., [Stable Diffusion v1-4](https://huggingface.co/CompVis/stable-diffusion-v1-4)): ```shell git lfs install git clone https://huggingface.co/CompVis/stable-diffusion-v1-4 ``` Alternatively, you can use a personalized [DreamBooth](https://arxiv.org/abs/2208.12242) model (e.g., [mr-potato-head](https://huggingface.co/sd-dreambooth-library/mr-potato-head)): ```shell git lfs install git clone https://huggingface.co/sd-dreambooth-library/mr-potato-head ``` ## Training To fine-tune the text-to-image diffusion models for text-to-video generation, run this command: ```shell accelerate launch train_tuneavideo.py --config="configs/man-surfing.yaml" ``` ## Inference Once the training is done, run inference: ```python from tuneavideo.pipelines.pipeline_tuneavideo import TuneAVideoPipeline from tuneavideo.models.unet import UNet3DConditionModel from tuneavideo.util import save_videos_grid import torch model_id = "path-to-your-trained-model" unet = UNet3DConditionModel.from_pretrained(model_id, subfolder='unet', torch_dtype=torch.float16).to('cuda') pipe = TuneAVideoPipeline.from_pretrained("CompVis/stable-diffusion-v1-4", unet=unet, torch_dtype=torch.float16).to("cuda") prompt = "a panda is surfing" video = pipe(prompt, video_length=8, height=512, width=512, num_inference_steps=50, guidance_scale=7.5).videos save_videos_grid(video, f"{prompt}.gif") ``` ## Results ### Fine-tuning on Stable Diffusion
[Training] a man is surfing. a panda is surfing. Iron Man is surfing in the desert. a raccoon is surfing, cartoon style.
### Fine-tuning on DreamBooth
sks mr potato head. sks mr potato head, wearing a pink hat, is surfing. sks mr potato head, wearing sunglasses, is surfing. sks mr potato head is surfing in the forest.
## BibTeX ``` @article{wu2022tuneavideo, title={Tune-A-Video: One-Shot Tuning of Image Diffusion Models for Text-to-Video Generation}, author={Wu, Jay Zhangjie and Ge, Yixiao and Wang, Xintao and Lei, Stan Weixian and Gu, Yuchao and Hsu, Wynne and Shan, Ying and Qie, Xiaohu and Shou, Mike Zheng}, journal={arXiv preprint arXiv:2212.11565}, year={2022} } ```