File size: 4,698 Bytes
e93f120
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
# Tune-A-Video

This repository is the official implementation of [Tune-A-Video](https://arxiv.org/abs/2212.11565).

**[Tune-A-Video: One-Shot Tuning of Image Diffusion Models for Text-to-Video Generation](https://arxiv.org/abs/2212.11565)**
<br/>
[Jay Zhangjie Wu](https://zhangjiewu.github.io/), 
[Yixiao Ge](https://geyixiao.com/), 
[Xintao Wang](https://xinntao.github.io/), 
[Stan Weixian Lei](), 
[Yuchao Gu](https://ycgu.site/), 
[Wynne Hsu](https://www.comp.nus.edu.sg/~whsu/), 
[Ying Shan](https://scholar.google.com/citations?user=4oXBp9UAAAAJ&hl=en), 
[Xiaohu Qie](https://scholar.google.com/citations?user=mk-F69UAAAAJ&hl=en), 
[Mike Zheng Shou](https://sites.google.com/view/showlab)
<br/>

[Project Page](https://tuneavideo.github.io/) | [arXiv](https://arxiv.org/abs/2212.11565)

## Setup

### Requirements

```shell
pip install -r requirements.txt
```

Installing [xformers](https://github.com/facebookresearch/xformers) is highly recommended for more efficiency and speed on GPUs. 
To enable xformers, set `enable_xformers_memory_efficient_attention=True` (default).

### Weights

You can download the pre-trained [Stable Diffusion](https://arxiv.org/abs/2112.10752) models 
(e.g., [Stable Diffusion v1-4](https://huggingface.co/CompVis/stable-diffusion-v1-4)):

```shell
git lfs install
git clone https://huggingface.co/CompVis/stable-diffusion-v1-4
```

Alternatively, you can use a personalized [DreamBooth](https://arxiv.org/abs/2208.12242) model (e.g., [mr-potato-head](https://huggingface.co/sd-dreambooth-library/mr-potato-head)):
```shell
git lfs install
git clone https://huggingface.co/sd-dreambooth-library/mr-potato-head
```

## Training

To fine-tune the text-to-image diffusion models for text-to-video generation, run this command:

```shell
accelerate launch train_tuneavideo.py --config="configs/man-surfing.yaml"
```

## Inference

Once the training is done, run inference:

```python
from tuneavideo.pipelines.pipeline_tuneavideo import TuneAVideoPipeline
from tuneavideo.models.unet import UNet3DConditionModel
from tuneavideo.util import save_videos_grid
import torch

model_id = "path-to-your-trained-model"
unet = UNet3DConditionModel.from_pretrained(model_id, subfolder='unet', torch_dtype=torch.float16).to('cuda')
pipe = TuneAVideoPipeline.from_pretrained("CompVis/stable-diffusion-v1-4", unet=unet, torch_dtype=torch.float16).to("cuda")

prompt = "a panda is surfing"
video = pipe(prompt, video_length=8, height=512, width=512, num_inference_steps=50, guidance_scale=7.5).videos

save_videos_grid(video, f"{prompt}.gif")
```

## Results

### Fine-tuning on Stable Diffusion

<table width="100%" align="center">
<tr>
  <td><img src="https://tuneavideo.github.io/static/results/man-surfing/train.gif"></td>
  <td><img src="https://tuneavideo.github.io/static/results/repo/stablediffusion/panda-surfing.gif"></td>
  <td><img src="https://tuneavideo.github.io/static/results/repo/stablediffusion/ironman-desert.gif"></td>              
  <td><img src="https://tuneavideo.github.io/static/results/repo/stablediffusion/raccoon-cartoon.gif"></td>
</tr>
<tr>
  <td width=25% style="text-align:center;color:gray;">[Training] a man is surfing.</td>
  <td width=25% style="text-align:center;">a panda is surfing.</td>
  <td width=25% style="text-align:center;">Iron Man is surfing in the desert.</td>
  <td width=25% style="text-align:center;">a raccoon is surfing, cartoon style.</td>
</tr>
</table>

### Fine-tuning on DreamBooth

<table width="100%" align="center">
<tr>
  <td><img src="https://tuneavideo.github.io/static/results/repo/dreambooth/mr-potato-head.png"></td>
  <td><img src="https://tuneavideo.github.io/static/results/repo/dreambooth/pink-hat.gif"></td>
  <td><img src="https://tuneavideo.github.io/static/results/repo/dreambooth/potato-sunglasses.gif"></td>              
  <td><img src="https://tuneavideo.github.io/static/results/repo/dreambooth/potato-forest.gif"></td>
</tr>
<tr>
  <td width=25% style="text-align:center;color:gray;">sks mr potato head.</td>
  <td width=25% style="text-align:center;">sks mr potato head, wearing a pink hat, is surfing.</td>
  <td width=25% style="text-align:center;">sks mr potato head, wearing sunglasses, is surfing.</td>
  <td width=25% style="text-align:center;">sks mr potato head is surfing in the forest.</td>
</tr>
</table>

## BibTeX
```
@article{wu2022tuneavideo,
    title={Tune-A-Video: One-Shot Tuning of Image Diffusion Models for Text-to-Video Generation},
    author={Wu, Jay Zhangjie and Ge, Yixiao and Wang, Xintao and Lei, Stan Weixian and Gu, Yuchao and Hsu, Wynne and Shan, Ying and Qie, Xiaohu and Shou, Mike Zheng},
    journal={arXiv preprint arXiv:2212.11565},
    year={2022}
}
```