nielsr HF staff commited on
Commit
1d324dc
·
verified ·
1 Parent(s): a793814

Add library_name and clarify license

Browse files

This PR adds the `library_name` to the model card metadata. The code examples and mention of a diffusers version suggest compatibility with the Diffusers library, making this a valuable addition for discoverability. The license is also clarified to explicitly state MIT.

Files changed (1) hide show
  1. README.md +205 -4
README.md CHANGED
@@ -1,16 +1,18 @@
1
  ---
2
- license: other
3
- language:
4
- - en
5
  base_model:
6
  - THUDM/CogVideoX-5b
 
 
 
 
7
  tags:
8
  - video
9
  - video-generation
10
  - cogvideox
11
  - alibaba
12
- pipeline_tag: text-to-video
13
  ---
 
14
  <div align="center">
15
 
16
  <img src="icon.jpg" width="250"/>
@@ -56,6 +58,21 @@ Recent advancements in Diffusion Transformer (DiT) have demonstrated remarkable
56
  - `2024/08/27` We released our v2 paper including appendix.
57
  - `2024/07/31` We submitted our paper on arXiv and released our project page.
58
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
59
  ## 🎞️ Showcases
60
 
61
  https://github.com/user-attachments/assets/949d5e99-18c9-49d6-b669-9003ccd44bf1
@@ -66,6 +83,190 @@ https://github.com/user-attachments/assets/4026c23d-229d-45d7-b5be-6f3eb9e4fd50
66
 
67
  All videos are available in this [Link](https://cloudbook-public-daily.oss-cn-hangzhou.aliyuncs.com/Tora_t2v/showcases.zip)
68
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
69
  ## 🤝 Acknowledgements
70
 
71
  We would like to express our gratitude to the following open-source projects that have been instrumental in the development of our project:
 
1
  ---
 
 
 
2
  base_model:
3
  - THUDM/CogVideoX-5b
4
+ language:
5
+ - en
6
+ license: mit
7
+ pipeline_tag: text-to-video
8
  tags:
9
  - video
10
  - video-generation
11
  - cogvideox
12
  - alibaba
13
+ library_name: diffusers
14
  ---
15
+
16
  <div align="center">
17
 
18
  <img src="icon.jpg" width="250"/>
 
58
  - `2024/08/27` We released our v2 paper including appendix.
59
  - `2024/07/31` We submitted our paper on arXiv and released our project page.
60
 
61
+ ## 📑 Table of Contents
62
+
63
+ - [🎞️ Showcases](#%EF%B8%8F-showcases)
64
+ - [✅ TODO List](#-todo-list)
65
+ - [🧨 Diffusers verision](#-diffusers-verision)
66
+ - [🐍 Installation](#-installation)
67
+ - [📦 Model Weights](#-model-weights)
68
+ - [🔄 Inference](#-inference)
69
+ - [🖥️ Gradio Demo](#%EF%B8%8F-gradio-demo)
70
+ - [🧠 Training](#-training)
71
+ - [🎯 Troubleshooting](#-troubleshooting)
72
+ - [🤝 Acknowledgements](#-acknowledgements)
73
+ - [📄 Our previous work](#-our-previous-work)
74
+ - [📚 Citation](#-citation)
75
+
76
  ## 🎞️ Showcases
77
 
78
  https://github.com/user-attachments/assets/949d5e99-18c9-49d6-b669-9003ccd44bf1
 
83
 
84
  All videos are available in this [Link](https://cloudbook-public-daily.oss-cn-hangzhou.aliyuncs.com/Tora_t2v/showcases.zip)
85
 
86
+ ## ✅ TODO List
87
+
88
+ - [x] Release our inference code and model weights
89
+ - [x] Provide a ModelScope Demo
90
+ - [x] Release our training code
91
+ - [x] Release diffusers version and optimize the GPU memory usage
92
+ - [x] Release complete version of Tora
93
+
94
+ ## 🧨 Diffusers verision
95
+
96
+ Please refer to [the diffusers version](diffusers-version/README.md) for details.
97
+
98
+ ## 🐍 Installation
99
+
100
+ Please make sure your Python version is between 3.10 and 3.12, inclusive of both 3.10 and 3.12.
101
+
102
+ ```bash
103
+ # Clone this repository.
104
+ git clone https://github.com/alibaba/Tora.git
105
+ cd Tora
106
+
107
+ # Install Pytorch (we use Pytorch 2.4.0) and torchvision following the official instructions: https://pytorch.org/get-started/previous-versions/. For example:
108
+ conda create -n tora python==3.10
109
+ conda activate tora
110
+ conda install pytorch==2.4.0 torchvision==0.19.0 pytorch-cuda=12.1 -c pytorch -c nvidia
111
+
112
+ # Install requirements
113
+ cd modules/SwissArmyTransformer
114
+ pip install -e .
115
+ cd ../../sat
116
+ pip install -r requirements.txt
117
+ cd ..
118
+ ```
119
+
120
+ ## 📦 Model Weights
121
+
122
+ ### Folder Structure
123
+
124
+ ```
125
+ Tora
126
+ └── sat
127
+ └── ckpts
128
+ ├── t5-v1_1-xxl
129
+ │ ├── model-00001-of-00002.safetensors
130
+ │ └── ...
131
+ ├── vae
132
+ │ └── 3d-vae.pt
133
+ ├── tora
134
+ │ ├── i2v
135
+ │ │ └── mp_rank_00_model_states.pt
136
+ │ └── t2v
137
+ │ └── mp_rank_00_model_states.pt
138
+ └── CogVideoX-5b-sat # for training stage 1
139
+ └── mp_rank_00_model_states.pt
140
+ ```
141
+
142
+ ### Download Links
143
+
144
+ *Note: Downloading the `tora` weights requires following the [CogVideoX License](CogVideoX_LICENSE).* You can choose one of the following options: HuggingFace, ModelScope, or native links.\
145
+ After downloading the model weights, you can put them in the `Tora/sat/ckpts` folder.
146
+
147
+ #### HuggingFace
148
+
149
+ ```bash
150
+ # This can be faster
151
+ pip install "huggingface_hub[hf_transfer]"
152
+ HF_HUB_ENABLE_HF_TRANSFER=1 huggingface-cli download Alibaba-Research-Intelligence-Computing/Tora --local-dir ckpts
153
+ ```
154
+
155
+ or
156
+
157
+ ```bash
158
+ # use git
159
+ git lfs install
160
+ git clone https://huggingface.co/Alibaba-Research-Intelligence-Computing/Tora
161
+ ```
162
+
163
+ #### ModelScope
164
+
165
+ - SDK
166
+
167
+ ```bash
168
+ from modelscope import snapshot_download
169
+ model_dir = snapshot_download('xiaoche/Tora')
170
+ ```
171
+
172
+ - Git
173
+
174
+ ```bash
175
+ git clone https://www.modelscope.cn/xiaoche/Tora.git
176
+ ```
177
+
178
+ #### Native
179
+
180
+ - Download the VAE and T5 model following [CogVideo](https://github.com/THUDM/CogVideo/blob/main/sat/README.md#2-download-model-weights):\
181
+ - VAE: https://cloud.tsinghua.edu.cn/f/fdba7608a49c463ba754/?dl=1
182
+ - T5: [text_encoder](https://huggingface.co/THUDM/CogVideoX-2b/tree/main/text_encoder), [tokenizer](https://huggingface.co/THUDM/CogVideoX-2b/tree/main/tokenizer)
183
+ - Tora t2v model weights: [Link](https://cloudbook-public-daily.oss-cn-hangzhou.aliyuncs.com/Tora_t2v/mp_rank_00_model_states.pt). Downloading this weight requires following the [CogVideoX License](CogVideoX_LICENSE).
184
+
185
+ ## 🔄 Inference
186
+
187
+ ### Text to Video
188
+ It requires around 30 GiB GPU memory tested on NVIDIA A100.
189
+
190
+ ```bash
191
+ cd sat
192
+ PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True torchrun --standalone --nproc_per_node=$N_GPU sample_video.py --base configs/tora/model/cogvideox_5b_tora.yaml configs/tora/inference_sparse.yaml --load ckpts/tora/t2v --output-dir samples --point_path trajs/coaster.txt --input-file assets/text/t2v/examples.txt
193
+ ```
194
+
195
+ You can change the `--input-file` and `--point_path` to your own prompts and trajectory points files. Please note that the trajectory is drawn on a 256x256 canvas.
196
+
197
+ Replace `$N_GPU` with the number of GPUs you want to use.
198
+
199
+ ### Image to Video
200
+
201
+ ```bash
202
+ cd sat
203
+ PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True torchrun --standalone --nproc_per_node=$N_GPU sample_video.py --base configs/tora/model/cogvideox_5b_tora_i2v.yaml configs/tora/inference_sparse.yaml --load ckpts/tora/i2v --output-dir samples --point_path trajs/sawtooth.txt --input-file assets/text/i2v/examples.txt --img_dir assets/images --image2video
204
+ ```
205
+
206
+ The first frame images should be placed in the `--img_dir`. The names of these images should be specified in the corresponding text prompt in `--input-file`, seperated by `@@`.
207
+
208
+ ### Recommendations for Text Prompts
209
+
210
+ For text prompts, we highly recommend using GPT-4 to enhance the details. Simple prompts may negatively impact both visual quality and motion control effectiveness.
211
+
212
+ You can refer to the following resources for guidance:
213
+
214
+ - [CogVideoX Documentation](https://github.com/THUDM/CogVideo/blob/main/inference/convert_demo.py)
215
+ - [OpenSora Scripts](https://github.com/hpcaitech/Open-Sora/blob/main/scripts/inference.py)
216
+
217
+ ## 🖥️ Gradio Demo
218
+
219
+ Usage:
220
+
221
+ ```bash
222
+ cd sat
223
+ python app.py --load ckpts/tora/t2v
224
+ ```
225
+
226
+ ## 🧠 Training
227
+
228
+ ### Data Preparation
229
+
230
+ Following this guide https://github.com/THUDM/CogVideo/blob/main/sat/README.md#preparing-the-dataset, structure the datasets as follows:
231
+
232
+ ```
233
+ .
234
+ ├── labels
235
+ │ ├── 1.txt
236
+ │ ├── 2.txt
237
+ │ ├── ...
238
+ └── videos
239
+ ├── 1.mp4
240
+ ├── 2.mp4
241
+ ├── ...
242
+ ```
243
+
244
+ Training data examples are in `sat/training_examples`
245
+
246
+ ### Text to Video
247
+
248
+ It requires around 60 GiB GPU memory tested on NVIDIA A100.
249
+
250
+ Replace `$N_GPU` with the number of GPUs you want to use.
251
+
252
+ - Stage 1
253
+
254
+ ```bash
255
+ PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True torchrun --standalone --nproc_per_node=$N_GPU train_video.py --base configs/tora/model/cogvideox_5b_tora.yaml configs/tora/train_dense.yaml --experiment-name "t2v-stage1"
256
+ ```
257
+
258
+ - Stage 2
259
+
260
+ ```bash
261
+ PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True torchrun --standalone --nproc_per_node=$N_GPU train_video.py --base configs/tora/model/cogvideox_5b_tora.yaml configs/tora/train_sparse.yaml --experiment-name "t2v-stage2"
262
+ ```
263
+
264
+ ## 🎯 Troubleshooting
265
+
266
+ ### 1. ValueError: Non-consecutive added token...
267
+
268
+ Upgrade the transformers package to 4.44.2. See [this](https://github.com/THUDM/CogVideo/issues/213) issue.
269
+
270
  ## 🤝 Acknowledgements
271
 
272
  We would like to express our gratitude to the following open-source projects that have been instrumental in the development of our project: