File size: 7,171 Bytes
f6f8e63 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 |
---
language:
- "en"
tags:
- video generation
- CreateAI
license: apache-2.0
pipeline_tag: image-to-video
---
# Ruyi-Mini-7B
[Hugging Face](https://huggingface.co/IamCreateAI/Ruyi-Mini-7B) | [Github](https://github.com/IamCreateAI/Ruyi-Models)
An image-to-video model by CreateAI.
## Overview
Ruyi-Mini-7B is an open-source image-to-video generation model. Starting with an input image, Ruyi produces subsequent video frames at resolutions ranging from 360p to 720p, supporting various aspect ratios and a maximum duration of 5 seconds. Enhanced with motion and camera control, Ruyi offers greater flexibility and creativity in video generation. We are releasing the model under the permissive Apache 2.0 license.
## Installation
Install code from github:
```bash
git clone https://github.com/IamCreateAI/Ruyi-Models
cd Ruyi-Models
pip install -r requirements.txt
```
## Running
We provide two ways to run our model. The first is directly using python code.
```bash
python3 predict_i2v.py
```
Or use ComfyUI wrapper in our [github repo](https://github.com/IamCreateAI/Ruyi-Models).
## Model Architecture
Ruyi-Mini-7B is an advanced image-to-video model with about 7.1 billion parameters. The model architecture is modified form [EasyAnimate V4 model](https://github.com/aigc-apps/EasyAnimate), whose transformer module is inherited from [HunyuanDiT](https://github.com/Tencent/HunyuanDiT). It comprises three key components:
1. Casual VAE Module: Handles video compression and decompression. It reduces spatial resolution to 1/8 and temporal resolution to 1/4, with each latent pixel is represented in 16-channel BF16 after compression.
2. Diffusion Transformer Module: Generates compressed video data using 3D full attention, with:
- 2D Normalized-RoPE for spatial dimensions;
- Sin-cos position embedding for temporal dimensions;
- DDPM (Denoising Diffusion Probabilistic Models) for model training.
3. Ruyi also utilizes a CLIP model to extract the semantic features from the input image to guide the whole video generation. The CLIP features are introduced into the transformer by cross-attention.
## Training Data and Methodology
The training process is divided into four phases:
- Phase 1: Pre-training from scratch with ~200M video clips and ~30M images at a 256-resolution, using a batch size of 4096 for 350,000 iterations to achieve full convergence.
- Phase 2: Fine-tuning with ~60M video clips for multi-scale resolutions (384–512), with a batch size of 1024 for 60,000 iterations.
- Phase 3: High-quality fine-tuning with ~20M video clips and ~8M images for 384–1024 resolutions, with dynamic batch sizes based on memory and 10,000 iterations.
- Phase 4: Final video training with ~10M curated high-quality video clips, using a batch size of 1024 for ~10,000 iterations.
## Hardware Requirements
The VRAM cost of Ruyi depends on the resolution and duration of the video. Here we list the costs for some typical video size. Tested on single A100.
|Video Size | 360x480x120 | 384x672x120 | 480x640x120 | 630x1120x120 | 720x1280x120 |
|:--:|:--:|:--:|:--:|:--:|:--:|
|Memory | 21.5GB | 25.5GB | 27.7GB | 44.9GB | 54.8GB |
|Time | 03:10 | 05:29 | 06:49 | 24:18 | 39:02 |
For 24GB VRAM cards such as RTX4090, we provide `low_gpu_memory_mode`, under which the model can generate 720x1280x120 videos with a longer time.
## Showcase
### Image to Video Effects
<table border="0" style="width: 100%; text-align: left; margin-top: 20px;">
<tr>
<td><video src="https://github.com/user-attachments/assets/4dedf40b-82f2-454c-9a67-5f4ed243f5ea" width="100%" style="max-height:640px; min-height: 200px" controls autoplay loop></video></td>
<td><video src="https://github.com/user-attachments/assets/905fef17-8c5d-49b0-a49a-6ae7e212fa07" width="100%" style="max-height:640px; min-height: 200px" controls autoplay loop></video></td>
<td><video src="https://github.com/user-attachments/assets/20daab12-b510-448a-9491-389d7bdbbf2e" width="100%" style="max-height:640px; min-height: 200px" controls autoplay loop></video></td>
<td><video src="https://github.com/user-attachments/assets/f1bb0a91-d52a-4611-bac2-8fcf9658cac0" width="100%" style="max-height:640px; min-height: 200px" controls autoplay loop></video></td>
</tr>
</table>
### Camera Control
<table border="0" style="width: 100%; text-align: left; margin-top: 20px;">
<tr>
<td align=center><img src="https://github.com/user-attachments/assets/8aedcea6-3b8e-4c8b-9fed-9ceca4d41954" width="100%" style="max-height:360px; min-height: 200px"></img>input</td>
<td align=center><video src="https://github.com/user-attachments/assets/d9d027d4-0d4f-45f5-9d46-49860b562c69" width="100%" style="max-height:360px; min-height: 200px" controls autoplay loop></video>left</td>
<td align=center><video src="https://github.com/user-attachments/assets/7716a67b-1bb8-4d44-b128-346cbc35e4ee" width="100%" style="max-height:360px; min-height: 200px" controls autoplay loop></video>right</td>
</tr>
<tr>
<td align=center><video src="https://github.com/user-attachments/assets/cc1f1928-cab7-4c4b-90af-928936102e66" width="100%" style="max-height:360px; min-height: 200px" controls autoplay loop></video>static</td>
<td align=center><video src="https://github.com/user-attachments/assets/c742ea2c-503a-454f-a61a-10b539100cd9" width="100%" style="max-height:360px; min-height: 200px" controls autoplay loop></video>up</td>
<td align=center><video src="https://github.com/user-attachments/assets/442839fa-cc53-4b75-b015-909e44c065e0" width="100%" style="max-height:360px; min-height: 200px" controls autoplay loop></video>down</td>
</tr>
</table>
### Motion Amplitude Control
<table border="0" style="width: 100%; text-align: left; margin-top: 20px;">
<tr>
<td align=center><video src="https://github.com/user-attachments/assets/0020bd54-0ff6-46ad-91ee-d9f0df013772" width="100%" controls autoplay loop></video>motion 1</td>
<td align=center><video src="https://github.com/user-attachments/assets/d1c26419-54e3-4b86-8ae3-98e12de3022e" width="100%" controls autoplay loop></video>motion 2</td>
<td align=center><video src="https://github.com/user-attachments/assets/535147a2-049a-4afc-8d2a-017bc778977e" width="100%" controls autoplay loop></video>motion 3</td>
<td align=center><video src="https://github.com/user-attachments/assets/bf893d53-2e11-406f-bb9a-2aacffcecd44" width="100%" controls autoplay loop></video>motion 4</td>
</tr>
</table>
## Limitations
There are some known limitations in this experimental release. Texts, hands and crowded human faces may be distorted. The video may cut to another scene when the model does not know how to generate future frames. We are still working on these problems and will update the model as we make progress.
## BibTeX
```
@misc{createai2024ruyi,
title={Ruyi-Mini-7B},
author={CreateAI Team},
year={2024},
publisher = {GitHub},
journal = {GitHub repository},
howpublished={\url{https://github.com/IamCreateAI/Ruyi-Models}}
}
``` |