File size: 4,275 Bytes

85fe96a
 
3abf421
 
f24bb22
85fe96a
3f99bb7
2211cb0
3f99bb7
d4b5e81
b2a21d4
 
d4b5e81
1e280cc
3abf421
 
 
1e280cc
3abf421
bf031d5
00d6b36
 
 
 
ef79da9
 
41f1943
2afcdf8
 
 
 
 
 
 
e522c75
2afcdf8
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
ef79da9
 
b2a21d4
 
 
e522c75
 
b2a21d4
 
 
 
 
e522c75
b2a21d4
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
e522c75
 
 
ef79da9
 
3abf421

---
license: apache-2.0
language:
- en
library_name: diffusers
---
<p align="center">
<img src="https://huggingface.co/rhymes-ai/Allegro/resolve/main/banner.gif" width="500" height="400"/>
</p>
<p align="center">
 <a href="https://rhymes.ai/" target="_blank"> Gallery</a> · <a href="https://github.com/rhymes-ai/Aria" target="_blank">GitHub</a> · <a href="https://www.rhymes.ai/blog-details/" target="_blank">Blog</a> · <a href="https://arxiv.org/pdf/2410.05993" target="_blank">Paper</a> · <a href="https://discord" target="_blank">Discord</a> 
 
</p> 

# Gallery
<img src="https://huggingface.co/rhymes-ai/Allegro/resolve/main/gallery.gif" width="1000" height="800"/>For more demos and corresponding prompts, see the [Allegro Gallery](TBD).


# Key Feature 

- **Open Source**: [Full model weights](https://huggingface.co/rhymes-ai/Allegro) and [code](https://github.com/rhymes-ai/Allegro) available to the community, Apache 2.0!
- **Versatile Content Creation**: Capable of generating a wide range of content, from close-ups of humans and animals to diverse dynamic scenes.
- **High-Quality Output**: Generate detailed 6-second videos at 15 FPS with 720x1280 resolution, can be interpolated to 30 FPS with [EMA-VFI](https://github.com/MCG-NJU/EMA-VFI).
- **Small and Efficient**: Features a 175M parameter VideoVAE and a 2.8B parameter VideoDiT model. Supports multiple precisions (FP32, BF16, FP16) and uses 9.3 GB of GPU memory in BF16 mode with CPU offloading. Context length is 79.2k, equivalent to 88 frames.

# Model info 

<table>
  <tr>
    <th>Model</th>
    <td>Allegro</td>
  </tr>
  <tr>
    <th>Description</th>
    <td>Text-to-Video Generation Model</td>
  </tr>
  <tr>
    <th>Download</th>
    <td>&lt;HF link - TBD&gt;</td>
  </tr>
  <tr>
    <th rowspan="2">Parameter</th>
    <td>VAE: 175M</td>
  </tr>
  <tr>
    <td>DiT: 2.8B</td>
  </tr>
  <tr>
    <th rowspan="2">Inference Precision</th>
    <td>VAE: FP32/TF32/BF16/FP16 (best in FP32/TF32)</td>
  </tr>
  <tr>
    <td>DiT/T5: BF16/FP32/TF32</td>
  </tr>
  <tr>
    <th>Context Length</th>
    <td>79.2k</td>
  </tr>
  <tr>
    <th>Resolution</th>
    <td>720 x 1280</td>
  </tr>
  <tr>
    <th>Frames</th>
    <td>88</td>
  </tr>
  <tr>
    <th>Video Length</th>
    <td>6 seconds @ 15 fps</td>
  </tr>
  <tr>
    <th>Single GPU Memory Usage</th>
    <td>9.3G BF16 (with cpu_offload)</td>
  </tr>
</table>


# Quick start
You can quickly get started with Allegro using the Hugging Face Diffusers library.
For more tutorials, see Allegro GitHub (link-tbd).

1. Install necessary requirements. Please refer to [requirements.txt](https://github.com/rhymes-ai) on Allegro GitHub.
2. Perform inference on a single GPU.
```python
from diffusers import DiffusionPipeline
import torch

allegro_pipeline = DiffusionPipeline.from_pretrained(
"rhymes-ai/Allegro", trust_remote_code=True, torch_dtype=torch.bfloat16
).to("cuda")

allegro_pipeline.vae = allegro_pipeline.vae.to(torch.float32)

prompt = "a video of an astronaut riding a horse on mars"

positive_prompt = """
(masterpiece), (best quality), (ultra-detailed), (unwatermarked), 
{} 
emotional, harmonious, vignette, 4k epic detailed, shot on kodak, 35mm photo, 
sharp focus, high budget, cinemascope, moody, epic, gorgeous
"""

negative_prompt = """
nsfw, lowres, bad anatomy, bad hands, text, error, missing fingers, extra digit, fewer digits, cropped, worst quality, 
low quality, normal quality, jpeg artifacts, signature, watermark, username, blurry.
"""

num_sampling_steps, guidance_scale, seed = 100, 7.5, 42

user_prompt = positive_prompt.format(args.user_prompt.lower().strip())
out_video = allegro_pipeline(
    user_prompt, 
    negative_prompt=negative_prompt, 
    num_frames=88,
    height=720,
    width=1280,
    num_inference_steps=num_sampling_steps,
    guidance_scale=guidance_scale,
    max_sequence_length=512,
    generator = torch.Generator(device="cuda:0").manual_seed(seed)
).video[0]

imageio.mimwrite("test_video.mp4", out_video, fps=15, quality=8)
```
Tip: 
- It is highly recommended to use a video frame interpolation model (such as EMA-VFI) to enhance the result to 30 FPS.
- For more tutorials, see [Allegro GitHub](https://github.com/rhymes-ai).

# License
This repo is released under the Apache 2.0 License.