|
--- |
|
license: mit |
|
pipeline_tag: text-to-video |
|
library_name: diffusers |
|
--- |
|
# VidToMe: Video Token Merging for Zero-Shot Video Editing |
|
|
|
Edit videos instantly with just a prompt! 🎥 |
|
|
|
Diffusers Implementation of VidToMe is a diffusion-based pipeline for zero-shot video editing that enhances temporal consistency and reduces memory usage by merging self-attention tokens across video frames. |
|
This approach allows for a harmonious video generation and editing without needing to fine-tune the model. |
|
By aligning and compressing redundant tokens across frames, VidToMe ensures smooth transitions and coherent video output, improving over traditional video editing methods. |
|
It follows by [this paper](https://arxiv.org/abs/2312.10656). |
|
|
|
## Usage |
|
|
|
```python |
|
from diffusers import DiffusionPipeline |
|
|
|
# load the pretrained model |
|
pipeline = DiffusionPipeline.from_pretrained( |
|
"jadechoghari/VidToMe", |
|
trust_remote_code=True, |
|
custom_pipeline="jadechoghari/VidToMe", |
|
sd_version="depth", |
|
device="cuda", |
|
float_precision="fp16" |
|
) |
|
|
|
# set prompts for inversion and generation |
|
inversion_prompt = "flamingos standing in the water near a tree." |
|
generation_prompt = {"origami": "rainbow-colored origami flamingos standing in the water near a tree."} |
|
|
|
# additional control and parameters |
|
control_type = "none" # No extra control, use "depth" if needed |
|
negative_prompt = "" |
|
|
|
# Run the video-to-image editing pipeline |
|
generated_images = pipeline( |
|
video_path="path/to/video.mp4", # add path to the input video |
|
video_prompt=inversion_prompt, # inversion prompt |
|
edit_prompt=generation_prompt, # edit prompt for generation |
|
control_type=control_type # control type (e.g., "none", "depth") |
|
) |
|
|
|
``` |
|
|
|
#### Note: For more control, consider creating a configuration and follow the instructions in the GitHub repository. |
|
|
|
## Applications: |
|
- Zero-shot video editing for content creators |
|
- Video transformation using natural language prompts |
|
- Memory-optimized video generation for longer or complex sequences |
|
|
|
**Model Authors:** |
|
- [Xirui Li](https://github.com/lixirui142) |
|
- Chao Ma |
|
- Xiaokang Yang |
|
- Ming-Hsuan Yang |
|
|
|
For more check the [Github Repo](https://github.com/lixirui142/VidToMe). |