VidTwin: Video VAE with Decoupled Structure and Dynamics
Abstract
Recent advancements in video autoencoders (Video AEs) have significantly improved the quality and efficiency of video generation. In this paper, we propose a novel and compact video autoencoder, VidTwin, that decouples video into two distinct latent spaces: Structure latent vectors, which capture overall content and global movement, and Dynamics latent vectors, which represent fine-grained details and rapid movements. Specifically, our approach leverages an Encoder-Decoder backbone, augmented with two submodules for extracting these latent spaces, respectively. The first submodule employs a Q-Former to extract low-frequency motion trends, followed by downsampling blocks to remove redundant content details. The second averages the latent vectors along the spatial dimension to capture rapid motion. Extensive experiments show that VidTwin achieves a high compression rate of 0.20% with high reconstruction quality (PSNR of 28.14 on the MCL-JCV dataset), and performs efficiently and effectively in downstream generative tasks. Moreover, our model demonstrates explainability and scalability, paving the way for future research in video latent representation and generation. Our code has been released at https://github.com/microsoft/VidTok/tree/main/vidtwin.
Community
Demo Page: https://wangyuchi369.github.io/VidTwin/
We introduce VidTwin, a novel and compact video autoencoder that decouples video content into two distinct latent spaces: Structure Latent Vectors, which capture the overall content and global movement, and Dynamics Latent Vectors, which encode fine-grained details and rapid movements.
The example below illustrates this decoupling. From left to right, the sequence shows the original video, the reconstructed video, the structure-generated video, and the dynamics-generated video. The Structure Latent captures the main semantic content, such as the table and screw, while the Dynamics Latent encodes fine-grained details like color, texture, and rapid local movements, such as the screw’s downward motion and rotation.
Additionally, we perform a cross-reenactment experiment, where the structure latent vector from the first video and the dynamics latent vector from the second video are combined. The resulting video retains the rapid rotation from the second video, while adjusting the camera’s downward movement according to the scene structure of the first.
Through this decoupling, VidTwin achieves a remarkable compression rate of 0.20%, maintaining high reconstruction quality (PSNR of 28.14 on the MCL-JCV dataset). Extensive experiments show its efficiency and effectiveness in downstream generative tasks. Moreover, our model offers explainability and scalability, paving the way for future advancements in video latent representation and generation.
For more examples, visit: https://wangyuchi369.github.io/VidTwin/
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- WF-VAE: Enhancing Video VAE by Wavelet-Driven Energy Flow for Latent Video Diffusion Model (2024)
- Improved Video VAE for Latent Video Diffusion Model (2024)
- Adapting Image-to-Video Diffusion Models for Large-Motion Frame Interpolation (2024)
- CPA: Camera-pose-awareness Diffusion Transformer for Video Generation (2024)
- M3-CVC: Controllable Video Compression with Multimodal Generative Models (2024)
- SoftVQ-VAE: Efficient 1-Dimensional Continuous Tokenizer (2024)
- SnapGen-V: Generating a Five-Second Video within Five Seconds on a Mobile Device (2024)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper