arxiv:2412.17726

VidTwin: Video VAE with Decoupled Structure and Dynamics

Published on Dec 23, 2024

· Submitted by

YuchiWang on Dec 26, 2024

Upvote

Authors:

Yuchi Wang ,

Junliang Guo ,

Xinyi Xie ,

Jiang Bian

Abstract

Recent advancements in video autoencoders (Video AEs) have significantly improved the quality and efficiency of video generation. In this paper, we propose a novel and compact video autoencoder, VidTwin, that decouples video into two distinct latent spaces: Structure latent vectors, which capture overall content and global movement, and Dynamics latent vectors, which represent fine-grained details and rapid movements. Specifically, our approach leverages an Encoder-Decoder backbone, augmented with two submodules for extracting these latent spaces, respectively. The first submodule employs a Q-Former to extract low-frequency motion trends, followed by downsampling blocks to remove redundant content details. The second averages the latent vectors along the spatial dimension to capture rapid motion. Extensive experiments show that VidTwin achieves a high compression rate of 0.20% with high reconstruction quality (PSNR of 28.14 on the MCL-JCV dataset), and performs efficiently and effectively in downstream generative tasks. Moreover, our model demonstrates explainability and scalability, paving the way for future research in video latent representation and generation. Our code has been released at https://github.com/microsoft/VidTok/tree/main/vidtwin.

View arXiv page View PDF Add to collection

Community

YuchiWang

Paper author Paper submitter 5 days ago

Demo Page: https://wangyuchi369.github.io/VidTwin/

We introduce VidTwin, a novel and compact video autoencoder that decouples video content into two distinct latent spaces: Structure Latent Vectors, which capture the overall content and global movement, and Dynamics Latent Vectors, which encode fine-grained details and rapid movements.

The example below illustrates this decoupling. From left to right, the sequence shows the original video, the reconstructed video, the structure-generated video, and the dynamics-generated video. The Structure Latent captures the main semantic content, such as the table and screw, while the Dynamics Latent encodes fine-grained details like color, texture, and rapid local movements, such as the screw’s downward motion and rotation.

Additionally, we perform a cross-reenactment experiment, where the structure latent vector from the first video and the dynamics latent vector from the second video are combined. The resulting video retains the rapid rotation from the second video, while adjusting the camera’s downward movement according to the scene structure of the first.

Through this decoupling, VidTwin achieves a remarkable compression rate of 0.20%, maintaining high reconstruction quality (PSNR of 28.14 on the MCL-JCV dataset). Extensive experiments show its efficiency and effectiveness in downstream generative tasks. Moreover, our model offers explainability and scalability, paving the way for future advancements in video latent representation and generation.

For more examples, visit: https://wangyuchi369.github.io/VidTwin/

librarian-bot

4 days ago

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2412.17726 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2412.17726 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2412.17726 in a Space README.md to link it from this page.