arxiv:2410.07763

HARIVO: Harnessing Text-to-Image Models for Video Generation

Published on Oct 10, 2024

Authors:

Abstract

We present a method to create diffusion-based video models from pretrained Text-to-Image (T2I) models. Recently, AnimateDiff proposed freezing the T2I model while only training temporal layers. We advance this method by proposing a unique architecture, incorporating a mapping network and frame-wise tokens, tailored for video generation while maintaining the diversity and creativity of the original T2I model. Key innovations include novel loss functions for temporal smoothness and a mitigating gradient sampling technique, ensuring realistic and temporally consistent video generation despite limited public video data. We have successfully integrated video-specific inductive biases into the architecture and loss functions. Our method, built on the frozen StableDiffusion model, simplifies training processes and allows for seamless integration with off-the-shelf models like ControlNet and DreamBooth. project page: https://kwonminki.github.io/HARIVO

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

No model linking this paper

Cite arxiv.org/abs/2410.07763 in a model README.md to link it from this page.

No dataset linking this paper

Cite arxiv.org/abs/2410.07763 in a dataset README.md to link it from this page.

No Space linking this paper

Cite arxiv.org/abs/2410.07763 in a Space README.md to link it from this page.

No Collection including this paper

Add this paper to a collection to link it from this page.