GuanjieChen
/

Skip-DiT

English

Chinese

video

image

model-efficiency

Model card Files Files and versions Community

GuanjieChen commited on Nov 27, 2024

Commit

d064a8f

verified ·

1 Parent(s): d2d09b6

Update README.md

Browse files

Files changed (1) hide show

README.md +18 -21

README.md CHANGED Viewed

@@ -1,11 +1,20 @@
 # Accelerating Vision Diffusion Transformers with Skip Branches
-### About
-This repository contains the official PyTorch implementation of the paper: **[Accelerating Vision Diffusion Transformers with Skip Branches](https://arxiv.org/abs/2411.17616)**. In this work, we enhance standard DiT models by introducing **Skip-DiT**, which incorporates skip branches to improve feature smoothness. We also propose **Skip-Cache**, a method that leverages skip branches to cache DiT features across timesteps during inference.The effectiveness of our approach is validated on various DiT backbones for both video and image generation, demonstrating how skip branches preserve generation quality while achieving significant speedup. Experimental results show that **Skip-Cache** provides a $1.5\times$ speedup with minimal computational cost and a $2.2\times$ speedup with only a slight reduction in quantitative metrics. All the codes and checkpoints are publicly available at [huggingface](https://huggingface.co/GuanjieChen/Skip-DiT/tree/main) and [github](https://github.com/OpenSparseLLMs/Skip-DiT.git). More visualizations can be found [here](#visualization).
 ### Pipeline
 ![pipeline](visuals/pipeline.jpg)
-Illustration of Skip-DiT and Skip-Cache for DiT visual generation caching. (a) The vanilla DiT block for image and video generation. (b) Skip-DiT modifies the vanilla DiT model using skip branches to connect shallow and deep DiT blocks. (c) Given a Skip-DiT with $L$ layers, during inference, at the $t-1$ step, the first layer output  ${x'}^{t-1}\_{0}$ and cached $L-1$ layer output ${x'}^t_{L-1}$ are forwarded through the skip branches to the final DiT block to generate the denoising output, without executing DiT blocks 2 to $L-1$.
 ### Feature Smoothness
 ![feature](visuals/feature.jpg)
@@ -24,23 +33,11 @@ Feature smoothness analysis of DiT in the class-to-video generation task using D
 Pretrained text-to-image Model of [HunYuan-DiT](https://github.com/Tencent/HunyuanDiT) can be found in [Huggingface](https://huggingface.co/Tencent-Hunyuan/HunyuanDiT-v1.2/tree/main/t2i/model) and [Tencent-cloud](https://dit.hunyuan.tencent.com/download/HunyuanDiT/model-v1_2.zip).
 ### Demo
-<div align="center">
-  <img src="visuals/video-demo.gif" width="90%" ></img>
-  <br>
-  <em>
-      (Results of Latte with skip-branches on text-to-video and class-to-video tasks. Left: text-to-video with 1.7x and 2.0x speedup. Right: class-to-video with 2.2x and 2.5x speedup. Latency is measured on one A100.)
-  </em>
-</div>
-<br>
-<div align="center">
-  <img src="visuals/image-demo.jpg" width="100%" ></img>
-  <br>
-  <em>
-      (Results of HunYuan-DiT with skip-branches on text-to-image task. Latency is measured on one A100.)
-  </em>
-</div>
-<br>
 ### Acknowledgement
 Skip-DiT has been greatly inspired by the following amazing works and teams: [DeepCache](https://arxiv.org/abs/2312.00858), [Latte](https://github.com/Vchitect/Latte), [DiT](https://github.com/facebookresearch/DiT), and [HunYuan-DiT](https://github.com/Tencent/HunyuanDiT), we thank all the contributors for open-sourcing.
@@ -57,4 +54,4 @@ The code and model weights are licensed under [LICENSE](./class-to-image/LICENSE
 #### Text-to-image
 ![text-to-image visualizations](visuals/case_t2i.jpg)
 #### Class-to-image
-![class-to-image visualizations](visuals/case_c2i.jpg)

+---
+license: apache-2.0
+language:
+- en
+- zh
+base_model:
+- maxin-cn/Latte-1
+- facebook/DiT-XL-2-256
+- Tencent-Hunyuan/HunyuanDiT
+---
 # Accelerating Vision Diffusion Transformers with Skip Branches
+This repository contains all the checkpoints of models the paper: **[Accelerating Vision Diffusion Transformers with Skip Branches](https://arxiv.org/abs/2411.17616)**. In this work, we enhance standard DiT models by introducing **Skip-DiT**, which incorporates skip branches to improve feature smoothness. We also propose **Skip-Cache**, a method that leverages skip branches to cache DiT features across timesteps during inference.The effectiveness of our approach is validated on various DiT backbones for both video and image generation, demonstrating how skip branches preserve generation quality while achieving significant speedup. Experimental results show that **Skip-Cache** provides a1.5x speedup with minimal computational cost and a 2.2x speedup with only a slight reduction in quantitative metrics. All the codes and checkpoints are publicly available at [huggingface](https://huggingface.co/GuanjieChen/Skip-DiT/tree/main) and [github](https://github.com/OpenSparseLLMs/Skip-DiT.git). More visualizations can be found [here](#visualization).
 ### Pipeline
 ![pipeline](visuals/pipeline.jpg)
+Illustration of Skip-DiT and Skip-Cache for DiT visual generation caching. (a) The vanilla DiT block for image and video generation. (b) Skip-DiT modifies the vanilla DiT model using skip branches to connect shallow and deep DiT blocks. (c) Pipeline of Skip-Cache.
 ### Feature Smoothness
 ![feature](visuals/feature.jpg)
 Pretrained text-to-image Model of [HunYuan-DiT](https://github.com/Tencent/HunyuanDiT) can be found in [Huggingface](https://huggingface.co/Tencent-Hunyuan/HunyuanDiT-v1.2/tree/main/t2i/model) and [Tencent-cloud](https://dit.hunyuan.tencent.com/download/HunyuanDiT/model-v1_2.zip).
 ### Demo
+![demo1](visuals/video-demo.gif)
+(Results of Latte with skip-branches on text-to-video and class-to-video tasks. Left: text-to-video with 1.7x and 2.0x speedup. Right: class-to-video with 2.2x and 2.5x speedup. Latency is measured on one A100.)
+![demo2](visuals/image-demo.jpg)
+(Results of HunYuan-DiT with skip-branches on text-to-image task. Latency is measured on one A100.)
 ### Acknowledgement
 Skip-DiT has been greatly inspired by the following amazing works and teams: [DeepCache](https://arxiv.org/abs/2312.00858), [Latte](https://github.com/Vchitect/Latte), [DiT](https://github.com/facebookresearch/DiT), and [HunYuan-DiT](https://github.com/Tencent/HunyuanDiT), we thank all the contributors for open-sourcing.
 #### Text-to-image
 ![text-to-image visualizations](visuals/case_t2i.jpg)
 #### Class-to-image
+![class-to-image visualizations](visuals/case_c2i.jpg)