|
--- |
|
license: apache-2.0 |
|
base_model: |
|
- Lykon/DreamShaper |
|
pipeline_tag: text-to-image |
|
library_name: diffusers |
|
tags: |
|
- lora |
|
--- |
|
# TDM: Learning Few-Step Diffusion Models by Trajectory Distribution Matching |
|
<div style="text-align: center;"> |
|
<a href="https://tdm-t2x.github.io/"><img src="https://img.shields.io/static/v1?label=Project%20Page&message=Github-Page&color=blue&logo=github-pages" style="display: inline;"></a>   |
|
<a href="https://arxiv.org/abs/2503.06674"><img src="https://img.shields.io/static/v1?label=Paper&message=Arxiv:TDM&color=red&logo=arxiv" style="display: inline;"></a> |
|
</div> |
|
|
|
This is the Official Repository of "[Learning Few-Step Diffusion Models by Trajectory Distribution Matching](https://arxiv.org/abs/2503.06674)", by *Yihong Luo, Tianyang Hu, Jiacheng Sun, Yujun Cai, Jing Tang*. |
|
|
|
|
|
## User Study Time! |
|
 |
|
Which one do you think is better? Some images are generated by Pixart-α (50 NFE). Some images are generated by **TDM (4 NFE)**, distilling from Pixart-α in a data-free way with merely 500 training iterations and 2 A800 hours. |
|
|
|
<details> |
|
|
|
<summary style="color: #1E88E5; cursor: pointer; font-size: 1.2em;"> Click for answer</summary> |
|
|
|
<p style="font-size: 1.2em; margin-top: 8px;">Answers of TDM's position (left to right): bottom, bottom, top, bottom, top.</p> |
|
|
|
</details> |
|
|
|
## Fast Text-to-Video Geneartion |
|
|
|
Our proposed TDM can be easily extended to text-to-video. |
|
|
|
<p align="center"> |
|
<img src="teacher.gif" alt="Teacher" width="100%"> |
|
<img src="student.gif" alt="Student" width="100%"> |
|
</p> |
|
|
|
The video on the above was generated by CogVideoX-2B (100 NFE). In the same amount of time, **TDM (4NFE)** can generate 25 videos, as shown in the below, achieving an impressive **25 times speedup without performance degradation**. (Note: The noise in the GIF is due to compression.) |
|
|
|
## Usage |
|
### TDM-SD3-LoRA |
|
```python |
|
import torch |
|
from diffusers import StableDiffusion3Pipeline, AutoencoderTiny, DPMSolverMultistepScheduler |
|
from huggingface_hub import hf_hub_download |
|
from safetensors.torch import load_file |
|
from diffusers.utils import make_image_grid |
|
pipe = StableDiffusion3Pipeline.from_pretrained("stabilityai/stable-diffusion-3-medium-diffusers", torch_dtype=torch.float16).to("cuda") |
|
pipe.load_lora_weights('Luo-Yihong/TDM_sd3_lora', adapter_name = 'tdm') # Load TDM-LoRA |
|
pipe.set_adapters(["tdm"], [0.125])# IMPORTANT. Please set LoRA scale to 0.125. |
|
pipe.vae = AutoencoderTiny.from_pretrained("madebyollin/taesd3", torch_dtype=torch.float16) # Save GPU memory. |
|
pipe.vae.config.shift_factor = 0.0 |
|
pipe = pipe.to("cuda") |
|
pipe.scheduler = DPMSolverMultistepScheduler.from_pretrained("Efficient-Large-Model/Sana_1600M_1024px_BF16_diffusers", subfolder="scheduler") |
|
pipe.scheduler.config['flow_shift'] = 6 # the flow_shift can be changed from 1 to 6. |
|
pipe.scheduler = DPMSolverMultistepScheduler.from_config(pipe.scheduler.config) |
|
generator = torch.manual_seed(8888) |
|
image = pipe( |
|
prompt="A cute panda holding a sign says TDM SOTA!", |
|
negative_prompt="", |
|
num_inference_steps=4, |
|
height=1024, |
|
width=1024, |
|
num_images_per_prompt = 1, |
|
guidance_scale=1., |
|
generator = generator, |
|
).images[0] |
|
|
|
pipe.scheduler = DPMSolverMultistepScheduler.from_pretrained("Efficient-Large-Model/Sana_1600M_1024px_BF16_diffusers", subfolder="scheduler") |
|
pipe.set_adapters(["tdm"], [0.]) # Unload lora |
|
generator = torch.manual_seed(8888) |
|
teacher_image = pipe( |
|
prompt="A cute panda holding a sign says TDM SOTA!", |
|
negative_prompt="", |
|
num_inference_steps=28, |
|
height=1024, |
|
width=1024, |
|
num_images_per_prompt = 1, |
|
guidance_scale=7., |
|
generator = generator, |
|
).images[0] |
|
make_image_grid([image,teacher_image],1,2) |
|
``` |
|
 |
|
The sample generated by SD3 with 56 NFE is on the right, and the sample generated by **TDM** with 4NFE is on the left. Which one do you feel is better? |
|
|
|
### TDM-Dreamshaper-v7-LoRA |
|
```python |
|
import torch |
|
from diffusers import DiffusionPipeline, UNet2DConditionModel, DPMSolverMultistepScheduler |
|
from huggingface_hub import hf_hub_download |
|
from safetensors.torch import load_file |
|
repo_name = "Luo-Yihong/TDM_dreamshaper_LoRA" |
|
ckpt_name = "tdm_dreamshaper.pt" |
|
pipe = DiffusionPipeline.from_pretrained('lykon/dreamshaper-7', torch_dtype=torch.float16).to("cuda") |
|
pipe.load_lora_weights(hf_hub_download(repo_name, ckpt_name)) |
|
pipe.scheduler = DPMSolverMultistepScheduler.from_pretrained("runwayml/stable-diffusion-v1-5", subfolder="scheduler") |
|
generator = torch.manual_seed(317) |
|
image = pipe( |
|
prompt="A close-up photo of an Asian lady with sunglasses", |
|
negative_prompt="", |
|
num_inference_steps=4, |
|
num_images_per_prompt = 1, |
|
generator = generator, |
|
guidance_scale=1., |
|
).images[0] |
|
image |
|
``` |
|
 |
|
|
|
## TDM-CogVideoX-2B-LoRA |
|
```python |
|
import torch |
|
from diffusers import CogVideoXPipeline |
|
from diffusers.utils import export_to_video |
|
pipe = CogVideoXPipeline.from_pretrained("THUDM/CogVideoX-2b", torch_dtype=torch.float16) |
|
pipe.vae.enable_slicing() # Save memory |
|
pipe.vae.enable_tiling() # Save memory |
|
pipe.load_lora_weights("Luo-Yihong/TDM_CogVideoX-2B_LoRA") |
|
pipe.to("cuda") |
|
prompt = ( |
|
"A panda, dressed in a small, red jacket and a tiny hat, sits on a wooden stool in a serene bamboo forest. The " |
|
"panda's fluffy paws strum a miniature acoustic guitar, producing soft, melodic tunes. Nearby, a few other " |
|
"pandas gather, watching curiously and some clapping in rhythm. Sunlight filters through the tall bamboo, " |
|
"casting a gentle glow on the scene. The panda's face is expressive, showing concentration and joy as it plays. " |
|
"The background includes a small, flowing stream and vibrant green foliage, enhancing the peaceful and magical " |
|
"atmosphere of this unique musical performance" |
|
) |
|
# We train the generator on timesteps [999, 856, 665, 399]. |
|
# The official scheduler of CogVideo-X using uniform spacing, may cause inferior results. |
|
# But TDM-LoRA still works well under 4 NFE. |
|
# We will update the TDM-CogVideoX-LoRA soon for better performance! |
|
generator = torch.manual_seed(8888) |
|
frames = pipe(prompt, guidance_scale=1, |
|
num_inference_steps=4, |
|
num_frames=49, |
|
generator = generator, |
|
use_dynamic_cfg=True).frames[0] |
|
export_to_video(frames, "output-TDM.mp4", fps=8) |
|
``` |
|
## 🔥 Pre-trained Models |
|
We release a bucket of TDM-LoRA. Please enjoy it! |
|
- [TDM-SD3-LoRA](https://huggingface.co/Luo-Yihong/TDM_sd3_lora) |
|
- [TDM-CogVideoX-2B-LoRA](https://huggingface.co/Luo-Yihong/TDM_CogVideoX-2B_LoRA) |
|
- [TDM-Dreamshaper-LoRA](https://huggingface.co/Luo-Yihong/TDM_dreamshaper_LoRA) |
|
|
|
|
|
## Contact |
|
|
|
Please contact Yihong Luo ([email protected]) if you have any questions about this work. |
|
|
|
## Bibtex |
|
|
|
``` |
|
@misc{luo2025tdm, |
|
title={Learning Few-Step Diffusion Models by Trajectory Distribution Matching}, |
|
author={Yihong Luo and Tianyang Hu and Jiacheng Sun and Yujun Cai and Jing Tang}, |
|
year={2025}, |
|
eprint={2503.06674}, |
|
archivePrefix={arXiv}, |
|
primaryClass={cs.CV}, |
|
url={https://arxiv.org/abs/2503.06674}, |
|
} |
|
``` |