File size: 7,143 Bytes
0a02cea 15df27f 76827a4 8419492 0a02cea 2295488 0a02cea 09a8880 0a02cea 15df27f |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 |
---
license: apache-2.0
base_model:
- Lykon/DreamShaper
pipeline_tag: text-to-image
library_name: diffusers
tags:
- lora
---
# TDM: Learning Few-Step Diffusion Models by Trajectory Distribution Matching
<div style="text-align: center;">
<a href="https://tdm-t2x.github.io/"><img src="https://img.shields.io/static/v1?label=Project%20Page&message=Github-Page&color=blue&logo=github-pages" style="display: inline;"></a>  
<a href="https://arxiv.org/abs/2503.06674"><img src="https://img.shields.io/static/v1?label=Paper&message=Arxiv:TDM&color=red&logo=arxiv" style="display: inline;"></a>
</div>
This is the Official Repository of "[Learning Few-Step Diffusion Models by Trajectory Distribution Matching](https://arxiv.org/abs/2503.06674)", by *Yihong Luo, Tianyang Hu, Jiacheng Sun, Yujun Cai, Jing Tang*.
## User Study Time!

Which one do you think is better? Some images are generated by Pixart-α (50 NFE). Some images are generated by **TDM (4 NFE)**, distilling from Pixart-α in a data-free way with merely 500 training iterations and 2 A800 hours.
<details>
<summary style="color: #1E88E5; cursor: pointer; font-size: 1.2em;"> Click for answer</summary>
<p style="font-size: 1.2em; margin-top: 8px;">Answers of TDM's position (left to right): bottom, bottom, top, bottom, top.</p>
</details>
## Fast Text-to-Video Geneartion
Our proposed TDM can be easily extended to text-to-video.
<p align="center">
<img src="teacher.gif" alt="Teacher" width="100%">
<img src="student.gif" alt="Student" width="100%">
</p>
The video on the above was generated by CogVideoX-2B (100 NFE). In the same amount of time, **TDM (4NFE)** can generate 25 videos, as shown in the below, achieving an impressive **25 times speedup without performance degradation**. (Note: The noise in the GIF is due to compression.)
## Usage
### TDM-SD3-LoRA
```python
import torch
from diffusers import StableDiffusion3Pipeline, AutoencoderTiny, DPMSolverMultistepScheduler
from huggingface_hub import hf_hub_download
from safetensors.torch import load_file
from diffusers.utils import make_image_grid
pipe = StableDiffusion3Pipeline.from_pretrained("stabilityai/stable-diffusion-3-medium-diffusers", torch_dtype=torch.float16).to("cuda")
pipe.load_lora_weights('Luo-Yihong/TDM_sd3_lora', adapter_name = 'tdm') # Load TDM-LoRA
pipe.set_adapters(["tdm"], [0.125])# IMPORTANT. Please set LoRA scale to 0.125.
pipe.vae = AutoencoderTiny.from_pretrained("madebyollin/taesd3", torch_dtype=torch.float16) # Save GPU memory.
pipe.vae.config.shift_factor = 0.0
pipe = pipe.to("cuda")
pipe.scheduler = DPMSolverMultistepScheduler.from_pretrained("Efficient-Large-Model/Sana_1600M_1024px_BF16_diffusers", subfolder="scheduler")
pipe.scheduler.config['flow_shift'] = 6 # the flow_shift can be changed from 1 to 6.
pipe.scheduler = DPMSolverMultistepScheduler.from_config(pipe.scheduler.config)
generator = torch.manual_seed(8888)
image = pipe(
prompt="A cute panda holding a sign says TDM SOTA!",
negative_prompt="",
num_inference_steps=4,
height=1024,
width=1024,
num_images_per_prompt = 1,
guidance_scale=1.,
generator = generator,
).images[0]
pipe.scheduler = DPMSolverMultistepScheduler.from_pretrained("Efficient-Large-Model/Sana_1600M_1024px_BF16_diffusers", subfolder="scheduler")
pipe.set_adapters(["tdm"], [0.]) # Unload lora
generator = torch.manual_seed(8888)
teacher_image = pipe(
prompt="A cute panda holding a sign says TDM SOTA!",
negative_prompt="",
num_inference_steps=28,
height=1024,
width=1024,
num_images_per_prompt = 1,
guidance_scale=7.,
generator = generator,
).images[0]
make_image_grid([image,teacher_image],1,2)
```

The sample generated by SD3 with 56 NFE is on the right, and the sample generated by **TDM** with 4NFE is on the left. Which one do you feel is better?
### TDM-Dreamshaper-v7-LoRA
```python
import torch
from diffusers import DiffusionPipeline, UNet2DConditionModel, DPMSolverMultistepScheduler
from huggingface_hub import hf_hub_download
from safetensors.torch import load_file
repo_name = "Luo-Yihong/TDM_dreamshaper_LoRA"
ckpt_name = "tdm_dreamshaper.pt"
pipe = DiffusionPipeline.from_pretrained('lykon/dreamshaper-7', torch_dtype=torch.float16).to("cuda")
pipe.load_lora_weights(hf_hub_download(repo_name, ckpt_name))
pipe.scheduler = DPMSolverMultistepScheduler.from_pretrained("runwayml/stable-diffusion-v1-5", subfolder="scheduler")
generator = torch.manual_seed(317)
image = pipe(
prompt="A close-up photo of an Asian lady with sunglasses",
negative_prompt="",
num_inference_steps=4,
num_images_per_prompt = 1,
generator = generator,
guidance_scale=1.,
).images[0]
image
```

## TDM-CogVideoX-2B-LoRA
```python
import torch
from diffusers import CogVideoXPipeline
from diffusers.utils import export_to_video
pipe = CogVideoXPipeline.from_pretrained("THUDM/CogVideoX-2b", torch_dtype=torch.float16)
pipe.vae.enable_slicing() # Save memory
pipe.vae.enable_tiling() # Save memory
pipe.load_lora_weights("Luo-Yihong/TDM_CogVideoX-2B_LoRA")
pipe.to("cuda")
prompt = (
"A panda, dressed in a small, red jacket and a tiny hat, sits on a wooden stool in a serene bamboo forest. The "
"panda's fluffy paws strum a miniature acoustic guitar, producing soft, melodic tunes. Nearby, a few other "
"pandas gather, watching curiously and some clapping in rhythm. Sunlight filters through the tall bamboo, "
"casting a gentle glow on the scene. The panda's face is expressive, showing concentration and joy as it plays. "
"The background includes a small, flowing stream and vibrant green foliage, enhancing the peaceful and magical "
"atmosphere of this unique musical performance"
)
# We train the generator on timesteps [999, 856, 665, 399].
# The official scheduler of CogVideo-X using uniform spacing, may cause inferior results.
# But TDM-LoRA still works well under 4 NFE.
# We will update the TDM-CogVideoX-LoRA soon for better performance!
generator = torch.manual_seed(8888)
frames = pipe(prompt, guidance_scale=1,
num_inference_steps=4,
num_frames=49,
generator = generator,
use_dynamic_cfg=True).frames[0]
export_to_video(frames, "output-TDM.mp4", fps=8)
```
## 🔥 Pre-trained Models
We release a bucket of TDM-LoRA. Please enjoy it!
- [TDM-SD3-LoRA](https://huggingface.co/Luo-Yihong/TDM_sd3_lora)
- [TDM-CogVideoX-2B-LoRA](https://huggingface.co/Luo-Yihong/TDM_CogVideoX-2B_LoRA)
- [TDM-Dreamshaper-LoRA](https://huggingface.co/Luo-Yihong/TDM_dreamshaper_LoRA)
## Contact
Please contact Yihong Luo ([email protected]) if you have any questions about this work.
## Bibtex
```
@misc{luo2025tdm,
title={Learning Few-Step Diffusion Models by Trajectory Distribution Matching},
author={Yihong Luo and Tianyang Hu and Jiacheng Sun and Yujun Cai and Jing Tang},
year={2025},
eprint={2503.06674},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2503.06674},
}
``` |