Add library name and pipeline tag
Browse filesAdds a pipeline tag and ensures the model can be found at https://huggingface.co/models?pipeline_tag=unconditional-image-generation.
Adds the library_name.
README.md
CHANGED
|
@@ -1,3 +1,8 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
# Marrying Autoregressive Transformer and Diffusion with Multi-Reference Autoregression <br><sub>Official PyTorch Implementation</sub>
|
| 2 |
|
| 3 |
[](https://arxiv.org/pdf/2506.09482)
|
|
@@ -22,10 +27,10 @@ This is a PyTorch/GPU implementation of the paper [Marrying Autoregressive Trans
|
|
| 22 |
|
| 23 |
This repo contains:
|
| 24 |
|
| 25 |
-
*
|
| 26 |
-
*
|
| 27 |
-
*
|
| 28 |
-
*
|
| 29 |
|
| 30 |
## Preparation
|
| 31 |
|
|
@@ -71,10 +76,10 @@ Given that our data augmentation consists of simple center cropping and random f
|
|
| 71 |
the VAE latents can be pre-computed and saved to `CACHED_PATH` to save computations during TransDiff training:
|
| 72 |
|
| 73 |
```
|
| 74 |
-
torchrun --nproc_per_node=8 --nnodes=1 --node_rank=0
|
| 75 |
-
main_cache.py
|
| 76 |
-
--img_size 256 --vae_path ckpt/vae/kl16.ckpt --vae_embed_dim 16
|
| 77 |
-
--batch_size 128
|
| 78 |
--data_path ${IMAGENET_PATH} --cached_path ${CACHED_PATH}
|
| 79 |
```
|
| 80 |
|
|
@@ -86,13 +91,13 @@ Run our interactive visualization [demo](demo.ipynb).
|
|
| 86 |
### Training
|
| 87 |
Script for the TransDiff-L 1StepAR setting (Pretrain TransDiff-L with a width of 1024 channels, 800 epochs):
|
| 88 |
```
|
| 89 |
-
torchrun --nproc_per_node=8 --nnodes=8 --node_rank=${NODE_RANK} --master_addr=${MASTER_ADDR} --master_port=${MASTER_PORT}
|
| 90 |
-
main.py
|
| 91 |
-
--img_size 256 --vae_path ckpt/vae/kl16.ckpt --vae_embed_dim 16 --patch_size 1
|
| 92 |
-
--model transdiff_large --diffloss_w 1024
|
| 93 |
-
--diffusion_batch_mul 4
|
| 94 |
-
--epochs 800 --warmup_epochs 100 --blr 1.0e-4 --batch_size 32
|
| 95 |
-
--output_dir ${OUTPUT_DIR} --resume ${OUTPUT_DIR}
|
| 96 |
--data_path ${IMAGENET_PATH}
|
| 97 |
```
|
| 98 |
- Training time is ~115h on 64 A100 GPUs with `--batch_size 32`.
|
|
@@ -103,25 +108,25 @@ main.py \
|
|
| 103 |
|
| 104 |
Script for the TransDiff-L MRAR setting (Finetune TransDiff-L MRAR with a width of 1024 channels, 40 epochs):
|
| 105 |
```
|
| 106 |
-
torchrun --nproc_per_node=8 --nnodes=8 --node_rank=${NODE_RANK} --master_addr=${MASTER_ADDR} --master_port=${MASTER_PORT}
|
| 107 |
-
main.py
|
| 108 |
-
--img_size 256 --vae_path ckpt/vae/kl16.ckpt --vae_embed_dim 16 --patch_size 1
|
| 109 |
-
--model transdiff_large --diffloss_w 1024 --mrar --bf16
|
| 110 |
-
--diffusion_batch_mul 2
|
| 111 |
-
--epochs 40 --warmup_epochs 10 --lr 5.0e-5 --batch_size 16 --gradient_accumulation_steps 2
|
| 112 |
-
--output_dir ${OUTPUT_DIR} --resume ${Transdiff-L_1StepAR_DIR}
|
| 113 |
--data_path ${IMAGENET_PATH}
|
| 114 |
```
|
| 115 |
Script for the TransDiff-L 512x512 setting (Finetune TransDiff-L 512x512 with a width of 1024 channels, 150 epochs):
|
| 116 |
```
|
| 117 |
-
torchrun --nproc_per_node=8 --nnodes=8 --node_rank=${NODE_RANK} --master_addr=${MASTER_ADDR} --master_port=${MASTER_PORT}
|
| 118 |
-
main.py
|
| 119 |
-
--img_size 512 --vae_path ckpt/vae/kl16.ckpt --vae_embed_dim 16 --patch_size 1
|
| 120 |
-
--model transdiff_large --diffloss_w 1024 --ema_rate 0.999 --bf16
|
| 121 |
-
--diffusion_batch_mul 4
|
| 122 |
-
--epochs 150 --warmup_epochs 10 --lr 1.0e-4 --batch_size 16 --gradient_accumulation_steps 2
|
| 123 |
-
--only_train_diff
|
| 124 |
-
--output_dir ${OUTPUT_DIR} --resume ${Transdiff-L_1StepAR_DIR}
|
| 125 |
--data_path ${IMAGENET_PATH}
|
| 126 |
```
|
| 127 |
|
|
@@ -129,34 +134,34 @@ main.py \
|
|
| 129 |
|
| 130 |
Evaluate TransDiff-L 1StepAR with classifier-free guidance:
|
| 131 |
```
|
| 132 |
-
torchrun --nproc_per_node=8 --nnodes=1 --node_rank=0
|
| 133 |
-
main.py
|
| 134 |
-
--img_size 256 --vae_path ckpt/vae/kl16.ckpt --vae_embed_dim 16 --patch_size 1
|
| 135 |
-
--model transdiff_large --diffloss_w 1024
|
| 136 |
-
--output_dir ${OUTPUT_DIR} --resume ckpt/transdiff_l/
|
| 137 |
-
--evaluate --eval_bsz 256 --num_images 50000
|
| 138 |
--cfg 1.3 --scale_0 0.89 --scale_1 0.95
|
| 139 |
```
|
| 140 |
|
| 141 |
Evaluate TransDiff-L MRAR with classifier-free guidance:
|
| 142 |
```
|
| 143 |
-
torchrun --nproc_per_node=8 --nnodes=1 --node_rank=0
|
| 144 |
-
main.py
|
| 145 |
-
--img_size 256 --vae_path ckpt/vae/kl16.ckpt --vae_embed_dim 16 --patch_size 1
|
| 146 |
-
--model transdiff_large --diffloss_w 1024
|
| 147 |
-
--output_dir ${OUTPUT_DIR} --resume ckpt/transdiff_l_mrar/
|
| 148 |
-
--evaluate --eval_bsz 256 --num_images 50000
|
| 149 |
--cfg 1.3 --scale_0 0.91 --scale_1 0.93
|
| 150 |
```
|
| 151 |
|
| 152 |
Evaluate TransDiff-L 512x512 with classifier-free guidance:
|
| 153 |
```
|
| 154 |
-
torchrun --nproc_per_node=8 --nnodes=1 --node_rank=0
|
| 155 |
-
main.py
|
| 156 |
-
--img_size 512 --vae_path ckpt/vae/kl16.ckpt --vae_embed_dim 16 --patch_size 1
|
| 157 |
-
--model transdiff_large --diffloss_w 1024
|
| 158 |
-
--output_dir ${OUTPUT_DIR} --resume ckpt/transdiff_l_512/
|
| 159 |
-
--evaluate --eval_bsz 64 --num_images 50000
|
| 160 |
--cfg 1.3 --scale_0 0.87 --scale_1 0.87
|
| 161 |
```
|
| 162 |
|
|
|
|
| 1 |
+
---
|
| 2 |
+
library_name: diffusers
|
| 3 |
+
pipeline_tag: unconditional-image-generation
|
| 4 |
+
---
|
| 5 |
+
|
| 6 |
# Marrying Autoregressive Transformer and Diffusion with Multi-Reference Autoregression <br><sub>Official PyTorch Implementation</sub>
|
| 7 |
|
| 8 |
[](https://arxiv.org/pdf/2506.09482)
|
|
|
|
| 27 |
|
| 28 |
This repo contains:
|
| 29 |
|
| 30 |
+
* \ud83e\ude90 A simple PyTorch implementation of [TransDiff Model](models/transdiff.py) and [TransDiff Model with MRAR](models/transdiff_mrar.py)
|
| 31 |
+
* \u26a1\ufe0f Pre-trained class-conditional TransDiff models trained on ImageNet 256x256 and 512x512
|
| 32 |
+
* \ud83d\udca5 A self-contained [notebook](demo.ipynb) for running various pre-trained TransDiff models
|
| 33 |
+
* \ud83d\udef8 An TransDiff [training and evaluation script](main.py) using PyTorch DDP
|
| 34 |
|
| 35 |
## Preparation
|
| 36 |
|
|
|
|
| 76 |
the VAE latents can be pre-computed and saved to `CACHED_PATH` to save computations during TransDiff training:
|
| 77 |
|
| 78 |
```
|
| 79 |
+
torchrun --nproc_per_node=8 --nnodes=1 --node_rank=0 \\
|
| 80 |
+
main_cache.py \\
|
| 81 |
+
--img_size 256 --vae_path ckpt/vae/kl16.ckpt --vae_embed_dim 16 \\
|
| 82 |
+
--batch_size 128 \\
|
| 83 |
--data_path ${IMAGENET_PATH} --cached_path ${CACHED_PATH}
|
| 84 |
```
|
| 85 |
|
|
|
|
| 91 |
### Training
|
| 92 |
Script for the TransDiff-L 1StepAR setting (Pretrain TransDiff-L with a width of 1024 channels, 800 epochs):
|
| 93 |
```
|
| 94 |
+
torchrun --nproc_per_node=8 --nnodes=8 --node_rank=${NODE_RANK} --master_addr=${MASTER_ADDR} --master_port=${MASTER_PORT} \\
|
| 95 |
+
main.py \\
|
| 96 |
+
--img_size 256 --vae_path ckpt/vae/kl16.ckpt --vae_embed_dim 16 --patch_size 1 \\
|
| 97 |
+
--model transdiff_large --diffloss_w 1024 \\
|
| 98 |
+
--diffusion_batch_mul 4 \\
|
| 99 |
+
--epochs 800 --warmup_epochs 100 --blr 1.0e-4 --batch_size 32 \\
|
| 100 |
+
--output_dir ${OUTPUT_DIR} --resume ${OUTPUT_DIR} \\
|
| 101 |
--data_path ${IMAGENET_PATH}
|
| 102 |
```
|
| 103 |
- Training time is ~115h on 64 A100 GPUs with `--batch_size 32`.
|
|
|
|
| 108 |
|
| 109 |
Script for the TransDiff-L MRAR setting (Finetune TransDiff-L MRAR with a width of 1024 channels, 40 epochs):
|
| 110 |
```
|
| 111 |
+
torchrun --nproc_per_node=8 --nnodes=8 --node_rank=${NODE_RANK} --master_addr=${MASTER_ADDR} --master_port=${MASTER_PORT} \\
|
| 112 |
+
main.py \\
|
| 113 |
+
--img_size 256 --vae_path ckpt/vae/kl16.ckpt --vae_embed_dim 16 --patch_size 1 \\
|
| 114 |
+
--model transdiff_large --diffloss_w 1024 --mrar --bf16 \\
|
| 115 |
+
--diffusion_batch_mul 2 \\
|
| 116 |
+
--epochs 40 --warmup_epochs 10 --lr 5.0e-5 --batch_size 16 --gradient_accumulation_steps 2 \\
|
| 117 |
+
--output_dir ${OUTPUT_DIR} --resume ${Transdiff-L_1StepAR_DIR} \\
|
| 118 |
--data_path ${IMAGENET_PATH}
|
| 119 |
```
|
| 120 |
Script for the TransDiff-L 512x512 setting (Finetune TransDiff-L 512x512 with a width of 1024 channels, 150 epochs):
|
| 121 |
```
|
| 122 |
+
torchrun --nproc_per_node=8 --nnodes=8 --node_rank=${NODE_RANK} --master_addr=${MASTER_ADDR} --master_port=${MASTER_PORT} \\
|
| 123 |
+
main.py \\
|
| 124 |
+
--img_size 512 --vae_path ckpt/vae/kl16.ckpt --vae_embed_dim 16 --patch_size 1 \\
|
| 125 |
+
--model transdiff_large --diffloss_w 1024 --ema_rate 0.999 --bf16 \\
|
| 126 |
+
--diffusion_batch_mul 4 \\
|
| 127 |
+
--epochs 150 --warmup_epochs 10 --lr 1.0e-4 --batch_size 16 --gradient_accumulation_steps 2 \\
|
| 128 |
+
--only_train_diff \\
|
| 129 |
+
--output_dir ${OUTPUT_DIR} --resume ${Transdiff-L_1StepAR_DIR} \\
|
| 130 |
--data_path ${IMAGENET_PATH}
|
| 131 |
```
|
| 132 |
|
|
|
|
| 134 |
|
| 135 |
Evaluate TransDiff-L 1StepAR with classifier-free guidance:
|
| 136 |
```
|
| 137 |
+
torchrun --nproc_per_node=8 --nnodes=1 --node_rank=0 \\
|
| 138 |
+
main.py \\
|
| 139 |
+
--img_size 256 --vae_path ckpt/vae/kl16.ckpt --vae_embed_dim 16 --patch_size 1 \\
|
| 140 |
+
--model transdiff_large --diffloss_w 1024 \\
|
| 141 |
+
--output_dir ${OUTPUT_DIR} --resume ckpt/transdiff_l/ \\
|
| 142 |
+
--evaluate --eval_bsz 256 --num_images 50000 \\
|
| 143 |
--cfg 1.3 --scale_0 0.89 --scale_1 0.95
|
| 144 |
```
|
| 145 |
|
| 146 |
Evaluate TransDiff-L MRAR with classifier-free guidance:
|
| 147 |
```
|
| 148 |
+
torchrun --nproc_per_node=8 --nnodes=1 --node_rank=0 \\
|
| 149 |
+
main.py \\
|
| 150 |
+
--img_size 256 --vae_path ckpt/vae/kl16.ckpt --vae_embed_dim 16 --patch_size 1 \\
|
| 151 |
+
--model transdiff_large --diffloss_w 1024 \\
|
| 152 |
+
--output_dir ${OUTPUT_DIR} --resume ckpt/transdiff_l_mrar/ \\
|
| 153 |
+
--evaluate --eval_bsz 256 --num_images 50000 \\
|
| 154 |
--cfg 1.3 --scale_0 0.91 --scale_1 0.93
|
| 155 |
```
|
| 156 |
|
| 157 |
Evaluate TransDiff-L 512x512 with classifier-free guidance:
|
| 158 |
```
|
| 159 |
+
torchrun --nproc_per_node=8 --nnodes=1 --node_rank=0 \\
|
| 160 |
+
main.py \\
|
| 161 |
+
--img_size 512 --vae_path ckpt/vae/kl16.ckpt --vae_embed_dim 16 --patch_size 1 \\
|
| 162 |
+
--model transdiff_large --diffloss_w 1024 \\
|
| 163 |
+
--output_dir ${OUTPUT_DIR} --resume ckpt/transdiff_l_512/ \\
|
| 164 |
+
--evaluate --eval_bsz 64 --num_images 50000 \\
|
| 165 |
--cfg 1.3 --scale_0 0.87 --scale_1 0.87
|
| 166 |
```
|
| 167 |
|