arxiv:2506.09482

Marrying Autoregressive Transformer and Diffusion with Multi-Reference Autoregression

Published on Jun 11

· Submitted by

zhendch on Jun 17

Upvote

Authors:

Dingcheng Zhen ,

Qian Qiao ,

Abstract

TransDiff, combining an Autoregressive Transformer and diffusion models, achieves superior image generation performance and speed, while Multi-Reference Autoregression further enhances its quality and diversity.

AI-generated summary

We introduce TransDiff, the first image generation model that marries Autoregressive (AR) Transformer with diffusion models. In this joint modeling framework, TransDiff encodes labels and images into high-level semantic features and employs a diffusion model to estimate the distribution of image samples. On the ImageNet 256x256 benchmark, TransDiff significantly outperforms other image generation models based on standalone AR Transformer or diffusion models. Specifically, TransDiff achieves a Fr\'echet Inception Distance (FID) of 1.61 and an Inception Score (IS) of 293.4, and further provides x2 faster inference latency compared to state-of-the-art methods based on AR Transformer and x112 faster inference compared to diffusion-only models. Furthermore, building on the TransDiff model, we introduce a novel image generation paradigm called Multi-Reference Autoregression (MRAR), which performs autoregressive generation by predicting the next image. MRAR enables the model to reference multiple previously generated images, thereby facilitating the learning of more diverse representations and improving the quality of generated images in subsequent iterations. By applying MRAR, the performance of TransDiff is improved, with the FID reduced from 1.61 to 1.42. We expect TransDiff to open up a new frontier in the field of image generation.

View arXiv page View PDF GitHub 131 Add to collection

Community

zhendch

Paper author Paper submitter Jun 17

TransDiff--The Simplest AR Transformer + Diffusion Image Generation Method

Hello, I’m excited to introduce our new work—Marrying Autoregressive Transformer and Diffusion with Multi-Reference Autoregression, which we’ll refer to as TransDiff.

TL;DR

First of all, TransDiff is currently the simplest method to combine AR Transformer and Diffusion for image generation. TransDiff encodes discrete inputs (such as categories, text, etc.) and continuous inputs (such as images) through an AR Transformer into image semantic representations, and then decodes the representations into images through a smaller Diffusion Decoder.

Additionally, we propose a new autoregressive paradigm—MRAR (Multi-Reference Autoregression). This paradigm is similar to In-context Learning (ICL) in NLP: by learning from previous images of the same category, better and more diverse images are generated. The only difference is that the previous images are generated by the model itself.

Detailed Introduction

To save the readers' time, we’ve abandoned the traditional paper structure and introduced TransDiff in a more 'conversational' Q&A format. These questions are also the motivation for our research.

Q: Why use Transformer? What information does the AR Transformer encode in our work?

A: Early CLIP work and the subsequent large models in the VL domain have already proven the advantages of Transformers in image understanding. Especially in the CLIP work, the ViT model can align the image representation to a semantic space (the cosine similarity between text BERT representation and image ViT representation).
Similarly, experiments have shown that the AR Transformer in TransDiff also encodes categories and images into the image's high-level (contrastive pixel)semantic space. The following demonstrates the generation of images by randomly concatenating 256-dimensional features of different categories. Unlike other models (VAR, LlamaGen, etc.) that edit pixels, qualitative experiments show the model's semantic editing ability.

Q: Does the use of a smaller Diffusion Decoder in TransDiff have limitations? Is it better than pure Diffusion and AR Transformer methods?

A: TransDiff's decoder uses the DiT structure and follows the Flow Matching paradigm. The diffusion component accounts for one-third of the total parameters, significantly fewer than mainstream diffusion models. However, compared to all available pure Diffusion and AR Transformer methods, TransDiff still shows certain advantages in benchmark tests, at least demonstrating a "back-and-forth" competitive performance.

Q: TransDiff looks like MAR, is it just a simple imitation of MAR?

A: Although TransDiff and MAR are structurally similar, the characteristics exhibited by the models are quite different. First, MAR generates on the pixel (or patch) level without explicit semantic representation. Additionally, MAR uses a very simple Diffusion Decoder (with n layers of MLP) which limits the decoder's expressive power. Therefore, as shown in the figure below: MAR cannot "generate images in one step", and image patches are gradually refined through autoregressive iterations.

Q: What’s good about MRAR? Does it have advantages over the commonly used Token-Level AR and Scale-Level AR in AR Transformers?

A: First, compared to Token-Level AR and Scale-Level AR, TransDiff with MRAR has a significant advantage in benchmarks. Secondly, we found that the higher the diversity of semantic representations, the higher the image quality. MRAR can significantly improve the diversity of semantic representations compared to Scale-Level AR.

Finally, some demos

One More Thing

TransDiff with MRAR has demonstrated the potential for generating continuous frames without training on video data. Therefore, we will also apply TransDiff to the video generation field in the future, so stay tuned.

librarian-bot

Jun 18

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Models citing this paper 1

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2506.09482 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2506.09482 in a Space README.md to link it from this page.