|
--- |
|
library_name: transformers |
|
pipeline_tag: text-generation |
|
license: apache-2.0 |
|
tags: |
|
- bd3lm |
|
- diffusion |
|
- autoregressive |
|
- language-modeling |
|
--- |
|
|
|
# Block Diffusion Interpolates Between Autoregressive and Diffusion Language Models (ICLR 2025 Oral) |
|
|
|
By [Marianne Arriola](https://m-arriola.com/), [Aaron Gokaslan](https://skylion007.github.io), [Justin T Chiu](https://justinchiu.netlify.app), [Zhihan Yang](https://zhihanyang2022.github.io/), [Zhixuan Qi](https://zhixuanqi.com/), [Jiaqi Han](https://hanjq17.github.io/), [Subham Sekhar Sahoo](https://s-sahoo.github.io), [Volodymyr Kuleshov](https://www.cs.cornell.edu/~kuleshov/) |
|
|
|
[](https://arxiv.org/abs/2503.09573) |
|
[](https://github.com/kuleshov-group/bd3lms) |
|
[](https://m-arriola.com/bd3lms/) |
|
[](https://huggingface.co/collections/kuleshov-group/bd3-lms-67be95f81b96b15fec50d53f) |
|
|
|
|
|
We introduce ***BD3-LMs***, a family of **B**lock **D**iscrete **D**enoising **D**iffusion **L**anguage **M**odels that achieve SOTA likelihoods among diffusion models and enable generation of arbitrary-length sequences. BD3-LMs combine the strengths of autoregressive and diffusion language models by decomposing a token sequence into blocks and performing discrete diffusion within each block. By tuning the block size, we interpolate between autoregressive and diffusion models which introduces a trade-off between quality and sample efficiency. We propose a recipe of building effective BD3-LMs that includes an efficient training algorithm, estimators of gradient variance, and data-driven noise schedules to minimize the variance. |
|
|
|
## Model Description |
|
BD3-LMs are Block Discrete Denoising Diffusion Language Models. They combine the strengths of autoregressive and diffusion language models by decomposing a token sequence into blocks and performing discrete diffusion within each block. |
|
|
|
## How to use |
|
See our [GitHub README](https://github.com/kuleshov-group/bd3lms), where we provide sample scripts for training, likelihood evaluation, and generation. |
|
|
|
## Citation |
|
``` |
|
@inproceedings{ |
|
arriola2025block, |
|
title={Block Diffusion: Interpolating Between Autoregressive and Diffusion Language Models}, |
|
author={Marianne Arriola and Aaron Gokaslan and Justin T Chiu and Zhihan Yang and Zhixuan Qi and Jiaqi Han and Subham Sekhar Sahoo and Volodymyr Kuleshov}, |
|
booktitle={The Thirteenth International Conference on Learning Representations}, |
|
year={2025}, |
|
url={https://arxiv.org/abs/2503.09573} |
|
} |
|
``` |