arxiv:2405.20494

Slight Corruption in Pre-training Data Makes Better Diffusion Models

Published on May 30, 2024

Authors:

Abstract

Diffusion models (DMs) have shown remarkable capabilities in generating realistic high-quality images, audios, and videos. They benefit significantly from extensive <PRE_TAG>pre-training</POST_TAG> on large-scale datasets, including web-crawled data with paired data and conditions, such as image-text and image-class pairs. Despite rigorous filtering, these <PRE_TAG>pre-training</POST_TAG> datasets often inevitably contain corrupted pairs where conditions do not accurately describe the data. This paper presents the first comprehensive study on the impact of such corruption in <PRE_TAG>pre-training</POST_TAG> data of DMs. We synthetically corrupt ImageNet-1K and CC3M to pre-train and evaluate over 50 conditional DMs. Our empirical findings reveal that various types of slight corruption in <PRE_TAG>pre-training</POST_TAG> can significantly enhance the quality, diversity, and fidelity of the generated images across different DMs, both during <PRE_TAG>pre-training</POST_TAG> and downstream adaptation stages. Theoretically, we consider a Gaussian mixture model and prove that slight corruption in the condition leads to higher entropy and a reduced 2-Wasserstein distance to the ground truth of the data distribution generated by the corruptly trained DMs. Inspired by our analysis, we propose a simple method to improve the training of DMs on practical datasets by adding condition embedding perturbations (CEP). CEP significantly improves the performance of various DMs in both <PRE_TAG>pre-training</POST_TAG> and downstream tasks. We hope that our study provides new insights into understanding the data and <PRE_TAG><PRE_TAG>pre-training</POST_TAG> process</POST_TAG>es of DMs.