Grounded Text-to-Image Synthesis with Attention Refocusing
Abstract
Driven by scalable diffusion models trained on large-scale paired text-image datasets, text-to-image synthesis methods have shown compelling results. However, these models still fail to precisely follow the text prompt when multiple objects, attributes, and spatial compositions are involved in the prompt. In this paper, we identify the potential reasons in both the cross-attention and self-attention layers of the diffusion model. We propose two novel losses to refocus the attention maps according to a given layout during the sampling process. We perform comprehensive experiments on the DrawBench and HRS benchmarks using layouts synthesized by Large Language Models, showing that our proposed losses can be integrated easily and effectively into existing text-to-image methods and consistently improve their alignment between the generated images and the text prompts.
Community
Amazing paper. I think grounded generation have not been explored enough in the text-to-image generation settings! Check out these two papers using groundings in video domain.
Grounded Video Editing: Ground-A-Video (https://ground-a-video.github.io/)
Grounded Video Generation: LLM-grounded VDM (https://llm-grounded-video-diffusion.github.io/)
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper