arxiv:2306.05427

Grounded Text-to-Image Synthesis with Attention Refocusing

Published on Jun 8, 2023

· Submitted by

akhaliq on Jun 9, 2023

Upvote

Authors:

Songwei Ge ,

Jia-Bin Huang

Abstract

Two novel losses improve attention focus in diffusion models, enhancing text-to-image alignment on benchmarks.

AI-generated summary

Driven by scalable diffusion models trained on large-scale paired text-image datasets, text-to-image synthesis methods have shown compelling results. However, these models still fail to precisely follow the text prompt when multiple objects, attributes, and spatial compositions are involved in the prompt. In this paper, we identify the potential reasons in both the cross-attention and self-attention layers of the diffusion model. We propose two novel losses to refocus the attention maps according to a given layout during the sampling process. We perform comprehensive experiments on the DrawBench and HRS benchmarks using layouts synthesized by Large Language Models, showing that our proposed losses can be integrated easily and effectively into existing text-to-image methods and consistently improve their alignment between the generated images and the text prompts.

View arXiv page View PDF Add to collection

Community

Benedikt

Jun 15, 2023

This comment has been hidden

Hyeonho99

Oct 4, 2023

Amazing paper. I think grounded generation have not been explored enough in the text-to-image generation settings! Check out these two papers using groundings in video domain.

Grounded Video Editing: Ground-A-Video (https://ground-a-video.github.io/)
Grounded Video Generation: LLM-grounded VDM (https://llm-grounded-video-diffusion.github.io/)

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2306.05427 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2306.05427 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2306.05427 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.