Papers
arxiv:2010.04295

Widget Captioning: Generating Natural Language Description for Mobile User Interface Elements

Published on Oct 8, 2020
Authors:
,
,
,
,
,

Abstract

Natural language descriptions of user interface (UI) elements such as alternative text are crucial for accessibility and language-based interaction in general. Yet, these descriptions are constantly missing in mobile UIs. We propose widget captioning, a novel task for automatically generating language descriptions for UI elements from multimodal input including both the image and the structural representations of user interfaces. We collected a large-scale dataset for widget captioning with crowdsourcing. Our dataset contains 162,859 language phrases created by human workers for annotating 61,285 UI elements across 21,750 unique UI screens. We thoroughly analyze the dataset, and train and evaluate a set of deep model configurations to investigate how each feature modality as well as the choice of learning strategies impact the quality of predicted captions. The task formulation and the dataset as well as our benchmark models contribute a solid basis for this novel multimodal captioning task that connects language and user interfaces.

Community

Sign up or log in to comment

Models citing this paper 164

Browse 164 models citing this paper

Datasets citing this paper 2

Spaces citing this paper 62

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.