arxiv:2402.19150

Unveiling Typographic Deceptions: Insights of the Typographic Vulnerability in Large Vision-Language Model

Published on Feb 29, 2024

Authors:

Abstract

Large Vision-Language Models (LVLMs) rely on vision encoders and Large Language Models (LLMs) to exhibit remarkable capabilities on various <PRE_TAG>multi-modal tasks</POST_TAG> in the joint space of vision and language. However, the Typographic Attack, which disrupts vision-language models (VLMs) such as Contrastive Language-Image Pretraining (CLIP), has also been expected to be a security threat to LVLMs. Firstly, we verify typographic attacks on current well-known commercial and open-source LVLMs and uncover the widespread existence of this threat. Secondly, to better assess this vulnerability, we propose the most comprehensive and largest-scale Typographic Dataset to date. The Typographic Dataset not only considers the evaluation of typographic attacks under various <PRE_TAG>multi-modal tasks</POST_TAG> but also evaluates the effects of typographic attacks, influenced by texts generated with diverse factors. Based on the evaluation results, we investigate the causes why typographic attacks may impact VLMs and LVLMs, leading to three highly insightful discoveries. By the examination of our discoveries and experimental validation in the Typographic Dataset, we reduce the performance degradation from 42.07% to 13.90% when LVLMs confront typographic attacks.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

No model linking this paper

Cite arxiv.org/abs/2402.19150 in a model README.md to link it from this page.

No Space linking this paper

Cite arxiv.org/abs/2402.19150 in a Space README.md to link it from this page.

No Collection including this paper

Add this paper to a collection to link it from this page.