Papers
arxiv:2412.04378

Discriminative Fine-tuning of LVLMs

Published on Dec 5
· Submitted by adrianb1 on Dec 6
Authors:
,
,
,

Abstract

Contrastively-trained Vision-Language Models (VLMs) like CLIP have become the de facto approach for discriminative vision-language representation learning. However, these models have limited language understanding, often exhibiting a "bag of words" behavior. At the same time, Large Vision-Language Models (LVLMs), which combine vision encoders with LLMs, have been shown capable of detailed vision-language reasoning, yet their autoregressive nature renders them less suitable for discriminative tasks. In this work, we propose to combine "the best of both worlds": a new training approach for discriminative fine-tuning of LVLMs that results in strong discriminative and compositional capabilities. Essentially, our approach converts a generative LVLM into a discriminative one, unlocking its capability for powerful image-text discrimination combined with enhanced language understanding. Our contributions include: (1) A carefully designed training/optimization framework that utilizes image-text pairs of variable length and granularity for training the model with both contrastive and next-token prediction losses. This is accompanied by ablation studies that justify the necessity of our framework's components. (2) A parameter-efficient adaptation method using a combination of soft prompting and LoRA adapters. (3) Significant improvements over state-of-the-art CLIP-like models of similar size, including standard image-text retrieval benchmarks and notable gains in compositionality.

Community

Paper author Paper submitter

TL;DR: The paper introduces **VladVA: Vision-Language Adaptation for Discriminative Visual Assistant, a novel approach to enhance the image-text discriminative abilities of Large Vision-Language Models (LVLMs). While current CLIP-style models excel in zero-shot tasks, they often struggle with language comprehension and compositional reasoning, exhibiting a "bag of words" behavior. In contrast, LVLMs demonstrate superior vision-language reasoning but are less suitable for discriminative tasks due to their generative nature.

VladVA addresses this by transforming a generative LVLM into a discriminative one, unlocking its potential for powerful image-text discrimination and enhanced language understanding. Key innovations include:

  • Tailored Training Framework: Leverages diverse image-text pairs, training with both contrastive and next-token prediction losses to boost discrimination while preserving compositional capabilities.
  • Efficient Adaptation: Incorporates soft prompting and LoRA adapters for fine-tuning, ensuring effectiveness and computational efficiency.
  • Performance Gains: Delivers significant improvements over state-of-the-art models in benchmarks for image-text retrieval and compositional reasoning, achieving up to 15% better accuracy.

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2412.04378 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2412.04378 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2412.04378 in a Space README.md to link it from this page.

Collections including this paper 2