arxiv:2412.04378

Discriminative Fine-tuning of LVLMs

Published on Dec 5

· Submitted by

adrianb1 on Dec 6

Upvote

Authors:

Yassine Ouali ,

Adrian Bulat ,

Alexandros Xenos ,

Abstract

Contrastively-trained Vision-Language Models (VLMs) like CLIP have become the de facto approach for discriminative vision-language representation learning. However, these models have limited language understanding, often exhibiting a "bag of words" behavior. At the same time, Large Vision-Language Models (LVLMs), which combine vision encoders with LLMs, have been shown capable of detailed vision-language reasoning, yet their autoregressive nature renders them less suitable for discriminative tasks. In this work, we propose to combine "the best of both worlds": a new training approach for discriminative fine-tuning of LVLMs that results in strong discriminative and compositional capabilities. Essentially, our approach converts a generative LVLM into a discriminative one, unlocking its capability for powerful image-text discrimination combined with enhanced language understanding. Our contributions include: (1) A carefully designed training/optimization framework that utilizes image-text pairs of variable length and granularity for training the model with both contrastive and next-token prediction losses. This is accompanied by ablation studies that justify the necessity of our framework's components. (2) A parameter-efficient adaptation method using a combination of soft prompting and LoRA adapters. (3) Significant improvements over state-of-the-art CLIP-like models of similar size, including standard image-text retrieval benchmarks and notable gains in compositionality.

View arXiv page View PDF Add to collection

Community

adrianb1

Paper author Paper submitter 19 days ago

TL;DR: The paper introduces **VladVA: Vision-Language Adaptation for Discriminative Visual Assistant, a novel approach to enhance the image-text discriminative abilities of Large Vision-Language Models (LVLMs). While current CLIP-style models excel in zero-shot tasks, they often struggle with language comprehension and compositional reasoning, exhibiting a "bag of words" behavior. In contrast, LVLMs demonstrate superior vision-language reasoning but are less suitable for discriminative tasks due to their generative nature.

VladVA addresses this by transforming a generative LVLM into a discriminative one, unlocking its potential for powerful image-text discrimination and enhanced language understanding. Key innovations include:

Tailored Training Framework: Leverages diverse image-text pairs, training with both contrastive and next-token prediction losses to boost discrimination while preserving compositional capabilities.
Efficient Adaptation: Incorporates soft prompting and LoRA adapters for fine-tuning, ensuring effectiveness and computational efficiency.
Performance Gains: Delivers significant improvements over state-of-the-art models in benchmarks for image-text retrieval and compositional reasoning, achieving up to 15% better accuracy.