CoLLM: A Large Language Model for Composed Image Retrieval
Abstract
Composed Image Retrieval (CIR) is a complex task that aims to retrieve images based on a multimodal query. Typical training data consists of triplets containing a reference image, a textual description of desired modifications, and the target image, which are expensive and time-consuming to acquire. The scarcity of CIR datasets has led to zero-shot approaches utilizing synthetic triplets or leveraging vision-language models (VLMs) with ubiquitous web-crawled image-caption pairs. However, these methods have significant limitations: synthetic triplets suffer from limited scale, lack of diversity, and unnatural modification text, while image-caption pairs hinder joint embedding learning of the multimodal query due to the absence of triplet data. Moreover, existing approaches struggle with complex and nuanced modification texts that demand sophisticated fusion and understanding of vision and language modalities. We present CoLLM, a one-stop framework that effectively addresses these limitations. Our approach generates triplets on-the-fly from image-caption pairs, enabling supervised training without manual annotation. We leverage Large Language Models (LLMs) to generate joint embeddings of reference images and modification texts, facilitating deeper multimodal fusion. Additionally, we introduce Multi-Text CIR (MTCIR), a large-scale dataset comprising 3.4M samples, and refine existing CIR benchmarks (CIRR and Fashion-IQ) to enhance evaluation reliability. Experimental results demonstrate that CoLLM achieves state-of-the-art performance across multiple CIR benchmarks and settings. MTCIR yields competitive results, with up to 15% performance improvement. Our refined benchmarks provide more reliable evaluation metrics for CIR models, contributing to the advancement of this important field.
Community
CoLLM Framework: Introduces a unified solution for Composed Image Retrieval (CIR) that overcomes limitations of existing methods by generating synthetic triplets on-the-fly from image-caption pairs, enabling supervised training without manual annotation.
LLM-Powered Multimodal Fusion: Utilizes Large Language Models (LLMs) to create joint embeddings of reference images and modification texts, enhancing the fusion and understanding of vision and language modalities for complex queries.
MTCIR Dataset: Presents a new, large-scale dataset with 3.4M samples for CIR, improving training and benchmarking with diverse, high-quality data.
Refined Benchmarks: Enhances existing CIR benchmarks (CIRR and Fashion-IQ) for more reliable evaluation metrics, boosting the field’s assessment standards.
State-of-the-Art Performance: Demonstrates superior results across multiple CIR benchmarks, with MTCIR contributing up to 15% performance gains.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- CoTMR: Chain-of-Thought Multi-Scale Reasoning for Training-Free Zero-Shot Composed Image Retrieval (2025)
- Data-Efficient Generalization for Zero-shot Composed Image Retrieval (2025)
- PDV: Prompt Directional Vectors for Zero-shot Composed Image Retrieval (2025)
- good4cir: Generating Detailed Synthetic Captions for Composed Image Retrieval (2025)
- A Comprehensive Survey on Composed Image Retrieval (2025)
- ImageScope: Unifying Language-Guided Image Retrieval via Large Multimodal Model Collective Reasoning (2025)
- Composed Multi-modal Retrieval: A Survey of Approaches and Applications (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper