Llama-NeMoRetriever-ColEmbed: Developer-Focused Guide to NVIDIA's State-of-the-Art Text-Image Retrieval

Community Article Published July 9, 2025

nvidia

The demand for robust retrieval systems that operate across both text and image modalities is rapidly increasing. The Llama-NemoRetriever-ColEmbed family introduces a unified approach to text-image retrieval, achieving state-of-the-art results on several benchmarks. This post provides a technical overview of the model architecture, training strategies, evaluation results, and practical trade-offs, focusing on what matters most to developers.

Model Architecture

Bi-Encoder with Late Interaction

Foundation: Built on NVIDIA's Eagle2 Vision Language Model (VLM), the architecture replaces causal attention with bidirectional attention.
Dynamic Image Tiling: Supports variable input resolutions, controlled by max_input_tiles and min_input_tiles parameters.
ColBERT-Style Late Interaction: Instead of compressing sequences into a single vector, each query token embedding interacts with all tokens' embedding of documents using a MaxSim operator. This enables fine-grained, token-level matching.

Model Variant	Parameters (B)	Embedding Dim
1B	2.42	2048
3B	4.41	3072

Training Pipeline

Two-Stage Training

Stage 1: Text-Only Pretraining
- Model is trained on large-scale text-only retrieval datasets using contrastive loss.
- Establishes strong semantic representations for text.
Stage 2: Text-Image Fine-Tuning
- Fine-tuning on diverse text-image pairs aligns text and visual representations in a shared embedding space.

Datasets

Text-only: HotpotQA, MIRACL, Natural Questions, Stack Exchange, SQuAD, Tiger Math/Stack.
Text-Image: ColPali, Wiki-SS-NQ, VDR, VisRAG-Ret-Train-Synthetic, VisRAG-Ret-Train-In-domain, Docmatix.

Evaluation Results

Benchmarks

ViDoRe V1 & V2: The 3B model achieves nDCG@5 scores of 91.0 (V1) and 63.5 (V2), leading both leaderboards as of June 2025.
MTEB Visual Document Retrieval: 3B model scores 83.1, outperforming larger 7B models.
MIRACL-VISION: Demonstrates strong multilingual retrieval, with the 3B variant achieving the highest overall average score (0.5841).

Model	Params	Embedding Dim	MTEB VDR	ViDoRe V1	ViDoRe V2
nvidia/Ilama-nemoretriever-colembed-1b-v1	2B	2048	82.63	90.5	62.1
nvidia/llama-nemoretriever-colembed-3b-v1	4B	3072	83.10	91.0	63.5

System Trade-Offs

Storage and Latency

Late-Interaction Models: Require storing all token embeddings, leading to significant storage overhead. For example, a 3B model with 3072-dim embeddings needs over 10 TB for one million images.
Bi-Encoder Models: Store a single vector per document, requiring only a few GB for the same corpus size.
Dimensionality Reduction: Linear projection layers can reduce storage by up to 88% with minimal accuracy loss.

Retrieval Pipeline Choices

Late-Interaction: Higher accuracy, higher storage and latency.
Bi-Encoder + Reranker: Lower storage, competitive accuracy with reranking, increased inference time per query.

Architecture	Storage (1M images, GB)	ViDoRe V1	ViDoRe V2	Additional Latency (ms/query)
ColEmbed 3B (3072d)	10,311.1	0.9106	0.6357	N/A
ColEmbed 3B (512d)	1,230.2	0.9064	0.6109	N/A
Bi-Encoder llama-vlm-embed-v1 (2048d)*¹	3.8	0.8313	0.5178	N/A
Bi-Encoder llama-vlm-embed-v1 + Rerank**¹	3.8	0.9064	0.6214	2,368

*A commercial multimodal retrieval model representing user queries as text and documents as images.
**Results obtained using a VLM reranker developed internally by ranking top 25 documents.
¹ The numbers might differ slightly, as we evaluated the datasets on Vidore V1 and V2 using a different codebase prior to leaderboard results being calculated with the mteb package. Refer to the full technical report for more details.

Practical Considerations

Deployment: Choose models and the architecture based on your storage, latency, and accuracy requirements.
Small-Corpus, High Query Volume: Larger embedding models without rerankers may be preferable.
Large Corpus, Moderate Query Volume: Smaller embedding models with rerankers can be more cost-efficient.
Vector Database Support: Late-interaction models require specialized support for token-level similarity search.

Llama-NemoRetriever-ColEmbed represents a significant advancement in scalable, high-performing text-image retrieval, achieving state-of-the-art results on ViDoRe V1, ViDoRe V2, and MIRACL-VISION benchmarks. The two-stage training pipeline, combining large-scale text and image data, results in strong generalization and multilingual retrieval capabilities. The release of both 1B and 3B model variants provides a robust foundation for future research and practical deployment in multimodal retrieval scenarios.

For a deeper technical understanding and comprehensive analysis of trade-offs, read the full research paper. If you're interested in experimenting with these models, try NeMo Retriever models directly at build.nvidia.com/explore/retrieval. This is an excellent opportunity to explore state-of-the-art retrieval in your own applications and workflows.

Community

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote