Llama-NeMoRetriever-ColEmbed: Developer-Focused Guide to NVIDIA's State-of-the-Art Text-Image Retrieval

Community Article Published July 9, 2025

The demand for robust retrieval systems that operate across both text and image modalities is rapidly increasing. The Llama-NemoRetriever-ColEmbed family introduces a unified approach to text-image retrieval, achieving state-of-the-art results on several benchmarks. This post provides a technical overview of the model architecture, training strategies, evaluation results, and practical trade-offs, focusing on what matters most to developers.

Model Architecture

Bi-Encoder with Late Interaction

  • Foundation: Built on NVIDIA's Eagle2 Vision Language Model (VLM), the architecture replaces causal attention with bidirectional attention.
  • Dynamic Image Tiling: Supports variable input resolutions, controlled by max_input_tiles and min_input_tiles parameters.
  • ColBERT-Style Late Interaction: Instead of compressing sequences into a single vector, each query token embedding interacts with all tokens' embedding of documents using a MaxSim operator. This enables fine-grained, token-level matching.
Model Variant Parameters (B) Embedding Dim
1B 2.42 2048
3B 4.41 3072

Training Pipeline

Two-Stage Training

  1. Stage 1: Text-Only Pretraining

    • Model is trained on large-scale text-only retrieval datasets using contrastive loss.
    • Establishes strong semantic representations for text.
  2. Stage 2: Text-Image Fine-Tuning

    • Fine-tuning on diverse text-image pairs aligns text and visual representations in a shared embedding space.

Datasets

  • Text-only: HotpotQA, MIRACL, Natural Questions, Stack Exchange, SQuAD, Tiger Math/Stack.
  • Text-Image: ColPali, Wiki-SS-NQ, VDR, VisRAG-Ret-Train-Synthetic, VisRAG-Ret-Train-In-domain, Docmatix.

Evaluation Results

Benchmarks

  • ViDoRe V1 & V2: The 3B model achieves nDCG@5 scores of 91.0 (V1) and 63.5 (V2), leading both leaderboards as of June 2025.
  • MTEB Visual Document Retrieval: 3B model scores 83.1, outperforming larger 7B models.
  • MIRACL-VISION: Demonstrates strong multilingual retrieval, with the 3B variant achieving the highest overall average score (0.5841).
Model Params Embedding Dim MTEB VDR ViDoRe V1 ViDoRe V2
nvidia/Ilama-nemoretriever-colembed-1b-v1 2B 2048 82.63 90.5 62.1
nvidia/llama-nemoretriever-colembed-3b-v1 4B 3072 83.10 91.0 63.5

System Trade-Offs

Storage and Latency

  • Late-Interaction Models: Require storing all token embeddings, leading to significant storage overhead. For example, a 3B model with 3072-dim embeddings needs over 10 TB for one million images.
  • Bi-Encoder Models: Store a single vector per document, requiring only a few GB for the same corpus size.
  • Dimensionality Reduction: Linear projection layers can reduce storage by up to 88% with minimal accuracy loss.

Retrieval Pipeline Choices

  • Late-Interaction: Higher accuracy, higher storage and latency.
  • Bi-Encoder + Reranker: Lower storage, competitive accuracy with reranking, increased inference time per query.
Architecture Storage (1M images, GB) ViDoRe V1 ViDoRe V2 Additional Latency (ms/query)
ColEmbed 3B (3072d) 10,311.1 0.9106 0.6357 N/A
ColEmbed 3B (512d) 1,230.2 0.9064 0.6109 N/A
Bi-Encoder llama-vlm-embed-v1 (2048d)*¹ 3.8 0.8313 0.5178 N/A
Bi-Encoder llama-vlm-embed-v1 + Rerank**¹ 3.8 0.9064 0.6214 2,368
  • *A commercial multimodal retrieval model representing user queries as text and documents as images.
  • **Results obtained using a VLM reranker developed internally by ranking top 25 documents.
  • ¹ The numbers might differ slightly, as we evaluated the datasets on Vidore V1 and V2 using a different codebase prior to leaderboard results being calculated with the mteb package. Refer to the full technical report for more details.

Practical Considerations

  • Deployment: Choose models and the architecture based on your storage, latency, and accuracy requirements.
  • Small-Corpus, High Query Volume: Larger embedding models without rerankers may be preferable.
  • Large Corpus, Moderate Query Volume: Smaller embedding models with rerankers can be more cost-efficient.
  • Vector Database Support: Late-interaction models require specialized support for token-level similarity search.

Llama-NemoRetriever-ColEmbed represents a significant advancement in scalable, high-performing text-image retrieval, achieving state-of-the-art results on ViDoRe V1, ViDoRe V2, and MIRACL-VISION benchmarks. The two-stage training pipeline, combining large-scale text and image data, results in strong generalization and multilingual retrieval capabilities. The release of both 1B and 3B model variants provides a robust foundation for future research and practical deployment in multimodal retrieval scenarios.

For a deeper technical understanding and comprehensive analysis of trade-offs, read the full research paper. If you're interested in experimenting with these models, try NeMo Retriever models directly at build.nvidia.com/explore/retrieval. This is an excellent opportunity to explore state-of-the-art retrieval in your own applications and workflows.

Community

Sign up or log in to comment