Llama-NeMoRetriever-ColEmbed: Developer-Focused Guide to NVIDIA's State-of-the-Art Text-Image Retrieval
The demand for robust retrieval systems that operate across both text and image modalities is rapidly increasing. The Llama-NemoRetriever-ColEmbed family introduces a unified approach to text-image retrieval, achieving state-of-the-art results on several benchmarks. This post provides a technical overview of the model architecture, training strategies, evaluation results, and practical trade-offs, focusing on what matters most to developers.
Model Architecture
Bi-Encoder with Late Interaction
- Foundation: Built on NVIDIA's Eagle2 Vision Language Model (VLM), the architecture replaces causal attention with bidirectional attention.
- Dynamic Image Tiling: Supports variable input resolutions, controlled by
max_input_tiles
andmin_input_tiles
parameters. - ColBERT-Style Late Interaction: Instead of compressing sequences into a single vector, each query token embedding interacts with all tokens' embedding of documents using a MaxSim operator. This enables fine-grained, token-level matching.
Model Variant | Parameters (B) | Embedding Dim |
---|---|---|
1B | 2.42 | 2048 |
3B | 4.41 | 3072 |
Training Pipeline
Two-Stage Training
Stage 1: Text-Only Pretraining
- Model is trained on large-scale text-only retrieval datasets using contrastive loss.
- Establishes strong semantic representations for text.
Stage 2: Text-Image Fine-Tuning
- Fine-tuning on diverse text-image pairs aligns text and visual representations in a shared embedding space.
Datasets
- Text-only: HotpotQA, MIRACL, Natural Questions, Stack Exchange, SQuAD, Tiger Math/Stack.
- Text-Image: ColPali, Wiki-SS-NQ, VDR, VisRAG-Ret-Train-Synthetic, VisRAG-Ret-Train-In-domain, Docmatix.
Evaluation Results
Benchmarks
- ViDoRe V1 & V2: The 3B model achieves nDCG@5 scores of 91.0 (V1) and 63.5 (V2), leading both leaderboards as of June 2025.
- MTEB Visual Document Retrieval: 3B model scores 83.1, outperforming larger 7B models.
- MIRACL-VISION: Demonstrates strong multilingual retrieval, with the 3B variant achieving the highest overall average score (0.5841).
Model | Params | Embedding Dim | MTEB VDR | ViDoRe V1 | ViDoRe V2 |
---|---|---|---|---|---|
nvidia/Ilama-nemoretriever-colembed-1b-v1 | 2B | 2048 | 82.63 | 90.5 | 62.1 |
nvidia/llama-nemoretriever-colembed-3b-v1 | 4B | 3072 | 83.10 | 91.0 | 63.5 |
System Trade-Offs
Storage and Latency
- Late-Interaction Models: Require storing all token embeddings, leading to significant storage overhead. For example, a 3B model with 3072-dim embeddings needs over 10 TB for one million images.
- Bi-Encoder Models: Store a single vector per document, requiring only a few GB for the same corpus size.
- Dimensionality Reduction: Linear projection layers can reduce storage by up to 88% with minimal accuracy loss.
Retrieval Pipeline Choices
- Late-Interaction: Higher accuracy, higher storage and latency.
- Bi-Encoder + Reranker: Lower storage, competitive accuracy with reranking, increased inference time per query.
Architecture | Storage (1M images, GB) | ViDoRe V1 | ViDoRe V2 | Additional Latency (ms/query) |
---|---|---|---|---|
ColEmbed 3B (3072d) | 10,311.1 | 0.9106 | 0.6357 | N/A |
ColEmbed 3B (512d) | 1,230.2 | 0.9064 | 0.6109 | N/A |
Bi-Encoder llama-vlm-embed-v1 (2048d)*¹ | 3.8 | 0.8313 | 0.5178 | N/A |
Bi-Encoder llama-vlm-embed-v1 + Rerank**¹ | 3.8 | 0.9064 | 0.6214 | 2,368 |
- *A commercial multimodal retrieval model representing user queries as text and documents as images.
- **Results obtained using a VLM reranker developed internally by ranking top 25 documents.
- ¹ The numbers might differ slightly, as we evaluated the datasets on Vidore V1 and V2 using a different codebase prior to leaderboard results being calculated with the mteb package. Refer to the full technical report for more details.
Practical Considerations
- Deployment: Choose models and the architecture based on your storage, latency, and accuracy requirements.
- Small-Corpus, High Query Volume: Larger embedding models without rerankers may be preferable.
- Large Corpus, Moderate Query Volume: Smaller embedding models with rerankers can be more cost-efficient.
- Vector Database Support: Late-interaction models require specialized support for token-level similarity search.
Llama-NemoRetriever-ColEmbed represents a significant advancement in scalable, high-performing text-image retrieval, achieving state-of-the-art results on ViDoRe V1, ViDoRe V2, and MIRACL-VISION benchmarks. The two-stage training pipeline, combining large-scale text and image data, results in strong generalization and multilingual retrieval capabilities. The release of both 1B and 3B model variants provides a robust foundation for future research and practical deployment in multimodal retrieval scenarios.
For a deeper technical understanding and comprehensive analysis of trade-offs, read the full research paper. If you're interested in experimenting with these models, try NeMo Retriever models directly at build.nvidia.com/explore/retrieval. This is an excellent opportunity to explore state-of-the-art retrieval in your own applications and workflows.