Tasks

Visual Document Retrieval

Visual document retrieval is the task of searching for relevant image-based documents, such as PDFs. These models take a text query and multiple documents as input and return the top-most relevant documents and relevancy scores as output.

Inputs
Question

Is the model in this paper the fastest for inference?

Visual Document Retrieval Model
Output
Page 10
0.700
Page 11
0.060
Page 9
0.003

About Visual Document Retrieval

Use Cases

Multimodal Document Retrieval

Visual document retrieval models can be used to retrieve relevant documents when given a text query. One needs to index the documents first, which is a one-time operation. After indexing is done, the retrieval model takes in a text query (question) and number k of documents to return, and the model returns the top-k most relevant documents for the query. The index can be used repetitively for inference.

Multimodal Retrieval Augmented Generation (RAG)

Multimodal RAG is the task of generating answers from documents (texts or images) when given a text query and a bunch of documents. These documents and the text query can be fed to a vision language model to get the actual answer.

Inference

You can use transformers to infer visual document retrieval models. To calculate similarity between images and text, simply process both separately and pass each processed input through the model. The model outputs can then be passed to calculate similarity scores.

import torch
from PIL import Image
from transformers import ColPaliForRetrieval, ColPaliProcessor

device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")

model = ColPaliForRetrieval.from_pretrained(
    "vidore/colpali-v1.2-hf",
    torch_dtype=torch.bfloat16,
).to(device)

processor = ColPaliProcessor.from_pretrained("vidore/colpali-v1.2-hf")

# Your inputs (replace dummy images with screenshots of your documents)
images = [
    Image.new("RGB", (32, 32), color="white"),
    Image.new("RGB", (16, 16), color="black"),
]
queries = [
    "What is the organizational structure for our R&D department?",
    "Can you provide a breakdown of last year’s financial performance?",
]

# Process the image and text
batch_images = processor(images=images).to(device)
batch_queries = processor(text=queries).to(device)

with torch.no_grad():
    image_embeddings = model(**batch_images).embeddings
    query_embeddings = model(**batch_queries).embeddings

# Score the queries against the images
scores = processor.score_retrieval(query_embeddings, image_embeddings)

Useful Resources

Compatible libraries

Visual Document Retrieval demo

No example widget is defined for this task.

Note Contribute by proposing a widget for this task !

Models for Visual Document Retrieval
Browse Models (52)

Note Very accurate visual document retrieval model for multilingual queries and documents.

Note Very fast and efficient visual document retrieval model that works on five languages.

Datasets for Visual Document Retrieval
Browse Datasets (3)

Note A large dataset used to train visual document retrieval models.

Spaces using Visual Document Retrieval

Note A leaderboard of visual document retrieval models.

Metrics for Visual Document Retrieval
Normalized Discounted Cumulative Gain at K
NDCG@k scores ranked recommendation lists for top-k results. 0 is the worst, 1 is the best.