What libraries can I use for Visual Document Retrieval?

The and transformers library is compatible with Visual Document Retrieval.

What models can I use for Visual Document Retrieval?

The vidore/colqwen2-v1.0and marco/mcdse-2b-v1 models can be used for Visual Document Retrieval.

What datasets can I use for Visual Document Retrieval?

The and vidore/colpali_train_set dataset can be used for Visual Document Retrieval.

What metrics can I use for Visual Document Retrieval?

The and Normalized Discounted Cumulative Gain at K metric can be used for Visual Document Retrieval.

What is Visual Document Retrieval?

Q: What is Visual Document Retrieval?

Visual document retrieval is the task of searching for relevant image-based documents, such as PDFs. These models take a text query and multiple documents as input and return the top-most relevant documents and relevancy scores as output.

Use Cases

Multimodal Document Retrieval

Visual document retrieval models can be used to retrieve relevant documents when given a text query. One needs to index the documents first, which is a one-time operation. After indexing is done, the retrieval model takes in a text query (question) and number k of documents to return, and the model returns the top-k most relevant documents for the query. The index can be used repetitively for inference.

Multimodal Retrieval Augmented Generation (RAG)

Multimodal RAG is the task of generating answers from documents (texts or images) when given a text query and a bunch of documents. These documents and the text query can be fed to a vision language model to get the actual answer.

Inference

You can use transformers to infer visual document retrieval models. To calculate similarity between images and text, simply process both separately and pass each processed input through the model. The model outputs can then be passed to calculate similarity scores.

import torch
from PIL import Image
from transformers import ColPaliForRetrieval, ColPaliProcessor

device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")

model = ColPaliForRetrieval.from_pretrained(
    "vidore/colpali-v1.2-hf",
    torch_dtype=torch.bfloat16,
).to(device)

processor = ColPaliProcessor.from_pretrained("vidore/colpali-v1.2-hf")

# Your inputs (replace dummy images with screenshots of your documents)
images = [
    Image.new("RGB", (32, 32), color="white"),
    Image.new("RGB", (16, 16), color="black"),
]
queries = [
    "What is the organizational structure for our R&D department?",
    "Can you provide a breakdown of last year’s financial performance?",
]

# Process the image and text
batch_images = processor(images=images).to(device)
batch_queries = processor(text=queries).to(device)

with torch.no_grad():
    image_embeddings = model(**batch_images).embeddings
    query_embeddings = model(**batch_queries).embeddings

# Score the queries against the images
scores = processor.score_retrieval(query_embeddings, image_embeddings)

Visual Document Retrieval

Question

About Visual Document Retrieval

Use Cases

Multimodal Document Retrieval

Multimodal Retrieval Augmented Generation (RAG)

Inference

Useful Resources

Compatible libraries

vidore/colqwen2-v1.0

marco/mcdse-2b-v1

vidore/colpali_train_set