Implementing Colpali with Qwen2VL

In [1]:
from byaldi import RAGMultiModalModel

RAG = RAGMultiModalModel.from_pretrained("vidore/colpali")

 from .autonotebook import tqdm as notebook_tqdm


Verbosity is set to 1 (active). Pass verbose=0 to make quieter.


`config.hidden_act` is ignored, you should use `config.hidden_activation` instead.
Gemma's activation function will be set to `gelu_pytorch_tanh`. Please, use
`config.hidden_activation` if you want to override this behaviour.
See https://github.com/huggingface/transformers/pull/29402 for more details.
Loading checkpoint shards: 100%|██████████| 2/2 [00:00<00:00, 6.01it/s]


In [2]:
RAG.index(
 input_path="image.png",
 index_name="image_index",
 store_collection_with_index=False,
 overwrite=True
)

You are passing both `text` and `images` to `PaliGemmaProcessor`. The processor expects special image tokens in the text, as many tokens as there are images per each text. It is recommended to add `` tokens in the very beginning of your text and `` token after that. For this call, we will infer how many images each text has and add special tokens.
Starting from v4.46, the `logits` model output will have the same type as the model (except at train time, where it will always be FP32)


Added page 1 of document 0 to index.
Index exported to .byaldi\image_index
Index exported to .byaldi\image_index


{0: 'image.png'}

In [3]:
text_query = "What is the structure of the compiler?"
results = RAG.search(text_query, k=1)
results

You are passing both `text` and `images` to `PaliGemmaProcessor`. The processor expects special image tokens in the text, as many tokens as there are images per each text. It is recommended to add `` tokens in the very beginning of your text and `` token after that. For this call, we will infer how many images each text has and add special tokens.


[{'doc_id': 0, 'page_num': 1, 'score': 18.75, 'metadata': {}, 'base64': None}]

In [5]:
from transformers import Qwen2VLForConditionalGeneration, AutoTokenizer, AutoProcessor
from qwen_vl_utils import process_vision_info
import torch

model = Qwen2VLForConditionalGeneration.from_pretrained(
 "Qwen/Qwen2-VL-2B-Instruct",
 trust_remote_code=True,
 torch_dtype=torch.bfloat16 if torch.cuda.is_available() else torch.float32
 )

The argument `trust_remote_code` is to be used with Auto classes. It has no effect here and is ignored.
Unrecognized keys in `rope_scaling` for 'rope_type'='default': {'mrope_section'}
Loading checkpoint shards: 100%|██████████| 2/2 [00:13<00:00, 6.88s/it]


In [7]:
results[0]["page_num"] -1

0

In [8]:
from PIL import Image
processor = AutoProcessor.from_pretrained("Qwen/Qwen2-VL-2B-Instruct", trust_remote_code=True)

messages = [
 {
 "role": "user",
 "content": [
 {
 "type": "image",
 "image": Image.open("image.png"),
 },
 {"type": "text", "text": text_query},
 ],
 }
]

In [9]:
text = processor.apply_chat_template(
 messages, tokenize=False, add_generation_prompt=True
)

In [11]:
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
 text=[text],
 images=image_inputs,
 videos=video_inputs,
 padding=True,
 return_tensors="pt",
)
device = "cuda" if torch.cuda.is_available() else "cpu"
inputs = inputs.to(device)
model = model.to(device)

In [12]:
generated_ids = model.generate(**inputs, max_new_tokens=50)
generated_ids_trimmed = [
 out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
 generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)


In [13]:
print(output_text)

['The structure of the compiler, as described in the syllabus, includes the following components:\n\n1. **Lexical Analysis**: This involves the role of the lexical analyzer, input buffering, and the design of lexical analyzers, specification and recognition of tokens']
