Atulit23 commited on
Commit
738e435
Β·
verified Β·
1 Parent(s): 4dd036e

Upload folder using huggingface_hub

Browse files
.byaldi/image_index/doc_ids_to_file_names.json.gz ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:a5adce3e520525f462d8f71c09a42b3ca10cc5039b79cd1640e0c0d97acd9e17
3
+ size 68
.byaldi/image_index/embed_id_to_doc_id.json.gz ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:60aedd13c343e38d2cb81b0c953b2f4b3db530f44b96af3167f63ff218c831ba
3
+ size 79
.byaldi/image_index/embeddings/embeddings_0.pt ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:404010331a12bd6c9dd18c87358a10c1c1ecf58e19c2dd402ae1757cded340e6
3
+ size 264885
.byaldi/image_index/index_config.json.gz ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:83da74cae33705bdcbdc43f436f45ec099b683245a56d7fd72336954916e9a3c
3
+ size 174
.byaldi/image_index/metadata.json.gz ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:a23514d5d0a1b04d797c42e596342a4b3203e7ed7886d6cad63c97ee0ae49b58
3
+ size 38
.github/workflows/update_space.yml ADDED
@@ -0,0 +1,28 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ name: Run Python script
2
+
3
+ on:
4
+ push:
5
+ branches:
6
+ - main
7
+
8
+ jobs:
9
+ build:
10
+ runs-on: ubuntu-latest
11
+
12
+ steps:
13
+ - name: Checkout
14
+ uses: actions/checkout@v2
15
+
16
+ - name: Set up Python
17
+ uses: actions/setup-python@v2
18
+ with:
19
+ python-version: '3.9'
20
+
21
+ - name: Install Gradio
22
+ run: python -m pip install gradio
23
+
24
+ - name: Log in to Hugging Face
25
+ run: python -c 'import huggingface_hub; huggingface_hub.login(token="${{ secrets.hf_token }}")'
26
+
27
+ - name: Deploy to Spaces
28
+ run: gradio deploy
README.md CHANGED
@@ -1,12 +1,34 @@
1
  ---
2
  title: ColPali
3
- emoji: πŸ”₯
4
- colorFrom: gray
5
- colorTo: pink
6
- sdk: gradio
7
- sdk_version: 4.44.0
8
  app_file: app.py
9
- pinned: false
 
10
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
11
 
12
- Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
 
 
1
  ---
2
  title: ColPali
 
 
 
 
 
3
  app_file: app.py
4
+ sdk: gradio
5
+ sdk_version: 4.41.0
6
  ---
7
+ # RAG-based PDF Search and Keyword Extraction using Qwen2VL
8
+
9
+ This repository contains an implementation of a **RAG (Retrieval-Augmented Generation)** based PDF search system using **Copali's implementation** of the Byaldi library and **Qwen2VL** for creating the RAG pipeline. Additionally, the repository includes a Gradio app that allows users to extract text from images and highlight searched keywords using **Qwen2VL**.
10
+
11
+ ## Table of Contents
12
+ - [Overview](#overview)
13
+ - [Installation](#installation)
14
+ - [Usage](#usage)
15
+ - [RAG PDF Search](#rag-pdf-search)
16
+ - [Gradio App for Keyword Extraction](#gradio-app-for-keyword-extraction)
17
+ - [License](#license)
18
+
19
+ ## Overview
20
+
21
+ ### RAG PDF Search
22
+
23
+ In `copali-qwen.ipynb`, you will find the complete implementation of the **RAG-based PDF search**. The pipeline is built using the **Copali** implementation of the Byaldi library, along with **Qwen2VL**. By default, the code indexes and searches through an image (`image.png`), but you can easily modify the path to a PDF file or any other desired document.
24
+
25
+ ### Gradio App for Keyword Extraction
26
+
27
+ The `app.py` file contains a **Gradio app** that utilizes only **Qwen2VL** to extract text from an image and highlight the keywords matching the user's search query. This app is an easy-to-use interface for real-time keyword extraction from images.
28
+
29
+ ## Installation
30
+
31
+ To run this project, you will need to install the following dependencies:
32
 
33
+ ```bash
34
+ pip install transformers byaldi qwen-vl-utils gradio pillow torch
app.py ADDED
@@ -0,0 +1,82 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import gradio as gr
2
+ import torch
3
+ from PIL import Image
4
+ from transformers import Qwen2VLForConditionalGeneration, AutoTokenizer, AutoProcessor
5
+ from qwen_vl_utils import process_vision_info
6
+ import re
7
+
8
+ min_pixels = 256 * 28 * 28
9
+ max_pixels = 1280 * 28 * 28
10
+
11
+ def model_inference(images):
12
+ model = Qwen2VLForConditionalGeneration.from_pretrained(
13
+ "Qwen/Qwen2-VL-2B-Instruct",
14
+ trust_remote_code=True,
15
+ torch_dtype=torch.bfloat16 if torch.cuda.is_available() else torch.float32
16
+ )
17
+ processor = AutoProcessor.from_pretrained("Qwen/Qwen2-VL-2B-Instruct", min_pixels=min_pixels, max_pixels=max_pixels)
18
+
19
+ images = [{"type": "image", "image": Image.open(image[0])} for image in images]
20
+
21
+ messages = [{"role": "user", "content": images}]
22
+
23
+ text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
24
+ image_inputs, video_inputs = process_vision_info(messages)
25
+ inputs = processor(
26
+ text=[text],
27
+ images=image_inputs,
28
+ videos=video_inputs,
29
+ padding=True,
30
+ return_tensors="pt",
31
+ )
32
+
33
+ device = "cuda" if torch.cuda.is_available() else "cpu"
34
+ inputs = inputs.to(device)
35
+ model = model.to(device)
36
+
37
+ generated_ids = model.generate(**inputs, max_new_tokens=512)
38
+ generated_ids_trimmed = [
39
+ out_ids[len(in_ids):] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
40
+ ]
41
+
42
+ output_text = processor.batch_decode(
43
+ generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
44
+ )
45
+
46
+ del model
47
+ del processor
48
+ return output_text[0]
49
+
50
+ def search_and_highlight(text, keywords):
51
+ if not keywords:
52
+ return text
53
+
54
+ keywords = [kw.strip().lower() for kw in keywords.split(',')]
55
+ highlighted_text = text
56
+
57
+ for keyword in keywords:
58
+ pattern = re.compile(re.escape(keyword), re.IGNORECASE)
59
+ highlighted_text = pattern.sub(f'**{keyword}**', highlighted_text)
60
+
61
+ return highlighted_text
62
+
63
+ def extract_and_search(images, keywords):
64
+ extracted_text = model_inference(images)
65
+ highlighted_text = search_and_highlight(extracted_text, keywords)
66
+ return extracted_text, highlighted_text
67
+
68
+ with gr.Blocks(theme=gr.themes.Soft()) as demo:
69
+ with gr.Row():
70
+ output_gallery = gr.Gallery(label="Image", height=300, show_label=True)
71
+ keywords = gr.Textbox(placeholder="Enter keywords to search (comma-separated)", label="Search Keywords")
72
+
73
+ extract_button = gr.Button("Extract Text and Search", variant="primary")
74
+
75
+ with gr.Row():
76
+ raw_output = gr.Textbox(label="Interpreted Text")
77
+ highlighted_output = gr.Markdown(label="Highlighted Search Results")
78
+
79
+ extract_button.click(extract_and_search, inputs=[output_gallery, keywords], outputs=[raw_output, highlighted_output])
80
+
81
+ if __name__ == "__main__":
82
+ demo.queue(max_size=10).launch(share=True)
copali-qwen.ipynb ADDED
@@ -0,0 +1,280 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "cells": [
3
+ {
4
+ "cell_type": "markdown",
5
+ "metadata": {},
6
+ "source": [
7
+ "Implementing Colpali with Qwen2VL"
8
+ ]
9
+ },
10
+ {
11
+ "cell_type": "code",
12
+ "execution_count": 1,
13
+ "metadata": {},
14
+ "outputs": [
15
+ {
16
+ "name": "stderr",
17
+ "output_type": "stream",
18
+ "text": [
19
+ "c:\\Users\\atuli\\AppData\\Local\\Programs\\Python\\Python310\\lib\\site-packages\\tqdm\\auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html\n",
20
+ " from .autonotebook import tqdm as notebook_tqdm\n"
21
+ ]
22
+ },
23
+ {
24
+ "name": "stdout",
25
+ "output_type": "stream",
26
+ "text": [
27
+ "Verbosity is set to 1 (active). Pass verbose=0 to make quieter.\n"
28
+ ]
29
+ },
30
+ {
31
+ "name": "stderr",
32
+ "output_type": "stream",
33
+ "text": [
34
+ "`config.hidden_act` is ignored, you should use `config.hidden_activation` instead.\n",
35
+ "Gemma's activation function will be set to `gelu_pytorch_tanh`. Please, use\n",
36
+ "`config.hidden_activation` if you want to override this behaviour.\n",
37
+ "See https://github.com/huggingface/transformers/pull/29402 for more details.\n",
38
+ "Loading checkpoint shards: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 2/2 [00:00<00:00, 6.01it/s]\n"
39
+ ]
40
+ }
41
+ ],
42
+ "source": [
43
+ "from byaldi import RAGMultiModalModel\n",
44
+ "\n",
45
+ "RAG = RAGMultiModalModel.from_pretrained(\"vidore/colpali\")"
46
+ ]
47
+ },
48
+ {
49
+ "cell_type": "code",
50
+ "execution_count": 2,
51
+ "metadata": {},
52
+ "outputs": [
53
+ {
54
+ "name": "stderr",
55
+ "output_type": "stream",
56
+ "text": [
57
+ "You are passing both `text` and `images` to `PaliGemmaProcessor`. The processor expects special image tokens in the text, as many tokens as there are images per each text. It is recommended to add `<image>` tokens in the very beginning of your text and `<bos>` token after that. For this call, we will infer how many images each text has and add special tokens.\n",
58
+ "Starting from v4.46, the `logits` model output will have the same type as the model (except at train time, where it will always be FP32)\n"
59
+ ]
60
+ },
61
+ {
62
+ "name": "stdout",
63
+ "output_type": "stream",
64
+ "text": [
65
+ "Added page 1 of document 0 to index.\n",
66
+ "Index exported to .byaldi\\image_index\n",
67
+ "Index exported to .byaldi\\image_index\n"
68
+ ]
69
+ },
70
+ {
71
+ "data": {
72
+ "text/plain": [
73
+ "{0: 'image.png'}"
74
+ ]
75
+ },
76
+ "execution_count": 2,
77
+ "metadata": {},
78
+ "output_type": "execute_result"
79
+ }
80
+ ],
81
+ "source": [
82
+ "RAG.index(\n",
83
+ " input_path=\"image.png\",\n",
84
+ " index_name=\"image_index\",\n",
85
+ " store_collection_with_index=False,\n",
86
+ " overwrite=True\n",
87
+ ")"
88
+ ]
89
+ },
90
+ {
91
+ "cell_type": "code",
92
+ "execution_count": 3,
93
+ "metadata": {},
94
+ "outputs": [
95
+ {
96
+ "name": "stderr",
97
+ "output_type": "stream",
98
+ "text": [
99
+ "You are passing both `text` and `images` to `PaliGemmaProcessor`. The processor expects special image tokens in the text, as many tokens as there are images per each text. It is recommended to add `<image>` tokens in the very beginning of your text and `<bos>` token after that. For this call, we will infer how many images each text has and add special tokens.\n"
100
+ ]
101
+ },
102
+ {
103
+ "data": {
104
+ "text/plain": [
105
+ "[{'doc_id': 0, 'page_num': 1, 'score': 18.75, 'metadata': {}, 'base64': None}]"
106
+ ]
107
+ },
108
+ "execution_count": 3,
109
+ "metadata": {},
110
+ "output_type": "execute_result"
111
+ }
112
+ ],
113
+ "source": [
114
+ "text_query = \"What is the structure of the compiler?\"\n",
115
+ "results = RAG.search(text_query, k=1)\n",
116
+ "results"
117
+ ]
118
+ },
119
+ {
120
+ "cell_type": "code",
121
+ "execution_count": 5,
122
+ "metadata": {},
123
+ "outputs": [
124
+ {
125
+ "name": "stderr",
126
+ "output_type": "stream",
127
+ "text": [
128
+ "The argument `trust_remote_code` is to be used with Auto classes. It has no effect here and is ignored.\n",
129
+ "Unrecognized keys in `rope_scaling` for 'rope_type'='default': {'mrope_section'}\n",
130
+ "Loading checkpoint shards: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 2/2 [00:13<00:00, 6.88s/it]\n"
131
+ ]
132
+ }
133
+ ],
134
+ "source": [
135
+ "from transformers import Qwen2VLForConditionalGeneration, AutoTokenizer, AutoProcessor\n",
136
+ "from qwen_vl_utils import process_vision_info\n",
137
+ "import torch\n",
138
+ "\n",
139
+ "model = Qwen2VLForConditionalGeneration.from_pretrained(\n",
140
+ " \"Qwen/Qwen2-VL-2B-Instruct\",\n",
141
+ " trust_remote_code=True,\n",
142
+ " torch_dtype=torch.bfloat16 if torch.cuda.is_available() else torch.float32\n",
143
+ " )"
144
+ ]
145
+ },
146
+ {
147
+ "cell_type": "code",
148
+ "execution_count": 7,
149
+ "metadata": {},
150
+ "outputs": [
151
+ {
152
+ "data": {
153
+ "text/plain": [
154
+ "0"
155
+ ]
156
+ },
157
+ "execution_count": 7,
158
+ "metadata": {},
159
+ "output_type": "execute_result"
160
+ }
161
+ ],
162
+ "source": [
163
+ "results[0][\"page_num\"] -1"
164
+ ]
165
+ },
166
+ {
167
+ "cell_type": "code",
168
+ "execution_count": 8,
169
+ "metadata": {},
170
+ "outputs": [],
171
+ "source": [
172
+ "from PIL import Image\n",
173
+ "processor = AutoProcessor.from_pretrained(\"Qwen/Qwen2-VL-2B-Instruct\", trust_remote_code=True)\n",
174
+ "\n",
175
+ "messages = [\n",
176
+ " {\n",
177
+ " \"role\": \"user\",\n",
178
+ " \"content\": [\n",
179
+ " {\n",
180
+ " \"type\": \"image\",\n",
181
+ " \"image\": Image.open(\"image.png\"),\n",
182
+ " },\n",
183
+ " {\"type\": \"text\", \"text\": text_query},\n",
184
+ " ],\n",
185
+ " }\n",
186
+ "]"
187
+ ]
188
+ },
189
+ {
190
+ "cell_type": "code",
191
+ "execution_count": 9,
192
+ "metadata": {},
193
+ "outputs": [],
194
+ "source": [
195
+ "text = processor.apply_chat_template(\n",
196
+ " messages, tokenize=False, add_generation_prompt=True\n",
197
+ ")"
198
+ ]
199
+ },
200
+ {
201
+ "cell_type": "code",
202
+ "execution_count": 11,
203
+ "metadata": {},
204
+ "outputs": [],
205
+ "source": [
206
+ "image_inputs, video_inputs = process_vision_info(messages)\n",
207
+ "inputs = processor(\n",
208
+ " text=[text],\n",
209
+ " images=image_inputs,\n",
210
+ " videos=video_inputs,\n",
211
+ " padding=True,\n",
212
+ " return_tensors=\"pt\",\n",
213
+ ")\n",
214
+ "device = \"cuda\" if torch.cuda.is_available() else \"cpu\"\n",
215
+ "inputs = inputs.to(device)\n",
216
+ "model = model.to(device)"
217
+ ]
218
+ },
219
+ {
220
+ "cell_type": "code",
221
+ "execution_count": 12,
222
+ "metadata": {},
223
+ "outputs": [],
224
+ "source": [
225
+ "generated_ids = model.generate(**inputs, max_new_tokens=50)\n",
226
+ "generated_ids_trimmed = [\n",
227
+ " out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)\n",
228
+ "]\n",
229
+ "output_text = processor.batch_decode(\n",
230
+ " generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False\n",
231
+ ")\n"
232
+ ]
233
+ },
234
+ {
235
+ "cell_type": "code",
236
+ "execution_count": 13,
237
+ "metadata": {},
238
+ "outputs": [
239
+ {
240
+ "name": "stdout",
241
+ "output_type": "stream",
242
+ "text": [
243
+ "['The structure of the compiler, as described in the syllabus, includes the following components:\\n\\n1. **Lexical Analysis**: This involves the role of the lexical analyzer, input buffering, and the design of lexical analyzers, specification and recognition of tokens']\n"
244
+ ]
245
+ }
246
+ ],
247
+ "source": [
248
+ "print(output_text)"
249
+ ]
250
+ },
251
+ {
252
+ "cell_type": "code",
253
+ "execution_count": null,
254
+ "metadata": {},
255
+ "outputs": [],
256
+ "source": []
257
+ }
258
+ ],
259
+ "metadata": {
260
+ "kernelspec": {
261
+ "display_name": "Python 3",
262
+ "language": "python",
263
+ "name": "python3"
264
+ },
265
+ "language_info": {
266
+ "codemirror_mode": {
267
+ "name": "ipython",
268
+ "version": 3
269
+ },
270
+ "file_extension": ".py",
271
+ "mimetype": "text/x-python",
272
+ "name": "python",
273
+ "nbconvert_exporter": "python",
274
+ "pygments_lexer": "ipython3",
275
+ "version": "3.10.11"
276
+ }
277
+ },
278
+ "nbformat": 4,
279
+ "nbformat_minor": 2
280
+ }
image.png ADDED
packages.txt ADDED
@@ -0,0 +1 @@
 
 
1
+ poppler-utils
requirements.txt ADDED
@@ -0,0 +1,10 @@
 
 
 
 
 
 
 
 
 
 
 
1
+ colpali-engine==0.2.0
2
+ pdf2image
3
+ GPUtil
4
+ accelerate==0.30.1
5
+ mteb>=1.12.22
6
+ git+https://github.com/huggingface/transformers
7
+ qwen-vl-utils
8
+ torchvision
9
+ fastapi<0.113.0
10
+ byaldi