vk98 commited on
Commit
a54266b
Β·
0 Parent(s):

Initial deployment of ColPali Visual Retrieval backend

Browse files
.gitignore ADDED
@@ -0,0 +1,15 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ .sesskey
2
+ .venv/
3
+ __pycache__/
4
+ ipynb_checkpoints/
5
+ .python-version
6
+ .env
7
+ template/
8
+ *.json
9
+ output/
10
+ pdfs/
11
+ colpalidemo/
12
+ src/static/full_images/
13
+ src/static/sim_maps/
14
+ embeddings/
15
+ hf_dataset/
Dockerfile ADDED
@@ -0,0 +1,35 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ FROM python:3.11-slim
2
+
3
+ # Install system dependencies
4
+ RUN apt-get update && apt-get install -y \
5
+ git \
6
+ build-essential \
7
+ libglib2.0-0 \
8
+ libsm6 \
9
+ libxext6 \
10
+ libxrender-dev \
11
+ libgomp1 \
12
+ wget \
13
+ && rm -rf /var/lib/apt/lists/*
14
+
15
+ # Create a non-root user
16
+ RUN useradd -m -u 1000 user
17
+ USER user
18
+ ENV HOME=/home/user \
19
+ PATH=/home/user/.local/bin:$PATH
20
+
21
+ # Set working directory
22
+ WORKDIR $HOME/app
23
+
24
+ # Copy files to container
25
+ COPY --chown=user . $HOME/app
26
+
27
+ # Install Python dependencies
28
+ RUN pip install --no-cache-dir --upgrade pip
29
+ RUN pip install --no-cache-dir -e .
30
+
31
+ # Expose the port the app runs on
32
+ EXPOSE 7860
33
+
34
+ # Run the application
35
+ CMD ["python", "main.py"]
README.md ADDED
@@ -0,0 +1,54 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ title: ColPali Visual Retrieval
3
+ emoji: πŸ”
4
+ colorFrom: green
5
+ colorTo: blue
6
+ sdk: docker
7
+ sdk_version: "3.11"
8
+ app_file: app.py
9
+ pinned: false
10
+ ---
11
+
12
+ # ColPali Visual Retrieval with Vespa
13
+
14
+ A powerful visual document retrieval system that combines **ColPali** (Contextual Late Interaction with Patch-level Information) with **Vespa** for scalable, intelligent document search and question-answering.
15
+
16
+ ## 🌟 Features
17
+
18
+ - **Visual Document Search**: Search through PDF documents using natural language queries
19
+ - **Token-level Similarity Maps**: Visualize exactly which parts of documents match your query
20
+ - **AI-Powered Chat**: Ask questions about retrieved documents using Google Gemini
21
+ - **Multiple Ranking Methods**: Choose between ColPali, BM25, or Hybrid ranking
22
+
23
+ ## πŸš€ Try It Out
24
+
25
+ 1. Enter a natural language query in the search box
26
+ 2. Select your preferred ranking method
27
+ 3. Click on token buttons to see visual attention maps
28
+ 4. Ask follow-up questions in the chat interface
29
+
30
+ ## πŸ“„ Sample Queries
31
+
32
+ - "Pie chart with model comparison"
33
+ - "Speaker diarization evaluation"
34
+ - "Results table from dense retrieval"
35
+ - "Graph showing training loss"
36
+ - "Architecture diagram with transformer"
37
+
38
+ ## πŸ› οΈ Technology Stack
39
+
40
+ - **ColPali**: Visual-language model for document understanding
41
+ - **Vespa**: Distributed search engine for scalability
42
+ - **FastHTML**: Modern web framework for the UI
43
+ - **Google Gemini**: AI-powered question answering
44
+
45
+ ## πŸ“Š About the Dataset
46
+
47
+ This demo uses ~400 pages from AI-related research papers published in 2024. The documents are processed using ColPali to create visual embeddings that enable semantic search across document images.
48
+
49
+ ## πŸ”— Links
50
+
51
+ - [ColPali Paper](https://arxiv.org/abs/2404.09317)
52
+ - [Vespa Documentation](https://docs.vespa.ai/)
53
+ - [Blog Post](https://blog.vespa.ai/visual-retrieval-with-colpali-and-vespa/)
54
+ - [GitHub Repository](https://github.com/vespa-engine/vespa/tree/master/examples/colpali-visual-retrieval)
README_HF.md ADDED
@@ -0,0 +1,54 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ title: ColPali Visual Retrieval
3
+ emoji: πŸ”
4
+ colorFrom: green
5
+ colorTo: blue
6
+ sdk: docker
7
+ sdk_version: "3.11"
8
+ app_file: app.py
9
+ pinned: false
10
+ ---
11
+
12
+ # ColPali Visual Retrieval with Vespa
13
+
14
+ A powerful visual document retrieval system that combines **ColPali** (Contextual Late Interaction with Patch-level Information) with **Vespa** for scalable, intelligent document search and question-answering.
15
+
16
+ ## 🌟 Features
17
+
18
+ - **Visual Document Search**: Search through PDF documents using natural language queries
19
+ - **Token-level Similarity Maps**: Visualize exactly which parts of documents match your query
20
+ - **AI-Powered Chat**: Ask questions about retrieved documents using Google Gemini
21
+ - **Multiple Ranking Methods**: Choose between ColPali, BM25, or Hybrid ranking
22
+
23
+ ## πŸš€ Try It Out
24
+
25
+ 1. Enter a natural language query in the search box
26
+ 2. Select your preferred ranking method
27
+ 3. Click on token buttons to see visual attention maps
28
+ 4. Ask follow-up questions in the chat interface
29
+
30
+ ## πŸ“„ Sample Queries
31
+
32
+ - "Pie chart with model comparison"
33
+ - "Speaker diarization evaluation"
34
+ - "Results table from dense retrieval"
35
+ - "Graph showing training loss"
36
+ - "Architecture diagram with transformer"
37
+
38
+ ## πŸ› οΈ Technology Stack
39
+
40
+ - **ColPali**: Visual-language model for document understanding
41
+ - **Vespa**: Distributed search engine for scalability
42
+ - **FastHTML**: Modern web framework for the UI
43
+ - **Google Gemini**: AI-powered question answering
44
+
45
+ ## πŸ“Š About the Dataset
46
+
47
+ This demo uses ~400 pages from AI-related research papers published in 2024. The documents are processed using ColPali to create visual embeddings that enable semantic search across document images.
48
+
49
+ ## πŸ”— Links
50
+
51
+ - [ColPali Paper](https://arxiv.org/abs/2404.09317)
52
+ - [Vespa Documentation](https://docs.vespa.ai/)
53
+ - [Blog Post](https://blog.vespa.ai/visual-retrieval-with-colpali-and-vespa/)
54
+ - [GitHub Repository](https://github.com/vespa-engine/vespa/tree/master/examples/colpali-visual-retrieval)
app.py ADDED
@@ -0,0 +1,17 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/usr/bin/env python3
2
+ # app.py - Entry point for Hugging Face Spaces
3
+ from main import app, APP_DIR, setup_static_routes, IMG_DIR, SIM_MAP_DIR
4
+ import os
5
+
6
+ # Ensure directories exist
7
+ os.makedirs(APP_DIR, exist_ok=True)
8
+ os.makedirs(IMG_DIR, exist_ok=True)
9
+ os.makedirs(SIM_MAP_DIR, exist_ok=True)
10
+
11
+ # Set up static routes
12
+ setup_static_routes(app)
13
+
14
+ # For Hugging Face Spaces
15
+ if __name__ == "__main__":
16
+ import uvicorn
17
+ uvicorn.run(app, host="0.0.0.0", port=7860)
backend/__init__.py ADDED
File without changes
backend/about.md ADDED
@@ -0,0 +1,985 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # ColPali 🀝 Vespa - Visual Retrieval System
2
+
3
+ A powerful visual document retrieval system that combines **ColPali** (Contextual Late Interaction with Patch-level Information) with **Vespa** for scalable, intelligent document search and question-answering.
4
+
5
+ ## 🌟 Features
6
+
7
+ ### πŸ” **Visual Document Search**
8
+
9
+ - **Multi-modal retrieval**: Search through PDF documents using natural language queries
10
+ - **Visual understanding**: ColPali model processes document images and text simultaneously
11
+ - **Token-level similarity maps**: Visualize exactly which parts of documents match your query
12
+ - **Multiple ranking algorithms**: Choose between hybrid, semantic, and other ranking methods
13
+
14
+ ### 🧠 **AI-Powered Chat**
15
+
16
+ - **Intelligent Q&A**: Ask questions about retrieved documents using Google Gemini 2.0
17
+ - **Context-aware responses**: AI analyzes document images to provide accurate answers
18
+ - **Real-time streaming**: Get responses as they're generated
19
+
20
+ ### ⚑ **Scalable Infrastructure**
21
+
22
+ - **Vespa integration**: Enterprise-grade search platform for large document collections
23
+ - **Real-time processing**: Instant search results and similarity map generation
24
+ - **Cloud-ready**: Supports Vespa Cloud deployment with secure authentication
25
+
26
+ ## πŸ—οΈ Architecture
27
+
28
+ ```
29
+ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
30
+ β”‚ Frontend β”‚ β”‚ Backend β”‚ β”‚ Vespa Cloud β”‚
31
+ β”‚ (Browser) β”‚ β”‚ (Your Local β”‚ β”‚ (Remote) β”‚
32
+ β”‚ β”‚ β”‚ Computer) β”‚ β”‚ β”‚
33
+ β”‚ β€’ Search UI │◄──►│ β€’ ColPali Model │◄──►│ β€’ Document Storeβ”‚
34
+ β”‚ β€’ Similarity β”‚ β”‚ β€’ Query Proc. β”‚ β”‚ β€’ Vector Search β”‚
35
+ β”‚ Maps β”‚ β”‚ β€’ Sim Map Gen. β”‚ β”‚ β€’ Ranking β”‚
36
+ β”‚ β€’ Chat Interfaceβ”‚ β”‚ β€’ Gemini Int. β”‚ β”‚ β”‚
37
+ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
38
+ ↑ ↑ ↑
39
+ Web Browser LOCAL AI REMOTE Storage
40
+ ```
41
+
42
+ ### 🏠 **LOCAL Processing (Your Computer)**
43
+
44
+ **All AI model inference happens on YOUR local machine:**
45
+
46
+ - **ColPali Model**: Runs locally on your GPU/CPU (~7GB model)
47
+ - **Document Processing**: PDF β†’ Images β†’ Embeddings (local)
48
+ - **Query Processing**: Text β†’ Embeddings (local)
49
+ - **Similarity Maps**: Visual attention generation (local)
50
+ - **Gemini Chat**: Processes retrieved images locally
51
+
52
+ **Device Detection:**
53
+
54
+ ```python
55
+ device = get_torch_device("auto") # Detects: CUDA, MPS (Apple), or CPU
56
+ print(f"Using device: {device}") # Shows YOUR hardware
57
+ ```
58
+
59
+ ### ☁️ **REMOTE Processing (Vespa Cloud)**
60
+
61
+ **Only storage and search index operations happen remotely:**
62
+
63
+ - **Document Storage**: Stores processed embeddings (not raw models)
64
+ - **Vector Search**: Fast similarity search across document collection
65
+ - **Query Routing**: Handles search requests and ranking
66
+ - **Metadata Storage**: Document titles, URLs, page numbers
67
+
68
+ ### πŸ”„ **Complete Data Flow**
69
+
70
+ #### **Document Upload Process:**
71
+
72
+ 1. **LOCAL**: Your computer downloads PDF from URL
73
+ 2. **LOCAL**: ColPali converts PDF pages to images
74
+ 3. **LOCAL**: ColPali generates visual embeddings (1024 patches Γ— 128 dims)
75
+ 4. **LOCAL**: Embeddings converted to binary format for efficiency
76
+ 5. **REMOTE**: Binary embeddings uploaded to Vespa Cloud
77
+ 6. **REMOTE**: Vespa indexes embeddings for fast search
78
+
79
+ #### **Search Query Process:**
80
+
81
+ 1. **LOCAL**: You enter search query in browser
82
+ 2. **LOCAL**: ColPali processes query β†’ generates query embeddings
83
+ 3. **REMOTE**: Query embeddings sent to Vespa Cloud
84
+ 4. **REMOTE**: Vespa searches document index, returns matches
85
+ 5. **LOCAL**: ColPali generates similarity maps for results
86
+ 6. **BROWSER**: Results displayed with visual attention maps
87
+
88
+ #### **AI Chat Process:**
89
+
90
+ 1. **LOCAL**: Retrieved document images processed by your machine
91
+ 2. **REMOTE**: Images + query sent to Google Gemini API
92
+ 3. **REMOTE**: Gemini generates response based on visual content
93
+ 4. **BROWSER**: Streaming response displayed in real-time
94
+
95
+ ### Core Components
96
+
97
+ - **ColPali Model**: Visual-language model for document understanding (LOCAL)
98
+ - **Vespa Search**: Distributed search and storage engine (REMOTE)
99
+ - **FastHTML Frontend**: Modern, responsive web interface (BROWSER)
100
+ - **Gemini Integration**: AI-powered question answering (REMOTE API)
101
+ - **Similarity Map Generator**: Visual attention visualization (LOCAL)
102
+
103
+ ## πŸ’» **System Requirements**
104
+
105
+ ### **LOCAL Machine Requirements (For AI Processing)**
106
+
107
+ **Minimum:**
108
+
109
+ - **CPU**: Modern multi-core processor (Intel/AMD/Apple Silicon)
110
+ - **RAM**: 8GB+ (16GB recommended)
111
+ - **Storage**: 10GB free space (for model cache)
112
+ - **Python**: 3.10+ (< 3.13)
113
+
114
+ **Recommended:**
115
+
116
+ - **GPU**: NVIDIA GPU with 8GB+ VRAM (RTX 3070/4060 or better)
117
+ - **Apple**: M1/M2/M3 Mac (uses Metal Performance Shaders)
118
+ - **RAM**: 16GB+ for smoother processing
119
+ - **Storage**: SSD for faster model loading
120
+
121
+ **Performance Examples:**
122
+
123
+ - **RTX 4090**: ~1-2 seconds per query
124
+ - **RTX 3070**: ~3-5 seconds per query
125
+ - **Apple M2**: ~4-6 seconds per query
126
+ - **CPU Only**: ~15-30 seconds per query
127
+
128
+ ### **REMOTE Requirements (Vespa Cloud)**
129
+
130
+ **What you need:**
131
+
132
+ - **Vespa Cloud account** (handles all remote processing)
133
+ - **Internet connection** (for uploading embeddings and search queries)
134
+ - **Authentication tokens** (provided by Vespa Cloud)
135
+
136
+ **What Vespa Cloud provides:**
137
+
138
+ - **Scalable storage** for any number of documents
139
+ - **Sub-second search** across millions of embeddings
140
+ - **High availability** with automatic failover
141
+ - **Global CDN** for fast access worldwide
142
+
143
+ ## πŸ’° **Cost Breakdown**
144
+
145
+ ### **FREE Components**
146
+
147
+ - **ColPali Model**: Open source, runs locally (no per-query costs)
148
+ - **Python Application**: MIT/Apache licensed, completely free
149
+ - **Local Processing**: Uses your own hardware (no cloud AI fees)
150
+
151
+ ### **PAID Components**
152
+
153
+ - **Vespa Cloud**: Pay for storage and search operations
154
+ - ~$0.001 per 1000 searches
155
+ - ~$0.10 per GB storage per month
156
+ - **Google Gemini API**: Optional, for chat features only
157
+ - ~$0.01 per 1000 image tokens
158
+ - Only used when you ask questions about documents
159
+
160
+ ### **Cost Examples (Monthly)**
161
+
162
+ - **Personal Use** (100 documents, 1000 searches): ~$5-10/month
163
+ - **Small Business** (1000 documents, 10k searches): ~$20-50/month
164
+ - **Enterprise** (10k+ documents, 100k+ searches): $200+/month
165
+
166
+ **πŸ’‘ Cost Optimization Tips:**
167
+
168
+ - Use local Vespa installation to avoid cloud costs
169
+ - Disable Gemini chat if not needed (saves API costs)
170
+ - Process documents in batches to minimize upload time
171
+
172
+ ## πŸš€ Quick Start
173
+
174
+ ### Prerequisites
175
+
176
+ - Python 3.10+ (< 3.13)
177
+ - **8GB+ RAM** for ColPali model
178
+ - **Vespa Cloud account** or local Vespa installation
179
+ - **Google Gemini API key** (optional, for chat features)
180
+ - **GPU recommended** but not required
181
+
182
+ ### 1. Installation
183
+
184
+ ```bash
185
+ # Clone the repository
186
+ git clone <repository-url>
187
+ cd colpali-vespa-visual-retrieval
188
+
189
+ # Install dependencies
190
+ pip install -e .
191
+
192
+ # For development
193
+ pip install -e ".[dev]"
194
+
195
+ # For document feeding capabilities
196
+ pip install -e ".[feed]"
197
+ ```
198
+
199
+ ### 2. Environment Configuration
200
+
201
+ Create a `.env` file with your configuration:
202
+
203
+ ```bash
204
+ # Vespa Configuration
205
+ VESPA_APP_TOKEN_URL=https://your-app.vespa-cloud.com
206
+ VESPA_CLOUD_SECRET_TOKEN=your_secret_token
207
+
208
+ # Alternative: mTLS Authentication
209
+ USE_MTLS=false
210
+ VESPA_APP_MTLS_URL=https://your-app.vespa-cloud.com
211
+ VESPA_CLOUD_MTLS_KEY="-----BEGIN PRIVATE KEY-----..."
212
+ VESPA_CLOUD_MTLS_CERT="-----BEGIN CERTIFICATE-----..."
213
+
214
+ # Optional: Gemini AI (for chat features)
215
+ GEMINI_API_KEY=your_gemini_api_key
216
+
217
+ # Optional: Logging
218
+ LOG_LEVEL=INFO
219
+ HOT_RELOAD=false
220
+ ```
221
+
222
+ ### 3. Deploy Vespa Application
223
+
224
+ ```bash
225
+ # Deploy the Vespa schema and configuration
226
+ python deploy_vespa_app.py \
227
+ --tenant_name your_tenant \
228
+ --vespa_application_name colpalidemo \
229
+ --token_id_write colpalidemo_write \
230
+ --token_id_read colpalidemo_read
231
+ ```
232
+
233
+ ### 4. Run the Application
234
+
235
+ ```bash
236
+ python main.py
237
+ ```
238
+
239
+ The application will be available at `http://localhost:7860`
240
+
241
+ ## πŸ“š Document Management
242
+
243
+ ### Uploading Documents
244
+
245
+ Use the feeding script to process and upload PDF documents:
246
+
247
+ ```bash
248
+ python feed_vespa.py \
249
+ --application_name colpalidemo \
250
+ --vespa_schema_name pdf_page
251
+ ```
252
+
253
+ **Document Processing Pipeline (LOCAL β†’ REMOTE):**
254
+
255
+ 1. **PDF Download** (LOCAL): Your computer downloads PDFs from URLs
256
+ 2. **PDF Conversion** (LOCAL): PDFs converted to images (one per page)
257
+ 3. **ColPali Processing** (LOCAL): Each page processed by ColPali model on YOUR GPU/CPU
258
+ 4. **Embedding Generation** (LOCAL): Visual embeddings created (1024 patches Γ— 128 dimensions)
259
+ 5. **Binary Encoding** (LOCAL): Embeddings converted to efficient binary format
260
+ 6. **Vespa Upload** (REMOTE): Binary embeddings uploaded to Vespa Cloud
261
+ 7. **Search Indexing** (REMOTE): Vespa indexes embeddings for fast retrieval
262
+
263
+ **⚠️ Important Notes:**
264
+
265
+ - **Processing Time**: Expect 5-30 seconds per page depending on your hardware
266
+ - **Network Usage**: Only final embeddings uploaded (~1KB per page vs ~1MB original)
267
+ - **Privacy**: Original PDFs and images stay on your local machine
268
+ - **Storage**: Raw images cached locally for similarity map generation
269
+
270
+ ### Supported Operations
271
+
272
+ - βœ… **Upload Documents**: Add new PDFs to the system
273
+ - βœ… **Search Documents**: Query existing documents
274
+ - βœ… **View Documents**: Browse stored documents
275
+ - ❌ **Remove Documents**: _Not currently implemented_
276
+ - ❌ **Update Documents**: _Not currently implemented_
277
+
278
+ ## πŸ” Authentication & Security
279
+
280
+ ### πŸ›‘οΈ **Current Security Implementation**
281
+
282
+ #### **SECURE Components:**
283
+
284
+ **Vespa Authentication (REMOTE)**
285
+
286
+ - **Token Authentication**: Bearer tokens for Vespa Cloud API access
287
+ - **mTLS Certificates**: Mutual TLS for enterprise security
288
+ - **Encrypted Communication**: HTTPS/TLS for all Vespa connections
289
+
290
+ **API Key Management (LOCAL)**
291
+
292
+ - **Environment Variables**: Sensitive keys stored in `.env` files
293
+ - **API Key Rotation**: Google Gemini supports key rotation
294
+ - **Local Storage**: Keys never transmitted except to authorized APIs
295
+
296
+ #### **LIMITED Security Components:**
297
+
298
+ **Session Management**
299
+
300
+ ```python
301
+ # Basic UUID session tracking (FastHTML)
302
+ session["session_id"] = str(uuid.uuid4())
303
+
304
+ # HTTP-only cookies (Next.js)
305
+ cookieStore.set(SESSION_KEY, newSessionId, {
306
+ httpOnly: true,
307
+ secure: process.env.NODE_ENV === "production",
308
+ sameSite: "lax",
309
+ maxAge: 60 * 60 * 24 * 30, // 30 days
310
+ });
311
+ ```
312
+
313
+ **Basic Request Validation**
314
+
315
+ ```python
316
+ # HTMX request validation
317
+ if "hx-request" not in request.headers:
318
+ return RedirectResponse("/search")
319
+
320
+ # Parameter validation
321
+ if not query:
322
+ return NextResponse.json({ error: "Query is required" }, { status: 400 });
323
+ ```
324
+
325
+ ### ⚠️ **Security Limitations & Risks**
326
+
327
+ #### **MISSING Security Features:**
328
+
329
+ **❌ No API Authentication**
330
+
331
+ - Local API endpoints are **completely open**
332
+ - No rate limiting or abuse protection
333
+ - No user authentication or authorization
334
+ - Anyone can access `/fetch_results`, `/get_sim_map` endpoints
335
+
336
+ **❌ No Input Sanitization**
337
+
338
+ ```python
339
+ # Raw user input passed directly to models
340
+ query = searchParams.get("query") # No validation/sanitization
341
+ ranking = searchParams.get("ranking") # No input filtering
342
+ ```
343
+
344
+ **❌ No Security Headers**
345
+
346
+ - No CORS configuration
347
+ - No Content Security Policy (CSP)
348
+ - No X-Frame-Options protection
349
+ - No X-Content-Type-Options validation
350
+
351
+ **❌ No Rate Limiting**
352
+
353
+ - Unlimited API requests
354
+ - No protection against DoS attacks
355
+ - No query throttling or user limits
356
+
357
+ **❌ No CSRF Protection**
358
+
359
+ - No token validation for state-changing operations
360
+ - Cross-site request forgery possible
361
+
362
+ ### 🎯 **Security Recommendations**
363
+
364
+ #### **IMMEDIATE (High Priority)**
365
+
366
+ **1. Add API Authentication**
367
+
368
+ ```typescript
369
+ // middleware.ts - Add API key validation
370
+ export function middleware(request: NextRequest) {
371
+ const apiKey = request.headers.get("X-API-Key");
372
+ if (!apiKey || apiKey !== process.env.COLPALI_API_KEY) {
373
+ return new Response("Unauthorized", { status: 401 });
374
+ }
375
+ }
376
+ ```
377
+
378
+ **2. Implement Rate Limiting**
379
+
380
+ ```typescript
381
+ // Use next-rate-limit or similar
382
+ import rateLimit from "@/lib/rate-limit";
383
+
384
+ const limiter = rateLimit({
385
+ interval: 60 * 1000, // 1 minute
386
+ uniqueTokenPerInterval: 500, // Limit each IP to 100 requests per interval
387
+ });
388
+
389
+ await limiter.check(10, getClientIP(request)); // 10 requests per minute
390
+ ```
391
+
392
+ **3. Add Security Headers**
393
+
394
+ ```typescript
395
+ // next.config.js
396
+ const securityHeaders = [
397
+ { key: "X-Frame-Options", value: "DENY" },
398
+ { key: "X-Content-Type-Options", value: "nosniff" },
399
+ { key: "Referrer-Policy", value: "strict-origin-when-cross-origin" },
400
+ {
401
+ key: "Content-Security-Policy",
402
+ value: "default-src 'self'; script-src 'self' 'unsafe-inline'",
403
+ },
404
+ ];
405
+ ```
406
+
407
+ **4. Input Validation & Sanitization**
408
+
409
+ ```typescript
410
+ import { z } from "zod";
411
+
412
+ const SearchSchema = z.object({
413
+ query: z
414
+ .string()
415
+ .min(1)
416
+ .max(500)
417
+ .regex(/^[a-zA-Z0-9\s\.\?\!]*$/),
418
+ ranking: z.enum(["hybrid", "colpali", "bm25"]),
419
+ });
420
+ ```
421
+
422
+ #### **MEDIUM Priority**
423
+
424
+ **5. CORS Configuration**
425
+
426
+ ```typescript
427
+ // Restrict origins to known domains
428
+ const corsHeaders = {
429
+ "Access-Control-Allow-Origin": "https://yourdomain.com",
430
+ "Access-Control-Allow-Methods": "GET, POST, OPTIONS",
431
+ "Access-Control-Allow-Headers": "Content-Type, Authorization",
432
+ };
433
+ ```
434
+
435
+ **6. Request Size Limits**
436
+
437
+ ```typescript
438
+ // Limit request payload sizes
439
+ export const config = {
440
+ api: {
441
+ bodyParser: {
442
+ sizeLimit: "1mb",
443
+ },
444
+ },
445
+ };
446
+ ```
447
+
448
+ **7. Audit Logging**
449
+
450
+ ```python
451
+ # Log all API access with IP, timestamp, and queries
452
+ logger.info(f"API_ACCESS: {client_ip} - {endpoint} - {query[:100]}")
453
+ ```
454
+
455
+ #### **LONG-TERM (Production Ready)**
456
+
457
+ **8. User Authentication (Optional)**
458
+
459
+ ```typescript
460
+ // Add NextAuth.js or similar for user accounts
461
+ // Implement role-based access control
462
+ // Add document ownership and permissions
463
+ ```
464
+
465
+ **9. Network Security**
466
+
467
+ ```bash
468
+ # Deploy behind reverse proxy (nginx/cloudflare)
469
+ # Enable DDoS protection
470
+ # Use Web Application Firewall (WAF)
471
+ ```
472
+
473
+ **10. Data Privacy Controls**
474
+
475
+ ```typescript
476
+ // Implement data retention policies
477
+ // Add user data deletion capabilities
478
+ // GDPR compliance features
479
+ ```
480
+
481
+ ### πŸ”’ **Security Best Practices**
482
+
483
+ #### **For LOCAL Development:**
484
+
485
+ - **Never commit API keys** to version control
486
+ - **Use strong environment variable names** (avoid `API_KEY`)
487
+ - **Rotate API keys regularly** (monthly)
488
+ - **Enable firewall** on development machines
489
+ - **Use HTTPS even locally** for production testing
490
+
491
+ #### **For PRODUCTION Deployment:**
492
+
493
+ - **Deploy behind CDN/WAF** (Cloudflare, AWS Shield)
494
+ - **Enable rate limiting** at infrastructure level
495
+ - **Use container security scanning**
496
+ - **Implement monitoring and alerting**
497
+ - **Regular security audits and penetration testing**
498
+
499
+ #### **For REMOTE Services:**
500
+
501
+ - **Vespa Cloud**: Follows enterprise security standards
502
+ - **Gemini API**: Google-managed security and compliance
503
+ - **Environment Isolation**: Separate dev/staging/prod credentials
504
+
505
+ ### 🚨 **Current Risk Level: MEDIUM**
506
+
507
+ **Suitable for:**
508
+
509
+ - βœ… **Personal projects and demos**
510
+ - βœ… **Internal company tools** (behind firewall)
511
+ - βœ… **Research and development** environments
512
+
513
+ **NOT suitable for:**
514
+
515
+ - ❌ **Public internet deployment**
516
+ - ❌ **Customer-facing applications**
517
+ - ❌ **Production environments** with sensitive data
518
+ - ❌ **Commercial applications** without security hardening
519
+
520
+ ## 🎯 Usage Guide
521
+
522
+ ### Basic Search
523
+
524
+ 1. Navigate to the homepage
525
+ 2. Enter your search query in natural language
526
+ 3. Select ranking method (hybrid, semantic, etc.)
527
+ 4. View results with similarity maps
528
+
529
+ ### Similarity Maps
530
+
531
+ - Click on token buttons to see which parts of documents match specific query terms
532
+ - Visual heatmaps show attention patterns
533
+ - Reset button returns to original document view
534
+
535
+ ### AI Chat
536
+
537
+ - Ask questions about retrieved documents
538
+ - Chat responses are based on document content
539
+ - Streaming responses for real-time interaction
540
+
541
+ ### Search Rankings
542
+
543
+ - **Hybrid**: Combines multiple ranking signals
544
+ - **Semantic**: Pure semantic similarity
545
+ - **BM25**: Traditional text-based ranking
546
+ - **ColPali**: Visual-first ranking
547
+
548
+ ## πŸ› οΈ Development
549
+
550
+ ### Project Structure
551
+
552
+ ```
553
+ β”œβ”€β”€ main.py # Application entry point
554
+ β”œβ”€β”€ backend/
555
+ β”‚ β”œβ”€β”€ colpali.py # ColPali model integration
556
+ β”‚ β”œβ”€β”€ vespa_app.py # Vespa client and queries
557
+ β”‚ └── modelmanager.py # Model management utilities
558
+ β”œβ”€β”€ frontend/
559
+ β”‚ β”œβ”€β”€ app.py # UI components
560
+ β”‚ └── layout.py # Layout templates
561
+ β”œβ”€β”€ feed_vespa.py # Document upload script
562
+ β”œβ”€β”€ deploy_vespa_app.py # Vespa deployment script
563
+ β”œβ”€β”€ colpali-with-snippets/ # Vespa schema definitions
564
+ └── static/ # Static assets and generated files
565
+ ```
566
+
567
+ ### Running in Development
568
+
569
+ ```bash
570
+ # Enable hot reload
571
+ export HOT_RELOAD=true
572
+ python main.py
573
+
574
+ # Or set in .env
575
+ echo "HOT_RELOAD=true" >> .env
576
+ ```
577
+
578
+ ### Code Quality
579
+
580
+ ```bash
581
+ # Format code
582
+ ruff format .
583
+
584
+ # Lint code
585
+ ruff check .
586
+ ```
587
+
588
+ ## πŸ“Š API Endpoints
589
+
590
+ ### **Current API Routes (⚠️ UNSECURED)**
591
+
592
+ | Endpoint | Method | Description | Security Status |
593
+ | ---------------- | ------ | ----------------------- | ---------------- |
594
+ | `/` | GET | Homepage | βœ… Public (safe) |
595
+ | `/search` | GET | Search interface | βœ… Public (safe) |
596
+ | `/fetch_results` | GET | Fetch search results | ⚠️ **OPEN API** |
597
+ | `/get_sim_map` | GET | Get similarity maps | ⚠️ **OPEN API** |
598
+ | `/get-message` | GET | Chat with AI (SSE) | ⚠️ **OPEN API** |
599
+ | `/full_image` | GET | Get full document image | ⚠️ **OPEN API** |
600
+ | `/suggestions` | GET | Query autocomplete | ⚠️ **OPEN API** |
601
+ | `/static/*` | GET | Static file serving | βœ… Public (safe) |
602
+
603
+ ### **Security Analysis by Endpoint**
604
+
605
+ #### **πŸ”’ SECURE Endpoints**
606
+
607
+ - **`/`** and **`/search`**: Static HTML pages, no sensitive data
608
+ - **`/static/*`**: Public assets (CSS, JS, images)
609
+
610
+ #### **⚠️ UNSECURED Endpoints (Risk)**
611
+
612
+ **`/fetch_results`** - **HIGH RISK**
613
+
614
+ ```bash
615
+ # Anyone can perform unlimited searches
616
+ curl "http://localhost:7860/fetch_results?query=secret&ranking=hybrid"
617
+ ```
618
+
619
+ - **Risks**: Resource abuse, server overload, competitive intelligence gathering
620
+ - **Exposes**: Search capabilities, document metadata, processing times
621
+
622
+ **`/get_sim_map`** - **MEDIUM RISK**
623
+
624
+ ```bash
625
+ # Access similarity maps without authentication
626
+ curl "http://localhost:7860/get_sim_map?query_id=123&idx=0&token=word&token_idx=5"
627
+ ```
628
+
629
+ - **Risks**: Unauthorized access to visual analysis
630
+ - **Exposes**: Document visual patterns, query insights
631
+
632
+ **`/get-message`** - **HIGH RISK**
633
+
634
+ ```bash
635
+ # Trigger AI processing without limits
636
+ curl "http://localhost:7860/get-message?query_id=123&query=question&doc_ids=doc1,doc2"
637
+ ```
638
+
639
+ - **Risks**: Gemini API abuse, cost exploitation, resource exhaustion
640
+ - **Exposes**: AI-generated insights, document content analysis
641
+
642
+ **`/full_image`** - **HIGH RISK**
643
+
644
+ ```bash
645
+ # Download any document image
646
+ curl "http://localhost:7860/full_image?doc_id=any_document_id"
647
+ ```
648
+
649
+ - **Risks**: Unauthorized document access, data leakage
650
+ - **Exposes**: Full document images, potentially sensitive content
651
+
652
+ ### **Immediate Security Fixes**
653
+
654
+ #### **1. Add API Key Authentication**
655
+
656
+ ```python
657
+ # Python FastHTML middleware
658
+ @app.middleware("http")
659
+ async def verify_api_key(request, call_next):
660
+ if request.url.path.startswith("/fetch_results"):
661
+ api_key = request.headers.get("X-API-Key")
662
+ if not api_key or api_key != os.getenv("COLPALI_API_KEY"):
663
+ return JSONResponse({"error": "Unauthorized"}, status_code=401)
664
+ return await call_next(request)
665
+ ```
666
+
667
+ #### **2. Implement Rate Limiting**
668
+
669
+ ```python
670
+ from slowapi import Limiter, _rate_limit_exceeded_handler
671
+ from slowapi.util import get_remote_address
672
+
673
+ limiter = Limiter(key_func=get_remote_address)
674
+
675
+ @rt("/fetch_results")
676
+ @limiter.limit("10/minute") # 10 requests per minute per IP
677
+ async def get_results(request, query: str, ranking: str):
678
+ # ... existing code
679
+ ```
680
+
681
+ #### **3. Input Validation**
682
+
683
+ ```python
684
+ from pydantic import BaseModel, validator
685
+
686
+ class SearchRequest(BaseModel):
687
+ query: str
688
+ ranking: str
689
+
690
+ @validator('query')
691
+ def query_must_be_safe(cls, v):
692
+ if len(v) > 500:
693
+ raise ValueError('Query too long')
694
+ # Add sanitization logic
695
+ return v.strip()
696
+ ```
697
+
698
+ #### **4. Request Origin Validation**
699
+
700
+ ```python
701
+ ALLOWED_ORIGINS = ["http://localhost:3000", "https://yourdomain.com"]
702
+
703
+ @app.middleware("http")
704
+ async def cors_middleware(request, call_next):
705
+ origin = request.headers.get("origin")
706
+ if origin not in ALLOWED_ORIGINS:
707
+ return JSONResponse({"error": "Forbidden"}, status_code=403)
708
+ return await call_next(request)
709
+ ```
710
+
711
+ ### **πŸ“ˆ Recommended API Security Architecture**
712
+
713
+ ```
714
+ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
715
+ β”‚ Frontend β”‚ β”‚ Rate Limiter β”‚ β”‚ Backend API β”‚
716
+ β”‚ β”‚ β”‚ β”‚ β”‚ β”‚
717
+ β”‚ β€’ API Key │◄──►│ β€’ IP Limiting │◄──►│ β€’ Input Valid. β”‚
718
+ β”‚ β€’ CORS Headers β”‚ β”‚ β€’ User Quotas β”‚ β”‚ β€’ Auth Checks β”‚
719
+ β”‚ β€’ Request Valid.β”‚ β”‚ β€’ DoS Protectionβ”‚ β”‚ β€’ Audit Logs β”‚
720
+ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
721
+ ```
722
+
723
+ **Benefits:**
724
+
725
+ - **Layer 1**: Frontend validates requests before sending
726
+ - **Layer 2**: Rate limiter prevents abuse and DoS attacks
727
+ - **Layer 3**: Backend performs final validation and authorization
728
+
729
+ ### **πŸ”’ Security Implementation Checklist**
730
+
731
+ #### **Before Production Deployment:**
732
+
733
+ **CRITICAL (Must Do):**
734
+
735
+ - [ ] **Generate API Key**: Create strong API key for endpoint authentication
736
+ - [ ] **Enable Rate Limiting**: Implement per-IP request limits
737
+ - [ ] **Add Security Headers**: X-Frame-Options, CSP, X-Content-Type-Options
738
+ - [ ] **Input Validation**: Sanitize all user inputs (query, ranking)
739
+ - [ ] **CORS Configuration**: Restrict origins to known domains only
740
+ - [ ] **Environment Security**: Never commit API keys, use secure .env
741
+ - [ ] **HTTPS Only**: Force TLS in production (no HTTP)
742
+
743
+ **HIGH Priority:**
744
+
745
+ - [ ] **Audit Logging**: Log all API requests with IP and timestamp
746
+ - [ ] **Request Size Limits**: Prevent large payload attacks
747
+ - [ ] **Error Handling**: Don't expose stack traces or internal details
748
+ - [ ] **Session Security**: HTTP-only, secure, SameSite cookies
749
+ - [ ] **API Documentation**: Document authentication requirements
750
+
751
+ **MEDIUM Priority:**
752
+
753
+ - [ ] **User Authentication**: Consider adding user accounts for access control
754
+ - [ ] **Request Timeout**: Prevent long-running request abuse
755
+ - [ ] **Content Validation**: Verify response content types
756
+ - [ ] **Monitoring**: Set up alerts for unusual API usage patterns
757
+ - [ ] **Backup Strategy**: Secure backup of environment variables
758
+
759
+ #### **Security Testing Commands:**
760
+
761
+ **Test API Authentication:**
762
+
763
+ ```bash
764
+ # Should fail without API key
765
+ curl "http://localhost:7860/fetch_results?query=test&ranking=hybrid"
766
+
767
+ # Should succeed with API key
768
+ curl -H "X-API-Key: your_api_key" "http://localhost:7860/fetch_results?query=test&ranking=hybrid"
769
+ ```
770
+
771
+ **Test Rate Limiting:**
772
+
773
+ ```bash
774
+ # Run multiple requests to trigger rate limit
775
+ for i in {1..15}; do
776
+ curl -H "X-API-Key: your_api_key" "http://localhost:7860/fetch_results?query=test$i&ranking=hybrid"
777
+ echo "Request $i"
778
+ done
779
+ ```
780
+
781
+ **Test Input Validation:**
782
+
783
+ ```bash
784
+ # Should reject invalid/malicious inputs
785
+ curl -H "X-API-Key: your_api_key" "http://localhost:7860/fetch_results?query=<script>alert('xss')</script>&ranking=invalid"
786
+ ```
787
+
788
+ **Test Security Headers:**
789
+
790
+ ```bash
791
+ # Check security headers in response
792
+ curl -I "http://localhost:7860/"
793
+ # Should see: X-Frame-Options, X-Content-Type-Options, etc.
794
+ ```
795
+
796
+ #### **Security Monitoring:**
797
+
798
+ **Log Analysis Queries:**
799
+
800
+ ```bash
801
+ # Monitor API usage patterns
802
+ grep "API_ACCESS" /var/log/colpali.log | tail -100
803
+
804
+ # Detect potential abuse
805
+ grep "RATE_LIMIT_EXCEEDED" /var/log/colpali.log
806
+
807
+ # Check authentication failures
808
+ grep "UNAUTHORIZED" /var/log/colpali.log
809
+ ```
810
+
811
+ **Alerting Setup:**
812
+
813
+ - **Rate Limit Violations**: Alert when >50 requests/minute from single IP
814
+ - **Authentication Failures**: Alert on repeated unauthorized attempts
815
+ - **Unusual Queries**: Alert on suspicious query patterns or injection attempts
816
+ - **Resource Usage**: Alert on high CPU/memory usage (potential DoS)
817
+
818
+ ## πŸ§ͺ Models Used
819
+
820
+ - **ColPali v1.2**: Visual document understanding
821
+ - **ColPaliGemma 3B**: Base visual-language model
822
+ - **Google Gemini 2.0**: AI chat and question answering
823
+
824
+ ## πŸ”§ Configuration Options
825
+
826
+ ### Environment Variables
827
+
828
+ | Variable | Required | Description | Security Impact |
829
+ | -------------------------- | -------- | ------------------------------------------- | ----------------------------------- |
830
+ | `VESPA_APP_TOKEN_URL` | Yes\* | Vespa application URL (token auth) | **HIGH** - Remote access |
831
+ | `VESPA_CLOUD_SECRET_TOKEN` | Yes\* | Vespa secret token | **CRITICAL** - Full database access |
832
+ | `USE_MTLS` | No | Use mTLS instead of token auth | **MEDIUM** - Auth method |
833
+ | `VESPA_APP_MTLS_URL` | Yes\*\* | Vespa application URL (mTLS) | **HIGH** - Remote access |
834
+ | `VESPA_CLOUD_MTLS_KEY` | Yes\*\* | mTLS private key | **CRITICAL** - TLS credentials |
835
+ | `VESPA_CLOUD_MTLS_CERT` | Yes\*\* | mTLS certificate | **HIGH** - TLS credentials |
836
+ | `GEMINI_API_KEY` | No | Google Gemini API key | **HIGH** - AI access/costs |
837
+ | `LOG_LEVEL` | No | Logging level (DEBUG, INFO, WARNING, ERROR) | **LOW** - Debug info |
838
+ | `HOT_RELOAD` | No | Enable hot reload in development | **LOW** - Dev convenience |
839
+
840
+ #### **πŸ”’ Security-Related Environment Variables (Recommended)**
841
+
842
+ | Variable | Required | Description | Default |
843
+ | -------------------------- | --------- | ------------------------------------ | ------- |
844
+ | `COLPALI_API_KEY` | **YES\*** | API key for endpoint authentication | None |
845
+ | `ALLOWED_ORIGINS` | **YES\*** | Comma-separated allowed CORS origins | None |
846
+ | `RATE_LIMIT_REQUESTS` | No | Max requests per minute per IP | `10` |
847
+ | `RATE_LIMIT_WINDOW` | No | Rate limit window in seconds | `60` |
848
+ | `MAX_QUERY_LENGTH` | No | Maximum query string length | `500` |
849
+ | `ENABLE_AUDIT_LOGGING` | No | Log all API requests for security | `false` |
850
+ | `SECURITY_HEADERS_ENABLED` | No | Enable security headers | `true` |
851
+ | `CSRF_SECRET` | **YES\*** | Secret for CSRF token generation | None |
852
+
853
+ **Example Security-Enhanced `.env`:**
854
+
855
+ ```bash
856
+ # Existing configuration
857
+ VESPA_APP_TOKEN_URL=https://your-app.vespa-cloud.com
858
+ VESPA_CLOUD_SECRET_TOKEN=your_vespa_secret_token
859
+ GEMINI_API_KEY=your_gemini_api_key
860
+
861
+ # NEW: Security configuration
862
+ COLPALI_API_KEY=your_strong_random_api_key_here
863
+ ALLOWED_ORIGINS=http://localhost:3000,https://yourdomain.com
864
+ RATE_LIMIT_REQUESTS=10
865
+ RATE_LIMIT_WINDOW=60
866
+ MAX_QUERY_LENGTH=500
867
+ ENABLE_AUDIT_LOGGING=true
868
+ SECURITY_HEADERS_ENABLED=true
869
+ CSRF_SECRET=your_random_csrf_secret_here
870
+
871
+ # Development vs Production
872
+ NODE_ENV=production # Enable secure cookies
873
+ LOG_LEVEL=INFO # Don't expose debug info in production
874
+ ```
875
+
876
+ \*Required for token authentication
877
+ \*\*Required for mTLS authentication
878
+ \*\*\*Required for production security
879
+
880
+ ## 🚨 Troubleshooting
881
+
882
+ ### **LOCAL Processing Issues**
883
+
884
+ **ColPali model fails to load:**
885
+
886
+ ```bash
887
+ # Check GPU memory
888
+ nvidia-smi # For NVIDIA GPUs
889
+ # or
890
+ system_profiler SPDisplaysDataType # For Apple Silicon
891
+
892
+ # Clear model cache if corrupted
893
+ rm -rf ~/.cache/huggingface/hub/models--vidore--colpali-v1.2
894
+ ```
895
+
896
+ **Out of memory errors:**
897
+
898
+ - Reduce batch size in `feed_vespa.py` (try `batch_size=1`)
899
+ - Close other applications to free RAM/VRAM
900
+ - Use CPU processing if GPU memory insufficient: `CUDA_VISIBLE_DEVICES="" python main.py`
901
+
902
+ **Slow processing on CPU:**
903
+
904
+ - Expected behavior - ColPali requires significant computation
905
+ - Consider upgrading to GPU or Apple Silicon for 5-10x speedup
906
+ - Process documents overnight for large collections
907
+
908
+ ### **REMOTE Processing Issues**
909
+
910
+ **Connection to Vespa fails:**
911
+
912
+ - Verify your Vespa URL and credentials in `.env`
913
+ - Check if the Vespa application is deployed and running
914
+ - Ensure network connectivity: `ping your-app.vespa-cloud.com`
915
+ - Validate authentication tokens haven't expired
916
+
917
+ **Document upload fails:**
918
+
919
+ - Check Vespa Cloud storage quota and billing
920
+ - Verify embedding format matches Vespa schema
921
+ - Ensure stable internet connection for large uploads
922
+
923
+ **Search returns no results:**
924
+
925
+ - Confirm documents were successfully uploaded to Vespa
926
+ - Check if embeddings were properly indexed
927
+ - Verify query processing isn't failing locally
928
+
929
+ ### **MIXED (Local + Remote) Issues**
930
+
931
+ **Chat features don't work:**
932
+
933
+ - **LOCAL**: Verify document images are being generated locally
934
+ - **REMOTE**: Check `GEMINI_API_KEY` is set correctly
935
+ - **REMOTE**: Verify Gemini API quota and billing
936
+ - **NETWORK**: Ensure images can be sent to Gemini API
937
+
938
+ **Similarity maps missing:**
939
+
940
+ - **LOCAL**: Confirm ColPali model loaded successfully
941
+ - **LOCAL**: Check if similarity map generation completed
942
+ - **REMOTE**: Verify Vespa returned similarity data
943
+ - **BROWSER**: Clear browser cache for static files
944
+
945
+ ### Performance Tips
946
+
947
+ **LOCAL Optimization:**
948
+
949
+ - Use GPU acceleration for 5-10x faster model inference
950
+ - Optimize batch sizes based on available memory
951
+ - Use SSD storage for faster model loading
952
+ - Consider quantized models for lower memory usage
953
+
954
+ **REMOTE Optimization:**
955
+
956
+ - Use Vespa's HNSW indexing for faster search
957
+ - Optimize embedding dimensions vs accuracy tradeoff
958
+ - Enable compression for faster network transfer
959
+ - Use multiple Vespa instances for high availability
960
+
961
+ **NETWORK Optimization:**
962
+
963
+ - Process documents in batches to reduce upload overhead
964
+ - Use compression for embedding transfer
965
+ - Consider regional Vespa deployment for lower latency
966
+
967
+ ## πŸ“„ License
968
+
969
+ Apache-2.0
970
+
971
+ ## 🀝 Contributing
972
+
973
+ 1. Fork the repository
974
+ 2. Create a feature branch
975
+ 3. Make your changes
976
+ 4. Run tests and linting
977
+ 5. Submit a pull request
978
+
979
+ ## πŸ“ž Support
980
+
981
+ For issues and questions:
982
+
983
+ - Check the troubleshooting section
984
+ - Review Vespa and ColPali documentation
985
+ - Open an issue on the repository
backend/cache.py ADDED
@@ -0,0 +1,26 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ from collections import OrderedDict
2
+
3
+
4
+ # Initialize LRU Cache
5
+ class LRUCache:
6
+ def __init__(self, max_size=20):
7
+ self.max_size = max_size
8
+ self.cache = OrderedDict()
9
+
10
+ def get(self, key):
11
+ if key in self.cache:
12
+ self.cache.move_to_end(key)
13
+ return self.cache[key]
14
+ return None
15
+
16
+ def set(self, key, value):
17
+ if key in self.cache:
18
+ self.cache.move_to_end(key)
19
+ else:
20
+ if len(self.cache) >= self.max_size:
21
+ self.cache.popitem(last=False)
22
+ self.cache[key] = value
23
+
24
+ def delete(self, key):
25
+ if key in self.cache:
26
+ del self.cache[key]
backend/colpali.py ADDED
@@ -0,0 +1,281 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import torch
2
+ from PIL import Image
3
+ import numpy as np
4
+ from typing import Generator, Tuple, List, Union, Dict
5
+ from pathlib import Path
6
+ import base64
7
+ from io import BytesIO
8
+ import re
9
+ import io
10
+ import matplotlib.cm as cm
11
+
12
+ from colpali_engine.models import ColPali, ColPaliProcessor
13
+ from colpali_engine.utils.torch_utils import get_torch_device
14
+ from vidore_benchmark.interpretability.torch_utils import (
15
+ normalize_similarity_map_per_query_token,
16
+ )
17
+ from functools import lru_cache
18
+ import logging
19
+
20
+
21
+ class SimMapGenerator:
22
+ """
23
+ Generates similarity maps based on query embeddings and image patches using the ColPali model.
24
+ """
25
+
26
+ colormap = cm.get_cmap("viridis") # Preload colormap for efficiency
27
+
28
+ def __init__(
29
+ self,
30
+ logger: logging.Logger,
31
+ model_name: str = "vidore/colpali-v1.2",
32
+ n_patch: int = 32,
33
+ ):
34
+ """
35
+ Initializes the SimMapGenerator class with a specified model and patch dimension.
36
+
37
+ Args:
38
+ model_name (str): The model name for loading the ColPali model.
39
+ n_patch (int): The number of patches per dimension.
40
+ """
41
+ self.model_name = model_name
42
+ self.n_patch = n_patch
43
+ self.device = get_torch_device("auto")
44
+ self.logger = logger
45
+ self.logger.info(f"Using device: {self.device}")
46
+ self.model, self.processor = self.load_model()
47
+
48
+ def load_model(self) -> Tuple[ColPali, ColPaliProcessor]:
49
+ """
50
+ Loads the ColPali model and processor.
51
+
52
+ Returns:
53
+ Tuple[ColPali, ColPaliProcessor]: Loaded model and processor.
54
+ """
55
+ model = ColPali.from_pretrained(
56
+ self.model_name,
57
+ torch_dtype=torch.bfloat16, # Note that the embeddings created during feed were float32 -> binarized, yet setting this seem to produce the most similar results both locally (mps) and HF (Cuda)
58
+ device_map=self.device,
59
+ ).eval()
60
+
61
+ processor = ColPaliProcessor.from_pretrained(self.model_name)
62
+ return model, processor
63
+
64
+ def gen_similarity_maps(
65
+ self,
66
+ query: str,
67
+ query_embs: torch.Tensor,
68
+ token_idx_map: Dict[int, str],
69
+ images: List[Union[Path, str]],
70
+ vespa_sim_maps: List[Dict],
71
+ ) -> Generator[Tuple[int, str, str], None, None]:
72
+ """
73
+ Generates similarity maps for the provided images and query, and returns base64-encoded blended images.
74
+
75
+ Args:
76
+ query (str): The query string.
77
+ query_embs (torch.Tensor): Query embeddings tensor.
78
+ token_idx_map (dict): Mapping from indices to tokens.
79
+ images (List[Union[Path, str]]): List of image paths or base64-encoded strings.
80
+ vespa_sim_maps (List[Dict]): List of Vespa similarity maps.
81
+
82
+ Yields:
83
+ Tuple[int, str, str]: A tuple containing the image index, selected token, and base64-encoded image.
84
+ """
85
+ processed_images, original_images, original_sizes = [], [], []
86
+ for img in images:
87
+ img_pil = self._load_image(img)
88
+ original_images.append(img_pil.copy())
89
+ original_sizes.append(img_pil.size)
90
+ processed_images.append(img_pil)
91
+
92
+ vespa_sim_map_tensor = self._prepare_similarity_map_tensor(
93
+ query_embs, vespa_sim_maps
94
+ )
95
+ similarity_map_normalized = normalize_similarity_map_per_query_token(
96
+ vespa_sim_map_tensor
97
+ )
98
+
99
+ for idx, img in enumerate(original_images):
100
+ for token_idx, token in token_idx_map.items():
101
+ if self.should_filter_token(token):
102
+ continue
103
+
104
+ sim_map = similarity_map_normalized[idx, token_idx, :, :]
105
+ blended_img_base64 = self._blend_image(
106
+ img, sim_map, original_sizes[idx]
107
+ )
108
+ yield idx, token, token_idx, blended_img_base64
109
+
110
+ def _load_image(self, img: Union[Path, str]) -> Image:
111
+ """
112
+ Loads an image from a file path or a base64-encoded string.
113
+
114
+ Args:
115
+ img (Union[Path, str]): The image to load.
116
+
117
+ Returns:
118
+ Image: The loaded PIL image.
119
+ """
120
+ try:
121
+ if isinstance(img, Path):
122
+ return Image.open(img).convert("RGB")
123
+ elif isinstance(img, str):
124
+ return Image.open(BytesIO(base64.b64decode(img))).convert("RGB")
125
+ except Exception as e:
126
+ raise ValueError(f"Failed to load image: {e}")
127
+
128
+ def _prepare_similarity_map_tensor(
129
+ self, query_embs: torch.Tensor, vespa_sim_maps: List[Dict]
130
+ ) -> torch.Tensor:
131
+ """
132
+ Prepares a similarity map tensor from Vespa similarity maps.
133
+
134
+ Args:
135
+ query_embs (torch.Tensor): Query embeddings tensor.
136
+ vespa_sim_maps (List[Dict]): List of Vespa similarity maps.
137
+
138
+ Returns:
139
+ torch.Tensor: The prepared similarity map tensor.
140
+ """
141
+ vespa_sim_map_tensor = torch.zeros(
142
+ (len(vespa_sim_maps), query_embs.size(1), self.n_patch, self.n_patch)
143
+ )
144
+ for idx, vespa_sim_map in enumerate(vespa_sim_maps):
145
+ for cell in vespa_sim_map["quantized"]["cells"]:
146
+ patch = int(cell["address"]["patch"])
147
+ query_token = int(cell["address"]["querytoken"])
148
+ value = cell["value"]
149
+ if hasattr(self.processor, "image_seq_length"):
150
+ image_seq_length = self.processor.image_seq_length
151
+ else:
152
+ image_seq_length = 1024
153
+
154
+ if patch >= image_seq_length:
155
+ continue
156
+ vespa_sim_map_tensor[
157
+ idx,
158
+ query_token,
159
+ patch // self.n_patch,
160
+ patch % self.n_patch,
161
+ ] = value
162
+ return vespa_sim_map_tensor
163
+
164
+ def _blend_image(
165
+ self, img: Image, sim_map: torch.Tensor, original_size: Tuple[int, int]
166
+ ) -> str:
167
+ """
168
+ Blends an image with a similarity map and encodes it to base64.
169
+
170
+ Args:
171
+ img (Image): The original image.
172
+ sim_map (torch.Tensor): The similarity map tensor.
173
+ original_size (Tuple[int, int]): The original size of the image.
174
+
175
+ Returns:
176
+ str: The base64-encoded blended image.
177
+ """
178
+ SCALING_FACTOR = 8
179
+ sim_map_resolution = (
180
+ max(32, int(original_size[0] / SCALING_FACTOR)),
181
+ max(32, int(original_size[1] / SCALING_FACTOR)),
182
+ )
183
+
184
+ sim_map_np = sim_map.cpu().float().numpy()
185
+ sim_map_img = Image.fromarray(sim_map_np).resize(
186
+ sim_map_resolution, resample=Image.BICUBIC
187
+ )
188
+ sim_map_resized_np = np.array(sim_map_img, dtype=np.float32)
189
+ sim_map_normalized = self._normalize_sim_map(sim_map_resized_np)
190
+
191
+ heatmap = self.colormap(sim_map_normalized)
192
+ heatmap_img = Image.fromarray((heatmap * 255).astype(np.uint8)).convert("RGBA")
193
+
194
+ buffer = io.BytesIO()
195
+ heatmap_img.save(buffer, format="PNG")
196
+ return base64.b64encode(buffer.getvalue()).decode("utf-8")
197
+
198
+ @staticmethod
199
+ def _normalize_sim_map(sim_map: np.ndarray) -> np.ndarray:
200
+ """
201
+ Normalizes a similarity map to range [0, 1].
202
+
203
+ Args:
204
+ sim_map (np.ndarray): The similarity map.
205
+
206
+ Returns:
207
+ np.ndarray: The normalized similarity map.
208
+ """
209
+ sim_map_min, sim_map_max = sim_map.min(), sim_map.max()
210
+ if sim_map_max - sim_map_min > 1e-6:
211
+ return (sim_map - sim_map_min) / (sim_map_max - sim_map_min)
212
+ return np.zeros_like(sim_map)
213
+
214
+ @staticmethod
215
+ def should_filter_token(token: str) -> bool:
216
+ """
217
+ Determines if a token should be filtered out based on predefined patterns.
218
+
219
+ The function filters out tokens that:
220
+
221
+ - Start with '<' (e.g., '<bos>')
222
+ - Consist entirely of whitespace
223
+ - Are purely punctuation (excluding tokens that contain digits or start with '▁')
224
+ - Start with an underscore '_'
225
+ - Exactly match the word 'Question'
226
+ - Are exactly the single character '▁'
227
+
228
+ Output of test:
229
+ Token: '2' | False
230
+ Token: '0' | False
231
+ Token: '2' | False
232
+ Token: '3' | False
233
+ Token: '▁2' | False
234
+ Token: '▁hi' | False
235
+ Token: 'norwegian' | False
236
+ Token: 'unlisted' | False
237
+ Token: '<bos>' | True
238
+ Token: 'Question' | True
239
+ Token: ':' | True
240
+ Token: '<pad>' | True
241
+ Token: '\n' | True
242
+ Token: '▁' | True
243
+ Token: '?' | True
244
+ Token: ')' | True
245
+ Token: '%' | True
246
+ Token: '/)' | True
247
+
248
+
249
+ Args:
250
+ token (str): The token to check.
251
+
252
+ Returns:
253
+ bool: True if the token should be filtered out, False otherwise.
254
+ """
255
+ pattern = re.compile(
256
+ r"^<.*$|^\s+$|^(?!.*\d)(?!▁)[^\w\s]+$|^_.*$|^Question$|^▁$"
257
+ )
258
+ return bool(pattern.match(token))
259
+
260
+ @lru_cache(maxsize=128)
261
+ def get_query_embeddings_and_token_map(
262
+ self, query: str
263
+ ) -> Tuple[torch.Tensor, dict]:
264
+ """
265
+ Retrieves query embeddings and a token index map.
266
+
267
+ Args:
268
+ query (str): The query string.
269
+
270
+ Returns:
271
+ Tuple[torch.Tensor, dict]: Query embeddings and token index map.
272
+ """
273
+ inputs = self.processor.process_queries([query]).to(self.model.device)
274
+ with torch.no_grad():
275
+ q_emb = self.model(**inputs).to("cpu")[0]
276
+
277
+ query_tokens = self.processor.tokenizer.tokenize(
278
+ self.processor.decode(inputs.input_ids[0])
279
+ )
280
+ idx_to_token = {idx: token for idx, token in enumerate(query_tokens)}
281
+ return q_emb, idx_to_token
backend/modelmanager.py ADDED
@@ -0,0 +1,24 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ from .colpali import load_model
2
+
3
+
4
+ class ModelManager:
5
+ _instance = None
6
+ model = None
7
+ processor = None
8
+ use_dummy_model = False
9
+
10
+ @staticmethod
11
+ def get_instance():
12
+ if ModelManager._instance is None:
13
+ ModelManager._instance = ModelManager()
14
+ if not ModelManager.use_dummy_model:
15
+ ModelManager._instance.initialize_model_and_processor()
16
+ return ModelManager._instance
17
+
18
+ def initialize_model_and_processor(self):
19
+ if self.model is None or self.processor is None: # Ensure no reinitialization
20
+ self.model, self.processor, self.device = load_model()
21
+ if self.model is None or self.processor is None:
22
+ print("Failed to initialize model or processor at startup")
23
+ else:
24
+ print("Model and processor loaded at startup")
backend/stopwords.py ADDED
@@ -0,0 +1,18 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import spacy
2
+ import os
3
+
4
+ # Download the model if it is not already present
5
+ if not spacy.util.is_package("en_core_web_sm"):
6
+ spacy.cli.download("en_core_web_sm")
7
+ nlp = spacy.load("en_core_web_sm")
8
+
9
+
10
+ # It would be possible to remove bolding for stopwords without removing them from the query,
11
+ # but that would require a java plugin which we didn't want to complicate this sample app with.
12
+ def filter(text):
13
+ doc = nlp(text)
14
+ tokens = [token.text for token in doc if not token.is_stop]
15
+ if len(tokens) == 0:
16
+ # if we remove all the words we don't have a query at all, so use the original
17
+ return text
18
+ return " ".join(tokens)
backend/testquery.py ADDED
@@ -0,0 +1,3013 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import torch
2
+
3
+ token_to_idx = {
4
+ "<bos>": 0,
5
+ "Question": 1,
6
+ ":": 2,
7
+ "▁Percentage": 3,
8
+ "▁of": 4,
9
+ "▁non": 5,
10
+ "-": 6,
11
+ "fresh": 7,
12
+ "▁water": 8,
13
+ "▁as": 9,
14
+ "▁source": 10,
15
+ "?": 11,
16
+ "<pad>": 21,
17
+ "\n": 22,
18
+ }
19
+ idx_to_token = {v: k for k, v in token_to_idx.items()}
20
+ q_embs = torch.tensor(
21
+ [
22
+ [
23
+ 1.6547e-01,
24
+ -1.8838e-02,
25
+ 1.0150e-01,
26
+ -2.2643e-02,
27
+ 5.0652e-02,
28
+ 3.2039e-02,
29
+ 6.8322e-02,
30
+ 2.7134e-02,
31
+ 7.2197e-03,
32
+ -4.2341e-02,
33
+ -5.8006e-02,
34
+ -1.1389e-01,
35
+ 7.0041e-02,
36
+ -6.8021e-02,
37
+ 2.4681e-02,
38
+ 5.3306e-02,
39
+ 4.1714e-02,
40
+ 6.2021e-02,
41
+ 1.3488e-01,
42
+ 3.4943e-02,
43
+ 3.8032e-02,
44
+ -3.2724e-02,
45
+ -1.2960e-01,
46
+ 1.1453e-02,
47
+ -2.6477e-02,
48
+ 3.5219e-02,
49
+ -7.6606e-02,
50
+ 2.2387e-01,
51
+ -3.4888e-02,
52
+ -4.0333e-02,
53
+ 1.4128e-01,
54
+ 4.2248e-02,
55
+ -1.2664e-01,
56
+ -7.8376e-02,
57
+ -2.0356e-02,
58
+ 4.2198e-02,
59
+ -7.0776e-02,
60
+ 1.3965e-02,
61
+ 4.2442e-03,
62
+ 7.1987e-02,
63
+ 8.5172e-04,
64
+ -6.4878e-02,
65
+ -1.8954e-01,
66
+ -8.6171e-02,
67
+ 9.1983e-02,
68
+ -9.3358e-02,
69
+ 2.2704e-01,
70
+ 1.3102e-02,
71
+ 6.5327e-02,
72
+ 2.4815e-02,
73
+ -1.4533e-01,
74
+ 5.8823e-02,
75
+ -6.1434e-02,
76
+ 5.2004e-02,
77
+ -8.4065e-02,
78
+ 1.6298e-01,
79
+ 8.1965e-02,
80
+ 2.6553e-02,
81
+ -1.2377e-01,
82
+ -5.3495e-02,
83
+ -3.4537e-02,
84
+ 5.1438e-02,
85
+ 8.2665e-03,
86
+ 7.9407e-02,
87
+ 5.8799e-02,
88
+ -3.5538e-02,
89
+ 1.9870e-01,
90
+ 6.2459e-02,
91
+ 1.6154e-01,
92
+ 7.2921e-02,
93
+ -9.7275e-02,
94
+ 3.0933e-02,
95
+ -1.0579e-02,
96
+ -1.4484e-01,
97
+ -4.8761e-02,
98
+ -5.3119e-02,
99
+ 6.2644e-02,
100
+ 2.2985e-02,
101
+ -2.1209e-01,
102
+ 9.0963e-02,
103
+ -2.6955e-02,
104
+ -7.7520e-02,
105
+ 1.2072e-01,
106
+ -1.9626e-02,
107
+ 5.8813e-02,
108
+ -1.2730e-01,
109
+ 1.5610e-01,
110
+ -1.6914e-01,
111
+ 6.5033e-02,
112
+ 8.5765e-02,
113
+ 1.2701e-01,
114
+ 6.5633e-02,
115
+ -1.0309e-01,
116
+ -8.0259e-02,
117
+ 4.5913e-02,
118
+ -3.3277e-02,
119
+ 1.9227e-01,
120
+ -2.3351e-02,
121
+ -8.0545e-02,
122
+ -9.8760e-03,
123
+ -4.1836e-02,
124
+ 1.2041e-01,
125
+ 8.1419e-02,
126
+ 1.4848e-01,
127
+ 7.2537e-02,
128
+ -4.7115e-03,
129
+ 4.1489e-02,
130
+ 4.3031e-02,
131
+ -1.3515e-01,
132
+ 1.0383e-01,
133
+ -9.4411e-02,
134
+ 2.8965e-02,
135
+ 1.9185e-01,
136
+ -1.4600e-02,
137
+ -8.0910e-02,
138
+ -2.9022e-02,
139
+ -4.4347e-02,
140
+ -8.8980e-03,
141
+ 2.8737e-03,
142
+ 4.2124e-02,
143
+ -1.6609e-02,
144
+ 4.2994e-02,
145
+ -7.2814e-02,
146
+ -2.9573e-02,
147
+ -1.2666e-01,
148
+ -4.3703e-02,
149
+ -7.2094e-02,
150
+ -2.7486e-02,
151
+ ],
152
+ [
153
+ -9.8454e-02,
154
+ -1.2954e-01,
155
+ 6.5702e-02,
156
+ -9.0006e-03,
157
+ -1.0934e-01,
158
+ 3.2155e-02,
159
+ -1.1444e-01,
160
+ -1.0309e-01,
161
+ 5.7024e-02,
162
+ 1.0124e-01,
163
+ 2.0721e-02,
164
+ -5.2608e-03,
165
+ 6.9916e-02,
166
+ 1.8036e-02,
167
+ 3.1653e-02,
168
+ 2.1923e-02,
169
+ -9.2523e-02,
170
+ -1.8215e-02,
171
+ 1.2974e-01,
172
+ -2.9632e-02,
173
+ -1.3854e-01,
174
+ 2.5710e-02,
175
+ -1.1727e-03,
176
+ 1.0245e-01,
177
+ -2.0731e-01,
178
+ 4.3669e-02,
179
+ -7.0196e-02,
180
+ -1.6697e-01,
181
+ 6.6050e-02,
182
+ 9.9776e-02,
183
+ -1.2227e-01,
184
+ 9.4000e-02,
185
+ 1.1945e-01,
186
+ 2.4611e-02,
187
+ -1.4073e-01,
188
+ 9.3476e-02,
189
+ 2.1170e-01,
190
+ -7.4522e-02,
191
+ -5.3362e-02,
192
+ -4.1198e-03,
193
+ 8.3880e-02,
194
+ 2.6590e-02,
195
+ -4.8489e-02,
196
+ -3.1279e-02,
197
+ -9.3401e-03,
198
+ -1.5945e-01,
199
+ -6.6368e-02,
200
+ 7.5715e-02,
201
+ 5.6884e-02,
202
+ 1.2861e-01,
203
+ 1.0073e-02,
204
+ -1.7185e-02,
205
+ 1.1545e-01,
206
+ 2.8725e-02,
207
+ -9.4969e-02,
208
+ -4.5517e-02,
209
+ 3.1253e-02,
210
+ 2.1135e-02,
211
+ -1.4505e-02,
212
+ 9.0893e-02,
213
+ -6.2680e-02,
214
+ -7.2855e-02,
215
+ -1.1275e-01,
216
+ -1.8433e-01,
217
+ 1.4693e-01,
218
+ 4.0366e-02,
219
+ 6.6879e-02,
220
+ -8.5653e-03,
221
+ 1.0663e-01,
222
+ -1.2342e-01,
223
+ 2.5350e-01,
224
+ -3.2227e-02,
225
+ 9.9404e-02,
226
+ 3.7340e-02,
227
+ 4.5462e-04,
228
+ -2.3015e-01,
229
+ 4.9006e-02,
230
+ 1.0079e-01,
231
+ -4.7179e-02,
232
+ 6.7642e-02,
233
+ -1.0833e-01,
234
+ 1.0030e-01,
235
+ 1.2838e-01,
236
+ -9.2911e-03,
237
+ 1.2342e-01,
238
+ -4.8455e-02,
239
+ 5.3904e-03,
240
+ 4.5178e-02,
241
+ 3.8961e-02,
242
+ 1.3383e-01,
243
+ -1.2236e-02,
244
+ 8.2026e-03,
245
+ 4.3735e-03,
246
+ -7.1725e-02,
247
+ 6.4360e-02,
248
+ 1.0004e-01,
249
+ 6.6840e-02,
250
+ 6.5649e-02,
251
+ -1.3978e-02,
252
+ 1.2810e-01,
253
+ 4.4325e-03,
254
+ 1.1136e-01,
255
+ -1.7329e-01,
256
+ -3.4472e-02,
257
+ -1.4066e-01,
258
+ 1.5641e-02,
259
+ -3.3600e-02,
260
+ -7.6192e-02,
261
+ 5.3085e-02,
262
+ -7.2859e-03,
263
+ 2.8798e-02,
264
+ -3.3748e-02,
265
+ -7.7591e-02,
266
+ -4.0927e-02,
267
+ -3.6577e-02,
268
+ 1.4012e-02,
269
+ -8.1780e-02,
270
+ -4.6315e-03,
271
+ -1.6508e-02,
272
+ -2.4506e-02,
273
+ 1.5122e-01,
274
+ 7.8270e-02,
275
+ 8.6502e-02,
276
+ -2.4651e-02,
277
+ -1.0286e-01,
278
+ -1.2171e-02,
279
+ -7.9000e-02,
280
+ 1.1161e-01,
281
+ ],
282
+ [
283
+ 6.4320e-02,
284
+ -2.6815e-02,
285
+ 1.2390e-01,
286
+ 8.6642e-02,
287
+ 5.3320e-02,
288
+ 9.1074e-02,
289
+ -8.9753e-02,
290
+ 1.6141e-02,
291
+ -5.0281e-02,
292
+ 1.0177e-01,
293
+ 4.5343e-02,
294
+ 9.1281e-02,
295
+ -8.2592e-03,
296
+ -1.4100e-02,
297
+ -4.7048e-02,
298
+ -8.6034e-02,
299
+ -1.1608e-01,
300
+ -8.4754e-02,
301
+ 1.0302e-01,
302
+ -1.2210e-02,
303
+ 8.5147e-02,
304
+ 1.3103e-01,
305
+ -3.3592e-03,
306
+ -1.4328e-02,
307
+ -4.6128e-03,
308
+ 2.1401e-02,
309
+ -1.7464e-01,
310
+ -3.9111e-02,
311
+ 2.0886e-02,
312
+ 8.9284e-02,
313
+ 1.3262e-01,
314
+ 7.0918e-02,
315
+ -1.3693e-02,
316
+ -9.7673e-02,
317
+ -5.3411e-02,
318
+ -6.7563e-02,
319
+ 2.3017e-02,
320
+ -3.4614e-02,
321
+ 3.6464e-02,
322
+ -4.6408e-02,
323
+ 1.4866e-01,
324
+ -2.1191e-01,
325
+ -6.4368e-02,
326
+ -3.0555e-02,
327
+ 7.2177e-02,
328
+ 2.4685e-02,
329
+ 8.6115e-02,
330
+ -5.8688e-02,
331
+ -2.8230e-03,
332
+ 1.1166e-01,
333
+ 1.8380e-01,
334
+ 3.6462e-02,
335
+ -1.4943e-02,
336
+ -1.4276e-01,
337
+ 1.0795e-01,
338
+ 1.9204e-02,
339
+ -6.5320e-02,
340
+ 1.0561e-01,
341
+ -1.3642e-01,
342
+ -4.3444e-02,
343
+ -6.4725e-02,
344
+ -1.3094e-01,
345
+ 1.9447e-02,
346
+ -1.3199e-01,
347
+ 8.7880e-02,
348
+ 5.8078e-02,
349
+ -2.5612e-02,
350
+ 7.7620e-02,
351
+ -2.5044e-02,
352
+ 1.0772e-02,
353
+ 6.5417e-02,
354
+ 8.2137e-02,
355
+ -2.8482e-02,
356
+ 5.5003e-02,
357
+ -1.1163e-01,
358
+ 1.7200e-03,
359
+ -1.0106e-01,
360
+ -1.8413e-02,
361
+ 1.2838e-01,
362
+ -1.2991e-01,
363
+ -1.8546e-02,
364
+ 1.0517e-01,
365
+ 1.0279e-01,
366
+ -8.1887e-02,
367
+ 1.0885e-01,
368
+ -1.0635e-01,
369
+ 1.2035e-01,
370
+ 1.1769e-01,
371
+ -2.8768e-02,
372
+ -3.3413e-02,
373
+ 1.3779e-01,
374
+ 1.4403e-02,
375
+ 2.3429e-02,
376
+ -1.2761e-01,
377
+ 7.2160e-02,
378
+ -1.0512e-01,
379
+ -1.7202e-02,
380
+ 5.3549e-02,
381
+ 5.5205e-02,
382
+ 9.2863e-02,
383
+ -2.3728e-02,
384
+ -5.1368e-02,
385
+ -3.7719e-02,
386
+ -4.7308e-02,
387
+ -2.2489e-02,
388
+ 4.5195e-02,
389
+ 1.0398e-01,
390
+ -2.2197e-02,
391
+ -1.7208e-01,
392
+ 7.4649e-02,
393
+ -7.7925e-02,
394
+ -6.4237e-02,
395
+ 2.8195e-02,
396
+ 2.2692e-01,
397
+ -9.7749e-02,
398
+ 2.4283e-01,
399
+ -4.0124e-02,
400
+ 1.8797e-02,
401
+ -6.1516e-02,
402
+ 6.5331e-03,
403
+ 1.3717e-01,
404
+ 9.9761e-02,
405
+ 5.4705e-02,
406
+ 3.5325e-02,
407
+ 1.9071e-01,
408
+ -6.1137e-02,
409
+ 1.6656e-01,
410
+ 3.4067e-02,
411
+ ],
412
+ [
413
+ 1.0896e-01,
414
+ -9.4366e-03,
415
+ 1.2956e-01,
416
+ 7.8127e-02,
417
+ 5.5422e-02,
418
+ 3.9155e-02,
419
+ -5.8379e-03,
420
+ -4.4257e-02,
421
+ -7.5182e-02,
422
+ 1.0452e-01,
423
+ 4.4595e-02,
424
+ 2.0972e-02,
425
+ 1.1071e-01,
426
+ 7.4710e-02,
427
+ -7.5500e-02,
428
+ -5.3393e-02,
429
+ -1.5478e-02,
430
+ -5.4455e-03,
431
+ 5.6779e-02,
432
+ -6.3919e-02,
433
+ 5.1792e-02,
434
+ 1.2070e-01,
435
+ -5.3707e-02,
436
+ 4.5715e-02,
437
+ -9.3062e-02,
438
+ 3.0224e-02,
439
+ -1.5892e-01,
440
+ -2.1702e-02,
441
+ 1.7942e-03,
442
+ 1.4574e-01,
443
+ 2.0721e-01,
444
+ -2.8224e-02,
445
+ -5.8104e-02,
446
+ -1.1645e-01,
447
+ -1.1515e-01,
448
+ -1.5202e-01,
449
+ -3.9751e-02,
450
+ 6.0342e-02,
451
+ 7.8182e-02,
452
+ -2.1132e-02,
453
+ 9.6468e-02,
454
+ -5.3148e-02,
455
+ -3.0343e-02,
456
+ 7.9363e-02,
457
+ 1.0752e-01,
458
+ 3.1086e-02,
459
+ 1.9322e-02,
460
+ -1.1134e-01,
461
+ 1.6342e-02,
462
+ 3.0358e-02,
463
+ 1.8543e-01,
464
+ 5.5353e-02,
465
+ -8.6656e-02,
466
+ -1.5650e-01,
467
+ 1.2087e-01,
468
+ -3.7852e-02,
469
+ -6.9116e-02,
470
+ 5.9981e-03,
471
+ 1.9205e-02,
472
+ -1.0314e-01,
473
+ -6.8082e-02,
474
+ -1.3078e-01,
475
+ 3.8448e-02,
476
+ -9.2233e-02,
477
+ 1.0965e-01,
478
+ 6.6332e-02,
479
+ -2.5805e-02,
480
+ 1.2299e-01,
481
+ 2.8629e-02,
482
+ -4.4949e-02,
483
+ 4.5560e-02,
484
+ 1.0507e-01,
485
+ -1.0271e-01,
486
+ 1.6237e-02,
487
+ -1.4555e-01,
488
+ -4.5335e-02,
489
+ -1.3477e-01,
490
+ 1.0230e-02,
491
+ 1.2380e-01,
492
+ -1.0681e-01,
493
+ 1.4412e-02,
494
+ 1.2396e-01,
495
+ 8.5290e-02,
496
+ -2.5138e-02,
497
+ 1.0191e-01,
498
+ -1.3413e-01,
499
+ 8.5871e-02,
500
+ 1.3389e-01,
501
+ -3.6357e-02,
502
+ -3.9740e-02,
503
+ 2.1128e-01,
504
+ 1.0263e-02,
505
+ 2.5547e-02,
506
+ -7.0139e-02,
507
+ 1.0178e-01,
508
+ -1.2729e-01,
509
+ -1.0717e-01,
510
+ -4.9394e-02,
511
+ 7.7645e-02,
512
+ 7.4589e-02,
513
+ -8.2835e-02,
514
+ -4.2227e-02,
515
+ -3.6417e-02,
516
+ -2.2900e-02,
517
+ -2.1010e-02,
518
+ 2.7898e-02,
519
+ 2.7314e-02,
520
+ 2.2172e-02,
521
+ -7.1122e-02,
522
+ 5.1570e-02,
523
+ 2.1860e-02,
524
+ -3.5103e-03,
525
+ -5.4524e-02,
526
+ 1.7485e-01,
527
+ -8.3810e-02,
528
+ 2.3868e-01,
529
+ 5.9468e-02,
530
+ -2.6706e-02,
531
+ -2.6617e-02,
532
+ 3.3851e-02,
533
+ 6.3651e-02,
534
+ 1.0611e-01,
535
+ 9.6252e-02,
536
+ -6.8701e-02,
537
+ 1.8108e-01,
538
+ -1.0178e-01,
539
+ 1.6935e-01,
540
+ 5.9301e-02,
541
+ ],
542
+ [
543
+ 1.3364e-01,
544
+ 6.6797e-02,
545
+ 1.0182e-01,
546
+ 6.1569e-02,
547
+ -1.4169e-04,
548
+ 1.1567e-01,
549
+ -3.1255e-03,
550
+ -8.9336e-02,
551
+ 5.3206e-03,
552
+ 1.7179e-01,
553
+ 1.3974e-01,
554
+ -6.5797e-02,
555
+ 1.2566e-01,
556
+ 6.1290e-02,
557
+ -1.3671e-01,
558
+ -1.1113e-01,
559
+ 1.6596e-01,
560
+ 6.2694e-02,
561
+ -1.6573e-02,
562
+ -1.8648e-02,
563
+ 2.1086e-02,
564
+ 9.6320e-03,
565
+ -1.0017e-01,
566
+ 2.3561e-02,
567
+ -6.2797e-02,
568
+ 2.1202e-02,
569
+ -1.3333e-01,
570
+ 1.4728e-01,
571
+ 1.9824e-03,
572
+ 1.6832e-01,
573
+ 1.7138e-01,
574
+ 4.6578e-02,
575
+ -1.2989e-01,
576
+ 3.2167e-02,
577
+ -7.7753e-02,
578
+ -1.1548e-01,
579
+ 4.3367e-02,
580
+ -4.5640e-02,
581
+ 4.3583e-02,
582
+ -8.9393e-03,
583
+ 5.3231e-02,
584
+ -4.8284e-02,
585
+ 3.1841e-04,
586
+ 6.6374e-02,
587
+ 7.7363e-02,
588
+ 4.0000e-02,
589
+ 1.5414e-02,
590
+ -9.6156e-02,
591
+ 1.1389e-01,
592
+ 7.1118e-02,
593
+ 3.0042e-02,
594
+ 5.9752e-02,
595
+ -6.4565e-02,
596
+ -1.2914e-01,
597
+ 1.1048e-01,
598
+ 1.2409e-02,
599
+ -9.9385e-02,
600
+ 1.8671e-02,
601
+ 2.1383e-02,
602
+ 3.7012e-03,
603
+ -1.1497e-01,
604
+ -6.8653e-02,
605
+ 3.0582e-02,
606
+ -1.1567e-01,
607
+ 1.4165e-01,
608
+ 3.7493e-02,
609
+ -6.0779e-02,
610
+ 1.0989e-01,
611
+ 8.6001e-02,
612
+ -5.6139e-02,
613
+ 2.0710e-02,
614
+ 9.8577e-02,
615
+ -9.9427e-02,
616
+ 5.8372e-02,
617
+ -1.3443e-01,
618
+ -1.3021e-02,
619
+ -1.3802e-01,
620
+ 7.6053e-02,
621
+ 1.2181e-01,
622
+ -8.7719e-02,
623
+ 1.0967e-02,
624
+ 1.3160e-01,
625
+ 4.7032e-02,
626
+ -6.3351e-02,
627
+ 7.1883e-02,
628
+ -9.7565e-02,
629
+ 1.4424e-01,
630
+ 1.2353e-01,
631
+ -6.1527e-02,
632
+ 2.4263e-02,
633
+ 2.9356e-01,
634
+ 6.2813e-02,
635
+ -4.5265e-03,
636
+ -1.0213e-01,
637
+ 1.4227e-02,
638
+ -7.9267e-02,
639
+ -1.0845e-01,
640
+ -2.0014e-02,
641
+ 2.8542e-02,
642
+ 9.7207e-02,
643
+ -2.5234e-02,
644
+ -7.3668e-02,
645
+ -3.0084e-02,
646
+ -1.2958e-02,
647
+ -3.9597e-02,
648
+ -7.2243e-02,
649
+ 5.3054e-02,
650
+ 3.1470e-03,
651
+ -1.9800e-02,
652
+ 1.3476e-01,
653
+ -1.6873e-02,
654
+ 5.2286e-02,
655
+ 2.0254e-02,
656
+ 1.0554e-01,
657
+ -3.0395e-02,
658
+ 8.6349e-02,
659
+ 7.6580e-02,
660
+ -3.0139e-02,
661
+ -4.8131e-02,
662
+ -4.5770e-02,
663
+ 1.5154e-01,
664
+ 1.1276e-01,
665
+ 5.7244e-02,
666
+ 8.0574e-02,
667
+ 1.5610e-01,
668
+ -1.5523e-01,
669
+ 1.0428e-01,
670
+ 5.7947e-02,
671
+ ],
672
+ [
673
+ 7.0440e-02,
674
+ 1.4300e-01,
675
+ 6.0559e-02,
676
+ 1.9177e-02,
677
+ -9.0313e-02,
678
+ 1.7104e-01,
679
+ -5.1137e-02,
680
+ -1.0229e-01,
681
+ -2.8831e-02,
682
+ 1.5385e-01,
683
+ -8.4017e-03,
684
+ -4.9185e-03,
685
+ 1.0820e-01,
686
+ 1.0022e-01,
687
+ -1.9284e-01,
688
+ -2.7293e-02,
689
+ 4.9526e-02,
690
+ 5.5152e-02,
691
+ -6.7003e-02,
692
+ -7.0313e-03,
693
+ -3.3208e-02,
694
+ 1.3815e-02,
695
+ -6.0694e-02,
696
+ 6.2041e-02,
697
+ -3.2288e-02,
698
+ 1.1629e-01,
699
+ -7.5270e-02,
700
+ 2.1824e-01,
701
+ -2.5215e-03,
702
+ 1.8179e-01,
703
+ 1.5514e-01,
704
+ 1.0494e-01,
705
+ -1.2390e-01,
706
+ -1.2241e-03,
707
+ 6.1079e-04,
708
+ -5.8730e-02,
709
+ 6.1313e-02,
710
+ -7.8853e-02,
711
+ 6.0292e-02,
712
+ -9.1497e-04,
713
+ 7.9087e-02,
714
+ -1.8246e-02,
715
+ 5.0215e-03,
716
+ 3.5083e-02,
717
+ 5.9616e-02,
718
+ 5.9520e-02,
719
+ 6.0224e-02,
720
+ -1.3079e-01,
721
+ 1.6500e-01,
722
+ 4.4308e-03,
723
+ 4.2712e-02,
724
+ 5.5916e-02,
725
+ -5.4616e-02,
726
+ -8.5617e-02,
727
+ 1.1235e-01,
728
+ 6.5911e-03,
729
+ -6.1463e-02,
730
+ 3.7832e-02,
731
+ 3.4189e-02,
732
+ -1.1295e-02,
733
+ -8.0972e-02,
734
+ -1.0051e-02,
735
+ -2.6856e-02,
736
+ -7.9570e-02,
737
+ 1.2776e-01,
738
+ 6.5826e-02,
739
+ -3.1759e-02,
740
+ 9.6016e-02,
741
+ 6.7249e-02,
742
+ -4.5115e-02,
743
+ 6.3695e-03,
744
+ 1.2092e-01,
745
+ -1.3821e-01,
746
+ -9.7066e-02,
747
+ -1.5063e-02,
748
+ 2.4618e-02,
749
+ -1.9589e-01,
750
+ 5.8625e-02,
751
+ 1.7886e-01,
752
+ -6.3740e-02,
753
+ -1.7241e-02,
754
+ 7.3394e-02,
755
+ 5.8903e-02,
756
+ -1.6557e-02,
757
+ -1.9226e-02,
758
+ -1.0912e-01,
759
+ 8.7902e-02,
760
+ 6.4426e-02,
761
+ 4.8019e-02,
762
+ 8.1223e-02,
763
+ 2.9888e-01,
764
+ 8.7458e-02,
765
+ 4.4167e-02,
766
+ -1.3228e-01,
767
+ 5.9629e-02,
768
+ -1.0696e-01,
769
+ -1.4102e-01,
770
+ -5.2509e-02,
771
+ -1.9981e-02,
772
+ 1.6788e-01,
773
+ 9.7499e-02,
774
+ -5.4125e-02,
775
+ -8.2383e-02,
776
+ -6.3908e-02,
777
+ -6.8830e-02,
778
+ -1.2622e-01,
779
+ 3.1651e-02,
780
+ 4.4592e-02,
781
+ -1.3325e-02,
782
+ 1.1260e-01,
783
+ -3.9567e-02,
784
+ 6.9631e-03,
785
+ 1.4943e-01,
786
+ 8.6930e-02,
787
+ -3.6171e-03,
788
+ -5.6886e-02,
789
+ -8.3102e-03,
790
+ -2.6001e-02,
791
+ -1.5187e-02,
792
+ -1.8835e-02,
793
+ 2.3583e-02,
794
+ 9.5520e-02,
795
+ -3.4944e-02,
796
+ 4.5537e-02,
797
+ 6.1444e-02,
798
+ -1.7165e-01,
799
+ 1.0230e-01,
800
+ 2.0319e-02,
801
+ ],
802
+ [
803
+ 7.7065e-02,
804
+ 1.4466e-01,
805
+ 1.0796e-01,
806
+ 9.7362e-03,
807
+ -9.3062e-02,
808
+ 2.0065e-01,
809
+ -9.3982e-03,
810
+ -1.2871e-01,
811
+ -4.6724e-02,
812
+ 1.6573e-01,
813
+ -1.7444e-02,
814
+ -3.1211e-02,
815
+ 9.7404e-02,
816
+ 9.6222e-02,
817
+ -1.4772e-01,
818
+ -5.6838e-02,
819
+ 9.6276e-02,
820
+ 7.7819e-02,
821
+ -1.1884e-01,
822
+ -4.5898e-02,
823
+ -7.3665e-02,
824
+ 2.1005e-02,
825
+ -2.8032e-02,
826
+ 7.0900e-02,
827
+ -7.7625e-02,
828
+ 7.0269e-02,
829
+ -7.0747e-02,
830
+ 2.5605e-01,
831
+ 1.9427e-04,
832
+ 1.4614e-01,
833
+ 1.6723e-01,
834
+ 1.0016e-01,
835
+ -7.8961e-02,
836
+ 1.3307e-02,
837
+ -1.1207e-02,
838
+ -7.1492e-02,
839
+ 7.5561e-02,
840
+ -1.3290e-02,
841
+ 4.4527e-02,
842
+ -2.4224e-02,
843
+ 8.7846e-02,
844
+ 4.7492e-02,
845
+ -7.0398e-02,
846
+ -2.6663e-02,
847
+ 4.9730e-02,
848
+ 6.6743e-02,
849
+ 3.1060e-02,
850
+ -7.8352e-02,
851
+ 1.0100e-01,
852
+ 6.4963e-02,
853
+ 1.6746e-02,
854
+ 3.0324e-02,
855
+ -2.6583e-02,
856
+ -7.8028e-02,
857
+ 7.0180e-02,
858
+ 3.8924e-02,
859
+ -8.5861e-02,
860
+ -3.3792e-02,
861
+ 6.1542e-02,
862
+ 1.6180e-02,
863
+ -7.7865e-02,
864
+ 3.9551e-02,
865
+ 2.9772e-02,
866
+ -7.0824e-02,
867
+ 1.5235e-01,
868
+ 4.8718e-02,
869
+ 1.5973e-03,
870
+ 8.7719e-02,
871
+ 7.7414e-02,
872
+ -6.4385e-02,
873
+ -6.4330e-02,
874
+ 1.3965e-01,
875
+ -1.6355e-01,
876
+ -6.5261e-02,
877
+ -6.2693e-02,
878
+ 4.9435e-02,
879
+ -1.5245e-01,
880
+ 6.6557e-02,
881
+ 1.5213e-01,
882
+ -8.2073e-02,
883
+ 1.4664e-02,
884
+ 8.4507e-02,
885
+ 3.0684e-02,
886
+ -8.7932e-02,
887
+ 1.9927e-02,
888
+ -7.1788e-02,
889
+ 9.4965e-02,
890
+ 3.9220e-02,
891
+ 4.5944e-02,
892
+ 8.7249e-02,
893
+ 3.3315e-01,
894
+ 5.9872e-02,
895
+ 2.6362e-02,
896
+ -1.7888e-01,
897
+ -1.6042e-02,
898
+ -9.4593e-02,
899
+ -1.7915e-01,
900
+ -2.9888e-02,
901
+ -4.3776e-02,
902
+ 1.1388e-01,
903
+ 2.3778e-02,
904
+ -4.0233e-02,
905
+ -4.7893e-02,
906
+ -4.4371e-02,
907
+ -2.7491e-02,
908
+ -9.6716e-02,
909
+ -2.0120e-02,
910
+ 5.6864e-02,
911
+ 1.8953e-02,
912
+ 1.2741e-01,
913
+ -5.3045e-02,
914
+ 3.2240e-02,
915
+ 1.4479e-01,
916
+ 9.5315e-02,
917
+ -2.7717e-02,
918
+ -5.7349e-02,
919
+ 3.3824e-03,
920
+ -3.5642e-02,
921
+ -1.6905e-02,
922
+ -4.5765e-02,
923
+ 1.1481e-02,
924
+ 4.2545e-02,
925
+ 1.2632e-02,
926
+ 4.5401e-02,
927
+ 7.7769e-02,
928
+ -1.7107e-01,
929
+ 7.7265e-02,
930
+ 1.3922e-02,
931
+ ],
932
+ [
933
+ -2.5666e-02,
934
+ 9.8268e-02,
935
+ 2.1571e-01,
936
+ 6.8094e-02,
937
+ -1.0440e-01,
938
+ 1.4299e-01,
939
+ 6.5751e-02,
940
+ -5.0693e-02,
941
+ -5.3796e-02,
942
+ 1.5239e-01,
943
+ 2.9566e-02,
944
+ -7.9492e-02,
945
+ 9.3274e-02,
946
+ 7.6112e-02,
947
+ -1.8187e-02,
948
+ -1.1190e-01,
949
+ 9.7962e-02,
950
+ -2.8204e-02,
951
+ -8.1216e-02,
952
+ -5.5618e-02,
953
+ -6.2378e-02,
954
+ 4.4238e-02,
955
+ -1.6572e-02,
956
+ -3.4035e-02,
957
+ -8.8068e-02,
958
+ 1.4164e-02,
959
+ -2.6908e-02,
960
+ 1.9650e-01,
961
+ 6.8845e-03,
962
+ 8.7550e-02,
963
+ 1.7410e-01,
964
+ 1.0088e-01,
965
+ -1.1340e-02,
966
+ 3.2057e-04,
967
+ -5.5130e-02,
968
+ -2.5234e-02,
969
+ 8.2460e-02,
970
+ -6.0768e-02,
971
+ 1.2448e-01,
972
+ 6.8736e-02,
973
+ 4.6176e-02,
974
+ 1.0866e-01,
975
+ 4.9560e-02,
976
+ -5.8322e-02,
977
+ 4.4106e-02,
978
+ 2.0739e-02,
979
+ -9.0032e-02,
980
+ -5.8815e-02,
981
+ 7.8127e-03,
982
+ 1.7999e-01,
983
+ 1.2519e-01,
984
+ 7.1377e-02,
985
+ 9.3219e-02,
986
+ -1.3311e-01,
987
+ 6.0305e-02,
988
+ 5.9400e-02,
989
+ -1.6119e-01,
990
+ 5.4173e-02,
991
+ -5.8663e-02,
992
+ -4.8149e-02,
993
+ 1.6295e-02,
994
+ -6.9787e-02,
995
+ 5.7512e-03,
996
+ -8.9745e-03,
997
+ 1.3492e-01,
998
+ 3.0659e-02,
999
+ -6.8611e-02,
1000
+ 2.1200e-02,
1001
+ 8.5522e-02,
1002
+ -4.3482e-02,
1003
+ -8.4601e-02,
1004
+ 1.4191e-01,
1005
+ -1.5514e-01,
1006
+ -5.8989e-03,
1007
+ -4.5591e-02,
1008
+ 6.4905e-02,
1009
+ -1.3198e-01,
1010
+ 1.3764e-01,
1011
+ 9.5549e-02,
1012
+ -9.4689e-02,
1013
+ -1.9705e-02,
1014
+ 2.1147e-01,
1015
+ -9.8519e-03,
1016
+ -7.7839e-02,
1017
+ 6.1447e-02,
1018
+ -6.4708e-02,
1019
+ 3.1579e-03,
1020
+ 7.6588e-02,
1021
+ -1.2452e-01,
1022
+ -6.1076e-02,
1023
+ 2.5150e-01,
1024
+ 2.3101e-02,
1025
+ -1.3632e-02,
1026
+ -8.5695e-02,
1027
+ -8.5841e-02,
1028
+ -1.3152e-01,
1029
+ -1.5294e-01,
1030
+ -4.4509e-03,
1031
+ 8.6619e-02,
1032
+ -1.0974e-02,
1033
+ -3.9592e-02,
1034
+ -3.0472e-02,
1035
+ -1.4011e-01,
1036
+ 8.6485e-03,
1037
+ -1.3633e-02,
1038
+ -1.7940e-02,
1039
+ 3.3016e-02,
1040
+ -5.7245e-02,
1041
+ 1.0200e-01,
1042
+ 1.2807e-01,
1043
+ 1.5249e-02,
1044
+ -1.2197e-02,
1045
+ -1.6867e-02,
1046
+ 3.4516e-02,
1047
+ -9.0908e-02,
1048
+ -2.6167e-02,
1049
+ 2.2975e-01,
1050
+ 4.2693e-02,
1051
+ 2.5415e-03,
1052
+ 7.1921e-03,
1053
+ 1.2855e-01,
1054
+ 5.5747e-03,
1055
+ -2.7843e-02,
1056
+ 2.4283e-02,
1057
+ -1.3484e-02,
1058
+ -1.5948e-01,
1059
+ 2.1127e-02,
1060
+ 3.8017e-02,
1061
+ ],
1062
+ [
1063
+ 5.8410e-02,
1064
+ 3.7501e-02,
1065
+ 1.7189e-01,
1066
+ 4.5894e-02,
1067
+ -6.7739e-02,
1068
+ 1.1421e-01,
1069
+ 4.6982e-02,
1070
+ -1.2963e-01,
1071
+ -4.2804e-02,
1072
+ 1.2137e-01,
1073
+ 1.0761e-01,
1074
+ -7.2615e-02,
1075
+ 1.1811e-01,
1076
+ 1.1291e-01,
1077
+ -1.6041e-01,
1078
+ -6.6820e-02,
1079
+ 1.9071e-01,
1080
+ -3.6201e-02,
1081
+ -1.0659e-01,
1082
+ -3.3226e-02,
1083
+ -2.6535e-02,
1084
+ 7.8536e-02,
1085
+ -3.2975e-02,
1086
+ 1.3015e-02,
1087
+ -1.0903e-01,
1088
+ -1.2502e-02,
1089
+ -7.0142e-02,
1090
+ 1.6872e-01,
1091
+ -3.7569e-02,
1092
+ 2.2090e-01,
1093
+ 1.5620e-01,
1094
+ 9.5620e-02,
1095
+ -7.9306e-02,
1096
+ 7.8558e-02,
1097
+ -6.6709e-02,
1098
+ -6.6446e-02,
1099
+ 3.1198e-02,
1100
+ -4.8808e-02,
1101
+ 6.7798e-02,
1102
+ -4.9524e-02,
1103
+ 1.2496e-01,
1104
+ 3.1661e-02,
1105
+ 4.0690e-02,
1106
+ -1.1341e-02,
1107
+ 2.9852e-03,
1108
+ 3.6166e-02,
1109
+ -1.1723e-01,
1110
+ -8.2746e-02,
1111
+ 1.2874e-01,
1112
+ 1.0894e-01,
1113
+ 7.6219e-02,
1114
+ 5.5564e-02,
1115
+ 4.3571e-02,
1116
+ -1.4745e-01,
1117
+ 5.1090e-02,
1118
+ -6.8233e-04,
1119
+ -1.7107e-01,
1120
+ -2.5612e-02,
1121
+ 1.9185e-02,
1122
+ 3.4466e-03,
1123
+ -2.5362e-02,
1124
+ -3.4807e-02,
1125
+ 8.1003e-02,
1126
+ -2.1582e-02,
1127
+ 1.1581e-01,
1128
+ -5.5891e-04,
1129
+ 4.2455e-02,
1130
+ 6.3277e-02,
1131
+ 7.9868e-02,
1132
+ -2.0369e-02,
1133
+ -7.6229e-02,
1134
+ 1.2525e-01,
1135
+ -1.4795e-01,
1136
+ 5.8798e-03,
1137
+ -2.8584e-02,
1138
+ -3.2508e-04,
1139
+ -5.5666e-02,
1140
+ 7.1217e-02,
1141
+ 1.3122e-01,
1142
+ -1.8671e-02,
1143
+ -6.6402e-02,
1144
+ 8.1978e-02,
1145
+ -5.5194e-02,
1146
+ -1.7259e-02,
1147
+ 7.1373e-02,
1148
+ -2.4721e-02,
1149
+ 1.0176e-01,
1150
+ 1.7447e-02,
1151
+ -8.2872e-02,
1152
+ 4.7707e-02,
1153
+ 2.9338e-01,
1154
+ 3.0481e-02,
1155
+ 1.3437e-02,
1156
+ -1.1888e-01,
1157
+ -6.1243e-02,
1158
+ -8.7197e-02,
1159
+ -1.1085e-01,
1160
+ 3.2847e-03,
1161
+ 1.0611e-02,
1162
+ 1.2908e-01,
1163
+ -2.9064e-02,
1164
+ 4.5522e-03,
1165
+ -7.8045e-02,
1166
+ -1.4551e-02,
1167
+ -3.3958e-02,
1168
+ -1.3626e-01,
1169
+ 1.0859e-01,
1170
+ -7.3438e-03,
1171
+ 9.5942e-02,
1172
+ 1.4176e-01,
1173
+ 7.3675e-02,
1174
+ -2.3085e-03,
1175
+ -1.1689e-02,
1176
+ 6.1119e-02,
1177
+ -9.3085e-02,
1178
+ -1.9003e-03,
1179
+ 1.9013e-01,
1180
+ -2.5321e-02,
1181
+ -7.0616e-02,
1182
+ -3.7145e-02,
1183
+ 1.2449e-01,
1184
+ 5.3496e-02,
1185
+ 6.4021e-02,
1186
+ 2.2668e-02,
1187
+ 7.3348e-02,
1188
+ -1.7717e-01,
1189
+ 8.7206e-02,
1190
+ -9.0490e-03,
1191
+ ],
1192
+ [
1193
+ 1.0227e-01,
1194
+ 4.4796e-02,
1195
+ 4.0916e-02,
1196
+ 1.2557e-01,
1197
+ 6.2699e-02,
1198
+ 8.8194e-02,
1199
+ 1.0194e-02,
1200
+ -1.8886e-01,
1201
+ -2.6260e-02,
1202
+ 1.8107e-01,
1203
+ 1.1127e-01,
1204
+ -4.9960e-03,
1205
+ 8.2291e-02,
1206
+ 6.0932e-02,
1207
+ -1.3735e-01,
1208
+ -8.1350e-02,
1209
+ 1.2948e-01,
1210
+ 1.0171e-01,
1211
+ -7.5950e-02,
1212
+ 4.7169e-02,
1213
+ 3.5525e-03,
1214
+ -1.4790e-02,
1215
+ -3.9063e-02,
1216
+ -1.9754e-02,
1217
+ -4.6450e-02,
1218
+ 4.9071e-03,
1219
+ -3.5914e-02,
1220
+ 1.7644e-01,
1221
+ 4.1860e-02,
1222
+ 1.0857e-01,
1223
+ 8.5257e-02,
1224
+ 1.0368e-01,
1225
+ -1.7223e-01,
1226
+ 4.6316e-02,
1227
+ -3.9480e-02,
1228
+ -9.5817e-02,
1229
+ 1.5301e-03,
1230
+ -4.5519e-02,
1231
+ 1.2380e-01,
1232
+ -3.0402e-02,
1233
+ 4.9047e-02,
1234
+ 3.4955e-02,
1235
+ -1.3018e-02,
1236
+ 7.4193e-02,
1237
+ -3.0426e-02,
1238
+ 1.8414e-02,
1239
+ -1.6606e-02,
1240
+ -1.6301e-01,
1241
+ 1.5784e-01,
1242
+ 1.3759e-01,
1243
+ 3.1906e-02,
1244
+ 2.9389e-02,
1245
+ -9.4501e-02,
1246
+ -1.9930e-01,
1247
+ 1.3336e-01,
1248
+ 3.0685e-02,
1249
+ -7.3809e-02,
1250
+ 6.2165e-02,
1251
+ 7.3050e-02,
1252
+ 3.2870e-02,
1253
+ -1.2857e-01,
1254
+ -7.5806e-04,
1255
+ 1.3985e-01,
1256
+ -5.5108e-02,
1257
+ 9.2220e-02,
1258
+ 7.2490e-02,
1259
+ -2.6251e-02,
1260
+ 3.4508e-02,
1261
+ -3.1215e-02,
1262
+ -8.1111e-02,
1263
+ 1.4316e-02,
1264
+ 1.2390e-01,
1265
+ -9.6291e-03,
1266
+ 6.4214e-02,
1267
+ -9.6013e-02,
1268
+ 7.5809e-02,
1269
+ -1.5899e-01,
1270
+ 8.6961e-02,
1271
+ 1.5239e-03,
1272
+ -5.6370e-02,
1273
+ -4.3367e-02,
1274
+ 1.5813e-02,
1275
+ 9.7189e-04,
1276
+ -5.2946e-02,
1277
+ 4.5950e-02,
1278
+ -1.0028e-01,
1279
+ 1.1108e-01,
1280
+ 9.0491e-03,
1281
+ -3.5540e-02,
1282
+ -1.2020e-02,
1283
+ 2.2980e-01,
1284
+ 2.7125e-02,
1285
+ -2.4191e-02,
1286
+ -1.1363e-01,
1287
+ -1.0109e-01,
1288
+ -1.4781e-01,
1289
+ -4.7656e-02,
1290
+ 3.9481e-02,
1291
+ 5.4198e-02,
1292
+ 8.2908e-02,
1293
+ 1.4034e-02,
1294
+ -1.8492e-02,
1295
+ -6.8612e-02,
1296
+ -4.8741e-02,
1297
+ 1.1223e-02,
1298
+ 3.9220e-02,
1299
+ 5.4551e-04,
1300
+ 6.5554e-02,
1301
+ 4.3087e-02,
1302
+ 1.4678e-01,
1303
+ -4.4496e-02,
1304
+ 6.2379e-02,
1305
+ 4.7876e-02,
1306
+ 7.0156e-02,
1307
+ -6.4684e-02,
1308
+ 6.1076e-02,
1309
+ 1.4685e-01,
1310
+ -6.3639e-02,
1311
+ -8.7487e-02,
1312
+ -1.3756e-02,
1313
+ 1.2724e-01,
1314
+ 1.7404e-01,
1315
+ 6.7980e-02,
1316
+ 6.8036e-02,
1317
+ 1.9786e-01,
1318
+ -9.2910e-02,
1319
+ 1.9158e-01,
1320
+ 4.2686e-02,
1321
+ ],
1322
+ [
1323
+ 4.7626e-02,
1324
+ 9.3338e-02,
1325
+ 9.8020e-02,
1326
+ 9.2408e-02,
1327
+ 1.8267e-02,
1328
+ 1.9572e-02,
1329
+ 1.1056e-01,
1330
+ -1.2639e-01,
1331
+ -3.1999e-02,
1332
+ 1.3731e-01,
1333
+ 1.4826e-01,
1334
+ -6.7136e-02,
1335
+ 8.9095e-02,
1336
+ 1.9683e-01,
1337
+ -6.5839e-02,
1338
+ -1.0708e-01,
1339
+ 1.4414e-01,
1340
+ 4.7389e-02,
1341
+ -7.9921e-02,
1342
+ 2.7793e-02,
1343
+ 9.7240e-02,
1344
+ 3.8000e-03,
1345
+ -4.2359e-02,
1346
+ 5.9870e-02,
1347
+ -8.2957e-02,
1348
+ -2.1429e-02,
1349
+ -5.1302e-02,
1350
+ 7.6202e-02,
1351
+ 3.5426e-02,
1352
+ 8.1509e-02,
1353
+ 1.0849e-01,
1354
+ 2.1857e-01,
1355
+ -9.1445e-02,
1356
+ 1.1869e-01,
1357
+ -1.1225e-02,
1358
+ -1.6530e-01,
1359
+ 2.9216e-02,
1360
+ -7.2792e-02,
1361
+ 2.9544e-02,
1362
+ 5.3102e-02,
1363
+ 2.7746e-02,
1364
+ 1.8145e-01,
1365
+ 4.4050e-02,
1366
+ 8.2549e-02,
1367
+ 2.3603e-02,
1368
+ 1.1494e-02,
1369
+ -9.1516e-02,
1370
+ -6.2333e-02,
1371
+ 1.0970e-01,
1372
+ 1.4544e-01,
1373
+ -2.3549e-02,
1374
+ 3.8348e-02,
1375
+ -8.8504e-03,
1376
+ -1.6986e-01,
1377
+ 5.3215e-02,
1378
+ 4.0085e-02,
1379
+ -1.4667e-01,
1380
+ 8.6047e-02,
1381
+ 4.7751e-02,
1382
+ -9.0753e-03,
1383
+ -8.0938e-02,
1384
+ 7.1735e-02,
1385
+ 1.1711e-01,
1386
+ -4.2525e-02,
1387
+ 9.4244e-02,
1388
+ 6.2260e-02,
1389
+ -8.4798e-02,
1390
+ -6.2766e-02,
1391
+ 6.1025e-02,
1392
+ -7.7480e-02,
1393
+ -5.3010e-02,
1394
+ 1.0182e-01,
1395
+ -7.1081e-03,
1396
+ 1.4029e-01,
1397
+ -8.2788e-02,
1398
+ 3.0182e-02,
1399
+ -1.4279e-01,
1400
+ 4.7882e-02,
1401
+ 3.2736e-02,
1402
+ 1.0058e-02,
1403
+ -1.2673e-02,
1404
+ 3.2723e-02,
1405
+ 5.7034e-02,
1406
+ -2.0343e-02,
1407
+ -6.3037e-03,
1408
+ 6.8914e-03,
1409
+ 2.8643e-02,
1410
+ 8.2739e-02,
1411
+ -2.5124e-02,
1412
+ -7.0611e-02,
1413
+ 2.7142e-01,
1414
+ -1.6370e-02,
1415
+ 1.1240e-02,
1416
+ -1.5776e-01,
1417
+ -8.8755e-02,
1418
+ -1.2389e-01,
1419
+ -2.5785e-02,
1420
+ -3.1258e-02,
1421
+ 2.2023e-02,
1422
+ 8.5497e-02,
1423
+ 6.8360e-02,
1424
+ 9.6662e-03,
1425
+ -1.2843e-01,
1426
+ -7.4122e-02,
1427
+ 7.1105e-02,
1428
+ 5.8334e-02,
1429
+ -1.8609e-02,
1430
+ 7.2775e-02,
1431
+ -1.1696e-03,
1432
+ 1.3165e-01,
1433
+ -4.7326e-02,
1434
+ 6.8737e-02,
1435
+ -2.3211e-02,
1436
+ 5.5425e-02,
1437
+ -6.8077e-02,
1438
+ -7.3743e-04,
1439
+ 1.6215e-01,
1440
+ -4.9965e-02,
1441
+ -6.8493e-02,
1442
+ -6.8358e-03,
1443
+ 9.9518e-02,
1444
+ 1.5627e-01,
1445
+ 1.3100e-01,
1446
+ 9.0073e-03,
1447
+ 1.8023e-01,
1448
+ -2.7992e-02,
1449
+ 1.3087e-01,
1450
+ 7.5672e-02,
1451
+ ],
1452
+ [
1453
+ 9.0390e-03,
1454
+ 1.3137e-01,
1455
+ 4.6586e-02,
1456
+ 1.1836e-01,
1457
+ 1.1406e-01,
1458
+ -7.3655e-02,
1459
+ -1.4340e-02,
1460
+ -1.4285e-01,
1461
+ 2.0426e-02,
1462
+ 1.3532e-01,
1463
+ 1.3353e-01,
1464
+ -4.7242e-02,
1465
+ 2.6449e-02,
1466
+ 2.0707e-02,
1467
+ -7.5049e-02,
1468
+ -4.4098e-02,
1469
+ 1.4599e-01,
1470
+ 1.1299e-01,
1471
+ -2.2097e-02,
1472
+ 8.4679e-02,
1473
+ 5.8946e-02,
1474
+ 2.1213e-02,
1475
+ -5.8343e-03,
1476
+ -4.5902e-02,
1477
+ -6.1018e-02,
1478
+ 4.4368e-02,
1479
+ -5.7242e-02,
1480
+ 1.4734e-01,
1481
+ -2.6191e-02,
1482
+ 8.6116e-02,
1483
+ 4.6004e-02,
1484
+ 9.4532e-02,
1485
+ -1.7424e-01,
1486
+ 1.3697e-01,
1487
+ 4.6544e-02,
1488
+ -1.1718e-01,
1489
+ 8.2567e-02,
1490
+ -3.1494e-02,
1491
+ 6.2073e-02,
1492
+ -4.6328e-02,
1493
+ 1.0782e-02,
1494
+ 6.6071e-02,
1495
+ 9.6953e-03,
1496
+ 2.0415e-02,
1497
+ -7.0939e-02,
1498
+ -3.1883e-02,
1499
+ -4.9926e-02,
1500
+ -1.2694e-01,
1501
+ 1.6351e-01,
1502
+ 1.1670e-01,
1503
+ 3.8149e-02,
1504
+ 7.2657e-02,
1505
+ -1.3930e-01,
1506
+ -1.6682e-01,
1507
+ 1.2471e-01,
1508
+ 1.9094e-02,
1509
+ -6.1618e-02,
1510
+ 5.8628e-02,
1511
+ 1.8804e-03,
1512
+ -9.8104e-03,
1513
+ -1.8998e-01,
1514
+ 1.4179e-02,
1515
+ 1.4856e-01,
1516
+ -4.2194e-02,
1517
+ 5.0934e-02,
1518
+ 7.4201e-02,
1519
+ 1.5546e-02,
1520
+ 2.3192e-02,
1521
+ -5.5886e-02,
1522
+ -4.3618e-02,
1523
+ 3.6677e-02,
1524
+ 1.2750e-01,
1525
+ 2.1240e-02,
1526
+ 9.5849e-02,
1527
+ -1.2006e-01,
1528
+ 6.0424e-02,
1529
+ -1.6516e-01,
1530
+ 7.8015e-02,
1531
+ -7.6026e-02,
1532
+ -1.6711e-02,
1533
+ -3.3572e-02,
1534
+ 4.3704e-02,
1535
+ 1.2728e-02,
1536
+ -6.9630e-02,
1537
+ 4.1065e-02,
1538
+ -8.6589e-02,
1539
+ 1.1581e-01,
1540
+ 2.2666e-02,
1541
+ -5.1476e-02,
1542
+ 2.2756e-02,
1543
+ 1.9309e-01,
1544
+ 7.0417e-02,
1545
+ 8.9253e-03,
1546
+ -1.3546e-01,
1547
+ -7.7783e-02,
1548
+ -1.5058e-01,
1549
+ -4.7874e-02,
1550
+ 3.3498e-02,
1551
+ 6.4585e-02,
1552
+ 7.8633e-02,
1553
+ 1.0162e-01,
1554
+ 4.5821e-02,
1555
+ -2.8424e-02,
1556
+ 4.4524e-02,
1557
+ 4.6649e-02,
1558
+ 4.5805e-02,
1559
+ 8.9723e-02,
1560
+ 3.7597e-02,
1561
+ -6.8016e-03,
1562
+ 1.1727e-01,
1563
+ -6.6387e-02,
1564
+ 3.4958e-02,
1565
+ 4.0205e-02,
1566
+ 7.5854e-02,
1567
+ -5.1322e-02,
1568
+ 6.0060e-02,
1569
+ 1.4631e-01,
1570
+ -1.1550e-01,
1571
+ -1.5388e-01,
1572
+ 6.0975e-03,
1573
+ 1.2013e-01,
1574
+ 2.1669e-01,
1575
+ 4.8245e-02,
1576
+ 8.4634e-02,
1577
+ 1.4934e-01,
1578
+ -1.2099e-01,
1579
+ 1.6673e-01,
1580
+ 6.2539e-02,
1581
+ ],
1582
+ [
1583
+ 1.6222e-01,
1584
+ 7.5644e-03,
1585
+ 1.0667e-01,
1586
+ 8.3084e-02,
1587
+ 5.1888e-02,
1588
+ 8.7477e-02,
1589
+ -1.1550e-01,
1590
+ -4.7037e-02,
1591
+ -3.9418e-02,
1592
+ 8.0987e-02,
1593
+ 1.2812e-01,
1594
+ -6.6527e-02,
1595
+ 6.5405e-02,
1596
+ 5.2723e-03,
1597
+ -6.2940e-02,
1598
+ -5.8805e-02,
1599
+ 2.8526e-02,
1600
+ 4.0631e-02,
1601
+ 9.8513e-02,
1602
+ -2.8448e-02,
1603
+ 9.2361e-02,
1604
+ 6.9724e-02,
1605
+ -1.1792e-01,
1606
+ -1.4806e-02,
1607
+ -6.6172e-02,
1608
+ 5.9468e-02,
1609
+ -1.8325e-01,
1610
+ 3.2851e-02,
1611
+ -2.7042e-02,
1612
+ 1.7569e-01,
1613
+ 1.9061e-01,
1614
+ 4.1868e-02,
1615
+ -1.4116e-01,
1616
+ -1.6780e-02,
1617
+ -6.2881e-02,
1618
+ -9.4655e-02,
1619
+ 5.1488e-02,
1620
+ -3.6501e-03,
1621
+ 5.6754e-02,
1622
+ -6.3285e-03,
1623
+ 9.2234e-02,
1624
+ -1.0761e-01,
1625
+ -5.5392e-02,
1626
+ 7.5249e-02,
1627
+ 8.7539e-02,
1628
+ 4.3377e-02,
1629
+ 3.2473e-03,
1630
+ -1.0990e-01,
1631
+ 5.2611e-02,
1632
+ 3.4955e-02,
1633
+ 1.1722e-01,
1634
+ 4.0339e-02,
1635
+ -9.6605e-02,
1636
+ -9.3660e-02,
1637
+ 9.6494e-02,
1638
+ -1.4790e-02,
1639
+ -5.7770e-02,
1640
+ 4.3192e-02,
1641
+ -8.1741e-02,
1642
+ -4.0909e-02,
1643
+ -9.5212e-02,
1644
+ -1.2459e-01,
1645
+ 3.6241e-02,
1646
+ -1.2878e-01,
1647
+ 1.1544e-01,
1648
+ 3.6392e-03,
1649
+ -7.7986e-03,
1650
+ 8.2339e-02,
1651
+ 1.4319e-01,
1652
+ -5.8374e-02,
1653
+ 1.2483e-01,
1654
+ 5.6826e-02,
1655
+ -6.2649e-02,
1656
+ 4.5884e-02,
1657
+ -1.8261e-01,
1658
+ -9.1633e-02,
1659
+ -1.9414e-01,
1660
+ 2.8376e-02,
1661
+ 9.5654e-02,
1662
+ -1.3532e-01,
1663
+ 2.1669e-03,
1664
+ 7.9960e-02,
1665
+ 5.0917e-02,
1666
+ -5.4874e-02,
1667
+ 7.6823e-02,
1668
+ -1.3932e-01,
1669
+ 1.2431e-01,
1670
+ 1.3001e-01,
1671
+ -6.0069e-02,
1672
+ 2.6470e-02,
1673
+ 2.2378e-01,
1674
+ 2.7335e-02,
1675
+ 1.6487e-02,
1676
+ -9.6473e-02,
1677
+ 8.3110e-02,
1678
+ -8.8776e-02,
1679
+ -6.2299e-02,
1680
+ 4.0821e-02,
1681
+ 6.5279e-02,
1682
+ 8.4030e-02,
1683
+ -4.9298e-02,
1684
+ -3.7441e-02,
1685
+ -4.9233e-02,
1686
+ -1.9428e-03,
1687
+ -1.9763e-02,
1688
+ -9.1249e-02,
1689
+ 1.1593e-01,
1690
+ 6.6291e-03,
1691
+ -1.0686e-01,
1692
+ 5.9383e-02,
1693
+ -3.1088e-02,
1694
+ -1.9844e-03,
1695
+ 2.5891e-02,
1696
+ 1.3471e-01,
1697
+ -3.5624e-02,
1698
+ 1.2647e-01,
1699
+ 6.7167e-02,
1700
+ -1.8682e-02,
1701
+ -2.4806e-02,
1702
+ -1.1453e-02,
1703
+ 9.6992e-02,
1704
+ 1.2253e-01,
1705
+ 5.6707e-02,
1706
+ 1.0860e-02,
1707
+ 2.4775e-01,
1708
+ -8.8641e-02,
1709
+ 1.1375e-01,
1710
+ 6.9383e-02,
1711
+ ],
1712
+ [
1713
+ 1.4724e-01,
1714
+ 3.7511e-02,
1715
+ 1.1238e-01,
1716
+ 6.3809e-02,
1717
+ 1.8174e-03,
1718
+ 8.4854e-02,
1719
+ -7.8965e-02,
1720
+ -3.1542e-02,
1721
+ -3.7643e-02,
1722
+ 1.0136e-01,
1723
+ 1.5837e-01,
1724
+ -7.5701e-02,
1725
+ 1.0313e-01,
1726
+ 2.2795e-02,
1727
+ -1.1227e-01,
1728
+ -4.3783e-02,
1729
+ 4.6561e-02,
1730
+ 4.1225e-02,
1731
+ 7.4273e-02,
1732
+ -4.8200e-02,
1733
+ 2.7666e-02,
1734
+ 5.1417e-02,
1735
+ -1.3928e-01,
1736
+ 2.0020e-02,
1737
+ -9.3673e-02,
1738
+ 5.1440e-02,
1739
+ -2.0076e-01,
1740
+ 1.0715e-01,
1741
+ -3.0323e-02,
1742
+ 1.7241e-01,
1743
+ 1.9730e-01,
1744
+ 7.7685e-02,
1745
+ -1.5949e-01,
1746
+ 4.8495e-02,
1747
+ -8.0068e-02,
1748
+ -7.1965e-02,
1749
+ 8.7222e-02,
1750
+ -4.8383e-02,
1751
+ 7.0627e-02,
1752
+ 3.9791e-03,
1753
+ 7.0469e-02,
1754
+ -8.6697e-02,
1755
+ -3.7149e-02,
1756
+ 7.9965e-02,
1757
+ 8.8967e-02,
1758
+ 5.3325e-02,
1759
+ -5.6073e-03,
1760
+ -1.1336e-01,
1761
+ 1.0338e-01,
1762
+ 2.3824e-02,
1763
+ 6.9393e-02,
1764
+ 6.1018e-02,
1765
+ -7.2880e-02,
1766
+ -6.6328e-02,
1767
+ 8.9040e-02,
1768
+ -8.8268e-03,
1769
+ -7.5720e-02,
1770
+ 3.7544e-02,
1771
+ -8.3740e-02,
1772
+ -4.7977e-02,
1773
+ -1.2425e-01,
1774
+ -1.0304e-01,
1775
+ -3.7850e-03,
1776
+ -1.1540e-01,
1777
+ 1.0803e-01,
1778
+ 2.5964e-02,
1779
+ -1.0915e-02,
1780
+ 1.1758e-01,
1781
+ 1.6561e-01,
1782
+ -4.5171e-02,
1783
+ 1.1272e-01,
1784
+ 6.0721e-02,
1785
+ -9.6027e-02,
1786
+ 6.0908e-02,
1787
+ -1.6545e-01,
1788
+ -6.5444e-02,
1789
+ -1.6802e-01,
1790
+ 2.5850e-02,
1791
+ 1.2434e-01,
1792
+ -1.2358e-01,
1793
+ -5.2647e-04,
1794
+ 1.2329e-01,
1795
+ 6.3928e-02,
1796
+ -5.5478e-02,
1797
+ 5.4642e-02,
1798
+ -1.0372e-01,
1799
+ 1.0955e-01,
1800
+ 1.1810e-01,
1801
+ -7.7028e-02,
1802
+ 5.0302e-02,
1803
+ 2.7576e-01,
1804
+ 6.4704e-02,
1805
+ 3.7824e-03,
1806
+ -8.3016e-02,
1807
+ 6.7355e-02,
1808
+ -4.9660e-02,
1809
+ -1.2677e-01,
1810
+ 3.4498e-02,
1811
+ 2.7027e-02,
1812
+ 9.1476e-02,
1813
+ -5.1170e-02,
1814
+ -3.4786e-02,
1815
+ -4.4356e-02,
1816
+ 6.2440e-03,
1817
+ -3.5473e-02,
1818
+ -1.1076e-01,
1819
+ 1.0014e-01,
1820
+ -1.5927e-02,
1821
+ -7.9539e-02,
1822
+ 1.1542e-01,
1823
+ -3.6775e-03,
1824
+ 2.5437e-02,
1825
+ 2.1185e-02,
1826
+ 1.0472e-01,
1827
+ -4.8937e-03,
1828
+ 4.5876e-02,
1829
+ 5.3122e-02,
1830
+ -1.9815e-02,
1831
+ -1.8126e-02,
1832
+ -2.3358e-02,
1833
+ 1.2271e-01,
1834
+ 9.8860e-02,
1835
+ 4.5235e-02,
1836
+ 6.2841e-02,
1837
+ 1.6294e-01,
1838
+ -1.0909e-01,
1839
+ 7.2012e-02,
1840
+ 5.3389e-02,
1841
+ ],
1842
+ [
1843
+ 1.2605e-01,
1844
+ 5.1723e-02,
1845
+ 1.2152e-01,
1846
+ 5.8291e-02,
1847
+ -3.8019e-02,
1848
+ 8.8117e-02,
1849
+ -4.1927e-02,
1850
+ -2.3022e-02,
1851
+ -3.1158e-02,
1852
+ 1.2518e-01,
1853
+ 1.4279e-01,
1854
+ -7.1922e-02,
1855
+ 1.0867e-01,
1856
+ 3.3116e-02,
1857
+ -1.3669e-01,
1858
+ -3.2413e-02,
1859
+ 9.0301e-02,
1860
+ 5.9203e-02,
1861
+ -7.5021e-03,
1862
+ -6.9837e-02,
1863
+ -1.2463e-03,
1864
+ 1.8891e-02,
1865
+ -1.1878e-01,
1866
+ 6.1658e-02,
1867
+ -1.2649e-01,
1868
+ 4.2450e-02,
1869
+ -1.6298e-01,
1870
+ 1.5450e-01,
1871
+ -2.6732e-02,
1872
+ 1.7891e-01,
1873
+ 2.1083e-01,
1874
+ 1.0432e-01,
1875
+ -1.3396e-01,
1876
+ 6.9416e-02,
1877
+ -7.3193e-02,
1878
+ -6.8281e-02,
1879
+ 9.9454e-02,
1880
+ -4.4528e-02,
1881
+ 7.7992e-02,
1882
+ -2.6336e-03,
1883
+ 6.9279e-02,
1884
+ -3.8836e-02,
1885
+ -2.8872e-02,
1886
+ 5.6951e-02,
1887
+ 8.7154e-02,
1888
+ 6.1652e-02,
1889
+ -1.8049e-02,
1890
+ -1.2096e-01,
1891
+ 1.3820e-01,
1892
+ 5.1705e-02,
1893
+ 3.4052e-02,
1894
+ 1.1524e-01,
1895
+ -4.1216e-02,
1896
+ -8.5670e-02,
1897
+ 7.1571e-02,
1898
+ -1.6001e-02,
1899
+ -8.8427e-02,
1900
+ 3.1357e-02,
1901
+ -5.4897e-02,
1902
+ -3.1463e-02,
1903
+ -1.2070e-01,
1904
+ -5.7460e-02,
1905
+ -2.1424e-02,
1906
+ -1.2160e-01,
1907
+ 1.0026e-01,
1908
+ 4.7475e-02,
1909
+ -1.4578e-02,
1910
+ 1.3247e-01,
1911
+ 1.5272e-01,
1912
+ -2.9788e-02,
1913
+ 8.2947e-02,
1914
+ 7.9830e-02,
1915
+ -1.1615e-01,
1916
+ 7.2949e-02,
1917
+ -1.2386e-01,
1918
+ -5.3969e-02,
1919
+ -1.3578e-01,
1920
+ 4.0461e-02,
1921
+ 1.2467e-01,
1922
+ -9.4575e-02,
1923
+ 6.2182e-04,
1924
+ 1.3108e-01,
1925
+ 6.5035e-02,
1926
+ -5.7784e-02,
1927
+ 4.2513e-02,
1928
+ -7.8118e-02,
1929
+ 1.1274e-01,
1930
+ 8.9333e-02,
1931
+ -4.9991e-02,
1932
+ 7.4639e-02,
1933
+ 3.0876e-01,
1934
+ 7.3974e-02,
1935
+ -1.4039e-02,
1936
+ -7.4526e-02,
1937
+ 6.5200e-02,
1938
+ -3.6421e-02,
1939
+ -1.6730e-01,
1940
+ 4.1713e-02,
1941
+ -1.5316e-02,
1942
+ 1.0056e-01,
1943
+ -5.0370e-02,
1944
+ -4.4531e-02,
1945
+ -5.0233e-02,
1946
+ 3.2636e-02,
1947
+ -1.7609e-02,
1948
+ -1.1354e-01,
1949
+ 8.1922e-02,
1950
+ -2.0907e-02,
1951
+ -1.3264e-02,
1952
+ 1.5115e-01,
1953
+ 1.2777e-02,
1954
+ 3.8261e-02,
1955
+ 2.1829e-02,
1956
+ 5.5874e-02,
1957
+ 6.8772e-03,
1958
+ -2.7772e-02,
1959
+ 6.7979e-02,
1960
+ -3.9099e-02,
1961
+ -2.8619e-02,
1962
+ -5.6936e-02,
1963
+ 1.3366e-01,
1964
+ 9.4269e-02,
1965
+ 4.4852e-02,
1966
+ 7.6881e-02,
1967
+ 9.6787e-02,
1968
+ -1.2885e-01,
1969
+ 7.9722e-02,
1970
+ 6.4991e-02,
1971
+ ],
1972
+ [
1973
+ 1.0575e-01,
1974
+ 6.4565e-02,
1975
+ 1.2479e-01,
1976
+ 5.5616e-02,
1977
+ -5.5869e-02,
1978
+ 8.9005e-02,
1979
+ -1.9612e-02,
1980
+ -3.3032e-02,
1981
+ -3.9719e-02,
1982
+ 1.3884e-01,
1983
+ 1.2010e-01,
1984
+ -6.3224e-02,
1985
+ 1.0893e-01,
1986
+ 4.4172e-02,
1987
+ -1.4283e-01,
1988
+ -3.8972e-02,
1989
+ 1.1470e-01,
1990
+ 7.7830e-02,
1991
+ -4.6450e-02,
1992
+ -8.7246e-02,
1993
+ -5.7215e-03,
1994
+ 1.7476e-02,
1995
+ -9.0388e-02,
1996
+ 7.4607e-02,
1997
+ -1.3105e-01,
1998
+ 2.9371e-02,
1999
+ -1.2681e-01,
2000
+ 1.6014e-01,
2001
+ -1.3655e-02,
2002
+ 1.8492e-01,
2003
+ 1.9841e-01,
2004
+ 1.0942e-01,
2005
+ -1.1261e-01,
2006
+ 6.8326e-02,
2007
+ -5.1130e-02,
2008
+ -7.1296e-02,
2009
+ 9.3895e-02,
2010
+ -3.1097e-02,
2011
+ 8.7043e-02,
2012
+ -1.2087e-02,
2013
+ 7.3054e-02,
2014
+ -1.0602e-02,
2015
+ -3.6347e-02,
2016
+ 3.1324e-02,
2017
+ 7.5686e-02,
2018
+ 5.3221e-02,
2019
+ -3.1793e-02,
2020
+ -1.3818e-01,
2021
+ 1.4972e-01,
2022
+ 9.0217e-02,
2023
+ 2.6365e-02,
2024
+ 1.1733e-01,
2025
+ -2.6107e-02,
2026
+ -1.1933e-01,
2027
+ 6.5575e-02,
2028
+ -1.6907e-02,
2029
+ -1.0186e-01,
2030
+ 3.3719e-02,
2031
+ -3.4021e-02,
2032
+ -1.0597e-02,
2033
+ -1.0478e-01,
2034
+ -3.4765e-02,
2035
+ -3.0666e-03,
2036
+ -1.2035e-01,
2037
+ 1.1053e-01,
2038
+ 5.2212e-02,
2039
+ -1.8638e-02,
2040
+ 1.2226e-01,
2041
+ 1.3614e-01,
2042
+ -4.3806e-02,
2043
+ 4.7251e-02,
2044
+ 9.2144e-02,
2045
+ -1.2505e-01,
2046
+ 8.3020e-02,
2047
+ -1.1203e-01,
2048
+ -4.9654e-02,
2049
+ -1.1939e-01,
2050
+ 4.4126e-02,
2051
+ 1.1188e-01,
2052
+ -7.8910e-02,
2053
+ -4.1917e-03,
2054
+ 1.3279e-01,
2055
+ 5.0291e-02,
2056
+ -6.1770e-02,
2057
+ 5.1914e-02,
2058
+ -6.8796e-02,
2059
+ 1.2148e-01,
2060
+ 7.6837e-02,
2061
+ -4.0098e-02,
2062
+ 8.3547e-02,
2063
+ 3.0367e-01,
2064
+ 6.9357e-02,
2065
+ -2.5710e-02,
2066
+ -7.2644e-02,
2067
+ 5.2612e-02,
2068
+ -5.3110e-02,
2069
+ -1.7360e-01,
2070
+ 5.2791e-02,
2071
+ -1.6490e-02,
2072
+ 9.9930e-02,
2073
+ -4.8174e-02,
2074
+ -3.0230e-02,
2075
+ -6.0898e-02,
2076
+ 3.4064e-02,
2077
+ 8.2446e-03,
2078
+ -1.0747e-01,
2079
+ 6.9406e-02,
2080
+ -9.2400e-03,
2081
+ 2.1804e-02,
2082
+ 1.5491e-01,
2083
+ 2.0964e-02,
2084
+ 3.4691e-02,
2085
+ 3.5648e-02,
2086
+ 4.4363e-02,
2087
+ -3.2605e-03,
2088
+ -5.9988e-02,
2089
+ 9.3354e-02,
2090
+ -7.4183e-02,
2091
+ -3.5824e-02,
2092
+ -7.1163e-02,
2093
+ 1.4776e-01,
2094
+ 1.0179e-01,
2095
+ 4.1685e-02,
2096
+ 6.7698e-02,
2097
+ 8.4053e-02,
2098
+ -1.4797e-01,
2099
+ 9.9115e-02,
2100
+ 7.2061e-02,
2101
+ ],
2102
+ [
2103
+ 8.3987e-02,
2104
+ 6.9951e-02,
2105
+ 1.3728e-01,
2106
+ 6.4616e-02,
2107
+ -5.5591e-02,
2108
+ 8.8007e-02,
2109
+ -1.7926e-02,
2110
+ -4.8462e-02,
2111
+ -4.8522e-02,
2112
+ 1.5932e-01,
2113
+ 1.0766e-01,
2114
+ -4.9589e-02,
2115
+ 1.0511e-01,
2116
+ 4.4010e-02,
2117
+ -1.3653e-01,
2118
+ -4.1380e-02,
2119
+ 1.3629e-01,
2120
+ 9.2007e-02,
2121
+ -5.2961e-02,
2122
+ -7.1865e-02,
2123
+ -2.8329e-03,
2124
+ 1.3417e-02,
2125
+ -5.2565e-02,
2126
+ 6.2258e-02,
2127
+ -1.2135e-01,
2128
+ 2.7531e-02,
2129
+ -9.4902e-02,
2130
+ 1.6567e-01,
2131
+ -1.6613e-02,
2132
+ 1.8443e-01,
2133
+ 1.6290e-01,
2134
+ 1.1109e-01,
2135
+ -1.1476e-01,
2136
+ 6.5613e-02,
2137
+ -1.7283e-02,
2138
+ -8.7120e-02,
2139
+ 1.0862e-01,
2140
+ -2.5325e-02,
2141
+ 8.9025e-02,
2142
+ -1.8290e-02,
2143
+ 7.2979e-02,
2144
+ 1.1878e-04,
2145
+ -4.1007e-02,
2146
+ 1.9679e-02,
2147
+ 6.1887e-02,
2148
+ 4.0949e-02,
2149
+ -3.6016e-02,
2150
+ -1.5569e-01,
2151
+ 1.4839e-01,
2152
+ 1.1372e-01,
2153
+ 1.8360e-02,
2154
+ 9.3030e-02,
2155
+ -1.8070e-02,
2156
+ -1.5578e-01,
2157
+ 6.1180e-02,
2158
+ -1.9569e-02,
2159
+ -1.1221e-01,
2160
+ 4.0037e-02,
2161
+ -2.3712e-02,
2162
+ -1.0311e-02,
2163
+ -1.0530e-01,
2164
+ -1.2916e-02,
2165
+ 3.1546e-02,
2166
+ -1.1647e-01,
2167
+ 1.2109e-01,
2168
+ 5.8714e-02,
2169
+ -1.9454e-02,
2170
+ 9.9438e-02,
2171
+ 1.1088e-01,
2172
+ -7.4443e-02,
2173
+ 1.6456e-02,
2174
+ 8.7619e-02,
2175
+ -1.1460e-01,
2176
+ 9.7142e-02,
2177
+ -1.1083e-01,
2178
+ -5.2367e-02,
2179
+ -1.2157e-01,
2180
+ 4.6784e-02,
2181
+ 8.5968e-02,
2182
+ -7.2488e-02,
2183
+ -1.2658e-02,
2184
+ 1.4136e-01,
2185
+ 1.7934e-02,
2186
+ -6.4075e-02,
2187
+ 5.3809e-02,
2188
+ -5.8649e-02,
2189
+ 1.2410e-01,
2190
+ 6.7307e-02,
2191
+ -5.3157e-02,
2192
+ 8.5616e-02,
2193
+ 2.8266e-01,
2194
+ 6.4775e-02,
2195
+ -2.9330e-02,
2196
+ -8.7733e-02,
2197
+ 1.3974e-02,
2198
+ -7.9092e-02,
2199
+ -1.6691e-01,
2200
+ 5.9037e-02,
2201
+ 4.1784e-03,
2202
+ 9.8219e-02,
2203
+ -3.6970e-02,
2204
+ 8.6754e-03,
2205
+ -5.5234e-02,
2206
+ 3.4627e-02,
2207
+ 3.5126e-02,
2208
+ -9.5508e-02,
2209
+ 6.4539e-02,
2210
+ -9.1506e-04,
2211
+ 2.9144e-02,
2212
+ 1.3975e-01,
2213
+ 1.5472e-02,
2214
+ 3.9015e-02,
2215
+ 4.6270e-02,
2216
+ 5.4053e-02,
2217
+ -1.2525e-02,
2218
+ -6.6168e-02,
2219
+ 1.2463e-01,
2220
+ -1.0793e-01,
2221
+ -4.0659e-02,
2222
+ -8.1224e-02,
2223
+ 1.5702e-01,
2224
+ 1.2323e-01,
2225
+ 3.3473e-02,
2226
+ 6.1162e-02,
2227
+ 8.3375e-02,
2228
+ -1.7863e-01,
2229
+ 1.1536e-01,
2230
+ 6.9906e-02,
2231
+ ],
2232
+ [
2233
+ 6.2508e-02,
2234
+ 8.5713e-02,
2235
+ 1.3566e-01,
2236
+ 8.5148e-02,
2237
+ -2.2642e-02,
2238
+ 5.8889e-02,
2239
+ -3.3654e-02,
2240
+ -6.6948e-02,
2241
+ -5.1541e-02,
2242
+ 1.7082e-01,
2243
+ 1.1639e-01,
2244
+ -3.9750e-02,
2245
+ 8.5338e-02,
2246
+ 2.8182e-02,
2247
+ -1.1108e-01,
2248
+ -3.5316e-02,
2249
+ 1.4888e-01,
2250
+ 1.0210e-01,
2251
+ -3.4537e-02,
2252
+ -1.1950e-02,
2253
+ 5.0482e-03,
2254
+ 1.8051e-02,
2255
+ -2.2067e-02,
2256
+ 1.8682e-02,
2257
+ -9.7924e-02,
2258
+ 2.6326e-02,
2259
+ -7.6128e-02,
2260
+ 1.7675e-01,
2261
+ -1.6092e-02,
2262
+ 1.6355e-01,
2263
+ 1.1133e-01,
2264
+ 1.1036e-01,
2265
+ -1.4070e-01,
2266
+ 6.9723e-02,
2267
+ 2.0482e-02,
2268
+ -1.0421e-01,
2269
+ 1.2920e-01,
2270
+ -3.1272e-02,
2271
+ 7.6070e-02,
2272
+ -3.1384e-02,
2273
+ 6.7374e-02,
2274
+ -8.3764e-05,
2275
+ -2.1820e-02,
2276
+ 1.4192e-02,
2277
+ 2.3975e-02,
2278
+ 2.3384e-02,
2279
+ -2.3589e-02,
2280
+ -1.7196e-01,
2281
+ 1.4986e-01,
2282
+ 1.1465e-01,
2283
+ 2.0120e-02,
2284
+ 4.8034e-02,
2285
+ -3.9982e-02,
2286
+ -1.8704e-01,
2287
+ 6.2222e-02,
2288
+ -1.4799e-02,
2289
+ -1.0783e-01,
2290
+ 5.4098e-02,
2291
+ -1.2046e-02,
2292
+ -1.5444e-02,
2293
+ -1.4001e-01,
2294
+ 1.5944e-02,
2295
+ 7.9476e-02,
2296
+ -9.7107e-02,
2297
+ 1.0629e-01,
2298
+ 7.1731e-02,
2299
+ -8.1614e-03,
2300
+ 6.2800e-02,
2301
+ 5.5151e-02,
2302
+ -1.0514e-01,
2303
+ 1.7916e-02,
2304
+ 8.1229e-02,
2305
+ -7.8761e-02,
2306
+ 1.1222e-01,
2307
+ -1.1032e-01,
2308
+ -4.8039e-02,
2309
+ -1.5340e-01,
2310
+ 6.0481e-02,
2311
+ 3.2276e-02,
2312
+ -6.2797e-02,
2313
+ -3.3934e-02,
2314
+ 1.3062e-01,
2315
+ -6.3068e-03,
2316
+ -6.5936e-02,
2317
+ 4.2361e-02,
2318
+ -4.6102e-02,
2319
+ 1.2669e-01,
2320
+ 6.1909e-02,
2321
+ -5.7771e-02,
2322
+ 7.8385e-02,
2323
+ 2.4836e-01,
2324
+ 7.5433e-02,
2325
+ -2.0151e-02,
2326
+ -1.2117e-01,
2327
+ -4.1874e-02,
2328
+ -1.1751e-01,
2329
+ -1.4471e-01,
2330
+ 6.3096e-02,
2331
+ 2.9419e-02,
2332
+ 1.0021e-01,
2333
+ -1.2180e-02,
2334
+ 5.0896e-02,
2335
+ -4.0094e-02,
2336
+ 4.1059e-02,
2337
+ 8.0853e-02,
2338
+ -7.4614e-02,
2339
+ 7.5243e-02,
2340
+ 1.4928e-02,
2341
+ 1.1477e-02,
2342
+ 1.2385e-01,
2343
+ -1.4719e-02,
2344
+ 5.3751e-02,
2345
+ 5.2591e-02,
2346
+ 7.2013e-02,
2347
+ -1.6333e-02,
2348
+ -5.0256e-02,
2349
+ 1.4199e-01,
2350
+ -1.3433e-01,
2351
+ -7.1574e-02,
2352
+ -7.4449e-02,
2353
+ 1.6136e-01,
2354
+ 1.5659e-01,
2355
+ 2.1466e-02,
2356
+ 6.9775e-02,
2357
+ 1.0352e-01,
2358
+ -1.9641e-01,
2359
+ 1.3156e-01,
2360
+ 5.5903e-02,
2361
+ ],
2362
+ [
2363
+ 4.4988e-02,
2364
+ 1.0390e-01,
2365
+ 1.1266e-01,
2366
+ 1.1661e-01,
2367
+ 2.1212e-02,
2368
+ 1.4641e-02,
2369
+ -3.1484e-02,
2370
+ -8.3443e-02,
2371
+ -4.4991e-02,
2372
+ 1.5687e-01,
2373
+ 1.3158e-01,
2374
+ -3.3434e-02,
2375
+ 6.5104e-02,
2376
+ 2.0808e-02,
2377
+ -7.0680e-02,
2378
+ -4.2508e-02,
2379
+ 1.4304e-01,
2380
+ 9.8483e-02,
2381
+ -2.6904e-02,
2382
+ 3.2902e-02,
2383
+ 1.1152e-02,
2384
+ 2.1118e-02,
2385
+ -1.4842e-02,
2386
+ -2.2356e-02,
2387
+ -8.6942e-02,
2388
+ 2.6609e-02,
2389
+ -6.2533e-02,
2390
+ 1.7495e-01,
2391
+ -1.0246e-02,
2392
+ 1.2697e-01,
2393
+ 7.9345e-02,
2394
+ 1.0189e-01,
2395
+ -1.7178e-01,
2396
+ 7.5006e-02,
2397
+ 4.5051e-02,
2398
+ -1.0753e-01,
2399
+ 1.2829e-01,
2400
+ -3.3422e-02,
2401
+ 7.8522e-02,
2402
+ -4.5604e-02,
2403
+ 5.2822e-02,
2404
+ 8.3934e-03,
2405
+ -1.4676e-02,
2406
+ 1.0482e-02,
2407
+ -6.9389e-03,
2408
+ 2.2131e-03,
2409
+ -2.0089e-02,
2410
+ -1.7261e-01,
2411
+ 1.5468e-01,
2412
+ 1.1537e-01,
2413
+ 2.7489e-02,
2414
+ 2.9948e-02,
2415
+ -7.4234e-02,
2416
+ -1.9946e-01,
2417
+ 7.4632e-02,
2418
+ 2.2249e-03,
2419
+ -9.7061e-02,
2420
+ 6.4988e-02,
2421
+ -1.1239e-02,
2422
+ -2.6152e-02,
2423
+ -1.6496e-01,
2424
+ 3.0305e-02,
2425
+ 1.0491e-01,
2426
+ -6.7692e-02,
2427
+ 7.3323e-02,
2428
+ 7.0258e-02,
2429
+ 1.2020e-02,
2430
+ 4.1457e-02,
2431
+ 3.7126e-03,
2432
+ -1.0465e-01,
2433
+ 2.7035e-02,
2434
+ 8.1376e-02,
2435
+ -4.0890e-02,
2436
+ 1.0791e-01,
2437
+ -1.2141e-01,
2438
+ -1.6070e-02,
2439
+ -1.7111e-01,
2440
+ 7.4218e-02,
2441
+ -2.4698e-02,
2442
+ -3.7496e-02,
2443
+ -4.5666e-02,
2444
+ 1.0478e-01,
2445
+ -1.7289e-03,
2446
+ -7.1823e-02,
2447
+ 2.7480e-02,
2448
+ -5.5305e-02,
2449
+ 1.2323e-01,
2450
+ 5.3008e-02,
2451
+ -5.6903e-02,
2452
+ 6.6087e-02,
2453
+ 2.1523e-01,
2454
+ 8.2657e-02,
2455
+ -9.3275e-03,
2456
+ -1.3791e-01,
2457
+ -6.7856e-02,
2458
+ -1.4959e-01,
2459
+ -1.1916e-01,
2460
+ 6.2014e-02,
2461
+ 3.8735e-02,
2462
+ 9.7171e-02,
2463
+ 2.6820e-02,
2464
+ 7.4336e-02,
2465
+ -2.9922e-02,
2466
+ 5.3645e-02,
2467
+ 1.0981e-01,
2468
+ -3.7215e-02,
2469
+ 9.4254e-02,
2470
+ 4.2902e-02,
2471
+ -2.0969e-03,
2472
+ 1.1838e-01,
2473
+ -5.4015e-02,
2474
+ 5.3799e-02,
2475
+ 4.8568e-02,
2476
+ 8.0494e-02,
2477
+ -2.1410e-02,
2478
+ -2.0629e-02,
2479
+ 1.3477e-01,
2480
+ -1.4177e-01,
2481
+ -1.1144e-01,
2482
+ -5.1781e-02,
2483
+ 1.5780e-01,
2484
+ 1.9357e-01,
2485
+ 2.5627e-02,
2486
+ 7.5402e-02,
2487
+ 1.1982e-01,
2488
+ -1.8961e-01,
2489
+ 1.4646e-01,
2490
+ 4.2665e-02,
2491
+ ],
2492
+ [
2493
+ 5.1689e-02,
2494
+ 1.0737e-01,
2495
+ 1.1035e-01,
2496
+ 1.1308e-01,
2497
+ 2.0363e-02,
2498
+ 1.9703e-02,
2499
+ -3.2107e-02,
2500
+ -9.0838e-02,
2501
+ -3.7066e-02,
2502
+ 1.5839e-01,
2503
+ 1.3202e-01,
2504
+ -3.9661e-02,
2505
+ 6.3263e-02,
2506
+ 1.4509e-02,
2507
+ -6.0145e-02,
2508
+ -4.1459e-02,
2509
+ 1.4670e-01,
2510
+ 9.4701e-02,
2511
+ -2.7710e-02,
2512
+ 3.4281e-02,
2513
+ 1.2373e-02,
2514
+ 2.0087e-02,
2515
+ -2.0821e-02,
2516
+ -1.6274e-02,
2517
+ -7.9908e-02,
2518
+ 2.6209e-02,
2519
+ -7.0731e-02,
2520
+ 1.7365e-01,
2521
+ -8.5357e-03,
2522
+ 1.2784e-01,
2523
+ 7.9591e-02,
2524
+ 9.7291e-02,
2525
+ -1.6958e-01,
2526
+ 7.5774e-02,
2527
+ 4.4993e-02,
2528
+ -1.0499e-01,
2529
+ 1.2133e-01,
2530
+ -4.2564e-02,
2531
+ 7.7296e-02,
2532
+ -3.9049e-02,
2533
+ 5.4639e-02,
2534
+ 3.7831e-03,
2535
+ -1.2108e-02,
2536
+ 1.7782e-02,
2537
+ -1.8517e-03,
2538
+ 1.1160e-02,
2539
+ -1.9819e-02,
2540
+ -1.7390e-01,
2541
+ 1.5853e-01,
2542
+ 1.1225e-01,
2543
+ 2.6839e-02,
2544
+ 3.1746e-02,
2545
+ -8.1013e-02,
2546
+ -1.9950e-01,
2547
+ 7.7933e-02,
2548
+ 4.3883e-03,
2549
+ -9.1775e-02,
2550
+ 6.3761e-02,
2551
+ -2.0220e-02,
2552
+ -2.0873e-02,
2553
+ -1.6981e-01,
2554
+ 3.2587e-02,
2555
+ 1.0188e-01,
2556
+ -7.2338e-02,
2557
+ 7.9739e-02,
2558
+ 7.2336e-02,
2559
+ 1.2712e-02,
2560
+ 4.6304e-02,
2561
+ 1.1977e-02,
2562
+ -9.9069e-02,
2563
+ 3.2361e-02,
2564
+ 7.4647e-02,
2565
+ -3.2075e-02,
2566
+ 1.0834e-01,
2567
+ -1.2421e-01,
2568
+ -1.4408e-02,
2569
+ -1.7591e-01,
2570
+ 7.6520e-02,
2571
+ -2.1893e-02,
2572
+ -4.0591e-02,
2573
+ -3.9790e-02,
2574
+ 1.0072e-01,
2575
+ 2.1907e-03,
2576
+ -7.5756e-02,
2577
+ 3.4002e-02,
2578
+ -6.2571e-02,
2579
+ 1.2714e-01,
2580
+ 5.2011e-02,
2581
+ -5.6154e-02,
2582
+ 6.2876e-02,
2583
+ 2.0992e-01,
2584
+ 8.9547e-02,
2585
+ -1.2875e-02,
2586
+ -1.3528e-01,
2587
+ -6.6275e-02,
2588
+ -1.4322e-01,
2589
+ -1.0560e-01,
2590
+ 5.9555e-02,
2591
+ 4.4426e-02,
2592
+ 1.0047e-01,
2593
+ 3.3182e-02,
2594
+ 7.2001e-02,
2595
+ -3.6514e-02,
2596
+ 5.0708e-02,
2597
+ 1.0564e-01,
2598
+ -4.1174e-02,
2599
+ 1.0378e-01,
2600
+ 4.2651e-02,
2601
+ -2.6353e-03,
2602
+ 1.1964e-01,
2603
+ -6.2808e-02,
2604
+ 5.9555e-02,
2605
+ 5.4234e-02,
2606
+ 8.3802e-02,
2607
+ -1.2534e-02,
2608
+ -1.7949e-02,
2609
+ 1.3403e-01,
2610
+ -1.3682e-01,
2611
+ -1.1048e-01,
2612
+ -5.1154e-02,
2613
+ 1.5967e-01,
2614
+ 1.9743e-01,
2615
+ 3.2240e-02,
2616
+ 7.4049e-02,
2617
+ 1.2060e-01,
2618
+ -1.9087e-01,
2619
+ 1.3942e-01,
2620
+ 4.1168e-02,
2621
+ ],
2622
+ [
2623
+ 7.2669e-02,
2624
+ 1.0343e-01,
2625
+ 1.1966e-01,
2626
+ 9.3777e-02,
2627
+ -4.3358e-03,
2628
+ 5.0720e-02,
2629
+ -3.8923e-02,
2630
+ -9.1626e-02,
2631
+ -3.0056e-02,
2632
+ 1.6863e-01,
2633
+ 1.2676e-01,
2634
+ -5.3184e-02,
2635
+ 7.4056e-02,
2636
+ 1.5517e-02,
2637
+ -7.5657e-02,
2638
+ -3.7712e-02,
2639
+ 1.5280e-01,
2640
+ 9.0719e-02,
2641
+ -3.1311e-02,
2642
+ 1.7038e-02,
2643
+ 1.0062e-02,
2644
+ 1.4759e-02,
2645
+ -3.4438e-02,
2646
+ 1.5206e-02,
2647
+ -7.7897e-02,
2648
+ 2.7431e-02,
2649
+ -9.0602e-02,
2650
+ 1.7164e-01,
2651
+ -9.9721e-03,
2652
+ 1.4776e-01,
2653
+ 1.0475e-01,
2654
+ 9.0160e-02,
2655
+ -1.5270e-01,
2656
+ 7.4838e-02,
2657
+ 2.4463e-02,
2658
+ -1.0180e-01,
2659
+ 1.1501e-01,
2660
+ -5.0178e-02,
2661
+ 7.2532e-02,
2662
+ -1.8656e-02,
2663
+ 6.9202e-02,
2664
+ -9.6694e-03,
2665
+ -1.8219e-02,
2666
+ 3.5159e-02,
2667
+ 2.5262e-02,
2668
+ 3.4003e-02,
2669
+ -1.4053e-02,
2670
+ -1.7146e-01,
2671
+ 1.6299e-01,
2672
+ 1.0628e-01,
2673
+ 2.3901e-02,
2674
+ 4.2174e-02,
2675
+ -7.0690e-02,
2676
+ -1.8996e-01,
2677
+ 7.4249e-02,
2678
+ -2.3634e-03,
2679
+ -8.9371e-02,
2680
+ 5.7919e-02,
2681
+ -3.2949e-02,
2682
+ -2.4951e-03,
2683
+ -1.6144e-01,
2684
+ 3.1264e-02,
2685
+ 8.4885e-02,
2686
+ -9.7903e-02,
2687
+ 1.0806e-01,
2688
+ 7.3583e-02,
2689
+ 1.4389e-03,
2690
+ 6.4343e-02,
2691
+ 5.4305e-02,
2692
+ -9.2823e-02,
2693
+ 3.4699e-02,
2694
+ 6.5887e-02,
2695
+ -4.1142e-02,
2696
+ 1.1057e-01,
2697
+ -1.2077e-01,
2698
+ -3.4178e-02,
2699
+ -1.7040e-01,
2700
+ 7.2546e-02,
2701
+ 1.9399e-02,
2702
+ -6.3051e-02,
2703
+ -3.0156e-02,
2704
+ 1.1501e-01,
2705
+ 3.9456e-03,
2706
+ -7.8346e-02,
2707
+ 5.0548e-02,
2708
+ -6.3681e-02,
2709
+ 1.3261e-01,
2710
+ 6.1531e-02,
2711
+ -5.1743e-02,
2712
+ 6.4762e-02,
2713
+ 2.3264e-01,
2714
+ 9.4896e-02,
2715
+ -2.0536e-02,
2716
+ -1.1807e-01,
2717
+ -4.9545e-02,
2718
+ -1.1460e-01,
2719
+ -1.0079e-01,
2720
+ 5.7836e-02,
2721
+ 4.4144e-02,
2722
+ 1.1039e-01,
2723
+ 2.1311e-02,
2724
+ 4.8694e-02,
2725
+ -4.7126e-02,
2726
+ 3.7200e-02,
2727
+ 8.3108e-02,
2728
+ -7.0069e-02,
2729
+ 9.7451e-02,
2730
+ 2.0983e-02,
2731
+ -4.1628e-04,
2732
+ 1.2463e-01,
2733
+ -5.0752e-02,
2734
+ 7.0680e-02,
2735
+ 6.3588e-02,
2736
+ 8.6767e-02,
2737
+ 6.3361e-04,
2738
+ -2.5236e-02,
2739
+ 1.3658e-01,
2740
+ -1.1925e-01,
2741
+ -8.3861e-02,
2742
+ -6.3728e-02,
2743
+ 1.6767e-01,
2744
+ 1.7967e-01,
2745
+ 3.3894e-02,
2746
+ 7.2830e-02,
2747
+ 1.1590e-01,
2748
+ -1.9359e-01,
2749
+ 1.2254e-01,
2750
+ 4.7767e-02,
2751
+ ],
2752
+ [
2753
+ 8.4189e-02,
2754
+ 9.3299e-02,
2755
+ 1.1532e-01,
2756
+ 8.4670e-02,
2757
+ -9.7533e-03,
2758
+ 6.4423e-02,
2759
+ -4.3742e-02,
2760
+ -9.0334e-02,
2761
+ -1.9307e-02,
2762
+ 1.7434e-01,
2763
+ 1.2522e-01,
2764
+ -6.4064e-02,
2765
+ 7.7841e-02,
2766
+ 2.5594e-02,
2767
+ -9.4764e-02,
2768
+ -3.8088e-02,
2769
+ 1.5148e-01,
2770
+ 8.7002e-02,
2771
+ -3.8634e-02,
2772
+ 1.6073e-02,
2773
+ 1.0152e-02,
2774
+ 8.9035e-03,
2775
+ -3.9583e-02,
2776
+ 3.3614e-02,
2777
+ -7.4457e-02,
2778
+ 2.6379e-02,
2779
+ -9.8885e-02,
2780
+ 1.6510e-01,
2781
+ -1.1651e-02,
2782
+ 1.5700e-01,
2783
+ 1.3055e-01,
2784
+ 7.7680e-02,
2785
+ -1.3690e-01,
2786
+ 7.0143e-02,
2787
+ 5.4565e-03,
2788
+ -1.0524e-01,
2789
+ 1.1096e-01,
2790
+ -4.6091e-02,
2791
+ 6.8167e-02,
2792
+ -5.6453e-03,
2793
+ 7.2094e-02,
2794
+ -1.8229e-02,
2795
+ -2.5539e-02,
2796
+ 3.9328e-02,
2797
+ 4.3699e-02,
2798
+ 4.4686e-02,
2799
+ -1.2072e-02,
2800
+ -1.6738e-01,
2801
+ 1.6301e-01,
2802
+ 1.0474e-01,
2803
+ 1.4918e-02,
2804
+ 5.4164e-02,
2805
+ -7.2489e-02,
2806
+ -1.7993e-01,
2807
+ 7.7292e-02,
2808
+ -2.8187e-03,
2809
+ -8.9506e-02,
2810
+ 5.6049e-02,
2811
+ -3.1474e-02,
2812
+ 4.7455e-03,
2813
+ -1.5338e-01,
2814
+ 2.2827e-02,
2815
+ 6.7546e-02,
2816
+ -1.1578e-01,
2817
+ 1.2026e-01,
2818
+ 8.2049e-02,
2819
+ -1.1262e-02,
2820
+ 7.4682e-02,
2821
+ 7.3023e-02,
2822
+ -8.6490e-02,
2823
+ 3.1955e-02,
2824
+ 6.6658e-02,
2825
+ -4.6417e-02,
2826
+ 1.0408e-01,
2827
+ -1.1210e-01,
2828
+ -3.3793e-02,
2829
+ -1.6235e-01,
2830
+ 7.3353e-02,
2831
+ 4.5024e-02,
2832
+ -7.6526e-02,
2833
+ -3.0443e-02,
2834
+ 1.2222e-01,
2835
+ 1.1513e-02,
2836
+ -8.5835e-02,
2837
+ 5.6084e-02,
2838
+ -6.5380e-02,
2839
+ 1.3535e-01,
2840
+ 7.2861e-02,
2841
+ -4.8121e-02,
2842
+ 6.6781e-02,
2843
+ 2.5041e-01,
2844
+ 9.3076e-02,
2845
+ -2.0334e-02,
2846
+ -1.0416e-01,
2847
+ -3.9805e-02,
2848
+ -1.0051e-01,
2849
+ -9.2133e-02,
2850
+ 6.3271e-02,
2851
+ 3.4768e-02,
2852
+ 1.1286e-01,
2853
+ 1.3493e-02,
2854
+ 3.0643e-02,
2855
+ -5.5867e-02,
2856
+ 2.4630e-02,
2857
+ 7.8038e-02,
2858
+ -7.5769e-02,
2859
+ 9.0636e-02,
2860
+ 1.8840e-02,
2861
+ 1.3688e-02,
2862
+ 1.2780e-01,
2863
+ -4.3306e-02,
2864
+ 7.6513e-02,
2865
+ 5.6990e-02,
2866
+ 8.7927e-02,
2867
+ 1.6438e-03,
2868
+ -1.0833e-02,
2869
+ 1.3722e-01,
2870
+ -9.4365e-02,
2871
+ -7.4534e-02,
2872
+ -6.2697e-02,
2873
+ 1.6868e-01,
2874
+ 1.6850e-01,
2875
+ 4.3229e-02,
2876
+ 7.4762e-02,
2877
+ 1.1225e-01,
2878
+ -1.9559e-01,
2879
+ 1.1981e-01,
2880
+ 5.0025e-02,
2881
+ ],
2882
+ [
2883
+ 1.0233e-01,
2884
+ 6.7397e-02,
2885
+ 1.0132e-01,
2886
+ 1.1631e-01,
2887
+ 4.3957e-02,
2888
+ 5.7017e-02,
2889
+ -4.0769e-02,
2890
+ -1.2329e-01,
2891
+ -6.1235e-03,
2892
+ 1.3821e-01,
2893
+ 1.3135e-01,
2894
+ -7.2640e-02,
2895
+ 6.7380e-02,
2896
+ 4.7518e-02,
2897
+ -1.0261e-01,
2898
+ -8.4497e-02,
2899
+ 1.1074e-01,
2900
+ 7.9965e-02,
2901
+ 4.3208e-03,
2902
+ -7.9881e-03,
2903
+ 3.0558e-02,
2904
+ 3.8130e-02,
2905
+ -7.0863e-02,
2906
+ 2.6484e-02,
2907
+ -8.1795e-02,
2908
+ 3.9067e-02,
2909
+ -1.5670e-01,
2910
+ 1.2963e-01,
2911
+ 4.3868e-03,
2912
+ 1.3081e-01,
2913
+ 1.7308e-01,
2914
+ 5.3891e-02,
2915
+ -1.2683e-01,
2916
+ 6.0787e-02,
2917
+ -7.8336e-02,
2918
+ -1.0347e-01,
2919
+ 5.8472e-02,
2920
+ -2.7212e-02,
2921
+ 7.1385e-02,
2922
+ -6.8297e-03,
2923
+ 8.7485e-02,
2924
+ -7.1364e-02,
2925
+ -1.9182e-02,
2926
+ 8.7217e-02,
2927
+ 8.2944e-02,
2928
+ 3.4400e-02,
2929
+ -8.5778e-03,
2930
+ -1.1191e-01,
2931
+ 1.4871e-01,
2932
+ 1.0602e-01,
2933
+ 5.5060e-02,
2934
+ 4.0554e-02,
2935
+ -1.0835e-01,
2936
+ -1.4749e-01,
2937
+ 1.1336e-01,
2938
+ 8.1897e-03,
2939
+ -5.1693e-02,
2940
+ 4.9600e-02,
2941
+ -3.6377e-02,
2942
+ -1.6864e-03,
2943
+ -1.4149e-01,
2944
+ -3.8906e-02,
2945
+ 5.3481e-02,
2946
+ -1.1160e-01,
2947
+ 1.3861e-01,
2948
+ 6.3040e-02,
2949
+ -2.2296e-02,
2950
+ 8.1680e-02,
2951
+ 7.9567e-02,
2952
+ -6.1083e-02,
2953
+ 6.7573e-02,
2954
+ 9.8196e-02,
2955
+ -7.0956e-02,
2956
+ 8.7717e-02,
2957
+ -1.3439e-01,
2958
+ -4.4087e-02,
2959
+ -1.7677e-01,
2960
+ 5.1971e-02,
2961
+ 1.1013e-01,
2962
+ -1.2838e-01,
2963
+ -1.8450e-02,
2964
+ 9.7373e-02,
2965
+ 1.5415e-02,
2966
+ -6.8535e-02,
2967
+ 5.8374e-02,
2968
+ -1.0450e-01,
2969
+ 1.3363e-01,
2970
+ 1.2089e-01,
2971
+ -4.1817e-02,
2972
+ 3.1015e-02,
2973
+ 2.7185e-01,
2974
+ 6.9890e-02,
2975
+ -3.8259e-03,
2976
+ -7.7094e-02,
2977
+ 2.4651e-02,
2978
+ -1.0086e-01,
2979
+ -8.4380e-02,
2980
+ 4.9062e-02,
2981
+ 4.7470e-02,
2982
+ 1.0739e-01,
2983
+ -2.4693e-03,
2984
+ -5.9990e-02,
2985
+ -6.2990e-02,
2986
+ 1.2398e-03,
2987
+ 4.8100e-03,
2988
+ -6.0338e-02,
2989
+ 6.4579e-02,
2990
+ 4.2505e-04,
2991
+ -2.9926e-02,
2992
+ 1.3627e-01,
2993
+ -3.0724e-02,
2994
+ 4.2972e-02,
2995
+ 4.6971e-02,
2996
+ 1.2441e-01,
2997
+ -2.4336e-02,
2998
+ 6.9954e-02,
2999
+ 1.0403e-01,
3000
+ -3.2450e-02,
3001
+ -4.3154e-02,
3002
+ -4.9959e-02,
3003
+ 1.5666e-01,
3004
+ 1.3688e-01,
3005
+ 5.3450e-02,
3006
+ 7.9968e-02,
3007
+ 1.3858e-01,
3008
+ -1.6817e-01,
3009
+ 1.2637e-01,
3010
+ 7.3937e-02,
3011
+ ],
3012
+ ]
3013
+ )
backend/vespa_app.py ADDED
@@ -0,0 +1,458 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import os
2
+ import time
3
+ from typing import Any, Dict, Tuple
4
+ import asyncio
5
+ import numpy as np
6
+ import torch
7
+ from dotenv import load_dotenv
8
+ from vespa.application import Vespa
9
+ from vespa.io import VespaQueryResponse
10
+ from .colpali import SimMapGenerator
11
+ import backend.stopwords
12
+ import logging
13
+
14
+
15
+ class VespaQueryClient:
16
+ MAX_QUERY_TERMS = 64
17
+ VESPA_SCHEMA_NAME = "pdf_page"
18
+ SELECT_FIELDS = "id,title,url,blur_image,page_number,snippet,text"
19
+
20
+ def __init__(self, logger: logging.Logger):
21
+ """
22
+ Initialize the VespaQueryClient by loading environment variables and establishing a connection to the Vespa application.
23
+ """
24
+ load_dotenv()
25
+ self.logger = logger
26
+
27
+ if os.environ.get("USE_MTLS") == "true":
28
+ self.logger.info("Connected using mTLS")
29
+ mtls_key = os.environ.get("VESPA_CLOUD_MTLS_KEY")
30
+ mtls_cert = os.environ.get("VESPA_CLOUD_MTLS_CERT")
31
+
32
+ self.vespa_app_url = os.environ.get("VESPA_APP_MTLS_URL")
33
+ if not self.vespa_app_url:
34
+ raise ValueError(
35
+ "Please set the VESPA_APP_MTLS_URL environment variable"
36
+ )
37
+
38
+ if not mtls_cert or not mtls_key:
39
+ raise ValueError(
40
+ "USE_MTLS was true, but VESPA_CLOUD_MTLS_KEY and VESPA_CLOUD_MTLS_CERT were not set"
41
+ )
42
+
43
+ # write the key and cert to a file
44
+ mtls_key_path = "/tmp/vespa-data-plane-private-key.pem"
45
+ with open(mtls_key_path, "w") as f:
46
+ f.write(mtls_key)
47
+
48
+ mtls_cert_path = "/tmp/vespa-data-plane-public-cert.pem"
49
+ with open(mtls_cert_path, "w") as f:
50
+ f.write(mtls_cert)
51
+
52
+ # Instantiate Vespa connection
53
+ self.app = Vespa(
54
+ url=self.vespa_app_url, key=mtls_key_path, cert=mtls_cert_path
55
+ )
56
+ else:
57
+ self.logger.info("Connected using token")
58
+ self.vespa_app_url = os.environ.get("VESPA_APP_TOKEN_URL")
59
+ if not self.vespa_app_url:
60
+ raise ValueError(
61
+ "Please set the VESPA_APP_TOKEN_URL environment variable"
62
+ )
63
+
64
+ self.vespa_cloud_secret_token = os.environ.get("VESPA_CLOUD_SECRET_TOKEN")
65
+
66
+ if not self.vespa_cloud_secret_token:
67
+ raise ValueError(
68
+ "Please set the VESPA_CLOUD_SECRET_TOKEN environment variable"
69
+ )
70
+
71
+ # Instantiate Vespa connection
72
+ self.app = Vespa(
73
+ url=self.vespa_app_url,
74
+ vespa_cloud_secret_token=self.vespa_cloud_secret_token,
75
+ )
76
+
77
+ self.app.wait_for_application_up()
78
+ self.logger.info(f"Connected to Vespa at {self.vespa_app_url}")
79
+
80
+ def get_fields(self, sim_map: bool = False):
81
+ if not sim_map:
82
+ return self.SELECT_FIELDS
83
+ else:
84
+ return "summaryfeatures"
85
+
86
+ def format_query_results(
87
+ self, query: str, response: VespaQueryResponse, hits: int = 5
88
+ ) -> dict:
89
+ """
90
+ Format the Vespa query results.
91
+
92
+ Args:
93
+ query (str): The query text.
94
+ response (VespaQueryResponse): The response from Vespa.
95
+ hits (int, optional): Number of hits to display. Defaults to 5.
96
+
97
+ Returns:
98
+ dict: The JSON content of the response.
99
+ """
100
+ query_time = response.json.get("timing", {}).get("searchtime", -1)
101
+ query_time = round(query_time, 2)
102
+ count = response.json.get("root", {}).get("fields", {}).get("totalCount", 0)
103
+ result_text = f"Query text: '{query}', query time {query_time}s, count={count}, top results:\n"
104
+ self.logger.debug(result_text)
105
+ return response.json
106
+
107
+ async def query_vespa_bm25(
108
+ self,
109
+ query: str,
110
+ q_emb: torch.Tensor,
111
+ hits: int = 3,
112
+ timeout: str = "10s",
113
+ sim_map: bool = False,
114
+ **kwargs,
115
+ ) -> dict:
116
+ """
117
+ Query Vespa using the BM25 ranking profile.
118
+ This corresponds to the "BM25" radio button in the UI.
119
+
120
+ Args:
121
+ query (str): The query text.
122
+ q_emb (torch.Tensor): Query embeddings.
123
+ hits (int, optional): Number of hits to retrieve. Defaults to 3.
124
+ timeout (str, optional): Query timeout. Defaults to "10s".
125
+
126
+ Returns:
127
+ dict: The formatted query results.
128
+ """
129
+ async with self.app.asyncio(connections=1) as session:
130
+ query_embedding = self.format_q_embs(q_emb)
131
+
132
+ start = time.perf_counter()
133
+ response: VespaQueryResponse = await session.query(
134
+ body={
135
+ "yql": (
136
+ f"select {self.get_fields(sim_map=sim_map)} from {self.VESPA_SCHEMA_NAME} where userQuery();"
137
+ ),
138
+ "ranking": self.get_rank_profile("bm25", sim_map),
139
+ "query": query,
140
+ "timeout": timeout,
141
+ "hits": hits,
142
+ "input.query(qt)": query_embedding,
143
+ "presentation.timing": True,
144
+ **kwargs,
145
+ },
146
+ )
147
+ assert response.is_successful(), response.json
148
+ stop = time.perf_counter()
149
+ self.logger.debug(
150
+ f"Query time + data transfer took: {stop - start} s, Vespa reported searchtime was "
151
+ f"{response.json.get('timing', {}).get('searchtime', -1)} s"
152
+ )
153
+ return self.format_query_results(query, response)
154
+
155
+ def float_to_binary_embedding(self, float_query_embedding: dict) -> dict:
156
+ """
157
+ Convert float query embeddings to binary embeddings.
158
+
159
+ Args:
160
+ float_query_embedding (dict): Dictionary of float embeddings.
161
+
162
+ Returns:
163
+ dict: Dictionary of binary embeddings.
164
+ """
165
+ binary_query_embeddings = {}
166
+ for key, vector in float_query_embedding.items():
167
+ binary_vector = (
168
+ np.packbits(np.where(np.array(vector) > 0, 1, 0))
169
+ .astype(np.int8)
170
+ .tolist()
171
+ )
172
+ binary_query_embeddings[key] = binary_vector
173
+ if len(binary_query_embeddings) >= self.MAX_QUERY_TERMS:
174
+ self.logger.warning(
175
+ f"Warning: Query has more than {self.MAX_QUERY_TERMS} terms. Truncating."
176
+ )
177
+ break
178
+ return binary_query_embeddings
179
+
180
+ def create_nn_query_strings(
181
+ self, binary_query_embeddings: dict, target_hits_per_query_tensor: int = 20
182
+ ) -> Tuple[str, dict]:
183
+ """
184
+ Create nearest neighbor query strings for Vespa.
185
+
186
+ Args:
187
+ binary_query_embeddings (dict): Binary query embeddings.
188
+ target_hits_per_query_tensor (int, optional): Target hits per query tensor. Defaults to 20.
189
+
190
+ Returns:
191
+ Tuple[str, dict]: Nearest neighbor query string and query tensor dictionary.
192
+ """
193
+ nn_query_dict = {}
194
+ for i in range(len(binary_query_embeddings)):
195
+ nn_query_dict[f"input.query(rq{i})"] = binary_query_embeddings[i]
196
+ nn = " OR ".join(
197
+ [
198
+ f"({{targetHits:{target_hits_per_query_tensor}}}nearestNeighbor(embedding,rq{i}))"
199
+ for i in range(len(binary_query_embeddings))
200
+ ]
201
+ )
202
+ return nn, nn_query_dict
203
+
204
+ def format_q_embs(self, q_embs: torch.Tensor) -> dict:
205
+ """
206
+ Convert query embeddings to a dictionary of lists.
207
+
208
+ Args:
209
+ q_embs (torch.Tensor): Query embeddings tensor.
210
+
211
+ Returns:
212
+ dict: Dictionary where each key is an index and value is the embedding list.
213
+ """
214
+ return {idx: emb.tolist() for idx, emb in enumerate(q_embs)}
215
+
216
+ async def get_result_from_query(
217
+ self,
218
+ query: str,
219
+ q_embs: torch.Tensor,
220
+ ranking: str,
221
+ idx_to_token: dict,
222
+ ) -> Dict[str, Any]:
223
+ """
224
+ Get query results from Vespa based on the ranking method.
225
+
226
+ Args:
227
+ query (str): The query text.
228
+ q_embs (torch.Tensor): Query embeddings.
229
+ ranking (str): The ranking method to use.
230
+ idx_to_token (dict): Index to token mapping.
231
+
232
+ Returns:
233
+ Dict[str, Any]: The query results.
234
+ """
235
+
236
+ # Remove stopwords from the query to avoid visual emphasis on irrelevant words (e.g., "the", "and", "of")
237
+ query = backend.stopwords.filter(query)
238
+
239
+ rank_method = ranking.split("_")[0]
240
+ sim_map: bool = len(ranking.split("_")) > 1 and ranking.split("_")[1] == "sim"
241
+ if rank_method == "colpali": # ColPali
242
+ result = await self.query_vespa_colpali(
243
+ query=query, ranking=rank_method, q_emb=q_embs, sim_map=sim_map
244
+ )
245
+ elif rank_method == "hybrid": # Hybrid ColPali+BM25
246
+ result = await self.query_vespa_colpali(
247
+ query=query, ranking=rank_method, q_emb=q_embs, sim_map=sim_map
248
+ )
249
+ elif rank_method == "bm25":
250
+ result = await self.query_vespa_bm25(query, q_embs, sim_map=sim_map)
251
+ else:
252
+ raise ValueError(f"Unsupported ranking: {rank_method}")
253
+ if "root" not in result or "children" not in result["root"]:
254
+ result["root"] = {"children": []}
255
+ return result
256
+ for single_result in result["root"]["children"]:
257
+ self.logger.debug(single_result["fields"].keys())
258
+ return result
259
+
260
+ def get_sim_maps_from_query(
261
+ self, query: str, q_embs: torch.Tensor, ranking: str, idx_to_token: dict
262
+ ):
263
+ """
264
+ Get similarity maps from Vespa based on the ranking method.
265
+
266
+ Args:
267
+ query (str): The query text.
268
+ q_embs (torch.Tensor): Query embeddings.
269
+ ranking (str): The ranking method to use.
270
+ idx_to_token (dict): Index to token mapping.
271
+
272
+ Returns:
273
+ Dict[str, Any]: The query results.
274
+ """
275
+ # Get the result by calling asyncio.run
276
+ result = asyncio.run(
277
+ self.get_result_from_query(query, q_embs, ranking, idx_to_token)
278
+ )
279
+ vespa_sim_maps = []
280
+ for single_result in result["root"]["children"]:
281
+ vespa_sim_map = single_result["fields"].get("summaryfeatures", None)
282
+ if vespa_sim_map is not None:
283
+ vespa_sim_maps.append(vespa_sim_map)
284
+ else:
285
+ raise ValueError("No sim_map found in Vespa response")
286
+ return vespa_sim_maps
287
+
288
+ async def get_full_image_from_vespa(self, doc_id: str) -> str:
289
+ """
290
+ Retrieve the full image from Vespa for a given document ID.
291
+
292
+ Args:
293
+ doc_id (str): The document ID.
294
+
295
+ Returns:
296
+ str: The full image data.
297
+ """
298
+ async with self.app.asyncio(connections=1) as session:
299
+ start = time.perf_counter()
300
+ response: VespaQueryResponse = await session.query(
301
+ body={
302
+ "yql": f'select full_image from {self.VESPA_SCHEMA_NAME} where id contains "{doc_id}"',
303
+ "ranking": "unranked",
304
+ "presentation.timing": True,
305
+ "ranking.matching.numThreadsPerSearch": 1,
306
+ },
307
+ )
308
+ assert response.is_successful(), response.json
309
+ stop = time.perf_counter()
310
+ self.logger.debug(
311
+ f"Getting image from Vespa took: {stop - start} s, Vespa reported searchtime was "
312
+ f"{response.json.get('timing', {}).get('searchtime', -1)} s"
313
+ )
314
+ return response.json["root"]["children"][0]["fields"]["full_image"]
315
+
316
+ def get_results_children(self, result: VespaQueryResponse) -> list:
317
+ return result["root"]["children"]
318
+
319
+ def results_to_search_results(
320
+ self, result: VespaQueryResponse, idx_to_token: dict
321
+ ) -> list:
322
+ # Initialize sim_map_ fields in the result
323
+ fields_to_add = [
324
+ f"sim_map_{token}_{idx}"
325
+ for idx, token in idx_to_token.items()
326
+ if not SimMapGenerator.should_filter_token(token)
327
+ ]
328
+ for child in result["root"]["children"]:
329
+ for sim_map_key in fields_to_add:
330
+ child["fields"][sim_map_key] = None
331
+ return self.get_results_children(result)
332
+
333
+ async def get_suggestions(self, query: str) -> list:
334
+ async with self.app.asyncio(connections=1) as session:
335
+ start = time.perf_counter()
336
+ yql = f'select questions from {self.VESPA_SCHEMA_NAME} where questions matches (".*{query}.*")'
337
+ response: VespaQueryResponse = await session.query(
338
+ body={
339
+ "yql": yql,
340
+ "query": query,
341
+ "ranking": "unranked",
342
+ "presentation.timing": True,
343
+ "presentation.summary": "suggestions",
344
+ "ranking.matching.numThreadsPerSearch": 1,
345
+ },
346
+ )
347
+ assert response.is_successful(), response.json
348
+ stop = time.perf_counter()
349
+ self.logger.debug(
350
+ f"Getting suggestions from Vespa took: {stop - start} s, Vespa reported searchtime was "
351
+ f"{response.json.get('timing', {}).get('searchtime', -1)} s"
352
+ )
353
+ search_results = (
354
+ response.json["root"]["children"]
355
+ if "root" in response.json and "children" in response.json["root"]
356
+ else []
357
+ )
358
+ questions = [
359
+ result["fields"]["questions"]
360
+ for result in search_results
361
+ if "questions" in result["fields"]
362
+ ]
363
+
364
+ unique_questions = set([item for sublist in questions for item in sublist])
365
+
366
+ # remove an artifact from our data generation
367
+ if "string" in unique_questions:
368
+ unique_questions.remove("string")
369
+
370
+ return list(unique_questions)
371
+
372
+ def get_rank_profile(self, ranking: str, sim_map: bool) -> str:
373
+ if sim_map:
374
+ return f"{ranking}_sim"
375
+ else:
376
+ return ranking
377
+
378
+ async def query_vespa_colpali(
379
+ self,
380
+ query: str,
381
+ ranking: str,
382
+ q_emb: torch.Tensor,
383
+ target_hits_per_query_tensor: int = 100,
384
+ hnsw_explore_additional_hits: int = 300,
385
+ hits: int = 3,
386
+ timeout: str = "10s",
387
+ sim_map: bool = False,
388
+ **kwargs,
389
+ ) -> dict:
390
+ """
391
+ Query Vespa using nearest neighbor search with mixed tensors for MaxSim calculations.
392
+ This corresponds to the "ColPali" radio button in the UI.
393
+
394
+ Args:
395
+ query (str): The query text.
396
+ q_emb (torch.Tensor): Query embeddings.
397
+ target_hits_per_query_tensor (int, optional): Target hits per query tensor. Defaults to 20.
398
+ hits (int, optional): Number of hits to retrieve. Defaults to 3.
399
+ timeout (str, optional): Query timeout. Defaults to "10s".
400
+
401
+ Returns:
402
+ dict: The formatted query results.
403
+ """
404
+ async with self.app.asyncio(connections=1) as session:
405
+ float_query_embedding = self.format_q_embs(q_emb)
406
+ binary_query_embeddings = self.float_to_binary_embedding(
407
+ float_query_embedding
408
+ )
409
+
410
+ # Mixed tensors for MaxSim calculations
411
+ query_tensors = {
412
+ "input.query(qtb)": binary_query_embeddings,
413
+ "input.query(qt)": float_query_embedding,
414
+ }
415
+ nn_string, nn_query_dict = self.create_nn_query_strings(
416
+ binary_query_embeddings, target_hits_per_query_tensor
417
+ )
418
+ query_tensors.update(nn_query_dict)
419
+ response: VespaQueryResponse = await session.query(
420
+ body={
421
+ **query_tensors,
422
+ "presentation.timing": True,
423
+ "yql": (
424
+ f"select {self.get_fields(sim_map=sim_map)} from {self.VESPA_SCHEMA_NAME} where {nn_string} or userQuery()"
425
+ ),
426
+ "ranking.profile": self.get_rank_profile(
427
+ ranking=ranking, sim_map=sim_map
428
+ ),
429
+ "timeout": timeout,
430
+ "hits": hits,
431
+ "query": query,
432
+ "hnsw.exploreAdditionalHits": hnsw_explore_additional_hits,
433
+ "ranking.rerankCount": 100,
434
+ **kwargs,
435
+ },
436
+ )
437
+ assert response.is_successful(), response.json
438
+ return self.format_query_results(query, response)
439
+
440
+ async def keepalive(self) -> bool:
441
+ """
442
+ Query Vespa to keep the connection alive.
443
+
444
+ Returns:
445
+ bool: True if the connection is alive.
446
+ """
447
+ async with self.app.asyncio(connections=1) as session:
448
+ response: VespaQueryResponse = await session.query(
449
+ body={
450
+ "yql": f"select title from {self.VESPA_SCHEMA_NAME} where true limit 1;",
451
+ "ranking": "unranked",
452
+ "query": "keepalive",
453
+ "timeout": "3s",
454
+ "hits": 1,
455
+ },
456
+ )
457
+ assert response.is_successful(), response.json
458
+ return True
colpali.py ADDED
@@ -0,0 +1,521 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/usr/bin/env python3
2
+
3
+ import torch
4
+ from PIL import Image
5
+ import numpy as np
6
+ from typing import cast
7
+ import pprint
8
+ from pathlib import Path
9
+ import base64
10
+ from io import BytesIO
11
+ from typing import Union, Tuple
12
+ import matplotlib
13
+ import re
14
+
15
+ from colpali_engine.models import ColPali, ColPaliProcessor
16
+ from colpali_engine.utils.torch_utils import get_torch_device
17
+ from einops import rearrange
18
+ from vidore_benchmark.interpretability.plot_utils import plot_similarity_heatmap
19
+ from vidore_benchmark.interpretability.torch_utils import (
20
+ normalize_similarity_map_per_query_token,
21
+ )
22
+ from vidore_benchmark.interpretability.vit_configs import VIT_CONFIG
23
+ from vidore_benchmark.utils.image_utils import scale_image
24
+ from vespa.application import Vespa
25
+ from vespa.io import VespaQueryResponse
26
+
27
+ matplotlib.use("Agg")
28
+
29
+ MAX_QUERY_TERMS = 64
30
+ # OUTPUT_DIR = Path(__file__).parent.parent / "output" / "sim_maps"
31
+ # OUTPUT_DIR.mkdir(exist_ok=True)
32
+
33
+ COLPALI_GEMMA_MODEL_ID = "vidore--colpaligemma-3b-pt-448-base"
34
+ COLPALI_GEMMA_MODEL_SNAPSHOT = "12c59eb7e23bc4c26876f7be7c17760d5d3a1ffa"
35
+ COLPALI_GEMMA_MODEL_PATH = (
36
+ Path().home()
37
+ / f".cache/huggingface/hub/models--{COLPALI_GEMMA_MODEL_ID}/snapshots/{COLPALI_GEMMA_MODEL_SNAPSHOT}"
38
+ )
39
+ COLPALI_MODEL_ID = "vidore--colpali-v1.2"
40
+ COLPALI_MODEL_SNAPSHOT = "9912ce6f8a462d8cf2269f5606eabbd2784e764f"
41
+ COLPALI_MODEL_PATH = (
42
+ Path().home()
43
+ / f".cache/huggingface/hub/models--{COLPALI_MODEL_ID}/snapshots/{COLPALI_MODEL_SNAPSHOT}"
44
+ )
45
+ COLPALI_GEMMA_MODEL_NAME = COLPALI_GEMMA_MODEL_ID.replace("--", "/")
46
+
47
+
48
+ def load_model() -> Tuple[ColPali, ColPaliProcessor]:
49
+ model_name = "vidore/colpali-v1.2"
50
+
51
+ device = get_torch_device("auto")
52
+ print(f"Using device: {device}")
53
+
54
+ # Load the model
55
+ model = cast(
56
+ ColPali,
57
+ ColPali.from_pretrained(
58
+ model_name,
59
+ torch_dtype=torch.bfloat16 if torch.cuda.is_available() else torch.float32,
60
+ device_map=device,
61
+ ),
62
+ ).eval()
63
+
64
+ # Load the processor
65
+ processor = cast(ColPaliProcessor, ColPaliProcessor.from_pretrained(model_name))
66
+ return model, processor
67
+
68
+
69
+ def load_vit_config(model):
70
+ # Load the ViT config
71
+ print(f"VIT config: {VIT_CONFIG}")
72
+ vit_config = VIT_CONFIG[COLPALI_GEMMA_MODEL_NAME]
73
+ return vit_config
74
+
75
+
76
+ # Create dummy image
77
+ dummy_image = Image.new("RGB", (448, 448), (255, 255, 255))
78
+
79
+
80
+ def gen_similarity_map(
81
+ model, processor, device, vit_config, query, image: Union[Path, str]
82
+ ):
83
+ # Should take in the b64 image from Vespa query result
84
+ # And possibly the tensor representing the output_image
85
+ if isinstance(image, Path):
86
+ # image is a file path
87
+ try:
88
+ image = Image.open(image)
89
+ except Exception as e:
90
+ raise ValueError(f"Failed to open image from path: {e}")
91
+ elif isinstance(image, str):
92
+ # image is b64 string
93
+ try:
94
+ image = Image.open(BytesIO(base64.b64decode(image)))
95
+ except Exception as e:
96
+ raise ValueError(f"Failed to open image from b64: {e}")
97
+
98
+ # Preview the image
99
+ scale_image(image, 512)
100
+ # Preprocess inputs
101
+ input_text_processed = processor.process_queries([query]).to(device)
102
+ input_image_processed = processor.process_images([image]).to(device)
103
+ # Forward passes
104
+ with torch.no_grad():
105
+ output_text = model.forward(**input_text_processed)
106
+ output_image = model.forward(**input_image_processed)
107
+ # output_image is the tensor that we could get from the Vespa query
108
+ # Print shape of output_text and output_image
109
+ # Output image shape: torch.Size([1, 1030, 128])
110
+ # Remove the special tokens from the output
111
+ output_image = output_image[
112
+ :, : processor.image_seq_length, :
113
+ ] # (1, n_patches_x * n_patches_y, dim)
114
+
115
+ # Rearrange the output image tensor to explicitly represent the 2D grid of patches
116
+ output_image = rearrange(
117
+ output_image,
118
+ "b (h w) c -> b h w c",
119
+ h=vit_config.n_patch_per_dim,
120
+ w=vit_config.n_patch_per_dim,
121
+ ) # (1, n_patches_x, n_patches_y, dim)
122
+ # Get the similarity map
123
+ similarity_map = torch.einsum(
124
+ "bnk,bijk->bnij", output_text, output_image
125
+ ) # (1, query_tokens, n_patches_x, n_patches_y)
126
+
127
+ # Normalize the similarity map
128
+ similarity_map_normalized = normalize_similarity_map_per_query_token(
129
+ similarity_map
130
+ ) # (1, query_tokens, n_patches_x, n_patches_y)
131
+ # Use this cell output to choose a token using its index
132
+ query_tokens = processor.tokenizer.tokenize(
133
+ processor.decode(input_text_processed.input_ids[0])
134
+ )
135
+ # Choose a token
136
+ token_idx = (
137
+ 10 # e.g. if "12: '▁Kazakhstan',", set 12 to choose the token 'Kazakhstan'
138
+ )
139
+ selected_token = processor.decode(input_text_processed.input_ids[0, token_idx])
140
+ # strip whitespace
141
+ selected_token = selected_token.strip()
142
+ print(f"Selected token: `{selected_token}`")
143
+ # Retrieve the similarity map for the chosen token
144
+ pprint.pprint({idx: val for idx, val in enumerate(query_tokens)})
145
+ # Resize the image to square
146
+ input_image_square = image.resize((vit_config.resolution, vit_config.resolution))
147
+
148
+ # Plot the similarity map
149
+ fig, ax = plot_similarity_heatmap(
150
+ input_image_square,
151
+ patch_size=vit_config.patch_size,
152
+ image_resolution=vit_config.resolution,
153
+ similarity_map=similarity_map_normalized[0, token_idx, :, :],
154
+ )
155
+ ax = annotate_plot(ax, selected_token)
156
+ return fig, ax
157
+
158
+
159
+ # def save_figure(fig, filename: str = "similarity_map.png"):
160
+ # fig.savefig(
161
+ # OUTPUT_DIR / filename,
162
+ # bbox_inches="tight",
163
+ # pad_inches=0,
164
+ # )
165
+
166
+
167
+ def annotate_plot(ax, query, selected_token):
168
+ # Add the query text
169
+ ax.set_title(query, fontsize=18)
170
+ # Add annotation with selected token
171
+ ax.annotate(
172
+ f"Selected token:`{selected_token}`",
173
+ xy=(0.5, 0.95),
174
+ xycoords="axes fraction",
175
+ ha="center",
176
+ va="center",
177
+ fontsize=18,
178
+ color="black",
179
+ bbox=dict(boxstyle="round,pad=0.3", fc="white", ec="black", lw=1),
180
+ )
181
+ return ax
182
+
183
+
184
+ def gen_similarity_map_new(
185
+ processor: ColPaliProcessor,
186
+ model: ColPali,
187
+ device,
188
+ vit_config,
189
+ query: str,
190
+ query_embs: torch.Tensor,
191
+ token_idx_map: dict,
192
+ token_to_show: str,
193
+ image: Union[Path, str],
194
+ ):
195
+ if isinstance(image, Path):
196
+ # image is a file path
197
+ try:
198
+ image = Image.open(image)
199
+ except Exception as e:
200
+ raise ValueError(f"Failed to open image from path: {e}")
201
+ elif isinstance(image, str):
202
+ # image is b64 string
203
+ try:
204
+ image = Image.open(BytesIO(base64.b64decode(image)))
205
+ except Exception as e:
206
+ raise ValueError(f"Failed to open image from b64: {e}")
207
+ token_idx = token_idx_map[token_to_show]
208
+ print(f"Selected token: `{token_to_show}`")
209
+ # strip whitespace
210
+ # Preview the image
211
+ # scale_image(image, 512)
212
+ # Preprocess inputs
213
+ input_image_processed = processor.process_images([image]).to(device)
214
+ # Forward passes
215
+ with torch.no_grad():
216
+ output_image = model.forward(**input_image_processed)
217
+ # output_image is the tensor that we could get from the Vespa query
218
+ # Print shape of output_text and output_image
219
+ # Output image shape: torch.Size([1, 1030, 128])
220
+ # Remove the special tokens from the output
221
+ print(f"Output image shape before dim: {output_image.shape}")
222
+ output_image = output_image[
223
+ :, : processor.image_seq_length, :
224
+ ] # (1, n_patches_x * n_patches_y, dim)
225
+ print(f"Output image shape after dim: {output_image.shape}")
226
+ # Rearrange the output image tensor to explicitly represent the 2D grid of patches
227
+ output_image = rearrange(
228
+ output_image,
229
+ "b (h w) c -> b h w c",
230
+ h=vit_config.n_patch_per_dim,
231
+ w=vit_config.n_patch_per_dim,
232
+ ) # (1, n_patches_x, n_patches_y, dim)
233
+ # Get the similarity map
234
+ print(f"Query embs shape: {query_embs.shape}")
235
+ # Add 1 extra dim to start of query_embs
236
+ query_embs = query_embs.unsqueeze(0).to(device)
237
+ print(f"Output image shape: {output_image.shape}")
238
+ similarity_map = torch.einsum(
239
+ "bnk,bijk->bnij", query_embs, output_image
240
+ ) # (1, query_tokens, n_patches_x, n_patches_y)
241
+ print(f"Similarity map shape: {similarity_map.shape}")
242
+ # Normalize the similarity map
243
+ similarity_map_normalized = normalize_similarity_map_per_query_token(
244
+ similarity_map
245
+ ) # (1, query_tokens, n_patches_x, n_patches_y)
246
+ print(f"Similarity map normalized shape: {similarity_map_normalized.shape}")
247
+ # Use this cell output to choose a token using its index
248
+ input_image_square = image.resize((vit_config.resolution, vit_config.resolution))
249
+
250
+ # Plot the similarity map
251
+ fig, ax = plot_similarity_heatmap(
252
+ input_image_square,
253
+ patch_size=vit_config.patch_size,
254
+ image_resolution=vit_config.resolution,
255
+ similarity_map=similarity_map_normalized[0, token_idx, :, :],
256
+ )
257
+ ax = annotate_plot(ax, query, token_to_show)
258
+ # save the figure
259
+ # save_figure(fig, f"similarity_map_{token_to_show}.png")
260
+ return fig, ax
261
+
262
+
263
+ def get_query_embeddings_and_token_map(
264
+ processor, model, query, image
265
+ ) -> Tuple[torch.Tensor, dict]:
266
+ inputs = processor.process_queries([query]).to(model.device)
267
+ with torch.no_grad():
268
+ embeddings_query = model(**inputs)
269
+ q_emb = embeddings_query.to("cpu")[0] # Extract the single embedding
270
+ # Use this cell output to choose a token using its index
271
+ query_tokens = processor.tokenizer.tokenize(processor.decode(inputs.input_ids[0]))
272
+ # reverse key, values in dictionary
273
+ print(query_tokens)
274
+ token_to_idx = {val: idx for idx, val in enumerate(query_tokens)}
275
+ return q_emb, token_to_idx
276
+
277
+
278
+ def format_query_results(query, response, hits=5) -> dict:
279
+ query_time = response.json.get("timing", {}).get("searchtime", -1)
280
+ query_time = round(query_time, 2)
281
+ count = response.json.get("root", {}).get("fields", {}).get("totalCount", 0)
282
+ result_text = f"Query text: '{query}', query time {query_time}s, count={count}, top results:\n"
283
+ print(result_text)
284
+ return response.json
285
+
286
+
287
+ async def query_vespa_default(
288
+ app: Vespa,
289
+ query: str,
290
+ q_emb: torch.Tensor,
291
+ hits: int = 3,
292
+ timeout: str = "10s",
293
+ **kwargs,
294
+ ) -> dict:
295
+ async with app.asyncio(connections=1, total_timeout=120) as session:
296
+ query_embedding = format_q_embs(q_emb)
297
+ response: VespaQueryResponse = await session.query(
298
+ body={
299
+ "yql": "select id,title,url,image,page_number,text from pdf_page where userQuery();",
300
+ "ranking": "default",
301
+ "query": query,
302
+ "timeout": timeout,
303
+ "hits": hits,
304
+ "input.query(qt)": query_embedding,
305
+ "presentation.timing": True,
306
+ **kwargs,
307
+ },
308
+ )
309
+ assert response.is_successful(), response.json
310
+ return format_query_results(query, response)
311
+
312
+
313
+ def float_to_binary_embedding(float_query_embedding: dict) -> dict:
314
+ binary_query_embeddings = {}
315
+ for k, v in float_query_embedding.items():
316
+ binary_vector = (
317
+ np.packbits(np.where(np.array(v) > 0, 1, 0)).astype(np.int8).tolist()
318
+ )
319
+ binary_query_embeddings[k] = binary_vector
320
+ if len(binary_query_embeddings) >= MAX_QUERY_TERMS:
321
+ print(f"Warning: Query has more than {MAX_QUERY_TERMS} terms. Truncating.")
322
+ break
323
+ return binary_query_embeddings
324
+
325
+
326
+ def create_nn_query_strings(
327
+ binary_query_embeddings: dict, target_hits_per_query_tensor: int = 20
328
+ ) -> Tuple[str, dict]:
329
+ # Query tensors for nearest neighbor calculations
330
+ nn_query_dict = {}
331
+ for i in range(len(binary_query_embeddings)):
332
+ nn_query_dict[f"input.query(rq{i})"] = binary_query_embeddings[i]
333
+ nn = " OR ".join(
334
+ [
335
+ f"({{targetHits:{target_hits_per_query_tensor}}}nearestNeighbor(embedding,rq{i}))"
336
+ for i in range(len(binary_query_embeddings))
337
+ ]
338
+ )
339
+ return nn, nn_query_dict
340
+
341
+
342
+ def format_q_embs(q_embs: torch.Tensor) -> dict:
343
+ float_query_embedding = {k: v.tolist() for k, v in enumerate(q_embs)}
344
+ return float_query_embedding
345
+
346
+
347
+ async def query_vespa_nearest_neighbor(
348
+ app: Vespa,
349
+ query: str,
350
+ q_emb: torch.Tensor,
351
+ target_hits_per_query_tensor: int = 20,
352
+ hits: int = 3,
353
+ timeout: str = "10s",
354
+ **kwargs,
355
+ ) -> dict:
356
+ # Hyperparameter for speed vs. accuracy
357
+ async with app.asyncio(connections=1, total_timeout=180) as session:
358
+ float_query_embedding = format_q_embs(q_emb)
359
+ binary_query_embeddings = float_to_binary_embedding(float_query_embedding)
360
+
361
+ # Mixed tensors for MaxSim calculations
362
+ query_tensors = {
363
+ "input.query(qtb)": binary_query_embeddings,
364
+ "input.query(qt)": float_query_embedding,
365
+ }
366
+ nn_string, nn_query_dict = create_nn_query_strings(
367
+ binary_query_embeddings, target_hits_per_query_tensor
368
+ )
369
+ query_tensors.update(nn_query_dict)
370
+ response: VespaQueryResponse = await session.query(
371
+ body={
372
+ **query_tensors,
373
+ "presentation.timing": True,
374
+ "yql": f"select id,title,text,url,image,page_number from pdf_page where {nn_string}",
375
+ "ranking.profile": "retrieval-and-rerank",
376
+ "timeout": timeout,
377
+ "hits": hits,
378
+ **kwargs,
379
+ },
380
+ )
381
+ assert response.is_successful(), response.json
382
+ return format_query_results(query, response)
383
+
384
+
385
+ def is_special_token(token: str) -> bool:
386
+ # Pattern for tokens that start with '<', numbers, whitespace, or single characters
387
+ pattern = re.compile(r"^<.*$|^\d+$|^\s+$|^.$")
388
+ if pattern.match(token):
389
+ return True
390
+ return False
391
+
392
+
393
+ async def get_result_from_query(
394
+ app: Vespa,
395
+ processor: ColPaliProcessor,
396
+ model: ColPali,
397
+ query: str,
398
+ nn=False,
399
+ gen_sim_map=False,
400
+ ):
401
+ # Get the query embeddings and token map
402
+ print(query)
403
+ q_embs, token_to_idx = get_query_embeddings_and_token_map(
404
+ processor, model, query, dummy_image
405
+ )
406
+ print(token_to_idx)
407
+ # Use the token map to choose a token randomly for now
408
+ # Dynamically select a token containing 'water'
409
+
410
+ if nn:
411
+ result = await query_vespa_nearest_neighbor(app, query, q_embs)
412
+ else:
413
+ result = await query_vespa_default(app, query, q_embs)
414
+ # Print score, title id and text of the results
415
+ for idx, child in enumerate(result["root"]["children"]):
416
+ print(
417
+ f"Result {idx+1}: {child['relevance']}, {child['fields']['title']}, {child['fields']['id']}"
418
+ )
419
+
420
+ if gen_sim_map:
421
+ for single_result in result["root"]["children"]:
422
+ img = single_result["fields"]["image"]
423
+ for token in token_to_idx:
424
+ if is_special_token(token):
425
+ print(f"Skipping special token: {token}")
426
+ continue
427
+ fig, ax = gen_similarity_map_new(
428
+ processor,
429
+ model,
430
+ model.device,
431
+ load_vit_config(model),
432
+ query,
433
+ q_embs,
434
+ token_to_idx,
435
+ token,
436
+ img,
437
+ )
438
+ sim_map = base64.b64encode(fig.canvas.tostring_rgb()).decode("utf-8")
439
+ single_result["fields"][f"sim_map_{token}"] = sim_map
440
+ return result
441
+
442
+
443
+ def get_result_dummy(query: str, nn: bool = False):
444
+ result = {}
445
+ result["timing"] = {}
446
+ result["timing"]["querytime"] = 0.23700000000000002
447
+ result["timing"]["summaryfetchtime"] = 0.001
448
+ result["timing"]["searchtime"] = 0.23900000000000002
449
+ result["root"] = {}
450
+ result["root"]["id"] = "toplevel"
451
+ result["root"]["relevance"] = 1
452
+ result["root"]["fields"] = {}
453
+ result["root"]["fields"]["totalCount"] = 59
454
+ result["root"]["coverage"] = {}
455
+ result["root"]["coverage"]["coverage"] = 100
456
+ result["root"]["coverage"]["documents"] = 155
457
+ result["root"]["coverage"]["full"] = True
458
+ result["root"]["coverage"]["nodes"] = 1
459
+ result["root"]["coverage"]["results"] = 1
460
+ result["root"]["coverage"]["resultsFull"] = 1
461
+ result["root"]["children"] = []
462
+ elt0 = {}
463
+ elt0["id"] = "index:colpalidemo_content/0/424c85e7dece761d226f060f"
464
+ elt0["relevance"] = 2354.050122871995
465
+ elt0["source"] = "colpalidemo_content"
466
+ elt0["fields"] = {}
467
+ elt0["fields"]["id"] = "a767cb1868be9a776cd56b768347b089"
468
+ elt0["fields"]["url"] = (
469
+ "https://static.conocophillips.com/files/resources/conocophillips-2023-sustainability-report.pdf"
470
+ )
471
+ elt0["fields"]["title"] = "ConocoPhillips 2023 Sustainability Report"
472
+ elt0["fields"]["page_number"] = 50
473
+ elt0["fields"]["image"] = "empty for now - is base64 encoded image"
474
+ result["root"]["children"].append(elt0)
475
+ elt1 = {}
476
+ elt1["id"] = "index:colpalidemo_content/0/b927c4979f0beaf0d7fab8e9"
477
+ elt1["relevance"] = 2313.7529950886965
478
+ elt1["source"] = "colpalidemo_content"
479
+ elt1["fields"] = {}
480
+ elt1["fields"]["id"] = "9f2fc0aa02c9561adfaa1451c875658f"
481
+ elt1["fields"]["url"] = (
482
+ "https://static.conocophillips.com/files/resources/conocophillips-2023-managing-climate-related-risks.pdf"
483
+ )
484
+ elt1["fields"]["title"] = "ConocoPhillips Managing Climate Related Risks"
485
+ elt1["fields"]["page_number"] = 44
486
+ elt1["fields"]["image"] = "empty for now - is base64 encoded image"
487
+ result["root"]["children"].append(elt1)
488
+ elt2 = {}
489
+ elt2["id"] = "index:colpalidemo_content/0/9632d72238829d6afefba6c9"
490
+ elt2["relevance"] = 2312.230182081461
491
+ elt2["source"] = "colpalidemo_content"
492
+ elt2["fields"] = {}
493
+ elt2["fields"]["id"] = "d638ded1ddcb446268b289b3f65430fd"
494
+ elt2["fields"]["url"] = (
495
+ "https://static.conocophillips.com/files/resources/24-0976-sustainability-highlights_nature.pdf"
496
+ )
497
+ elt2["fields"]["title"] = (
498
+ "ConocoPhillips Sustainability Highlights - Nature (24-0976)"
499
+ )
500
+ elt2["fields"]["page_number"] = 0
501
+ elt2["fields"]["image"] = "empty for now - is base64 encoded image"
502
+ result["root"]["children"].append(elt2)
503
+ return result
504
+
505
+
506
+ if __name__ == "__main__":
507
+ model, processor = load_model()
508
+ vit_config = load_vit_config(model)
509
+ query = "How many percent of source water is fresh water?"
510
+ image_filepath = (
511
+ Path(__file__).parent.parent
512
+ / "static"
513
+ / "assets"
514
+ / "ConocoPhillips Sustainability Highlights - Nature (24-0976).png"
515
+ )
516
+ gen_similarity_map(
517
+ model, processor, model.device, vit_config, query=query, image=image_filepath
518
+ )
519
+ result = get_result_dummy("dummy query")
520
+ print(result)
521
+ print("Done")
deploy_vespa_app.py ADDED
@@ -0,0 +1,208 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/usr/bin/env python3
2
+
3
+ import argparse
4
+ from vespa.package import (
5
+ ApplicationPackage,
6
+ Field,
7
+ Schema,
8
+ Document,
9
+ HNSW,
10
+ RankProfile,
11
+ Function,
12
+ AuthClient,
13
+ Parameter,
14
+ FieldSet,
15
+ SecondPhaseRanking,
16
+ )
17
+ from vespa.deployment import VespaCloud
18
+ import os
19
+ from pathlib import Path
20
+
21
+
22
+ def main():
23
+ parser = argparse.ArgumentParser(description="Deploy Vespa application")
24
+ parser.add_argument("--tenant_name", required=True, help="Vespa Cloud tenant name")
25
+ parser.add_argument(
26
+ "--vespa_application_name", required=True, help="Vespa application name"
27
+ )
28
+ parser.add_argument(
29
+ "--token_id_write", required=True, help="Vespa Cloud token ID for write access"
30
+ )
31
+ parser.add_argument(
32
+ "--token_id_read", required=True, help="Vespa Cloud token ID for read access"
33
+ )
34
+
35
+ args = parser.parse_args()
36
+ tenant_name = args.tenant_name
37
+ vespa_app_name = args.vespa_application_name
38
+ token_id_write = args.token_id_write
39
+ token_id_read = args.token_id_read
40
+
41
+ # Define the Vespa schema
42
+ colpali_schema = Schema(
43
+ name="pdf_page",
44
+ document=Document(
45
+ fields=[
46
+ Field(
47
+ name="id",
48
+ type="string",
49
+ indexing=["summary", "index"],
50
+ match=["word"],
51
+ ),
52
+ Field(name="url", type="string", indexing=["summary", "index"]),
53
+ Field(
54
+ name="title",
55
+ type="string",
56
+ indexing=["summary", "index"],
57
+ match=["text"],
58
+ index="enable-bm25",
59
+ ),
60
+ Field(
61
+ name="page_number", type="int", indexing=["summary", "attribute"]
62
+ ),
63
+ Field(name="image", type="raw", indexing=["summary"]),
64
+ Field(name="full_image", type="raw", indexing=["summary"]),
65
+ Field(
66
+ name="text",
67
+ type="string",
68
+ indexing=["summary", "index"],
69
+ match=["text"],
70
+ index="enable-bm25",
71
+ ),
72
+ Field(
73
+ name="embedding",
74
+ type="tensor<int8>(patch{}, v[16])",
75
+ indexing=[
76
+ "attribute",
77
+ "index",
78
+ ], # adds HNSW index for candidate retrieval.
79
+ ann=HNSW(
80
+ distance_metric="hamming",
81
+ max_links_per_node=32,
82
+ neighbors_to_explore_at_insert=400,
83
+ ),
84
+ ),
85
+ ]
86
+ ),
87
+ fieldsets=[
88
+ FieldSet(name="default", fields=["title", "url", "page_number", "text"]),
89
+ FieldSet(name="image", fields=["image"]),
90
+ ],
91
+ )
92
+
93
+ # Define rank profiles
94
+ colpali_profile = RankProfile(
95
+ name="default",
96
+ inputs=[("query(qt)", "tensor<float>(querytoken{}, v[128])")],
97
+ functions=[
98
+ Function(
99
+ name="max_sim",
100
+ expression="""
101
+ sum(
102
+ reduce(
103
+ sum(
104
+ query(qt) * unpack_bits(attribute(embedding)) , v
105
+ ),
106
+ max, patch
107
+ ),
108
+ querytoken
109
+ )
110
+ """,
111
+ ),
112
+ Function(name="bm25_score", expression="bm25(title) + bm25(text)"),
113
+ ],
114
+ first_phase="bm25_score",
115
+ second_phase=SecondPhaseRanking(expression="max_sim", rerank_count=10),
116
+ )
117
+ colpali_schema.add_rank_profile(colpali_profile)
118
+
119
+ # Add retrieval-and-rerank rank profile
120
+ input_query_tensors = []
121
+ MAX_QUERY_TERMS = 64
122
+ for i in range(MAX_QUERY_TERMS):
123
+ input_query_tensors.append((f"query(rq{i})", "tensor<int8>(v[16])"))
124
+
125
+ input_query_tensors.append(("query(qt)", "tensor<float>(querytoken{}, v[128])"))
126
+ input_query_tensors.append(("query(qtb)", "tensor<int8>(querytoken{}, v[16])"))
127
+
128
+ colpali_retrieval_profile = RankProfile(
129
+ name="retrieval-and-rerank",
130
+ inputs=input_query_tensors,
131
+ functions=[
132
+ Function(
133
+ name="max_sim",
134
+ expression="""
135
+ sum(
136
+ reduce(
137
+ sum(
138
+ query(qt) * unpack_bits(attribute(embedding)) , v
139
+ ),
140
+ max, patch
141
+ ),
142
+ querytoken
143
+ )
144
+ """,
145
+ ),
146
+ Function(
147
+ name="max_sim_binary",
148
+ expression="""
149
+ sum(
150
+ reduce(
151
+ 1/(1 + sum(
152
+ hamming(query(qtb), attribute(embedding)) ,v)
153
+ ),
154
+ max,
155
+ patch
156
+ ),
157
+ querytoken
158
+ )
159
+ """,
160
+ ),
161
+ ],
162
+ first_phase="max_sim_binary",
163
+ second_phase=SecondPhaseRanking(expression="max_sim", rerank_count=10),
164
+ )
165
+ colpali_schema.add_rank_profile(colpali_retrieval_profile)
166
+
167
+ # Create the Vespa application package
168
+ vespa_application_package = ApplicationPackage(
169
+ name=vespa_app_name,
170
+ schema=[colpali_schema],
171
+ auth_clients=[
172
+ AuthClient(
173
+ id="mtls", # Note that you still need to include the mtls client.
174
+ permissions=["read", "write"],
175
+ parameters=[Parameter("certificate", {"file": "security/clients.pem"})],
176
+ ),
177
+ AuthClient(
178
+ id="token_write",
179
+ permissions=["read", "write"],
180
+ parameters=[Parameter("token", {"id": token_id_write})],
181
+ ),
182
+ AuthClient(
183
+ id="token_read",
184
+ permissions=["read"],
185
+ parameters=[Parameter("token", {"id": token_id_read})],
186
+ ),
187
+ ],
188
+ )
189
+ vespa_team_api_key = os.getenv("VESPA_TEAM_API_KEY")
190
+ # Deploy the application to Vespa Cloud
191
+ vespa_cloud = VespaCloud(
192
+ tenant=tenant_name,
193
+ application=vespa_app_name,
194
+ key_content=vespa_team_api_key,
195
+ application_root="colpali-with-snippets",
196
+ #application_package=vespa_application_package,
197
+ )
198
+
199
+ #app = vespa_cloud.deploy()
200
+ vespa_cloud.deploy_from_disk("default", "colpali-with-snippets")
201
+
202
+ # Output the endpoint URL
203
+ endpoint_url = vespa_cloud.get_token_endpoint()
204
+ print(f"Application deployed. Token endpoint URL: {endpoint_url}")
205
+
206
+
207
+ if __name__ == "__main__":
208
+ main()
feed_vespa.py ADDED
@@ -0,0 +1,209 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/usr/bin/env python3
2
+
3
+ import argparse
4
+ import torch
5
+ from torch.utils.data import DataLoader
6
+ from tqdm import tqdm
7
+ from io import BytesIO
8
+ from typing import cast
9
+ import os
10
+ import json
11
+ import hashlib
12
+
13
+ from colpali_engine.models import ColPali, ColPaliProcessor
14
+ from colpali_engine.utils.torch_utils import get_torch_device
15
+ from vidore_benchmark.utils.image_utils import scale_image, get_base64_image
16
+ import requests
17
+ from pdf2image import convert_from_path
18
+ from pypdf import PdfReader
19
+ import numpy as np
20
+ from vespa.application import Vespa
21
+ from vespa.io import VespaResponse
22
+ from dotenv import load_dotenv
23
+
24
+ load_dotenv()
25
+
26
+
27
+ def main():
28
+ parser = argparse.ArgumentParser(description="Feed data into Vespa application")
29
+ parser.add_argument(
30
+ "--application_name",
31
+ required=True,
32
+ default="colpalidemo",
33
+ help="Vespa application name",
34
+ )
35
+ parser.add_argument(
36
+ "--vespa_schema_name",
37
+ required=True,
38
+ default="pdf_page",
39
+ help="Vespa schema name",
40
+ )
41
+ args = parser.parse_args()
42
+
43
+ vespa_app_url = os.getenv("VESPA_APP_URL")
44
+ vespa_cloud_secret_token = os.getenv("VESPA_CLOUD_SECRET_TOKEN")
45
+ # Set application and schema names
46
+ application_name = args.application_name
47
+ schema_name = args.vespa_schema_name
48
+ # Instantiate Vespa connection using token
49
+ app = Vespa(url=vespa_app_url, vespa_cloud_secret_token=vespa_cloud_secret_token)
50
+ app.get_application_status()
51
+ model_name = "vidore/colpali-v1.2"
52
+
53
+ device = get_torch_device("auto")
54
+ print(f"Using device: {device}")
55
+
56
+ # Load the model
57
+ model = cast(
58
+ ColPali,
59
+ ColPali.from_pretrained(
60
+ model_name,
61
+ torch_dtype=torch.bfloat16 if torch.cuda.is_available() else torch.float32,
62
+ device_map=device,
63
+ ),
64
+ ).eval()
65
+
66
+ # Load the processor
67
+ processor = cast(ColPaliProcessor, ColPaliProcessor.from_pretrained(model_name))
68
+
69
+ # Define functions to work with PDFs
70
+ def download_pdf(url):
71
+ response = requests.get(url)
72
+ if response.status_code == 200:
73
+ return BytesIO(response.content)
74
+ else:
75
+ raise Exception(
76
+ f"Failed to download PDF: Status code {response.status_code}"
77
+ )
78
+
79
+ def get_pdf_images(pdf_url):
80
+ # Download the PDF
81
+ pdf_file = download_pdf(pdf_url)
82
+ # Save the PDF temporarily to disk (pdf2image requires a file path)
83
+ temp_file = "temp.pdf"
84
+ with open(temp_file, "wb") as f:
85
+ f.write(pdf_file.read())
86
+ reader = PdfReader(temp_file)
87
+ page_texts = []
88
+ for page_number in range(len(reader.pages)):
89
+ page = reader.pages[page_number]
90
+ text = page.extract_text()
91
+ page_texts.append(text)
92
+ images = convert_from_path(temp_file)
93
+ assert len(images) == len(page_texts)
94
+ return (images, page_texts)
95
+
96
+ # Define sample PDFs
97
+ sample_pdfs = [
98
+ {
99
+ "title": "ConocoPhillips Sustainability Highlights - Nature (24-0976)",
100
+ "url": "https://static.conocophillips.com/files/resources/24-0976-sustainability-highlights_nature.pdf",
101
+ },
102
+ {
103
+ "title": "ConocoPhillips Managing Climate Related Risks",
104
+ "url": "https://static.conocophillips.com/files/resources/conocophillips-2023-managing-climate-related-risks.pdf",
105
+ },
106
+ {
107
+ "title": "ConocoPhillips 2023 Sustainability Report",
108
+ "url": "https://static.conocophillips.com/files/resources/conocophillips-2023-sustainability-report.pdf",
109
+ },
110
+ ]
111
+
112
+ # Check if vespa_feed.json exists
113
+ if os.path.exists("vespa_feed.json"):
114
+ print("Loading vespa_feed from vespa_feed.json")
115
+ with open("vespa_feed.json", "r") as f:
116
+ vespa_feed_saved = json.load(f)
117
+ vespa_feed = []
118
+ for doc in vespa_feed_saved:
119
+ put_id = doc["put"]
120
+ fields = doc["fields"]
121
+ # Extract document_id from put_id
122
+ # Format: 'id:application_name:schema_name::document_id'
123
+ parts = put_id.split("::")
124
+ document_id = parts[1] if len(parts) > 1 else ""
125
+ page = {"id": document_id, "fields": fields}
126
+ vespa_feed.append(page)
127
+ else:
128
+ print("Generating vespa_feed")
129
+ # Process PDFs
130
+ for pdf in sample_pdfs:
131
+ page_images, page_texts = get_pdf_images(pdf["url"])
132
+ pdf["images"] = page_images
133
+ pdf["texts"] = page_texts
134
+
135
+ # Generate embeddings
136
+ for pdf in sample_pdfs:
137
+ page_embeddings = []
138
+ dataloader = DataLoader(
139
+ pdf["images"],
140
+ batch_size=2,
141
+ shuffle=False,
142
+ collate_fn=lambda x: processor.process_images(x),
143
+ )
144
+ for batch_doc in tqdm(dataloader):
145
+ with torch.no_grad():
146
+ batch_doc = {k: v.to(model.device) for k, v in batch_doc.items()}
147
+ embeddings_doc = model(**batch_doc)
148
+ page_embeddings.extend(list(torch.unbind(embeddings_doc.to("cpu"))))
149
+ pdf["embeddings"] = page_embeddings
150
+
151
+ # Prepare Vespa feed
152
+ vespa_feed = []
153
+ for pdf in sample_pdfs:
154
+ url = pdf["url"]
155
+ title = pdf["title"]
156
+ for page_number, (page_text, embedding, image) in enumerate(
157
+ zip(pdf["texts"], pdf["embeddings"], pdf["images"])
158
+ ):
159
+ base_64_image = get_base64_image(
160
+ scale_image(image, 640), add_url_prefix=False
161
+ )
162
+ base_64_full_image = get_base64_image(image, add_url_prefix=False)
163
+ embedding_dict = dict()
164
+ for idx, patch_embedding in enumerate(embedding):
165
+ binary_vector = (
166
+ np.packbits(np.where(patch_embedding > 0, 1, 0))
167
+ .astype(np.int8)
168
+ .tobytes()
169
+ .hex()
170
+ )
171
+ embedding_dict[idx] = binary_vector
172
+ # id_hash should be md5 hash of url and page_number
173
+ id_hash = hashlib.md5(f"{url}_{page_number}".encode()).hexdigest()
174
+ page = {
175
+ "id": id_hash,
176
+ "fields": {
177
+ "id": id_hash,
178
+ "url": url,
179
+ "title": title,
180
+ "page_number": page_number,
181
+ "image": base_64_image,
182
+ "full_image": base_64_full_image,
183
+ "text": page_text,
184
+ "embedding": embedding_dict,
185
+ },
186
+ }
187
+ vespa_feed.append(page)
188
+
189
+ # Save vespa_feed to vespa_feed.json in the specified format
190
+ vespa_feed_to_save = []
191
+ for page in vespa_feed:
192
+ document_id = page["id"]
193
+ put_id = f"id:{application_name}:{schema_name}::{document_id}"
194
+ vespa_feed_to_save.append({"put": put_id, "fields": page["fields"]})
195
+ with open("vespa_feed.json", "w") as f:
196
+ json.dump(vespa_feed_to_save, f)
197
+
198
+ def callback(response: VespaResponse, id: str):
199
+ if not response.is_successful():
200
+ print(
201
+ f"Failed to feed document {id} with status code {response.status_code}: Reason {response.get_json()}"
202
+ )
203
+
204
+ # Feed data into Vespa
205
+ app.feed_iterable(vespa_feed, schema=schema_name, callback=callback)
206
+
207
+
208
+ if __name__ == "__main__":
209
+ main()
frontend/__init__.py ADDED
File without changes
frontend/app.py ADDED
@@ -0,0 +1,768 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ from typing import Optional
2
+ from urllib.parse import quote_plus
3
+
4
+ from fasthtml.components import (
5
+ H1,
6
+ H2,
7
+ H3,
8
+ Br,
9
+ Div,
10
+ Form,
11
+ Img,
12
+ NotStr,
13
+ P,
14
+ Hr,
15
+ Span,
16
+ A,
17
+ Script,
18
+ Button,
19
+ Label,
20
+ RadioGroup,
21
+ RadioGroupItem,
22
+ Separator,
23
+ Ul,
24
+ Li,
25
+ Strong,
26
+ Iframe,
27
+ )
28
+ from fasthtml.xtend import A, Script
29
+ from lucide_fasthtml import Lucide
30
+ from shad4fast import Badge, Button, Input, Label, RadioGroup, RadioGroupItem, Separator
31
+
32
+ # JavaScript to check the input value and enable/disable the search button and radio buttons
33
+ check_input_script = Script(
34
+ """
35
+ window.onload = function() {
36
+ const input = document.getElementById('search-input');
37
+ const button = document.querySelector('[data-button="search-button"]');
38
+ const radioGroupItems = document.querySelectorAll('button[data-ref="radio-item"]'); // Get all radio buttons
39
+
40
+ function checkInputValue() {
41
+ const isInputEmpty = input.value.trim() === "";
42
+ button.disabled = isInputEmpty; // Disable the submit button
43
+ radioGroupItems.forEach(item => {
44
+ item.disabled = isInputEmpty; // Disable/enable the radio buttons
45
+ });
46
+ }
47
+
48
+ input.addEventListener('input', checkInputValue); // Listen for input changes
49
+ checkInputValue(); // Initial check when the page loads
50
+ };
51
+ """
52
+ )
53
+
54
+ # JavaScript to handle the image swapping, reset button, and active class toggling
55
+ image_swapping = Script(
56
+ """
57
+ document.addEventListener('click', function (e) {
58
+ if (e.target.classList.contains('sim-map-button') || e.target.classList.contains('reset-button')) {
59
+ const imgContainer = e.target.closest('.relative');
60
+ const overlayContainer = imgContainer.querySelector('.overlay-container');
61
+ const newSrc = e.target.getAttribute('data-image-src');
62
+
63
+ // If it's a reset button, remove the overlay image
64
+ if (e.target.classList.contains('reset-button')) {
65
+ overlayContainer.innerHTML = ''; // Clear the overlay container, showing only the full image
66
+ } else {
67
+ // Create a new overlay image
68
+ const img = document.createElement('img');
69
+ img.src = newSrc;
70
+ img.classList.add('overlay-image', 'absolute', 'top-0', 'left-0', 'w-full', 'h-full');
71
+ overlayContainer.innerHTML = ''; // Clear any previous overlay
72
+ overlayContainer.appendChild(img); // Add the new overlay image
73
+ }
74
+
75
+ // Toggle active class on buttons
76
+ const activeButton = document.querySelector('.sim-map-button.active');
77
+ if (activeButton) {
78
+ activeButton.classList.remove('active');
79
+ }
80
+ if (e.target.classList.contains('sim-map-button')) {
81
+ e.target.classList.add('active');
82
+ }
83
+ }
84
+ });
85
+ """
86
+ )
87
+
88
+ toggle_text_content = Script(
89
+ """
90
+ function toggleTextContent(idx) {
91
+ const textColumn = document.getElementById(`text-column-${idx}`);
92
+ const imageTextColumns = document.getElementById(`image-text-columns-${idx}`);
93
+ const toggleButton = document.getElementById(`toggle-button-${idx}`);
94
+
95
+ if (textColumn.classList.contains('md-grid-text-column')) {
96
+ // Hide the text column
97
+ textColumn.classList.remove('md-grid-text-column');
98
+ imageTextColumns.classList.remove('grid-image-text-columns');
99
+ toggleButton.innerText = `Show Text`;
100
+ } else {
101
+ // Show the text column
102
+ textColumn.classList.add('md-grid-text-column');
103
+ imageTextColumns.classList.add('grid-image-text-columns');
104
+ toggleButton.innerText = `Hide Text`;
105
+ }
106
+ }
107
+ """
108
+ )
109
+
110
+ autocomplete_script = Script(
111
+ """
112
+ document.addEventListener('DOMContentLoaded', function() {
113
+ const input = document.querySelector('#search-input');
114
+ const awesomplete = new Awesomplete(input, { minChars: 1, maxItems: 5 });
115
+
116
+ input.addEventListener('input', function() {
117
+ if (this.value.length >= 1) {
118
+ // Use template literals to insert the input value dynamically in the query parameter
119
+ fetch(`/suggestions?query=${encodeURIComponent(this.value)}`)
120
+ .then(response => response.json())
121
+ .then(data => {
122
+ // Update the Awesomplete list dynamically with fetched suggestions
123
+ awesomplete.list = data.suggestions;
124
+ })
125
+ .catch(err => console.error('Error fetching suggestions:', err));
126
+ }
127
+ });
128
+ });
129
+ """
130
+ )
131
+
132
+ dynamic_elements_scrollbars = Script(
133
+ """
134
+ (function () {
135
+ const { applyOverlayScrollbars, getScrollbarTheme } = OverlayScrollbarsManager;
136
+
137
+ function applyScrollbarsToDynamicElements() {
138
+ const scrollbarTheme = getScrollbarTheme();
139
+
140
+ // Apply scrollbars to dynamically loaded result-text-full and result-text-snippet elements
141
+ const resultTextFullElements = document.querySelectorAll('[id^="result-text-full"]');
142
+ const resultTextSnippetElements = document.querySelectorAll('[id^="result-text-snippet"]');
143
+
144
+ resultTextFullElements.forEach(element => {
145
+ applyOverlayScrollbars(element, scrollbarTheme);
146
+ });
147
+
148
+ resultTextSnippetElements.forEach(element => {
149
+ applyOverlayScrollbars(element, scrollbarTheme);
150
+ });
151
+ }
152
+
153
+ // Apply scrollbars after dynamic content is loaded (e.g., after search results)
154
+ applyScrollbarsToDynamicElements();
155
+
156
+ // Observe changes in the 'dark' class to adjust the theme dynamically if needed
157
+ const observer = new MutationObserver(applyScrollbarsToDynamicElements);
158
+ observer.observe(document.documentElement, { attributes: true, attributeFilter: ['class'] });
159
+ })();
160
+ """
161
+ )
162
+
163
+ submit_form_on_radio_change = Script(
164
+ """
165
+ document.addEventListener('click', function (e) {
166
+ // if target has data-ref="radio-item" and type is button
167
+ if (e.target.getAttribute('data-ref') === 'radio-item' && e.target.type === 'button') {
168
+ console.log('Radio button clicked');
169
+ const form = e.target.closest('form');
170
+ form.submit();
171
+ }
172
+ });
173
+ """
174
+ )
175
+
176
+
177
+ def ShareButtons():
178
+ title = "Visual RAG over PDFs with Vespa and ColPali"
179
+ url = "https://huggingface.co/spaces/vespa-engine/colpali-vespa-visual-retrieval"
180
+ return Div(
181
+ A(
182
+ Img(src="/static/img/linkedin.svg", aria_hidden="true", cls="h-[21px]"),
183
+ "Share on LinkedIn",
184
+ href=f"https://www.linkedin.com/sharing/share-offsite/?url={quote_plus(url)}",
185
+ rel="noopener noreferrer",
186
+ target="_blank",
187
+ cls="bg-[#0A66C2] text-white inline-flex items-center gap-x-1.5 px-2.5 py-1.5 border rounded-md text-sm font-semibold",
188
+ ),
189
+ A(
190
+ Img(src="/static/img/x.svg", aria_hidden="true", cls="h-[21px]"),
191
+ "Share on X",
192
+ href=f"https://twitter.com/intent/tweet?text={quote_plus(title)}&url={quote_plus(url)}",
193
+ rel="noopener noreferrer",
194
+ target="_blank",
195
+ cls="bg-black text-white inline-flex items-center gap-x-1.5 px-2.5 py-1.5 border rounded-md text-sm font-semibold",
196
+ ),
197
+ cls="flex items-center justify-center space-x-8 mt-5",
198
+ )
199
+
200
+
201
+ def SearchBox(with_border=False, query_value="", ranking_value="hybrid"):
202
+ grid_cls = "grid gap-2 items-center p-3 bg-muted w-full"
203
+
204
+ if with_border:
205
+ grid_cls = "grid gap-2 p-3 rounded-md border border-input bg-muted w-full ring-offset-background focus-within:outline-none focus-within:ring-2 focus-within:ring-ring focus-within:ring-offset-2 focus-within:border-input"
206
+
207
+ return Form(
208
+ Div(
209
+ Lucide(
210
+ icon="search", cls="absolute left-2 top-2 text-muted-foreground z-10"
211
+ ),
212
+ Input(
213
+ placeholder="Enter your search query...",
214
+ name="query",
215
+ value=query_value,
216
+ id="search-input",
217
+ cls="text-base pl-10 border-transparent ring-offset-transparent ring-0 focus-visible:ring-transparent bg-white dark:bg-background awesomplete",
218
+ data_list="#suggestions",
219
+ style="font-size: 1rem",
220
+ autofocus=True,
221
+ ),
222
+ cls="relative",
223
+ ),
224
+ Div(
225
+ Div(
226
+ Span("Ranking by:", cls="text-muted-foreground text-xs font-semibold"),
227
+ RadioGroup(
228
+ Div(
229
+ RadioGroupItem(value="colpali", id="colpali"),
230
+ Label("ColPali", htmlFor="ColPali"),
231
+ cls="flex items-center space-x-2",
232
+ ),
233
+ Div(
234
+ RadioGroupItem(value="bm25", id="bm25"),
235
+ Label("BM25", htmlFor="BM25"),
236
+ cls="flex items-center space-x-2",
237
+ ),
238
+ Div(
239
+ RadioGroupItem(value="hybrid", id="hybrid"),
240
+ Label("Hybrid ColPali + BM25", htmlFor="Hybrid ColPali + BM25"),
241
+ cls="flex items-center space-x-2",
242
+ ),
243
+ name="ranking",
244
+ default_value=ranking_value,
245
+ cls="grid-flow-col gap-x-5 text-muted-foreground",
246
+ # Submit form when radio button is clicked
247
+ ),
248
+ cls="grid grid-flow-col items-center gap-x-3 border border-input px-3 rounded-sm",
249
+ ),
250
+ Button(
251
+ Lucide(icon="arrow-right", size="21"),
252
+ size="sm",
253
+ type="submit",
254
+ data_button="search-button",
255
+ disabled=True,
256
+ ),
257
+ cls="flex justify-between",
258
+ ),
259
+ check_input_script,
260
+ autocomplete_script,
261
+ submit_form_on_radio_change,
262
+ action=f"/search?query={quote_plus(query_value)}&ranking={quote_plus(ranking_value)}",
263
+ method="GET",
264
+ hx_get="/fetch_results", # As the component is a form, input components query and ranking are sent as query parameters automatically, see https://htmx.org/docs/#parameters
265
+ hx_trigger="load",
266
+ hx_target="#search-results",
267
+ hx_swap="outerHTML",
268
+ hx_indicator="#loading-indicator",
269
+ cls=grid_cls,
270
+ )
271
+
272
+
273
+ def SampleQueries():
274
+ sample_queries = [
275
+ "What percentage of the funds unlisted real estate investments were in Switzerland 2023?",
276
+ "Gender balance at level 4 or above in NY office 2023?",
277
+ "Number of graduate applications trend 2021-2023",
278
+ "Total amount of fixed salaries paid in 2023?",
279
+ "Proportion of female new hires 2021-2023?",
280
+ "child jumping over puddle",
281
+ "hula hoop kid",
282
+ ]
283
+
284
+ query_badges = []
285
+ for query in sample_queries:
286
+ query_badges.append(
287
+ A(
288
+ Badge(
289
+ Div(
290
+ Lucide(
291
+ icon="text-search", size="18", cls="text-muted-foreground"
292
+ ),
293
+ Span(query, cls="text-base font-normal"),
294
+ cls="flex gap-2 items-center",
295
+ ),
296
+ variant="outline",
297
+ cls="text-base font-normal text-muted-foreground hover:border-black dark:hover:border-white",
298
+ ),
299
+ href=f"/search?query={quote_plus(query)}",
300
+ cls="no-underline",
301
+ )
302
+ )
303
+
304
+ return Div(*query_badges, cls="grid gap-2 justify-items-center")
305
+
306
+
307
+ def Hero():
308
+ return Div(
309
+ H1(
310
+ "Visual RAG over PDFs",
311
+ cls="text-5xl md:text-6xl font-bold tracking-wide md:tracking-wider bg-clip-text text-transparent bg-gradient-to-r from-black to-slate-700 dark:from-white dark:to-slate-300 animate-fade-in",
312
+ ),
313
+ P(
314
+ "See how Vespa and ColPali can be used for Visual RAG in this demo",
315
+ cls="text-base md:text-2xl text-muted-foreground md:tracking-wide",
316
+ ),
317
+ cls="grid gap-5 text-center",
318
+ )
319
+
320
+
321
+ def Home():
322
+ return Div(
323
+ Div(
324
+ Hero(),
325
+ SearchBox(with_border=True),
326
+ SampleQueries(),
327
+ ShareButtons(),
328
+ cls="grid gap-8 content-start mt-[13vh]",
329
+ ),
330
+ cls="grid w-full h-full max-w-screen-md gap-4 mx-auto",
331
+ )
332
+
333
+
334
+ def LinkResource(text, href):
335
+ return Li(
336
+ A(
337
+ Lucide(icon="external-link", size="18"),
338
+ text,
339
+ href=href,
340
+ target="_blank",
341
+ cls="flex items-center gap-1.5 hover:underline bold text-md",
342
+ ),
343
+ )
344
+
345
+
346
+ def AboutThisDemo():
347
+ resources = [
348
+ {
349
+ "text": "Vespa Blog: How we built this demo",
350
+ "href": "https://blog.vespa.ai/visual-rag-in-practice",
351
+ },
352
+ {
353
+ "text": "Notebook to set up Vespa application and feed dataset",
354
+ "href": "https://pyvespa.readthedocs.io/en/latest/examples/visual_pdf_rag_with_vespa_colpali_cloud.html",
355
+ },
356
+ {
357
+ "text": "Web App (FastHTML) Code",
358
+ "href": "https://github.com/vespa-engine/sample-apps/tree/master/visual-retrieval-colpali",
359
+ },
360
+ {
361
+ "text": "Vespa Blog: Scaling ColPali to Billions",
362
+ "href": "https://blog.vespa.ai/scaling-colpali-to-billions/",
363
+ },
364
+ {
365
+ "text": "Vespa Blog: Retrieval with Vision Language Models",
366
+ "href": "https://blog.vespa.ai/retrieval-with-vision-language-models-colpali/",
367
+ },
368
+ ]
369
+ return Div(
370
+ H1(
371
+ "About This Demo",
372
+ cls="text-3xl md:text-5xl font-bold tracking-wide md:tracking-wider",
373
+ ),
374
+ P(
375
+ "This demo showcases a Visual Retrieval-Augmented Generation (RAG) application over PDFs using ColPali embeddings in Vespa, built entirely in Python, using FastHTML. The code is fully open source.",
376
+ cls="text-base",
377
+ ),
378
+ Img(
379
+ src="/static/img/colpali_child.png",
380
+ alt="Example of token level similarity map",
381
+ cls="w-full",
382
+ ),
383
+ H2("Resources", cls="text-2xl font-semibold"),
384
+ Ul(
385
+ *[
386
+ LinkResource(resource["text"], resource["href"])
387
+ for resource in resources
388
+ ],
389
+ cls="space-y-2 list-disc pl-5",
390
+ ),
391
+ H2("Architecture Overview", cls="text-2xl font-semibold"),
392
+ Img(
393
+ src="/static/img/visual-retrieval-demoapp-arch.png",
394
+ alt="Architecture Overview",
395
+ cls="w-full",
396
+ ),
397
+ Ul(
398
+ Li(
399
+ Strong("Vespa Application: "),
400
+ "Vespa Application that handles indexing, search, ranking and queries, leveraging features like phased ranking and multivector MaxSim calculations.",
401
+ ),
402
+ Li(
403
+ Strong("Frontend: "),
404
+ "Built with FastHTML, offering a professional and responsive user interface without the complexity of separate frontend frameworks.",
405
+ ),
406
+ Li(
407
+ Strong("Backend: "),
408
+ "Also built with FastHTML. Handles query embedding inference using ColPali, serves static files, and is responsible for orchestrating interactions between Vespa and the frontend.",
409
+ ),
410
+ Li(
411
+ Strong("Gemini API: "),
412
+ "VLM for the AI response, providing responses based on the top results from Vespa.",
413
+ cls="list-disc list-inside",
414
+ ),
415
+ H2("User Experience Highlights", cls="text-2xl font-semibold"),
416
+ Ul(
417
+ Li(
418
+ Strong("Fast and Responsive: "),
419
+ "Optimized for quick loading times, with phased content delivery to display essential information immediately while loading detailed data in the background.",
420
+ ),
421
+ Li(
422
+ Strong("Similarity Maps: "),
423
+ "Provides visual highlights of the most relevant parts of a page in response to a query, enhancing interpretability.",
424
+ ),
425
+ Li(
426
+ Strong("Type-Ahead Suggestions: "),
427
+ "Offers query suggestions to assist users in formulating effective searches.",
428
+ ),
429
+ cls="list-disc list-inside",
430
+ ),
431
+ cls="grid gap-5",
432
+ ),
433
+ H2("Dataset", cls="text-2xl font-semibold"),
434
+ P(
435
+ "The dataset used in this demo is retrieved from reports published by the Norwegian Government Pension Fund Global. It contains 6,992 pages from 116 PDF reports (2000–2024). The information is often presented in visual formats, making it an ideal dataset for visual retrieval applications.",
436
+ cls="text-base",
437
+ ),
438
+ Iframe(
439
+ src="https://huggingface.co/datasets/vespa-engine/gpfg-QA/embed/viewer",
440
+ frameborder="0",
441
+ width="100%",
442
+ height="500",
443
+ ),
444
+ Hr(), # To add some margin to bottom. Probably a much better way to do this, but the mb-[16vh] class doesn't seem to be applied
445
+ cls="w-full h-full max-w-screen-md gap-4 mx-auto mt-[8vh] mb-[16vh] grid gap-8 content-start",
446
+ )
447
+
448
+
449
+ def Search(request, search_results=[]):
450
+ query_value = request.query_params.get("query", "").strip()
451
+ ranking_value = request.query_params.get("ranking", "hybrid")
452
+ return Div(
453
+ Div(
454
+ Div(
455
+ SearchBox(query_value=query_value, ranking_value=ranking_value),
456
+ Div(
457
+ LoadingMessage(),
458
+ id="search-results", # This will be replaced by the search results
459
+ ),
460
+ cls="grid",
461
+ ),
462
+ cls="grid",
463
+ ),
464
+ )
465
+
466
+
467
+ def LoadingMessage(display_text="Retrieving search results"):
468
+ return Div(
469
+ Lucide(icon="loader-circle", cls="size-5 mr-1.5 animate-spin"),
470
+ Span(display_text, cls="text-base text-center"),
471
+ cls="p-10 text-muted-foreground flex items-center justify-center",
472
+ id="loading-indicator",
473
+ )
474
+
475
+
476
+ def LoadingSkeleton():
477
+ return Div(
478
+ Div(cls="h-5 bg-muted"),
479
+ Div(cls="h-5 bg-muted"),
480
+ Div(cls="h-5 bg-muted"),
481
+ cls="grid gap-2 animate-pulse",
482
+ )
483
+
484
+
485
+ def SimMapButtonReady(query_id, idx, token, token_idx, img_src):
486
+ return Button(
487
+ token.replace("\u2581", ""),
488
+ size="sm",
489
+ data_image_src=img_src,
490
+ id=f"sim-map-button-{query_id}-{idx}-{token_idx}-{token}",
491
+ cls="sim-map-button pointer-events-auto font-mono text-xs h-5 rounded-none px-2",
492
+ )
493
+
494
+
495
+ def SimMapButtonPoll(query_id, idx, token, token_idx):
496
+ return Button(
497
+ Lucide(icon="loader-circle", size="15", cls="animate-spin"),
498
+ size="sm",
499
+ disabled=True,
500
+ hx_get=f"/get_sim_map?query_id={query_id}&idx={idx}&token={token}&token_idx={token_idx}",
501
+ hx_trigger="every 0.5s",
502
+ hx_swap="outerHTML",
503
+ cls="pointer-events-auto text-xs h-5 rounded-none px-2",
504
+ )
505
+
506
+
507
+ def SearchInfo(search_time, total_count):
508
+ return (
509
+ Div(
510
+ Span(
511
+ "Retrieved ",
512
+ Strong(total_count),
513
+ Span(" results"),
514
+ Span(" in "),
515
+ Strong(f"{search_time:.3f}"), # 3 significant digits
516
+ Span(" seconds."),
517
+ ),
518
+ cls="grid bg-background border-t text-sm text-center p-3",
519
+ ),
520
+ )
521
+
522
+
523
+ def SearchResult(
524
+ results: list,
525
+ query: str,
526
+ query_id: Optional[str] = None,
527
+ search_time: float = 0,
528
+ total_count: int = 0,
529
+ ):
530
+ if not results:
531
+ return Div(
532
+ P(
533
+ "No results found for your query.",
534
+ cls="text-muted-foreground text-base text-center",
535
+ ),
536
+ cls="grid p-10",
537
+ )
538
+
539
+ doc_ids = []
540
+ # Otherwise, display the search results
541
+ result_items = []
542
+ for idx, result in enumerate(results):
543
+ fields = result["fields"] # Extract the 'fields' part of each result
544
+ doc_id = fields["id"]
545
+ doc_ids.append(doc_id)
546
+ blur_image_base64 = f"data:image/jpeg;base64,{fields['blur_image']}"
547
+
548
+ sim_map_fields = {
549
+ key: value
550
+ for key, value in fields.items()
551
+ if key.startswith(
552
+ "sim_map_"
553
+ ) # filtering is done before creating with 'should_filter_token'-function
554
+ }
555
+
556
+ # Generate buttons for the sim_map fields
557
+ sim_map_buttons = []
558
+ for key, value in sim_map_fields.items():
559
+ token = key.split("_")[-2]
560
+ token_idx = int(key.split("_")[-1])
561
+ if value is not None:
562
+ sim_map_base64 = f"data:image/jpeg;base64,{value}"
563
+ sim_map_buttons.append(
564
+ SimMapButtonReady(
565
+ query_id=query_id,
566
+ idx=idx,
567
+ token=token,
568
+ token_idx=token_idx,
569
+ img_src=sim_map_base64,
570
+ )
571
+ )
572
+ else:
573
+ sim_map_buttons.append(
574
+ SimMapButtonPoll(
575
+ query_id=query_id,
576
+ idx=idx,
577
+ token=token,
578
+ token_idx=token_idx,
579
+ )
580
+ )
581
+
582
+ # Add "Reset Image" button to restore the full image
583
+ reset_button = Button(
584
+ "Reset",
585
+ variant="outline",
586
+ size="sm",
587
+ data_image_src=blur_image_base64,
588
+ cls="reset-button pointer-events-auto font-mono text-xs h-5 rounded-none px-2",
589
+ )
590
+
591
+ tokens_icon = Lucide(icon="images", size="15")
592
+
593
+ # Add "Tokens" button - this has no action, just a placeholder
594
+ tokens_button = Button(
595
+ tokens_icon,
596
+ "Tokens",
597
+ size="sm",
598
+ cls="tokens-button flex gap-[3px] font-bold pointer-events-none font-mono text-xs h-5 rounded-none px-2",
599
+ )
600
+
601
+ result_items.append(
602
+ Div(
603
+ Div(
604
+ Div(
605
+ Lucide(icon="file-text"),
606
+ H2(fields["title"], cls="text-xl md:text-2xl font-semibold"),
607
+ Separator(orientation="vertical"),
608
+ Badge(
609
+ f"Relevance score: {result['relevance']:.4f}",
610
+ cls="flex gap-1.5 items-center justify-center",
611
+ ),
612
+ cls="flex items-center gap-2",
613
+ ),
614
+ Div(
615
+ Button(
616
+ "Hide Text",
617
+ size="sm",
618
+ id=f"toggle-button-{idx}",
619
+ onclick=f"toggleTextContent({idx})",
620
+ cls="hidden md:block",
621
+ ),
622
+ ),
623
+ cls="flex flex-wrap items-center justify-between bg-background px-3 py-4",
624
+ ),
625
+ Div(
626
+ Div(
627
+ Div(
628
+ tokens_button,
629
+ *sim_map_buttons,
630
+ reset_button,
631
+ cls="flex flex-wrap gap-px w-full pointer-events-none",
632
+ ),
633
+ Div(
634
+ Div(
635
+ Div(
636
+ Img(
637
+ src=blur_image_base64,
638
+ hx_get=f"/full_image?doc_id={doc_id}",
639
+ style="backdrop-filter: blur(5px);",
640
+ hx_trigger="load",
641
+ hx_swap="outerHTML",
642
+ alt=fields["title"],
643
+ cls="result-image w-full h-full object-contain",
644
+ ),
645
+ Div(
646
+ cls="overlay-container absolute top-0 left-0 w-full h-full pointer-events-none"
647
+ ),
648
+ cls="relative w-full h-full",
649
+ ),
650
+ cls="grid bg-muted p-2",
651
+ ),
652
+ cls="block",
653
+ ),
654
+ id=f"image-column-{idx}",
655
+ cls="image-column relative bg-background px-3 py-5 grid-image-column",
656
+ ),
657
+ Div(
658
+ Div(
659
+ A(
660
+ Lucide(icon="external-link", size="18"),
661
+ f"PDF Source (Page {fields['page_number'] + 1})",
662
+ href=f"{fields['url']}#page={fields['page_number'] + 1}",
663
+ target="_blank",
664
+ cls="flex items-center gap-1.5 font-mono bold text-sm",
665
+ ),
666
+ cls="flex items-center justify-end",
667
+ ),
668
+ Div(
669
+ Div(
670
+ Div(
671
+ Div(
672
+ Div(
673
+ H3(
674
+ "Dynamic summary",
675
+ cls="text-base font-semibold",
676
+ ),
677
+ P(
678
+ NotStr(fields.get("snippet", "")),
679
+ cls="text-highlight text-muted-foreground",
680
+ ),
681
+ cls="grid grid-rows-[auto_0px] content-start gap-y-3",
682
+ ),
683
+ id=f"result-text-snippet-{idx}",
684
+ cls="grid gap-y-3 p-8 border border-dashed",
685
+ ),
686
+ Div(
687
+ Div(
688
+ Div(
689
+ H3(
690
+ "Full text",
691
+ cls="text-base font-semibold",
692
+ ),
693
+ Div(
694
+ P(
695
+ NotStr(fields.get("text", "")),
696
+ cls="text-highlight text-muted-foreground",
697
+ ),
698
+ Br(),
699
+ ),
700
+ cls="grid grid-rows-[auto_0px] content-start gap-y-3",
701
+ ),
702
+ id=f"result-text-full-{idx}",
703
+ cls="grid gap-y-3 p-8 border border-dashed",
704
+ ),
705
+ Div(
706
+ cls="absolute inset-x-0 bottom-0 bg-gradient-to-t from-[#fcfcfd] dark:from-[#1c2024] pt-[7%]"
707
+ ),
708
+ cls="relative grid",
709
+ ),
710
+ cls="grid grid-rows-[1fr_1fr] xl:grid-rows-[1fr_2fr] gap-y-8 p-8 text-sm",
711
+ ),
712
+ cls="grid bg-background",
713
+ ),
714
+ cls="grid bg-muted p-2",
715
+ ),
716
+ id=f"text-column-{idx}",
717
+ cls="text-column relative bg-background px-3 py-5 hidden md-grid-text-column",
718
+ ),
719
+ id=f"image-text-columns-{idx}",
720
+ cls="relative grid grid-cols-1 border-t grid-image-text-columns",
721
+ ),
722
+ cls="grid grid-cols-1 grid-rows-[auto_auto_1fr]",
723
+ ),
724
+ )
725
+
726
+ return [
727
+ Div(
728
+ SearchInfo(search_time, total_count),
729
+ *result_items,
730
+ image_swapping,
731
+ toggle_text_content,
732
+ dynamic_elements_scrollbars,
733
+ id="search-results",
734
+ cls="grid grid-cols-1 gap-px bg-border min-h-0",
735
+ ),
736
+ Div(
737
+ ChatResult(query_id=query_id, query=query, doc_ids=doc_ids),
738
+ hx_swap_oob="true",
739
+ id="chat_messages",
740
+ ),
741
+ ]
742
+
743
+
744
+ def ChatResult(query_id: str, query: str, doc_ids: Optional[list] = None):
745
+ messages = Div(LoadingSkeleton())
746
+
747
+ if doc_ids:
748
+ messages = Div(
749
+ LoadingSkeleton(),
750
+ hx_ext="sse",
751
+ sse_connect=f"/get-message?query_id={query_id}&doc_ids={','.join(doc_ids)}&query={quote_plus(query)}",
752
+ sse_swap="message",
753
+ sse_close="close",
754
+ hx_swap="innerHTML",
755
+ )
756
+
757
+ return Div(
758
+ Div("AI-response (Gemini-2.0)", cls="text-xl font-semibold p-5"),
759
+ Div(
760
+ Div(
761
+ messages,
762
+ ),
763
+ id="chat-messages",
764
+ cls="overflow-auto min-h-0 grid items-end px-5",
765
+ ),
766
+ id="chat_messages",
767
+ cls="h-full grid grid-rows-[auto_1fr_auto] min-h-0 gap-3",
768
+ )
frontend/layout.py ADDED
@@ -0,0 +1,171 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ from fasthtml.components import Body, Div, Header, Img, Nav, Title
2
+ from fasthtml.xtend import A, Script
3
+ from lucide_fasthtml import Lucide
4
+ from shad4fast import Button, Separator
5
+
6
+ layout_script = Script(
7
+ """
8
+ document.addEventListener("DOMContentLoaded", function () {
9
+ const main = document.querySelector('main');
10
+ const aside = document.querySelector('aside');
11
+ const body = document.body;
12
+
13
+ if (main && aside && main.nextElementSibling === aside) {
14
+ // If we have both main and aside, adjust the layout for larger screens
15
+ body.classList.remove('grid-cols-1'); // Remove single-column layout
16
+ body.classList.add('md:grid-cols-[minmax(0,_45fr)_minmax(0,_15fr)]'); // Two-column layout on larger screens
17
+ } else if (main) {
18
+ // If only main, keep it full width
19
+ body.classList.add('grid-cols-1');
20
+ }
21
+ });
22
+ """
23
+ )
24
+
25
+ overlay_scrollbars_manager = Script(
26
+ """
27
+ (function () {
28
+ const { OverlayScrollbars } = OverlayScrollbarsGlobal;
29
+
30
+ function getPreferredTheme() {
31
+ return localStorage.theme === 'dark' || (!('theme' in localStorage) && window.matchMedia('(prefers-color-scheme: dark)').matches)
32
+ ? 'dark'
33
+ : 'light';
34
+ }
35
+
36
+ function applyOverlayScrollbars(element, scrollbarTheme) {
37
+ // Destroy existing OverlayScrollbars instance if it exists
38
+ const instance = OverlayScrollbars(element);
39
+ if (instance) {
40
+ instance.destroy();
41
+ }
42
+
43
+ // Reinitialize OverlayScrollbars with the correct theme and settings
44
+ OverlayScrollbars(element, {
45
+ overflow: {
46
+ x: 'hidden',
47
+ y: 'scroll'
48
+ },
49
+ scrollbars: {
50
+ theme: scrollbarTheme,
51
+ visibility: 'auto',
52
+ autoHide: 'leave',
53
+ autoHideDelay: 800
54
+ }
55
+ });
56
+ }
57
+
58
+ // Function to get the current scrollbar theme (light or dark)
59
+ function getScrollbarTheme() {
60
+ const isDarkMode = getPreferredTheme() === 'dark';
61
+ return isDarkMode ? 'os-theme-light' : 'os-theme-dark'; // Light theme in dark mode, dark theme in light mode
62
+ }
63
+
64
+ // Expose the common functions globally for reuse
65
+ window.OverlayScrollbarsManager = {
66
+ applyOverlayScrollbars: applyOverlayScrollbars,
67
+ getScrollbarTheme: getScrollbarTheme
68
+ };
69
+ })();
70
+ """
71
+ )
72
+
73
+ static_elements_scrollbars = Script(
74
+ """
75
+ (function () {
76
+ const { applyOverlayScrollbars, getScrollbarTheme } = OverlayScrollbarsManager;
77
+
78
+ function applyScrollbarsToStaticElements() {
79
+ const mainElement = document.querySelector('main');
80
+ const chatMessagesElement = document.querySelector('#chat-messages');
81
+
82
+ const scrollbarTheme = getScrollbarTheme();
83
+
84
+ if (mainElement) {
85
+ applyOverlayScrollbars(mainElement, scrollbarTheme);
86
+ }
87
+
88
+ if (chatMessagesElement) {
89
+ applyOverlayScrollbars(chatMessagesElement, scrollbarTheme);
90
+ }
91
+ }
92
+
93
+ // Apply the scrollbars on page load
94
+ applyScrollbarsToStaticElements();
95
+
96
+ // Observe changes in the 'dark' class on the <html> element to adjust the theme dynamically
97
+ const observer = new MutationObserver(applyScrollbarsToStaticElements);
98
+ observer.observe(document.documentElement, { attributes: true, attributeFilter: ['class'] });
99
+ })();
100
+ """
101
+ )
102
+
103
+
104
+ def Logo():
105
+ return Div(
106
+ Img(
107
+ src="https://assets.vespa.ai/logos/vespa-logo-black.svg",
108
+ alt="Vespa Logo",
109
+ cls="h-full dark:hidden",
110
+ ),
111
+ Img(
112
+ src="https://assets.vespa.ai/logos/vespa-logo-white.svg",
113
+ alt="Vespa Logo Dark Mode",
114
+ cls="h-full hidden dark:block",
115
+ ),
116
+ cls="h-[27px]",
117
+ )
118
+
119
+
120
+ def ThemeToggle(variant="ghost", cls=None, **kwargs):
121
+ return Button(
122
+ Lucide("sun", cls="dark:flex hidden"),
123
+ Lucide("moon", cls="dark:hidden"),
124
+ variant=variant,
125
+ size="icon",
126
+ cls=f"theme-toggle {cls}",
127
+ **kwargs,
128
+ )
129
+
130
+
131
+ def Links():
132
+ return Nav(
133
+ A(
134
+ Button("About this demo?", variant="link"),
135
+ href="/about-this-demo",
136
+ ),
137
+ Separator(orientation="vertical"),
138
+ A(
139
+ Button(Lucide(icon="github"), size="icon", variant="ghost"),
140
+ href="https://github.com/vespa-engine/vespa",
141
+ target="_blank",
142
+ ),
143
+ A(
144
+ Button(Lucide(icon="slack"), size="icon", variant="ghost"),
145
+ href="https://slack.vespa.ai",
146
+ target="_blank",
147
+ ),
148
+ Separator(orientation="vertical"),
149
+ ThemeToggle(),
150
+ cls="flex items-center space-x-2",
151
+ )
152
+
153
+
154
+ def Layout(*c, is_home=False, **kwargs):
155
+ return (
156
+ Title("Visual Retrieval ColPali"),
157
+ Body(
158
+ Header(
159
+ A(Logo(), href="/"),
160
+ Links(),
161
+ cls="min-h-[55px] h-[55px] w-full flex items-center justify-between px-4",
162
+ ),
163
+ *c,
164
+ **kwargs,
165
+ data_is_home=str(is_home).lower(),
166
+ cls="grid grid-rows-[minmax(0,55px)_minmax(0,1fr)] min-h-0",
167
+ ),
168
+ layout_script,
169
+ overlay_scrollbars_manager,
170
+ static_elements_scrollbars,
171
+ )
hello.py ADDED
@@ -0,0 +1,17 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ from fasthtml.common import *
2
+ from importlib.util import find_spec
3
+
4
+ # Run find_spec for all the modules (imports will be removed by ruff if not used. This is just to check if the modules are available, and should be removed)ß
5
+ for module in ["torch", "einops", "PIL", "vidore_benchmark", "colpali_engine"]:
6
+ spec = find_spec(module)
7
+ assert spec is not None, f"Module {module} not found"
8
+
9
+ app, rt = fast_app()
10
+
11
+
12
+ @rt("/")
13
+ def get():
14
+ return Div(P("Hello World!"), hx_get="/change")
15
+
16
+
17
+ serve()
icons.py ADDED
@@ -0,0 +1 @@
 
 
1
+ ICONS = {"chevrons-right": "<path d=\"m6 17 5-5-5-5\"></path><path d=\"m13 17 5-5-5-5\"></path>", "moon": "<path d=\"M12 3a6 6 0 0 0 9 9 9 9 0 1 1-9-9Z\"></path>", "sun": "<circle cx=\"12\" cy=\"12\" r=\"4\"></circle><path d=\"M12 2v2\"></path><path d=\"M12 20v2\"></path><path d=\"m4.93 4.93 1.41 1.41\"></path><path d=\"m17.66 17.66 1.41 1.41\"></path><path d=\"M2 12h2\"></path><path d=\"M20 12h2\"></path><path d=\"m6.34 17.66-1.41 1.41\"></path><path d=\"m19.07 4.93-1.41 1.41\"></path>", "github": "<path d=\"M15 22v-4a4.8 4.8 0 0 0-1-3.5c3 0 6-2 6-5.5.08-1.25-.27-2.48-1-3.5.28-1.15.28-2.35 0-3.5 0 0-1 0-3 1.5-2.64-.5-5.36-.5-8 0C6 2 5 2 5 2c-.3 1.15-.3 2.35 0 3.5A5.403 5.403 0 0 0 4 9c0 3.5 3 5.5 6 5.5-.39.49-.68 1.05-.85 1.65-.17.6-.22 1.23-.15 1.85v4\"></path><path d=\"M9 18c-4.51 2-5-2-7-2\"></path>", "slack": "<rect height=\"8\" rx=\"1.5\" width=\"3\" x=\"13\" y=\"2\"></rect><path d=\"M19 8.5V10h1.5A1.5 1.5 0 1 0 19 8.5\"></path><rect height=\"8\" rx=\"1.5\" width=\"3\" x=\"8\" y=\"14\"></rect><path d=\"M5 15.5V14H3.5A1.5 1.5 0 1 0 5 15.5\"></path><rect height=\"3\" rx=\"1.5\" width=\"8\" x=\"14\" y=\"13\"></rect><path d=\"M15.5 19H14v1.5a1.5 1.5 0 1 0 1.5-1.5\"></path><rect height=\"3\" rx=\"1.5\" width=\"8\" x=\"2\" y=\"8\"></rect><path d=\"M8.5 5H10V3.5A1.5 1.5 0 1 0 8.5 5\"></path>", "settings": "<path d=\"M12.22 2h-.44a2 2 0 0 0-2 2v.18a2 2 0 0 1-1 1.73l-.43.25a2 2 0 0 1-2 0l-.15-.08a2 2 0 0 0-2.73.73l-.22.38a2 2 0 0 0 .73 2.73l.15.1a2 2 0 0 1 1 1.72v.51a2 2 0 0 1-1 1.74l-.15.09a2 2 0 0 0-.73 2.73l.22.38a2 2 0 0 0 2.73.73l.15-.08a2 2 0 0 1 2 0l.43.25a2 2 0 0 1 1 1.73V20a2 2 0 0 0 2 2h.44a2 2 0 0 0 2-2v-.18a2 2 0 0 1 1-1.73l.43-.25a2 2 0 0 1 2 0l.15.08a2 2 0 0 0 2.73-.73l.22-.39a2 2 0 0 0-.73-2.73l-.15-.08a2 2 0 0 1-1-1.74v-.5a2 2 0 0 1 1-1.74l.15-.09a2 2 0 0 0 .73-2.73l-.22-.38a2 2 0 0 0-2.73-.73l-.15.08a2 2 0 0 1-2 0l-.43-.25a2 2 0 0 1-1-1.73V4a2 2 0 0 0-2-2z\"></path><circle cx=\"12\" cy=\"12\" r=\"3\"></circle>", "arrow-right": "<path d=\"M5 12h14\"></path><path d=\"m12 5 7 7-7 7\"></path>", "search": "<circle cx=\"11\" cy=\"11\" r=\"8\"></circle><path d=\"m21 21-4.3-4.3\"></path>", "file-search": "<path d=\"M14 2v4a2 2 0 0 0 2 2h4\"></path><path d=\"M4.268 21a2 2 0 0 0 1.727 1H18a2 2 0 0 0 2-2V7l-5-5H6a2 2 0 0 0-2 2v3\"></path><path d=\"m9 18-1.5-1.5\"></path><circle cx=\"5\" cy=\"14\" r=\"3\"></circle>", "message-circle-question": "<path d=\"M7.9 20A9 9 0 1 0 4 16.1L2 22Z\"></path><path d=\"M9.09 9a3 3 0 0 1 5.83 1c0 2-3 3-3 3\"></path><path d=\"M12 17h.01\"></path>", "text-search": "<path d=\"M21 6H3\"></path><path d=\"M10 12H3\"></path><path d=\"M10 18H3\"></path><circle cx=\"17\" cy=\"15\" r=\"3\"></circle><path d=\"m21 19-1.9-1.9\"></path>", "maximize": "<path d=\"M8 3H5a2 2 0 0 0-2 2v3\"></path><path d=\"M21 8V5a2 2 0 0 0-2-2h-3\"></path><path d=\"M3 16v3a2 2 0 0 0 2 2h3\"></path><path d=\"M16 21h3a2 2 0 0 0 2-2v-3\"></path>", "expand": "<path d=\"m21 21-6-6m6 6v-4.8m0 4.8h-4.8\"></path><path d=\"M3 16.2V21m0 0h4.8M3 21l6-6\"></path><path d=\"M21 7.8V3m0 0h-4.8M21 3l-6 6\"></path><path d=\"M3 7.8V3m0 0h4.8M3 3l6 6\"></path>", "fullscreen": "<path d=\"M3 7V5a2 2 0 0 1 2-2h2\"></path><path d=\"M17 3h2a2 2 0 0 1 2 2v2\"></path><path d=\"M21 17v2a2 2 0 0 1-2 2h-2\"></path><path d=\"M7 21H5a2 2 0 0 1-2-2v-2\"></path><rect height=\"8\" rx=\"1\" width=\"10\" x=\"7\" y=\"8\"></rect>", "images": "<path d=\"M18 22H4a2 2 0 0 1-2-2V6\"></path><path d=\"m22 13-1.296-1.296a2.41 2.41 0 0 0-3.408 0L11 18\"></path><circle cx=\"12\" cy=\"8\" r=\"2\"></circle><rect height=\"16\" rx=\"2\" width=\"16\" x=\"6\" y=\"2\"></rect>", "circle": "<circle cx=\"12\" cy=\"12\" r=\"10\"></circle>", "loader-circle": "<path d=\"M21 12a9 9 0 1 1-6.219-8.56\"></path>", "file-text": "<path d=\"M15 2H6a2 2 0 0 0-2 2v16a2 2 0 0 0 2 2h12a2 2 0 0 0 2-2V7Z\"></path><path d=\"M14 2v4a2 2 0 0 0 2 2h4\"></path><path d=\"M10 9H8\"></path><path d=\"M16 13H8\"></path><path d=\"M16 17H8\"></path>", "file-question": "<path d=\"M12 17h.01\"></path><path d=\"M15 2H6a2 2 0 0 0-2 2v16a2 2 0 0 0 2 2h12a2 2 0 0 0 2-2V7z\"></path><path d=\"M9.1 9a3 3 0 0 1 5.82 1c0 2-3 3-3 3\"></path>", "external-link": "<path d=\"M15 3h6v6\"></path><path d=\"M10 14 21 3\"></path><path d=\"M18 13v6a2 2 0 0 1-2 2H5a2 2 0 0 1-2-2V8a2 2 0 0 1 2-2h6\"></path>", "linkedin": "<path d=\"M16 8a6 6 0 0 1 6 6v7h-4v-7a2 2 0 0 0-2-2 2 2 0 0 0-2 2v7h-4v-7a6 6 0 0 1 6-6z\"></path><rect height=\"12\" width=\"4\" x=\"2\" y=\"9\"></rect><circle cx=\"4\" cy=\"4\" r=\"2\"></circle>"}
main.py ADDED
@@ -0,0 +1,420 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import asyncio
2
+ import base64
3
+ import os
4
+ import time
5
+ import uuid
6
+ import logging
7
+ import sys
8
+ from concurrent.futures import ThreadPoolExecutor
9
+ from pathlib import Path
10
+
11
+ import google.generativeai as genai
12
+ from fastcore.parallel import threaded
13
+ from fasthtml.common import (
14
+ Aside,
15
+ Div,
16
+ FileResponse,
17
+ HighlightJS,
18
+ Img,
19
+ JSONResponse,
20
+ Link,
21
+ Main,
22
+ P,
23
+ RedirectResponse,
24
+ Script,
25
+ StreamingResponse,
26
+ fast_app,
27
+ serve,
28
+ )
29
+ from PIL import Image
30
+ from shad4fast import ShadHead
31
+ from vespa.application import Vespa
32
+
33
+ from backend.colpali import SimMapGenerator
34
+ from backend.vespa_app import VespaQueryClient
35
+ from frontend.app import (
36
+ AboutThisDemo,
37
+ ChatResult,
38
+ Home,
39
+ Search,
40
+ SearchBox,
41
+ SearchResult,
42
+ SimMapButtonPoll,
43
+ SimMapButtonReady,
44
+ )
45
+ from frontend.layout import Layout
46
+
47
+ highlight_js_theme_link = Link(id="highlight-theme", rel="stylesheet", href="")
48
+ highlight_js_theme = Script(src="/static/js/highlightjs-theme.js")
49
+ highlight_js = HighlightJS(
50
+ langs=["python", "javascript", "java", "json", "xml"],
51
+ dark="github-dark",
52
+ light="github",
53
+ )
54
+
55
+ overlayscrollbars_link = Link(
56
+ rel="stylesheet",
57
+ href="https://cdnjs.cloudflare.com/ajax/libs/overlayscrollbars/2.10.0/styles/overlayscrollbars.min.css",
58
+ type="text/css",
59
+ )
60
+ overlayscrollbars_js = Script(
61
+ src="https://cdnjs.cloudflare.com/ajax/libs/overlayscrollbars/2.10.0/browser/overlayscrollbars.browser.es5.min.js"
62
+ )
63
+ awesomplete_link = Link(
64
+ rel="stylesheet",
65
+ href="https://cdnjs.cloudflare.com/ajax/libs/awesomplete/1.1.7/awesomplete.min.css",
66
+ type="text/css",
67
+ )
68
+ awesomplete_js = Script(
69
+ src="https://cdnjs.cloudflare.com/ajax/libs/awesomplete/1.1.7/awesomplete.min.js"
70
+ )
71
+ sselink = Script(src="https://unpkg.com/[email protected]/sse.js")
72
+
73
+ # Get log level from environment variable, default to INFO
74
+ LOG_LEVEL = os.getenv("LOG_LEVEL", "INFO").upper()
75
+ # Configure logger
76
+ logger = logging.getLogger("vespa_app")
77
+ handler = logging.StreamHandler(sys.stdout)
78
+ handler.setFormatter(
79
+ logging.Formatter(
80
+ "%(levelname)s: \t %(asctime)s \t %(message)s",
81
+ datefmt="%Y-%m-%d %H:%M:%S",
82
+ )
83
+ )
84
+ logger.addHandler(handler)
85
+ logger.setLevel(getattr(logging, LOG_LEVEL))
86
+
87
+ app, rt = fast_app(
88
+ htmlkw={"cls": "grid h-full"},
89
+ pico=False,
90
+ hdrs=(
91
+ highlight_js,
92
+ highlight_js_theme_link,
93
+ highlight_js_theme,
94
+ overlayscrollbars_link,
95
+ overlayscrollbars_js,
96
+ awesomplete_link,
97
+ awesomplete_js,
98
+ sselink,
99
+ ShadHead(tw_cdn=False, theme_handle=True),
100
+ ),
101
+ )
102
+ vespa_app: Vespa = VespaQueryClient(logger=logger)
103
+ thread_pool = ThreadPoolExecutor()
104
+ # Gemini config
105
+
106
+ genai.configure(api_key=os.getenv("GEMINI_API_KEY"))
107
+ GEMINI_SYSTEM_PROMPT = """If the user query is a question, try your best to answer it based on the provided images.
108
+ If the user query can not be interpreted as a question, or if the answer to the query can not be inferred from the images,
109
+ answer with the exact phrase "I am sorry, I can't find enough relevant information on these pages to answer your question.".
110
+ Your response should be HTML formatted, but only simple tags, such as <b>. <p>, <i>, <br> <ul> and <li> are allowed. No HTML tables.
111
+ This means that newlines will be replaced with <br> tags, bold text will be enclosed in <b> tags, and so on.
112
+ Do NOT include backticks (`) in your response. Only simple HTML tags and text.
113
+ """
114
+ gemini_model = genai.GenerativeModel(
115
+ "gemini-2.0-flash", system_instruction=GEMINI_SYSTEM_PROMPT
116
+ )
117
+ STATIC_DIR = Path("static")
118
+ IMG_DIR = STATIC_DIR / "full_images"
119
+ SIM_MAP_DIR = STATIC_DIR / "sim_maps"
120
+ os.makedirs(IMG_DIR, exist_ok=True)
121
+ os.makedirs(SIM_MAP_DIR, exist_ok=True)
122
+
123
+
124
+ @app.on_event("startup")
125
+ def load_model_on_startup():
126
+ app.sim_map_generator = SimMapGenerator(logger=logger)
127
+ return
128
+
129
+
130
+ @app.on_event("startup")
131
+ async def keepalive():
132
+ asyncio.create_task(poll_vespa_keepalive())
133
+ return
134
+
135
+
136
+ def generate_query_id(query, ranking_value):
137
+ hash_input = (query + ranking_value).encode("utf-8")
138
+ return hash(hash_input)
139
+
140
+
141
+ @rt("/static/{filepath:path}")
142
+ def serve_static(filepath: str):
143
+ return FileResponse(STATIC_DIR / filepath)
144
+
145
+
146
+ @rt("/")
147
+ def get(session):
148
+ if "session_id" not in session:
149
+ session["session_id"] = str(uuid.uuid4())
150
+ return Layout(Main(Home()), is_home=True)
151
+
152
+
153
+ @rt("/about-this-demo")
154
+ def get():
155
+ return Layout(Main(AboutThisDemo()))
156
+
157
+
158
+ @rt("/search")
159
+ def get(request, query: str = "", ranking: str = "hybrid"):
160
+ logger.info(f"/search: Fetching results for query: {query}, ranking: {ranking}")
161
+
162
+ # Always render the SearchBox first
163
+ if not query:
164
+ # Show SearchBox and a message for missing query
165
+ return Layout(
166
+ Main(
167
+ Div(
168
+ SearchBox(query_value=query, ranking_value=ranking),
169
+ Div(
170
+ P(
171
+ "No query provided. Please enter a query.",
172
+ cls="text-center text-muted-foreground",
173
+ ),
174
+ cls="p-10",
175
+ ),
176
+ cls="grid",
177
+ )
178
+ )
179
+ )
180
+ # Generate a unique query_id based on the query and ranking value
181
+ query_id = generate_query_id(query, ranking)
182
+ # Show the loading message if a query is provided
183
+ return Layout(
184
+ Main(Search(request), data_overlayscrollbars_initialize=True, cls="border-t"),
185
+ Aside(
186
+ ChatResult(query_id=query_id, query=query),
187
+ cls="border-t border-l hidden md:block",
188
+ ),
189
+ ) # Show SearchBox and Loading message initially
190
+
191
+
192
+ @rt("/fetch_results")
193
+ async def get(session, request, query: str, ranking: str):
194
+ if "hx-request" not in request.headers:
195
+ return RedirectResponse("/search")
196
+
197
+ # Get the hash of the query and ranking value
198
+ query_id = generate_query_id(query, ranking)
199
+ logger.info(f"Query id in /fetch_results: {query_id}")
200
+ # Run the embedding and query against Vespa app
201
+ start_inference = time.perf_counter()
202
+ q_embs, idx_to_token = app.sim_map_generator.get_query_embeddings_and_token_map(
203
+ query
204
+ )
205
+ end_inference = time.perf_counter()
206
+ logger.info(
207
+ f"Inference time for query_id: {query_id} \t {end_inference - start_inference:.2f} seconds"
208
+ )
209
+
210
+ start = time.perf_counter()
211
+ # Fetch real search results from Vespa
212
+ result = await vespa_app.get_result_from_query(
213
+ query=query,
214
+ q_embs=q_embs,
215
+ ranking=ranking,
216
+ idx_to_token=idx_to_token,
217
+ )
218
+ end = time.perf_counter()
219
+ logger.info(
220
+ f"Search results fetched in {end - start:.2f} seconds. Vespa search time: {result['timing']['searchtime']}"
221
+ )
222
+ search_time = result["timing"]["searchtime"]
223
+ # Safely get total_count with a default of 0
224
+ total_count = result.get("root", {}).get("fields", {}).get("totalCount", 0)
225
+
226
+ search_results = vespa_app.results_to_search_results(result, idx_to_token)
227
+
228
+ get_and_store_sim_maps(
229
+ query_id=query_id,
230
+ query=query,
231
+ q_embs=q_embs,
232
+ ranking=ranking,
233
+ idx_to_token=idx_to_token,
234
+ doc_ids=[result["fields"]["id"] for result in search_results],
235
+ )
236
+ return SearchResult(search_results, query, query_id, search_time, total_count)
237
+
238
+
239
+ def get_results_children(result):
240
+ search_results = (
241
+ result["root"]["children"]
242
+ if "root" in result and "children" in result["root"]
243
+ else []
244
+ )
245
+ return search_results
246
+
247
+
248
+ async def poll_vespa_keepalive():
249
+ while True:
250
+ await asyncio.sleep(5)
251
+ await vespa_app.keepalive()
252
+ logger.debug(f"Vespa keepalive: {time.time()}")
253
+
254
+
255
+ @threaded
256
+ def get_and_store_sim_maps(
257
+ query_id, query: str, q_embs, ranking, idx_to_token, doc_ids
258
+ ):
259
+ ranking_sim = ranking + "_sim"
260
+ vespa_sim_maps = vespa_app.get_sim_maps_from_query(
261
+ query=query,
262
+ q_embs=q_embs,
263
+ ranking=ranking_sim,
264
+ idx_to_token=idx_to_token,
265
+ )
266
+ img_paths = [IMG_DIR / f"{doc_id}.jpg" for doc_id in doc_ids]
267
+ # All images should be downloaded, but best to wait 5 secs
268
+ max_wait = 5
269
+ start_time = time.time()
270
+ while (
271
+ not all([os.path.exists(img_path) for img_path in img_paths])
272
+ and time.time() - start_time < max_wait
273
+ ):
274
+ time.sleep(0.2)
275
+ if not all([os.path.exists(img_path) for img_path in img_paths]):
276
+ logger.warning(f"Images not ready in 5 seconds for query_id: {query_id}")
277
+ return False
278
+ sim_map_generator = app.sim_map_generator.gen_similarity_maps(
279
+ query=query,
280
+ query_embs=q_embs,
281
+ token_idx_map=idx_to_token,
282
+ images=img_paths,
283
+ vespa_sim_maps=vespa_sim_maps,
284
+ )
285
+ for idx, token, token_idx, blended_img_base64 in sim_map_generator:
286
+ with open(SIM_MAP_DIR / f"{query_id}_{idx}_{token_idx}.png", "wb") as f:
287
+ f.write(base64.b64decode(blended_img_base64))
288
+ logger.debug(
289
+ f"Sim map saved to disk for query_id: {query_id}, idx: {idx}, token: {token}"
290
+ )
291
+ return True
292
+
293
+
294
+ @app.get("/get_sim_map")
295
+ async def get_sim_map(query_id: str, idx: int, token: str, token_idx: int):
296
+ """
297
+ Endpoint that each of the sim map button polls to get the sim map image
298
+ when it is ready. If it is not ready, returns a SimMapButtonPoll, that
299
+ continues to poll every 1 second.
300
+ """
301
+ sim_map_path = SIM_MAP_DIR / f"{query_id}_{idx}_{token_idx}.png"
302
+ if not os.path.exists(sim_map_path):
303
+ logger.debug(
304
+ f"Sim map not ready for query_id: {query_id}, idx: {idx}, token: {token}"
305
+ )
306
+ return SimMapButtonPoll(
307
+ query_id=query_id, idx=idx, token=token, token_idx=token_idx
308
+ )
309
+ else:
310
+ return SimMapButtonReady(
311
+ query_id=query_id,
312
+ idx=idx,
313
+ token=token,
314
+ token_idx=token_idx,
315
+ img_src=sim_map_path,
316
+ )
317
+
318
+
319
+ @app.get("/full_image")
320
+ async def full_image(doc_id: str):
321
+ """
322
+ Endpoint to get the full quality image for a given result id.
323
+ """
324
+ img_path = IMG_DIR / f"{doc_id}.jpg"
325
+ if not os.path.exists(img_path):
326
+ image_data = await vespa_app.get_full_image_from_vespa(doc_id)
327
+ # image data is base 64 encoded string. Save it to disk as jpg.
328
+ with open(img_path, "wb") as f:
329
+ f.write(base64.b64decode(image_data))
330
+ logger.debug(f"Full image saved to disk for doc_id: {doc_id}")
331
+ else:
332
+ with open(img_path, "rb") as f:
333
+ image_data = base64.b64encode(f.read()).decode("utf-8")
334
+ return Img(
335
+ src=f"data:image/jpeg;base64,{image_data}",
336
+ alt="something",
337
+ cls="result-image w-full h-full object-contain",
338
+ )
339
+
340
+
341
+ @rt("/suggestions")
342
+ async def get_suggestions(query: str = ""):
343
+ """Endpoint to get suggestions as user types in the search box"""
344
+ query = query.lower().strip()
345
+
346
+ if query:
347
+ suggestions = await vespa_app.get_suggestions(query)
348
+ if len(suggestions) > 0:
349
+ return JSONResponse({"suggestions": suggestions})
350
+
351
+ return JSONResponse({"suggestions": []})
352
+
353
+
354
+ async def message_generator(query_id: str, query: str, doc_ids: list):
355
+ """Generator function to yield SSE messages for chat response"""
356
+ images = []
357
+ num_images = 3 # Number of images before firing chat request
358
+ max_wait = 10 # seconds
359
+ start_time = time.time()
360
+ # Check if full images are ready on disk
361
+ while (
362
+ len(images) < min(num_images, len(doc_ids))
363
+ and time.time() - start_time < max_wait
364
+ ):
365
+ images = []
366
+ for idx in range(num_images):
367
+ image_filename = IMG_DIR / f"{doc_ids[idx]}.jpg"
368
+ if not os.path.exists(image_filename):
369
+ logger.debug(
370
+ f"Message generator: Full image not ready for query_id: {query_id}, idx: {idx}"
371
+ )
372
+ continue
373
+ else:
374
+ logger.debug(
375
+ f"Message generator: image ready for query_id: {query_id}, idx: {idx}"
376
+ )
377
+ images.append(Image.open(image_filename))
378
+ if len(images) < num_images:
379
+ await asyncio.sleep(0.2)
380
+
381
+ # yield message with number of images ready
382
+ yield f"event: message\ndata: Generating response based on {len(images)} images...\n\n"
383
+ if not images:
384
+ yield "event: message\ndata: Failed to send images to Gemini 2.0!\n\n"
385
+ yield "event: close\ndata: \n\n"
386
+ return
387
+
388
+ # If newlines are present in the response, the connection will be closed.
389
+ def replace_newline_with_br(text):
390
+ return text.replace("\n", "<br>")
391
+
392
+ response_text = ""
393
+ async for chunk in await gemini_model.generate_content_async(
394
+ images + ["\n\n Query: ", query], stream=True
395
+ ):
396
+ if chunk.text:
397
+ response_text += chunk.text
398
+ response_text = replace_newline_with_br(response_text)
399
+ yield f"event: message\ndata: {response_text}\n\n"
400
+ await asyncio.sleep(0.1)
401
+ yield "event: close\ndata: \n\n"
402
+
403
+
404
+ @app.get("/get-message")
405
+ async def get_message(query_id: str, query: str, doc_ids: str):
406
+ return StreamingResponse(
407
+ message_generator(query_id=query_id, query=query, doc_ids=doc_ids.split(",")),
408
+ media_type="text/event-stream",
409
+ )
410
+
411
+
412
+ @rt("/app")
413
+ def get():
414
+ return Layout(Main(Div(P(f"Connected to Vespa at {vespa_app.url}"), cls="p-4")))
415
+
416
+
417
+ if __name__ == "__main__":
418
+ HOT_RELOAD = os.getenv("HOT_RELOAD", "False").lower() == "true"
419
+ logger.info(f"Starting app with hot reload: {HOT_RELOAD}")
420
+ serve(port=7860, reload=HOT_RELOAD)
prepare_feed_deploy.py ADDED
@@ -0,0 +1,956 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # # Visual PDF Retrieval - demo application
2
+ #
3
+ # In this notebook, we will prepare the Vespa backend application for our visual retrieval demo.
4
+ # We will use ColPali as the model to extract patch vectors from images of pdf pages.
5
+ # At query time, we use MaxSim to retrieve and/or (based on the configuration) rank the page results.
6
+ #
7
+ # To see the application in action, visit TODO:
8
+ #
9
+ # The web application is written in FastHTML, meaning the complete application is written in python.
10
+ #
11
+ # The steps we will take in this notebook are:
12
+ #
13
+ # 0. Setup and configuration
14
+ # 1. Download the data
15
+ # 2. Prepare the data
16
+ # 3. Generate queries for evaluation and typeahead search suggestions
17
+ # 4. Deploy the Vespa application
18
+ # 5. Create the Vespa application
19
+ # 6. Feed the data to the Vespa application
20
+ #
21
+ # All the steps that are needed to provision the Vespa application, including feeding the data, can be done from this notebook.
22
+ # We have tried to make it easy for others to run this notebook, to create your own PDF Enterprise Search application using Vespa.
23
+ #
24
+
25
+ # ## 0. Setup and Configuration
26
+ #
27
+
28
+ # +
29
+ import os
30
+ import asyncio
31
+ import json
32
+ from typing import Tuple
33
+ import hashlib
34
+ import numpy as np
35
+
36
+ # Vespa
37
+ from vespa.package import (
38
+ ApplicationPackage,
39
+ Field,
40
+ Schema,
41
+ Document,
42
+ HNSW,
43
+ RankProfile,
44
+ Function,
45
+ FieldSet,
46
+ SecondPhaseRanking,
47
+ Summary,
48
+ DocumentSummary,
49
+ )
50
+ from vespa.deployment import VespaCloud
51
+ from vespa.application import Vespa
52
+ from vespa.io import VespaResponse
53
+
54
+ # Google Generative AI
55
+ import google.generativeai as genai
56
+
57
+ # Torch and other ML libraries
58
+ import torch
59
+ from torch.utils.data import DataLoader
60
+ from tqdm import tqdm
61
+ from pdf2image import convert_from_path
62
+ from pypdf import PdfReader
63
+
64
+ # ColPali model and processor
65
+ from colpali_engine.models import ColPali, ColPaliProcessor
66
+ from colpali_engine.utils.torch_utils import get_torch_device
67
+ from vidore_benchmark.utils.image_utils import scale_image, get_base64_image
68
+
69
+ # Other utilities
70
+ from bs4 import BeautifulSoup
71
+ import httpx
72
+ from urllib.parse import urljoin, urlparse
73
+
74
+ # Load environment variables
75
+ from dotenv import load_dotenv
76
+
77
+ load_dotenv()
78
+
79
+ # Avoid warning from huggingface tokenizers
80
+ os.environ["TOKENIZERS_PARALLELISM"] = "false"
81
+ # -
82
+
83
+ # ### Create a free trial in Vespa Cloud
84
+ #
85
+ # Create a tenant from [here](https://vespa.ai/free-trial/).
86
+ # The trial includes $300 credit.
87
+ # Take note of your tenant name.
88
+ #
89
+
90
+ VESPA_TENANT_NAME = "vespa-team"
91
+
92
+ # Here, set your desired application name. (Will be created in later steps)
93
+ # Note that you can not have hyphen `-` or underscore `_` in the application name.
94
+ #
95
+
96
+ VESPA_APPLICATION_NAME = "colpalidemo"
97
+ VESPA_SCHEMA_NAME = "pdf_page"
98
+
99
+ # Next, you need to create some tokens for feeding data, and querying the application.
100
+ # We recommend separate tokens for feeding and querying, (the former with write permission, and the latter with read permission).
101
+ # The tokens can be created from the [Vespa Cloud console](https://console.vespa-cloud.com/) in the 'Account' -> 'Tokens' section.
102
+ #
103
+
104
+ VESPA_TOKEN_ID_WRITE = "colpalidemo_write"
105
+
106
+ # We also need to set the value of the write token to be able to feed data to the Vespa application.
107
+ #
108
+
109
+ VESPA_CLOUD_SECRET_TOKEN = os.getenv("VESPA_CLOUD_SECRET_TOKEN") or input(
110
+ "Enter Vespa cloud secret token: "
111
+ )
112
+
113
+ # We will also use the Gemini API to create sample queries for our images.
114
+ # You can also use other VLM's to create these queries.
115
+ # Create a Gemini API key from [here](https://aistudio.google.com/app/apikey).
116
+ #
117
+
118
+ GEMINI_API_KEY = os.getenv("GEMINI_API_KEY") or input(
119
+ "Enter Google Generative AI API key: "
120
+ )
121
+
122
+ # +
123
+ MODEL_NAME = "vidore/colpali-v1.2"
124
+
125
+ # Configure Google Generative AI
126
+ genai.configure(api_key=GEMINI_API_KEY)
127
+
128
+ # Set device for Torch
129
+ device = get_torch_device("auto")
130
+ print(f"Using device: {device}")
131
+
132
+ # Load the ColPali model and processor
133
+ model = ColPali.from_pretrained(
134
+ MODEL_NAME,
135
+ torch_dtype=torch.bfloat16 if torch.cuda.is_available() else torch.float32,
136
+ device_map=device,
137
+ ).eval()
138
+
139
+ processor = ColPaliProcessor.from_pretrained(MODEL_NAME)
140
+ # -
141
+
142
+ # ## 1. Download PDFs
143
+ #
144
+ # We are going to use public reports from the Norwegian Government Pension Fund Global (also known as the Oil Fund).
145
+ # The fund puts transparency at the forefront and publishes reports on its investments, holdings, and returns, as well as its strategy and governance.
146
+ #
147
+ # These reports are the ones we are going to use for this showcase.
148
+ # Here are some sample images:
149
+ #
150
+ # ![Sample1](./static/img/gfpg-sample-1.png)
151
+ # ![Sample2](./static/img/gfpg-sample-2.png)
152
+ #
153
+
154
+ # As we can see, a lot of the information is in the form of tables, charts and numbers.
155
+ # These are not easily extractable using pdf-readers or OCR tools.
156
+ #
157
+
158
+ # +
159
+ import requests
160
+
161
+ url = "https://www.nbim.no/en/publications/reports/"
162
+ response = requests.get(url)
163
+ response.raise_for_status()
164
+ html_content = response.text
165
+
166
+ # Parse with BeautifulSoup
167
+ soup = BeautifulSoup(html_content, "html.parser")
168
+
169
+ links = []
170
+ url_to_year = {}
171
+
172
+ # Find all 'div's with id starting with 'year-'
173
+ for year_div in soup.find_all("div", id=lambda x: x and x.startswith("year-")):
174
+ year_id = year_div.get("id", "")
175
+ year = year_id.replace("year-", "")
176
+
177
+ # Within this div, find all 'a' elements with the specific classes
178
+ for a_tag in year_div.select("a.button.button--download-secondary[href]"):
179
+ href = a_tag["href"]
180
+ full_url = urljoin(url, href)
181
+ links.append(full_url)
182
+ url_to_year[full_url] = year
183
+ links, url_to_year
184
+ # -
185
+
186
+ # Limit the number of PDFs to download
187
+ NUM_PDFS = 2 # Set to None to download all PDFs
188
+ links = links[:NUM_PDFS] if NUM_PDFS else links
189
+ links
190
+
191
+ # +
192
+ from nest_asyncio import apply
193
+ from typing import List
194
+
195
+ apply()
196
+
197
+ max_attempts = 3
198
+
199
+
200
+ async def download_pdf(session, url, filename):
201
+ attempt = 0
202
+ while attempt < max_attempts:
203
+ try:
204
+ response = await session.get(url)
205
+ response.raise_for_status()
206
+
207
+ # Use Content-Disposition header to get the filename if available
208
+ content_disposition = response.headers.get("Content-Disposition")
209
+ if content_disposition:
210
+ import re
211
+
212
+ fname = re.findall('filename="(.+)"', content_disposition)
213
+ if fname:
214
+ filename = fname[0]
215
+
216
+ # Ensure the filename is safe to use on the filesystem
217
+ safe_filename = filename.replace("/", "_").replace("\\", "_")
218
+ if not safe_filename or safe_filename == "_":
219
+ print(f"Invalid filename: {filename}")
220
+ continue
221
+
222
+ filepath = os.path.join("pdfs", safe_filename)
223
+ with open(filepath, "wb") as f:
224
+ f.write(response.content)
225
+ print(f"Downloaded {safe_filename}")
226
+ return filepath
227
+ except Exception as e:
228
+ print(f"Error downloading {filename}: {e}")
229
+ print(f"Retrying ({attempt})...")
230
+ await asyncio.sleep(1) # Wait a bit before retrying
231
+ attempt += 1
232
+ return None
233
+
234
+
235
+ async def download_pdfs(links: List[str]) -> List[dict]:
236
+ """Download PDFs from a list of URLs. Add the filename to the dictionary."""
237
+ async with httpx.AsyncClient() as client:
238
+ tasks = []
239
+
240
+ for idx, link in enumerate(links):
241
+ # Try to get the filename from the URL
242
+ path = urlparse(link).path
243
+ filename = os.path.basename(path)
244
+
245
+ # If filename is empty,skip
246
+ if not filename:
247
+ continue
248
+ tasks.append(download_pdf(client, link, filename))
249
+
250
+ # Run the tasks concurrently
251
+ paths = await asyncio.gather(*tasks)
252
+ pdf_files = [
253
+ {"url": link, "path": path} for link, path in zip(links, paths) if path
254
+ ]
255
+ return pdf_files
256
+
257
+
258
+ # Create the pdfs directory if it doesn't exist
259
+ os.makedirs("pdfs", exist_ok=True)
260
+ # Now run the download_pdfs function with the URL
261
+ pdfs = asyncio.run(download_pdfs(links))
262
+ # -
263
+
264
+ pdfs
265
+
266
+ # ## 2. Convert PDFs to Images
267
+ #
268
+
269
+
270
+ # +
271
+ def get_pdf_images(pdf_path):
272
+ reader = PdfReader(pdf_path)
273
+ page_texts = []
274
+ for page_number in range(len(reader.pages)):
275
+ page = reader.pages[page_number]
276
+ text = page.extract_text()
277
+ page_texts.append(text)
278
+ images = convert_from_path(pdf_path)
279
+ # Convert to PIL images
280
+ assert len(images) == len(page_texts)
281
+ return images, page_texts
282
+
283
+
284
+ pdf_folder = "pdfs"
285
+ pdf_pages = []
286
+ for pdf in tqdm(pdfs):
287
+ pdf_file = pdf["path"]
288
+ title = os.path.splitext(os.path.basename(pdf_file))[0]
289
+ images, texts = get_pdf_images(pdf_file)
290
+ for page_no, (image, text) in enumerate(zip(images, texts)):
291
+ pdf_pages.append(
292
+ {
293
+ "title": title,
294
+ "year": int(url_to_year[pdf["url"]]),
295
+ "url": pdf["url"],
296
+ "path": pdf_file,
297
+ "image": image,
298
+ "text": text,
299
+ "page_no": page_no,
300
+ }
301
+ )
302
+ # -
303
+
304
+ len(pdf_pages)
305
+
306
+ # +
307
+ from collections import Counter
308
+
309
+ # Print the length of the text fields - mean, max and min
310
+ text_lengths = [len(page["text"]) for page in pdf_pages]
311
+ print(f"Mean text length: {np.mean(text_lengths)}")
312
+ print(f"Max text length: {np.max(text_lengths)}")
313
+ print(f"Min text length: {np.min(text_lengths)}")
314
+ print(f"Median text length: {np.median(text_lengths)}")
315
+ print(f"Number of text with length == 0: {Counter(text_lengths)[0]}")
316
+ # -
317
+
318
+ # ## 3. Generate Queries
319
+ #
320
+ # In this step, we want to generate queries for each page image.
321
+ # These will be useful for 2 reasons:
322
+ #
323
+ # 1. We can use these queries as typeahead suggestions in the search bar.
324
+ # 2. We can use the queries to generate an evaluation dataset. See [Improving Retrieval with LLM-as-a-judge](https://blog.vespa.ai/improving-retrieval-with-llm-as-a-judge/) for a deeper dive into this topic.
325
+ #
326
+ # The prompt for generating queries is taken from [this](https://danielvanstrien.xyz/posts/post-with-code/colpali/2024-09-23-generate_colpali_dataset.html#an-update-retrieval-focused-prompt) wonderful blog post by Daniel van Strien.
327
+ #
328
+ # We will use the Gemini API to generate these queries, with `gemini-1.5-flash-8b` as the model.
329
+ #
330
+
331
+ # +
332
+ from pydantic import BaseModel
333
+
334
+
335
+ class GeneratedQueries(BaseModel):
336
+ broad_topical_question: str
337
+ broad_topical_query: str
338
+ specific_detail_question: str
339
+ specific_detail_query: str
340
+ visual_element_question: str
341
+ visual_element_query: str
342
+
343
+
344
+ def get_retrieval_prompt() -> Tuple[str, GeneratedQueries]:
345
+ prompt = (
346
+ prompt
347
+ ) = """You are an investor, stock analyst and financial expert. You will be presented an image of a document page from a report published by the Norwegian Government Pension Fund Global (GPFG). The report may be annual or quarterly reports, or policy reports, on topics such as responsible investment, risk etc.
348
+ Your task is to generate retrieval queries and questions that you would use to retrieve this document (or ask based on this document) in a large corpus.
349
+ Please generate 3 different types of retrieval queries and questions.
350
+ A retrieval query is a keyword based query, made up of 2-5 words, that you would type into a search engine to find this document.
351
+ A question is a natural language question that you would ask, for which the document contains the answer.
352
+ The queries should be of the following types:
353
+ 1. A broad topical query: This should cover the main subject of the document.
354
+ 2. A specific detail query: This should cover a specific detail or aspect of the document.
355
+ 3. A visual element query: This should cover a visual element of the document, such as a chart, graph, or image.
356
+
357
+ Important guidelines:
358
+ - Ensure the queries are relevant for retrieval tasks, not just describing the page content.
359
+ - Use a fact-based natural language style for the questions.
360
+ - Frame the queries as if someone is searching for this document in a large corpus.
361
+ - Make the queries diverse and representative of different search strategies.
362
+
363
+ Format your response as a JSON object with the structure of the following example:
364
+ {
365
+ "broad_topical_question": "What was the Responsible Investment Policy in 2019?",
366
+ "broad_topical_query": "responsible investment policy 2019",
367
+ "specific_detail_question": "What is the percentage of investments in renewable energy?",
368
+ "specific_detail_query": "renewable energy investments percentage",
369
+ "visual_element_question": "What is the trend of total holding value over time?",
370
+ "visual_element_query": "total holding value trend"
371
+ }
372
+
373
+ If there are no relevant visual elements, provide an empty string for the visual element question and query.
374
+ Here is the document image to analyze:
375
+ Generate the queries based on this image and provide the response in the specified JSON format.
376
+ Only return JSON. Don't return any extra explanation text. """
377
+
378
+ return prompt, GeneratedQueries
379
+
380
+
381
+ prompt_text, pydantic_model = get_retrieval_prompt()
382
+
383
+ # +
384
+ gemini_model = genai.GenerativeModel("gemini-1.5-flash-8b")
385
+
386
+
387
+ def generate_queries(image, prompt_text, pydantic_model):
388
+ try:
389
+ response = gemini_model.generate_content(
390
+ [image, "\n\n", prompt_text],
391
+ generation_config=genai.GenerationConfig(
392
+ response_mime_type="application/json",
393
+ response_schema=pydantic_model,
394
+ ),
395
+ )
396
+ queries = json.loads(response.text)
397
+ except Exception as _e:
398
+ queries = {
399
+ "broad_topical_question": "",
400
+ "broad_topical_query": "",
401
+ "specific_detail_question": "",
402
+ "specific_detail_query": "",
403
+ "visual_element_question": "",
404
+ "visual_element_query": "",
405
+ }
406
+ return queries
407
+
408
+
409
+ # -
410
+
411
+ for pdf in tqdm(pdf_pages):
412
+ image = pdf.get("image")
413
+ pdf["queries"] = generate_queries(image, prompt_text, pydantic_model)
414
+
415
+ pdf_pages[46]["image"]
416
+
417
+ pdf_pages[46]["queries"]
418
+
419
+ # +
420
+ # Generate queries async - keeping for now as we probably need when applying to the full dataset
421
+ # import asyncio
422
+ # from tenacity import retry, stop_after_attempt, wait_exponential
423
+ # import google.generativeai as genai
424
+ # from tqdm.asyncio import tqdm_asyncio
425
+
426
+ # max_in_flight = 200 # Maximum number of concurrent requests
427
+
428
+
429
+ # async def generate_queries_for_image_async(model, image, semaphore):
430
+ # @retry(stop=stop_after_attempt(3), wait=wait_exponential(), reraise=True)
431
+ # async def _generate():
432
+ # async with semaphore:
433
+ # result = await model.generate_content_async(
434
+ # [image, "\n\n", prompt_text],
435
+ # generation_config=genai.GenerationConfig(
436
+ # response_mime_type="application/json",
437
+ # response_schema=pydantic_model,
438
+ # ),
439
+ # )
440
+ # return json.loads(result.text)
441
+
442
+ # try:
443
+ # return await _generate()
444
+ # except Exception as e:
445
+ # print(f"Error generating queries for image: {e}")
446
+ # return None # Return None or handle as needed
447
+
448
+
449
+ # async def enrich_pdfs():
450
+ # gemini_model = genai.GenerativeModel("gemini-1.5-flash-8b")
451
+ # semaphore = asyncio.Semaphore(max_in_flight)
452
+ # tasks = []
453
+ # for pdf in pdf_pages:
454
+ # pdf["queries"] = []
455
+ # image = pdf.get("image")
456
+ # if image:
457
+ # task = generate_queries_for_image_async(gemini_model, image, semaphore)
458
+ # tasks.append((pdf, task))
459
+
460
+ # # Run the tasks concurrently using asyncio.gather()
461
+ # for pdf, task in tqdm_asyncio(tasks):
462
+ # result = await task
463
+ # if result:
464
+ # pdf["queries"] = result
465
+ # return pdf_pages
466
+
467
+
468
+ # pdf_pages = asyncio.run(enrich_pdfs())
469
+
470
+ # +
471
+ # write title, url, page_no, text, queries, not image to JSON
472
+ with open("output/pdf_pages.json", "w") as f:
473
+ to_write = [{k: v for k, v in pdf.items() if k != "image"} for pdf in pdf_pages]
474
+ json.dump(to_write, f, indent=2)
475
+
476
+ # with open("pdfs/pdf_pages.json", "r") as f:
477
+ # saved_pdf_pages = json.load(f)
478
+ # for pdf, saved_pdf in zip(pdf_pages, saved_pdf_pages):
479
+ # pdf.update(saved_pdf)
480
+ # -
481
+
482
+ # ## 4. Generate embeddings
483
+ #
484
+ # Now that we have the queries, we can use the ColPali model to generate embeddings for each page image.
485
+ #
486
+
487
+
488
+ def generate_embeddings(images, model, processor, batch_size=2) -> np.ndarray:
489
+ """
490
+ Generate embeddings for a list of images.
491
+ Move to CPU only once per batch.
492
+
493
+ Args:
494
+ images (List[PIL.Image]): List of PIL images.
495
+ model (nn.Module): The model to generate embeddings.
496
+ processor: The processor to preprocess images.
497
+ batch_size (int, optional): Batch size for processing. Defaults to 64.
498
+
499
+ Returns:
500
+ np.ndarray: Embeddings for the images, shape
501
+ (len(images), processor.max_patch_length (1030 for ColPali), model.config.hidden_size (Patch embedding dimension - 128 for ColPali)).
502
+ """
503
+ embeddings_list = []
504
+
505
+ def collate_fn(batch):
506
+ # Batch is a list of images
507
+ return processor.process_images(batch) # Should return a dict of tensors
508
+
509
+ dataloader = DataLoader(
510
+ images,
511
+ shuffle=False,
512
+ collate_fn=collate_fn,
513
+ )
514
+
515
+ for batch_doc in tqdm(dataloader, desc="Generating embeddings"):
516
+ with torch.no_grad():
517
+ # Move batch to the device
518
+ batch_doc = {k: v.to(model.device) for k, v in batch_doc.items()}
519
+ embeddings_batch = model(**batch_doc)
520
+ embeddings_list.append(torch.unbind(embeddings_batch.to("cpu"), dim=0))
521
+ # Concatenate all embeddings and create a numpy array
522
+ all_embeddings = np.concatenate(embeddings_list, axis=0)
523
+ return all_embeddings
524
+
525
+
526
+ # Generate embeddings for all images
527
+ images = [pdf["image"] for pdf in pdf_pages]
528
+ embeddings = generate_embeddings(images, model, processor)
529
+
530
+ embeddings.shape
531
+
532
+ # ## 5. Prepare Data on Vespa Format
533
+ #
534
+ # Now, that we have all the data we need, all that remains is to make sure it is in the right format for Vespa.
535
+ #
536
+
537
+
538
+ def float_to_binary_embedding(float_query_embedding: dict) -> dict:
539
+ """Utility function to convert float query embeddings to binary query embeddings."""
540
+ binary_query_embeddings = {}
541
+ for k, v in float_query_embedding.items():
542
+ binary_vector = (
543
+ np.packbits(np.where(np.array(v) > 0, 1, 0)).astype(np.int8).tolist()
544
+ )
545
+ binary_query_embeddings[k] = binary_vector
546
+ return binary_query_embeddings
547
+
548
+
549
+ vespa_feed = []
550
+ for pdf, embedding in zip(pdf_pages, embeddings):
551
+ url = pdf["url"]
552
+ year = pdf["year"]
553
+ title = pdf["title"]
554
+ image = pdf["image"]
555
+ text = pdf.get("text", "")
556
+ page_no = pdf["page_no"]
557
+ query_dict = pdf["queries"]
558
+ questions = [v for k, v in query_dict.items() if "question" in k and v]
559
+ queries = [v for k, v in query_dict.items() if "query" in k and v]
560
+ base_64_image = get_base64_image(
561
+ scale_image(image, 32), add_url_prefix=False
562
+ ) # Scaled down image to return fast on search (~1kb)
563
+ base_64_full_image = get_base64_image(image, add_url_prefix=False)
564
+ embedding_dict = {k: v for k, v in enumerate(embedding)}
565
+ binary_embedding = float_to_binary_embedding(embedding_dict)
566
+ # id_hash should be md5 hash of url and page_number
567
+ id_hash = hashlib.md5(f"{url}_{page_no}".encode()).hexdigest()
568
+ page = {
569
+ "id": id_hash,
570
+ "fields": {
571
+ "id": id_hash,
572
+ "url": url,
573
+ "title": title,
574
+ "year": year,
575
+ "page_number": page_no,
576
+ "blur_image": base_64_image,
577
+ "full_image": base_64_full_image,
578
+ "text": text,
579
+ "embedding": binary_embedding,
580
+ "queries": queries,
581
+ "questions": questions,
582
+ },
583
+ }
584
+ vespa_feed.append(page)
585
+
586
+ # +
587
+ # We will prepare the Vespa feed data, including the embeddings and the generated queries
588
+
589
+
590
+ # Save vespa_feed to vespa_feed.json
591
+ os.makedirs("output", exist_ok=True)
592
+ with open("output/vespa_feed.json", "w") as f:
593
+ vespa_feed_to_save = []
594
+ for page in vespa_feed:
595
+ document_id = page["id"]
596
+ put_id = f"id:{VESPA_APPLICATION_NAME}:{VESPA_SCHEMA_NAME}::{document_id}"
597
+ vespa_feed_to_save.append({"put": put_id, "fields": page["fields"]})
598
+ json.dump(vespa_feed_to_save, f)
599
+
600
+ # +
601
+ # import json
602
+
603
+ # with open("output/vespa_feed.json", "r") as f:
604
+ # vespa_feed = json.load(f)
605
+ # -
606
+
607
+ len(vespa_feed)
608
+
609
+ # ## 5. Prepare Vespa Application
610
+ #
611
+
612
+ # +
613
+ # Define the Vespa schema
614
+ colpali_schema = Schema(
615
+ name=VESPA_SCHEMA_NAME,
616
+ document=Document(
617
+ fields=[
618
+ Field(
619
+ name="id",
620
+ type="string",
621
+ indexing=["summary", "index"],
622
+ match=["word"],
623
+ ),
624
+ Field(name="url", type="string", indexing=["summary", "index"]),
625
+ Field(name="year", type="int", indexing=["summary", "attribute"]),
626
+ Field(
627
+ name="title",
628
+ type="string",
629
+ indexing=["summary", "index"],
630
+ match=["text"],
631
+ index="enable-bm25",
632
+ ),
633
+ Field(name="page_number", type="int", indexing=["summary", "attribute"]),
634
+ Field(name="blur_image", type="raw", indexing=["summary"]),
635
+ Field(name="full_image", type="raw", indexing=["summary"]),
636
+ Field(
637
+ name="text",
638
+ type="string",
639
+ indexing=["summary", "index"],
640
+ match=["text"],
641
+ index="enable-bm25",
642
+ ),
643
+ Field(
644
+ name="embedding",
645
+ type="tensor<int8>(patch{}, v[16])",
646
+ indexing=[
647
+ "attribute",
648
+ "index",
649
+ ],
650
+ ann=HNSW(
651
+ distance_metric="hamming",
652
+ max_links_per_node=32,
653
+ neighbors_to_explore_at_insert=400,
654
+ ),
655
+ ),
656
+ Field(
657
+ name="questions",
658
+ type="array<string>",
659
+ indexing=["summary", "attribute"],
660
+ summary=Summary(fields=["matched-elements-only"]),
661
+ ),
662
+ Field(
663
+ name="queries",
664
+ type="array<string>",
665
+ indexing=["summary", "attribute"],
666
+ summary=Summary(fields=["matched-elements-only"]),
667
+ ),
668
+ ]
669
+ ),
670
+ fieldsets=[
671
+ FieldSet(
672
+ name="default",
673
+ fields=["title", "url", "blur_image", "page_number", "text"],
674
+ ),
675
+ FieldSet(
676
+ name="image",
677
+ fields=["full_image"],
678
+ ),
679
+ ],
680
+ document_summaries=[
681
+ DocumentSummary(
682
+ name="default",
683
+ summary_fields=[
684
+ Summary(
685
+ name="text",
686
+ fields=[("bolding", "on")],
687
+ ),
688
+ Summary(
689
+ name="snippet",
690
+ fields=[("source", "text"), "dynamic"],
691
+ ),
692
+ ],
693
+ from_disk=True,
694
+ ),
695
+ DocumentSummary(
696
+ name="suggestions",
697
+ summary_fields=[
698
+ Summary(name="questions"),
699
+ ],
700
+ from_disk=True,
701
+ ),
702
+ ],
703
+ )
704
+
705
+ # Define similarity functions used in all rank profiles
706
+ mapfunctions = [
707
+ Function(
708
+ name="similarities", # computes similarity scores between each query token and image patch
709
+ expression="""
710
+ sum(
711
+ query(qt) * unpack_bits(attribute(embedding)), v
712
+ )
713
+ """,
714
+ ),
715
+ Function(
716
+ name="normalized", # normalizes the similarity scores to [-1, 1]
717
+ expression="""
718
+ (similarities - reduce(similarities, min)) / (reduce((similarities - reduce(similarities, min)), max)) * 2 - 1
719
+ """,
720
+ ),
721
+ Function(
722
+ name="quantized", # quantizes the normalized similarity scores to signed 8-bit integers [-128, 127]
723
+ expression="""
724
+ cell_cast(normalized * 127.999, int8)
725
+ """,
726
+ ),
727
+ ]
728
+
729
+ # Define the 'bm25' rank profile
730
+ colpali_bm25_profile = RankProfile(
731
+ name="bm25",
732
+ inputs=[("query(qt)", "tensor<float>(querytoken{}, v[128])")],
733
+ first_phase="bm25(title) + bm25(text)",
734
+ functions=mapfunctions,
735
+ )
736
+
737
+
738
+ # A function to create an inherited rank profile which also returns quantized similarity scores
739
+ def with_quantized_similarity(rank_profile: RankProfile) -> RankProfile:
740
+ return RankProfile(
741
+ name=f"{rank_profile.name}_sim",
742
+ first_phase=rank_profile.first_phase,
743
+ inherits=rank_profile.name,
744
+ summary_features=["quantized"],
745
+ )
746
+
747
+
748
+ colpali_schema.add_rank_profile(colpali_bm25_profile)
749
+ colpali_schema.add_rank_profile(with_quantized_similarity(colpali_bm25_profile))
750
+
751
+ # Update the 'default' rank profile
752
+ colpali_profile = RankProfile(
753
+ name="default",
754
+ inputs=[("query(qt)", "tensor<float>(querytoken{}, v[128])")],
755
+ first_phase="bm25_score",
756
+ second_phase=SecondPhaseRanking(expression="max_sim", rerank_count=10),
757
+ functions=mapfunctions
758
+ + [
759
+ Function(
760
+ name="max_sim",
761
+ expression="""
762
+ sum(
763
+ reduce(
764
+ sum(
765
+ query(qt) * unpack_bits(attribute(embedding)), v
766
+ ),
767
+ max, patch
768
+ ),
769
+ querytoken
770
+ )
771
+ """,
772
+ ),
773
+ Function(name="bm25_score", expression="bm25(title) + bm25(text)"),
774
+ ],
775
+ )
776
+ colpali_schema.add_rank_profile(colpali_profile)
777
+ colpali_schema.add_rank_profile(with_quantized_similarity(colpali_profile))
778
+
779
+ # Update the 'retrieval-and-rerank' rank profile
780
+ input_query_tensors = []
781
+ MAX_QUERY_TERMS = 64
782
+ for i in range(MAX_QUERY_TERMS):
783
+ input_query_tensors.append((f"query(rq{i})", "tensor<int8>(v[16])"))
784
+
785
+ input_query_tensors.extend(
786
+ [
787
+ ("query(qt)", "tensor<float>(querytoken{}, v[128])"),
788
+ ("query(qtb)", "tensor<int8>(querytoken{}, v[16])"),
789
+ ]
790
+ )
791
+
792
+ colpali_retrieval_profile = RankProfile(
793
+ name="retrieval-and-rerank",
794
+ inputs=input_query_tensors,
795
+ first_phase="max_sim_binary",
796
+ second_phase=SecondPhaseRanking(expression="max_sim", rerank_count=10),
797
+ functions=mapfunctions
798
+ + [
799
+ Function(
800
+ name="max_sim",
801
+ expression="""
802
+ sum(
803
+ reduce(
804
+ sum(
805
+ query(qt) * unpack_bits(attribute(embedding)), v
806
+ ),
807
+ max, patch
808
+ ),
809
+ querytoken
810
+ )
811
+ """,
812
+ ),
813
+ Function(
814
+ name="max_sim_binary",
815
+ expression="""
816
+ sum(
817
+ reduce(
818
+ 1 / (1 + sum(
819
+ hamming(query(qtb), attribute(embedding)), v)
820
+ ),
821
+ max, patch
822
+ ),
823
+ querytoken
824
+ )
825
+ """,
826
+ ),
827
+ ],
828
+ )
829
+ colpali_schema.add_rank_profile(colpali_retrieval_profile)
830
+ colpali_schema.add_rank_profile(with_quantized_similarity(colpali_retrieval_profile))
831
+
832
+ # +
833
+ from vespa.configuration.services import (
834
+ services,
835
+ container,
836
+ search,
837
+ document_api,
838
+ document_processing,
839
+ clients,
840
+ client,
841
+ config,
842
+ content,
843
+ redundancy,
844
+ documents,
845
+ node,
846
+ certificate,
847
+ token,
848
+ document,
849
+ nodes,
850
+ )
851
+ from vespa.configuration.vt import vt
852
+ from vespa.package import ServicesConfiguration
853
+
854
+ service_config = ServicesConfiguration(
855
+ application_name=VESPA_APPLICATION_NAME,
856
+ services_config=services(
857
+ container(
858
+ search(),
859
+ document_api(),
860
+ document_processing(),
861
+ clients(
862
+ client(
863
+ certificate(file="security/clients.pem"),
864
+ id="mtls",
865
+ permissions="read,write",
866
+ ),
867
+ client(
868
+ token(id=f"{VESPA_TOKEN_ID_WRITE}"),
869
+ id="token_write",
870
+ permissions="read,write",
871
+ ),
872
+ ),
873
+ config(
874
+ vt("tag")(
875
+ vt("bold")(
876
+ vt("open", "<strong>"),
877
+ vt("close", "</strong>"),
878
+ ),
879
+ vt("separator", "..."),
880
+ ),
881
+ name="container.qr-searchers",
882
+ ),
883
+ id=f"{VESPA_APPLICATION_NAME}_container",
884
+ version="1.0",
885
+ ),
886
+ content(
887
+ redundancy("1"),
888
+ documents(document(type="pdf_page", mode="index")),
889
+ nodes(node(distribution_key="0", hostalias="node1")),
890
+ config(
891
+ vt("max_matches", "2", replace_underscores=False),
892
+ vt("length", "1000"),
893
+ vt("surround_max", "500", replace_underscores=False),
894
+ vt("min_length", "300", replace_underscores=False),
895
+ name="vespa.config.search.summary.juniperrc",
896
+ ),
897
+ id=f"{VESPA_APPLICATION_NAME}_content",
898
+ version="1.0",
899
+ ),
900
+ version="1.0",
901
+ ),
902
+ )
903
+ # -
904
+
905
+ # Create the Vespa application package
906
+ vespa_application_package = ApplicationPackage(
907
+ name=VESPA_APPLICATION_NAME,
908
+ schema=[colpali_schema],
909
+ services_config=service_config,
910
+ )
911
+
912
+ # ## 6. Deploy Vespa Application
913
+ #
914
+
915
+ VESPA_TEAM_API_KEY = os.getenv("VESPA_TEAM_API_KEY") or input(
916
+ "Enter Vespa team API key: "
917
+ )
918
+
919
+ # +
920
+ vespa_cloud = VespaCloud(
921
+ tenant=VESPA_TENANT_NAME,
922
+ application=VESPA_APPLICATION_NAME,
923
+ key_content=VESPA_TEAM_API_KEY,
924
+ application_package=vespa_application_package,
925
+ )
926
+
927
+ # Deploy the application
928
+ vespa_cloud.deploy()
929
+
930
+ # Output the endpoint URL
931
+ endpoint_url = vespa_cloud.get_token_endpoint()
932
+ print(f"Application deployed. Token endpoint URL: {endpoint_url}")
933
+ # -
934
+
935
+ # Make sure to take note of the token endpoint_url.
936
+ # You need to put this in your `.env` file - `VESPA_APP_URL=https://abcd.vespa-app.cloud` - to access the Vespa application from your web application.
937
+ #
938
+
939
+ # ## 8. Feed Data to Vespa
940
+ #
941
+
942
+ # Instantiate Vespa connection using token
943
+ app = Vespa(url=endpoint_url, vespa_cloud_secret_token=VESPA_CLOUD_SECRET_TOKEN)
944
+ app.get_application_status()
945
+
946
+
947
+ # +
948
+ def callback(response: VespaResponse, id: str):
949
+ if not response.is_successful():
950
+ print(
951
+ f"Failed to feed document {id} with status code {response.status_code}: Reason {response.get_json()}"
952
+ )
953
+
954
+
955
+ # Feed data into Vespa asynchronously
956
+ app.feed_async_iterable(vespa_feed, schema=VESPA_SCHEMA_NAME, callback=callback)
prepare_feed_deploy_v2.py ADDED
@@ -0,0 +1,956 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # # Visual PDF Retrieval - demo application
2
+ #
3
+ # In this notebook, we will prepare the Vespa backend application for our visual retrieval demo.
4
+ # We will use ColPali as the model to extract patch vectors from images of pdf pages.
5
+ # At query time, we use MaxSim to retrieve and/or (based on the configuration) rank the page results.
6
+ #
7
+ # To see the application in action, visit TODO:
8
+ #
9
+ # The web application is written in FastHTML, meaning the complete application is written in python.
10
+ #
11
+ # The steps we will take in this notebook are:
12
+ #
13
+ # 0. Setup and configuration
14
+ # 1. Download the data
15
+ # 2. Prepare the data
16
+ # 3. Generate queries for evaluation and typeahead search suggestions
17
+ # 4. Deploy the Vespa application
18
+ # 5. Create the Vespa application
19
+ # 6. Feed the data to the Vespa application
20
+ #
21
+ # All the steps that are needed to provision the Vespa application, including feeding the data, can be done from this notebook.
22
+ # We have tried to make it easy for others to run this notebook, to create your own PDF Enterprise Search application using Vespa.
23
+ #
24
+
25
+ # ## 0. Setup and Configuration
26
+ #
27
+
28
+ # +
29
+ import os
30
+ import asyncio
31
+ import json
32
+ from typing import Tuple
33
+ import hashlib
34
+ import numpy as np
35
+
36
+ # Vespa
37
+ from vespa.package import (
38
+ ApplicationPackage,
39
+ Field,
40
+ Schema,
41
+ Document,
42
+ HNSW,
43
+ RankProfile,
44
+ Function,
45
+ FieldSet,
46
+ SecondPhaseRanking,
47
+ Summary,
48
+ DocumentSummary,
49
+ )
50
+ from vespa.deployment import VespaCloud
51
+ from vespa.application import Vespa
52
+ from vespa.io import VespaResponse
53
+
54
+ # Google Generative AI
55
+ import google.generativeai as genai
56
+
57
+ # Torch and other ML libraries
58
+ import torch
59
+ from torch.utils.data import DataLoader
60
+ from tqdm import tqdm
61
+ from pdf2image import convert_from_path
62
+ from pypdf import PdfReader
63
+
64
+ # ColPali model and processor
65
+ from colpali_engine.models import ColPali, ColPaliProcessor
66
+ from colpali_engine.utils.torch_utils import get_torch_device
67
+ from vidore_benchmark.utils.image_utils import scale_image, get_base64_image
68
+
69
+ # Other utilities
70
+ from bs4 import BeautifulSoup
71
+ import httpx
72
+ from urllib.parse import urljoin, urlparse
73
+
74
+ # Load environment variables
75
+ from dotenv import load_dotenv
76
+
77
+ load_dotenv()
78
+
79
+ # Avoid warning from huggingface tokenizers
80
+ os.environ["TOKENIZERS_PARALLELISM"] = "false"
81
+ # -
82
+
83
+ # ### Create a free trial in Vespa Cloud
84
+ #
85
+ # Create a tenant from [here](https://vespa.ai/free-trial/).
86
+ # The trial includes $300 credit.
87
+ # Take note of your tenant name.
88
+ #
89
+
90
+ VESPA_TENANT_NAME = "vespa-team"
91
+
92
+ # Here, set your desired application name. (Will be created in later steps)
93
+ # Note that you can not have hyphen `-` or underscore `_` in the application name.
94
+ #
95
+
96
+ VESPA_APPLICATION_NAME = "colpalidemo"
97
+ VESPA_SCHEMA_NAME = "pdf_page"
98
+
99
+ # Next, you need to create some tokens for feeding data, and querying the application.
100
+ # We recommend separate tokens for feeding and querying, (the former with write permission, and the latter with read permission).
101
+ # The tokens can be created from the [Vespa Cloud console](https://console.vespa-cloud.com/) in the 'Account' -> 'Tokens' section.
102
+ #
103
+
104
+ VESPA_TOKEN_ID_WRITE = "colpalidemo_write"
105
+
106
+ # We also need to set the value of the write token to be able to feed data to the Vespa application.
107
+ #
108
+
109
+ VESPA_CLOUD_SECRET_TOKEN = os.getenv("VESPA_CLOUD_SECRET_TOKEN") or input(
110
+ "Enter Vespa cloud secret token: "
111
+ )
112
+
113
+ # We will also use the Gemini API to create sample queries for our images.
114
+ # You can also use other VLM's to create these queries.
115
+ # Create a Gemini API key from [here](https://aistudio.google.com/app/apikey).
116
+ #
117
+
118
+ GEMINI_API_KEY = os.getenv("GEMINI_API_KEY") or input(
119
+ "Enter Google Generative AI API key: "
120
+ )
121
+
122
+ # +
123
+ MODEL_NAME = "vidore/colpali-v1.2"
124
+
125
+ # Configure Google Generative AI
126
+ genai.configure(api_key=GEMINI_API_KEY)
127
+
128
+ # Set device for Torch
129
+ device = get_torch_device("auto")
130
+ print(f"Using device: {device}")
131
+
132
+ # Load the ColPali model and processor
133
+ model = ColPali.from_pretrained(
134
+ MODEL_NAME,
135
+ torch_dtype=torch.bfloat16 if torch.cuda.is_available() else torch.float32,
136
+ device_map=device,
137
+ ).eval()
138
+
139
+ processor = ColPaliProcessor.from_pretrained(MODEL_NAME)
140
+ # -
141
+
142
+ # ## 1. Download PDFs
143
+ #
144
+ # We are going to use public reports from the Norwegian Government Pension Fund Global (also known as the Oil Fund).
145
+ # The fund puts transparency at the forefront and publishes reports on its investments, holdings, and returns, as well as its strategy and governance.
146
+ #
147
+ # These reports are the ones we are going to use for this showcase.
148
+ # Here are some sample images:
149
+ #
150
+ # ![Sample1](./static/img/gfpg-sample-1.png)
151
+ # ![Sample2](./static/img/gfpg-sample-2.png)
152
+ #
153
+
154
+ # As we can see, a lot of the information is in the form of tables, charts and numbers.
155
+ # These are not easily extractable using pdf-readers or OCR tools.
156
+ #
157
+
158
+ # +
159
+ import requests
160
+
161
+ url = "https://www.nbim.no/en/publications/reports/"
162
+ response = requests.get(url)
163
+ response.raise_for_status()
164
+ html_content = response.text
165
+
166
+ # Parse with BeautifulSoup
167
+ soup = BeautifulSoup(html_content, "html.parser")
168
+
169
+ links = []
170
+ url_to_year = {}
171
+
172
+ # Find all 'div's with id starting with 'year-'
173
+ for year_div in soup.find_all("div", id=lambda x: x and x.startswith("year-")):
174
+ year_id = year_div.get("id", "")
175
+ year = year_id.replace("year-", "")
176
+
177
+ # Within this div, find all 'a' elements with the specific classes
178
+ for a_tag in year_div.select("a.button.button--download-secondary[href]"):
179
+ href = a_tag["href"]
180
+ full_url = urljoin(url, href)
181
+ links.append(full_url)
182
+ url_to_year[full_url] = year
183
+ links, url_to_year
184
+ # -
185
+
186
+ # Limit the number of PDFs to download
187
+ NUM_PDFS = 2 # Set to None to download all PDFs
188
+ links = links[:NUM_PDFS] if NUM_PDFS else links
189
+ links
190
+
191
+ # +
192
+ from nest_asyncio import apply
193
+ from typing import List
194
+
195
+ apply()
196
+
197
+ max_attempts = 3
198
+
199
+
200
+ async def download_pdf(session, url, filename):
201
+ attempt = 0
202
+ while attempt < max_attempts:
203
+ try:
204
+ response = await session.get(url)
205
+ response.raise_for_status()
206
+
207
+ # Use Content-Disposition header to get the filename if available
208
+ content_disposition = response.headers.get("Content-Disposition")
209
+ if content_disposition:
210
+ import re
211
+
212
+ fname = re.findall('filename="(.+)"', content_disposition)
213
+ if fname:
214
+ filename = fname[0]
215
+
216
+ # Ensure the filename is safe to use on the filesystem
217
+ safe_filename = filename.replace("/", "_").replace("\\", "_")
218
+ if not safe_filename or safe_filename == "_":
219
+ print(f"Invalid filename: {filename}")
220
+ continue
221
+
222
+ filepath = os.path.join("pdfs", safe_filename)
223
+ with open(filepath, "wb") as f:
224
+ f.write(response.content)
225
+ print(f"Downloaded {safe_filename}")
226
+ return filepath
227
+ except Exception as e:
228
+ print(f"Error downloading {filename}: {e}")
229
+ print(f"Retrying ({attempt})...")
230
+ await asyncio.sleep(1) # Wait a bit before retrying
231
+ attempt += 1
232
+ return None
233
+
234
+
235
+ async def download_pdfs(links: List[str]) -> List[dict]:
236
+ """Download PDFs from a list of URLs. Add the filename to the dictionary."""
237
+ async with httpx.AsyncClient() as client:
238
+ tasks = []
239
+
240
+ for idx, link in enumerate(links):
241
+ # Try to get the filename from the URL
242
+ path = urlparse(link).path
243
+ filename = os.path.basename(path)
244
+
245
+ # If filename is empty,skip
246
+ if not filename:
247
+ continue
248
+ tasks.append(download_pdf(client, link, filename))
249
+
250
+ # Run the tasks concurrently
251
+ paths = await asyncio.gather(*tasks)
252
+ pdf_files = [
253
+ {"url": link, "path": path} for link, path in zip(links, paths) if path
254
+ ]
255
+ return pdf_files
256
+
257
+
258
+ # Create the pdfs directory if it doesn't exist
259
+ os.makedirs("pdfs", exist_ok=True)
260
+ # Now run the download_pdfs function with the URL
261
+ pdfs = asyncio.run(download_pdfs(links))
262
+ # -
263
+
264
+ pdfs
265
+
266
+ # ## 2. Convert PDFs to Images
267
+ #
268
+
269
+
270
+ # +
271
+ def get_pdf_images(pdf_path):
272
+ reader = PdfReader(pdf_path)
273
+ page_texts = []
274
+ for page_number in range(len(reader.pages)):
275
+ page = reader.pages[page_number]
276
+ text = page.extract_text()
277
+ page_texts.append(text)
278
+ images = convert_from_path(pdf_path)
279
+ # Convert to PIL images
280
+ assert len(images) == len(page_texts)
281
+ return images, page_texts
282
+
283
+
284
+ pdf_folder = "pdfs"
285
+ pdf_pages = []
286
+ for pdf in tqdm(pdfs):
287
+ pdf_file = pdf["path"]
288
+ title = os.path.splitext(os.path.basename(pdf_file))[0]
289
+ images, texts = get_pdf_images(pdf_file)
290
+ for page_no, (image, text) in enumerate(zip(images, texts)):
291
+ pdf_pages.append(
292
+ {
293
+ "title": title,
294
+ "year": int(url_to_year[pdf["url"]]),
295
+ "url": pdf["url"],
296
+ "path": pdf_file,
297
+ "image": image,
298
+ "text": text,
299
+ "page_no": page_no,
300
+ }
301
+ )
302
+ # -
303
+
304
+ len(pdf_pages)
305
+
306
+ # +
307
+ from collections import Counter
308
+
309
+ # Print the length of the text fields - mean, max and min
310
+ text_lengths = [len(page["text"]) for page in pdf_pages]
311
+ print(f"Mean text length: {np.mean(text_lengths)}")
312
+ print(f"Max text length: {np.max(text_lengths)}")
313
+ print(f"Min text length: {np.min(text_lengths)}")
314
+ print(f"Median text length: {np.median(text_lengths)}")
315
+ print(f"Number of text with length == 0: {Counter(text_lengths)[0]}")
316
+ # -
317
+
318
+ # ## 3. Generate Queries
319
+ #
320
+ # In this step, we want to generate queries for each page image.
321
+ # These will be useful for 2 reasons:
322
+ #
323
+ # 1. We can use these queries as typeahead suggestions in the search bar.
324
+ # 2. We can use the queries to generate an evaluation dataset. See [Improving Retrieval with LLM-as-a-judge](https://blog.vespa.ai/improving-retrieval-with-llm-as-a-judge/) for a deeper dive into this topic.
325
+ #
326
+ # The prompt for generating queries is taken from [this](https://danielvanstrien.xyz/posts/post-with-code/colpali/2024-09-23-generate_colpali_dataset.html#an-update-retrieval-focused-prompt) wonderful blog post by Daniel van Strien.
327
+ #
328
+ # We will use the Gemini API to generate these queries, with `gemini-1.5-flash-8b` as the model.
329
+ #
330
+
331
+ # +
332
+ from pydantic import BaseModel
333
+
334
+
335
+ class GeneratedQueries(BaseModel):
336
+ broad_topical_question: str
337
+ broad_topical_query: str
338
+ specific_detail_question: str
339
+ specific_detail_query: str
340
+ visual_element_question: str
341
+ visual_element_query: str
342
+
343
+
344
+ def get_retrieval_prompt() -> Tuple[str, GeneratedQueries]:
345
+ prompt = (
346
+ prompt
347
+ ) = """You are an investor, stock analyst and financial expert. You will be presented an image of a document page from a report published by the Norwegian Government Pension Fund Global (GPFG). The report may be annual or quarterly reports, or policy reports, on topics such as responsible investment, risk etc.
348
+ Your task is to generate retrieval queries and questions that you would use to retrieve this document (or ask based on this document) in a large corpus.
349
+ Please generate 3 different types of retrieval queries and questions.
350
+ A retrieval query is a keyword based query, made up of 2-5 words, that you would type into a search engine to find this document.
351
+ A question is a natural language question that you would ask, for which the document contains the answer.
352
+ The queries should be of the following types:
353
+ 1. A broad topical query: This should cover the main subject of the document.
354
+ 2. A specific detail query: This should cover a specific detail or aspect of the document.
355
+ 3. A visual element query: This should cover a visual element of the document, such as a chart, graph, or image.
356
+
357
+ Important guidelines:
358
+ - Ensure the queries are relevant for retrieval tasks, not just describing the page content.
359
+ - Use a fact-based natural language style for the questions.
360
+ - Frame the queries as if someone is searching for this document in a large corpus.
361
+ - Make the queries diverse and representative of different search strategies.
362
+
363
+ Format your response as a JSON object with the structure of the following example:
364
+ {
365
+ "broad_topical_question": "What was the Responsible Investment Policy in 2019?",
366
+ "broad_topical_query": "responsible investment policy 2019",
367
+ "specific_detail_question": "What is the percentage of investments in renewable energy?",
368
+ "specific_detail_query": "renewable energy investments percentage",
369
+ "visual_element_question": "What is the trend of total holding value over time?",
370
+ "visual_element_query": "total holding value trend"
371
+ }
372
+
373
+ If there are no relevant visual elements, provide an empty string for the visual element question and query.
374
+ Here is the document image to analyze:
375
+ Generate the queries based on this image and provide the response in the specified JSON format.
376
+ Only return JSON. Don't return any extra explanation text. """
377
+
378
+ return prompt, GeneratedQueries
379
+
380
+
381
+ prompt_text, pydantic_model = get_retrieval_prompt()
382
+
383
+ # +
384
+ gemini_model = genai.GenerativeModel("gemini-1.5-flash-8b")
385
+
386
+
387
+ def generate_queries(image, prompt_text, pydantic_model):
388
+ try:
389
+ response = gemini_model.generate_content(
390
+ [image, "\n\n", prompt_text],
391
+ generation_config=genai.GenerationConfig(
392
+ response_mime_type="application/json",
393
+ response_schema=pydantic_model,
394
+ ),
395
+ )
396
+ queries = json.loads(response.text)
397
+ except Exception as _e:
398
+ queries = {
399
+ "broad_topical_question": "",
400
+ "broad_topical_query": "",
401
+ "specific_detail_question": "",
402
+ "specific_detail_query": "",
403
+ "visual_element_question": "",
404
+ "visual_element_query": "",
405
+ }
406
+ return queries
407
+
408
+
409
+ # -
410
+
411
+ for pdf in tqdm(pdf_pages):
412
+ image = pdf.get("image")
413
+ pdf["queries"] = generate_queries(image, prompt_text, pydantic_model)
414
+
415
+ pdf_pages[46]["image"]
416
+
417
+ pdf_pages[46]["queries"]
418
+
419
+ # +
420
+ # Generate queries async - keeping for now as we probably need when applying to the full dataset
421
+ # import asyncio
422
+ # from tenacity import retry, stop_after_attempt, wait_exponential
423
+ # import google.generativeai as genai
424
+ # from tqdm.asyncio import tqdm_asyncio
425
+
426
+ # max_in_flight = 200 # Maximum number of concurrent requests
427
+
428
+
429
+ # async def generate_queries_for_image_async(model, image, semaphore):
430
+ # @retry(stop=stop_after_attempt(3), wait=wait_exponential(), reraise=True)
431
+ # async def _generate():
432
+ # async with semaphore:
433
+ # result = await model.generate_content_async(
434
+ # [image, "\n\n", prompt_text],
435
+ # generation_config=genai.GenerationConfig(
436
+ # response_mime_type="application/json",
437
+ # response_schema=pydantic_model,
438
+ # ),
439
+ # )
440
+ # return json.loads(result.text)
441
+
442
+ # try:
443
+ # return await _generate()
444
+ # except Exception as e:
445
+ # print(f"Error generating queries for image: {e}")
446
+ # return None # Return None or handle as needed
447
+
448
+
449
+ # async def enrich_pdfs():
450
+ # gemini_model = genai.GenerativeModel("gemini-1.5-flash-8b")
451
+ # semaphore = asyncio.Semaphore(max_in_flight)
452
+ # tasks = []
453
+ # for pdf in pdf_pages:
454
+ # pdf["queries"] = []
455
+ # image = pdf.get("image")
456
+ # if image:
457
+ # task = generate_queries_for_image_async(gemini_model, image, semaphore)
458
+ # tasks.append((pdf, task))
459
+
460
+ # # Run the tasks concurrently using asyncio.gather()
461
+ # for pdf, task in tqdm_asyncio(tasks):
462
+ # result = await task
463
+ # if result:
464
+ # pdf["queries"] = result
465
+ # return pdf_pages
466
+
467
+
468
+ # pdf_pages = asyncio.run(enrich_pdfs())
469
+
470
+ # +
471
+ # write title, url, page_no, text, queries, not image to JSON
472
+ with open("output/pdf_pages.json", "w") as f:
473
+ to_write = [{k: v for k, v in pdf.items() if k != "image"} for pdf in pdf_pages]
474
+ json.dump(to_write, f, indent=2)
475
+
476
+ # with open("pdfs/pdf_pages.json", "r") as f:
477
+ # saved_pdf_pages = json.load(f)
478
+ # for pdf, saved_pdf in zip(pdf_pages, saved_pdf_pages):
479
+ # pdf.update(saved_pdf)
480
+ # -
481
+
482
+ # ## 4. Generate embeddings
483
+ #
484
+ # Now that we have the queries, we can use the ColPali model to generate embeddings for each page image.
485
+ #
486
+
487
+
488
+ def generate_embeddings(images, model, processor, batch_size=2) -> np.ndarray:
489
+ """
490
+ Generate embeddings for a list of images.
491
+ Move to CPU only once per batch.
492
+
493
+ Args:
494
+ images (List[PIL.Image]): List of PIL images.
495
+ model (nn.Module): The model to generate embeddings.
496
+ processor: The processor to preprocess images.
497
+ batch_size (int, optional): Batch size for processing. Defaults to 64.
498
+
499
+ Returns:
500
+ np.ndarray: Embeddings for the images, shape
501
+ (len(images), processor.max_patch_length (1030 for ColPali), model.config.hidden_size (Patch embedding dimension - 128 for ColPali)).
502
+ """
503
+ embeddings_list = []
504
+
505
+ def collate_fn(batch):
506
+ # Batch is a list of images
507
+ return processor.process_images(batch) # Should return a dict of tensors
508
+
509
+ dataloader = DataLoader(
510
+ images,
511
+ shuffle=False,
512
+ collate_fn=collate_fn,
513
+ )
514
+
515
+ for batch_doc in tqdm(dataloader, desc="Generating embeddings"):
516
+ with torch.no_grad():
517
+ # Move batch to the device
518
+ batch_doc = {k: v.to(model.device) for k, v in batch_doc.items()}
519
+ embeddings_batch = model(**batch_doc)
520
+ embeddings_list.append(torch.unbind(embeddings_batch.to("cpu"), dim=0))
521
+ # Concatenate all embeddings and create a numpy array
522
+ all_embeddings = np.concatenate(embeddings_list, axis=0)
523
+ return all_embeddings
524
+
525
+
526
+ # Generate embeddings for all images
527
+ images = [pdf["image"] for pdf in pdf_pages]
528
+ embeddings = generate_embeddings(images, model, processor)
529
+
530
+ embeddings.shape
531
+
532
+ # ## 5. Prepare Data on Vespa Format
533
+ #
534
+ # Now, that we have all the data we need, all that remains is to make sure it is in the right format for Vespa.
535
+ #
536
+
537
+
538
+ def float_to_binary_embedding(float_query_embedding: dict) -> dict:
539
+ """Utility function to convert float query embeddings to binary query embeddings."""
540
+ binary_query_embeddings = {}
541
+ for k, v in float_query_embedding.items():
542
+ binary_vector = (
543
+ np.packbits(np.where(np.array(v) > 0, 1, 0)).astype(np.int8).tolist()
544
+ )
545
+ binary_query_embeddings[k] = binary_vector
546
+ return binary_query_embeddings
547
+
548
+
549
+ vespa_feed = []
550
+ for pdf, embedding in zip(pdf_pages, embeddings):
551
+ url = pdf["url"]
552
+ year = pdf["year"]
553
+ title = pdf["title"]
554
+ image = pdf["image"]
555
+ text = pdf.get("text", "")
556
+ page_no = pdf["page_no"]
557
+ query_dict = pdf["queries"]
558
+ questions = [v for k, v in query_dict.items() if "question" in k and v]
559
+ queries = [v for k, v in query_dict.items() if "query" in k and v]
560
+ base_64_image = get_base64_image(
561
+ scale_image(image, 32), add_url_prefix=False
562
+ ) # Scaled down image to return fast on search (~1kb)
563
+ base_64_full_image = get_base64_image(image, add_url_prefix=False)
564
+ embedding_dict = {k: v for k, v in enumerate(embedding)}
565
+ binary_embedding = float_to_binary_embedding(embedding_dict)
566
+ # id_hash should be md5 hash of url and page_number
567
+ id_hash = hashlib.md5(f"{url}_{page_no}".encode()).hexdigest()
568
+ page = {
569
+ "id": id_hash,
570
+ "fields": {
571
+ "id": id_hash,
572
+ "url": url,
573
+ "title": title,
574
+ "year": year,
575
+ "page_number": page_no,
576
+ "blur_image": base_64_image,
577
+ "full_image": base_64_full_image,
578
+ "text": text,
579
+ "embedding": binary_embedding,
580
+ "queries": queries,
581
+ "questions": questions,
582
+ },
583
+ }
584
+ vespa_feed.append(page)
585
+
586
+ # +
587
+ # We will prepare the Vespa feed data, including the embeddings and the generated queries
588
+
589
+
590
+ # Save vespa_feed to vespa_feed.json
591
+ os.makedirs("output", exist_ok=True)
592
+ with open("output/vespa_feed.json", "w") as f:
593
+ vespa_feed_to_save = []
594
+ for page in vespa_feed:
595
+ document_id = page["id"]
596
+ put_id = f"id:{VESPA_APPLICATION_NAME}:{VESPA_SCHEMA_NAME}::{document_id}"
597
+ vespa_feed_to_save.append({"put": put_id, "fields": page["fields"]})
598
+ json.dump(vespa_feed_to_save, f)
599
+
600
+ # +
601
+ # import json
602
+
603
+ # with open("output/vespa_feed.json", "r") as f:
604
+ # vespa_feed = json.load(f)
605
+ # -
606
+
607
+ len(vespa_feed)
608
+
609
+ # ## 5. Prepare Vespa Application
610
+ #
611
+
612
+ # +
613
+ # Define the Vespa schema
614
+ colpali_schema = Schema(
615
+ name=VESPA_SCHEMA_NAME,
616
+ document=Document(
617
+ fields=[
618
+ Field(
619
+ name="id",
620
+ type="string",
621
+ indexing=["summary", "index"],
622
+ match=["word"],
623
+ ),
624
+ Field(name="url", type="string", indexing=["summary", "index"]),
625
+ Field(name="year", type="int", indexing=["summary", "attribute"]),
626
+ Field(
627
+ name="title",
628
+ type="string",
629
+ indexing=["summary", "index"],
630
+ match=["text"],
631
+ index="enable-bm25",
632
+ ),
633
+ Field(name="page_number", type="int", indexing=["summary", "attribute"]),
634
+ Field(name="blur_image", type="raw", indexing=["summary"]),
635
+ Field(name="full_image", type="raw", indexing=["summary"]),
636
+ Field(
637
+ name="text",
638
+ type="string",
639
+ indexing=["summary", "index"],
640
+ match=["text"],
641
+ index="enable-bm25",
642
+ ),
643
+ Field(
644
+ name="embedding",
645
+ type="tensor<int8>(patch{}, v[16])",
646
+ indexing=[
647
+ "attribute",
648
+ "index",
649
+ ],
650
+ ann=HNSW(
651
+ distance_metric="hamming",
652
+ max_links_per_node=32,
653
+ neighbors_to_explore_at_insert=400,
654
+ ),
655
+ ),
656
+ Field(
657
+ name="questions",
658
+ type="array<string>",
659
+ indexing=["summary", "attribute"],
660
+ summary=Summary(fields=["matched-elements-only"]),
661
+ ),
662
+ Field(
663
+ name="queries",
664
+ type="array<string>",
665
+ indexing=["summary", "attribute"],
666
+ summary=Summary(fields=["matched-elements-only"]),
667
+ ),
668
+ ]
669
+ ),
670
+ fieldsets=[
671
+ FieldSet(
672
+ name="default",
673
+ fields=["title", "url", "blur_image", "page_number", "text"],
674
+ ),
675
+ FieldSet(
676
+ name="image",
677
+ fields=["full_image"],
678
+ ),
679
+ ],
680
+ document_summaries=[
681
+ DocumentSummary(
682
+ name="default",
683
+ summary_fields=[
684
+ Summary(
685
+ name="text",
686
+ fields=[("bolding", "on")],
687
+ ),
688
+ Summary(
689
+ name="snippet",
690
+ fields=[("source", "text"), "dynamic"],
691
+ ),
692
+ ],
693
+ from_disk=True,
694
+ ),
695
+ DocumentSummary(
696
+ name="suggestions",
697
+ summary_fields=[
698
+ Summary(name="questions"),
699
+ ],
700
+ from_disk=True,
701
+ ),
702
+ ],
703
+ )
704
+
705
+ # Define similarity functions used in all rank profiles
706
+ mapfunctions = [
707
+ Function(
708
+ name="similarities", # computes similarity scores between each query token and image patch
709
+ expression="""
710
+ sum(
711
+ query(qt) * unpack_bits(attribute(embedding)), v
712
+ )
713
+ """,
714
+ ),
715
+ Function(
716
+ name="normalized", # normalizes the similarity scores to [-1, 1]
717
+ expression="""
718
+ (similarities - reduce(similarities, min)) / (reduce((similarities - reduce(similarities, min)), max)) * 2 - 1
719
+ """,
720
+ ),
721
+ Function(
722
+ name="quantized", # quantizes the normalized similarity scores to signed 8-bit integers [-128, 127]
723
+ expression="""
724
+ cell_cast(normalized * 127.999, int8)
725
+ """,
726
+ ),
727
+ ]
728
+
729
+ # Define the 'bm25' rank profile
730
+ colpali_bm25_profile = RankProfile(
731
+ name="bm25",
732
+ inputs=[("query(qt)", "tensor<float>(querytoken{}, v[128])")],
733
+ first_phase="bm25(title) + bm25(text)",
734
+ functions=mapfunctions,
735
+ )
736
+
737
+
738
+ # A function to create an inherited rank profile which also returns quantized similarity scores
739
+ def with_quantized_similarity(rank_profile: RankProfile) -> RankProfile:
740
+ return RankProfile(
741
+ name=f"{rank_profile.name}_sim",
742
+ first_phase=rank_profile.first_phase,
743
+ inherits=rank_profile.name,
744
+ summary_features=["quantized"],
745
+ )
746
+
747
+
748
+ colpali_schema.add_rank_profile(colpali_bm25_profile)
749
+ colpali_schema.add_rank_profile(with_quantized_similarity(colpali_bm25_profile))
750
+
751
+ # Update the 'default' rank profile
752
+ colpali_profile = RankProfile(
753
+ name="default",
754
+ inputs=[("query(qt)", "tensor<float>(querytoken{}, v[128])")],
755
+ first_phase="bm25_score",
756
+ second_phase=SecondPhaseRanking(expression="max_sim", rerank_count=10),
757
+ functions=mapfunctions
758
+ + [
759
+ Function(
760
+ name="max_sim",
761
+ expression="""
762
+ sum(
763
+ reduce(
764
+ sum(
765
+ query(qt) * unpack_bits(attribute(embedding)), v
766
+ ),
767
+ max, patch
768
+ ),
769
+ querytoken
770
+ )
771
+ """,
772
+ ),
773
+ Function(name="bm25_score", expression="bm25(title) + bm25(text)"),
774
+ ],
775
+ )
776
+ colpali_schema.add_rank_profile(colpali_profile)
777
+ colpali_schema.add_rank_profile(with_quantized_similarity(colpali_profile))
778
+
779
+ # Update the 'retrieval-and-rerank' rank profile
780
+ input_query_tensors = []
781
+ MAX_QUERY_TERMS = 64
782
+ for i in range(MAX_QUERY_TERMS):
783
+ input_query_tensors.append((f"query(rq{i})", "tensor<int8>(v[16])"))
784
+
785
+ input_query_tensors.extend(
786
+ [
787
+ ("query(qt)", "tensor<float>(querytoken{}, v[128])"),
788
+ ("query(qtb)", "tensor<int8>(querytoken{}, v[16])"),
789
+ ]
790
+ )
791
+
792
+ colpali_retrieval_profile = RankProfile(
793
+ name="retrieval-and-rerank",
794
+ inputs=input_query_tensors,
795
+ first_phase="max_sim_binary",
796
+ second_phase=SecondPhaseRanking(expression="max_sim", rerank_count=10),
797
+ functions=mapfunctions
798
+ + [
799
+ Function(
800
+ name="max_sim",
801
+ expression="""
802
+ sum(
803
+ reduce(
804
+ sum(
805
+ query(qt) * unpack_bits(attribute(embedding)), v
806
+ ),
807
+ max, patch
808
+ ),
809
+ querytoken
810
+ )
811
+ """,
812
+ ),
813
+ Function(
814
+ name="max_sim_binary",
815
+ expression="""
816
+ sum(
817
+ reduce(
818
+ 1 / (1 + sum(
819
+ hamming(query(qtb), attribute(embedding)), v)
820
+ ),
821
+ max, patch
822
+ ),
823
+ querytoken
824
+ )
825
+ """,
826
+ ),
827
+ ],
828
+ )
829
+ colpali_schema.add_rank_profile(colpali_retrieval_profile)
830
+ colpali_schema.add_rank_profile(with_quantized_similarity(colpali_retrieval_profile))
831
+
832
+ # +
833
+ from vespa.configuration.services import (
834
+ services,
835
+ container,
836
+ search,
837
+ document_api,
838
+ document_processing,
839
+ clients,
840
+ client,
841
+ config,
842
+ content,
843
+ redundancy,
844
+ documents,
845
+ node,
846
+ certificate,
847
+ token,
848
+ document,
849
+ nodes,
850
+ )
851
+ from vespa.configuration.vt import vt
852
+ from vespa.package import ServicesConfiguration
853
+
854
+ service_config = ServicesConfiguration(
855
+ application_name=VESPA_APPLICATION_NAME,
856
+ services_config=services(
857
+ container(
858
+ search(),
859
+ document_api(),
860
+ document_processing(),
861
+ clients(
862
+ client(
863
+ certificate(file="security/clients.pem"),
864
+ id="mtls",
865
+ permissions="read,write",
866
+ ),
867
+ client(
868
+ token(id=f"{VESPA_TOKEN_ID_WRITE}"),
869
+ id="token_write",
870
+ permissions="read,write",
871
+ ),
872
+ ),
873
+ config(
874
+ vt("tag")(
875
+ vt("bold")(
876
+ vt("open", "<strong>"),
877
+ vt("close", "</strong>"),
878
+ ),
879
+ vt("separator", "..."),
880
+ ),
881
+ name="container.qr-searchers",
882
+ ),
883
+ id=f"{VESPA_APPLICATION_NAME}_container",
884
+ version="1.0",
885
+ ),
886
+ content(
887
+ redundancy("1"),
888
+ documents(document(type="pdf_page", mode="index")),
889
+ nodes(node(distribution_key="0", hostalias="node1")),
890
+ config(
891
+ vt("max_matches", "2", replace_underscores=False),
892
+ vt("length", "1000"),
893
+ vt("surround_max", "500", replace_underscores=False),
894
+ vt("min_length", "300", replace_underscores=False),
895
+ name="vespa.config.search.summary.juniperrc",
896
+ ),
897
+ id=f"{VESPA_APPLICATION_NAME}_content",
898
+ version="1.0",
899
+ ),
900
+ version="1.0",
901
+ ),
902
+ )
903
+ # -
904
+
905
+ # Create the Vespa application package
906
+ vespa_application_package = ApplicationPackage(
907
+ name=VESPA_APPLICATION_NAME,
908
+ schema=[colpali_schema],
909
+ services_config=service_config,
910
+ )
911
+
912
+ # ## 6. Deploy Vespa Application
913
+ #
914
+
915
+ VESPA_TEAM_API_KEY = os.getenv("VESPA_TEAM_API_KEY") or input(
916
+ "Enter Vespa team API key: "
917
+ )
918
+
919
+ # +
920
+ vespa_cloud = VespaCloud(
921
+ tenant=VESPA_TENANT_NAME,
922
+ application=VESPA_APPLICATION_NAME,
923
+ key_content=VESPA_TEAM_API_KEY,
924
+ application_package=vespa_application_package,
925
+ )
926
+
927
+ # Deploy the application
928
+ vespa_cloud.deploy()
929
+
930
+ # Output the endpoint URL
931
+ endpoint_url = vespa_cloud.get_token_endpoint()
932
+ print(f"Application deployed. Token endpoint URL: {endpoint_url}")
933
+ # -
934
+
935
+ # Make sure to take note of the token endpoint_url.
936
+ # You need to put this in your `.env` file - `VESPA_APP_URL=https://abcd.vespa-app.cloud` - to access the Vespa application from your web application.
937
+ #
938
+
939
+ # ## 8. Feed Data to Vespa
940
+ #
941
+
942
+ # Instantiate Vespa connection using token
943
+ app = Vespa(url=endpoint_url, vespa_cloud_secret_token=VESPA_CLOUD_SECRET_TOKEN)
944
+ app.get_application_status()
945
+
946
+
947
+ # +
948
+ def callback(response: VespaResponse, id: str):
949
+ if not response.is_successful():
950
+ print(
951
+ f"Failed to feed document {id} with status code {response.status_code}: Reason {response.get_json()}"
952
+ )
953
+
954
+
955
+ # Feed data into Vespa asynchronously
956
+ app.feed_async_iterable(vespa_feed, schema=VESPA_SCHEMA_NAME, callback=callback)
pyproject.toml ADDED
@@ -0,0 +1,119 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ [project]
2
+ name = "visual-retrieval-colpali"
3
+ version = "0.1.0"
4
+ description = "Visual retrieval with ColPali"
5
+ readme = "README.md"
6
+ requires-python = ">=3.10, <3.13"
7
+ license = { text = "Apache-2.0" }
8
+ dependencies = [
9
+ "python-fasthtml",
10
+ "huggingface-hub",
11
+ "pyvespa>=0.50.0",
12
+ "vespacli",
13
+ "torch",
14
+ "vidore-benchmark[interpretability]>=4.0.0,<5.0.0",
15
+ "colpali-engine",
16
+ "einops",
17
+ "pypdf",
18
+ "setuptools",
19
+ "python-dotenv",
20
+ "shad4fast>=1.2.1",
21
+ "google-generativeai>=0.7.2",
22
+ "spacy",
23
+ "pip",
24
+ "matplotlib"
25
+ ]
26
+
27
+ # dev-dependencies
28
+ [project.optional-dependencies]
29
+ dev = [
30
+ "ruff",
31
+ "python-dotenv",
32
+ "huggingface_hub[cli]"
33
+ ]
34
+ feed = [
35
+ "ipykernel",
36
+ "jupytext",
37
+ "pydantic",
38
+ "beautifulsoup4",
39
+ "pdf2image",
40
+ "google-generativeai"
41
+ ]
42
+ [tool.ruff]
43
+ # Exclude a variety of commonly ignored directories.
44
+ exclude = [
45
+ ".bzr",
46
+ ".direnv",
47
+ ".eggs",
48
+ ".git",
49
+ ".git-rewrite",
50
+ ".hg",
51
+ ".ipynb_checkpoints",
52
+ ".mypy_cache",
53
+ ".nox",
54
+ ".pants.d",
55
+ ".pyenv",
56
+ ".pytest_cache",
57
+ ".pytype",
58
+ ".ruff_cache",
59
+ ".svn",
60
+ ".tox",
61
+ ".venv",
62
+ ".vscode",
63
+ "__pypackages__",
64
+ "_build",
65
+ "buck-out",
66
+ "build",
67
+ "dist",
68
+ "node_modules",
69
+ "site-packages",
70
+ "venv",
71
+ ]
72
+
73
+ # Same as Black.
74
+ line-length = 88
75
+ indent-width = 4
76
+
77
+ # Assume Python 3.8
78
+ target-version = "py38"
79
+
80
+ [tool.ruff.lint]
81
+ # Enable Pyflakes (`F`) and a subset of the pycodestyle (`E`) codes by default.
82
+ # Unlike Flake8, Ruff doesn't enable pycodestyle warnings (`W`) or
83
+ # McCabe complexity (`C901`) by default.
84
+ select = ["E4", "E7", "E9", "F"]
85
+ ignore = []
86
+
87
+ # Allow fix for all enabled rules (when `--fix`) is provided.
88
+ fixable = ["ALL"]
89
+ unfixable = []
90
+
91
+ # Allow unused variables when underscore-prefixed.
92
+ dummy-variable-rgx = "^(_+|(_+[a-zA-Z0-9_]*[a-zA-Z0-9]+?))$"
93
+
94
+ [tool.ruff.format]
95
+ # Like Black, use double quotes for strings.
96
+ quote-style = "double"
97
+
98
+ # Like Black, indent with spaces, rather than tabs.
99
+ indent-style = "space"
100
+
101
+ # Like Black, respect magic trailing commas.
102
+ skip-magic-trailing-comma = false
103
+
104
+ # Like Black, automatically detect the appropriate line ending.
105
+ line-ending = "auto"
106
+
107
+ # Enable auto-formatting of code examples in docstrings. Markdown,
108
+ # reStructuredText code/literal blocks and doctests are all supported.
109
+ #
110
+ # This is currently disabled by default, but it is planned for this
111
+ # to be opt-out in the future.
112
+ docstring-code-format = false
113
+
114
+ # Set the line length limit used when formatting code snippets in
115
+ # docstrings.
116
+ #
117
+ # This only has an effect when the `docstring-code-format` setting is
118
+ # enabled.
119
+ docstring-code-line-length = "dynamic"
query_vespa.py ADDED
@@ -0,0 +1,193 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/usr/bin/env python3
2
+
3
+ import os
4
+ import torch
5
+ from torch.utils.data import DataLoader
6
+ from PIL import Image
7
+ import numpy as np
8
+ from typing import cast
9
+ import asyncio
10
+
11
+ from colpali_engine.models import ColPali, ColPaliProcessor
12
+ from colpali_engine.utils.torch_utils import get_torch_device
13
+ from vespa.application import Vespa
14
+ from vespa.io import VespaQueryResponse
15
+ from dotenv import load_dotenv
16
+ from pathlib import Path
17
+
18
+ MAX_QUERY_TERMS = 64
19
+ SAVEDIR = Path(__file__) / "output" / "images"
20
+ load_dotenv()
21
+
22
+
23
+ def process_queries(processor, queries, image):
24
+ inputs = processor(
25
+ images=[image] * len(queries), text=queries, return_tensors="pt", padding=True
26
+ )
27
+ return inputs
28
+
29
+
30
+ def display_query_results(query, response, hits=5):
31
+ query_time = response.json.get("timing", {}).get("searchtime", -1)
32
+ query_time = round(query_time, 2)
33
+ count = response.json.get("root", {}).get("fields", {}).get("totalCount", 0)
34
+ result_text = f"Query text: '{query}', query time {query_time}s, count={count}, top results:\n"
35
+
36
+ for i, hit in enumerate(response.hits[:hits]):
37
+ title = hit["fields"]["title"]
38
+ url = hit["fields"]["url"]
39
+ page = hit["fields"]["page_number"]
40
+ image = hit["fields"]["image"]
41
+ _id = hit["id"]
42
+ score = hit["relevance"]
43
+
44
+ result_text += f"\nPDF Result {i + 1}\n"
45
+ result_text += f"Title: {title}, page {page+1} with score {score:.2f}\n"
46
+ result_text += f"URL: {url}\n"
47
+ result_text += f"ID: {_id}\n"
48
+ # Optionally, save or display the image
49
+ # img_data = base64.b64decode(image)
50
+ # img_path = SAVEDIR / f"{title}.png"
51
+ # with open(f"{img_path}", "wb") as f:
52
+ # f.write(img_data)
53
+ print(result_text)
54
+
55
+
56
+ async def query_vespa_default(app, queries, qs):
57
+ async with app.asyncio(connections=1, total_timeout=120) as session:
58
+ for idx, query in enumerate(queries):
59
+ query_embedding = {k: v.tolist() for k, v in enumerate(qs[idx])}
60
+ response: VespaQueryResponse = await session.query(
61
+ yql="select documentid,title,url,image,page_number from pdf_page where userInput(@userQuery)",
62
+ ranking="default",
63
+ userQuery=query,
64
+ timeout=120,
65
+ hits=3,
66
+ body={"input.query(qt)": query_embedding, "presentation.timing": True},
67
+ )
68
+ assert response.is_successful()
69
+ display_query_results(query, response)
70
+
71
+
72
+ async def query_vespa_nearest_neighbor(app, queries, qs):
73
+ # Using nearestNeighbor for retrieval
74
+ target_hits_per_query_tensor = (
75
+ 20 # this is a hyper parameter that can be tuned for speed versus accuracy
76
+ )
77
+ async with app.asyncio(connections=1, total_timeout=180) as session:
78
+ for idx, query in enumerate(queries):
79
+ float_query_embedding = {k: v.tolist() for k, v in enumerate(qs[idx])}
80
+ binary_query_embeddings = dict()
81
+ for k, v in float_query_embedding.items():
82
+ binary_vector = (
83
+ np.packbits(np.where(np.array(v) > 0, 1, 0))
84
+ .astype(np.int8)
85
+ .tolist()
86
+ )
87
+ binary_query_embeddings[k] = binary_vector
88
+ if len(binary_query_embeddings) >= MAX_QUERY_TERMS:
89
+ print(
90
+ f"Warning: Query has more than {MAX_QUERY_TERMS} terms. Truncating."
91
+ )
92
+ break
93
+
94
+ # The mixed tensors used in MaxSim calculations
95
+ # We use both binary and float representations
96
+ query_tensors = {
97
+ "input.query(qtb)": binary_query_embeddings,
98
+ "input.query(qt)": float_query_embedding,
99
+ }
100
+ # The query tensors used in the nearest neighbor calculations
101
+ for i in range(0, len(binary_query_embeddings)):
102
+ query_tensors[f"input.query(rq{i})"] = binary_query_embeddings[i]
103
+ nn = []
104
+ for i in range(0, len(binary_query_embeddings)):
105
+ nn.append(
106
+ f"({{targetHits:{target_hits_per_query_tensor}}}nearestNeighbor(embedding,rq{i}))"
107
+ )
108
+ # We use an OR operator to combine the nearest neighbor operator
109
+ nn = " OR ".join(nn)
110
+ response: VespaQueryResponse = await session.query(
111
+ body={
112
+ **query_tensors,
113
+ "presentation.timing": True,
114
+ "yql": f"select documentid, title, url, image, page_number from pdf_page where {nn}",
115
+ "ranking.profile": "retrieval-and-rerank",
116
+ "timeout": 120,
117
+ "hits": 3,
118
+ },
119
+ )
120
+ assert response.is_successful(), response.json
121
+ display_query_results(query, response)
122
+
123
+
124
+ def main():
125
+ vespa_app_url = os.environ.get(
126
+ "VESPA_APP_URL"
127
+ ) # Ensure this is set to your Vespa app URL
128
+ vespa_cloud_secret_token = os.environ.get("VESPA_CLOUD_SECRET_TOKEN")
129
+ if not vespa_app_url or not vespa_cloud_secret_token:
130
+ raise ValueError(
131
+ "Please set the VESPA_APP_URL and VESPA_CLOUD_SECRET_TOKEN environment variables"
132
+ )
133
+ # Instantiate Vespa connection
134
+ app = Vespa(url=vespa_app_url, vespa_cloud_secret_token=vespa_cloud_secret_token)
135
+ status_resp = app.get_application_status()
136
+ if status_resp.status_code != 200:
137
+ print(f"Failed to connect to Vespa at {vespa_app_url}")
138
+ return
139
+ else:
140
+ print(f"Connected to Vespa at {vespa_app_url}")
141
+ # Load the model
142
+ device = get_torch_device("auto")
143
+ print(f"Using device: {device}")
144
+
145
+ model_name = "vidore/colpali-v1.2"
146
+ processor_name = "google/paligemma-3b-mix-448"
147
+
148
+ model = cast(
149
+ ColPali,
150
+ ColPali.from_pretrained(
151
+ model_name,
152
+ torch_dtype=torch.bfloat16 if torch.cuda.is_available() else torch.float32,
153
+ device_map=device,
154
+ ),
155
+ ).eval()
156
+
157
+ processor = cast(ColPaliProcessor, ColPaliProcessor.from_pretrained(processor_name))
158
+
159
+ # Create dummy image
160
+ dummy_image = Image.new("RGB", (448, 448), (255, 255, 255))
161
+
162
+ # Define queries
163
+ queries = [
164
+ "Percentage of non-fresh water as source?",
165
+ "Policies related to nature risk?",
166
+ "How much of produced water is recycled?",
167
+ ]
168
+
169
+ # Obtain query embeddings
170
+ dataloader = DataLoader(
171
+ queries,
172
+ batch_size=1,
173
+ shuffle=False,
174
+ collate_fn=lambda x: process_queries(processor, x, dummy_image),
175
+ )
176
+ qs = []
177
+ for batch_query in dataloader:
178
+ with torch.no_grad():
179
+ batch_query = {k: v.to(model.device) for k, v in batch_query.items()}
180
+ embeddings_query = model(**batch_query)
181
+ qs.extend(list(torch.unbind(embeddings_query.to("cpu"))))
182
+
183
+ # Perform queries using default rank profile
184
+ print("Performing queries using default rank profile:")
185
+ asyncio.run(query_vespa_default(app, queries, qs))
186
+
187
+ # Perform queries using nearestNeighbor
188
+ print("Performing queries using nearestNeighbor:")
189
+ asyncio.run(query_vespa_nearest_neighbor(app, queries, qs))
190
+
191
+
192
+ if __name__ == "__main__":
193
+ main()
requirements.txt ADDED
@@ -0,0 +1,540 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # This file was autogenerated by uv via the following command:
2
+ # uv pip compile pyproject.toml -o src/requirements.txt
3
+ accelerate==0.34.2
4
+ # via peft
5
+ aiohappyeyeballs==2.4.3
6
+ # via aiohttp
7
+ aiohttp==3.10.11
8
+ # via
9
+ # datasets
10
+ # fsspec
11
+ # pyvespa
12
+ aiosignal==1.3.1
13
+ # via aiohttp
14
+ annotated-types==0.7.0
15
+ # via pydantic
16
+ anyio==4.6.0
17
+ # via
18
+ # httpx
19
+ # starlette
20
+ # watchfiles
21
+ async-timeout==4.0.3
22
+ # via aiohttp
23
+ attrs==24.2.0
24
+ # via aiohttp
25
+ beautifulsoup4==4.12.3
26
+ # via python-fasthtml
27
+ blis==0.7.11
28
+ # via thinc
29
+ cachetools==5.5.0
30
+ # via google-auth
31
+ catalogue==2.0.10
32
+ # via
33
+ # spacy
34
+ # srsly
35
+ # thinc
36
+ certifi==2024.8.30
37
+ # via
38
+ # httpcore
39
+ # httpx
40
+ # requests
41
+ cffi==1.17.1
42
+ # via cryptography
43
+ charset-normalizer==3.3.2
44
+ # via requests
45
+ click==8.1.7
46
+ # via
47
+ # typer
48
+ # uvicorn
49
+ cloudpathlib==0.20.0
50
+ # via weasel
51
+ colpali-engine==0.3.1
52
+ # via
53
+ # visual-retrieval-colpali (pyproject.toml)
54
+ # vidore-benchmark
55
+ confection==0.1.5
56
+ # via
57
+ # thinc
58
+ # weasel
59
+ contourpy==1.3.0
60
+ # via matplotlib
61
+ cryptography==43.0.1
62
+ # via pyvespa
63
+ cycler==0.12.1
64
+ # via matplotlib
65
+ cymem==2.0.8
66
+ # via
67
+ # preshed
68
+ # spacy
69
+ # thinc
70
+ datasets==2.21.0
71
+ # via
72
+ # mteb
73
+ # vidore-benchmark
74
+ dill==0.3.8
75
+ # via
76
+ # datasets
77
+ # multiprocess
78
+ docker==7.1.0
79
+ # via pyvespa
80
+ einops==0.8.0
81
+ # via
82
+ # visual-retrieval-colpali (pyproject.toml)
83
+ # vidore-benchmark
84
+ eval-type-backport==0.2.0
85
+ # via mteb
86
+ exceptiongroup==1.2.2
87
+ # via anyio
88
+ fastcore==1.7.11
89
+ # via
90
+ # fastlite
91
+ # python-fasthtml
92
+ # pyvespa
93
+ # sqlite-minutils
94
+ fastlite==0.0.11
95
+ # via python-fasthtml
96
+ filelock==3.16.1
97
+ # via
98
+ # datasets
99
+ # huggingface-hub
100
+ # torch
101
+ # transformers
102
+ fonttools==4.54.1
103
+ # via matplotlib
104
+ frozenlist==1.4.1
105
+ # via
106
+ # aiohttp
107
+ # aiosignal
108
+ fsspec==2024.6.1
109
+ # via
110
+ # datasets
111
+ # huggingface-hub
112
+ # torch
113
+ google-ai-generativelanguage==0.6.10
114
+ # via google-generativeai
115
+ google-api-core==2.21.0
116
+ # via
117
+ # google-ai-generativelanguage
118
+ # google-api-python-client
119
+ # google-generativeai
120
+ google-api-python-client==2.149.0
121
+ # via google-generativeai
122
+ google-auth==2.35.0
123
+ # via
124
+ # google-ai-generativelanguage
125
+ # google-api-core
126
+ # google-api-python-client
127
+ # google-auth-httplib2
128
+ # google-generativeai
129
+ google-auth-httplib2==0.2.0
130
+ # via google-api-python-client
131
+ google-generativeai==0.8.3
132
+ # via visual-retrieval-colpali (pyproject.toml)
133
+ googleapis-common-protos==1.65.0
134
+ # via
135
+ # google-api-core
136
+ # grpcio-status
137
+ gputil==1.4.0
138
+ # via
139
+ # colpali-engine
140
+ # vidore-benchmark
141
+ grpcio==1.67.0
142
+ # via
143
+ # google-api-core
144
+ # grpcio-status
145
+ grpcio-status==1.67.0
146
+ # via google-api-core
147
+ h11==0.14.0
148
+ # via
149
+ # httpcore
150
+ # uvicorn
151
+ h2==4.1.0
152
+ # via httpx
153
+ hpack==4.0.0
154
+ # via h2
155
+ httpcore==1.0.6
156
+ # via httpx
157
+ httplib2==0.22.0
158
+ # via
159
+ # google-api-python-client
160
+ # google-auth-httplib2
161
+ httptools==0.6.1
162
+ # via uvicorn
163
+ httpx==0.27.2
164
+ # via
165
+ # python-fasthtml
166
+ # pyvespa
167
+ huggingface-hub==0.25.1
168
+ # via
169
+ # visual-retrieval-colpali (pyproject.toml)
170
+ # accelerate
171
+ # datasets
172
+ # peft
173
+ # sentence-transformers
174
+ # tokenizers
175
+ # transformers
176
+ hyperframe==6.0.1
177
+ # via h2
178
+ idna==3.10
179
+ # via
180
+ # anyio
181
+ # httpx
182
+ # requests
183
+ # yarl
184
+ itsdangerous==2.2.0
185
+ # via python-fasthtml
186
+ jinja2==3.1.5
187
+ # via
188
+ # pyvespa
189
+ # spacy
190
+ # torch
191
+ joblib==1.4.2
192
+ # via scikit-learn
193
+ kiwisolver==1.4.7
194
+ # via matplotlib
195
+ langcodes==3.4.1
196
+ # via spacy
197
+ language-data==1.2.0
198
+ # via langcodes
199
+ loguru==0.7.2
200
+ # via vidore-benchmark
201
+ lucide-fasthtml==0.0.9
202
+ # via shad4fast
203
+ lxml==5.3.0
204
+ # via
205
+ # lucide-fasthtml
206
+ # pyvespa
207
+ marisa-trie==1.2.1
208
+ # via language-data
209
+ markdown-it-py==3.0.0
210
+ # via rich
211
+ markupsafe==2.1.5
212
+ # via jinja2
213
+ matplotlib==3.9.2
214
+ # via
215
+ # seaborn
216
+ # vidore-benchmark
217
+ mdurl==0.1.2
218
+ # via markdown-it-py
219
+ mpmath==1.3.0
220
+ # via sympy
221
+ mteb==1.15.3
222
+ # via vidore-benchmark
223
+ multidict==6.1.0
224
+ # via
225
+ # aiohttp
226
+ # yarl
227
+ multiprocess==0.70.16
228
+ # via datasets
229
+ murmurhash==1.0.10
230
+ # via
231
+ # preshed
232
+ # spacy
233
+ # thinc
234
+ networkx==3.3
235
+ # via torch
236
+ numpy==1.26.4
237
+ # via
238
+ # accelerate
239
+ # blis
240
+ # colpali-engine
241
+ # contourpy
242
+ # datasets
243
+ # matplotlib
244
+ # mteb
245
+ # pandas
246
+ # peft
247
+ # pyarrow
248
+ # scikit-learn
249
+ # scipy
250
+ # seaborn
251
+ # spacy
252
+ # thinc
253
+ # transformers
254
+ # vidore-benchmark
255
+ oauthlib==3.2.2
256
+ # via python-fasthtml
257
+ packaging==24.1
258
+ # via
259
+ # accelerate
260
+ # datasets
261
+ # fastcore
262
+ # huggingface-hub
263
+ # matplotlib
264
+ # peft
265
+ # spacy
266
+ # thinc
267
+ # transformers
268
+ # weasel
269
+ pandas==2.2.3
270
+ # via
271
+ # datasets
272
+ # seaborn
273
+ pdf2image==1.17.0
274
+ # via vidore-benchmark
275
+ peft==0.11.1
276
+ # via
277
+ # colpali-engine
278
+ # vidore-benchmark
279
+ pillow==10.4.0
280
+ # via
281
+ # colpali-engine
282
+ # matplotlib
283
+ # pdf2image
284
+ # sentence-transformers
285
+ # vidore-benchmark
286
+ pip==24.3.1
287
+ # via visual-retrieval-colpali (pyproject.toml)
288
+ polars==1.9.0
289
+ # via mteb
290
+ preshed==3.0.9
291
+ # via
292
+ # spacy
293
+ # thinc
294
+ proto-plus==1.24.0
295
+ # via
296
+ # google-ai-generativelanguage
297
+ # google-api-core
298
+ protobuf==5.28.3
299
+ # via
300
+ # google-ai-generativelanguage
301
+ # google-api-core
302
+ # google-generativeai
303
+ # googleapis-common-protos
304
+ # grpcio-status
305
+ # proto-plus
306
+ psutil==6.0.0
307
+ # via
308
+ # accelerate
309
+ # peft
310
+ pyarrow==17.0.0
311
+ # via datasets
312
+ pyasn1==0.6.1
313
+ # via
314
+ # pyasn1-modules
315
+ # rsa
316
+ pyasn1-modules==0.4.1
317
+ # via google-auth
318
+ pycparser==2.22
319
+ # via cffi
320
+ pydantic==2.9.2
321
+ # via
322
+ # confection
323
+ # google-generativeai
324
+ # mteb
325
+ # spacy
326
+ # thinc
327
+ # weasel
328
+ pydantic-core==2.23.4
329
+ # via pydantic
330
+ pygments==2.18.0
331
+ # via rich
332
+ pyparsing==3.1.4
333
+ # via
334
+ # httplib2
335
+ # matplotlib
336
+ pypdf==5.0.1
337
+ # via visual-retrieval-colpali (pyproject.toml)
338
+ python-dateutil==2.9.0.post0
339
+ # via
340
+ # matplotlib
341
+ # pandas
342
+ # python-fasthtml
343
+ # pyvespa
344
+ python-dotenv==1.0.1
345
+ # via
346
+ # visual-retrieval-colpali (pyproject.toml)
347
+ # uvicorn
348
+ # vidore-benchmark
349
+ python-fasthtml==0.6.9
350
+ # via
351
+ # visual-retrieval-colpali (pyproject.toml)
352
+ # lucide-fasthtml
353
+ # shad4fast
354
+ python-multipart==0.0.18
355
+ # via python-fasthtml
356
+ pytrec-eval-terrier==0.5.6
357
+ # via mteb
358
+ pytz==2024.2
359
+ # via pandas
360
+ pyvespa==0.50.0
361
+ # via visual-retrieval-colpali (pyproject.toml)
362
+ pyyaml==6.0.2
363
+ # via
364
+ # accelerate
365
+ # datasets
366
+ # huggingface-hub
367
+ # peft
368
+ # transformers
369
+ # uvicorn
370
+ regex==2024.9.11
371
+ # via transformers
372
+ requests==2.32.3
373
+ # via
374
+ # colpali-engine
375
+ # datasets
376
+ # docker
377
+ # google-api-core
378
+ # huggingface-hub
379
+ # lucide-fasthtml
380
+ # mteb
381
+ # pyvespa
382
+ # requests-toolbelt
383
+ # spacy
384
+ # transformers
385
+ # weasel
386
+ requests-toolbelt==1.0.0
387
+ # via pyvespa
388
+ rich==13.9.2
389
+ # via
390
+ # mteb
391
+ # typer
392
+ rsa==4.9
393
+ # via google-auth
394
+ safetensors==0.4.5
395
+ # via
396
+ # accelerate
397
+ # peft
398
+ # transformers
399
+ scikit-learn==1.5.2
400
+ # via
401
+ # mteb
402
+ # sentence-transformers
403
+ scipy==1.14.1
404
+ # via
405
+ # mteb
406
+ # scikit-learn
407
+ # sentence-transformers
408
+ seaborn==0.13.2
409
+ # via vidore-benchmark
410
+ sentence-transformers==3.1.1
411
+ # via
412
+ # mteb
413
+ # vidore-benchmark
414
+ sentencepiece==0.2.0
415
+ # via vidore-benchmark
416
+ setuptools==75.1.0
417
+ # via
418
+ # visual-retrieval-colpali (pyproject.toml)
419
+ # marisa-trie
420
+ # spacy
421
+ # thinc
422
+ shad4fast==1.2.1
423
+ # via visual-retrieval-colpali (pyproject.toml)
424
+ shellingham==1.5.4
425
+ # via typer
426
+ six==1.16.0
427
+ # via python-dateutil
428
+ smart-open==7.0.5
429
+ # via weasel
430
+ sniffio==1.3.1
431
+ # via
432
+ # anyio
433
+ # httpx
434
+ soupsieve==2.6
435
+ # via beautifulsoup4
436
+ spacy==3.7.5
437
+ # via visual-retrieval-colpali (pyproject.toml)
438
+ spacy-legacy==3.0.12
439
+ # via spacy
440
+ spacy-loggers==1.0.5
441
+ # via spacy
442
+ sqlite-minutils==3.37.0.post3
443
+ # via fastlite
444
+ srsly==2.4.8
445
+ # via
446
+ # confection
447
+ # spacy
448
+ # thinc
449
+ # weasel
450
+ starlette==0.39.2
451
+ # via python-fasthtml
452
+ sympy==1.13.3
453
+ # via torch
454
+ tenacity==9.0.0
455
+ # via pyvespa
456
+ thinc==8.2.5
457
+ # via spacy
458
+ threadpoolctl==3.5.0
459
+ # via scikit-learn
460
+ tokenizers==0.20.0
461
+ # via transformers
462
+ torch==2.4.1
463
+ # via
464
+ # visual-retrieval-colpali (pyproject.toml)
465
+ # accelerate
466
+ # colpali-engine
467
+ # mteb
468
+ # peft
469
+ # sentence-transformers
470
+ # vidore-benchmark
471
+ tqdm==4.66.5
472
+ # via
473
+ # datasets
474
+ # google-generativeai
475
+ # huggingface-hub
476
+ # mteb
477
+ # peft
478
+ # sentence-transformers
479
+ # spacy
480
+ # transformers
481
+ transformers==4.45.1
482
+ # via
483
+ # colpali-engine
484
+ # peft
485
+ # sentence-transformers
486
+ # vidore-benchmark
487
+ typer==0.12.5
488
+ # via
489
+ # spacy
490
+ # vidore-benchmark
491
+ # weasel
492
+ typing-extensions==4.12.2
493
+ # via
494
+ # anyio
495
+ # cloudpathlib
496
+ # google-generativeai
497
+ # huggingface-hub
498
+ # mteb
499
+ # multidict
500
+ # pydantic
501
+ # pydantic-core
502
+ # pypdf
503
+ # pyvespa
504
+ # rich
505
+ # torch
506
+ # typer
507
+ # uvicorn
508
+ tzdata==2024.2
509
+ # via pandas
510
+ uritemplate==4.1.1
511
+ # via google-api-python-client
512
+ urllib3==2.2.3
513
+ # via
514
+ # docker
515
+ # requests
516
+ uvicorn==0.31.0
517
+ # via python-fasthtml
518
+ uvloop==0.20.0
519
+ # via uvicorn
520
+ vespacli==8.391.23
521
+ # via visual-retrieval-colpali (pyproject.toml)
522
+ vidore-benchmark==4.0.0
523
+ # via visual-retrieval-colpali (pyproject.toml)
524
+ wasabi==1.1.3
525
+ # via
526
+ # spacy
527
+ # thinc
528
+ # weasel
529
+ watchfiles==0.24.0
530
+ # via uvicorn
531
+ weasel==0.4.1
532
+ # via spacy
533
+ websockets==13.1
534
+ # via uvicorn
535
+ wrapt==1.16.0
536
+ # via smart-open
537
+ xxhash==3.5.0
538
+ # via datasets
539
+ yarl==1.13.1
540
+ # via aiohttp
ruff.toml ADDED
@@ -0,0 +1,77 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Exclude a variety of commonly ignored directories.
2
+ exclude = [
3
+ ".bzr",
4
+ ".direnv",
5
+ ".eggs",
6
+ ".git",
7
+ ".git-rewrite",
8
+ ".hg",
9
+ ".ipynb_checkpoints",
10
+ ".mypy_cache",
11
+ ".nox",
12
+ ".pants.d",
13
+ ".pyenv",
14
+ ".pytest_cache",
15
+ ".pytype",
16
+ ".ruff_cache",
17
+ ".svn",
18
+ ".tox",
19
+ ".venv",
20
+ ".vscode",
21
+ "__pypackages__",
22
+ "_build",
23
+ "buck-out",
24
+ "build",
25
+ "dist",
26
+ "node_modules",
27
+ "site-packages",
28
+ "venv",
29
+ ]
30
+
31
+ # Same as Black.
32
+ line-length = 88
33
+ indent-width = 4
34
+
35
+ # Assume Python 3.8
36
+ target-version = "py38"
37
+
38
+ [lint]
39
+ # Enable Pyflakes (`F`) and a subset of the pycodestyle (`E`) codes by default.
40
+ # Unlike Flake8, Ruff doesn't enable pycodestyle warnings (`W`) or
41
+ # McCabe complexity (`C901`) by default.
42
+ select = ["E4", "E7", "E9", "F"]
43
+ ignore = []
44
+
45
+ # Allow fix for all enabled rules (when `--fix`) is provided.
46
+ fixable = ["ALL"]
47
+ unfixable = []
48
+
49
+ # Allow unused variables when underscore-prefixed.
50
+ dummy-variable-rgx = "^(_+|(_+[a-zA-Z0-9_]*[a-zA-Z0-9]+?))$"
51
+
52
+ [format]
53
+ # Like Black, use double quotes for strings.
54
+ quote-style = "double"
55
+
56
+ # Like Black, indent with spaces, rather than tabs.
57
+ indent-style = "space"
58
+
59
+ # Like Black, respect magic trailing commas.
60
+ skip-magic-trailing-comma = false
61
+
62
+ # Like Black, automatically detect the appropriate line ending.
63
+ line-ending = "auto"
64
+
65
+ # Enable auto-formatting of code examples in docstrings. Markdown,
66
+ # reStructuredText code/literal blocks and doctests are all supported.
67
+ #
68
+ # This is currently disabled by default, but it is planned for this
69
+ # to be opt-out in the future.
70
+ docstring-code-format = false
71
+
72
+ # Set the line length limit used when formatting code snippets in
73
+ # docstrings.
74
+ #
75
+ # This only has an effect when the `docstring-code-format` setting is
76
+ # enabled.
77
+ docstring-code-line-length = "dynamic"
setup.py ADDED
@@ -0,0 +1,104 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/usr/bin/env python3
2
+ """
3
+ Quick setup script for ColPali-Vespa Visual Retrieval System
4
+ """
5
+
6
+ import os
7
+ import sys
8
+ from pathlib import Path
9
+
10
+
11
+ def create_env_file():
12
+ """Create a sample .env file if it doesn't exist"""
13
+ env_path = Path(".env")
14
+ if env_path.exists():
15
+ print("βœ… .env file already exists")
16
+ return
17
+
18
+ env_content = """# Vespa Configuration
19
+ # Choose one authentication method:
20
+
21
+ # Option 1: Token Authentication (Recommended)
22
+ VESPA_APP_TOKEN_URL=https://your-app.your-tenant.vespa-cloud.com
23
+ VESPA_CLOUD_SECRET_TOKEN=your_vespa_secret_token_here
24
+
25
+ # Option 2: mTLS Authentication
26
+ # USE_MTLS=true
27
+ # VESPA_APP_MTLS_URL=https://your-app.your-tenant.vespa-cloud.com
28
+ # VESPA_CLOUD_MTLS_KEY="-----BEGIN PRIVATE KEY-----
29
+ # Your private key content here
30
+ # -----END PRIVATE KEY-----"
31
+ # VESPA_CLOUD_MTLS_CERT="-----BEGIN CERTIFICATE-----
32
+ # Your certificate content here
33
+ # -----END CERTIFICATE-----"
34
+
35
+ # Google Gemini Configuration (Optional - for AI chat features)
36
+ GEMINI_API_KEY=your_gemini_api_key_here
37
+
38
+ # Application Configuration
39
+ LOG_LEVEL=INFO
40
+ HOT_RELOAD=false
41
+
42
+ # Development Configuration
43
+ # Uncomment for development mode
44
+ # HOT_RELOAD=true
45
+ # LOG_LEVEL=DEBUG
46
+ """
47
+
48
+ with open(env_path, "w") as f:
49
+ f.write(env_content)
50
+
51
+ print("βœ… Created .env file with sample configuration")
52
+ print(" Please edit .env with your actual credentials")
53
+
54
+
55
+ def create_directories():
56
+ """Create necessary directories"""
57
+ directories = ["static", "static/full_images", "static/sim_maps"]
58
+
59
+ for directory in directories:
60
+ Path(directory).mkdir(parents=True, exist_ok=True)
61
+
62
+ print("βœ… Created necessary directories")
63
+
64
+
65
+ def check_python_version():
66
+ """Check if Python version is compatible"""
67
+ version = sys.version_info
68
+ if version.major != 3 or version.minor < 10 or version.minor >= 13:
69
+ print("❌ Python 3.10, 3.11, or 3.12 is required")
70
+ print(f" Current version: {version.major}.{version.minor}.{version.micro}")
71
+ return False
72
+
73
+ print(
74
+ f"βœ… Python version {version.major}.{version.minor}.{version.micro} is compatible"
75
+ )
76
+ return True
77
+
78
+
79
+ def main():
80
+ """Main setup function"""
81
+ print("πŸš€ ColPali-Vespa Visual Retrieval Setup")
82
+ print("=" * 40)
83
+
84
+ # Check Python version
85
+ if not check_python_version():
86
+ sys.exit(1)
87
+
88
+ # Create directories
89
+ create_directories()
90
+
91
+ # Create .env file
92
+ create_env_file()
93
+
94
+ print("\nπŸ“‹ Next steps:")
95
+ print("1. Edit .env file with your Vespa and Gemini credentials")
96
+ print("2. Install dependencies: pip install -e .")
97
+ print("3. Deploy Vespa application: python deploy_vespa_app.py ...")
98
+ print("4. Upload documents: python feed_vespa.py ...")
99
+ print("5. Run the application: python main.py")
100
+ print("\nπŸ“– See README.md for detailed instructions")
101
+
102
+
103
+ if __name__ == "__main__":
104
+ main()