Zilun commited on
Commit
4ae33db
Β·
verified Β·
1 Parent(s): 179e8e8

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +206 -3
README.md CHANGED
@@ -1,3 +1,206 @@
1
- ---
2
- license: mit
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # ImageRAG
2
+
3
+
4
+ <font size=4><div align='center'> [[πŸ€— Checkpoint, cache and data](https://huggingface.co/omlab/ImageRAG)] [[πŸ“„ Paper](https://ieeexplore.ieee.org/document/11039502)] [[πŸ“ Blog (in Mandarin)](https://mp.weixin.qq.com/s/BcFejPAcxh4Rx_rh1JvRJA)]</div></font>
5
+ ## ✨ Highlight
6
+
7
+ Ultrahigh resolution (UHR) remote sensing imagery (RSI) (e.g. 10,000 X 10,000 pixels) poses a significant challenge for current RS vision-language models (RSVLMs). If one chooses to resize the UHR image to the standard in
8
+ put image size, the extensive spatial and contextual information that UHR images contain will be neglected. Otherwise, the original size of these images often exceeds the token limits of standard RSVLMs, making it difficult to pro
9
+ cess the entire image and capture long-range dependencies to answer the query based on the abundant visual context.
10
+
11
+ * Three crucial aspects for MLLMs to effectively handle UHR RSI are:
12
+
13
+ * Managing small targets, ensuring that the model can accurately be aware and analyze fine details
14
+ within images
15
+
16
+ * Processing the UHR image in a way that integrates with MLLMs without significantly increasing
17
+ the number of image tokens, which would lead to high computational costs
18
+
19
+ * Achieving these goals while minimizing the need for additional training or specialized
20
+ annotation.
21
+
22
+
23
+ * We contribute the ImageRAG framework, which offers several key advantages as follows:
24
+
25
+ * It retrieves and emphasizes relevant visual context from the UHR image based on the text query, allowing the MLLM to focus on important details, even tiny ones.
26
+
27
+ * It integrates various external knowledge sources (store in vector database) to guide the model, enhancing the understanding of the query and UHR RSI
28
+
29
+ * ImageRAG requires only a small amount of training, making it a practical solution for efficiently handling UHR RSI.
30
+
31
+
32
+
33
+ ## πŸš€ Update
34
+ πŸ”₯πŸ”₯πŸ”₯ Last Updated on 2025.07.10 πŸ”₯πŸ”₯πŸ”₯
35
+
36
+ * **TODO**: Validate the codebase using uploaded data in isolate enviromemt
37
+
38
+ * **2025.07.10**: Upload checkpoint, cache and dataset.
39
+
40
+ * **2025.06.25**: Upload codebase and scripts.
41
+
42
+ * **2025.05.24**: ImageRAG is accepted by IEEE Geoscience and Remote Sensing Magazine πŸŽ‰
43
+ * IEEE Early Access (we prefer this version): https://ieeexplore.ieee.org/document/11039502
44
+
45
+ * Arxiv: https://arxiv.org/abs/2411.07688
46
+
47
+
48
+ ## πŸ“– Setup Codebase and Data
49
+
50
+ * Clone this repo:
51
+ ```bash
52
+ git clone https://github.com/om-ai-lab/ImageRAG.git
53
+ ```
54
+
55
+ * Download data, caches and checkpoints for ImageRAG from huggingface:
56
+ * https://huggingface.co/omlab/ImageRAG
57
+
58
+ * Using the [hf mirror](https://hf-mirror.com/) if you encounter connection problems:
59
+ ```bash
60
+ ./hfd.sh omlab/ImageRAG --local-dir ImageRAG_hf
61
+ ```
62
+ * Merge two repos:
63
+ ```bash
64
+ mv ImageRAG_hf/cache ImageRAG_hf/checkpoint ImageRAG_hf/data ImageRAG/
65
+ ```
66
+ * Unzip all zip files:
67
+ * cache/patch/mmerealworld.zip
68
+ * cache/vector_database/crsd_vector_database.zip
69
+ * cache/vector_database/lrsd_vector_database.zip
70
+
71
+ * The ImageRAG directory structure should look like this:
72
+ ```bash
73
+ /training/zilun/ImageRAG
74
+
75
+ β”œβ”€β”€ codebase
76
+ β”œβ”€β”€ inference
77
+ β”œβ”€β”€ patchify
78
+ β”œβ”€β”€ main_inference_mmerealworld_imagerag_preextract.py
79
+ ......
80
+ β”œβ”€β”€ config
81
+ β”œβ”€β”€ config_mmerealworld-baseline-zoom4kvqa10k2epoch_server.yaml
82
+ β”œβ”€β”€ config_mmerealworld-detectiongt-zoom4kvqa10k2epoch_server.yaml
83
+ ......
84
+ β”œβ”€β”€ data
85
+ β”œβ”€β”€ dataset
86
+ β”œβ”€β”€ MME-RealWorld
87
+ β”œβ”€β”€ remote_sensing
88
+ β”œβ”€β”€ remote_sensing
89
+ β”œβ”€β”€ 03553_Toronto.png
90
+ ......
91
+ β”œβ”€β”€ crsd_clip_3M.pkl
92
+ ......
93
+ β”œβ”€β”€ cache
94
+ β”œβ”€β”€ patch
95
+ β”œβ”€β”€ mmerealworld
96
+ β”œβ”€β”€ vit
97
+ β”œβ”€β”€ cc
98
+ β”œβ”€β”€ grid
99
+ β”œβ”€β”€ vector_database
100
+ β”œβ”€β”€ crsd_vector_database
101
+ β”œβ”€β”€ lrsd_vector_database
102
+ β”œβ”€β”€ checkpoint
103
+ β”œβ”€β”€ InternVL2_5-8B_lora32_vqa10k_zoom4k_2epoch_merged
104
+ ......
105
+ β”œβ”€β”€ script
106
+ β”œβ”€β”€ clip_cc.sh
107
+ ......
108
+ ```
109
+
110
+ ## πŸ“– Setup Env
111
+
112
+ ```bash
113
+ conda create -n imagerag python=3.10
114
+ conda activate imagerag
115
+ cd /training/zilun/ImageRAG
116
+ export PYTHONPATH=$PYTHONPATH:/training/zilun/ImageRAG
117
+ # Install torch, torchvision and flash attention accroding to your cuda version
118
+ pip install torch==2.5.1 torchvision==0.20.1 torchaudio==2.5.1 --index-url https://download.pytorch.org/whl/cu121
119
+ pip install ninja
120
+ MAX_JOBS=16 pip install flash-attn --no-build-isolation
121
+ ```
122
+
123
+ ```bash
124
+ pip install requirement.txt
125
+ ```
126
+
127
+ ```bash
128
+ python
129
+ >>> import nltk
130
+ >>> nltk.download('stopwords')
131
+
132
+ python -m spacy download en_core_web_sm
133
+ ```
134
+
135
+ ## πŸ“– Setup SGLang (Docker)
136
+ * Host Qwen2.5-32B-Instruct using SGLang for text parsing module
137
+
138
+ ```bash
139
+ # Pull sglang docker (we use mirror just for speeding up the download process)
140
+ docker pull swr.cn-north-4.myhuaweicloud.com/ddn-k8s/docker.io/lmsysorg/sglang:latest
141
+ docker tag swr.cn-north-4.myhuaweicloud.com/ddn-k8s/docker.io/lmsysorg/sglang:latest docker.io/lmsysorg/sglang:latest
142
+
143
+ # docker load -i sglang.tar
144
+
145
+ bash script/sglang_start.sh
146
+ ```
147
+
148
+
149
+ ## πŸ“– Feature Extraction (Optional, use Ray to parallelize the process)
150
+ * Necessary if you need to run ImageRAG in cutomized data
151
+
152
+ ```bash
153
+ ray start --head --port=6379
154
+ # extract patch features (e.g. MMERealworld-RS)
155
+ python codebase/ray_feat_extract_patch.py --ray_mode auto --num_runner 8
156
+ # extract image & text features for vector database (Section V-D, external data)
157
+ python codebase/ray_feat_extract_vectorstore.py --ray_mode auto --num_runner 8
158
+ # ray stop (optional)
159
+ ```
160
+
161
+ ## πŸ“– Inference
162
+
163
+ * See imagerag_result directory for result examples.
164
+
165
+ ### Run Baseline Inference (No ImageRAG, No GT, No Inference while detecting)
166
+ ```bash
167
+ # inference
168
+ CUDA_VISIBLE_DEVICES=0 python codebase/main_inference_mmerealworld_imagerag_preextract.py --cfg_path config/config_mmerealworld-baseline-zoom4kvqa10k2epoch_server.yaml
169
+
170
+ # eval inference result
171
+ python codebase/inference/MME-RealWorld-RS/eval_your_results.py --results_file data/eval/mmerealworld_zoom4kvqa10k2epoch_baseline.jsonl
172
+ ```
173
+
174
+ ### Run Regular VQA Task Inference in Parallel
175
+
176
+ ```bash
177
+ # {clip, georsclip, remoteclip, mcipclip} x {vit, cc, grid} x {rerank, mean, cluster} x {0, ... ,7}
178
+ bash script/georsclip_grid.sh rerank 0
179
+ ```
180
+
181
+ ### Run Inferring VQA Task Inference (No ImageRAG, No Inference while detecting, BBoxes are needed)
182
+ ```bash
183
+ # inference
184
+ CUDA_VISIBLE_DEVICES=0 python codebase/main_inference_mmerealworld_imagerag_preextract.py --cfg_path config/config_mmerealworld-detectiongt-zoom4kvqa10k2epoch_server.yaml
185
+
186
+ # eval inference result
187
+ python codebase/inference/MME-RealWorld-RS/eval_your_results.py --results_file data/eval/mmerealworld_zoom4kvqa10k2epoch_baseline.jsonl
188
+ ```
189
+
190
+ ## πŸ‘¨β€πŸ« Contact
191
192
+
193
+ ## πŸ–ŠοΈ Citation
194
+ ```bash
195
+ @ARTICLE{11039502,
196
+ author={Zhang, Zilun and Shen, Haozhan and Zhao, Tiancheng and Guan, Zian and Chen, Bin and Wang, Yuhao and Jia, Xu and Cai, Yuxiang and Shang, Yongheng and Yin, Jianwei},
197
+ journal={IEEE Geoscience and Remote Sensing Magazine},
198
+ title={Enhancing Ultrahigh Resolution Remote Sensing Imagery Analysis With ImageRAG: A new framework},
199
+ year={2025},
200
+ volume={},
201
+ number={},
202
+ pages={2-27},
203
+ keywords={Image resolution;Visualization;Benchmark testing;Training;Image color analysis;Standards;Remote sensing;Analytical models;Accuracy;Vehicle dynamics},
204
+ doi={10.1109/MGRS.2025.3574742}}
205
+
206
+ ```