Update README.md
Browse files
README.md
CHANGED
@@ -1,3 +1,206 @@
|
|
1 |
-
|
2 |
-
|
3 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
# ImageRAG
|
2 |
+
|
3 |
+
|
4 |
+
<font size=4><div align='center'> [[π€ Checkpoint, cache and data](https://huggingface.co/omlab/ImageRAG)] [[π Paper](https://ieeexplore.ieee.org/document/11039502)] [[π Blog (in Mandarin)](https://mp.weixin.qq.com/s/BcFejPAcxh4Rx_rh1JvRJA)]</div></font>
|
5 |
+
## β¨ Highlight
|
6 |
+
|
7 |
+
Ultrahigh resolution (UHR) remote sensing imagery (RSI) (e.g. 10,000 X 10,000 pixels) poses a significant challenge for current RS vision-language models (RSVLMs). If one chooses to resize the UHR image to the standard in
|
8 |
+
put image size, the extensive spatial and contextual information that UHR images contain will be neglected. Otherwise, the original size of these images often exceeds the token limits of standard RSVLMs, making it difficult to pro
|
9 |
+
cess the entire image and capture long-range dependencies to answer the query based on the abundant visual context.
|
10 |
+
|
11 |
+
* Three crucial aspects for MLLMs to effectively handle UHR RSI are:
|
12 |
+
|
13 |
+
* Managing small targets, ensuring that the model can accurately be aware and analyze fine details
|
14 |
+
within images
|
15 |
+
|
16 |
+
* Processing the UHR image in a way that integrates with MLLMs without significantly increasing
|
17 |
+
the number of image tokens, which would lead to high computational costs
|
18 |
+
|
19 |
+
* Achieving these goals while minimizing the need for additional training or specialized
|
20 |
+
annotation.
|
21 |
+
|
22 |
+
|
23 |
+
* We contribute the ImageRAG framework, which offers several key advantages as follows:
|
24 |
+
|
25 |
+
* It retrieves and emphasizes relevant visual context from the UHR image based on the text query, allowing the MLLM to focus on important details, even tiny ones.
|
26 |
+
|
27 |
+
* It integrates various external knowledge sources (store in vector database) to guide the model, enhancing the understanding of the query and UHR RSI
|
28 |
+
|
29 |
+
* ImageRAG requires only a small amount of training, making it a practical solution for efficiently handling UHR RSI.
|
30 |
+
|
31 |
+
|
32 |
+
|
33 |
+
## π Update
|
34 |
+
π₯π₯π₯ Last Updated on 2025.07.10 π₯π₯π₯
|
35 |
+
|
36 |
+
* **TODO**: Validate the codebase using uploaded data in isolate enviromemt
|
37 |
+
|
38 |
+
* **2025.07.10**: Upload checkpoint, cache and dataset.
|
39 |
+
|
40 |
+
* **2025.06.25**: Upload codebase and scripts.
|
41 |
+
|
42 |
+
* **2025.05.24**: ImageRAG is accepted by IEEE Geoscience and Remote Sensing Magazine π
|
43 |
+
* IEEE Early Access (we prefer this version): https://ieeexplore.ieee.org/document/11039502
|
44 |
+
|
45 |
+
* Arxiv: https://arxiv.org/abs/2411.07688
|
46 |
+
|
47 |
+
|
48 |
+
## π Setup Codebase and Data
|
49 |
+
|
50 |
+
* Clone this repo:
|
51 |
+
```bash
|
52 |
+
git clone https://github.com/om-ai-lab/ImageRAG.git
|
53 |
+
```
|
54 |
+
|
55 |
+
* Download data, caches and checkpoints for ImageRAG from huggingface:
|
56 |
+
* https://huggingface.co/omlab/ImageRAG
|
57 |
+
|
58 |
+
* Using the [hf mirror](https://hf-mirror.com/) if you encounter connection problems:
|
59 |
+
```bash
|
60 |
+
./hfd.sh omlab/ImageRAG --local-dir ImageRAG_hf
|
61 |
+
```
|
62 |
+
* Merge two repos:
|
63 |
+
```bash
|
64 |
+
mv ImageRAG_hf/cache ImageRAG_hf/checkpoint ImageRAG_hf/data ImageRAG/
|
65 |
+
```
|
66 |
+
* Unzip all zip files:
|
67 |
+
* cache/patch/mmerealworld.zip
|
68 |
+
* cache/vector_database/crsd_vector_database.zip
|
69 |
+
* cache/vector_database/lrsd_vector_database.zip
|
70 |
+
|
71 |
+
* The ImageRAG directory structure should look like this:
|
72 |
+
```bash
|
73 |
+
/training/zilun/ImageRAG
|
74 |
+
|
75 |
+
βββ codebase
|
76 |
+
βββ inference
|
77 |
+
βββ patchify
|
78 |
+
βββ main_inference_mmerealworld_imagerag_preextract.py
|
79 |
+
......
|
80 |
+
βββ config
|
81 |
+
βββ config_mmerealworld-baseline-zoom4kvqa10k2epoch_server.yaml
|
82 |
+
βββ config_mmerealworld-detectiongt-zoom4kvqa10k2epoch_server.yaml
|
83 |
+
......
|
84 |
+
βββ data
|
85 |
+
βββ dataset
|
86 |
+
βββ MME-RealWorld
|
87 |
+
βββ remote_sensing
|
88 |
+
βββ remote_sensing
|
89 |
+
βββ 03553_Toronto.png
|
90 |
+
......
|
91 |
+
βββ crsd_clip_3M.pkl
|
92 |
+
......
|
93 |
+
βββ cache
|
94 |
+
βββ patch
|
95 |
+
βββ mmerealworld
|
96 |
+
βββ vit
|
97 |
+
βββ cc
|
98 |
+
βββ grid
|
99 |
+
βββ vector_database
|
100 |
+
βββ crsd_vector_database
|
101 |
+
βββ lrsd_vector_database
|
102 |
+
βββ checkpoint
|
103 |
+
βββ InternVL2_5-8B_lora32_vqa10k_zoom4k_2epoch_merged
|
104 |
+
......
|
105 |
+
βββ script
|
106 |
+
βββ clip_cc.sh
|
107 |
+
......
|
108 |
+
```
|
109 |
+
|
110 |
+
## π Setup Env
|
111 |
+
|
112 |
+
```bash
|
113 |
+
conda create -n imagerag python=3.10
|
114 |
+
conda activate imagerag
|
115 |
+
cd /training/zilun/ImageRAG
|
116 |
+
export PYTHONPATH=$PYTHONPATH:/training/zilun/ImageRAG
|
117 |
+
# Install torch, torchvision and flash attention accroding to your cuda version
|
118 |
+
pip install torch==2.5.1 torchvision==0.20.1 torchaudio==2.5.1 --index-url https://download.pytorch.org/whl/cu121
|
119 |
+
pip install ninja
|
120 |
+
MAX_JOBS=16 pip install flash-attn --no-build-isolation
|
121 |
+
```
|
122 |
+
|
123 |
+
```bash
|
124 |
+
pip install requirement.txt
|
125 |
+
```
|
126 |
+
|
127 |
+
```bash
|
128 |
+
python
|
129 |
+
>>> import nltk
|
130 |
+
>>> nltk.download('stopwords')
|
131 |
+
|
132 |
+
python -m spacy download en_core_web_sm
|
133 |
+
```
|
134 |
+
|
135 |
+
## π Setup SGLang (Docker)
|
136 |
+
* Host Qwen2.5-32B-Instruct using SGLang for text parsing module
|
137 |
+
|
138 |
+
```bash
|
139 |
+
# Pull sglang docker (we use mirror just for speeding up the download process)
|
140 |
+
docker pull swr.cn-north-4.myhuaweicloud.com/ddn-k8s/docker.io/lmsysorg/sglang:latest
|
141 |
+
docker tag swr.cn-north-4.myhuaweicloud.com/ddn-k8s/docker.io/lmsysorg/sglang:latest docker.io/lmsysorg/sglang:latest
|
142 |
+
|
143 |
+
# docker load -i sglang.tar
|
144 |
+
|
145 |
+
bash script/sglang_start.sh
|
146 |
+
```
|
147 |
+
|
148 |
+
|
149 |
+
## π Feature Extraction (Optional, use Ray to parallelize the process)
|
150 |
+
* Necessary if you need to run ImageRAG in cutomized data
|
151 |
+
|
152 |
+
```bash
|
153 |
+
ray start --head --port=6379
|
154 |
+
# extract patch features (e.g. MMERealworld-RS)
|
155 |
+
python codebase/ray_feat_extract_patch.py --ray_mode auto --num_runner 8
|
156 |
+
# extract image & text features for vector database (Section V-D, external data)
|
157 |
+
python codebase/ray_feat_extract_vectorstore.py --ray_mode auto --num_runner 8
|
158 |
+
# ray stop (optional)
|
159 |
+
```
|
160 |
+
|
161 |
+
## π Inference
|
162 |
+
|
163 |
+
* See imagerag_result directory for result examples.
|
164 |
+
|
165 |
+
### Run Baseline Inference (No ImageRAG, No GT, No Inference while detecting)
|
166 |
+
```bash
|
167 |
+
# inference
|
168 |
+
CUDA_VISIBLE_DEVICES=0 python codebase/main_inference_mmerealworld_imagerag_preextract.py --cfg_path config/config_mmerealworld-baseline-zoom4kvqa10k2epoch_server.yaml
|
169 |
+
|
170 |
+
# eval inference result
|
171 |
+
python codebase/inference/MME-RealWorld-RS/eval_your_results.py --results_file data/eval/mmerealworld_zoom4kvqa10k2epoch_baseline.jsonl
|
172 |
+
```
|
173 |
+
|
174 |
+
### Run Regular VQA Task Inference in Parallel
|
175 |
+
|
176 |
+
```bash
|
177 |
+
# {clip, georsclip, remoteclip, mcipclip} x {vit, cc, grid} x {rerank, mean, cluster} x {0, ... ,7}
|
178 |
+
bash script/georsclip_grid.sh rerank 0
|
179 |
+
```
|
180 |
+
|
181 |
+
### Run Inferring VQA Task Inference (No ImageRAG, No Inference while detecting, BBoxes are needed)
|
182 |
+
```bash
|
183 |
+
# inference
|
184 |
+
CUDA_VISIBLE_DEVICES=0 python codebase/main_inference_mmerealworld_imagerag_preextract.py --cfg_path config/config_mmerealworld-detectiongt-zoom4kvqa10k2epoch_server.yaml
|
185 |
+
|
186 |
+
# eval inference result
|
187 |
+
python codebase/inference/MME-RealWorld-RS/eval_your_results.py --results_file data/eval/mmerealworld_zoom4kvqa10k2epoch_baseline.jsonl
|
188 |
+
```
|
189 |
+
|
190 |
+
## π¨βπ« Contact
|
191 | |
192 |
+
|
193 |
+
## ποΈ Citation
|
194 |
+
```bash
|
195 |
+
@ARTICLE{11039502,
|
196 |
+
author={Zhang, Zilun and Shen, Haozhan and Zhao, Tiancheng and Guan, Zian and Chen, Bin and Wang, Yuhao and Jia, Xu and Cai, Yuxiang and Shang, Yongheng and Yin, Jianwei},
|
197 |
+
journal={IEEE Geoscience and Remote Sensing Magazine},
|
198 |
+
title={Enhancing Ultrahigh Resolution Remote Sensing Imagery Analysis With ImageRAG: A new framework},
|
199 |
+
year={2025},
|
200 |
+
volume={},
|
201 |
+
number={},
|
202 |
+
pages={2-27},
|
203 |
+
keywords={Image resolution;Visualization;Benchmark testing;Training;Image color analysis;Standards;Remote sensing;Analytical models;Accuracy;Vehicle dynamics},
|
204 |
+
doi={10.1109/MGRS.2025.3574742}}
|
205 |
+
|
206 |
+
```
|