joaodaniel commited on
Commit
71c9293
1 Parent(s): da90651

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +149 -1
README.md CHANGED
@@ -4,5 +4,153 @@ tags:
4
  library_name: open_clip
5
  pipeline_tag: zero-shot-image-classification
6
  license: mit
 
 
 
 
 
 
 
 
 
 
 
 
 
7
  ---
8
- # Model card for RS-M-CLIP
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
4
  library_name: open_clip
5
  pipeline_tag: zero-shot-image-classification
6
  license: mit
7
+ language:
8
+ - en
9
+ - zh
10
+ - pt
11
+ - es
12
+ - fr
13
+ - de
14
+ - nl
15
+ - it
16
+ - ko
17
+ - ru
18
+ base_model:
19
+ - laion/CLIP-ViT-B-32-xlm-roberta-base-laion5B-s13B-b90k
20
  ---
21
+ # RS-M-CLIP: Multilingual Vision-Language Pre-training for the Remote Sensing Domain
22
+
23
+ This is the official repository for the article titled “[Multilingual Vision-Language Pre-training for the Remote Sensing Domain](https://arxiv.org/abs/2410.23370)"
24
+
25
+ #### Authors:
26
+
27
+ João Daniel Silva, INESC-ID , Instituto Superior Técnico, University of Lisbon
28
+
29
+ João Magalhães, NOVA-LINCS, Faculty of Science and Technology, NOVA University
30
+
31
+ Devis Tuia, ECEO, Ecole Polytechnique Fédérale de Lausanne
32
+
33
+ Bruno Martins, INESC-ID & LUMLIS, Instituto Superior Técnico, University of Lisbon
34
+
35
+ ### Abstract
36
+
37
+ > Methods based on Contrastive Language-Image Pre-training (CLIP) are nowadays extensively used in support of vision-and-language tasks involving remote sensing data, such as cross-modal retrieval. The adaptation of CLIP to this specific domain has relied on model fine-tuning with the standard contrastive objective, using existing human-labeled image-caption datasets, or using synthetic data corresponding to image-caption pairs derived from other annotations over remote sensing images (e.g., object classes). The use of different pre-training mechanisms has received less attention, and only a few exceptions have considered multilingual inputs. This work proposes a novel vision-and-language model for the remote sensing domain, exploring the fine-tuning of a multilingual CLIP model and testing the use of a self-supervised method based on aligning local and global representations from individual input images, together with the standard CLIP objective. Model training relied on assembling pre-existing datasets of remote sensing images paired with English captions, followed by the use of automated machine translation into nine additional languages. We show that translated data is indeed helpful, e.g. improving performance also on English. Our resulting model, which we named Remote Sensing Multilingual CLIP (RS-M-CLIP), obtains state-of-the-art results in a variety of vision-and-language tasks, including cross-modal and multilingual image-text retrieval, or zero-shot image classification.
38
+
39
+ ### Description
40
+
41
+ (**R**emote **S**ensing **M**ultilingual **CLIP**) RS-M-CLIP is a CLIP-based model for the Remote Sensing Domain. We improve the performance of CLIP without scaling the amount of training data by using an aggregation of available datasets with image-caption pairs, included a self-distillation method together with the contrastive learning objective, and used captions translated in different languages. We started training from a CLIP model with a multilingual text encoder, with a vision encoder ViT-B: https://huggingface.co/laion/CLIP-ViT-B-32-xlm-roberta-base-laion5B-s13B-b90k. Our model achieves state-of-the art results for cross-modal image-text retrieval, and has multilingual abilities, being able to process other languages than English.
42
+
43
+ Languages supported: English, Portuguese, Spanish, French, German, Dutch, Italian, Chinese, Korean, and Russian.
44
+
45
+ You can download RS-M-CLIP from the huggingface-hub repo: https://huggingface.co/joaodaniel/RS-M-CLIP
46
+
47
+ ### Code Examples
48
+
49
+ To load the model, the OpenCLIP library will load the weights that are stored in the huggingface hub.
50
+
51
+ ```python
52
+ import torch
53
+ import open_clip
54
+
55
+ model, preprocess, preprocess_val = open_clip.create_model_and_transforms('hf-hub:joaodaniel/RS-M-CLIP')
56
+ tokenizer = open_clip.get_tokenizer('hf-hub:joaodaniel/RS-M-CLIP')
57
+ ```
58
+
59
+ #### Image Classification (in English)
60
+
61
+ The following is an example of image classification with text queries in English.
62
+
63
+ ```python
64
+ model = model.eval()
65
+
66
+ from PIL import Image
67
+ image = preprocess(Image.open('figs/airplane_004.jpg')).unsqueeze(0)
68
+
69
+ text_queries = [
70
+ "A residential area with houses.",
71
+ "Blocks of buildings can be seen in the factory .",
72
+ "Dense residential areas on both sides of the road .",
73
+ "Many airplanes in the open area.",
74
+ "A cute cat",
75
+ ]
76
+ text = tokenizer(text_queries)
77
+
78
+ with torch.no_grad():
79
+ image_features = model.encode_image(image)
80
+ text_features = model.encode_text(text)
81
+ image_features /= image_features.norm(dim=-1, keepdim=True)
82
+ text_features /= text_features.norm(dim=-1, keepdim=True)
83
+
84
+ text_probs = (100.0 * image_features @ text_features.T).softmax(dim=-1).cpu().numpy()[0]
85
+
86
+ for query, prob in zip(text_queries, text_probs):
87
+ print(f"{query:<40} {prob * 100:5.1f}%")
88
+ ```
89
+
90
+ With the following output:
91
+ ```
92
+ A residential area with houses. 0.0%
93
+ Blocks of buildings can be seen in the factory . 0.0%
94
+ Dense residential areas on both sides of the road . 0.0%
95
+ Many airplanes in the open area. 100.0%
96
+ A cute cat 0.0%
97
+ ```
98
+
99
+
100
+ <div align="center">
101
+ <img src="figs/airplane_004.jpg" alt="Figure with four airplanes parked." width="224">
102
+ </div>
103
+
104
+ #### Image Classification (in Spanish)
105
+ ```python
106
+ from PIL import Image
107
+ image = preprocess(Image.open('figs/golf_course_004.jpg')).unsqueeze(0)
108
+
109
+ text_queries = [
110
+ "Una zona residencial con casas.",
111
+ "Se pueden ver bloques de edificios en la fábrica.",
112
+ "Zonas residenciales densas a ambos lados de la carretera.",
113
+ "Muchos aviones en el área abierta.",
114
+ "Un lindo gato",
115
+ "Un campo de golf con bunkers."
116
+ ]
117
+ text = tokenizer(text_queries)
118
+
119
+ with torch.no_grad():
120
+ image_features = model.encode_image(image)
121
+ text_features = model.encode_text(text)
122
+ image_features /= image_features.norm(dim=-1, keepdim=True)
123
+ text_features /= text_features.norm(dim=-1, keepdim=True)
124
+
125
+ text_probs = (100.0 * image_features @ text_features.T).softmax(dim=-1).cpu().numpy()[0]
126
+
127
+ for query, prob in zip(text_queries, text_probs):
128
+ print(f"{query:<60} {prob * 100:5.1f}%")
129
+ ```
130
+
131
+ With the following output:
132
+ ```
133
+ Una zona residencial con casas. 0.0%
134
+ Se pueden ver bloques de edificios en la fábrica. 0.0%
135
+ Zonas residenciales densas a ambos lados de la carretera. 0.0%
136
+ Muchos aviones en el área abierta. 0.0%
137
+ Un lindo gato 0.0%
138
+ Un campo de golf con bunkers. 100.0%
139
+ ```
140
+
141
+ <div align="center">
142
+ <img src="figs/golf_course_004.jpg" alt="Figure of a golf course with many bunkers." width="224">
143
+ </div>
144
+
145
+ ### Citation
146
+
147
+ If you find our work useful 🙏, please cite us as:
148
+
149
+ ```bibtex
150
+ @article{silva2024large,
151
+ title={Multilingual Vision-Language Pre-training for the Remote Sensing Domain},
152
+ author={Silva, Jo{\~a}o Daniel and Magalh{\~a}es, Jo{\~a}o and Tuia, Devis and Martins, Bruno},
153
+ journal={arXiv:2410.23370},
154
+ year={2024}
155
+ }
156
+ ```