Create README.md
Browse files
README.md
ADDED
@@ -0,0 +1,96 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
tags:
|
3 |
+
- vision
|
4 |
+
---
|
5 |
+
|
6 |
+
# Model Card: clip-rsicd
|
7 |
+
|
8 |
+
## Model Details
|
9 |
+
|
10 |
+
This model is a finetuned [CLIP by OpenAI](https://huggingface.co/openai/clip-vit-base-patch32). It is designed with aim to improve zero-shot image classification, text-to-image and image-to-image retrieval specically on remote sencing images.
|
11 |
+
|
12 |
+
### Model Date
|
13 |
+
|
14 |
+
July 2021
|
15 |
+
|
16 |
+
### Model Type
|
17 |
+
|
18 |
+
The base model uses a ViT-B/32 Transformer architecture as an image encoder and uses a masked self-attention Transformer as a text encoder. These encoders are trained to maximize the similarity of (image, text) pairs via a contrastive loss.
|
19 |
+
|
20 |
+
### Model Version
|
21 |
+
|
22 |
+
We release several checkpoints for `clip-rsicd` model. Refer to [our github repo](https://github.com/arampacha/CLIP-rsicd) for zero-shot classification for each of those.
|
23 |
+
|
24 |
+
### Training
|
25 |
+
|
26 |
+
To reproduce the fine-tuning procedure one can use released [script](https://github.com/arampacha/CLIP-rsicd/blob/master/run_clip_flax_tv.py).
|
27 |
+
The model was trained using batch size 1024, adafactor optimizer with linear warmup and decay with peack learning rate 1e-4 on 1 TPU-v3-8.
|
28 |
+
Full log of the training run done to produce can be found on [WandB](https://wandb.ai/wandb/hf-flax-clip-rsicd/runs/1ts243k3).
|
29 |
+
|
30 |
+
### Demo
|
31 |
+
|
32 |
+
Checko out the model text-to-image and image-to-image capabilities using [this demo](https://huggingface.co/spaces/sujitpal/clip-rsicd-demo).
|
33 |
+
|
34 |
+
|
35 |
+
### Documents
|
36 |
+
|
37 |
+
- [Fine-tuning CLIP on RSICD with HuggingFace and flax/jax on colab using TPU]()
|
38 |
+
|
39 |
+
|
40 |
+
### Use with Transformers
|
41 |
+
|
42 |
+
```py
|
43 |
+
from PIL import Image
|
44 |
+
import requests
|
45 |
+
|
46 |
+
from transformers import CLIPProcessor, CLIPModel
|
47 |
+
|
48 |
+
model = CLIPModel.from_pretrained("flax-community/clip-rsicd")
|
49 |
+
processor = CLIPProcessor.from_pretrained("flax-community/clip-rsicd")
|
50 |
+
|
51 |
+
url = "https://raw.githubusercontent.com/arampacha/CLIP-rsicd/master/data/stadium_1.jpg"
|
52 |
+
image = Image.open(requests.get(url, stream=True).raw)
|
53 |
+
|
54 |
+
labels = ["residential area", "playground", "stadium", "forrest", "airport"]
|
55 |
+
inputs = processor(text=[f"a photo of a {l}" for l in labels], images=image, return_tensors="pt", padding=True)
|
56 |
+
|
57 |
+
outputs = model(**inputs)
|
58 |
+
logits_per_image = outputs.logits_per_image # this is the image-text similarity score
|
59 |
+
probs = logits_per_image.softmax(dim=1) # we can take the softmax to get the label probabilities
|
60 |
+
for l, p in zip(labels, probs[0]):
|
61 |
+
print(f"{l:<16} {p:.4f}")
|
62 |
+
```
|
63 |
+
[Try it on colab](https://colab.research.google.com/github/arampacha/CLIP-rsicd/blob/master/nbs/clip_rsicd_zero_shot.ipynb)
|
64 |
+
|
65 |
+
|
66 |
+
## Model Use
|
67 |
+
|
68 |
+
### Intended Use
|
69 |
+
|
70 |
+
The model is intended as a research output for research communities. We hope that this model will enable researchers to better understand and explore zero-shot, arbitrary image classification. We also hope it can be used for interdisciplinary studies of the potential impact of such models - the CLIP paper includes a discussion of potential downstream impacts to provide an example for this sort of analysis.
|
71 |
+
|
72 |
+
#### Primary intended uses
|
73 |
+
|
74 |
+
The primary intended users of these models are AI researchers.
|
75 |
+
|
76 |
+
We primarily imagine the model will be used by researchers to better understand robustness, generalization, and other capabilities, biases, and constraints of computer vision models.
|
77 |
+
|
78 |
+
|
79 |
+
|
80 |
+
## Data
|
81 |
+
|
82 |
+
The model was trained on publicly available remote sensing image cations datasets. Namely [RSICD](), [UCM]() and [Sydney]().
|
83 |
+
|
84 |
+
|
85 |
+
## Performance and Limitations
|
86 |
+
|
87 |
+
### Performance
|
88 |
+
|
89 |
+
| Model-name | k=1 | k=3 | k=5 | k=10 |
|
90 |
+
| -------------------------------- | ----- | ----- | ----- | ----- |
|
91 |
+
| original CLIP | 0.572 | 0.745 | 0.837 | 0.939 |
|
92 |
+
| clip-rsicd (this model) | 0.843 | 0.958 | 0.977 | 0.993 |
|
93 |
+
|
94 |
+
## Limitations
|
95 |
+
|
96 |
+
The model is finetuned on RSI data but can contain some biases and limitations of the original CLIP model. Refer to [CLIP model card](https://huggingface.co/openai/clip-vit-base-patch32#limitations) for details on those.
|