flax-community
/

clip-rsicd

+---
+tags:
+- vision
+---
+# Model Card: clip-rsicd
+## Model Details
+This model is a finetuned [CLIP by OpenAI](https://huggingface.co/openai/clip-vit-base-patch32). It is designed with aim to improve zero-shot image classification, text-to-image and image-to-image retrieval specically on remote sencing images.
+### Model Date
+July 2021
+### Model Type
+The base model uses a ViT-B/32 Transformer architecture as an image encoder and uses a masked self-attention Transformer as a text encoder. These encoders are trained to maximize the similarity of (image, text) pairs via a contrastive loss.
+### Model Version
+We release several checkpoints for `clip-rsicd` model. Refer to [our github repo](https://github.com/arampacha/CLIP-rsicd) for zero-shot classification for each of those.
+### Training
+To reproduce the fine-tuning procedure one can use released [script](https://github.com/arampacha/CLIP-rsicd/blob/master/run_clip_flax_tv.py).
+The model was trained using batch size 1024, adafactor optimizer with linear warmup and decay with peack learning rate 1e-4 on 1 TPU-v3-8.
+Full log of the training run done to produce can be found on [WandB](https://wandb.ai/wandb/hf-flax-clip-rsicd/runs/1ts243k3).
+### Demo
+Checko out the model text-to-image and image-to-image capabilities using [this demo](https://huggingface.co/spaces/sujitpal/clip-rsicd-demo).
+### Documents
+- [Fine-tuning CLIP on RSICD with HuggingFace and flax/jax on colab using TPU]()
+### Use with Transformers
+```py
+from PIL import Image
+import requests
+from transformers import CLIPProcessor, CLIPModel
+model = CLIPModel.from_pretrained("flax-community/clip-rsicd")
+processor = CLIPProcessor.from_pretrained("flax-community/clip-rsicd")
+url = "https://raw.githubusercontent.com/arampacha/CLIP-rsicd/master/data/stadium_1.jpg"
+image = Image.open(requests.get(url, stream=True).raw)
+labels = ["residential area", "playground", "stadium", "forrest", "airport"]
+inputs = processor(text=[f"a photo of a {l}" for l in labels], images=image, return_tensors="pt", padding=True)
+outputs = model(**inputs)
+logits_per_image = outputs.logits_per_image # this is the image-text similarity score
+probs = logits_per_image.softmax(dim=1) # we can take the softmax to get the label probabilities
+for l, p in zip(labels, probs[0]):
+    print(f"{l:<16} {p:.4f}")
+```
+[Try it on colab](https://colab.research.google.com/github/arampacha/CLIP-rsicd/blob/master/nbs/clip_rsicd_zero_shot.ipynb)
+## Model Use
+### Intended Use
+The model is intended as a research output for research communities. We hope that this model will enable researchers to better understand and explore zero-shot, arbitrary image classification. We also hope it can be used for interdisciplinary studies of the potential impact of such models - the CLIP paper includes a discussion of potential downstream impacts to provide an example for this sort of analysis.
+#### Primary intended uses
+The primary intended users of these models are AI researchers.
+We primarily imagine the model will be used by researchers to better understand robustness, generalization, and other capabilities, biases, and constraints of computer vision models.
+## Data
+The model was trained on publicly available remote sensing image cations datasets. Namely [RSICD](), [UCM]() and [Sydney]().
+## Performance and Limitations
+### Performance
+| Model-name                       | k=1   | k=3   | k=5   | k=10  |
+| -------------------------------- | ----- | ----- | ----- | ----- |
+| original CLIP                    | 0.572 | 0.745 | 0.837 | 0.939 |
+| clip-rsicd (this model)          | 0.843 | 0.958 | 0.977 | 0.993 |
+## Limitations
+The model is finetuned on RSI data but can contain some biases and limitations of the original CLIP model. Refer to [CLIP model card](https://huggingface.co/openai/clip-vit-base-patch32#limitations) for details on those.