Model Details
The CLIP model was pretrained from openai/clip-vit-base-patch32 , to learn about what contributes to robustness in computer vision tasks.
The model has the ability to generalize to arbitrary image classification tasks in a zero-shot manner.
Top predictions:
Saree: 64.89%
Dupatta: 25.81%
Lehenga: 7.51%
Leggings and Salwar: 0.84% Women Kurta: 0.44%
Use with Transformers
from PIL import Image
import requests
from transformers import CLIPProcessor, CLIPModel
model = CLIPModel.from_pretrained("samim2024/clip")
processor = CLIPProcessor.from_pretrained("samim2024/clip")
url = "https://www.istockphoto.com/photo/indian-saris-gm93355119-10451468"
image = Image.open(requests.get(url, stream=True).raw)
inputs = processor(text=["a photo of a saree", "a photo of a blouse"], images=image, return_tensors="pt", padding=True)
outputs = model(**inputs)
logits_per_image = outputs.logits_per_image # this is the image-text similarity score
probs = logits_per_image.softmax(dim=1) # we can take the softmax to get the label probabilities
- Downloads last month
- 112
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social
visibility and check back later, or deploy to Inference Endpoints (dedicated)
instead.