Image embeddings are different from the official OpenAI clip model
The normalized image embeddings generated by this huggingface version of the CLIP model and the official openai implementation produce different embeddings.
I downloaded the following image: https://thumbs.dreamstime.com/b/lovely-cat-as-domestic-animal-view-pictures-182393057.jpg
I generated image embeddings using this model with the following code:
from transformers import CLIPModel, CLIPProcessor
_model = CLIPModel.from_pretrained('openai/clip-vit-large-patch14')
_processor = CLIPProcessor.from_pretrained('openai/clip-vit-large-patch14')
img = Image.open('lovely-cat-as-domestic-animal-view-pictures-182393057.jpg').convert('RGB')
inputs = processor(images=img, return_tensors='pt', padding=True)
with torch.no_grad():
vision_outputs = _model.vision_model(**inputs)
image_embeds = vision_outputs[1]
image_embeds = _model.visual_projection(image_embeds)
image_embeds = image_embeds / image_embeds.norm(dim=-1, keepdim=True)
print(image_embeds[0, :10])
I get:
tensor([-0.0262, 0.0541, 0.0122, 0.0053, 0.0453, 0.0138, 0.0141, 0.0035,
0.0202, -0.0173])
When I use the official implementation with this code:
import clip
__model, __preprocess = clip.load("ViT-L/14", device='cpu')
with torch.no_grad():
__image_features = __model.encode_image(__image)
__image_features /= __image_features.norm(dim=-1, keepdim=True)
print(__image_features[0, :10])
I get:
tensor([-0.0192, 0.0559, 0.0147, 0.0041, 0.0461, 0.0098, 0.0115, 0.0014,
0.0174, -0.0151])
You can see the that values are similar, but are out by a bit.
If I calculate the cosine similarity / dot product I get:
image_embeds @ image_features.t()
# tensor([[0.9971]])
I get the same result when I load up the official openai weights with the open_clip implementation also.
So, there's some subtle difference here.
I'm running transformers 4.20.0
Actually, I worked it out. The preprocessing is different from the huggingface CLIPProcessor, and the default clip implementations. So the model was getting a slightly different version of the image.
From what I can tell so far, due to different implementations for the center cropping, it's changing pixels.
TL;DR if you need exactly the same input for a given image, then use the openai input processing pipeline like this:
from torchvision.transforms import Compose, Resize, CenterCrop, ToTensor, Normalize
from PIL import Image
image_processor = Compose([
Resize(size=224, interpolation=Image.BICUBIC),
CenterCrop(size=(224, 224)),
lambda img: img.convert('RGB'),
ToTensor(),
Normalize(mean=(0.48145466, 0.4578275, 0.40821073), std=(0.26862954, 0.26130258, 0.27577711))
])
inputs=dict(pixel_values=image_processor(img).unsqueeze(0))
with torch.no_grad():
vision_outputs = _model.vision_model(**inputs)
image_embeds = vision_outputs[1]
image_embeds = _model.visual_projection(image_embeds)
image_embeds = image_embeds / image_embeds.norm(dim=-1, keepdim=True)
print(image_embeds[0, :10])
tensor([-0.0192, 0.0559, 0.0147, 0.0041, 0.0461, 0.0098, 0.0115, 0.0014,
0.0174, -0.0151])
I found the text embedding differs quite a lot. Does this make sense?
@eugeneware
I am not able to get consistent results in the HF interface and my local model. I did what you have done but getting different scores
This is my code
Please let me know what is different in this and the HF preprocessing
from PIL import Image
import requests
import torch
from torchvision.transforms import Compose, Resize, CenterCrop, ToTensor, Normalize
from transformers import CLIPProcessor, CLIPModel
Load the CLIP model and processors
model = CLIPModel.from_pretrained("openai/clip-vit-large-patch14")
processor = CLIPProcessor.from_pretrained("openai/clip-vit-large-patch14")
Load the image from URL
image = Image.open('medicine_mistake_download/3edf2e67-2e49-4ef0-a5d6-fbe224c35bf9.jpg')
Define the image preprocessing pipeline
image_processor = Compose([
Resize(size=224, interpolation=Image.BICUBIC),
CenterCrop(size=(224, 224)),
lambda img: img.convert('RGB'),
ToTensor(),
Normalize(mean=(0.48145466, 0.4578275, 0.40821073), std=(0.26862954, 0.26130258, 0.27577711))
])
Preprocess the image
processed_image = image_processor(image).unsqueeze(0)
Get image embeddings
with torch.no_grad():
vision_outputs = model.vision_model(pixel_values=processed_image)
image_embeds = vision_outputs.last_hidden_state
image_embeds = model.visual_projection(image_embeds)
image_embeds = image_embeds / image_embeds.norm(dim=-1, keepdim=True)
Print the first 10 dimensions of the first image embedding
print("First 10 dimensions of the image embedding:", image_embeds[0, :10])
Process the text inputs
texts = ["other", "prescription document",'medicine image']
text_inputs = processor(text=texts, return_tensors="pt", padding=True)
Combine text and image inputs
inputs = {
"input_ids": text_inputs["input_ids"],
"attention_mask": text_inputs["attention_mask"],
"pixel_values": processed_image
}
Get the model outputs
outputs = model(**inputs)
logits_per_image = outputs.logits_per_image # Image-text similarity scores
probs = logits_per_image.softmax(dim=1) # Probabilities
Print the results
print("Logits per image:", logits_per_image)
print("Probabilities:", probs)
Present in a more intuitive way.
inputs = processor(text=["a photo of a cat", "a photo of a dog"], images=image, return_tensors="pt", padding=True)
Image Embedding
vision_outputs = _model.vision_model(**inputs)
image_embeds = vision_outputs.pooler_output
image_embeds = _model.visual_projection(image_embeds)
image_embeds = image_embeds / image_embeds.norm(dim=-1, keepdim=True)
Text Embedding
text_output = _model.text_model(**inputs)
text_embeds = text_output.pooler_output
text_embeds = _model.text_projection(text_embeds)
text_embeds = text_embeds / text_embeds.norm(dim=-1, keepdim=True)