Image embeddings are different from the official OpenAI clip model

#1
by eugeneware - opened

The normalized image embeddings generated by this huggingface version of the CLIP model and the official openai implementation produce different embeddings.

I downloaded the following image: https://thumbs.dreamstime.com/b/lovely-cat-as-domestic-animal-view-pictures-182393057.jpg

I generated image embeddings using this model with the following code:

from transformers import CLIPModel, CLIPProcessor
_model = CLIPModel.from_pretrained('openai/clip-vit-large-patch14')
_processor = CLIPProcessor.from_pretrained('openai/clip-vit-large-patch14')
img = Image.open('lovely-cat-as-domestic-animal-view-pictures-182393057.jpg').convert('RGB')
inputs = processor(images=img, return_tensors='pt', padding=True)
with torch.no_grad():
    vision_outputs = _model.vision_model(**inputs)
    image_embeds = vision_outputs[1]
    image_embeds = _model.visual_projection(image_embeds)
    image_embeds = image_embeds / image_embeds.norm(dim=-1, keepdim=True) 
print(image_embeds[0, :10])

I get:

tensor([-0.0262,  0.0541,  0.0122,  0.0053,  0.0453,  0.0138,  0.0141,  0.0035,
         0.0202, -0.0173])

When I use the official implementation with this code:

import clip
__model, __preprocess = clip.load("ViT-L/14", device='cpu')
with torch.no_grad():
    __image_features = __model.encode_image(__image)
    __image_features /= __image_features.norm(dim=-1, keepdim=True)
print(__image_features[0, :10])

I get:

tensor([-0.0192,  0.0559,  0.0147,  0.0041,  0.0461,  0.0098,  0.0115,  0.0014,
         0.0174, -0.0151])

You can see the that values are similar, but are out by a bit.

If I calculate the cosine similarity / dot product I get:

image_embeds @ image_features.t()
# tensor([[0.9971]])

I get the same result when I load up the official openai weights with the open_clip implementation also.

So, there's some subtle difference here.

I'm running transformers 4.20.0

Actually, I worked it out. The preprocessing is different from the huggingface CLIPProcessor, and the default clip implementations. So the model was getting a slightly different version of the image.

From what I can tell so far, due to different implementations for the center cropping, it's changing pixels.

TL;DR if you need exactly the same input for a given image, then use the openai input processing pipeline like this:

from torchvision.transforms import Compose, Resize, CenterCrop, ToTensor, Normalize
from PIL import Image
image_processor = Compose([
    Resize(size=224, interpolation=Image.BICUBIC),
    CenterCrop(size=(224, 224)),
    lambda img: img.convert('RGB'),
    ToTensor(),
    Normalize(mean=(0.48145466, 0.4578275, 0.40821073), std=(0.26862954, 0.26130258, 0.27577711))
])
inputs=dict(pixel_values=image_processor(img).unsqueeze(0))
with torch.no_grad():
    vision_outputs = _model.vision_model(**inputs)
    image_embeds = vision_outputs[1]
    image_embeds = _model.visual_projection(image_embeds)
    image_embeds = image_embeds / image_embeds.norm(dim=-1, keepdim=True)
print(image_embeds[0, :10])
tensor([-0.0192,  0.0559,  0.0147,  0.0041,  0.0461,  0.0098,  0.0115,  0.0014,
         0.0174, -0.0151])

cc @valhalla in case you hadn't seen this!

I found the text embedding differs quite a lot. Does this make sense?

@eugeneware
I am not able to get consistent results in the HF interface and my local model. I did what you have done but getting different scores
This is my code

Please let me know what is different in this and the HF preprocessing

from PIL import Image
import requests
import torch
from torchvision.transforms import Compose, Resize, CenterCrop, ToTensor, Normalize
from transformers import CLIPProcessor, CLIPModel

Load the CLIP model and processors

model = CLIPModel.from_pretrained("openai/clip-vit-large-patch14")
processor = CLIPProcessor.from_pretrained("openai/clip-vit-large-patch14")

Load the image from URL

image = Image.open('medicine_mistake_download/3edf2e67-2e49-4ef0-a5d6-fbe224c35bf9.jpg')

Define the image preprocessing pipeline

image_processor = Compose([
Resize(size=224, interpolation=Image.BICUBIC),
CenterCrop(size=(224, 224)),
lambda img: img.convert('RGB'),
ToTensor(),
Normalize(mean=(0.48145466, 0.4578275, 0.40821073), std=(0.26862954, 0.26130258, 0.27577711))
])

Preprocess the image

processed_image = image_processor(image).unsqueeze(0)

Get image embeddings

with torch.no_grad():
vision_outputs = model.vision_model(pixel_values=processed_image)
image_embeds = vision_outputs.last_hidden_state
image_embeds = model.visual_projection(image_embeds)
image_embeds = image_embeds / image_embeds.norm(dim=-1, keepdim=True)

Print the first 10 dimensions of the first image embedding

print("First 10 dimensions of the image embedding:", image_embeds[0, :10])

Process the text inputs

texts = ["other", "prescription document",'medicine image']
text_inputs = processor(text=texts, return_tensors="pt", padding=True)

Combine text and image inputs

inputs = {
"input_ids": text_inputs["input_ids"],
"attention_mask": text_inputs["attention_mask"],
"pixel_values": processed_image
}

Get the model outputs

outputs = model(**inputs)
logits_per_image = outputs.logits_per_image # Image-text similarity scores
probs = logits_per_image.softmax(dim=1) # Probabilities

Print the results

print("Logits per image:", logits_per_image)
print("Probabilities:", probs)

Present in a more intuitive way.

inputs = processor(text=["a photo of a cat", "a photo of a dog"], images=image, return_tensors="pt", padding=True)

Image Embedding

vision_outputs = _model.vision_model(**inputs)
image_embeds = vision_outputs.pooler_output
image_embeds = _model.visual_projection(image_embeds)
image_embeds = image_embeds / image_embeds.norm(dim=-1, keepdim=True)

Text Embedding

text_output = _model.text_model(**inputs)
text_embeds = text_output.pooler_output
text_embeds = _model.text_projection(text_embeds)
text_embeds = text_embeds / text_embeds.norm(dim=-1, keepdim=True)

Sign up or log in to comment