BLIP2 for retrieval

#27

by SnowNation - opened Feb 26, 2024

Discussion

SnowNation

Feb 26, 2024

Is there a way to use the huggingface model to do cross modal retrieval tasks?

nielsr

Feb 26, 2024

There's an effort to add it: https://github.com/huggingface/transformers/pull/29261

skycyou

Sep 12, 2024

There's an effort to add it: https://github.com/huggingface/transformers/pull/29261

but it seems that there is no proper model on hugging-face for Blip2ForImageTextRetrieval? existing models could not be rightly loaded on retrieval tasks.

nielsr

Sep 12, 2024

•

edited Sep 12, 2024

The PR above has been merged, so the Blip2ForImageTextRetrieval class is now available.

There are 2 checkpoints available:

skycyou

Sep 12, 2024

•

edited Sep 12, 2024

The PR above has been merged, so the Blip2ForImageTextRetrieval class is now available.

There are 2 checkpoints available:

https://huggingface.co/Salesforce/blip2-itm-vit-g

https://huggingface.co/Salesforce/blip2-itm-vit-g-coco
as I know, blip2-itm-vit-g do not work well.

here is the logs:
Some weights of the model checkpoint at ../Salesforce/blip2-itm-vit-g were not used when initializing Blip2ForImageTextRetrieval: ['cls.predictions.bias', 'cls.predictions.decoder.bias', 'cls.predictions.decoder.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.dense.weight', 'qformer.embeddings.LayerNorm.bias', 'qformer.embeddings.LayerNorm.weight', 'qformer.embeddings.position_embeddings.weight', 'qformer.embeddings.word_embeddings.weight', 'temp', 'text_proj.bias', 'text_proj.weight', 'vision_proj.bias', 'vision_proj.weight']

This IS expected if you are initializing Blip2ForImageTextRetrieval from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
This IS NOT expected if you are initializing Blip2ForImageTextRetrieval from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of Blip2ForImageTextRetrieval were not initialized from the model checkpoint at /home/pyr/pretrained_models/Salesforce/blip2-itm-vit-g and are newly initialized: ['embeddings.position_embeddings.weight', 'embeddings.word_embeddings.weight', 'qformer.layernorm.bias', 'qformer.layernorm.weight', 'text_projection.bias', 'text_projection.weight', 'vision_projection.bias', 'vision_projection.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Expanding inputs for image tokens in BLIP-2 should be done in processing. Please follow instruction here (https://gist.github.com/zucchini-nlp/e9f20b054fa322f84ac9311d9ab67042) to update your BLIP-2 model. Using processors without these attributes in the config is deprecated and will throw an error in v4.47.

nielsr

Sep 14, 2024

Hi,

It looks like you may need to update your Transformers version, the following code snippet works for me:

import torch
from PIL import Image
import requests
from transformers import AutoProcessor, Blip2ForImageTextRetrieval

device = "cuda" if torch.cuda.is_available() else "cpu"

model = Blip2ForImageTextRetrieval.from_pretrained("Salesforce/blip2-itm-vit-g", torch_dtype=torch.float16)
processor = AutoProcessor.from_pretrained("Salesforce/blip2-itm-vit-g")

model.to(device)  # doctest: +IGNORE_RESULT

url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = Image.open(requests.get(url, stream=True).raw)
text = "two cats laying on a pink blanket"

inputs = processor(images=image, text=text, return_tensors="pt").to(device, torch.float16)
itm_out = model(**inputs, use_image_text_matching_head=True)
logits_per_image = torch.nn.functional.softmax(itm_out.logits_per_image, dim=1)
probs = logits_per_image.softmax(dim=1)  # we can take the softmax to get the label probabilities

print("Probs:", probs)

which prints:

Expanding inputs for image tokens in BLIP-2 should be done in processing. Please follow instruction here (https://gist.github.com/zucchini-nlp/e9f20b054fa322f84ac9311d9ab67042) to update your BLIP-2 model. Using processors without these attributes in the config is deprecated and will throw an error in v4.47.
Probs: tensor([[0.2693, 0.7305]], dtype=torch.float16, grad_fn=<SoftmaxBackward0>)

skycyou

Sep 18, 2024

•

edited Sep 18, 2024

Hi,

It looks like you may need to update your Transformers version, the following code snippet works for me:

import torch
from PIL import Image
import requests
from transformers import AutoProcessor, Blip2ForImageTextRetrieval

device = "cuda" if torch.cuda.is_available() else "cpu"

model = Blip2ForImageTextRetrieval.from_pretrained("Salesforce/blip2-itm-vit-g", torch_dtype=torch.float16)
processor = AutoProcessor.from_pretrained("Salesforce/blip2-itm-vit-g")

model.to(device)  # doctest: +IGNORE_RESULT

url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = Image.open(requests.get(url, stream=True).raw)
text = "two cats laying on a pink blanket"

inputs = processor(images=image, text=text, return_tensors="pt").to(device, torch.float16)
itm_out = model(**inputs, use_image_text_matching_head=True)
logits_per_image = torch.nn.functional.softmax(itm_out.logits_per_image, dim=1)
probs = logits_per_image.softmax(dim=1)  # we can take the softmax to get the label probabilities

print("Probs:", probs)

which prints:

Expanding inputs for image tokens in BLIP-2 should be done in processing. Please follow instruction here (https://gist.github.com/zucchini-nlp/e9f20b054fa322f84ac9311d9ab67042) to update your BLIP-2 model. Using processors without these attributes in the config is deprecated and will throw an error in v4.47.
Probs: tensor([[0.2693, 0.7305]], dtype=torch.float16, grad_fn=<SoftmaxBackward0>)

Thank you for your reply; it was very helpful to me. It indeed seems to be a version issue.
Furthermore, may I ask another question? In the examples of the lavis library, both the extracted image features and text features are multiple low-dimensional vectors.
However, I noticed that using

image_emb = model.extract_features(sample, mode="image").image_embeds[:,0,:] # size (768)
text_emb = model.extract_features(sample, mode="text").text_embeds[:,0,:] # size (768)

seems to also work.
What is the difference between these two methods? Or is there detailed documentation available somewhere? Thank you.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment