BridgeTower/bridgetower-large-itm-mlm-itc · BridgeTower from Hugging Face vs. BridgeTower from Prediction Guard

Oct 1, 2024

I am a starter in Hugging Face and I need some help regarding BridgeTower.

I am taking a course (https://www.deeplearning.ai/short-courses/multimodal-rag-chat-with-videos/) in Deeplearning.AI about creating multimodal RAG. In Lesson 2, it talked about creating embeddings on image and text using BridgeTower.

In the example code, it uses PredictionGuardClient() to create BridgeTower embeddings:

# helper function to compute the joint embedding of a prompt and a base64-encoded image through PredictionGuard
def bt_embedding_from_prediction_guard(prompt, base64_image):
    # get PredictionGuard client
    client = _getPredictionGuardClient()
    message = {"text": prompt,}
    if base64_image is not None and base64_image != "":
        if not isBase64(base64_image): 
            raise TypeError("image input must be in base64 encoding!")
        message['image'] = base64_image
    response = client.embeddings.create(
        model="bridgetower-large-itm-mlm-itc",
        input=[message]
    )
    return response['data'][0]['embedding']

However, the above requires a Prediction Guard API key which is not easy to obtain. Many other learners got the same issue as well.

As a workaround, I used the Hugging Face transformer BridgeTowerProcessor and BridgeTowerModel. I refactored the above function as below:

from transformers import BridgeTowerProcessor, BridgeTowerModel
import torch

def bt_embedding_from_prediction_guard(prompt, base64_image):

    processor = BridgeTowerProcessor.from_pretrained("BridgeTower/bridgetower-large-itm-mlm-itc")
    model = BridgeTowerModel.from_pretrained("BridgeTower/bridgetower-large-itm-mlm-itc")

    inputs = {"text": prompt}
    
    if base64_image:
        inputs["images"] = base64_image

    # Preprocess the inputs
    processed_inputs = processor(text=[inputs['text']], images=[inputs.get('images', None)], return_tensors="pt")

    # Generate the embedding
    with torch.no_grad():
        outputs = model(**processed_inputs)
    
    # Extract the embeddings (you can change which embedding layer to use depending on your task)
    embeddings = outputs.pooler_output

    return embeddings.tolist()  # Return the embeddings as a list for easier use

The code runs and produces the embeddings - i got a 2048 dimension embeddings compared with the 512 dimension embedding from the sample code using Prediction Guard.

But when I calculate the cosine similarities between embeddings for different pictures, the cosine similarity calculated by using Hugging Face BridgeTower is so different from the one calculated by using Prediction Guard.

For example:
ex1_embeded (picture for a motorcycle)
ex2_embeded (picture for a motorcycle)
ex3_embeded (picture for a cat)

Results calculated by using Hugging Face BridgeTower (using my code above):
Cosine similarity between ex1_embeded and ex2_embeded is:
0.9268679323546363
Cosine similarity between ex1_embeded and ex3_embeded is:
0.8940821384304778

Results calculated by using Prediction Guard Face BridgeTower (using the sample code above):
Cosine similarity between ex1_embeded and ex2_embeded is:
0.48566270290489155
Cosine similarity between ex1_embeded and ex3_embeded is:
0.17133985252863604

IonutM

Oct 20, 2024

•

edited Oct 20, 2024

I have encountered the same problem. Has anyone managed to solve it?

shaoyent

BridgeTower org Oct 21, 2024

•

edited Oct 21, 2024

Hi, thanks for following the course with BridgeTower.

For comparing embeddings you should use the model that includes the contrastive head, i.e. BridgeTowerForContrastiveLearning

Here is an example:

model = BridgeTowerForContrastiveLearning.from_pretrained("BridgeTower/bridgetower-large-itm-mlm-itc")

inputs  = processor(images, texts, padding=True, return_tensors="pt")
outputs = model(**inputs)

cross_modal_embeddings = outputs.cross_embeds
# text_embeddings = outputs.text_embeds
# image_embeddings = outputs.image_embeds

IonutM

Oct 21, 2024

Thank you, Shaoyent!

Parth376

Nov 6, 2024

Hi shaoyent when I'm trying to do inference on the example image and text pair it is giving this error
Please note that authentication is recommended but still optional to access public models or datasets.
warnings.warn(
/usr/local/lib/python3.10/dist-packages/transformers/tokenization_utils_base.py:1601: FutureWarning: clean_up_tokenization_spaces was not set. It will be set to True by default. This behavior will be depracted in transformers v4.45, and will be then set to False by default. For more details check this issue: https://github.com/huggingface/transformers/issues/31884
warnings.warn(
Some weights of BridgeTowerForContrastiveLearning were not initialized from the model checkpoint at BridgeTower/bridgetower-large-itm-mlm-itc and are newly initialized: ['logit_scale']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.

Parth376

Nov 7, 2024

Have any one solve this problem
"Some weights of BridgeTowerForContrastiveLearning were not initialized from the model checkpoint at BridgeTower/bridgetower-large-itm-mlm-itc and are newly initialized: ['logit_scale']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference."

Parth376

Nov 7, 2024

Hi I am also taking a course (https://www.deeplearning.ai/short-courses/multimodal-rag-chat-with-videos/) in Deeplearning.AI about creating multimodal RAG. I stuck on retrival part. which model is used in only text embedding

Heiner66

Nov 8, 2024

Hi I am also taking a course (https://www.deeplearning.ai/short-courses/multimodal-rag-chat-with-videos/) in Deeplearning.AI about creating multimodal RAG. I stuck on retrival part. which model is used in only text embedding

I'm stuck with the same problem. Have you found a solution? Does anyone else have any suggestions?

shaoyent

BridgeTower org Nov 8, 2024

Have any one solve this problem
"Some weights of BridgeTowerForContrastiveLearning were not initialized from the model checkpoint at BridgeTower/bridgetower-large-itm-mlm-itc and are newly initialized: ['logit_scale']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference."

Hi @Parth376 , this should not be an issue for inference.

shaoyent

BridgeTower org Nov 8, 2024

Hi I am also taking a course (https://www.deeplearning.ai/short-courses/multimodal-rag-chat-with-videos/) in Deeplearning.AI about creating multimodal RAG. I stuck on retrival part. which model is used in only text embedding

I'm stuck with the same problem. Have you found a solution? Does anyone else have any suggestions?

Hi @Heiner66 @Parth376 ,
For text only embeddings you can reference this code:


inputs  = processor(images, texts, padding=True, return_tensors="pt")
outputs = model(**inputs)

cross_modal_embeddings = outputs.cross_embeds
text_embeddings = outputs.text_embeds
image_embeddings = outputs.image_embeds

text_embeddings are independent of images, so you can pass a dummy image to get text-only embeddings.

Heiner66

Nov 12, 2024

Hi I am also taking a course (https://www.deeplearning.ai/short-courses/multimodal-rag-chat-with-videos/) in Deeplearning.AI about creating multimodal RAG. I stuck on retrival part. which model is used in only text embedding

I'm stuck with the same problem. Have you found a solution? Does anyone else have any suggestions?

Hi @Heiner66 @Parth376 ,
For text only embeddings you can reference this code:
inputs  = processor(images, texts, padding=True, return_tensors="pt")
outputs = model(**inputs)

cross_modal_embeddings = outputs.cross_embeds
text_embeddings = outputs.text_embeds
image_embeddings = outputs.image_embeds
text_embeddings are independent of images, so you can pass a dummy image to get text-only embeddings.

Followed your guidance is works!!! This issue had been blocked me for days. Thank you so much!! @shaoyent

n3xt1lxs

Nov 18, 2024

I edited it until it became like this. Did I edit it correctly?

def bt_embedding_from_prediction_guard(prompt, base64_image):
    if base64_image:
        if not isBase64(base64_image):
            raise TypeError("Image input must be in base64 encoding!")
        try:
            image_data = base64.b64decode(base64_image)
            image = Image.open(BytesIO(image_data)).convert("RGB")
        except Exception as e:
            raise ValueError("Invalid image data!") from e
    else:
        image = None

    texts = [prompt]
    images = [image] if image else None

    processor = BridgeTowerProcessor.from_pretrained("BridgeTower/bridgetower-large-itm-mlm-itc")
    model = BridgeTowerForContrastiveLearning.from_pretrained("BridgeTower/bridgetower-large-itm-mlm-itc")

    inputs = processor(images=images, text=texts, padding=True, return_tensors="pt")

    with torch.no_grad():
        outputs = model(**inputs)
    
    cross_modal_embeddings = outputs.cross_embeds

    return cross_modal_embeddings.squeeze().tolist()

n3xt1lxs

Nov 21, 2024

This comment has been hidden

n3xt1lxs

Nov 21, 2024

@selili688 Did you do it? If you did, would you mind sharing your code? I need it. I'm having trouble with the query.

88hours

2 days ago

•

edited 2 days ago

@n3xt1lxs
The base64 image is required only in case of PredictionGuard. If you try passing the image as base64, it will give an error saying it requires PIL.Image.Image ..
So just pass the image as

from PIL import Image
Image.open(img['image_path'])

The best explanation and example of how to use this is given here in the README
https://huggingface.co/BridgeTower/bridgetower-large-itm-mlm-itc/blob/main/README.md

88hours

2 days ago

So I just tested this


img1 = {
  'flickr_url': url1,
  'caption': cap1,
  'image_path' : './shared_data/motorcycle_1.jpg'
}

img2 = {
    'flickr_url': url2,
    'caption': cap2,
    'image_path' : './shared_data/motorcycle_2.jpg'
}

img3 = {
    'flickr_url' : url3,
    'caption': cap3,
    'image_path' : './shared_data/cat_1.jpg'
}

def bt_embeddings_from_local(text, image):

    model = BridgeTowerForContrastiveLearning.from_pretrained("BridgeTower/bridgetower-large-itm-mlm-itc")
    processor = BridgeTowerProcessor.from_pretrained("BridgeTower/bridgetower-large-itm-mlm-itc")

    processed_inputs  = processor(image, text, padding=True, return_tensors="pt")

    #inputs  = processor(prompt, base64_image, padding=True, return_tensors="pt")
    outputs = model(**processed_inputs)

    cross_modal_embeddings = outputs.cross_embeds
    text_embeddings = outputs.text_embeds
    image_embeddings = outputs.image_embeds
    return {
        'cross_modal_embeddings': cross_modal_embeddings,
        'text_embeddings': text_embeddings,
        'image_embeddings': image_embeddings
    }
    
for img in [img1, img2, img3]:
    embeddings = bt_embeddings_from_local(img['caption'], Image.open(img['image_path']))
    print(embeddings['cross_modal_embeddings'][0].shape)

The above code will return

025-02-05 00:26:39.296317: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
Some weights of BridgeTowerForContrastiveLearning were not initialized from the model checkpoint at BridgeTower/bridgetower-large-itm-mlm-itc and are newly initialized: ['logit_scale']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
torch.Size([512])
Some weights of BridgeTowerForContrastiveLearning were not initialized from the model checkpoint at BridgeTower/bridgetower-large-itm-mlm-itc and are newly initialized: ['logit_scale']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
torch.Size([512])