Transformers
PyTorch
English
bridgetower
gaudi
Inference Endpoints

BridgeTower from Hugging Face vs. BridgeTower from Prediction Guard

#4
by selili688 - opened

I am a starter in Hugging Face and I need some help regarding BridgeTower.

I am taking a course (https://www.deeplearning.ai/short-courses/multimodal-rag-chat-with-videos/) in Deeplearning.AI about creating multimodal RAG. In Lesson 2, it talked about creating embeddings on image and text using BridgeTower.

In the example code, it uses PredictionGuardClient() to create BridgeTower embeddings:

# helper function to compute the joint embedding of a prompt and a base64-encoded image through PredictionGuard
def bt_embedding_from_prediction_guard(prompt, base64_image):
    # get PredictionGuard client
    client = _getPredictionGuardClient()
    message = {"text": prompt,}
    if base64_image is not None and base64_image != "":
        if not isBase64(base64_image): 
            raise TypeError("image input must be in base64 encoding!")
        message['image'] = base64_image
    response = client.embeddings.create(
        model="bridgetower-large-itm-mlm-itc",
        input=[message]
    )
    return response['data'][0]['embedding']

However, the above requires a Prediction Guard API key which is not easy to obtain. Many other learners got the same issue as well.

As a workaround, I used the Hugging Face transformer BridgeTowerProcessor and BridgeTowerModel. I refactored the above function as below:

from transformers import BridgeTowerProcessor, BridgeTowerModel
import torch

def bt_embedding_from_prediction_guard(prompt, base64_image):

    processor = BridgeTowerProcessor.from_pretrained("BridgeTower/bridgetower-large-itm-mlm-itc")
    model = BridgeTowerModel.from_pretrained("BridgeTower/bridgetower-large-itm-mlm-itc")

    inputs = {"text": prompt}
    
    if base64_image:
        inputs["images"] = base64_image

    # Preprocess the inputs
    processed_inputs = processor(text=[inputs['text']], images=[inputs.get('images', None)], return_tensors="pt")

    # Generate the embedding
    with torch.no_grad():
        outputs = model(**processed_inputs)
    
    # Extract the embeddings (you can change which embedding layer to use depending on your task)
    embeddings = outputs.pooler_output

    return embeddings.tolist()  # Return the embeddings as a list for easier use

The code runs and produces the embeddings - i got a 2048 dimension embeddings compared with the 512 dimension embedding from the sample code using Prediction Guard.

But when I calculate the cosine similarities between embeddings for different pictures, the cosine similarity calculated by using Hugging Face BridgeTower is so different from the one calculated by using Prediction Guard.

For example:
ex1_embeded (picture for a motorcycle)
ex2_embeded (picture for a motorcycle)
ex3_embeded (picture for a cat)

Results calculated by using Hugging Face BridgeTower (using my code above):
Cosine similarity between ex1_embeded and ex2_embeded is:
0.9268679323546363
Cosine similarity between ex1_embeded and ex3_embeded is:
0.8940821384304778

Results calculated by using Prediction Guard Face BridgeTower (using the sample code above):
Cosine similarity between ex1_embeded and ex2_embeded is:
0.48566270290489155
Cosine similarity between ex1_embeded and ex3_embeded is:
0.17133985252863604

I have encountered the same problem. Has anyone managed to solve it?

BridgeTower org
β€’
edited Oct 21

Hi, thanks for following the course with BridgeTower.

For comparing embeddings you should use the model that includes the contrastive head, i.e. BridgeTowerForContrastiveLearning

Here is an example:

model = BridgeTowerForContrastiveLearning.from_pretrained("BridgeTower/bridgetower-large-itm-mlm-itc")

inputs  = processor(images, texts, padding=True, return_tensors="pt")
outputs = model(**inputs)

cross_modal_embeddings = outputs.cross_embeds
# text_embeddings = outputs.text_embeds
# image_embeddings = outputs.image_embeds

Thank you, Shaoyent!

Hi shaoyent when I'm trying to do inference on the example image and text pair it is giving this error
Please note that authentication is recommended but still optional to access public models or datasets.
warnings.warn(
/usr/local/lib/python3.10/dist-packages/transformers/tokenization_utils_base.py:1601: FutureWarning: clean_up_tokenization_spaces was not set. It will be set to True by default. This behavior will be depracted in transformers v4.45, and will be then set to False by default. For more details check this issue: https://github.com/huggingface/transformers/issues/31884
warnings.warn(
Some weights of BridgeTowerForContrastiveLearning were not initialized from the model checkpoint at BridgeTower/bridgetower-large-itm-mlm-itc and are newly initialized: ['logit_scale']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.

Have any one solve this problem
"Some weights of BridgeTowerForContrastiveLearning were not initialized from the model checkpoint at BridgeTower/bridgetower-large-itm-mlm-itc and are newly initialized: ['logit_scale']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference."

Hi I am also taking a course (https://www.deeplearning.ai/short-courses/multimodal-rag-chat-with-videos/) in Deeplearning.AI about creating multimodal RAG. I stuck on retrival part. which model is used in only text embedding

Hi I am also taking a course (https://www.deeplearning.ai/short-courses/multimodal-rag-chat-with-videos/) in Deeplearning.AI about creating multimodal RAG. I stuck on retrival part. which model is used in only text embedding

I'm stuck with the same problem. Have you found a solution? Does anyone else have any suggestions?

BridgeTower org

Have any one solve this problem
"Some weights of BridgeTowerForContrastiveLearning were not initialized from the model checkpoint at BridgeTower/bridgetower-large-itm-mlm-itc and are newly initialized: ['logit_scale']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference."

Hi @Parth376 , this should not be an issue for inference.

BridgeTower org

Hi I am also taking a course (https://www.deeplearning.ai/short-courses/multimodal-rag-chat-with-videos/) in Deeplearning.AI about creating multimodal RAG. I stuck on retrival part. which model is used in only text embedding

I'm stuck with the same problem. Have you found a solution? Does anyone else have any suggestions?

Hi @Heiner66 @Parth376 ,
For text only embeddings you can reference this code:


inputs  = processor(images, texts, padding=True, return_tensors="pt")
outputs = model(**inputs)

cross_modal_embeddings = outputs.cross_embeds
text_embeddings = outputs.text_embeds
image_embeddings = outputs.image_embeds

text_embeddings are independent of images, so you can pass a dummy image to get text-only embeddings.

Hi I am also taking a course (https://www.deeplearning.ai/short-courses/multimodal-rag-chat-with-videos/) in Deeplearning.AI about creating multimodal RAG. I stuck on retrival part. which model is used in only text embedding

I'm stuck with the same problem. Have you found a solution? Does anyone else have any suggestions?

Hi @Heiner66 @Parth376 ,
For text only embeddings you can reference this code:


inputs  = processor(images, texts, padding=True, return_tensors="pt")
outputs = model(**inputs)

cross_modal_embeddings = outputs.cross_embeds
text_embeddings = outputs.text_embeds
image_embeddings = outputs.image_embeds

text_embeddings are independent of images, so you can pass a dummy image to get text-only embeddings.

Followed your guidance is works!!! This issue had been blocked me for days. Thank you so much!! @shaoyent

I edited it until it became like this. Did I edit it correctly?

def bt_embedding_from_prediction_guard(prompt, base64_image):
    if base64_image:
        if not isBase64(base64_image):
            raise TypeError("Image input must be in base64 encoding!")
        try:
            image_data = base64.b64decode(base64_image)
            image = Image.open(BytesIO(image_data)).convert("RGB")
        except Exception as e:
            raise ValueError("Invalid image data!") from e
    else:
        image = None

    texts = [prompt]
    images = [image] if image else None

    processor = BridgeTowerProcessor.from_pretrained("BridgeTower/bridgetower-large-itm-mlm-itc")
    model = BridgeTowerForContrastiveLearning.from_pretrained("BridgeTower/bridgetower-large-itm-mlm-itc")

    inputs = processor(images=images, text=texts, padding=True, return_tensors="pt")

    with torch.no_grad():
        outputs = model(**inputs)
    
    cross_modal_embeddings = outputs.cross_embeds

    return cross_modal_embeddings.squeeze().tolist()
This comment has been hidden

@selili688 Did you do it? If you did, would you mind sharing your code? I need it. I'm having trouble with the query.

Sign up or log in to comment