Salesforce/BLIP2 · Question about multimodal BLIP-2 inputs : text, image or text+image embeddings need to end up in identical semantic space

Hello,

I see some online posts such as https://towardsdatascience.com/multimodal-search-engine-agents-powered-by-blip-2-and-gemini/ describe how to direct text-only queries to BLIP-2's text model and image-only queries to BLIP-2's vision model, but where does the Q-Former sit in this approach?

I have created an AI factory that treats vocational training videos in the wheelchair and mobility device domain as multimodal input which ultimately results in a set of curated frame grabs (images) and structured textual summaries that consider both visual and textual (transcribed audio) inputs.

Now I want to create an AI expert that is capable of continuously learning by expanding a knowledge base derived from videos in this growing corpus. I am now planning to create BLIP-2 embeddings that will contain both textual summaries of portions of videos, and accompanying frame grabs.

As part of a RAG workflow, I need a user to be able to enter either straight text, text+images, or images only, create a BLIP-2 embedding from the input, and perform nearest-neighbor search in semantic space (I'm planning on using PineCone) to return viable vectors, link back to the original data used to create said vector, and enhance the foundation model's context with said data.

I see some online recommendations when making text-only embeddings to pass in a completely blank image. I see other recommendations to create embeddings from BLIP-2's text model only. I need to ensure that all vectors created from text-only, text+image and image-only end up in the same semantic space. What is the best approach to achieve this?

Many thanks!

Ref : https://ai.stackexchange.com/questions/40753/how-to-generate-original-training-videos-based-on-existing-videoset