Create custom_st.py
Based on the updated demo, I've created a custom SentenceTransformer
model. However, I don't have the hardware to test it. I've created it based on jasper implementation.
@tomaarsen , could you review it as well, please?
Currently, my main concern if I need to add <|begin_of_text|>
token after image or not? Processor automatically adding <|begin_of_text|>
, but not adding <|end_of_text|>
, but I'm not sure if it needed.
processor(text="<|image|><|begin_of_text|> Represent the given image.", images=[Image.open('/path/to/image')], return_tensors="pt")
# {'input_ids': tensor([[128000, 128256, 128000, 22717, 279, 2728, 2217, 13]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1]]), ...
Hi @Samoed ,
Thanks for your contribution! I tested your code and fixed the model loading part. Here is the test code I used:
import requests
model = MultiModalTransformer("intfloat/mmE5-mllama-11b-instruct").cuda()
image_bytes = requests.get('https://github.com/haon-chen/mmE5/blob/main/figures/example.jpg?raw=true', stream=True).raw.read()
qry_input = [
[
{"type": "image_bytes", "content": image_bytes},
{"type": "text", "content": " Represent the given image with the following question: What is in the image"},
]
]
qry_features = model.tokenize(qry_input)
qry_output = model(features={k: v.cuda() for k, v in qry_features.items()})["sentence_embedding"][:1]
tgt_input = ["A cat and a dog", "A cat and a tiger"]
tgt_features = model.tokenize(tgt_input)
tgt_output = model(features={k: v.cuda() for k, v in tgt_features.items()})["sentence_embedding"]
scores = torch.matmul(qry_output, tgt_output.transpose(0, 1))
print(scores)
# tensor([[0.3965, 0.3105]], device='cuda:0', dtype=torch.bfloat16, grad_fn=<MmBackward0>)
Turns out there is no need to add <|end_of_text|>
.
I think this should work with SentenceTransformers
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("intfloat/mmE5-mllama-11b-instruct", trust_remote_code=True)
image_bytes = requests.get('https://github.com/haon-chen/mmE5/blob/main/figures/example.jpg?raw=true', stream=True).raw.read()
doc_list = [
"just some text",
[
{"type": "image_bytes", "content": image_bytes},
{"type": "text", "content": " Represent the given image with the following question: What is in the image"},
],
]
doc_vecs = model.encode(doc_list)
And if that works you can add sentence-transformers
pipeline tag for better adoption
Unfortunately it does not, MultiModalTransformer
does not have attribute encode
. Also, MllamaProcessor
requires If a batch of text is provided, there should be either no images or at least one image per sample
.
I am still not quite clear on how sentence-transformers
supports multi-modal embeddings.
I've been cloning this repository for the last little while to try and see if I can help @Samoed indeed introduce multi-modal embeddings for this model via Sentence Transformers. I believe some files are missing, but other than that, I suspect it should be possible. I'll update here, or make a PR if I get something working.
Thanks for the great work here
@Samoed
. I think we're missing a modules.json
, which Sentence Transformers uses to figure out which "Modules" to use. See for example the modules.json
for Jasper: https://huggingface.co/NovaSearch/jasper_en_vision_language_v1/blob/main/modules.json
- Tom Aarsen