Create custom_st.py

by Samoed - opened Feb 23

base: refs/heads/main

←

from: refs/pr/2

Discussion Files changed

+156

-39

Upload README.md with huggingface_hubd553fa4a

Samoed

Feb 23

Based on the updated demo, I've created a custom SentenceTransformer model. However, I don't have the hardware to test it. I've created it based on jasper implementation.

@tomaarsen , could you review it as well, please?

Create custom_st.pyd4b7753e

Update custom_st.py31932a6c

use only `<|image|>`cbde83ef

return `<|begin_of_text|>`46f332ef

remove `<|begin_of_text|>`519b6b5c

Samoed

Feb 23

•

edited Feb 23

processor(text="<|image|><|begin_of_text|> Represent the given image.", images=[Image.open('/path/to/image')], return_tensors="pt")
# {'input_ids': tensor([[128000, 128256, 128000,  22717,    279,   2728,   2217,     13]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1]]), ...

Update README.mdb620de43

Update custom_st.py7f44eb9c

intfloat

Owner Feb 24

•

edited Feb 24

Hi @Samoed ,

Thanks for your contribution! I tested your code and fixed the model loading part. Here is the test code I used:

import requests
model = MultiModalTransformer("intfloat/mmE5-mllama-11b-instruct").cuda()
image_bytes = requests.get('https://github.com/haon-chen/mmE5/blob/main/figures/example.jpg?raw=true', stream=True).raw.read()

qry_input = [
    [
        {"type": "image_bytes", "content": image_bytes},
        {"type": "text", "content": " Represent the given image with the following question: What is in the image"},
    ]
]
qry_features = model.tokenize(qry_input)
qry_output = model(features={k: v.cuda() for k, v in qry_features.items()})["sentence_embedding"][:1]

tgt_input = ["A cat and a dog", "A cat and a tiger"]
tgt_features = model.tokenize(tgt_input)
tgt_output = model(features={k: v.cuda() for k, v in tgt_features.items()})["sentence_embedding"]

scores = torch.matmul(qry_output, tgt_output.transpose(0, 1))
print(scores)
# tensor([[0.3965, 0.3105]], device='cuda:0', dtype=torch.bfloat16, grad_fn=<MmBackward0>)

Turns out there is no need to add <|end_of_text|>.

intfloat changed pull request status to merged Feb 24

Samoed

Feb 24

•

edited Feb 24

I think this should work with SentenceTransformers

from sentence_transformers import SentenceTransformer

model = SentenceTransformer("intfloat/mmE5-mllama-11b-instruct", trust_remote_code=True)
image_bytes = requests.get('https://github.com/haon-chen/mmE5/blob/main/figures/example.jpg?raw=true', stream=True).raw.read()

doc_list = [
       "just some text",
        [
        {"type": "image_bytes", "content": image_bytes},
        {"type": "text", "content": " Represent the given image with the following question: What is in the image"},
    ],
]
doc_vecs = model.encode(doc_list)

And if that works you can add sentence-transformers pipeline tag for better adoption

intfloat

Owner Feb 24

Unfortunately it does not, MultiModalTransformer does not have attribute encode. Also, MllamaProcessor requires If a batch of text is provided, there should be either no images or at least one image per sample.

I am still not quite clear on how sentence-transformers supports multi-modal embeddings.

tomaarsen

Feb 24

•

edited Feb 24

I've been cloning this repository for the last little while to try and see if I can help @Samoed indeed introduce multi-modal embeddings for this model via Sentence Transformers. I believe some files are missing, but other than that, I suspect it should be possible. I'll update here, or make a PR if I get something working.

Thanks for the great work here @Samoed . I think we're missing a modules.json, which Sentence Transformers uses to figure out which "Modules" to use. See for example the modules.json for Jasper: https://huggingface.co/NovaSearch/jasper_en_vision_language_v1/blob/main/modules.json

Tom Aarsen

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment