Batch inputs (image, prompt)
#10
by
jeeyungk
- opened
Can we use a batch of image as an input to LLaVA?
Hi! Yes Llava-1.5 can take batched inputs, see the code snippet below:
import requests
from PIL import Image
import torch
from transformers import AutoProcessor, LlavaForConditionalGeneration
model = LlavaForConditionalGeneration.from_pretrained("llava-hf/llava-1.5-7b-hf", torch_dtype=torch.float16, device_map="auto")
processor = AutoProcessor.from_pretrained("llava-hf/llava-1.5-7b-hf")
prompts = [
"USER: <image>\nWhat are the things I should be cautious about when I visit this place? What should I bring with me? ASSISTANT:",
"USER: <image>\nWhat is this? ASSISTANT:",
]
image1 = Image.open(requests.get("https://llava-vl.github.io/static/images/view.jpg", stream=True).raw)
image2 = Image.open(requests.get("http://images.cocodataset.org/val2017/000000039769.jpg", stream=True).raw)
inputs = processor(prompts, images=[image1, image2], padding=True, return_tensors="pt")
output = model.generate(**inputs, max_new_tokens=20)
print(output)
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu! (when checking argument for argument index in method wrapper_CUDA__index_select)
Hi,
You need to place the inputs on the GPU as well, so the snippet above needs to add:
inputs = processor(prompts, images=[image1, image2], padding=True, return_tensors="pt").to("cuda")
Why can it only recognize the first picture and not reply to the two pictures?
@ZIHANGDU18 the models was not trained with multi-image setting and thus may perform poorly without proper fine-tuning. Try out the new llava series, tuned with multi-image dataset :)
https://huggingface.co/collections/llava-hf/llava-interleave-668e19a97da0036aad4a2f19