Wrong implementation of processor for multi-image inference
According to the paper, when inputting multiple images in one conversation, each image should remain instread of being cropped into patches so that number of tokens can be saved.
However, I noticed that the HF implementation does not seem to consider this case. When inputting a prompt with six images, the processor outputs around 40k tokens, which is not correct. Processor is supposed to output only 729 placeholder tokens for each image. It means that this implmentation is simply extending single-image use case into a multi-image one, so each image is still patched.
These are codes I used:
url = "https://github.com/haotian-liu/LLaVA/blob/1a91fc274d7c35a9b50b3cb29c4247ae5837ce39/images/llava_v1_5_radar.jpg?raw=true"
image = Image.open(requests.get(url, stream=True).raw)
conversation = [
{
"role": "user",
"content": [
{"type": "image"},
] * 6,
},
]
prompt = processor.apply_chat_template(conversation, add_generation_prompt=True)
inputs = processor(text=prompt, images=[image] * 6, return_tensors='pt').to("cuda:0")
print("number of tokens:", (inputs['input_ids'] == 151646).sum().item())
And the output is:
number of tokens: 39306