RuntimeError: Could not infer dtype of numpy.float32 when converting to PyTorch tensor

#8
by Koshti10 - opened

Hello,
Thank you for releasing the transformers compatible version for this model. I am trying to run the base inference script provided on the model page. There is just one change, I've added padding=True to the processor arg. I tried with and without this argument, but the following error still persists. This error is shown for both 8b and 8b-chatty.

System -
Linux clp-a100 6.5.0-26-generic #26~22.04.1-Ubuntu
transformers==4.43.1
torch==2.1.1
numpy==2.0.1
device=Nvidia A100 x4

Error -


Traceback (most recent call last):
  File "/project/kkoshti/envs/clembench/lib/python3.10/site-packages/transformers/feature_extraction_utils.py", line 183, in convert_to_tensors
    tensor = as_tensor(value)
  File "/project/kkoshti/envs/clembench/lib/python3.10/site-packages/transformers/feature_extraction_utils.py", line 142, in as_tensor
    return torch.tensor(value)
RuntimeError: Could not infer dtype of numpy.float32

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/project/kkoshti/clembench/backends/multimodal_utils/idefics3_utils.py", line 45, in <module>
    inputs = processor(text=prompt, images=[image1, image2], padding=True, return_tensors="pt")
  File "/project/kkoshti/envs/clembench/lib/python3.10/site-packages/transformers/models/idefics2/processing_idefics2.py", line 230, in __call__
    image_inputs = self.image_processor(images, return_tensors=return_tensors)
  File "/project/kkoshti/envs/clembench/lib/python3.10/site-packages/transformers/image_processing_utils.py", line 41, in __call__
    return self.preprocess(images, **kwargs)
  File "/project/kkoshti/envs/clembench/lib/python3.10/site-packages/transformers/models/idefics2/image_processing_idefics2.py", line 596, in preprocess
    return BatchFeature(data=data, tensor_type=return_tensors)
  File "/project/kkoshti/envs/clembench/lib/python3.10/site-packages/transformers/feature_extraction_utils.py", line 79, in __init__
    self.convert_to_tensors(tensor_type=tensor_type)
  File "/project/kkoshti/envs/clembench/lib/python3.10/site-packages/transformers/feature_extraction_utils.py", line 189, in convert_to_tensors
    raise ValueError(
ValueError: Unable to create tensor, you should probably activate padding with 'padding=True' to have batched tensors with the same length.

CODE -

import requests
import torch
from PIL import Image
from io import BytesIO

from transformers import AutoProcessor, AutoModelForVision2Seq
from transformers.image_utils import load_image

DEVICE = "cuda:0"

# Note that passing the image urls (instead of the actual pil images) to the processor is also possible
image1 = load_image("https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg")
image2 = load_image("https://cdn.britannica.com/59/94459-050-DBA42467/Skyline-Chicago.jpg")
image3 = load_image("https://cdn.britannica.com/68/170868-050-8DDE8263/Golden-Gate-Bridge-San-Francisco.jpg")

processor = AutoProcessor.from_pretrained("HuggingFaceM4/idefics2-8b")
model = AutoModelForVision2Seq.from_pretrained(
    "HuggingFaceM4/idefics2-8b",
).to(DEVICE)

# Create inputs
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image"},
            {"type": "text", "text": "What do we see in this image?"},
        ]
    },
    {
        "role": "assistant",
        "content": [
            {"type": "text", "text": "In this image, we can see the city of New York, and more specifically the Statue of Liberty."},
        ]
    },
    {
        "role": "user",
        "content": [
            {"type": "image"},
            {"type": "text", "text": "And how about this image?"},
        ]
    },
]
prompt = processor.apply_chat_template(messages, add_generation_prompt=True)
inputs = processor(text=prompt, images=[image1, image2], padding=True, return_tensors="pt")
inputs = {k: v.to(DEVICE) for k, v in inputs.items()}


# Generate
generated_ids = model.generate(**inputs, max_new_tokens=500)
generated_texts = processor.batch_decode(generated_ids, skip_special_tokens=True)

print(generated_texts)
# ['User: What do we see in this image? \nAssistant: In this image, we can see the city of New York, and more specifically the Statue of Liberty. \nUser: And how about this image? \nAssistant: In this image we can see buildings, trees, lights, water and sky.']

Hey, we put a disclaimer on the model card

/!!!!\ WARNING: Idefics2 will NOT work with Transformers version between 4.41.0 and 4.43.3 included. See the issue https://github.com/huggingface/transformers/issues/32271 and the fix https://github.com/huggingface/transformers/pull/32275.

I'm note sure it's related to your bug, but in any case, I might not help or add another silent bug on top of it.
Maybe you can retry with version 4.40 first (or with the fix in Transformers) to see if it helps?

Sign up or log in to comment