8bit model always returns empty string

#26
by zwang2022 - opened

I tried the following code either in my personal computer or kaggle, it always returned empty string.

I tried to replace the image and prompt, in most of the cases, it still return empty string for the output, only in few case, it returned random words.

# pip install accelerate bitsandbytes
import torch
import requests
from PIL import Image
from transformers import Blip2Processor, Blip2ForConditionalGeneration

processor = Blip2Processor.from_pretrained("Salesforce/blip2-opt-2.7b")
model = Blip2ForConditionalGeneration.from_pretrained("Salesforce/blip2-opt-2.7b", load_in_8bit=True, device_map="auto")

img_url = 'https://storage.googleapis.com/sfr-vision-language-research/BLIP/demo.jpg' 
raw_image = Image.open(requests.get(img_url, stream=True).raw).convert('RGB')

question = "how many dogs are in the picture?"
inputs = processor(raw_image, question, return_tensors="pt").to("cuda", torch.float16)

out = model.generate(**inputs)
print(processor.decode(out[0], skip_special_tokens=True).strip())

Hi @zwang2022 !
Hmm interesting
Can you try with latest transformers & bitsandbytes? pip install -U transformers bitsandbytes
Also do you face the same issue with 4bit?

Hi @ybelkada ,
I also tried the full float16 version in a new machine with NVIDIA A40. The same issue happened.
Finally, I found that, if the image or the prompt was not set properly, the blip2 refused to output anything.

I was not lucky enough that the first image with corresponding prompt did not make blip2 to make the output.

I added "Answer: " to the end of prompt, blip2 model was more willing to answer, but still, around 50 of 1000 images, with the same prompt, were not answered.

Also, the ability of text understanding and generation of blip2 is under performance, so I do believe this is not the code example issue while it's the inherited chat ability of blip2.

Same issue. Is anyone looking into this?

I'm able to reproduce this. cc @ybelkada

Same with 4bit too. Works perfectly for the same image on full precision though. Noticed this in multiple cases.

processor = Blip2Processor.from_pretrained("Salesforce/blip2-opt-2.7b")
model = Blip2ForConditionalGeneration.from_pretrained("Salesforce/blip2-opt-2.7b", load_in_4bit=True,device_map="auto")

raw_image = Image.open("01256.png").convert('RGB')
inputs = processor(raw_image, return_tensors="pt").to("cuda", torch.float16)
out = model.generate(**inputs)
print(processor.decode(out[0], skip_special_tokens=False).strip())

Returns empty string.

But on full precision, returns perfectly: dog with a caption that says when your debt card declines at the clinic and they have to put the baby back in

import requests
from PIL import Image
from transformers import Blip2ForConditionalGeneration, Blip2Processor

processor = Blip2Processor.from_pretrained("Salesforce/blip2-opt-6.7b")
model = Blip2ForConditionalGeneration.from_pretrained(
    "Salesforce/blip2-opt-6.7b", device_map="auto"
)

img_url = 'https://storage.googleapis.com/sfr-vision-language-research/BLIP/demo.jpg'
raw_image = Image.open(requests.get(img_url, stream=True).raw).convert('RGB')

# Adjusting the prompt as per paper:
question = "Question: how many dogs are in picture? Answer:"
inputs = processor(raw_image, question, return_tensors="pt").to("cuda")

out = model.generate(**inputs)
print(processor.decode(out[0], skip_special_tokens=True).strip())

# Normal question
question = "how many dogs are in picture?"
inputs = processor(raw_image, question, return_tensors="pt").to("cuda")

out = model.generate(**inputs)
print(processor.decode(out[0], skip_special_tokens=True).strip())

Gives me "1" and Empty string, respectively. It seems generation accepts different prompt format

I'm not using 8bit, just fp16, but same thing.

When I add "Answer: " to the end of the prompt, it said nothing.

When I added "Answer: " to the end AND said "Please" at the start of the prompt, it actually gave an answer, but a very short answer ("woman sitting on the beach").

When I said "Please", without "Answer: " at the end, the response was hilarious: "Don't just say its a photo."

Sign up or log in to comment