metadata

license: cc-by-nc-4.0
language:
  - en
pipeline_tag: image-text-to-text

Model description

BLIP-3 consists of 3 models: a CLIP-like image encoder, a VL connector, and a large language model.

Direct Use and Downstream Use

Bias, Risks, Limitations, and Ethical Considerations

How to use

We require use the development version ("4.41.0.dev0") of the transformers library. To get it, as of 05/07/2024, one can use pip uninstall -y transformers && pip install git+https://github.com/huggingface/transformers.

from transformers import AutoModelForVision2Seq, AutoTokenizer, AutoImageProcessor, StoppingCriteria
import torch
import requests
from PIL import Image

# define the prompt template
def apply_prompt_template(prompt):
    s = (
            '<|system|>\nA chat between a curious user and an artificial intelligence assistant. '
            "The assistant gives helpful, detailed, and polite answers to the user's questions.<|end|>\n"
            f'<|user|>\n<image>\n{prompt}<|end|>\n<|assistant|>\n'
        )
    return s 
class EosListStoppingCriteria(StoppingCriteria):
    def __init__(self, eos_sequence = [32007]):
        self.eos_sequence = eos_sequence

    def __call__(self, input_ids: torch.LongTensor, scores: torch.FloatTensor, **kwargs) -> bool:
        last_ids = input_ids[:,-len(self.eos_sequence):].tolist()
        return self.eos_sequence in last_ids      

# load models
model_name_or_path = "Salesforce/blip3-phi3-3b-instruct-r-v1"
model = AutoModelForVision2Seq.from_pretrained(model_name_or_path, trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained(model_name_or_path, trust_remote_code=True, use_fast=True, legacy=False)
image_processor = AutoImageProcessor.from_pretrained(model_name_or_path, trust_remote_code=True)
tokenizer = model.update_special_tokens(tokenizer)

# craft a test sample
img_url = 'https://storage.googleapis.com/sfr-vision-language-research/BLIP/demo.jpg'
raw_image = Image.open(requests.get(img_url, stream=True).raw).convert('RGB')
query = "how many dogs are in the picture?"

model = model.cuda()
inputs = image_processor([raw_image], return_tensors="pt", image_aspect_ratio='anyres')
prompt = apply_prompt_template(query)
language_inputs = tokenizer([prompt], return_tensors="pt")
inputs.update(language_inputs)
inputs = {name: tensor.cuda() for name, tensor in inputs.items()}
generated_text = model.generate(**inputs, image_size=[raw_image.size],
                                pad_token_id=tokenizer.pad_token_id,
                                do_sample=False, max_new_tokens=768, top_p=None, num_beams=1,
                                stopping_criteria = [EosListStoppingCriteria()],
                                )
prediction = tokenizer.decode(generated_text[0], skip_special_tokens=True)
print("==> prediciton: ", prediction)
# output: ==> prediciton: There is one dog in the picture.

License

Our code and weights are released under the Creative Commons Attribution Non Commercial 4.0 LICENSE.

Troubleshoot

If you missing any packages, please consider the followings

pip install torch==2.2.1 torchvision==0.17.1 torchaudio==2.2.1 --index-url https://download.pytorch.org/whl/cu121
pip install open_clip_torch==2.24.0
pip install einops
pip install einops-exts