--- pipeline_tag: image-to-text tags: - image-captioning languages: - en license: bsd-3-clause --- # BLIP - Math Our model is fine-tuned on a mathematical multi-modal dataset, and it comprises two output heads: text generation and scoring. We provide the weight file for the text generation part of the model 'pytorch_model.bin.' You will need 4 input sources, including two text inputs and two image inputs: 'problem_body,' 'student_response,' 'question_image,' and 'student_image.' To perform conditional text generation: - Concatenating the text in the following manner: text = 'problem:' + ' ' + [problem_body] + ' ' + 'student:' + [student_response] + ' ' + 'response:' - Concatenating [question_image] and [student_image] vertically while keeping [question_image] on top and selecting the larger of the two image sizes. For all other usage, follow the same procedures as with the BLIP model. If you have any further questions or need assistance with specific code or implementation details, please feel free to ask. # BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation Model card for image captioning pretrained on COCO dataset - base architecture (with ViT base backbone). | ![BLIP.gif](https://cdn-uploads.huggingface.co/production/uploads/1670928184033-62441d1d9fdefb55a0b7d12c.gif) | |:--:| | Pull figure from BLIP official repo | Image source: https://github.com/salesforce/BLIP | ## TL;DR Authors from the [paper](https://arxiv.org/abs/2201.12086) write in the abstract: *Vision-Language Pre-training (VLP) has advanced the performance for many vision-language tasks. However, most existing pre-trained models only excel in either understanding-based tasks or generation-based tasks. Furthermore, performance improvement has been largely achieved by scaling up the dataset with noisy image-text pairs collected from the web, which is a suboptimal source of supervision. In this paper, we propose BLIP, a new VLP framework which transfers flexibly to both vision-language understanding and generation tasks. BLIP effectively utilizes the noisy web data by bootstrapping the captions, where a captioner generates synthetic captions and a filter removes the noisy ones. We achieve state-of-the-art results on a wide range of vision-language tasks, such as image-text retrieval (+2.7% in average recall@1), image captioning (+2.8% in CIDEr), and VQA (+1.6% in VQA score). BLIP also demonstrates strong generalization ability when directly transferred to videolanguage tasks in a zero-shot manner. Code, models, and datasets are released.* ## Usage You can use this model for conditional and un-conditional image captioning ### Using the Pytorch model #### Running the model on CPU
Click to expand ```python import requests from PIL import Image from transformers import BlipProcessor, BlipForConditionalGeneration processor = BlipProcessor.from_pretrained("Salesforce/blip-image-captioning-base") model = BlipForConditionalGeneration.from_pretrained("Salesforce/blip-image-captioning-base") img_url = 'https://storage.googleapis.com/sfr-vision-language-research/BLIP/demo.jpg' raw_image = Image.open(requests.get(img_url, stream=True).raw).convert('RGB') # conditional image captioning text = "a photography of" inputs = processor(raw_image, text, return_tensors="pt") out = model.generate(**inputs) print(processor.decode(out[0], skip_special_tokens=True)) # >>> a photography of a woman and her dog # unconditional image captioning inputs = processor(raw_image, return_tensors="pt") out = model.generate(**inputs) print(processor.decode(out[0], skip_special_tokens=True)) >>> a woman sitting on the beach with her dog ```
#### Running the model on GPU ##### In full precision
Click to expand ```python import requests from PIL import Image from transformers import BlipProcessor, BlipForConditionalGeneration processor = BlipProcessor.from_pretrained("Salesforce/blip-image-captioning-base") model = BlipForConditionalGeneration.from_pretrained("Salesforce/blip-image-captioning-base").to("cuda") img_url = 'https://storage.googleapis.com/sfr-vision-language-research/BLIP/demo.jpg' raw_image = Image.open(requests.get(img_url, stream=True).raw).convert('RGB') # conditional image captioning text = "a photography of" inputs = processor(raw_image, text, return_tensors="pt").to("cuda") out = model.generate(**inputs) print(processor.decode(out[0], skip_special_tokens=True)) # >>> a photography of a woman and her dog # unconditional image captioning inputs = processor(raw_image, return_tensors="pt").to("cuda") out = model.generate(**inputs) print(processor.decode(out[0], skip_special_tokens=True)) >>> a woman sitting on the beach with her dog ```
##### In half precision (`float16`)
Click to expand ```python import torch import requests from PIL import Image from transformers import BlipProcessor, BlipForConditionalGeneration processor = BlipProcessor.from_pretrained("Salesforce/blip-image-captioning-base") model = BlipForConditionalGeneration.from_pretrained("Salesforce/blip-image-captioning-base", torch_dtype=torch.float16).to("cuda") img_url = 'https://storage.googleapis.com/sfr-vision-language-research/BLIP/demo.jpg' raw_image = Image.open(requests.get(img_url, stream=True).raw).convert('RGB') # conditional image captioning text = "a photography of" inputs = processor(raw_image, text, return_tensors="pt").to("cuda", torch.float16) out = model.generate(**inputs) print(processor.decode(out[0], skip_special_tokens=True)) # >>> a photography of a woman and her dog # unconditional image captioning inputs = processor(raw_image, return_tensors="pt").to("cuda", torch.float16) out = model.generate(**inputs) print(processor.decode(out[0], skip_special_tokens=True)) >>> a woman sitting on the beach with her dog ```
## BibTex and citation info ``` @misc{https://doi.org/10.48550/arxiv.2201.12086, doi = {10.48550/ARXIV.2201.12086}, url = {https://arxiv.org/abs/2201.12086}, author = {Li, Junnan and Li, Dongxu and Xiong, Caiming and Hoi, Steven}, keywords = {Computer Vision and Pattern Recognition (cs.CV), FOS: Computer and information sciences, FOS: Computer and information sciences}, title = {BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation}, publisher = {arXiv}, year = {2022}, copyright = {Creative Commons Attribution 4.0 International} } ```