Fine-Tuned Image Captioning Model

This is a fine-tuned version of BLIP for visual answering on retail product images. This model is finetuned on custom dataset with images from online retail platform and annotated with product description.

This experimental model can be used for answering questions on product images in retail industry. Product meta data enrichment, Validation of human generated product description are some of the examples sue case.

Sample model predictions

Input Image Prediction
image/png kitchenaid artisann stand mixer
a bottle of milk sitting on a counter
image/jpeg dove sensitive skin lotion
bread bag bread bag with blue plastic handl
image/png bush ' s best white beans

How to use the model:

Click to expand
import requests
from PIL import Image
from transformers import BlipProcessor, BlipForConditionalGeneration

processor = BlipProcessor.from_pretrained("quadranttechnologies/qhub-blip-image-captioning-finetuned")
model = BlipForConditionalGeneration.from_pretrained("quadranttechnologies/qhub-blip-image-captioning-finetuned")

img_url = 'https://storage.googleapis.com/sfr-vision-language-research/BLIP/demo.jpg' 
raw_image = Image.open(requests.get(img_url, stream=True).raw).convert('RGB')

# conditional image captioning
text = "a photography of"
inputs = processor(raw_image, text, return_tensors="pt")

out = model.generate(**inputs)
print(processor.decode(out[0], skip_special_tokens=True))

# unconditional image captioning
inputs = processor(raw_image, return_tensors="pt")

out = model.generate(**inputs)
print(processor.decode(out[0], skip_special_tokens=True))

BibTex and citation info

@misc{https://doi.org/10.48550/arxiv.2201.12086,
  doi = {10.48550/ARXIV.2201.12086},
  
  url = {https://arxiv.org/abs/2201.12086},
  
  author = {Li, Junnan and Li, Dongxu and Xiong, Caiming and Hoi, Steven},
  
  keywords = {Computer Vision and Pattern Recognition (cs.CV), FOS: Computer and information sciences, FOS: Computer and information sciences},
  
  title = {BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation},
  
  publisher = {arXiv},
  
  year = {2022},
  
  copyright = {Creative Commons Attribution 4.0 International}
}
Downloads last month
96
Safetensors
Model size
470M params
Tensor type
F32
Β·
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.

Model tree for quadranttechnologies/qhub-blip-image-captioning-finetuned

Finetuned
(11)
this model

Dataset used to train quadranttechnologies/qhub-blip-image-captioning-finetuned

Spaces using quadranttechnologies/qhub-blip-image-captioning-finetuned 2