File size: 3,430 Bytes
8ad8161 9282d8a 8ad8161 9282d8a 3788991 8ad8161 0e8d993 0e4e908 01816ff 3788991 3ef4592 3788991 a67eb24 1033eed a67eb24 4aa97af 37a8115 5d81017 a67eb24 c9c478a 1033eed 54c182b ba2c591 1033eed a67eb24 1033eed a67eb24 1033eed 3788991 1033eed 3c5311e 1033eed 3c5311e 1033eed 3c5311e 1033eed 02c0269 3c5311e 54c182b 3c5311e b9a3d87 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 |
---
language:
- en
base_model:
- Salesforce/blip-image-captioning-base
pipeline_tag: image-to-text
tags:
- art
license: apache-2.0
metrics:
- bleu
library_name: transformers
datasets:
- phiyodr/coco2017
---
### Fine-Tuned Image Captioning Model
This is a fine-tuned version of BLIP for visual answering on retail product images. This model is finetuned on custom dataset with images from online retail platform and annotated with product description.
This experimental model can be used for answering questions on product images in retail industry. Product meta data enrichment, Validation of human generated product description are some of the examples sue case.
# Sample model predictions
| Input Image | Prediction |
|-------------------------------------------|--------------------------------|
|<img src="https://cdn-uploads.huggingface.co/production/uploads/672d17c98e098bf429c83670/KTnUTaTjrIG7dUyR1aMho.png" alt="image/png" width="100" height="100" /> | kitchenaid artisann stand mixer|
|<img src="https://cdn-uploads.huggingface.co/production/uploads/672d17c98e098bf429c83670/Skt_sjYxbfQu056v2C1Ym.png" width="100" height="100" /> | a bottle of milk sitting on a counter |
|<img src="https://cdn-uploads.huggingface.co/production/uploads/672d17c98e098bf429c83670/Zp1OMzO4BEs7s9k3O5ij7.jpeg" alt="image/jpeg" width="100" height="100" />| dove sensitive skin lotion |
|<img src="https://cdn-uploads.huggingface.co/production/uploads/672d17c98e098bf429c83670/dYNo38En0M0WpKONS8StX.jpeg" alt="bread bag" width="100" height="100" /> | bread bag with blue plastic handl|
|<img src="https://cdn-uploads.huggingface.co/production/uploads/672d17c98e098bf429c83670/oypT9482ysQjC0usEHGbT.png" alt="image/png" width="100" height="100" /> | bush ' s best white beans |
### How to use the model:
<details>
<summary> Click to expand </summary>
```python
import requests
from PIL import Image
from transformers import BlipProcessor, BlipForConditionalGeneration
processor = BlipProcessor.from_pretrained("quadranttechnologies/qhub-blip-image-captioning-finetuned")
model = BlipForConditionalGeneration.from_pretrained("quadranttechnologies/qhub-blip-image-captioning-finetuned")
img_url = 'https://storage.googleapis.com/sfr-vision-language-research/BLIP/demo.jpg'
raw_image = Image.open(requests.get(img_url, stream=True).raw).convert('RGB')
# conditional image captioning
text = "a photography of"
inputs = processor(raw_image, text, return_tensors="pt")
out = model.generate(**inputs)
print(processor.decode(out[0], skip_special_tokens=True))
# unconditional image captioning
inputs = processor(raw_image, return_tensors="pt")
out = model.generate(**inputs)
print(processor.decode(out[0], skip_special_tokens=True))
```
</details>
## BibTex and citation info
```
@misc{https://doi.org/10.48550/arxiv.2201.12086,
doi = {10.48550/ARXIV.2201.12086},
url = {https://arxiv.org/abs/2201.12086},
author = {Li, Junnan and Li, Dongxu and Xiong, Caiming and Hoi, Steven},
keywords = {Computer Vision and Pattern Recognition (cs.CV), FOS: Computer and information sciences, FOS: Computer and information sciences},
title = {BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation},
publisher = {arXiv},
year = {2022},
copyright = {Creative Commons Attribution 4.0 International}
}
``` |