--- language: - en base_model: - Salesforce/blip-image-captioning-base pipeline_tag: image-to-text tags: - art license: apache-2.0 metrics: - bleu library_name: transformers datasets: - phiyodr/coco2017 --- ### Fine-Tuned Image Captioning Model This is a fine-tuned version of BLIP for visual answering on images. This model is finetuned on Stanford Online Products Dataset comprising of 120k product images from online retail platform. The dataset is enriched with answers from LLMs and used to fine-tune the model. This experimental model can be used for answering questions on product images in retail industry. Product meta data enrichment, Validation of human generated product description are some of the examples sue case. Examples: (place images here) Input Image | Model Output ___________________________________________________________________________________________________________________________________________________________________________ ![image/jpeg](https://cdn-uploads.huggingface.co/production/uploads/672d17c98e098bf429c83670/-Ux5mU-JDpZvdhNq-sSiw.jpeg) Model Output:- chips nachos ![image/jpeg](https://cdn-uploads.huggingface.co/production/uploads/672d17c98e098bf429c83670/-Z87gp9zWg2FiLTUCu8Ir.jpeg) Model Output:- a man in a suit walking across a crosswalk ![image/png](https://cdn-uploads.huggingface.co/production/uploads/672d17c98e098bf429c83670/YcSs_CFcRj-Tb4woXIArC.png) Model Output:- bush ' s best white beans ## Sample model predictions | Image | Description | |-------------------------------------|--------------------------------| | ![image/png](https://cdn-uploads.huggingface.co/production/uploads/672d17c98e098bf429c83670/YcSs_CFcRj-Tb4woXIArC.png) | **=>** bush ' s best white beans | ## BibTex and citation info ``` @misc{https://doi.org/10.48550/arxiv.2201.12086, doi = {10.48550/ARXIV.2201.12086}, url = {https://arxiv.org/abs/2201.12086}, author = {Li, Junnan and Li, Dongxu and Xiong, Caiming and Hoi, Steven}, keywords = {Computer Vision and Pattern Recognition (cs.CV), FOS: Computer and information sciences, FOS: Computer and information sciences}, title = {BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation}, publisher = {arXiv}, year = {2022}, copyright = {Creative Commons Attribution 4.0 International} } ```