--- license: apache-2.0 inference: false pipeline_tag: text-generation tags: - text-generation-inference - llama2 - text-to-image datasets: - TIFA language: - en --- This is the text parsing and question generation model for the ICCV 2023 paper [TIFA: Accurate and Interpretable Text-to-Image Faithfulness Evaluation with Question Answering](https://arxiv.org/abs/2303.11897) We introduce TIFA (Text-to-Image Faithfulness evaluation with question Answering), an automatic evaluation metric that measures the faithfulness of a generated image to its text input via visual question answering (VQA). Specifically, given a text input, we automatically generate several question-answer pairs using a language model. We calculate image faithfulness by checking whether existing VQA models can answer these questions using the generated image. Specifically, this fine-tuned LLaMA 2 model is the substitute for the GPT-3 model in the paper. It can parse an arbitrary prompt into visual entities, attributes, relations, etc. and generate question-answer tuples for each of them. See examples below. # QuickStart All codes are from . Clone this repo to easily use this model together with other modules (e.g. VQA) provided in TIFA. Please follow the prompt format, which will give the best performance. ```python import torch import transformers # prepare the LLaMA 2 model model_name = "/gscratch/tial/yushihu/tifa-all/llama2/results/llama2/final_question_generation_checkpoint" pipeline = transformers.pipeline( "text-generation", model=model_name, torch_dtype=torch.float16, device_map="auto", ) # prompt formatting test_caption = "a blue rabbit and a red plane" model = PromptCap("vqascore/promptcap-coco-vqa") # also support OFA checkpoints. e.g. "OFA-Sys/ofa-large" if torch.cuda.is_available(): model.cuda() prompt = "please describe this image according to the given question: what piece of clothing is this boy putting on?" image = "glove_boy.jpeg" print(model.caption(prompt, image)) ``` To try generic captioning, just use "what does the image describe?" ```python prompt = "what does the image describe?" image = "glove_boy.jpeg" print(model.caption(prompt, image)) ``` PromptCap also support taking OCR inputs: ```python prompt = "please describe this image according to the given question: what year was this taken?" image = "dvds.jpg" ocr = "yip AE Mht juor 02/14/2012" print(model.caption(prompt, image, ocr)) ``` ## Bibtex ``` @article{hu2022promptcap, title={PromptCap: Prompt-Guided Task-Aware Image Captioning}, author={Hu, Yushi and Hua, Hang and Yang, Zhengyuan and Shi, Weijia and Smith, Noah A and Luo, Jiebo}, journal={arXiv preprint arXiv:2211.09699}, year={2022} } ```