metadata

license: apache-2.0
inference: false
pipeline_tag: text-generation
tags:
  - text-generation-inference
  - llama2
  - text-to-image
datasets:
  - TIFA
language:
  - en

This is the text parsing and question generation model for the ICCV 2023 paper TIFA: Accurate and Interpretable Text-to-Image Faithfulness Evaluation with Question Answering

We introduce TIFA (Text-to-Image Faithfulness evaluation with question Answering), an automatic evaluation metric that measures the faithfulness of a generated image to its text input via visual question answering (VQA). Specifically, given a text input, we automatically generate several question-answer pairs using a language model. We calculate image faithfulness by checking whether existing VQA models can answer these questions using the generated image.

Specifically, this fine-tuned LLaMA 2 model is the substitute for the GPT-3 model in the paper. It can parse an arbitrary prompt into visual entities, attributes, relations, etc. and generate question-answer tuples for each of them. See examples below.

QuickStart

All codes are from https://github.com/Yushi-Hu/tifa. Clone this repo to easily use this model together with other modules (e.g. VQA) provided in TIFA.

Please follow the prompt format, which will give the best performance.

import torch
import transformers

# prepare the LLaMA 2 model
model_name = "/gscratch/tial/yushihu/tifa-all/llama2/results/llama2/final_question_generation_checkpoint"
pipeline = transformers.pipeline(
    "text-generation",
    model=model_name,
    torch_dtype=torch.float16,
    device_map="auto",
)

# prompt formatting



test_caption = "a blue rabbit and a red plane"




model = PromptCap("vqascore/promptcap-coco-vqa")  # also support OFA checkpoints. e.g. "OFA-Sys/ofa-large"

if torch.cuda.is_available():
  model.cuda()

prompt = "please describe this image according to the given question: what piece of clothing is this boy putting on?"
image = "glove_boy.jpeg"

print(model.caption(prompt, image))

To try generic captioning, just use "what does the image describe?"

prompt = "what does the image describe?"
image = "glove_boy.jpeg"

print(model.caption(prompt, image))

PromptCap also support taking OCR inputs:

prompt = "please describe this image according to the given question: what year was this taken?"
image = "dvds.jpg"
ocr = "yip AE Mht juor 02/14/2012"

print(model.caption(prompt, image, ocr))

Bibtex

@article{hu2022promptcap,
  title={PromptCap: Prompt-Guided Task-Aware Image Captioning},
  author={Hu, Yushi and Hua, Hang and Yang, Zhengyuan and Shi, Weijia and Smith, Noah A and Luo, Jiebo},
  journal={arXiv preprint arXiv:2211.09699},
  year={2022}
}