File size: 2,779 Bytes
de6537d
 
64dae59
 
 
 
 
 
 
 
 
 
de6537d
64dae59
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
---
license: apache-2.0
inference: false
pipeline_tag: text-generation
tags:
- text-generation-inference
- llama2
- text-to-image
datasets:
- TIFA
language:
- en
---
This is the text parsing and question generation model for the ICCV 2023 paper [TIFA: Accurate and Interpretable Text-to-Image Faithfulness Evaluation with Question Answering](https://arxiv.org/abs/2303.11897)

We introduce TIFA (Text-to-Image Faithfulness evaluation with question Answering), an automatic evaluation metric that measures the faithfulness of a generated image to its text input via visual question answering (VQA). Specifically, given a text input, we automatically generate several question-answer pairs using a language model. We calculate image faithfulness by checking whether existing VQA models can answer these questions using the generated image. 

Specifically, this fine-tuned LLaMA 2 model is the substitute for the GPT-3 model in the paper. It can parse an arbitrary prompt into visual entities, attributes, relations, etc. and generate question-answer tuples for each of them. See examples below.


# QuickStart

All codes are from <https://github.com/Yushi-Hu/tifa>. Clone this repo to easily use this model together with other modules (e.g. VQA) provided in TIFA.

Please follow the prompt format, which will give the best performance.


```python
import torch
import transformers

# prepare the LLaMA 2 model
model_name = "/gscratch/tial/yushihu/tifa-all/llama2/results/llama2/final_question_generation_checkpoint"
pipeline = transformers.pipeline(
    "text-generation",
    model=model_name,
    torch_dtype=torch.float16,
    device_map="auto",
)

# prompt formatting



test_caption = "a blue rabbit and a red plane"




model = PromptCap("vqascore/promptcap-coco-vqa")  # also support OFA checkpoints. e.g. "OFA-Sys/ofa-large"

if torch.cuda.is_available():
  model.cuda()

prompt = "please describe this image according to the given question: what piece of clothing is this boy putting on?"
image = "glove_boy.jpeg"

print(model.caption(prompt, image))
```

To try generic captioning, just use "what does the image describe?"

```python
prompt = "what does the image describe?"
image = "glove_boy.jpeg"

print(model.caption(prompt, image))
```



PromptCap also support taking OCR inputs:

```python
prompt = "please describe this image according to the given question: what year was this taken?"
image = "dvds.jpg"
ocr = "yip AE Mht juor 02/14/2012"

print(model.caption(prompt, image, ocr))
```




## Bibtex
```
@article{hu2022promptcap,
  title={PromptCap: Prompt-Guided Task-Aware Image Captioning},
  author={Hu, Yushi and Hua, Hang and Yang, Zhengyuan and Shi, Weijia and Smith, Noah A and Luo, Jiebo},
  journal={arXiv preprint arXiv:2211.09699},
  year={2022}
}
```