File size: 5,931 Bytes
b2e192f 95e76c4 b2e192f 95e76c4 b2e192f 4f407e9 b2e192f 84b55d0 b2e192f 84b55d0 b2e192f 84b55d0 b2e192f ed8ecf2 b2e192f |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 |
---
license: apache-2.0
datasets:
- liuhaotian/LLaVA-CC3M-Pretrain-595K
- liuhaotian/LLaVA-Instruct-150K
- FreedomIntelligence/ALLaVA-4V-Chinese
- shareAI/ShareGPT-Chinese-English-90k
language:
- zh
- en
pipeline_tag: visual-question-answering
---
# Model Card for IAA: Inner-Adaptor Architecture
**Github**:https://github.com/360CVGroup/Inner-Adaptor-Architecture
**[IAA: Inner-Adaptor Architecture Empowers Frozen Large Language Model with Multimodal Capabilities](https://www.arxiv.org/abs/2408.12902)**
Bin Wang*, Chunyu Xie*, Dawei Leng†, Yuhui Yin(*Equal Contribution, ✝Corresponding Author)
[![arXiv](https://img.shields.io/badge/arXiv-2408.12902-b31b1b.svg)](https://www.arxiv.org/abs/2408.12902)
We propose a MLLM based on Inner-Adaptor Architecture (IAA). IAA demonstrates that training with a frozen language model can surpass the models with fine-tuned LLMs in both multimodal comprehension and visual grounding tasks. Moreover, after deployment, our approach incorporates multiple workflows, thereby preserving the NLP proficiency of the language model. With a single download, the model can be finetuned to cater to various task specifications. Enjoy the seamless experience of utilizing our IAA model.
<p align="center">
<img src="overview.png" width=80%/>
</p>
## Model Performance
### Main Results on General Multimodal Benchmarks.
<p align="center">
<img src="mmresult.png" width=90%/>
</p>
### Results on Visual Grounding Benchmarks.
<!-- grounding_re -->
<p align="center">
<img src="grounding_re.png" width=90%/>
</p>
### Comparison on text-only question answering.
<!-- grounding_re -->
<p align="center">
<img src="NLPresult.png" width=90%/>
</p>
## Quick Start 🤗
### First pull off our model
```Shell
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
from PIL import Image
checkpoint = "qihoo360/Inner-Adaptor-Architecture"
model = AutoModelForCausalLM.from_pretrained(checkpoint, torch_dtype=torch.float16, device_map='cuda', trust_remote_code=True).eval()
tokenizer = AutoTokenizer.from_pretrained(checkpoint, trust_remote_code=True)
vision_tower = model.get_vision_tower()
vision_tower.load_model()
vision_tower.to(device="cuda", dtype=torch.float16)
image_processor = vision_tower.image_processor
tokenizer.pad_token = tokenizer.eos_token
terminators = [
tokenizer.convert_tokens_to_ids("<|eot_id|>",)
]
```
### Multimodal Workflow: task_type="MM"
```Shell
image = Image.open("readpanda.jpg").convert('RGB')
query = "What animal is in the picture?"
inputs = model.build_conversation_input_ids(tokenizer, query=query, image=image, image_processor=image_processor)
input_ids = inputs["input_ids"].to(device='cuda', non_blocking=True)
images = inputs["image"].to(dtype=torch.float16, device='cuda', non_blocking=True)
output_ids = model.generate(
input_ids,
task_type="MM",
images=images,
do_sample=False,
eos_token_id=terminators,
num_beams=1,
max_new_tokens=512,
use_cache=True)
input_token_len = input_ids.shape[1]
outputs = tokenizer.batch_decode(output_ids[:, input_token_len:], skip_special_tokens=True)[0]
outputs = outputs.strip()
print(outputs)
```
### Grounding Workflow: task_type="G"
```Shell
image = Image.open("COCO_train2014_000000014502.jpg").convert('RGB')
query = "Please provide the bounding box coordinate of the region this sentence describes: dude with black shirt says circa."
inputs = model.build_conversation_input_ids(tokenizer, query=query, image=image, image_processor=image_processor)
input_ids = inputs["input_ids"].to(device='cuda', non_blocking=True)
images = inputs["image"].to(dtype=torch.float16, device='cuda', non_blocking=True)
output_ids = model.generate(
input_ids,
task_type="G",
images=images,
do_sample=False,
eos_token_id=terminators,
num_beams=1,
max_new_tokens=512,
use_cache=True)
input_token_len = input_ids.shape[1]
outputs = tokenizer.batch_decode(output_ids[:, input_token_len:], skip_special_tokens=True)[0]
outputs = outputs.strip()
print(outputs)
```
### Text-only Workflow: task_type="Text"
```Shell
query = "What is the approximate weight of an adult red panda?"
inputs = model.build_conversation_input_ids(tokenizer, query=query)
input_ids = inputs["input_ids"].to(device='cuda', non_blocking=True)
images = None
output_ids = model.generate(
input_ids,
task_type="Text",
images=images,
do_sample=False,
eos_token_id=terminators,
num_beams=1,
max_new_tokens=512,
use_cache=True)
input_token_len = input_ids.shape[1]
outputs = tokenizer.batch_decode(output_ids[:, input_token_len:], skip_special_tokens=True)[0]
outputs = outputs.strip()
print(outputs)
```
## We Are Hiring
We are seeking academic interns in the Multimodal field. If interested, please send your resume to [email protected].
## Citation
If you find IAA useful for your research and applications, please cite using this BibTeX:
```
@article{Wang2024IAA,
title={IAA: Inner-Adaptor Architecture Empowers Frozen Large Language Model with Multimodal Capabilities},
author={Bin Wang and Chunyu Xie and Dawei Leng and Yuhui Yin},
journal={arXiv preprint arXiv:2408.12902},
year={2024},
}
```
## License
This project utilizes certain datasets and checkpoints that are subject to their respective original licenses. Users must comply with all terms and conditions of these original licenses.
The content of this project itself is licensed under the [Apache license 2.0]
**Where to send questions or comments about the model:**
https://github.com/360CVGroup/Inner-Adaptor-Architecture
## Related Projects
This work wouldn't be possible without the incredible open-source code of these projects. Huge thanks!
- [Meta Llama 3](https://github.com/meta-llama/llama3)
- [LLaVA: Large Language and Vision Assistant](https://github.com/haotian-liu/LLaVA)
- [360VL](https://github.com/360CVGroup/360VL)
|