---
license: mit
language:
- en
library_name: transformers
---

# Model Card for MMICL

## Temporal Demo for MMICL
[Playground for MMICL-FLANT5XXL](https://bcd7bc41d42486e7c8.gradio.live/)
support multi-image input as well as video input.
<!-- Provide a quick summary of what the model is/does. -->

## Model Details
**MMICL(Multi-Modal In-Context Learning)** is a multimodal vision-language model that incorporates blip2/instrcutblip. 
It has the ability to analyze and understand multiple images, as well as follow instructions. 


### Model Description
MMICL outperforms the VL model of the same size and performs exceptionally well on complex visual reasoning datasets. 
Till 21st Aug. 2023, it achieves **state-of-the-art** performance on both multimodal task leaderboards and a wide range of vision-language tasks. 
Furthermore, it showcases new capabilities in video understanding and multimodal in-context learning (M-ICL).
+ **Capability of multiple images refering and reasoning**

+ **Manually constructed In-context instruction tuning dataset**

+ Till 21st Aug. 2023 **1st on [MME](https://github.com/BradyFU/Awesome-Multimodal-Large-Language-Models/tree/Evaluation), 1st on [MMBench](https://opencompass.org.cn/leaderboard-multimodal)**

+ Visual Encoder:  VIT-L from CLIP/ ViT-G/14 from EVA-CLIP 

+ Pre-trained LLM: FlanT5-XL/ FlanT5-XXL/ Vicuna-7B/ Vicuna-13B
<!-- Provide a longer summary of what this model is. -->


- **Developed by:** [More Information Needed]
- **License:** MIT
- **Finetuned from model :** [instructblip-flan-t5-xxl](https://huggingface.co/Salesforce/instructblip-flan-t5-xxl)

<!-- Provide the basic links for the model. -->

- **Repository:** [MMICL](https://github.com/HaozheZhao/MIC)


## How to Get Started with the Model

```
# For T5 based model
from model.instructblip import InstructBlipConfig, InstructBlipModel, InstructBlipPreTrainedModel,InstructBlipForConditionalGeneration,InstructBlipProcessor
import datasets
import json
import transformers
from PIL import Image
import torch
model_type="instructblip"
model_ckpt="BleachNick/MMICL-Instructblip-T5-xxl"
config_ckpt = "Salesforce/instructblip-flan-t5-xxl"
config = InstructBlipConfig.from_pretrained(config_ckpt )

if 'instructblip' in model_type:
    model = InstructBlipForConditionalGeneration.from_pretrained(
        model_ckpt,
        config=config).to('cuda:0',dtype=torch.bfloat16) 


sp = ["图"]+[f"<image{i}>" for i in range(20)]

processor = InstructBlipProcessor.from_pretrained(
    model_ckpt
)


sp = sp+processor.tokenizer.additional_special_tokens[len(sp):]
processor.tokenizer.add_special_tokens({'additional_special_tokens':sp})


prompt = ['Use the image 0: <image0>图,image 1: <image1>图 and image 2: <image2>图 as a visual aid to help you calculate the equation accurately. image 0 is 2+1=3.\nimage 1 is 5+6=11.\nimage 2 is"']
# images try to load the images to be a list of PIL.Image object.
prompt = " ".join(prompt)

inputs = processor(images=images, text=prompt, return_tensors="pt")

inputs['pixel_values'] = inputs['pixel_values'].to(torch.bfloat16)
inputs['img_mask'] = torch.tensor([[1 for i in range(len(images))]])
inputs['pixel_values'] = inputs['pixel_values'].unsqueeze(0)

inputs = inputs.to('cuda:0')
outputs = model.generate(
        pixel_values = inputs['pixel_values'],
        input_ids = inputs['input_ids'],
        attention_mask = inputs['attention_mask'],
        img_mask = inputs['img_mask']
)
generated_text = processor.batch_decode(outputs, skip_special_tokens=True)[0].strip()
print(generated_text)

```

####
 Training Hyperparameters

- **Training regime:** [fp32, bf16 mixed precision, bf16 non-mixed precision] <!--fp32, fp16 mixed precision, bf16 mixed precision, bf16 non-mixed precision, fp16 non-mixed precision, fp8 mixed precision -->