|
<p align="center"> |
|
<img src="https://modelscope.cn/api/v1/models/codefuse-ai/CodeFuse-VLM-14B/repo?Revision=master&FilePath=LOGO.jpg&View=true" width="800"/> |
|
<p> |
|
|
|
## CodeFuse-VLM |
|
CodeFuse-VLM is a Multimodal LLM(MLLM) framework that provides users with multiple vision encoders, multimodal alignment adapters, and LLMs. Through CodeFuse-VLM framework, users are able to customize their own MLLM model to adapt their own tasks. |
|
As more and more models are published on Huggingface community, there will be more open-source vision encoders and LLMs. Each of these models has their own specialties, e.g. Code-LLama is good at code-related tasks but has poor performance for Chinese tasks. Therefore, we built CodeFuse-VLM framework to support multiple vision encoders, multimodal alignment adapters, and LLMs to adapt different types of tasks. |
|
<p align="center"> |
|
<img src="./CodeFuse-VLM-arch.png" width="50%" /> |
|
</p> |
|
|
|
Under CodeFuse-VLM framework, we use cross attention multimodal adapter, Qwen-14B LLM, and Qwen-VL's vision encoder to train CodeFuse-VLM-14B model. On multiple benchmarks, our CodeFuse-VLM-14B shows superior performances over Qwen-VL and LLAVA-1.5. |
|
|
|
<p align="center"> |
|
<img src="./CodeFuse-VLM-14B-performance.png" width="50%" /> |
|
</p> |
|
|
|
Here is the table for different MLLM model's performance on benchmarks |
|
Model | MMBench | MMBench-CN | VqaV2 | GQA | TextVQA | Vizwiz |
|
| ------------- | ------------- | ------------- | ------------- | ------------- | ------------- | ------------- | |
|
LLAVA-1.5 | 67.7 | 63.6 | 80.0 | 63.3 | 61.3 | 53.6 |
|
Qwen-VL | 60.6 | 56.7 | 78.2 | 57.5 | 63.8 | 38.9 |
|
CodeFuse-VLM-14B | 75.7 | 69.8 | 79.3 | 59.4 | 63.9 | 45.3 |
|
|
|
|
|
## Contents |
|
- [Install](#Install) |
|
- [Datasets](#Datasets) |
|
- [Multimodal Alignment](#Multimodal-Alignment) |
|
- [Visual Instruction Tuning](#Visual-Instruction-Tuning) |
|
- [Evaluation](#Evaluation) |
|
|
|
## Install |
|
Please run sh init\_env.sh |
|
|
|
## Datasets |
|
Here's the table of datasets we used to train CodeFuse-VLM-14B: |
|
|
|
Dataset | Task Type | Number of Samples |
|
| ------------- | ------------- | ------------- | |
|
synthdog-en | OCR | 800,000 |
|
synthdog-zh | OCR | 800,000 |
|
cc3m(downsampled)| Image Caption | 600,000 |
|
cc3m(downsampled)| Image Caption | 600,000 |
|
SBU | Image Caption | 850,000 |
|
Visual Genome VQA (Downsampled) | Visual Question Answer(VQA) | 500,000 |
|
Visual Genome Region descriptions (Downsampled) | Reference Grouding | 500,000 |
|
Visual Genome objects (Downsampled) | Grounded Caption | 500,000 |
|
OCR VQA (Downsampled) | OCR and VQA | 500,000 |
|
|
|
Please download these datasets on their own official websites. |
|
|
|
## Multimodal Alignment |
|
Please run sh scripts/pretrain.sh or sh scripts/pretrain\_multinode.sh |
|
|
|
|
|
## Visual Instruction Tuning |
|
Please run sh scripts/finetune.sh or sh scripts/finetune\_multinode.sh |
|
|
|
## Evaluation |
|
Please run python scripts in directory llava/eval/. Our pre-trained CodeFuse-VLM-14B can be loaded with the following code: |
|
|
|
``` |
|
import os |
|
from llava.model.builder import load_mixed_pretrained_model |
|
|
|
model_path = '/pretrained/model/path' |
|
tokenizer, model, image_processor, context_len = load_mixed_pretrained_model(model_path, None, 'qwen-vl-14b', os.path.join(model_path, 'Qwen-VL-visual'), 'cross_attn', os.path.join(model_path, 'mm_projector/mm_projector.bin')) |
|
``` |
|
|