CrystalChat-7B-MLLM: a fully-reproducible vision language model based on CrystalChat-7B

Model Description

CrystalChat-7B based multi-modal large language model (MLLM) mimics the training recipe used for Vicuna-7B based LLaVa-v1.5. CrystalChat-7B-MLLM models are entirely transparent, having open-sourced all materials, including code, data, model checkpoint, intermediate results, and more at TODO: Add paper link.

About CrystalChat-7B-MLLM:

7 billion parameter LLM
CLIP ViT-L/14-336px vision encoder
Languages: English
Models Released: CrystalChat-7B-MLLM
Trained in 2 stages
License: MIT

Evaluation

General Evaluation Metrics for MLLMs. MME serves as an extensive evaluative benchmark, aiming to assess perceptual and cognitive capability of MLLMs within 14 sub-tasks. Additionally, we also evaluate the performance of our models on text-oriented visual question answering tasks employing a diverse set of benchmark datasets including ScienceQA and TextVQA. Furthermore, we assess our models’ ability toward anti-hallucination through POPE.

LLM Backbone	MME-P	MME-C	POPE	SciQA	TextVQA
CrystalCoder-7B	1359.83	238.92	86.18	64.15	50.39
CrystalChat-7B	1456.53	308.21	86.96	67.77	57.84
Vicuna-7B	1481.12	302.85	87.17	67.97	56.49

Table 1: Comparison of different LLM backbones on visual language understanding benchmarks. All models are instruction-tuned on the general domain data (i.e. LLaVA)

Data and Training Details

Pretrain Data

LLaVA Visual Instruct Pretrain LCS-558K is a filtered subset of the LAION, CC, and SBU datasets, featuring a more balanced distribution of concept coverage. The file includes multimodal synthesized conversations generated from image-caption pairs by incorporating randomly selected instructions such as "Describe this image." It is used for pretraining in LLaVA, with the raw CC-3M caption serving as the default answer.

Finetune Data

The dataset chosen was created by LLaVA with academic-task-oriented VQA data mixture and data from ShareGPT. LLaVA Visual Instruct 150K is a dataset of GPT-generated multimodal instruction-following data. It is designed for visual instruction tuning and aims to develop large multimodal models with capabilities akin to GPT-4 in both vision and language.

Data	Size	Response formatting prompts
LLaVA [36]	158K	–
ShareGPT [46]	40K	–
VQAv2 [19]	83K	Answer the question using a single word or phrase.
GQA [21]	72K	Answer the question using a single word or phrase.
OKVQA [41]	9K	Answer the question using a single word or phrase.
OCRVQA [42]	80K	Answer the question using a single word or phrase.
A-OKVQA [45]	66K	Answer with the option’s letter from the given choices directly.
TextCaps [47]	22K	Provide a one-sentence caption for the provided image.
RefCOCO [24, 40]	48K	Note: randomly choose between the two formats. Provide a short description for this region.
VG [25]	86K	Provide the bounding box coordinate of the region this sentence describes.
Total	665K

Table 2. Instruction-following Data Mixture of LLaVA-1.5.

TODO: Check if we need to publish these 2

Stage 2 - Finetuning

Checkpoints
CrystalChat
CrystalCoder

Stage 1 - Pretraining

Checkpoints
CrystalChat
CrystalCoder

[to find all branches: git branch -a]

Examples

TODO: Add image as sample example

Loading Crystal

from transformers import AutoModelForCausalLM, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(
                    "LLM360/CrystalChat-7B-MLLM",
                    padding_side="right",
                    trust_remote_code=True)

model = AutoModelForCausalLM.from_pretrained(
    "LLM360/CrystalChat-7B-MLLM",
    trust_remote_code=True,
    torch_dtype=torch.float16,
    device_map='auto',
    low_cpu_mem_usage=True
)

LLM-360

LLM-360 is an open research lab enabling community-owned AGI through open-source large model research and development.

Crystal-based Models enables community-owned AGI by creating standards and tools to advance the bleeding edge of LLM capability and empower knowledge transfer, research, and development.

We believe in a future where artificial general intelligence (AGI) is created by the community, for the community. Through an open ecosystem of equitable computational resources, high-quality data, and flowing technical knowledge, we can ensure ethical AGI development and universal access for all innovators.

Visit us

Citation

BibTeX:

@article{yun2024web2code,
  title={Web2Code: A Large-scale Webpage-to-Code Dataset and Evaluation Framework for Multimodal LLMs},
  author={Yun, Sukmin and Lin, Haokun and Thushara, Rusiru and Bhat, Mohammad Qazim and Wang, Yongxin and Jiang, Zutao and Deng, Mingkai and Wang, Jinhong and Tao, Tianhua and Li, Junbo and others},
  journal={arXiv preprint arXiv:2406.20098},
  year={2024}
}