|
--- |
|
license: llama2 |
|
language: |
|
- en |
|
- zh |
|
tags: |
|
- multimodal |
|
datasets: |
|
- liuhaotian/LLaVA-Pretrain |
|
base_model: |
|
- lmsys/vicuna-7b-v1.5 |
|
pipeline_tag: image-text-to-text |
|
library_name: transformers |
|
--- |
|
|
|
|
|
## **Citation** |
|
If you find this model useful, please cite the following paper |
|
``` |
|
@article{huang2024deciphering, |
|
title={Deciphering Cross-Modal Alignment in Large Vision-Language Models with Modality Integration Rate}, |
|
author={Huang, Qidong and Dong, Xiaoyi and Zhang, Pan and Zang, Yuhang and Cao, Yuhang and Wang, Jiaqi and Lin, Dahua and Zhang, Weiming and Yu, Nenghai}, |
|
journal={arXiv preprint arXiv:2410.07167}, |
|
year={2024} |
|
} |
|
``` |