Model Card: LLaVA_MORE-llama_3_1-8B-finetuning

In this model space, you will find the stage two (finetuning) weights of LLaVA-MORE LLaMA 3.1 8B, as described in LLaVA-MORE: A Comparative Study of LLMs and Visual Backbones for Enhanced Visual Instruction Tuning.

Overview

Recent progress in Multimodal Large Language Models (MLLMs) has highlighted the critical roles of both the visual backbone and the underlying language model. While prior work has primarily focused on scaling these components to billions of parameters, the trade-offs between model size, architecture, and performance remain underexplored. Additionally, inconsistencies in training data and evaluation protocols have hindered direct comparisons, making it difficult to derive optimal design choices. In this paper, we introduce LLaVA-MORE, a new family of MLLMs that integrates recent language models with diverse visual backbones. To ensure fair comparisons, we employ a unified training protocol applied consistently across all architectures. Our analysis systematically explores both small- and medium-scale LLMs -- including Phi-4, LLaMA-3.1, and Gemma-2 -- to evaluate multimodal reasoning, generation, and instruction following, while examining the relationship between model size and performance. Beyond evaluating the LLM impact on final results, we conduct a comprehensive study of various visual encoders, ranging from CLIP-based architectures to alternatives such as DINOv2, SigLIP, and SigLIP2. Additional experiments investigate the effects of increased image resolution and variations in pre-training datasets. Overall, our results provide insights into the design of more effective MLLMs, offering a reproducible evaluation framework that facilitates direct comparisons and can guide future model development. Our source code and trained models are publicly available.

For more information, visit our LLaVA-MORE GitHub repository and the AImageLab Website.

News

For the latest updates regarding LLaVA-MORE, including new releases and publications, please refer to the News section on our GitHub repository.

Performance

Detailed performance benchmarks and comparisons of LLaVA-MORE variants across various multimodal datasets can be found in the Performance section on our GitHub repository.

Checkpoints

A comprehensive list of our released checkpoints, including pre-trained and finetuned models with different LLM and visual backbones, is available in the Checkpoints section on our GitHub repository.

Installation

To set up the required environment and install dependencies, please follow the instructions in the Installation section on our GitHub repository.

Training

For detailed instructions on training LLaVA-MORE, including bash scripts for distributed training on HPC facilities, please refer to the Training section on our GitHub repository.

Inference

You can try our LLaVA-MORE in the Image-To-Text task using the following script. If you get out-of-memory problems, consider loading the model weights in 8 bit (load_in_8bit=True).

source activate more
cd ~/LLaVA-MORE
export PYTHONPATH=.

# tokenizer_model_path
model_path=aimagelab/LLaVA_MORE-llama_3_1-8B-finetuning
model_architecture=llama_3_1
conversation=llama_3_1

export HF_TOKEN=hf_read_token
export TOKENIZER_PATH=$model_path

python -u src/llava/eval/run_llava.py --model-path $model_path --model-architecture $model_architecture --conv-mode $conversation

Citation

If you make use of our work, please cite our paper:

@inproceedings{cocchi2025llava,
      title={{LLaVA-MORE: A Comparative Study of LLMs and Visual Backbones for Enhanced Visual Instruction Tuning}},
      author={Cocchi, Federico and Moratelli, Nicholas and Caffagni, Davide and Sarto, Sara and Baraldi, Lorenzo and Cornia, Marcella and Cucchiara, Rita},
      booktitle={Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops},
      year={2025}
}

aimagelab
/

LLaVA_MORE-llama_3_1-8B-finetuning