|
--- |
|
license: apache-2.0 |
|
metrics: |
|
- accuracy |
|
base_model: |
|
- liuhaotian/llava-v1.5-7b |
|
--- |
|
# LLaVA-3D |
|
|
|
## Table of Contents |
|
|
|
1. [Model Summary](##model-summary) |
|
2. [Use](##use) |
|
3. [Limitations](##limitations) |
|
4. [Training](##training) |
|
5. [License](##license) |
|
6. [Citation](##citation) |
|
|
|
## Model Summary |
|
|
|
The LLaVA-3D model is a 7B parameter models trained on LLaVA-3D-Instruct-1M, based on LLaVA-v1.5-7B. |
|
|
|
- **Repository:** [ZCMax/LLaVA-3D](https://github.com/ZCMax/LLaVA-3D) |
|
- **Project Website:** [zcmax.github.io/projects/LLaVA-3D](https://zcmax.github.io/projects/LLaVA-3D/) |
|
- **Paper:** [LLaVA-3D](https://arxiv.org/abs/2409.18125) |
|
- **Point of Contact:** [Chenming Zhu](mailto:[email protected]) |
|
- **Languages:** English |
|
|
|
|
|
## Use |
|
|
|
### Intended use |
|
|
|
The model was trained on LLaVA-3D-Instruct-1M and has the ability to interact with the single image for 2D tasks and posed RBG-D images for 3D tasks. |
|
|
|
**Feel free to share your generations in the Community tab!** |
|
|
|
# Training |
|
|
|
## Model |
|
|
|
- **Pretraining Stage:** scene-level and region-level caption data, 1 epoch, projector |
|
- **Instructing Tuning Stage:** A mixture of 1M high-quality 2D and 3D data, 1 epoch, full model |
|
- **Precision:** bfloat16 |
|
|
|
## Hardware & Software |
|
|
|
- **GPUs:** 8 * Nvidia Tesla A100 (for whole model series training) |
|
- **Orchestration:** [Huggingface Trainer](https://huggingface.co/docs/transformers/main_classes/trainer) |
|
- **Neural networks:** [PyTorch](https://github.com/pytorch/pytorch) |
|
|
|
# Citation |
|
``` |
|
@article{zhu2024llava, |
|
title={LLaVA-3D: A Simple yet Effective Pathway to Empowering LMMs with 3D-awareness}, |
|
author={Zhu, Chenming and Wang, Tai and Zhang, Wenwei and Pang, Jiangmiao and Liu, Xihui}, |
|
journal={arXiv preprint arXiv:2409.18125}, |
|
year={2024} |
|
} |
|
``` |