|
--- |
|
license: mit |
|
--- |
|
# 3DGraphLLM |
|
|
|
3DGraphLLM is a model that uses a 3D scene graph and an LLM to perform 3D vision-language tasks. |
|
|
|
<p align="center"> |
|
<img src="ga.png" width="80%"> |
|
</p> |
|
|
|
|
|
## Model Details |
|
|
|
We provide our best checkpoint that uses [Llama-3-8B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct) as an LLM, [Mask3D](https://github.com/JonasSchult/Mask3D) 3D instance segmentation to get scene graph nodes, [VL-SAT](https://github.com/wz7in/CVPR2023-VLSAT) to encode semantic relations [Uni3D](https://github.com/baaivision/Uni3D) as 3D object encoder, and [DINOv2](https://github.com/facebookresearch/dinov2) as 2D object encoder. |
|
|
|
## Citation |
|
If you find 3DGraphLLM helpful, please consider citing our work as: |
|
``` |
|
@misc{zemskova20243dgraphllm, |
|
title={3DGraphLLM: Combining Semantic Graphs and Large Language Models for 3D Scene Understanding}, |
|
author={Tatiana Zemskova and Dmitry Yudin}, |
|
year={2024}, |
|
eprint={2412.18450}, |
|
archivePrefix={arXiv}, |
|
primaryClass={cs.CV}, |
|
url={https://arxiv.org/abs/2412.18450}, |
|
} |
|
``` |