---
license: mit
pipeline_tag: image-text-to-text
library_name: transformers
tags:
  - 3d-scene-understanding
  - scene-graph
  - multimodal
  - vlm
  - llama
  - vision-language-model
---

# 3DGraphLLM

3DGraphLLM is a model that combines semantic graphs and large language models for 3D scene understanding. It aims to improve 3D vision-language tasks by explicitly incorporating semantic relationships into a learnable representation of a 3D scene graph, which is then used as input to LLMs.

This model was presented in the paper:
[**3DGraphLLM: Combining Semantic Graphs and Large Language Models for 3D Scene Understanding**](https://huggingface.co/papers/2412.18450)

The official code is publicly available at: [https://github.com/CognitiveAISystems/3DGraphLLM](https://github.com/CognitiveAISystems/3DGraphLLM)

<p align="center">
<img src="ga.png" width="80%">
</p>

## Abstract
A 3D scene graph represents a compact scene model by capturing both the objects present and the semantic relationships between them, making it a promising structure for robotic applications. To effectively interact with users, an embodied intelligent agent should be able to answer a wide range of natural language queries about the surrounding 3D environment. Large Language Models (LLMs) are beneficial solutions for user-robot interaction due to their natural language understanding and reasoning abilities. Recent methods for learning scene representations have shown that adapting these representations to the 3D world can significantly improve the quality of LLM responses. However, existing methods typically rely only on geometric information, such as object coordinates, and overlook the rich semantic relationships between objects. In this work, we propose 3DGraphLLM, a method for constructing a learnable representation of a 3D scene graph that explicitly incorporates semantic relationships. This representation is used as input to LLMs for performing 3D vision-language tasks. In our experiments on popular ScanRefer, Multi3DRefer, ScanQA, Sqa3D, and Scan2cap datasets, we demonstrate that our approach outperforms baselines that do not leverage semantic relationships between objects.

## Model Details

We provide our best checkpoint that uses [Llama-3-8B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct) as an LLM, [Mask3D](https://github.com/JonasSchult/Mask3D) 3D instance segmentation to get scene graph nodes, [VL-SAT](https://github.com/wz7in/CVPR2023-VLSAT) to encode semantic relations [Uni3D](https://github.com/baaivision/Uni3D) as 3D object encoder, and [DINOv2](https://github.com/facebookresearch/dinov2) as 2D object encoder.

## Performance
Semantic relations boost LLM performance on 3D Referred Object Grounding and Dense Scene Captioning tasks.

|      	| [ScanRefer](https://github.com/daveredrum/ScanRefer) 	|         	| [Multi3dRefer](https://github.com/3dlg-hcvc/M3DRef-CLIP)	|        	|  [Scan2Cap](https://github.com/daveredrum/Scan2Cap) 	|            	| [ScanQA](https://github.com/ATR-DBI/ScanQA)  	|        	| [SQA3D](https://github.com/SilongYong/SQA3D) 	|
|:----:	|:---------:	|:-------:	|:------:	|:------:	|:---------:	|:----------:	|:------------:	|:------:	|:-----:	|
|      	|  Acc@0.25 	| Acc@0.5 	|    F1@0.25   	| F1@0.5 	|  CIDEr@0.5 	| B-4@0.5 	| CIDEr 	| B-4 	|    EM  	|
| [Chat-Scene](https://github.com/ZzZZCHS/Chat-Scene/tree/dev) 	|    55.5   	|   50.2  	|  57.1 	|  52.3  	|    77.1  	|    36.3    	|     **87.7**      	|    **14.3**  	|   <ins>54.6</ins>   	|
| <ins>3DGraphLLM Vicuna-1.5 </ins> 	|   <ins>58.6</ins>    	|  <ins>53.0</ins>  	|  <ins>61.9</ins>  	|  <ins>57.3</ins>  	|   <ins>79.2</ins>    	|    <ins>34.7</ins>    	|    <ins>91.2</ins>     	|  13.7 	| 55.1  	|
| **3DGraphLLM LLAMA3-8B**	|   **62.4**    	|  **56.6**   	|  **64.7**  	|  **59.9**  	|   **81.0**    	|    **36.5**    	|     88.8     	|  <ins>15.9</ins>   	| **55.9**  	|

## Usage

For detailed instructions on environment preparation, downloading LLM backbones, data preprocessing, training, and inference, please refer to the [official GitHub repository](https://github.com/CognitiveAISystems/3DGraphLLM).

You can run their interactive demo by following the instructions on their GitHub, or try the simplified command below:
```bash
bash demo/run_demo.sh
```
This will prompt you to ask different queries about Scene 435 of ScanNet.

## Citation
If you find 3DGraphLLM helpful, please consider citing our work as:
```
@misc{zemskova20243dgraphllm,
      title={3DGraphLLM: Combining Semantic Graphs and Large Language Models for 3D Scene Understanding}, 
      author={Tatiana Zemskova and Dmitry Yudin},
      year={2024},
      eprint={2412.18450},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2412.18450}, 
}
```

## Acknowledgement
Thanks to the open source of the following projects:

[Chat-Scene](https://github.com/ZzZZCHS/Chat-Scene/tree/dev)

## Contact
If you have any questions about the project, please open an issue in the [GitHub repository](https://github.com/CognitiveAISystems/3DGraphLLM) or send an email to [Tatiana Zemskova](zemskova@airi.net).