wingrune
/

3DGraphLLM

Model card Files Files and versions Community

3DGraphLLM / README.md

wingrune's picture

Update README.md

d3fc72d verified 3 months ago

|

history blame contribute delete

1.09 kB

	---
	license: mit
	---
	# 3DGraphLLM

	3DGraphLLM is a model that uses a 3D scene graph and an LLM to perform 3D vision-language tasks.

	<p align="center">
	<img src="ga.png" width="80%">
	</p>


	## Model Details

	We provide our best checkpoint that uses [Llama-3-8B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct) as an LLM, [Mask3D](https://github.com/JonasSchult/Mask3D) 3D instance segmentation to get scene graph nodes, [VL-SAT](https://github.com/wz7in/CVPR2023-VLSAT) to encode semantic relations [Uni3D](https://github.com/baaivision/Uni3D) as 3D object encoder, and [DINOv2](https://github.com/facebookresearch/dinov2) as 2D object encoder.

	## Citation
	If you find 3DGraphLLM helpful, please consider citing our work as:
	```
	@misc{zemskova20243dgraphllm,
	title={3DGraphLLM: Combining Semantic Graphs and Large Language Models for 3D Scene Understanding},
	author={Tatiana Zemskova and Dmitry Yudin},
	year={2024},
	eprint={2412.18450},
	archivePrefix={arXiv},
	primaryClass={cs.CV},
	url={https://arxiv.org/abs/2412.18450},
	}
	```