MLCD-Embodied-7B / README.md

Update README.md

fc4dd9f verified 2 months ago

6.56 kB

	---
	license: apache-2.0
	language:
	- zho
	- eng
	- fra
	- spa
	- por
	- deu
	- ita
	- rus
	- jpn
	- kor
	- vie
	- tha
	- ara
	metrics:
	- bleu
	base_model:
	- Qwen/Qwen2.5-7B-Instruct
	---


	[[Paper]](https://arxiv.org/abs/2407.17331) [[GitHub]](https://github.com/deepglint/unicom)


	## Embodied Ability Evaluation: Performance in RoboVQA and OpenEQA



	\| \| \| MLCD <br> Embodied-7B \| LLaVA <br> OneVision-7B \| GPT-4v \| RoboMamba \|
	:-- \| :-- \| :-: \| :-: \| :-: \| :-: \|
	\| RoboVQA \| BLEU1 \| <span style="color:red">73.16</span> \| 38.12 \| - \| 54.9 \|
	\| \| BLEU2 \| <span style="color:red">66.39</span> \| 33.56 \| - \| 44.2 \|
	\| \| BLEU3 \| <span style="color:red">60.61</span> \| 31.76 \| - \| 39.5 \|
	\| \| BLEU4 \| <span style="color:red">56.56</span> \| 30.97 \| - \| 36.3 \|
	\| OpenEQA \| Object State Recognition \| <span style="color:red">71.83</span> \| - \| 63.2 \| - \|
	\| \| Object Recognition \| <span style="color:red">49.46</span> \| - \| 43.4 \| - \|
	\| \| Functional Reasoning \| 54.38 \| - \| <span style="color:red">57.4</span> \| - \|
	\| \| Spatial Understanding \| <span style="color:red">48.64</span> \| - \| 33.6 \| - \|
	\| \| Attribute Recognition \| <span style="color:red">67.08</span> \| - \| 57.2 \| - \|
	\| \| World Knowledge \| <span style="color:red">53.87</span> \| - \| 50.7 \| - \|
	\| \| Object Localization \| <span style="color:red">43.06</span> \| - \| 42.0 \| - \|




	## General Ability Evaluation: Comparison with LLaVA OneVision-7B and GPT-4

	\| Dataset \| Split \| MLCD<br>Embodied-7B \| LLaVA<br>OneVision-7B \| GPT-4v \| GPT-4o \|
	\| :-- \| :-: \| :-: \| :-: \| :-: \| :-: \|
	\| A12D \| test \| 79.9 \| 81.4 \| 78.2 \| 94.2 \|
	\| ChartQA \| test \| 83.0 \| 80.0 \| 78.5 \| 85.7 \|
	\| DocVQA \| test \| 91.6 \| 87.5 \| 88.4 \| 92.8 \|
	\| InfoVQA \| val \| 73.9 \| 70.7 \| - \| - \|
	\| InfoVQA \| test \| 70.0 \| 68.8 \| - \| - \|
	\| MMMU \| val \| 47.3 \| 48.8 \| 56.8 \| 69.1 \|
	\| MMStar \| test \| 58.5 \| 61.7 \| 57.1 \| 63.9 \|
	\| OCRBench \| - \| 749.0 \| 697.0 \| 656.0 \| 805.0 \|
	\| RealWorldQA \| test \| 68.9 \| 66.3 \| 61.4 \| 58.6 \|
	\| SeedBench \| image \| 74.9 \| 75.4 \| 49.9 \| 76.2 \|
	\| MMbench \| en-dev \| 81.1 \| 83.2 \| 81.3 \| 83.4 \|
	\| MMbench \| en-test \| 80.1 \| 80.8 \| 75.0 \| - \|
	\| MME \| test \| 578/1603 \| 418/1580 \| 517/1409 \| - \|

	## Usage

	### A. Installation

	```bash
	git clone https://github.com/deepglint/unicom
	cd unicom/mlcd_vl

	docker build -t train_mlcd_llava .

	docker run --gpus all \
	-v /vlm:/vlm \
	-v /mnt:/mnt \
	-v $(pwd):/workspace \
	--rm \
	-w /workspace \
	--shm-size=64g -it train_mlcd_llava bash

	pip install flash-attn==2.3.3 --no-build-isolation
	```


	### B. Inference

	```bash
	CUDA_VISIBLE_DEVICES=0 python infer_mlcd_emboided.py --model_dir DeepGlint-AI/MLCD-Embodied-7B

	# example:
	# >> Enter 'exit' to end the conversation, 'reset' to clear the chat history.
	# >> Enter image file paths (comma-separated): ../_static/images/logo.png
	# >> User: <image>What kind of animal is it in this picture?
	# >> Assistant: The image features a stylized representation of a cat, characterized by its vibrant and abstract depiction.
	# >> User: What color is this cat?
	# >> Assistant: The cat in the image is primarily white with blue, orange and pink accents, creating a visually appealing and unique appearance.
	```



	### C. Evaluation for Embodied Ability

	#### Step 1

	Download raw data following [OpenEQA](https://github.com/facebookresearch/open-eqa/tree/main/data) and [RoboVQA](https://console.cloud.google.com/storage/browser/gdm-robovqa)(val part)

	#### Step 2

	Converting raw data into the format required for model evaluation.
	```bash
	# convert OpenEQA benchmark. Note: replace the paths with your own.
	python llava/benchmark/make_openeqa_bmk.py

	# convert RoboVQA benchmark. Note: replace the paths with your own.
	python llava/benchmark/make_robovqa_bmk.py
	```

	#### Step 3

	Make sure that your top-level directory structure should look like this:
	```
	\|--/path/to/your/benchmarks
	\| \|--OpenEQA
	\| \| \|--openeqa_scannet.parquet
	\| \| \|--openeqa_hm3d.parquet
	\| \|--RoboVQA
	\| \|--robovqa.parquet
	\|--/path/to/your/images
	\|--openeqa_val
	\| \|--scannet-v0
	\| \| \|--002-scannet-scene0709_00
	\| \| \|--xxx-scannet-scenexxxx_xx
	\| \|--hm3d-v0
	\| \|--000-hm3d-BFRyYbPCCPE
	\| \|--xxx-hm3d-xxxxxxxxxxx
	\|--robovqa_val
	\|--robovqa_221911
	\|--robovqa_xxxxxx
	```

	#### Step 4

	Run script for evaluation
	```bash
	# Note: replace 'YOUR_API_KEY', 'YOUR_ENDPOINT', 'bmk_root', 'image_folder' with your own.
	bash scripts/eval/eval_robo.sh /path/to/your/model
	```

	### D. Evaluation for General Ability

	Install the evaluation tool and execute the evaluation script:
	```bash
	pip install lmms-eval==0.2.0
	PYTHONPATH=./ CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 python -m accelerate.commands.launch \
	--main_process_port=12444 \
	--num_processes=8 \
	-m lmms_eval \
	--model llava \
	--model_args pretrained=DeepGlint-AI/MLCD-Embodied-7B,conv_template=qwen_1_5 \
	--tasks mme \
	--batch_size 1 \
	--log_samples \
	--log_samples_suffix mlcd \
	--output_path ./eval_log/
	```

	We would like to express our gratitude to [Huajie Tan](https://huggingface.co/tanhuajie2001), [Yumeng Wang](https://huggingface.co/devymex), [Yin Xie](https://huggingface.co/Yin-Xie) for his significant contributions to the experimental validation in MLLMs.