File size: 4,297 Bytes
3f185b0 200d588 14bfb19 3f185b0 14bfb19 3f185b0 14bfb19 3f185b0 14bfb19 3f185b0 14bfb19 3f185b0 14bfb19 3f185b0 14bfb19 3f185b0 14bfb19 e4b3b79 14bfb19 3debc13 14bfb19 e4b3b79 3debc13 e4b3b79 3debc13 e4b3b79 14bfb19 200d588 14bfb19 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 |
---
license: apache-2.0
datasets:
- declare-lab/Emma-X-GCOT
metrics:
- accuracy
base_model:
- openvla/openvla-7b
pipeline_tag: image-text-to-text
---
<h1 align="center">✨
<br/>
Meet Emma-X, an Embodied Multimodal Action Model
<br/>
✨✨✨
</h1>
<div align="center">
<img src="https://raw.githubusercontent.com/declare-lab/Emma-X/main/Emma-X.png" alt="Emma-X" width="300" />
<br/>
[![arXiv](https://img.shields.io/badge/arxiv-2412.11974-b31b1b)](https://arxiv.org/abs/2412.11974) [![Emma-X](https://img.shields.io/badge/Huggingface-Emma--X-brightgreen?style=flat&logo=huggingface&color=violet)](https://huggingface.co/declare-lab/Emma-X) [![Static Badge](https://img.shields.io/badge/Demos-declare--lab-brightred?style=flat)](https://declare-lab.github.io/Emma-X/)
</div>
## Model Overview
EMMA-X is an Embodied Multimodal Action (VLA) Model designed to bridge the gap between Visual-Language Models (VLMs) and robotic control tasks. EMMA-X generalizes effectively across diverse environments, objects, and instructions while excelling at long-horizon spatial reasoning and grounded task planning using a novel Trajectory Segmentation Strategy. It relies on --
- Hierarchical Embodiment Dataset: Emma-X is trained on a dataset derived from BridgeV2, containing 60,000 robot manipulation trajectories. Trained using a hierarchical dataset with visual grounded chain-of-thought reasoning, EMMA-X's output will include the following components:
- Grounded Chain-of-Thought Reasoning: Helps break down tasks into smaller, manageable subtasks, ensuring accurate task execution by mitigating hallucination in reasoning.
- Gripper Position Guidance: Affordance point inside the image.
- Look-Ahead Spatial Reasoning: Enables the model to plan actions while considering spatial guidance for effective planning, enhancing long-horizon task performance.
It generates:
- Action: Action policy in 7-dimensional vector to control the robot ([WidowX-6Dof](https://www.trossenrobotics.com/widowx-250)).
## Model Card
- **Developed by:** SUTD Declare Lab
- **Model type:** Vision-language-action (language, image => reasoning, robot actions)
- **Language(s) (NLP):** en
- **License:** Apache-2.0
- **Finetuned from:** [`openvla-7B`](https://huggingface.co/openvla/openvla-7b/)
- **Pretraining Dataset:** Augmented version of [Bridge V2](https://rail-berkeley.github.io/bridgedata/), for more info check our repository.
- **Repository:** [https://github.com/declare-lab/Emma-X/](https://github.com/declare-lab/Emma-X/)
- **Paper:** [Emma-X: An Embodied Multimodal Action Model with Grounded Chain of Thought and Look-ahead Spatial Reasoning](https://arxiv.org/pdf/2412.11974)
- **Project Page & Videos:** [https://declare-lab.github.io/Emma-X/](https://declare-lab.github.io/Emma-X/)
## Getting Started
```python
# Install minimal dependencies (`torch`, `transformers`, `timm`, `tokenizers`, ...)
# > pip install -r https://raw.githubusercontent.com/openvla/openvla/main/requirements-min.txt
from transformers import AutoModelForVision2Seq, AutoProcessor
from PIL import Image
import torch
task_label = "put carrot in pot" # Change your desired task label
image: Image.Image = get_from_camera(...)
# Load Emma-X
vla = AutoModelForVision2Seq.from_pretrained(
"declare-lab/Emma-X",
attn_implementation="flash_attention_2", # [Optional] Requires `flash_attn`
torch_dtype=torch.bfloat16,
low_cpu_mem_usage=True,
trust_remote_code=True
).to("cuda:0")
processor = AutoProcessor.from_pretrained("declare-lab/Emma-X", trust_remote_code=True)
prompt, image = processor.get_prompt(task_label, image)
inputs = processor(prompt, image).to("cuda:0", dtype=torch.bfloat16)
# Predict Action (action is a 7 dimensional vector to control the robot)
action, reasoning = vla.generate_actions(inputs, processor.tokenizer, do_sample=False, max_new_tokens=512)
print("action", action)
# Execute...
robot.act(action, ...)
```
## Citation
```
@article{sun2024emma,
title={Emma-X: An Embodied Multimodal Action Model with Grounded Chain of Thought and Look-ahead Spatial Reasoning},
author={Sun, Qi and Hong, Pengfei and Pala, Tej Deep and Toh, Vernon and Tan, U-Xuan and Ghosal, Deepanway and Poria, Soujanya},
journal={arXiv preprint arXiv:2412.11974},
year={2024}
}
``` |