File size: 4,297 Bytes

3f185b0
 
 
 
 
 
 
 
 
 
200d588
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
14bfb19
3f185b0
14bfb19
3f185b0
14bfb19
3f185b0
14bfb19
3f185b0
14bfb19
3f185b0
14bfb19
3f185b0
 
 
14bfb19
 
 
 
 
 
 
 
 
3f185b0
 
14bfb19
 
 
 
 
 
 
 
 
 
e4b3b79
 
 
14bfb19
 
 
 
 
 
 
 
3debc13
14bfb19
e4b3b79
3debc13
e4b3b79
 
3debc13
e4b3b79
14bfb19
 
200d588
 
 
 
 
 
 
 
 
 
14bfb19

---
license: apache-2.0
datasets:
- declare-lab/Emma-X-GCOT
metrics:
- accuracy
base_model:
- openvla/openvla-7b
pipeline_tag: image-text-to-text
---

<h1 align="center">✨ 
<br/>  
Meet Emma-X, an Embodied Multimodal Action Model 
<br/>
✨✨✨


</h1>

<div align="center">
  <img src="https://raw.githubusercontent.com/declare-lab/Emma-X/main/Emma-X.png" alt="Emma-X" width="300" />

<br/>

[![arXiv](https://img.shields.io/badge/arxiv-2412.11974-b31b1b)](https://arxiv.org/abs/2412.11974) [![Emma-X](https://img.shields.io/badge/Huggingface-Emma--X-brightgreen?style=flat&logo=huggingface&color=violet)](https://huggingface.co/declare-lab/Emma-X) [![Static Badge](https://img.shields.io/badge/Demos-declare--lab-brightred?style=flat)](https://declare-lab.github.io/Emma-X/)


</div>

## Model Overview

EMMA-X is an Embodied Multimodal Action (VLA) Model designed to bridge the gap between Visual-Language Models (VLMs) and robotic control tasks. EMMA-X generalizes effectively across diverse environments, objects, and instructions while excelling at long-horizon spatial reasoning and grounded task planning using a novel Trajectory Segmentation Strategy. It relies on --

- Hierarchical Embodiment Dataset: Emma-X is trained on a dataset derived from BridgeV2, containing 60,000 robot manipulation trajectories. Trained using a hierarchical dataset with visual grounded chain-of-thought reasoning, EMMA-X's output will include the following components:

- Grounded Chain-of-Thought Reasoning: Helps break down tasks into smaller, manageable subtasks, ensuring accurate task execution by mitigating hallucination in reasoning.

- Gripper Position Guidance: Affordance point inside the image.

- Look-Ahead Spatial Reasoning: Enables the model to plan actions while considering spatial guidance for effective planning, enhancing long-horizon task performance. 

It generates: 

- Action: Action policy in 7-dimensional vector to control the robot ([WidowX-6Dof](https://www.trossenrobotics.com/widowx-250)).

## Model Card
- **Developed by:** SUTD Declare Lab
- **Model type:** Vision-language-action (language, image => reasoning, robot actions)
- **Language(s) (NLP):** en
- **License:** Apache-2.0
- **Finetuned from:** [`openvla-7B`](https://huggingface.co/openvla/openvla-7b/)
- **Pretraining Dataset:** Augmented version of [Bridge V2](https://rail-berkeley.github.io/bridgedata/), for more info check our repository.
- **Repository:** [https://github.com/declare-lab/Emma-X/](https://github.com/declare-lab/Emma-X/)
- **Paper:** [Emma-X: An Embodied Multimodal Action Model with Grounded Chain of Thought and Look-ahead Spatial Reasoning](https://arxiv.org/pdf/2412.11974)
- **Project Page & Videos:** [https://declare-lab.github.io/Emma-X/](https://declare-lab.github.io/Emma-X/)

## Getting Started
```python
# Install minimal dependencies (`torch`, `transformers`, `timm`, `tokenizers`, ...)
# > pip install -r https://raw.githubusercontent.com/openvla/openvla/main/requirements-min.txt
from transformers import AutoModelForVision2Seq, AutoProcessor
from PIL import Image

import torch

task_label = "put carrot in pot" # Change your desired task label
image: Image.Image = get_from_camera(...)

# Load Emma-X
vla = AutoModelForVision2Seq.from_pretrained(
    "declare-lab/Emma-X",
    attn_implementation="flash_attention_2",  # [Optional] Requires `flash_attn`
    torch_dtype=torch.bfloat16, 
    low_cpu_mem_usage=True, 
    trust_remote_code=True
).to("cuda:0")
processor = AutoProcessor.from_pretrained("declare-lab/Emma-X", trust_remote_code=True)

prompt, image = processor.get_prompt(task_label, image)
inputs = processor(prompt, image).to("cuda:0", dtype=torch.bfloat16)
# Predict Action (action is a 7 dimensional vector to control the robot)
action, reasoning = vla.generate_actions(inputs, processor.tokenizer, do_sample=False, max_new_tokens=512)
print("action", action)

# Execute...
robot.act(action, ...)
```

## Citation
```
@article{sun2024emma,
  title={Emma-X: An Embodied Multimodal Action Model with Grounded Chain of Thought and Look-ahead Spatial Reasoning},
  author={Sun, Qi and Hong, Pengfei and Pala, Tej Deep and Toh, Vernon and Tan, U-Xuan and Ghosal, Deepanway and Poria, Soujanya},
  journal={arXiv preprint arXiv:2412.11974},
  year={2024}
}
```