LLaRA Model Card

This model is released with paper LLaRA: Supercharging Robot Learning Data for Vision-Language Policy

Xiang Li1, Cristina Mata1, Jongwoo Park1, Kumara Kahatapitiya1, Yoo Sung Jang1, Jinghuan Shang1, Kanchana Ranasinghe1, Ryan Burgert1, Mu Cai2, Yong Jae Lee2, and Michael S. Ryoo1

1Stony Brook University 2University of Wisconsin-Madison

Model details

Model type: D-RT2-Style is one of the baselines in our LLaRA paper, following the style of RT2. This is an open-source visuomotor policy trained by fine-tuning LLaVA-7b-v1.5 on instruction-following data D-RT2-Style, converted from VIMA-Data. For the conversion code, please refer to convert_vima.ipynb

Model date: llava-1.5-7b-llara-D-RT2-Style-VIMA-80k was trained in June 2024.

Paper or resources for more information: https://github.com/LostXine/LLaRA

Where to send questions or comments about the model: https://github.com/LostXine/LLaRA/issues

Intended use

Primary intended uses: The primary use of LLaRA is research on large multimodal models for robotics.

Primary intended users: The primary intended users of the model are researchers and hobbyists in robotics, computer vision, natural language processing, machine learning, and artificial intelligence.

Downloads last month
6
Safetensors
Model size
7.06B params
Tensor type
BF16
·
Inference API
Inference API (serverless) has been turned off for this model.

Dataset used to train variante/llava-1.5-7b-llara-D-RT2-Style-VIMA-80k

Collection including variante/llava-1.5-7b-llara-D-RT2-Style-VIMA-80k