SpaceVLMs
Collection
Features VLMs fine-tuned for enhanced spatial reasoning using a synthetic data pipeline similar to Spatial VLM.
•
11 items
•
Updated
•
6
SpaceQwen2.5-VL-3B-Instruct uses LoRA to fine-tune Qwen2.5-VL-3B-Instruct on a dataset designed with VQASynth to enhance spatial reasoning as in SpatialVLM
This model uses data synthesis techniques and publically available models to reproduce the work described in SpatialVLM to enhance the spatial reasoning of multimodal models. With a pipeline of expert models, we can infer spatial relationships between objects in a scene to create VQA dataset for spatial reasoning.
Install qwen dependencies:
pip install qwen-vl-utils[decord]==0.0.8
To run inference with this model, run:
from transformers import Qwen2_5_VLForConditionalGeneration, AutoTokenizer, AutoProcessor
from qwen_vl_utils import process_vision_info
model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
"remyxai/SpaceQwen2.5-VL-3B-Instruct", torch_dtype="auto", device_map="auto"
)
processor = AutoProcessor.from_pretrained("remyxai/SpaceQwen2.5-VL-3B-Instruct")
messages = [
{
"role": "user",
"content": [
{
"type": "image",
"image": "https://remyx.ai/assets/spatialvlm/warehouse_rgb.jpg",
},
{"type": "text", "text": "What is the height of the man in the red hat in feet?"},
],
}
]
# Preparation for inference
text = processor.apply_chat_template(
messages, tokenize=False, add_generation_prompt=True
)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
text=[text],
images=image_inputs,
videos=video_inputs,
padding=True,
return_tensors="pt",
)
inputs = inputs.to("cuda")
# Inference: Generation of the output
generated_ids = model.generate(**inputs, max_new_tokens=128)
generated_ids_trimmed = [
out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)
You can also try it on Discord or the HF space.
@article{chen2024spatialvlm,
title = {SpatialVLM: Endowing Vision-Language Models with Spatial Reasoning Capabilities},
author = {Chen, Boyuan and Xu, Zhuo and Kirmani, Sean and Ichter, Brian and Driess, Danny and Florence, Pete and Sadigh, Dorsa and Guibas, Leonidas and Xia, Fei},
journal = {arXiv preprint arXiv:2401.12168},
year = {2024},
url = {https://arxiv.org/abs/2401.12168},
}
@misc{qwen2.5-VL,
title = {Qwen2.5-VL},
url = {https://qwenlm.github.io/blog/qwen2.5-vl/},
author = {Qwen Team},
month = {January},
year = {2025}
}
Unable to build the model tree, the base model loops to the model itself. Learn more.