Spaces:

open-r1
/

README

Running

App Files Files Community

Multimodal R1

#10

by salma-remyx - opened Feb 1

Discussion

salma-remyx

Feb 1

"... we don’t want to stop at math datasets."

Right on, I'm experimenting with VLM fine-tunes using R1 distillations as the base llm to see if it's CoT reasoning can improve spatial reasoning.

This synthetic dataset uses a pipeline of models to infer distances and spatial relationships in a scene: https://huggingface.co/datasets/remyxai/OpenSpaces

Each image sample includes 5 QA pairs sampled from 40 templates.
Can the model learn to use relationships about different objects in a scene to reason about the best answer to the question.

User: What is the distance between the lamp and the chair?

Assistant: Let me solve this step by step.
<think>
The height of the lamp is X.
The sofa is to the left of the painting.
...
</think>
<ansewr>5.3 meters</answer>

Thoughts from the community about restructuring the dataset samples to use the context of 4 QA pairs to reason about the last one?

salma-remyx

Feb 24

Here, I make r1-style reasoning with tags by using another AI to rephrase the information to justify the provided answer after resolving the fact set of the remaining QA pairs

salma-remyx

12 days ago

Here's a little over 12K samples as described above at a cost of $50 to generate.
https://huggingface.co/datasets/remyxai/OpenSpaces_MC_R1

salma-remyx

12 days ago

May be worth experimenting with filtering the dataset further using prometheus's VLM-as-a-Judge as documented here:
https://huggingface.co/datasets/remyxai/SpaceJudgeDataset

taesiri

2 days ago

Hey @salma-remyx

I would love to learn more about your experiments. Are you still working on this?
We recently created a dataset of hand with various number of fingers. Do you think GRPO can help to count fingers better?

salma-remyx

about 6 hours ago

Hey @taesiri , thank you and yes! Still actively adding to VQASynth

And the taesiri/FluxHands-FingerCount dataset looks awesome! Building an expert pipeline to verify finger count could be tricky, especially since you have special instances where there are 5+ fingers. If you could get a pipeline of expert models with a heuristic involved to catch the special cases you could build a verification engine to then train a VLM to do this task.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

Your need to confirm your account before you can post a new comment.

· Sign up or log in to comment