Multimodal R1

#10
by salma-remyx - opened

"... we don’t want to stop at math datasets."

Right on, I'm experimenting with VLM fine-tunes using R1 distillations as the base llm to see if it's CoT reasoning can improve spatial reasoning.

This synthetic dataset uses a pipeline of models to infer distances and spatial relationships in a scene: https://huggingface.co/datasets/remyxai/OpenSpaces

Each image sample includes 5 QA pairs sampled from 40 templates.
Can the model learn to use relationships about different objects in a scene to reason about the best answer to the question.

User: What is the distance between the lamp and the chair?

Assistant: Let me solve this step by step.
<think>
The height of the lamp is X.
The sofa is to the left of the painting.
...
</think>
<ansewr>5.3 meters</answer>

Thoughts from the community about restructuring the dataset samples to use the context of 4 QA pairs to reason about the last one?

Here, I make r1-style reasoning with tags by using another AI to rephrase the information to justify the provided answer after resolving the fact set of the remaining QA pairs

image.png

image.png

Here's a little over 12K samples as described above at a cost of $50 to generate.
https://huggingface.co/datasets/remyxai/OpenSpaces_MC_R1

May be worth experimenting with filtering the dataset further using prometheus's VLM-as-a-Judge as documented here:
https://huggingface.co/datasets/remyxai/SpaceJudgeDataset

Hey @salma-remyx

I would love to learn more about your experiments. Are you still working on this?
We recently created a dataset of hand with various number of fingers. Do you think GRPO can help to count fingers better?

Hey @taesiri , thank you and yes! Still actively adding to VQASynth

And the taesiri/FluxHands-FingerCount dataset looks awesome! Building an expert pipeline to verify finger count could be tricky, especially since you have special instances where there are 5+ fingers. If you could get a pipeline of expert models with a heuristic involved to catch the special cases you could build a verification engine to then train a VLM to do this task.

Your need to confirm your account before you can post a new comment.

Sign up or log in to comment