Hi thanks a lot sharing, I tried a similar approach for making the vlm point to objects in the image, in x y co ordinates using the pixmo points dataset. But inspite of training on around 20k subset of the dataset, the model just produces random x y values and is not improving the reward at all beyond a certain point. I am using a format reward similar to you, and the distance between predicted point and truth as reward I.e. exp(-distance) . It just doesn’t work!! Do you have any insights why it doesn’t work for pointing ? I used qwen2vl 2b.

updated a model 17 days ago

mbiswas/qwen2vl-point-checkpoint600

Updated 17 days ago • 19

published a model 17 days ago

mbiswas/qwen2vl-point-checkpoint600

Updated 17 days ago • 19

updated a model 17 days ago

mbiswas/qwen2vl-point-checkpoint280

Updated 17 days ago • 17

published a model 17 days ago

mbiswas/qwen2vl-point-checkpoint280

Updated 17 days ago • 17

reacted to tianchez's post with 🚀 17 days ago

Post

4119

Introducing VLM-R1!

GRPO has helped DeepSeek R1 to learn reasoning. Can it also help VLMs perform stronger for general computer vision tasks?

The answer is YES and it generalizes better than SFT. We trained Qwen 2.5 VL 3B on RefCOCO (a visual grounding task) and eval on RefCOCO Val and RefGTA (an OOD task).

https://github.com/om-ai-lab/VLM-R1

3 replies

reacted to Jaward's post with 🔥❤️ 20 days ago

Post

3868

Finally here it is: a faster, custom, scalable GRPO trainer for smaller models with < 500M params, can train on 8gb ram cpu, also supports gpu for sanity sake (includes support for vllm + flash attention). Using smolLM2-135M/360M-instructs as ref & base models. Experience your own “aha” moment 🐳 on 8gb ram.
Code: https://github.com/Jaykef/ai-algorithms/blob/main/smollm2_360M_135M_grpo_gsm8k.ipynb

2 replies

updated a model 21 days ago

mbiswas/smolvlm-points-merged

Updated 21 days ago • 18

published a model 21 days ago