Spaces:
Running
[Experiment]Applying Open-R1-Math-220k to smolThinking models.
For anyone struggling with getting GRPO working, I have a pipeline in a notebook that will publish a working model to HF in about 2 hours.
The experiment starts from this base reasoning model of Qwen/Qwen2.5-0.5B-Instruct fine-tuned on GSM8k and and then will train on the newly released Open-R1-Math-220k.
This is a test of how smol of an Agent (smolAgents: https://huggingface.co/blog/smolagents) is needed to work with how smol of a thinker that can still do a particular task? If this works, we may see a future where thinker agents interacting directly with finely tuned model endpoints designed for exact tasks, instead of applications or apis.
Google search becomes a ranking of which endpoints can do a specific task best. Interesting idea as the smolAgents class kicks off. (https://huggingface.co/learn/agents-course/en/unit0/introduction)
Live training (Feb 12) is underway and viewable here: https://wandb.ai/christian-cooper-us/qwen-math-grpo/runs/g5ij05u5?nw=nwuserchristiancooperus