open-r1/README · [Experiment]Applying Open-R1-Math-220k to smolThinking models.

For anyone struggling with getting GRPO working, I have a pipeline in a notebook that will publish a working model to HF in about 2 hours.

The experiment starts from this base reasoning model of Qwen/Qwen2.5-0.5B-Instruct fine-tuned on GSM8k and and then will train on the newly released Open-R1-Math-220k.

This is a test of how smol of an Agent (smolAgents: https://huggingface.co/blog/smolagents) is needed to work with how smol of a thinker that can still do a particular task? If this works, we may see a future where thinker agents interacting directly with finely tuned model endpoints designed for exact tasks, instead of applications or apis.

Google search becomes a ranking of which endpoints can do a specific task best. Interesting idea as the smolAgents class kicks off. (https://huggingface.co/learn/agents-course/en/unit0/introduction)

Repo: https://github.com/HarleyCoops/smolThinker-.5B.git