Model Issue
The current merged version struggles with long-chain reasoning and tends to provide immediate answers directly. Would it be possible to explore re-merging the model to address this limitation?
The current merged version struggles with long-chain reasoning and tends to provide immediate answers directly. Would it be possible to explore re-merging the model to address this limitation?
We find this problem and try to fix it. This might be due to the significantly different parameter space between Qwen2.5-Coder-32B and DeepSeek-R1-32B.
I see a new version has been uploaded @Wanfq . Any comments on the changes in this new release? Does this fix the issue that was discussed here?
Yes, we change the base pretrain model from Qwen2.5-32B to Qwen2.5-32B-Coder. This indeed fix this issue. Results are shown below:
Models | LiveCodeBench | LiveCodeBench-Easy | LiveCodeBench-Medium | LiveCodeBench-Hard |
---|---|---|---|---|
OpenAI o1 | 63.4 | 98.5 | 80.9 | 31.7 |
OpenAI o1-preview | 42.7 | 97.0 | 47.2 | 9.8 |
OpenAI o1-mini | 52.00 | 91.0 | 67.4 | 19.5 |
DeepSeek R1 | 62.8 | 98.4 | 78.3 | 32.2 |
deepseek-ai/DeepSeek-R1-Distill-Qwen-32B | 56.1 | 93.6 | 73.1 | 23.4 |
Qwen/QwQ-32B-Preview | 44.4 | 94.9 | 53.8 | 10.0 |
NovaSky-AI/Sky-T1-32B-Preview | 37.3 | 89.7 | 40.4 | 6.6 |
FuseAI/FuseO1-DeepSeekR1-Qwen2.5-Coder-32B-Preview | 56.4 | 92.9 | 73.5 | 24.2 |
FuseAI/FuseO1-DeepSeekR1-QwQ-32B-Preview | 54.8 | 93.9 | 71.7 | 21.3 |
FuseAI/FuseO1-DeepSeekR1-QwQ-SkyT1-Flash-32B-Preview | 58.2 | 94.3 | 77.1 | 25.0 |
FuseAI/FuseO1-DeepSeekR1-QwQ-SkyT1-32B-Preview | 57.9 | 93.6 | 76.0 | 25.5 |