Model Issue

by YOYO-AI - opened Jan 23

Jan 23

The current merged version struggles with long-chain reasoning and tends to provide immediate answers directly. Would it be possible to explore re-merging the model to address this limitation?

Wanfq

FuseAI org Jan 23

The current merged version struggles with long-chain reasoning and tends to provide immediate answers directly. Would it be possible to explore re-merging the model to address this limitation?

We find this problem and try to fix it. This might be due to the significantly different parameter space between Qwen2.5-Coder-32B and DeepSeek-R1-32B.

Mushoz

Jan 24

I see a new version has been uploaded @Wanfq . Any comments on the changes in this new release? Does this fix the issue that was discussed here?

Wanfq

FuseAI org Jan 25

I see a new version has been uploaded @Wanfq . Any comments on the changes in this new release? Does this fix the issue that was discussed here?

Yes, we change the base pretrain model from Qwen2.5-32B to Qwen2.5-32B-Coder. This indeed fix this issue. Results are shown below:

Models	LiveCodeBench	LiveCodeBench-Easy	LiveCodeBench-Medium	LiveCodeBench-Hard
OpenAI o1	63.4	98.5	80.9	31.7
OpenAI o1-preview	42.7	97.0	47.2	9.8
OpenAI o1-mini	52.00	91.0	67.4	19.5
DeepSeek R1	62.8	98.4	78.3	32.2
deepseek-ai/DeepSeek-R1-Distill-Qwen-32B	56.1	93.6	73.1	23.4
Qwen/QwQ-32B-Preview	44.4	94.9	53.8	10.0
NovaSky-AI/Sky-T1-32B-Preview	37.3	89.7	40.4	6.6
FuseAI/FuseO1-DeepSeekR1-Qwen2.5-Coder-32B-Preview	56.4	92.9	73.5	24.2
FuseAI/FuseO1-DeepSeekR1-QwQ-32B-Preview	54.8	93.9	71.7	21.3
FuseAI/FuseO1-DeepSeekR1-QwQ-SkyT1-Flash-32B-Preview	58.2	94.3	77.1	25.0
FuseAI/FuseO1-DeepSeekR1-QwQ-SkyT1-32B-Preview	57.9	93.6	76.0	25.5

valoomba

Feb 1

@Wanfq Any chance you will do one with Qwen2.5-Coder-32B-Instruct?

Wanfq

FuseAI org Feb 1

@Wanfq Any chance you will do one with Qwen2.5-Coder-32B-Instruct?

This model is now merged based on Qwen2.5-Coder-32B-Instruct and DeepSeek-R1-Distill-Qwen-32B. The Qwen2.5-32B-Coder is used to automatically calculate the merging weight for these two models only. Since in our SCE merging method, the merging weight is proportional to the delta parameter from the pivot model to the target model.

Here is the merging config for this model:

models:
  # Pivot model
  - model: Qwen/Qwen2.5-32B-Coder
  # Target models
  - model: Qwen/Qwen2.5-32B-Coder-Instruct
  - model: DeepSeek-R1-Distill-Qwen-32B
merge_method: sce
base_model: Qwen/Qwen2.5-32B-Coder
parameters:
  select_topk: 1.0
dtype: bfloat16

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

Your need to confirm your account before you can post a new comment.

· Sign up or log in to comment