jiangchengchengNLP/qwen2.5-distill-QWQ

Model Details

This model is a distilled version of a smaller model based on Qwen/QwQ-32B-Preview, fine-tuned on 60,000 math-related data samples from amphora/QwQ-LongCoT-130K. The base model used is qwen2.5-coder-7B-Instruct, and the fine-tuning method is LORA. The tokenizer QWQ remains consistent, and the model demonstrates its best inference capabilities in English.

Uses

To use the model, follow the code below:

from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("jiangchengchengNLP/qwen2.5-distill-QWQ")
base_model = AutoModelForCausalLM.from_pretrained(r"Qwen/Qwen2.5-Coder-7B-Instruct", device_map='auto', torch_dtype="bfloat16")
model = PeftModel.from_pretrained(base_model, "jiangchengchengNLP/qwen2.5-distill-QWQ")

prompt = "how many `r` in `strawberry`？"
messages = [
    {"role": "user", "content": prompt}
]
text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True
)
model_inputs = tokenizer([text], return_tensors="pt").to(model.device)

generated_ids = model.generate(
    **model_inputs,
    max_new_tokens=4096,
    top_p=0.8,
    temperature=0.2
)

generated_ids = [
    output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
]

response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
print(response)

Evaluation

Thanks to the teacher(QWQ), the performance of qwen2.5-distill-QWQ on MATH exceeded that of Qwen2.5-Math-72B by 66.8%, reaching 67.01%.

Apply Simple Test Time Scaling to strengthen COT of Model

from huggingface_hub import snapshot_download
thinker_lora_path=snapshot_download(repo_id="jiangchengchengNLP/qwen2.5-distill-QWQ")
from vllm import LLM,SamplingParams
from vllm.lora.request import LoRARequest
from transformers import AutoTokenizer

# Decide on a token limit for thinking; As the model's max tokens is 32768, 32000 usually ensures there is enough space for the model to still answer
MAX_TOKENS_THINKING = 32000
# Decide how often to ignore end-of-thinking token
NUM_IGNORE = 1
model=LLM(model="Qwen/Qwen2.5-Coder-7B-Instruct",enable_lora=True)
model=LLM(model=model_path)
tokenizer=AutoTokenizer.from_pretrained(
    "jiangchengchengNLP/qwen2.5-distill-QWQ"
)
stop_token_ids=tokenizer("<|im_end|>")['input_ids']
sampling_params=SamplingParams(
    max_tokens=MAX_TOKENS_THINKING,
    min_tokens=0,
    stop_token_ids=stop_token_ids,
    skip_special_tokens=True,
    temperature=0.0
)
# For the math sample 
import re
pattn=re.compile("\*\*Final Answer\*\*.*",re.S)

prompts=[
    """Given positive real numbers $a$ and $b$ satisfy $a+b=1$, then $M=$ $\sqrt{1+a^{2}}+\sqrt{1+2 b}$ the integer part is ? """,
]
for i,p in enumerate(prompts):
    messages=[
        {'role':'user','content':p}
    ]
    prompt=tokenizer.apply_chat_template(
        messages,
        add_generation_prompt=True,
        tokenize=False
    )
    o=model.generate(
        prompt,
        sampling_params=sampling_params,
        lora_request=LoRARequest("thinker_adapter", 1, thinker_lora_path)
    )
    ignore_str="Wait a minute"
    max_tokens_thinking_tmp = MAX_TOKENS_THINKING
    if max_tokens_thinking_tmp>0:
        for i in range(NUM_IGNORE):
            max_tokens_thinking_tmp-=len(o[0].outputs[0].token_ids)
            generate_text=o[0].outputs[0].text
            drop_text=pattn.findall(generate_text)
            if drop_text:
                generate_text=generate_text.replace(drop_text[0],"")
            prompt+=generate_text+ignore_str
            sampling_params = SamplingParams(
                max_tokens=max_tokens_thinking_tmp,
                min_tokens=1,
                stop_token_ids=stop_token_ids,
                skip_special_tokens=True,
                temperature=0.0,
            )
            o = model.generate(
                prompt,
                sampling_params=sampling_params,
                lora_request=LoRARequest("thinker_adapter", 1, thinker_lora_path)
            )
generate_text=o[0].outputs[0].text
drop_text=pattn.findall(generate_text)
if drop_text:
    generate_text=generate_text.replace(drop_text[0],"")
prompt+=generate_text+"\n"+"**Final Answer**"
sampling_params = SamplingParams(
                max_tokens=max_tokens_thinking_tmp,
                min_tokens=1,
                stop_token_ids=stop_token_ids,
                skip_special_tokens=False,
                temperature=0.0,
            )
o=model.generate(
        prompt,
        sampling_params=sampling_params,
        lora_request=LoRARequest("thinker_adapter", 1, thinker_lora_path)
    )
print("With budget forcing:") 
print(prompt + o[0].outputs[0].text)

If you want to combine lora weights into one model then use the following code

from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("jiangchengchengNLP/qwen2.5-distill-QWQ")
base_model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen2.5-Coder-7B-Instruct",device_map='cpu',torch_dtype="bfloat16")
model = PeftModel.from_pretrained(base_model, "jiangchengchengNLP/qwen2.5-distill-QWQ")
mergemodel = model.merge_and_unload()
mergemodel.save_pretrained("./merge_model")
tokenizer.save_pretrained("./merge_model")
print("model have merged!")

PEFT 0.14.0

jiangchengchengNLP
/

qwen2.5-distill-QWQ

Model Details

Uses

Evaluation

Apply Simple Test Time Scaling to strengthen COT of Model

If you want to combine lora weights into one model then use the following code

Model tree for jiangchengchengNLP/qwen2.5-distill-QWQ

Dataset used to train jiangchengchengNLP/qwen2.5-distill-QWQ