Step-DPO: Step-wise Preference Optimization for Long-chain Reasoning of LLMs

🖥️Code | 🤗Data | 📄Paper

This repo contains the Qwen1.5-32B-SFT-Step-DPO model. It is obtained by performing Step-DPO on Qwen1.5-32B-SFT.

Step-DPO is a simple, effective, and data-efficient method for boosting the mathematical reasoning ability of LLMs. Notably, Step-DPO, when applied to Qwen2-72B-Instruct, achieves scores of 70.8% and 94.0% on the test sets of MATH and GSM8K without bells and wistles, respectively, surpassing a series of closed-source models, including GPT-4-1106, Claude-3-Opus, and Gemini-1.5-Pro.

Contact

Please submit an issue here or send me an email here.

Downloads last month
17
Safetensors
Model size
32.5B params
Tensor type
BF16
·
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.

Collection including xinlai/Qwen1.5-32B-SFT-Step-DPO