Korean S1K-1.1
- Generation Result: werty1248/qwen-32b-s1.1-Ko-Native-result
- AIME2024(Korean): 30% (9/30) with max_tokens=8192
- Chinese appears in think tokens (Not halucination or failure behavior. Solved correctly in Chinese, then answered in Korean)
Training Details
- Train with s1k official code
block_size = 20000
gradient_checkpointing=True
- Training Dataset: werty1248/s1k-1.1-Ko-ReGenerated-Formatted
- Translated questions from s1k-1.1 into Korean, and then regenearted using Deepseek-R1.
- Need to modify "text" column to Qwen format
- 8x H200 SXM, 2 hours; $63.84, except for the cost of trial and error :(
Evaluation
- HRM8K - Korean Math Benchmark
Generated and evaluated with my own code; accuracy may differ.
Model | GSM8K | KSM | MATH | OMNI_MATH |
---|---|---|---|---|
Qwen2.5-32B-s1.1-Ko-Native | 89.92 | 39.85 | 87.73 | 42.06 |
*GPT-4o | 91.21 | 22.83 | 74.45 | 30.75 |
*GPT-4o-mini | 87.57 | 19.40 | 70.68 | 26.45 |
EXAONE-3.5-7.8B-Stratos-Ko | 83.02 | 15.97 | 67.49 | 24.62 |
Qwen2.5-7B-s1.1-Ko-Native | 76.27 | 15.48 | 66.45 | 23.57 |
EXAONE-3.5-7.8B-Instruct | 81.58 | 14.71 | 63.50 | 21.69 |
*Qwen2.5-14B-Instruct | 66.34 | 15.55 | 53.38 | 20.64 |
*Llama-3.1-8B-Instruct | 77.79 | 7.21 | 49.01 | 15.92 |
*Qwen2.5-7B-Instruct | 58.38 | 13.10 | 48.04 | 16.55 |
*EXAONE-3.0-7.8B-Instruct | 72.33 | 7.98 | 46.79 | 15.35 |
*Ko-R1-1.5B-preview | 43.3 | ? | 73.1 | 29.8 |
* Reported by HRM8K authors
- Generation
temperature = 0.7
top_p = 0.95
max_tokens = 8192
- If it exceeded the maximum tokens but did not generate
</think>
tokens, add</think>
tokens and generate 512 additional tokens
- Evaluation
- Custom parser & evaluation code; may have parsed incorrectly
Why Qwen? Why EXAONE can't?
- Downloads last month
- 46
Inference Providers
NEW
This model is not currently available via any of the supported Inference Providers.