Korean S1K-1.1

  • Generation Result: werty1248/qwen-32b-s1.1-Ko-Native-result
    • AIME2024(Korean): 30% (9/30) with max_tokens=8192
  • Chinese appears in think tokens (Not halucination or failure behavior. Solved correctly in Chinese, then answered in Korean)

Training Details

  • Train with s1k official code
    • block_size = 20000
    • gradient_checkpointing=True
  • Training Dataset: werty1248/s1k-1.1-Ko-ReGenerated-Formatted
    • Translated questions from s1k-1.1 into Korean, and then regenearted using Deepseek-R1.
    • Need to modify "text" column to Qwen format
  • 8x H200 SXM, 2 hours; $63.84, except for the cost of trial and error :(
  • image/png

Evaluation

  • HRM8K - Korean Math Benchmark

Generated and evaluated with my own code; accuracy may differ.

Model GSM8K KSM MATH OMNI_MATH
Qwen2.5-32B-s1.1-Ko-Native 89.92 39.85 87.73 42.06
*GPT-4o 91.21 22.83 74.45 30.75
*GPT-4o-mini 87.57 19.40 70.68 26.45
EXAONE-3.5-7.8B-Stratos-Ko 83.02 15.97 67.49 24.62
Qwen2.5-7B-s1.1-Ko-Native 76.27 15.48 66.45 23.57
EXAONE-3.5-7.8B-Instruct 81.58 14.71 63.50 21.69
*Qwen2.5-14B-Instruct 66.34 15.55 53.38 20.64
*Llama-3.1-8B-Instruct 77.79 7.21 49.01 15.92
*Qwen2.5-7B-Instruct 58.38 13.10 48.04 16.55
*EXAONE-3.0-7.8B-Instruct 72.33 7.98 46.79 15.35
*Ko-R1-1.5B-preview 43.3 ? 73.1 29.8

* Reported by HRM8K authors

  • Generation
    • temperature = 0.7
    • top_p = 0.95
    • max_tokens = 8192
    • If it exceeded the maximum tokens but did not generate </think> tokens, add </think> tokens and generate 512 additional tokens
  • Evaluation
    • Custom parser & evaluation code; may have parsed incorrectly

Why Qwen? Why EXAONE can't?

Downloads last month
46
Safetensors
Model size
32.8B params
Tensor type
BF16
·
Inference Providers NEW
This model is not currently available via any of the supported Inference Providers.

Model tree for werty1248/Qwen2.5-32B-s1.1-Ko-Native

Base model

Qwen/Qwen2.5-32B
Finetuned
(142)
this model
Quantizations
1 model

Dataset used to train werty1248/Qwen2.5-32B-s1.1-Ko-Native