keeeeenw
/

Llama-3.2-1B-Instruct-Open-R1-Distill

Model card Files Files and versions Community

keeeeenw commited on Feb 1

Commit

1d8ba20

·

verified ·

1 Parent(s): 74040d7

Update README.md

Files changed (1) hide show

README.md +17 -5

README.md CHANGED Viewed

@@ -276,10 +276,9 @@ per_device_train_batch_size: 4
 The evaluation of this model is based on HuggingFace's instructions [OpenR1](https://github.com/huggingface/open-r1)
 ```
-NUM_GPUS=4
-MODEL="/root/open-r1/data/meta-llama/Llama-3.2-1B-Instruct"
-MODEL_ARGS="pretrained=$MODEL,dtype=float16,data_parallel_size=$NUM_GPUS,max_model_length=32768,gpu_memory_utilisation=0.8"
-TASK=aime24
 OUTPUT_DIR=data/evals/$MODEL
 lighteval vllm $MODEL_ARGS "custom|$TASK|0|0" \
@@ -289,5 +288,18 @@ lighteval vllm $MODEL_ARGS "custom|$TASK|0|0" \
     --output-dir $OUTPUT_DIR
 ```
-Results: To be added. I don't have CUDA-12.1 on the rental GPU server so I will run evaluation later.

 The evaluation of this model is based on HuggingFace's instructions [OpenR1](https://github.com/huggingface/open-r1)
 ```
+MODEL=keeeeenw/Llama-3.2-1B-Instruct-Open-R1-Distill
+MODEL_ARGS="pretrained=$MODEL,dtype=bfloat16,max_model_length=32768,gpu_memory_utilisation=0.8"
+TASK=math_500
 OUTPUT_DIR=data/evals/$MODEL
 lighteval vllm $MODEL_ARGS "custom|$TASK|0|0" \
     --output-dir $OUTPUT_DIR
 ```
+```
+    Task       |Version|     Metric     |Value|   |Stderr|
+|-----------------|------:|----------------|----:|---|-----:|
+|all              |       |extractive_match|0.216|±  |0.0184|
+|custom:math_500:0|      1|extractive_match|0.216|±  |0.0184|
+```
+For comparison, **DeepSeek-R1-Distill-Qwen-1.5B** has a score of 81.6 when computed with the same evaluation script (as reported by HuggingFace)
+which is close to the official number 83.9 reported by **DeepSeek**.
+There is still a long way to go for score improvements:
+1. Distillation with actual math data instead of **HuggingFaceH4/Bespoke-Stratos-17k**. Data should be the real problem here and we can potentially collect and filter more data ourserlves.
+2. Test a few other checkpoints to see if this particular checkpoint achieves the best results.
+3. This model tends to be wordy. We should try to make it more concise because the model length limit is only 32768.
+4. Try out GRPO and / or a combination of GRPO and SFT.