Update README.md
Browse files
README.md
CHANGED
@@ -276,10 +276,9 @@ per_device_train_batch_size: 4
|
|
276 |
The evaluation of this model is based on HuggingFace's instructions [OpenR1](https://github.com/huggingface/open-r1)
|
277 |
|
278 |
```
|
279 |
-
|
280 |
-
MODEL=
|
281 |
-
|
282 |
-
TASK=aime24
|
283 |
OUTPUT_DIR=data/evals/$MODEL
|
284 |
|
285 |
lighteval vllm $MODEL_ARGS "custom|$TASK|0|0" \
|
@@ -289,5 +288,18 @@ lighteval vllm $MODEL_ARGS "custom|$TASK|0|0" \
|
|
289 |
--output-dir $OUTPUT_DIR
|
290 |
```
|
291 |
|
292 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
293 |
|
|
|
|
|
|
|
|
|
|
|
|
276 |
The evaluation of this model is based on HuggingFace's instructions [OpenR1](https://github.com/huggingface/open-r1)
|
277 |
|
278 |
```
|
279 |
+
MODEL=keeeeenw/Llama-3.2-1B-Instruct-Open-R1-Distill
|
280 |
+
MODEL_ARGS="pretrained=$MODEL,dtype=bfloat16,max_model_length=32768,gpu_memory_utilisation=0.8"
|
281 |
+
TASK=math_500
|
|
|
282 |
OUTPUT_DIR=data/evals/$MODEL
|
283 |
|
284 |
lighteval vllm $MODEL_ARGS "custom|$TASK|0|0" \
|
|
|
288 |
--output-dir $OUTPUT_DIR
|
289 |
```
|
290 |
|
291 |
+
```
|
292 |
+
Task |Version| Metric |Value| |Stderr|
|
293 |
+
|-----------------|------:|----------------|----:|---|-----:|
|
294 |
+
|all | |extractive_match|0.216|± |0.0184|
|
295 |
+
|custom:math_500:0| 1|extractive_match|0.216|± |0.0184|
|
296 |
+
```
|
297 |
+
|
298 |
+
For comparison, **DeepSeek-R1-Distill-Qwen-1.5B** has a score of 81.6 when computed with the same evaluation script (as reported by HuggingFace)
|
299 |
+
which is close to the official number 83.9 reported by **DeepSeek**.
|
300 |
|
301 |
+
There is still a long way to go for score improvements:
|
302 |
+
1. Distillation with actual math data instead of **HuggingFaceH4/Bespoke-Stratos-17k**. Data should be the real problem here and we can potentially collect and filter more data ourserlves.
|
303 |
+
2. Test a few other checkpoints to see if this particular checkpoint achieves the best results.
|
304 |
+
3. This model tends to be wordy. We should try to make it more concise because the model length limit is only 32768.
|
305 |
+
4. Try out GRPO and / or a combination of GRPO and SFT.
|