Safetensors
llama
keeeeenw commited on
Commit
1d8ba20
·
verified ·
1 Parent(s): 74040d7

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +17 -5
README.md CHANGED
@@ -276,10 +276,9 @@ per_device_train_batch_size: 4
276
  The evaluation of this model is based on HuggingFace's instructions [OpenR1](https://github.com/huggingface/open-r1)
277
 
278
  ```
279
- NUM_GPUS=4
280
- MODEL="/root/open-r1/data/meta-llama/Llama-3.2-1B-Instruct"
281
- MODEL_ARGS="pretrained=$MODEL,dtype=float16,data_parallel_size=$NUM_GPUS,max_model_length=32768,gpu_memory_utilisation=0.8"
282
- TASK=aime24
283
  OUTPUT_DIR=data/evals/$MODEL
284
 
285
  lighteval vllm $MODEL_ARGS "custom|$TASK|0|0" \
@@ -289,5 +288,18 @@ lighteval vllm $MODEL_ARGS "custom|$TASK|0|0" \
289
  --output-dir $OUTPUT_DIR
290
  ```
291
 
292
- Results: To be added. I don't have CUDA-12.1 on the rental GPU server so I will run evaluation later.
 
 
 
 
 
 
 
 
293
 
 
 
 
 
 
 
276
  The evaluation of this model is based on HuggingFace's instructions [OpenR1](https://github.com/huggingface/open-r1)
277
 
278
  ```
279
+ MODEL=keeeeenw/Llama-3.2-1B-Instruct-Open-R1-Distill
280
+ MODEL_ARGS="pretrained=$MODEL,dtype=bfloat16,max_model_length=32768,gpu_memory_utilisation=0.8"
281
+ TASK=math_500
 
282
  OUTPUT_DIR=data/evals/$MODEL
283
 
284
  lighteval vllm $MODEL_ARGS "custom|$TASK|0|0" \
 
288
  --output-dir $OUTPUT_DIR
289
  ```
290
 
291
+ ```
292
+ Task |Version| Metric |Value| |Stderr|
293
+ |-----------------|------:|----------------|----:|---|-----:|
294
+ |all | |extractive_match|0.216|± |0.0184|
295
+ |custom:math_500:0| 1|extractive_match|0.216|± |0.0184|
296
+ ```
297
+
298
+ For comparison, **DeepSeek-R1-Distill-Qwen-1.5B** has a score of 81.6 when computed with the same evaluation script (as reported by HuggingFace)
299
+ which is close to the official number 83.9 reported by **DeepSeek**.
300
 
301
+ There is still a long way to go for score improvements:
302
+ 1. Distillation with actual math data instead of **HuggingFaceH4/Bespoke-Stratos-17k**. Data should be the real problem here and we can potentially collect and filter more data ourserlves.
303
+ 2. Test a few other checkpoints to see if this particular checkpoint achieves the best results.
304
+ 3. This model tends to be wordy. We should try to make it more concise because the model length limit is only 32768.
305
+ 4. Try out GRPO and / or a combination of GRPO and SFT.