distily_TinyStories-33M

This student model is distilled from the teacher model roneneldan/TinyStories-33M using the dataset (unspecified).

The Distily library was used for this distillation.

It achieves the following results on the evaluation set:

  • eval_enwikippl: 86.0272
  • eval_frwikippl: 9172.2910
  • eval_zhwikippl: 31986.0898
  • eval_loss: 0.9611
  • eval_runtime: 27.2508
  • eval_samples_per_second: 91.741
  • eval_steps_per_second: 11.486

Training procedure

Training hyperparameters

The following hyperparameters were used during training:

  • distillation_objective: DistillationObjective(logits_loss_component=LossComponent(label=logits, weight=1, loss_fn=kl, layer_mapper=None, projector=None), hs_loss_component=LossComponent(label=hs, weight=5000.0, loss_fn=mse, layer_mapper=None, projector=None), attn_loss_component=LossComponent(label=attn, weight=500.0, loss_fn=jsd, layer_mapper=None, projector=None))
  • train_embeddings: True
  • learning_rate: 4e-05
  • train_batch_size: 8
  • eval_batch_size: 8
  • seed: 42
  • optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
  • lr_scheduler_type: constant
  • num_epochs: 1.0

Resource Usage

Peak GPU Memory: 8.2940 GB

Eval-Phase Metrics

step epoch enwikippl frwikippl loss runtime samples_per_second steps_per_second zhwikippl
teacher eval 174.1653 48148.2734 4930.5806
0 0 42788.8555 63779.7148 13.4382 27.2438 91.764 11.489 57958.3359
1000 0.0323 176.2009 44333.6016 1.6774 27.3162 91.521 11.458 457143.6562
2000 0.0646 128.0956 24798.3691 1.5142 27.266 91.689 11.48 119591.0781
3000 0.0970 109.3945 15041.4014 1.3719 27.4573 91.051 11.4 68749.8828
4000 0.1293 103.1060 11736.2949 1.2548 27.2393 91.779 11.491 52875.8438
5000 0.1616 112.2423 11673.6494 1.1644 27.3226 91.499 11.456 45928.1172
6000 0.1939 98.1303 11178.0225 1.0962 27.294 91.595 11.468 43252.2148
7000 0.2263 93.0121 9680.7031 1.0394 27.3697 91.342 11.436 36992.1562
8000 0.2586 90.4050 9906.2393 1.0005 27.424 91.161 11.413 34836.8906
9000 0.2909 86.0272 9172.2910 0.9611 27.2508 91.741 11.486 31986.0898
10000 0.3232 86.3193 8911.2168 0.9344 27.4195 91.176 11.415 33114.9648
11000 0.3555 85.1883 9004.5898 0.9170 27.6131 90.537 11.335 28466.0332
12000 0.3879 82.4485 8789.0557 0.8952 27.5622 90.704 11.356 26171.4727
13000 0.4202 86.4648 11200.8799 0.8819 27.2915 91.603 11.469 28254.1816
14000 0.4525 83.4509 8846.1875 0.8756 27.288 91.615 11.47 24126.1836
15000 0.4848 83.4380 8696.6904 0.8562 27.2967 91.586 11.467 22347.7852
16000 0.5172 84.3804 9052.9209 0.8506 27.5838 90.633 11.347 26039.1504
17000 0.5495 92.4088 9267.0918 0.8451 27.2622 91.702 11.481 24745.4961
18000 0.5818 92.4374 9366.8291 0.8401 27.5177 90.851 11.375 23503.5566
19000 0.6141 87.0512 8318.6689 0.8306 27.185 91.963 11.514 23050.2109
20000 0.6465 93.4635 10036.1631 0.8266 27.3179 91.515 11.458 26122.6484
21000 0.6788 82.3464 9078.4600 0.8196 27.3629 91.365 11.439 28156.3516
22000 0.7111 81.6666 9332.5889 0.8155 27.6142 90.533 11.335 32020.2734
23000 0.7434 84.7325 9831.8672 0.8086 27.2205 91.843 11.499 33488.1289
24000 0.7757 81.2596 8868.6484 0.8074 27.307 91.552 11.462 30275.5918
25000 0.8081 81.1778 8258.5459 0.8051 27.3489 91.411 11.445 26269.4199
26000 0.8404 84.4753 9221.5127 0.8007 27.3172 91.517 11.458 31739.5938
27000 0.8727 81.3541 9123.3232 0.7995 27.2848 91.626 11.472 36992.1562
28000 0.9050 85.5785 9260.5635 0.7973 27.1686 92.018 11.521 34531.5234
29000 0.9374 92.4553 8333.3262 0.7944 27.2956 91.59 11.467 41878.25
30000 0.9697 92.4625 8644.1758 0.7925 27.2757 91.657 11.475 49319.1836
30938 1.0 91.8841 8440.8330 0.7884 27.314 91.528 11.459 49928.1523

Framework versions

  • Distily 0.2.0
  • Transformers 4.44.0
  • Pytorch 2.3.0
  • Datasets 2.21.0
Downloads last month
12
Safetensors
Model size
68.5M params
Tensor type
BF16
·
Inference API
Unable to determine this model’s pipeline type. Check the docs .

Model tree for distily/distily_TinyStories-33M_freeze_emb

Finetuned
(8)
this model