distily_bench_obj_cross_v2.15_gpt2

This student model is distilled from the teacher model gpt2 using the dataset (unspecified).

The Distily library was used for this distillation.

It achieves the following results on the evaluation set:

  • eval_enwikippl: 84.0
  • eval_frwikippl: 336.0
  • eval_zhwikippl: 143.0
  • eval_tinystoriesppl: 68.0
  • eval_loss: 0.6821
  • eval_runtime: 16.9876
  • eval_samples_per_second: 58.866
  • eval_steps_per_second: 7.358

Training procedure

Training hyperparameters

The following hyperparameters were used during training:

  • distillation_objective: DistillationObjective(logits_loss_component=LossComponent(label=logits, weight=1, loss_fn=kl, layer_mapper=None, projector=None), hs_loss_component=LossComponent(label=hs, weight=1.0, loss_fn=mse, layer_mapper=last, projector=None), attn_loss_component=LossComponent(label=attn, weight=0, loss_fn=None, layer_mapper=None, projector=None))
  • train_embeddings: True
  • learning_rate: 0.0001
  • train_batch_size: 4
  • eval_batch_size: 8
  • seed: 42
  • optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
  • lr_scheduler_type: constant
  • lr_scheduler_warmup_ratio: 0.2
  • num_epochs: 1.0

Resource Usage

Peak GPU Memory: 7.7252 GB

Eval-Phase Metrics

step epoch enwikippl frwikippl loss runtime samples_per_second steps_per_second tinystoriesppl zhwikippl
teacher eval 43.75 61.75 11.8125 19.125
0 0 2473901162496.0 170424302305280.0 20.7680 16.9731 58.917 7.365 4060086272.0 71468255805440.0
1000 0.0404 328.0 1488.0 1.5338 17.0565 58.629 7.329 243.0 608.0
2000 0.0808 229.0 792.0 1.3211 16.9911 58.854 7.357 190.0 260.0
3000 0.1212 178.0 624.0 1.1600 17.0018 58.817 7.352 149.0 176.0
4000 0.1616 147.0 580.0 1.0397 17.0215 58.749 7.344 119.0 161.0
5000 0.2020 128.0 516.0 0.9532 17.0153 58.771 7.346 102.0 159.0
6000 0.2424 111.0 410.0 0.8655 17.0046 58.808 7.351 90.0 147.0
7000 0.2828 104.5 410.0 0.8083 16.9742 58.913 7.364 82.0 145.0
8000 0.3232 97.5 382.0 0.7412 16.9735 58.915 7.364 74.0 128.0
9000 0.3636 84.0 336.0 0.6821 16.9876 58.866 7.358 68.0 143.0
10000 0.4040 77.5 312.0 0.6396 16.9771 58.903 7.363 65.0 140.0
11000 0.4444 75.5 280.0 0.5964 17.02 58.754 7.344 60.75 122.5
12000 0.4848 74.5 268.0 0.5797 16.9985 58.829 7.354 58.0 152.0
13000 0.5253 71.5 274.0 0.5537 16.9566 58.974 7.372 58.25 134.0
14000 0.5657 72.0 252.0 0.5429 16.9325 59.058 7.382 58.0 99.0
15000 0.6061 69.0 229.0 0.5308 16.9917 58.852 7.357 51.25 94.0
16000 0.6465 67.0 223.0 0.5209 16.9686 58.932 7.367 52.5 108.0
17000 0.6869 67.5 227.0 0.5046 16.979 58.896 7.362 54.25 118.0
18000 0.7273 67.5 244.0 0.5024 16.994 58.844 7.356 50.5 128.0
19000 0.7677 66.0 212.0 0.4931 16.9719 58.921 7.365 49.25 88.0
20000 0.8081 64.5 202.0 0.4925 17.0171 58.764 7.346 49.75 169.0
21000 0.8485 67.0 222.0 0.4839 16.9754 58.909 7.364 47.75 126.0
22000 0.8889 66.0 227.0 0.4759 16.9314 59.062 7.383 48.0 100.0
23000 0.9293 61.75 208.0 0.4704 16.9662 58.941 7.368 47.25 125.5
24000 0.9697 66.0 210.0 0.4706 17.0394 58.688 7.336 47.5 173.0
24750 1.0 63.75 218.0 0.4686 16.9798 58.894 7.362 46.75 82.5

Framework versions

  • Distily 0.2.0
  • Transformers 4.44.0
  • Pytorch 2.3.0
  • Datasets 2.21.0
Downloads last month
2
Safetensors
Model size
124M params
Tensor type
BF16
·
Inference API
Unable to determine this model’s pipeline type. Check the docs .

Model tree for lapp0/distily_bench_obj_cross_v2.15_gpt2

Quantized
(54)
this model