lapp0 commited on
Commit
ad1087f
·
verified ·
1 Parent(s): a18f4f5

End of training

Browse files
README.md CHANGED
@@ -16,14 +16,14 @@ This student model is distilled from the teacher model [gpt2](https://huggingfac
16
  The [Distily](https://github.com/lapp0/distily) library was used for this distillation.
17
 
18
  It achieves the following results on the evaluation set:
19
- - eval_enwikippl: 280.0
20
- - eval_frwikippl: 1392.0
21
- - eval_zhwikippl: 2576.0
22
- - eval_tinystoriesppl: 207.0
23
- - eval_loss: 1.4458
24
- - eval_runtime: 12.6656
25
- - eval_samples_per_second: 47.372
26
- - eval_steps_per_second: 11.843
27
 
28
  <!-- This model card has been generated automatically according to the information the Trainer had access to. You
29
  should probably proofread and complete it, then remove this comment.
@@ -48,8 +48,8 @@ More information needed
48
  The following hyperparameters were used during training:
49
  - distillation_objective: DistillationObjective(logits_loss_component=LossComponent(label=logits, weight=1, loss_fn=kl, layer_mapper=None, projector=None), hs_loss_component=LossComponent(label=hs, weight=0, loss_fn=None, layer_mapper=None, projector=None), attn_loss_component=LossComponent(label=attn, weight=0, loss_fn=None, layer_mapper=None, projector=None))
50
  - train_embeddings: True
51
- - learning_rate: 0.0001
52
- - train_batch_size: 1
53
  - eval_batch_size: 4
54
  - seed: 42
55
  - optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
@@ -64,47 +64,17 @@ Peak GPU Memory: 4.1856 GB
64
  | step | epoch | enwikippl | frwikippl | loss | runtime | samples_per_second | steps_per_second | tinystoriesppl | zhwikippl |
65
  | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
66
  | **teacher eval** | | 43.75 | 61.75 | | | | | 11.8125 | 19.125 |
67
- | 0 | 0 | 1821066133504.0 | 158329674399744.0 | 19.3254 | 12.6923 | 47.273 | 11.818 | 12079595520.0 | 98956046499840.0 |
68
- | 1500 | 0.0253 | 3616.0 | 28416.0 | 3.2424 | 12.6627 | 47.383 | 11.846 | 2560.0 | 606208.0 |
69
- | 3000 | 0.0505 | 836.0 | 5888.0 | 2.2048 | 12.6773 | 47.329 | 11.832 | 652.0 | 21376.0 |
70
- | 4500 | 0.0758 | 482.0 | 3152.0 | 1.8732 | 12.5614 | 47.766 | 11.941 | 352.0 | 580.0 |
71
- | 6000 | 0.1010 | 352.0 | 1520.0 | 1.5840 | 12.7029 | 47.233 | 11.808 | 272.0 | 330.0 |
72
- | 7500 | 0.1263 | 308.0 | 1272.0 | 1.5038 | 12.8562 | 46.67 | 11.668 | 241.0 | 294.0 |
73
- | 9000 | 0.1515 | 280.0 | 1392.0 | 1.4458 | 12.6656 | 47.372 | 11.843 | 207.0 | 2576.0 |
74
- | 10500 | 0.1768 | 238.0 | 936.0 | 1.3311 | 12.6721 | 47.348 | 11.837 | 185.0 | 286.0 |
75
- | 12000 | 0.2020 | 215.0 | 836.0 | 1.2527 | 12.6564 | 47.407 | 11.852 | 163.0 | 280.0 |
76
- | 13500 | 0.2273 | 197.0 | 772.0 | 1.2033 | 12.5546 | 47.791 | 11.948 | 150.0 | 266.0 |
77
- | 15000 | 0.2525 | 187.0 | 792.0 | 1.1701 | 12.5678 | 47.741 | 11.935 | 154.0 | 250.0 |
78
- | 16500 | 0.2778 | 171.0 | 704.0 | 1.1274 | 12.6162 | 47.558 | 11.889 | 150.0 | 152.0 |
79
- | 18000 | 0.3030 | 167.0 | 748.0 | 1.1101 | 12.5787 | 47.7 | 11.925 | 141.0 | 314.0 |
80
- | 19500 | 0.3283 | 164.0 | 684.0 | 1.1063 | 12.6722 | 47.348 | 11.837 | 142.0 | 254.0 |
81
- | 21000 | 0.3535 | 162.0 | 676.0 | 1.0552 | 12.6941 | 47.266 | 11.816 | 135.0 | 242.0 |
82
- | 22500 | 0.3788 | 146.0 | 628.0 | 1.0061 | 12.5812 | 47.69 | 11.923 | 120.0 | 226.0 |
83
- | 24000 | 0.4040 | 141.0 | 636.0 | 0.9924 | 12.5431 | 47.835 | 11.959 | 115.5 | 218.0 |
84
- | 25500 | 0.4293 | 138.0 | 776.0 | 0.9653 | 12.619 | 47.547 | 11.887 | 110.5 | 362.0 |
85
- | 27000 | 0.4545 | 129.0 | 572.0 | 0.9293 | 12.5466 | 47.822 | 11.955 | 113.0 | 245.0 |
86
- | 28500 | 0.4798 | 136.0 | 592.0 | 0.9387 | 12.658 | 47.401 | 11.85 | 117.5 | 204.0 |
87
- | 30000 | 0.5051 | 132.0 | 704.0 | 0.9436 | 12.5944 | 47.64 | 11.91 | 112.0 | 255.0 |
88
- | 31500 | 0.5303 | 120.0 | 596.0 | 0.9097 | 12.5827 | 47.684 | 11.921 | 104.5 | 262.0 |
89
- | 33000 | 0.5556 | 119.5 | 548.0 | 0.8672 | 12.5736 | 47.719 | 11.93 | 103.5 | 264.0 |
90
- | 34500 | 0.5808 | 114.0 | 544.0 | 0.8406 | 12.6869 | 47.293 | 11.823 | 95.0 | 300.0 |
91
- | 36000 | 0.6061 | 107.5 | 410.0 | 0.8157 | 12.6245 | 47.527 | 11.882 | 95.5 | 199.0 |
92
- | 37500 | 0.6313 | 102.5 | 478.0 | 0.8011 | 12.7459 | 47.074 | 11.769 | 92.0 | 312.0 |
93
- | 39000 | 0.6566 | 106.0 | 454.0 | 0.7952 | 12.6026 | 47.609 | 11.902 | 93.0 | 262.0 |
94
- | 40500 | 0.6818 | 102.5 | 448.0 | 0.7747 | 12.5828 | 47.684 | 11.921 | 85.5 | 249.0 |
95
- | 42000 | 0.7071 | 88.5 | 366.0 | 0.6942 | 12.5833 | 47.682 | 11.921 | 76.0 | 207.0 |
96
- | 43500 | 0.7323 | 79.5 | 326.0 | 0.6337 | 12.6842 | 47.303 | 11.826 | 66.5 | 160.0 |
97
- | 45000 | 0.7576 | 78.0 | 270.0 | 0.6070 | 12.684 | 47.304 | 11.826 | 65.5 | 159.0 |
98
- | 46500 | 0.7828 | 76.5 | 260.0 | 0.5937 | 12.5713 | 47.728 | 11.932 | 62.0 | 127.5 |
99
- | 48000 | 0.8081 | 76.0 | 272.0 | 0.5848 | 12.6783 | 47.325 | 11.831 | 63.5 | 127.0 |
100
- | 49500 | 0.8333 | 74.0 | 253.0 | 0.5771 | 12.5825 | 47.685 | 11.921 | 62.25 | 132.0 |
101
- | 51000 | 0.8586 | 72.5 | 252.0 | 0.5649 | 12.6897 | 47.283 | 11.821 | 59.5 | 101.0 |
102
- | 52500 | 0.8838 | 71.5 | 241.0 | 0.5513 | 12.6795 | 47.32 | 11.83 | 58.25 | 105.0 |
103
- | 54000 | 0.9091 | 71.0 | 238.0 | 0.5457 | 12.657 | 47.404 | 11.851 | 57.5 | 102.5 |
104
- | 55500 | 0.9343 | 70.5 | 237.0 | 0.5425 | 12.6411 | 47.464 | 11.866 | 56.75 | 95.5 |
105
- | 57000 | 0.9596 | 71.0 | 236.0 | 0.5401 | 12.772 | 46.978 | 11.744 | 56.75 | 92.0 |
106
- | 58500 | 0.9848 | 70.0 | 234.0 | 0.5385 | 12.5709 | 47.729 | 11.932 | 56.75 | 93.0 |
107
- | 59400 | 1.0 | 70.0 | 234.0 | 0.5385 | 12.5868 | 47.669 | 11.917 | 56.75 | 93.0 |
108
 
109
  ### Framework versions
110
  - Distily 0.2.0
 
16
  The [Distily](https://github.com/lapp0/distily) library was used for this distillation.
17
 
18
  It achieves the following results on the evaluation set:
19
+ - eval_enwikippl: 249.0
20
+ - eval_frwikippl: 600.0
21
+ - eval_zhwikippl: 186.0
22
+ - eval_tinystoriesppl: 220.0
23
+ - eval_loss: 0.9819
24
+ - eval_runtime: 12.7319
25
+ - eval_samples_per_second: 47.126
26
+ - eval_steps_per_second: 11.781
27
 
28
  <!-- This model card has been generated automatically according to the information the Trainer had access to. You
29
  should probably proofread and complete it, then remove this comment.
 
48
  The following hyperparameters were used during training:
49
  - distillation_objective: DistillationObjective(logits_loss_component=LossComponent(label=logits, weight=1, loss_fn=kl, layer_mapper=None, projector=None), hs_loss_component=LossComponent(label=hs, weight=0, loss_fn=None, layer_mapper=None, projector=None), attn_loss_component=LossComponent(label=attn, weight=0, loss_fn=None, layer_mapper=None, projector=None))
50
  - train_embeddings: True
51
+ - learning_rate: 4e-05
52
+ - train_batch_size: 4
53
  - eval_batch_size: 4
54
  - seed: 42
55
  - optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
 
64
  | step | epoch | enwikippl | frwikippl | loss | runtime | samples_per_second | steps_per_second | tinystoriesppl | zhwikippl |
65
  | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
66
  | **teacher eval** | | 43.75 | 61.75 | | | | | 11.8125 | 19.125 |
67
+ | 0 | 0 | 837518622720.0 | 78065325572096.0 | 19.8108 | 12.6525 | 47.421 | 11.855 | 2667577344.0 | 36009005809664.0 |
68
+ | 1500 | 0.1010 | 1472.0 | 8832.0 | 2.5979 | 12.6262 | 47.52 | 11.88 | 1056.0 | 19200.0 |
69
+ | 3000 | 0.2020 | 500.0 | 3040.0 | 1.8976 | 12.7775 | 46.958 | 11.739 | 354.0 | 552.0 |
70
+ | 4500 | 0.3030 | 312.0 | 1320.0 | 1.5456 | 12.7017 | 47.238 | 11.809 | 249.0 | 260.0 |
71
+ | 6000 | 0.4040 | 234.0 | 940.0 | 1.3441 | 12.5854 | 47.674 | 11.919 | 204.0 | 158.0 |
72
+ | 7500 | 0.5051 | 190.0 | 656.0 | 1.1277 | 12.5936 | 47.643 | 11.911 | 164.0 | 152.0 |
73
+ | 9000 | 0.6061 | 249.0 | 600.0 | 0.9819 | 12.7319 | 47.126 | 11.781 | 220.0 | 186.0 |
74
+ | 10500 | 0.7071 | 141.0 | 436.0 | 0.8717 | 12.5874 | 47.667 | 11.917 | 121.0 | 128.0 |
75
+ | 12000 | 0.8081 | 193.0 | 482.0 | 0.8292 | 12.6439 | 47.454 | 11.863 | 163.0 | 135.0 |
76
+ | 13500 | 0.9091 | 202.0 | 504.0 | 0.8078 | 12.5913 | 47.652 | 11.913 | 176.0 | 136.0 |
77
+ | 14850 | 1.0 | 196.0 | 490.0 | 0.8045 | 12.677 | 47.33 | 11.832 | 170.0 | 135.0 |
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
78
 
79
  ### Framework versions
80
  - Distily 0.2.0
logs/learning_rate=0.0001, per_device_train_batch_size=1, warmup_ratio=0.5/completed.flag ADDED
File without changes
logs/learning_rate=4e-05, per_device_train_batch_size=4, warmup_ratio=0.5/events.out.tfevents.1724101218.5f530b1cf724 ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:c83bef75426d8f36fee9612191e1524a303c4ec35e4c3226eb7190dc941e2abf
3
+ size 7019562
logs/learning_rate=4e-05, per_device_train_batch_size=4, warmup_ratio=0.5/events.out.tfevents.1724104558.5f530b1cf724 ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:78a1ed94a6137eddf091018a4a99e4468088d8d44717f46db1c06c8d23804ce9
3
+ size 578
model.safetensors CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:974fc1972a602545efa3d6652649817b5c0627110efb26a9da1aafedbf6d789f
3
  size 248894656
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:c3bfb2f23a0aff41ecb489c04a3d5e9bfaf2ffd767eefbc536e0e41070b52993
3
  size 248894656
training_args.bin CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:f9c4d041aa095b908edefa6f0a3a491706fee792245eaef3ec5562e85525f3c8
3
  size 1017899144
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:18c6ef1b1efa5ca7e43951548646fd9ea110a6ce03086c641425405532f6e868
3
  size 1017899144