End of training

Browse files

Files changed (6) hide show

README.md +21 -51
logs/learning_rate=0.0001, per_device_train_batch_size=1, warmup_ratio=0.5/completed.flag +0 -0
logs/learning_rate=4e-05, per_device_train_batch_size=4, warmup_ratio=0.5/events.out.tfevents.1724101218.5f530b1cf724 +3 -0
logs/learning_rate=4e-05, per_device_train_batch_size=4, warmup_ratio=0.5/events.out.tfevents.1724104558.5f530b1cf724 +3 -0
model.safetensors +1 -1
training_args.bin +1 -1

README.md CHANGED Viewed

@@ -16,14 +16,14 @@ This student model is distilled from the teacher model [gpt2](https://huggingfac
 The [Distily](https://github.com/lapp0/distily) library was used for this distillation.
 It achieves the following results on the evaluation set:
-- eval_enwikippl: 280.0
-- eval_frwikippl: 1392.0
-- eval_zhwikippl: 2576.0
-- eval_tinystoriesppl: 207.0
-- eval_loss: 1.4458
-- eval_runtime: 12.6656
-- eval_samples_per_second: 47.372
-- eval_steps_per_second: 11.843
 <!-- This model card has been generated automatically according to the information the Trainer had access to. You
 should probably proofread and complete it, then remove this comment.
@@ -48,8 +48,8 @@ More information needed
 The following hyperparameters were used during training:
 - distillation_objective: DistillationObjective(logits_loss_component=LossComponent(label=logits, weight=1, loss_fn=kl, layer_mapper=None, projector=None), hs_loss_component=LossComponent(label=hs, weight=0, loss_fn=None, layer_mapper=None, projector=None), attn_loss_component=LossComponent(label=attn, weight=0, loss_fn=None, layer_mapper=None, projector=None))
 - train_embeddings: True
-- learning_rate: 0.0001
-- train_batch_size: 1
 - eval_batch_size: 4
 - seed: 42
 - optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
@@ -64,47 +64,17 @@ Peak GPU Memory: 4.1856 GB
 | step | epoch | enwikippl | frwikippl | loss | runtime | samples_per_second | steps_per_second | tinystoriesppl | zhwikippl |
 | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
 | **teacher eval** |  | 43.75 | 61.75 |  |  |  |  | 11.8125 | 19.125 |
-| 0 | 0 | 1821066133504.0 | 158329674399744.0 | 19.3254 | 12.6923 | 47.273 | 11.818 | 12079595520.0 | 98956046499840.0 |
-| 1500 | 0.0253 | 3616.0 | 28416.0 | 3.2424 | 12.6627 | 47.383 | 11.846 | 2560.0 | 606208.0 |
-| 3000 | 0.0505 | 836.0 | 5888.0 | 2.2048 | 12.6773 | 47.329 | 11.832 | 652.0 | 21376.0 |
-| 4500 | 0.0758 | 482.0 | 3152.0 | 1.8732 | 12.5614 | 47.766 | 11.941 | 352.0 | 580.0 |
-| 6000 | 0.1010 | 352.0 | 1520.0 | 1.5840 | 12.7029 | 47.233 | 11.808 | 272.0 | 330.0 |
-| 7500 | 0.1263 | 308.0 | 1272.0 | 1.5038 | 12.8562 | 46.67 | 11.668 | 241.0 | 294.0 |
-| 9000 | 0.1515 | 280.0 | 1392.0 | 1.4458 | 12.6656 | 47.372 | 11.843 | 207.0 | 2576.0 |
-| 10500 | 0.1768 | 238.0 | 936.0 | 1.3311 | 12.6721 | 47.348 | 11.837 | 185.0 | 286.0 |
-| 12000 | 0.2020 | 215.0 | 836.0 | 1.2527 | 12.6564 | 47.407 | 11.852 | 163.0 | 280.0 |
-| 13500 | 0.2273 | 197.0 | 772.0 | 1.2033 | 12.5546 | 47.791 | 11.948 | 150.0 | 266.0 |
-| 15000 | 0.2525 | 187.0 | 792.0 | 1.1701 | 12.5678 | 47.741 | 11.935 | 154.0 | 250.0 |
-| 16500 | 0.2778 | 171.0 | 704.0 | 1.1274 | 12.6162 | 47.558 | 11.889 | 150.0 | 152.0 |
-| 18000 | 0.3030 | 167.0 | 748.0 | 1.1101 | 12.5787 | 47.7 | 11.925 | 141.0 | 314.0 |
-| 19500 | 0.3283 | 164.0 | 684.0 | 1.1063 | 12.6722 | 47.348 | 11.837 | 142.0 | 254.0 |
-| 21000 | 0.3535 | 162.0 | 676.0 | 1.0552 | 12.6941 | 47.266 | 11.816 | 135.0 | 242.0 |
-| 22500 | 0.3788 | 146.0 | 628.0 | 1.0061 | 12.5812 | 47.69 | 11.923 | 120.0 | 226.0 |
-| 24000 | 0.4040 | 141.0 | 636.0 | 0.9924 | 12.5431 | 47.835 | 11.959 | 115.5 | 218.0 |
-| 25500 | 0.4293 | 138.0 | 776.0 | 0.9653 | 12.619 | 47.547 | 11.887 | 110.5 | 362.0 |
-| 27000 | 0.4545 | 129.0 | 572.0 | 0.9293 | 12.5466 | 47.822 | 11.955 | 113.0 | 245.0 |
-| 28500 | 0.4798 | 136.0 | 592.0 | 0.9387 | 12.658 | 47.401 | 11.85 | 117.5 | 204.0 |
-| 30000 | 0.5051 | 132.0 | 704.0 | 0.9436 | 12.5944 | 47.64 | 11.91 | 112.0 | 255.0 |
-| 31500 | 0.5303 | 120.0 | 596.0 | 0.9097 | 12.5827 | 47.684 | 11.921 | 104.5 | 262.0 |
-| 33000 | 0.5556 | 119.5 | 548.0 | 0.8672 | 12.5736 | 47.719 | 11.93 | 103.5 | 264.0 |
-| 34500 | 0.5808 | 114.0 | 544.0 | 0.8406 | 12.6869 | 47.293 | 11.823 | 95.0 | 300.0 |
-| 36000 | 0.6061 | 107.5 | 410.0 | 0.8157 | 12.6245 | 47.527 | 11.882 | 95.5 | 199.0 |
-| 37500 | 0.6313 | 102.5 | 478.0 | 0.8011 | 12.7459 | 47.074 | 11.769 | 92.0 | 312.0 |
-| 39000 | 0.6566 | 106.0 | 454.0 | 0.7952 | 12.6026 | 47.609 | 11.902 | 93.0 | 262.0 |
-| 40500 | 0.6818 | 102.5 | 448.0 | 0.7747 | 12.5828 | 47.684 | 11.921 | 85.5 | 249.0 |
-| 42000 | 0.7071 | 88.5 | 366.0 | 0.6942 | 12.5833 | 47.682 | 11.921 | 76.0 | 207.0 |
-| 43500 | 0.7323 | 79.5 | 326.0 | 0.6337 | 12.6842 | 47.303 | 11.826 | 66.5 | 160.0 |
-| 45000 | 0.7576 | 78.0 | 270.0 | 0.6070 | 12.684 | 47.304 | 11.826 | 65.5 | 159.0 |
-| 46500 | 0.7828 | 76.5 | 260.0 | 0.5937 | 12.5713 | 47.728 | 11.932 | 62.0 | 127.5 |
-| 48000 | 0.8081 | 76.0 | 272.0 | 0.5848 | 12.6783 | 47.325 | 11.831 | 63.5 | 127.0 |
-| 49500 | 0.8333 | 74.0 | 253.0 | 0.5771 | 12.5825 | 47.685 | 11.921 | 62.25 | 132.0 |
-| 51000 | 0.8586 | 72.5 | 252.0 | 0.5649 | 12.6897 | 47.283 | 11.821 | 59.5 | 101.0 |
-| 52500 | 0.8838 | 71.5 | 241.0 | 0.5513 | 12.6795 | 47.32 | 11.83 | 58.25 | 105.0 |
-| 54000 | 0.9091 | 71.0 | 238.0 | 0.5457 | 12.657 | 47.404 | 11.851 | 57.5 | 102.5 |
-| 55500 | 0.9343 | 70.5 | 237.0 | 0.5425 | 12.6411 | 47.464 | 11.866 | 56.75 | 95.5 |
-| 57000 | 0.9596 | 71.0 | 236.0 | 0.5401 | 12.772 | 46.978 | 11.744 | 56.75 | 92.0 |
-| 58500 | 0.9848 | 70.0 | 234.0 | 0.5385 | 12.5709 | 47.729 | 11.932 | 56.75 | 93.0 |
-| 59400 | 1.0 | 70.0 | 234.0 | 0.5385 | 12.5868 | 47.669 | 11.917 | 56.75 | 93.0 |
 ### Framework versions
 - Distily 0.2.0

 The [Distily](https://github.com/lapp0/distily) library was used for this distillation.
 It achieves the following results on the evaluation set:
+- eval_enwikippl: 249.0
+- eval_frwikippl: 600.0
+- eval_zhwikippl: 186.0
+- eval_tinystoriesppl: 220.0
+- eval_loss: 0.9819
+- eval_runtime: 12.7319
+- eval_samples_per_second: 47.126
+- eval_steps_per_second: 11.781
 <!-- This model card has been generated automatically according to the information the Trainer had access to. You
 should probably proofread and complete it, then remove this comment.
 The following hyperparameters were used during training:
 - distillation_objective: DistillationObjective(logits_loss_component=LossComponent(label=logits, weight=1, loss_fn=kl, layer_mapper=None, projector=None), hs_loss_component=LossComponent(label=hs, weight=0, loss_fn=None, layer_mapper=None, projector=None), attn_loss_component=LossComponent(label=attn, weight=0, loss_fn=None, layer_mapper=None, projector=None))
 - train_embeddings: True
+- learning_rate: 4e-05
+- train_batch_size: 4
 - eval_batch_size: 4
 - seed: 42
 - optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
 | step | epoch | enwikippl | frwikippl | loss | runtime | samples_per_second | steps_per_second | tinystoriesppl | zhwikippl |
 | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
 | **teacher eval** |  | 43.75 | 61.75 |  |  |  |  | 11.8125 | 19.125 |
+| 0 | 0 | 837518622720.0 | 78065325572096.0 | 19.8108 | 12.6525 | 47.421 | 11.855 | 2667577344.0 | 36009005809664.0 |
+| 1500 | 0.1010 | 1472.0 | 8832.0 | 2.5979 | 12.6262 | 47.52 | 11.88 | 1056.0 | 19200.0 |
+| 3000 | 0.2020 | 500.0 | 3040.0 | 1.8976 | 12.7775 | 46.958 | 11.739 | 354.0 | 552.0 |
+| 4500 | 0.3030 | 312.0 | 1320.0 | 1.5456 | 12.7017 | 47.238 | 11.809 | 249.0 | 260.0 |
+| 6000 | 0.4040 | 234.0 | 940.0 | 1.3441 | 12.5854 | 47.674 | 11.919 | 204.0 | 158.0 |
+| 7500 | 0.5051 | 190.0 | 656.0 | 1.1277 | 12.5936 | 47.643 | 11.911 | 164.0 | 152.0 |
+| 9000 | 0.6061 | 249.0 | 600.0 | 0.9819 | 12.7319 | 47.126 | 11.781 | 220.0 | 186.0 |
+| 10500 | 0.7071 | 141.0 | 436.0 | 0.8717 | 12.5874 | 47.667 | 11.917 | 121.0 | 128.0 |
+| 12000 | 0.8081 | 193.0 | 482.0 | 0.8292 | 12.6439 | 47.454 | 11.863 | 163.0 | 135.0 |
+| 13500 | 0.9091 | 202.0 | 504.0 | 0.8078 | 12.5913 | 47.652 | 11.913 | 176.0 | 136.0 |
+| 14850 | 1.0 | 196.0 | 490.0 | 0.8045 | 12.677 | 47.33 | 11.832 | 170.0 | 135.0 |
 ### Framework versions
 - Distily 0.2.0

logs/learning_rate=0.0001, per_device_train_batch_size=1, warmup_ratio=0.5/completed.flag ADDED Viewed

File without changes

logs/learning_rate=4e-05, per_device_train_batch_size=4, warmup_ratio=0.5/events.out.tfevents.1724101218.5f530b1cf724 ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:c83bef75426d8f36fee9612191e1524a303c4ec35e4c3226eb7190dc941e2abf
+size 7019562

logs/learning_rate=4e-05, per_device_train_batch_size=4, warmup_ratio=0.5/events.out.tfevents.1724104558.5f530b1cf724 ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:78a1ed94a6137eddf091018a4a99e4468088d8d44717f46db1c06c8d23804ce9
+size 578

model.safetensors CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:974fc1972a602545efa3d6652649817b5c0627110efb26a9da1aafedbf6d789f
 size 248894656

 version https://git-lfs.github.com/spec/v1
+oid sha256:c3bfb2f23a0aff41ecb489c04a3d5e9bfaf2ffd767eefbc536e0e41070b52993
 size 248894656

training_args.bin CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:f9c4d041aa095b908edefa6f0a3a491706fee792245eaef3ec5562e85525f3c8
 size 1017899144

 version https://git-lfs.github.com/spec/v1
+oid sha256:18c6ef1b1efa5ca7e43951548646fd9ea110a6ce03086c641425405532f6e868
 size 1017899144