End of training
Browse files- README.md +17 -52
- logs/learning_rate=0.0001, per_device_train_batch_size=8, warmup_ratio=0.5/events.out.tfevents.1724116219.5f530b1cf724 +3 -0
- logs/learning_rate=0.0001, per_device_train_batch_size=8, warmup_ratio=0.5/events.out.tfevents.1724118620.5f530b1cf724 +3 -0
- logs/learning_rate=4e-05, per_device_train_batch_size=1, warmup_ratio=0.5/completed.flag +0 -0
- model.safetensors +1 -1
- training_args.bin +1 -1
README.md
CHANGED
@@ -16,14 +16,14 @@ This student model is distilled from the teacher model [gpt2](https://huggingfac
|
|
16 |
The [Distily](https://github.com/lapp0/distily) library was used for this distillation.
|
17 |
|
18 |
It achieves the following results on the evaluation set:
|
19 |
-
- eval_enwikippl:
|
20 |
-
- eval_frwikippl:
|
21 |
-
- eval_zhwikippl:
|
22 |
-
- eval_tinystoriesppl:
|
23 |
-
- eval_loss:
|
24 |
-
- eval_runtime: 12.
|
25 |
-
- eval_samples_per_second: 47.
|
26 |
-
- eval_steps_per_second: 11.
|
27 |
|
28 |
<!-- This model card has been generated automatically according to the information the Trainer had access to. You
|
29 |
should probably proofread and complete it, then remove this comment.
|
@@ -48,8 +48,8 @@ More information needed
|
|
48 |
The following hyperparameters were used during training:
|
49 |
- distillation_objective: DistillationObjective(logits_loss_component=LossComponent(label=logits, weight=1, loss_fn=kl, layer_mapper=None, projector=None), hs_loss_component=LossComponent(label=hs, weight=0, loss_fn=None, layer_mapper=None, projector=None), attn_loss_component=LossComponent(label=attn, weight=0, loss_fn=None, layer_mapper=None, projector=None))
|
50 |
- train_embeddings: True
|
51 |
-
- learning_rate:
|
52 |
-
- train_batch_size:
|
53 |
- eval_batch_size: 4
|
54 |
- seed: 42
|
55 |
- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
|
@@ -58,53 +58,18 @@ The following hyperparameters were used during training:
|
|
58 |
- num_epochs: 1.0
|
59 |
|
60 |
### Resource Usage
|
61 |
-
Peak GPU Memory:
|
62 |
|
63 |
### Eval-Phase Metrics
|
64 |
| step | epoch | enwikippl | frwikippl | loss | runtime | samples_per_second | steps_per_second | tinystoriesppl | zhwikippl |
|
65 |
| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
|
66 |
| **teacher eval** | | 43.75 | 61.75 | | | | | 11.8125 | 19.125 |
|
67 |
-
| 0 | 0 |
|
68 |
-
| 1500 | 0.
|
69 |
-
| 3000 | 0.
|
70 |
-
| 4500 | 0.
|
71 |
-
| 6000 | 0.
|
72 |
-
|
|
73 |
-
| 9000 | 0.1515 | 344.0 | 1568.0 | 1.5858 | 12.5697 | 47.734 | 11.933 | 238.0 | 390.0 |
|
74 |
-
| 10500 | 0.1768 | 316.0 | 1448.0 | 1.5377 | 12.5933 | 47.644 | 11.911 | 262.0 | 464.0 |
|
75 |
-
| 12000 | 0.2020 | 314.0 | 1320.0 | 1.5134 | 12.5641 | 47.755 | 11.939 | 244.0 | 548.0 |
|
76 |
-
| 13500 | 0.2273 | 272.0 | 1248.0 | 1.4329 | 12.5786 | 47.7 | 11.925 | 211.0 | 243.0 |
|
77 |
-
| 15000 | 0.2525 | 244.0 | 1168.0 | 1.3673 | 12.5803 | 47.694 | 11.923 | 202.0 | 180.0 |
|
78 |
-
| 16500 | 0.2778 | 219.0 | 976.0 | 1.3144 | 12.5889 | 47.661 | 11.915 | 194.0 | 167.0 |
|
79 |
-
| 18000 | 0.3030 | 218.0 | 1016.0 | 1.3081 | 12.6999 | 47.245 | 11.811 | 185.0 | 294.0 |
|
80 |
-
| 19500 | 0.3283 | 208.0 | 796.0 | 1.2611 | 20.0133 | 29.98 | 7.495 | 179.0 | 191.0 |
|
81 |
-
| 21000 | 0.3535 | 211.0 | 908.0 | 1.2402 | 12.674 | 47.341 | 11.835 | 171.0 | 182.0 |
|
82 |
-
| 22500 | 0.3788 | 192.0 | 720.0 | 1.2178 | 12.581 | 47.691 | 11.923 | 158.0 | 195.0 |
|
83 |
-
| 24000 | 0.4040 | 189.0 | 764.0 | 1.1770 | 12.6729 | 47.345 | 11.836 | 148.0 | 215.0 |
|
84 |
-
| 25500 | 0.4293 | 171.0 | 740.0 | 1.1165 | 12.6124 | 47.572 | 11.893 | 139.0 | 237.0 |
|
85 |
-
| 27000 | 0.4545 | 162.0 | 640.0 | 1.0755 | 12.5788 | 47.699 | 11.925 | 137.0 | 204.0 |
|
86 |
-
| 28500 | 0.4798 | 154.0 | 604.0 | 1.0288 | 12.6014 | 47.614 | 11.903 | 125.5 | 143.0 |
|
87 |
-
| 30000 | 0.5051 | 145.0 | 632.0 | 1.0105 | 12.6861 | 47.296 | 11.824 | 111.0 | 180.0 |
|
88 |
-
| 31500 | 0.5303 | 135.0 | 624.0 | 0.9842 | 12.6101 | 47.581 | 11.895 | 110.0 | 161.0 |
|
89 |
-
| 33000 | 0.5556 | 135.0 | 532.0 | 0.9620 | 12.6688 | 47.36 | 11.84 | 102.0 | 148.0 |
|
90 |
-
| 34500 | 0.5808 | 127.5 | 564.0 | 0.9313 | 12.6938 | 47.267 | 11.817 | 108.5 | 200.0 |
|
91 |
-
| 36000 | 0.6061 | 129.0 | 506.0 | 0.9064 | 12.5848 | 47.677 | 11.919 | 98.0 | 215.0 |
|
92 |
-
| 37500 | 0.6313 | 115.5 | 464.0 | 0.8361 | 12.5738 | 47.718 | 11.93 | 91.0 | 164.0 |
|
93 |
-
| 39000 | 0.6566 | 107.0 | 410.0 | 0.7831 | 12.5036 | 47.986 | 11.997 | 86.0 | 169.0 |
|
94 |
-
| 40500 | 0.6818 | 102.5 | 402.0 | 0.7615 | 12.5289 | 47.889 | 11.972 | 82.0 | 127.0 |
|
95 |
-
| 42000 | 0.7071 | 100.5 | 394.0 | 0.7506 | 12.5392 | 47.85 | 11.963 | 80.5 | 128.0 |
|
96 |
-
| 43500 | 0.7323 | 100.5 | 386.0 | 0.7381 | 12.6177 | 47.552 | 11.888 | 78.5 | 129.0 |
|
97 |
-
| 45000 | 0.7576 | 100.5 | 376.0 | 0.7307 | 12.6297 | 47.507 | 11.877 | 80.0 | 141.0 |
|
98 |
-
| 46500 | 0.7828 | 100.0 | 364.0 | 0.7279 | 12.6428 | 47.458 | 11.864 | 78.5 | 122.0 |
|
99 |
-
| 48000 | 0.8081 | 100.0 | 384.0 | 0.7256 | 12.5268 | 47.897 | 11.974 | 78.5 | 130.0 |
|
100 |
-
| 49500 | 0.8333 | 96.5 | 366.0 | 0.7072 | 12.6353 | 47.486 | 11.871 | 76.0 | 129.0 |
|
101 |
-
| 51000 | 0.8586 | 94.0 | 358.0 | 0.6995 | 12.5977 | 47.628 | 11.907 | 75.0 | 125.5 |
|
102 |
-
| 52500 | 0.8838 | 94.5 | 354.0 | 0.6960 | 12.6273 | 47.516 | 11.879 | 75.5 | 119.0 |
|
103 |
-
| 54000 | 0.9091 | 94.0 | 354.0 | 0.6935 | 12.6419 | 47.461 | 11.865 | 75.5 | 118.0 |
|
104 |
-
| 55500 | 0.9343 | 94.5 | 354.0 | 0.6914 | 12.5538 | 47.794 | 11.949 | 75.5 | 117.5 |
|
105 |
-
| 57000 | 0.9596 | 94.0 | 352.0 | 0.6900 | 12.6293 | 47.509 | 11.877 | 75.0 | 117.5 |
|
106 |
-
| 58500 | 0.9848 | 94.0 | 352.0 | 0.6898 | 12.622 | 47.536 | 11.884 | 75.0 | 117.5 |
|
107 |
-
| 59400 | 1.0 | 94.0 | 352.0 | 0.6898 | 12.6709 | 47.353 | 11.838 | 75.0 | 117.5 |
|
108 |
|
109 |
### Framework versions
|
110 |
- Distily 0.2.0
|
|
|
16 |
The [Distily](https://github.com/lapp0/distily) library was used for this distillation.
|
17 |
|
18 |
It achieves the following results on the evaluation set:
|
19 |
+
- eval_enwikippl: 111.0
|
20 |
+
- eval_frwikippl: 400.0
|
21 |
+
- eval_zhwikippl: 122.5
|
22 |
+
- eval_tinystoriesppl: 91.0
|
23 |
+
- eval_loss: 0.8789
|
24 |
+
- eval_runtime: 12.6655
|
25 |
+
- eval_samples_per_second: 47.373
|
26 |
+
- eval_steps_per_second: 11.843
|
27 |
|
28 |
<!-- This model card has been generated automatically according to the information the Trainer had access to. You
|
29 |
should probably proofread and complete it, then remove this comment.
|
|
|
48 |
The following hyperparameters were used during training:
|
49 |
- distillation_objective: DistillationObjective(logits_loss_component=LossComponent(label=logits, weight=1, loss_fn=kl, layer_mapper=None, projector=None), hs_loss_component=LossComponent(label=hs, weight=0, loss_fn=None, layer_mapper=None, projector=None), attn_loss_component=LossComponent(label=attn, weight=0, loss_fn=None, layer_mapper=None, projector=None))
|
50 |
- train_embeddings: True
|
51 |
+
- learning_rate: 0.0001
|
52 |
+
- train_batch_size: 8
|
53 |
- eval_batch_size: 4
|
54 |
- seed: 42
|
55 |
- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
|
|
|
58 |
- num_epochs: 1.0
|
59 |
|
60 |
### Resource Usage
|
61 |
+
Peak GPU Memory: 7.9381 GB
|
62 |
|
63 |
### Eval-Phase Metrics
|
64 |
| step | epoch | enwikippl | frwikippl | loss | runtime | samples_per_second | steps_per_second | tinystoriesppl | zhwikippl |
|
65 |
| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
|
66 |
| **teacher eval** | | 43.75 | 61.75 | | | | | 11.8125 | 19.125 |
|
67 |
+
| 0 | 0 | 828928688128.0 | 52226802319360.0 | 21.0583 | 12.4569 | 48.166 | 12.042 | 5167382528.0 | 20753281974272.0 |
|
68 |
+
| 1500 | 0.2020 | 512.0 | 3472.0 | 1.8762 | 12.4942 | 48.022 | 12.006 | 344.0 | 868.0 |
|
69 |
+
| 3000 | 0.4040 | 237.0 | 944.0 | 1.4192 | 12.543 | 47.835 | 11.959 | 207.0 | 223.0 |
|
70 |
+
| 4500 | 0.6061 | 148.0 | 532.0 | 1.1068 | 12.5192 | 47.926 | 11.982 | 135.0 | 158.0 |
|
71 |
+
| 6000 | 0.8081 | 118.0 | 430.0 | 0.9155 | 12.5398 | 47.848 | 11.962 | 98.0 | 122.0 |
|
72 |
+
| 7425 | 1.0 | 111.0 | 400.0 | 0.8789 | 12.6655 | 47.373 | 11.843 | 91.0 | 122.5 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
73 |
|
74 |
### Framework versions
|
75 |
- Distily 0.2.0
|
logs/learning_rate=0.0001, per_device_train_batch_size=8, warmup_ratio=0.5/events.out.tfevents.1724116219.5f530b1cf724
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:84991b98cdd115b5865b3f3d72395e916728091ed9bcbb99d6552b6e35a2dd33
|
3 |
+
size 3512274
|
logs/learning_rate=0.0001, per_device_train_batch_size=8, warmup_ratio=0.5/events.out.tfevents.1724118620.5f530b1cf724
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:a969bb9ff141beb753d2e02ab5b00470ce28bb2ec9b8ef575e5447e44de4ff4e
|
3 |
+
size 578
|
logs/learning_rate=4e-05, per_device_train_batch_size=1, warmup_ratio=0.5/completed.flag
ADDED
File without changes
|
model.safetensors
CHANGED
@@ -1,3 +1,3 @@
|
|
1 |
version https://git-lfs.github.com/spec/v1
|
2 |
-
oid sha256:
|
3 |
size 248894656
|
|
|
1 |
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:44c225267def37ca71584d3beff29b20501933b79fbecef253613f1b35f4a73d
|
3 |
size 248894656
|
training_args.bin
CHANGED
@@ -1,3 +1,3 @@
|
|
1 |
version https://git-lfs.github.com/spec/v1
|
2 |
-
oid sha256:
|
3 |
size 1017899144
|
|
|
1 |
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:77bf0e293d306f9ced0e580e530e228cdb6b58e30b5f6999d1d162bfa633f029
|
3 |
size 1017899144
|