distily_TinyStories-33M

This student model is distilled from the teacher model roneneldan/TinyStories-33M using the dataset (unspecified).

The Distily library was used for this distillation.

It achieves the following results on the evaluation set:

eval_enwikippl: 86.0272
eval_frwikippl: 9172.2910
eval_zhwikippl: 31986.0898
eval_loss: 0.9611
eval_runtime: 27.2508
eval_samples_per_second: 91.741
eval_steps_per_second: 11.486

Training procedure

Training hyperparameters

The following hyperparameters were used during training:

distillation_objective: DistillationObjective(logits_loss_component=LossComponent(label=logits, weight=1, loss_fn=kl, layer_mapper=None, projector=None), hs_loss_component=LossComponent(label=hs, weight=5000.0, loss_fn=mse, layer_mapper=None, projector=None), attn_loss_component=LossComponent(label=attn, weight=500.0, loss_fn=jsd, layer_mapper=None, projector=None))
train_embeddings: True
learning_rate: 4e-05
train_batch_size: 8
eval_batch_size: 8
seed: 42
optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
lr_scheduler_type: constant
num_epochs: 1.0

Resource Usage

Peak GPU Memory: 8.2940 GB

Eval-Phase Metrics

step	epoch	enwikippl	frwikippl	loss	runtime	samples_per_second	steps_per_second	zhwikippl
teacher eval		174.1653	48148.2734					4930.5806
0	0	42788.8555	63779.7148	13.4382	27.2438	91.764	11.489	57958.3359
1000	0.0323	176.2009	44333.6016	1.6774	27.3162	91.521	11.458	457143.6562
2000	0.0646	128.0956	24798.3691	1.5142	27.266	91.689	11.48	119591.0781
3000	0.0970	109.3945	15041.4014	1.3719	27.4573	91.051	11.4	68749.8828
4000	0.1293	103.1060	11736.2949	1.2548	27.2393	91.779	11.491	52875.8438
5000	0.1616	112.2423	11673.6494	1.1644	27.3226	91.499	11.456	45928.1172
6000	0.1939	98.1303	11178.0225	1.0962	27.294	91.595	11.468	43252.2148
7000	0.2263	93.0121	9680.7031	1.0394	27.3697	91.342	11.436	36992.1562
8000	0.2586	90.4050	9906.2393	1.0005	27.424	91.161	11.413	34836.8906
9000	0.2909	86.0272	9172.2910	0.9611	27.2508	91.741	11.486	31986.0898
10000	0.3232	86.3193	8911.2168	0.9344	27.4195	91.176	11.415	33114.9648
11000	0.3555	85.1883	9004.5898	0.9170	27.6131	90.537	11.335	28466.0332
12000	0.3879	82.4485	8789.0557	0.8952	27.5622	90.704	11.356	26171.4727
13000	0.4202	86.4648	11200.8799	0.8819	27.2915	91.603	11.469	28254.1816
14000	0.4525	83.4509	8846.1875	0.8756	27.288	91.615	11.47	24126.1836
15000	0.4848	83.4380	8696.6904	0.8562	27.2967	91.586	11.467	22347.7852
16000	0.5172	84.3804	9052.9209	0.8506	27.5838	90.633	11.347	26039.1504
17000	0.5495	92.4088	9267.0918	0.8451	27.2622	91.702	11.481	24745.4961
18000	0.5818	92.4374	9366.8291	0.8401	27.5177	90.851	11.375	23503.5566
19000	0.6141	87.0512	8318.6689	0.8306	27.185	91.963	11.514	23050.2109
20000	0.6465	93.4635	10036.1631	0.8266	27.3179	91.515	11.458	26122.6484
21000	0.6788	82.3464	9078.4600	0.8196	27.3629	91.365	11.439	28156.3516
22000	0.7111	81.6666	9332.5889	0.8155	27.6142	90.533	11.335	32020.2734
23000	0.7434	84.7325	9831.8672	0.8086	27.2205	91.843	11.499	33488.1289
24000	0.7757	81.2596	8868.6484	0.8074	27.307	91.552	11.462	30275.5918
25000	0.8081	81.1778	8258.5459	0.8051	27.3489	91.411	11.445	26269.4199
26000	0.8404	84.4753	9221.5127	0.8007	27.3172	91.517	11.458	31739.5938
27000	0.8727	81.3541	9123.3232	0.7995	27.2848	91.626	11.472	36992.1562
28000	0.9050	85.5785	9260.5635	0.7973	27.1686	92.018	11.521	34531.5234
29000	0.9374	92.4553	8333.3262	0.7944	27.2956	91.59	11.467	41878.25
30000	0.9697	92.4625	8644.1758	0.7925	27.2757	91.657	11.475	49319.1836
30938	1.0	91.8841	8440.8330	0.7884	27.314	91.528	11.459	49928.1523

Framework versions

Distily 0.2.0
Transformers 4.44.0
Pytorch 2.3.0
Datasets 2.21.0

distily
/

distily_TinyStories-33M_freeze_emb