distily_bench_gpt2_attn

This student model is distilled from the teacher model gpt2 using the dataset (unspecified).

The Distily library was used for this distillation.

It achieves the following results on the evaluation set:

Training procedure

The following hyperparameters were used during training:

distillation_objective: DistillationObjective(logits_loss_component=LossComponent(label=logits, weight=1, loss_fn=kl, layer_mapper=None, projector=None), hs_loss_component=LossComponent(label=hs, weight=0, loss_fn=None, layer_mapper=None, projector=None), attn_loss_component=LossComponent(label=attn, weight=2.0, loss_fn=ce, layer_mapper=None, projector=None))
train_embeddings: True
learning_rate: 4e-05
train_batch_size: 8
eval_batch_size: 8
seed: 42
optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
lr_scheduler_type: constant
num_epochs: 1.0

Peak GPU Memory: 8.2195 GB

step	epoch	enwikippl	frwikippl	loss	runtime	samples_per_second	steps_per_second	zhwikippl
teacher eval		30.2086	57.2728					18.1784
0	0	55429.6875	57698.8047	24.5150	17.4179	57.412	7.177	56988.9141
1000	0.0808	702.9320	4403.8062	20.5050	17.3512	57.633	7.204	19095.4688
2000	0.1616	507.8192	3252.5339	20.3170	17.3451	57.653	7.207	2454.8054
3000	0.2424	418.1162	2743.4949	20.2070	17.3188	57.741	7.218	1193.5658
4000	0.3232	372.6640	2567.2002	20.1200	17.2361	58.018	7.252	1026.6641
5000	0.4040	320.0249	2154.7588	20.0340	17.3151	57.753	7.219	1183.4081
6000	0.4848	278.3867	1778.2332	19.9610	17.3435	57.658	7.207	869.0625
7000	0.5657	251.7534	1568.9419	19.9040	17.4023	57.464	7.183	807.5215
8000	0.6465	230.5502	1399.7903	19.8380	17.3855	57.519	7.19	816.4125
9000	0.7273	208.9635	1351.4938	19.7940	17.3332	57.693	7.212	781.2166
10000	0.8081	192.8560	1211.9225	19.7530	17.3032	57.793	7.224	608.5041
11000	0.8889	179.3916	1140.7820	19.6930	17.2721	57.897	7.237	624.0573
12000	0.9697	161.3999	997.4732	19.6480	17.21	58.106	7.263	560.2280
12375	1.0	158.3214	948.9705	19.6380	17.3071	57.78	7.222	575.3149