metadata

base_model: gpt2
datasets:
  - wikimedia/wikipedia
library_name: Distily
license: mit
tags:
  - bitnet
  - 1.58b
  - generated_from_trainer
model-index:
  - name: distily_miles_projector_experiment
    results: []

Summary

Distilled with Distily library using teacher model gpt2 on dataset wikimedia/wikipedia.

Model Architecture:

Architecture: GPT2LMHeadModel
Total Parameters: 124,439,808
Data Type (dtype): torch.bfloat16
Model Size: 0.24 GB

Evaluation Metrics Comparison

step	epoch	enwikippl	frwikippl	loss	runtime	samples_per_second	steps_per_second	tinystoriesppl	zhwikippl
teacher eval		36.25	77.0					11.75	21.375
0	0	1486058684416.0	34084860461056.0	20.1302	40.0525	62.418	7.815	2281701376.0	15874199126016.0
2500	0.0404	756.0	3440.0	2.4552	40.0832	62.37	7.809	404.0	1560.0
5000	0.0808	352.0	1288.0	1.7734	42.1208	59.353	7.431	246.0	290.0
7500	0.1212	227.0	688.0	1.4859	44.2818	56.457	7.068	177.0	214.0
10000	0.1616	176.0	624.0	1.2995	40.5384	61.67	7.721	129.0	225.0
12500	0.2020	122.0	446.0	1.0558	43.2882	57.752	7.231	93.5	231.0
15000	0.2424	102.5	412.0	0.9530	40.2067	62.179	7.785	80.0	175.0
17500	0.2828	92.0	342.0	0.8613	42.4322	58.918	7.376	77.5	165.0
20000	0.3232	78.0	266.0	0.8054	42.4876	58.841	7.367	64.5	110.0
22500	0.3636	66.5	228.0	0.6962	40.1977	62.193	7.787	58.0	185.0
25000	0.4040	64.0	200.0	0.6565	42.3516	59.03	7.391	52.75	115.5
27500	0.4444	61.25	190.0	0.6213	42.9602	58.193	7.286	50.75	101.0
30000	0.4848	62.75	211.0	0.6318	44.9016	55.677	6.971	50.25	184.0
32500	0.5253	57.5	194.0	0.6184	43.9215	56.92	7.126	50.25	89.5
35000	0.5657	57.0	177.0	0.5768	42.6805	58.575	7.334	44.0	107.0
37500	0.6061	54.5	168.0	0.5596	44.1546	56.619	7.089	43.5	81.0
40000	0.6465	54.0	159.0	0.5345	42.0172	59.499	7.449	42.75	77.5
42500	0.6869	53.5	169.0	0.5260	41.7231	59.919	7.502	39.5	61.25
45000	0.7273	48.5	152.0	0.4414	40.3349	61.981	7.76	35.25	50.25
47500	0.7677	47.25	142.0	0.4216	41.3204	60.503	7.575	34.5	44.25
50000	0.8081	46.5	137.0	0.4085	43.1383	57.953	7.256	32.25	41.25
52500	0.8485	46.0	141.0	0.4018	42.0641	59.433	7.441	33.0	38.75
55000	0.8889	45.0	138.0	0.3859	40.373	61.923	7.753	31.875	35.75
57500	0.9293	44.75	133.0	0.3810	40.3972	61.885	7.748	31.625	36.0
60000	0.9697	44.75	132.0	0.3782	42.2203	59.213	7.413	31.625	35.5
61875	1.0	44.75	133.0	0.3778	44.5224	56.151	7.03	31.5	35.5

Resource Usage Comparison

VRAM Use: 7.7831 GB

Distillation (Teacher -> Student) Architecture Difference:

Architecture: GPT2LMHeadModel -> GPT2LMHeadModel
Total Parameters: 124,439,808 -> 124,439,808
Data Type (dtype): torch.bfloat16 -> torch.bfloat16
Model Size: 0.24 GB -> 0.24 GB

Module Diff Details

Train Dataset

Trained on 145,697,117 tokens from the wikimedia/wikipedia dataset.

Num Samples: 247,500
Subset: 20231101.en
Split: train

Training Objective

DistillationObjective(logits_loss_component=LossComponent(label=logits, weight=1, loss_fn=kl), attn_loss_component=LossComponent(label=attn, weight=5, loss_fn=raw_mse, layer_mapper=layer-2))

Hyperparameters

The following hyperparameters were used during training:

Expand

learning_rate: 0.0001
train_batch_size: 4
eval_batch_size: 8
seed: 42
optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
lr_scheduler_type: linear
lr_scheduler_warmup_ratio: 0.5
num_epochs: 1.0
distillation_objective: DistillationObjective(logits_loss_component=LossComponent(label=logits, weight=1, loss_fn=kl), attn_loss_component=LossComponent(label=attn, weight=5, loss_fn=raw_mse, layer_mapper=layer-2))
train_embeddings: True
lr_scheduler: <torch.optim.lr_scheduler.LambdaLR object at 0x7fd776b0cd90>
student_model_name_or_path: None
student_config_name_or_path: None
student_model_config: None
reinitialize_weights: None
copy_teacher_modules: [('lm_head', False)]
student_model_as_bitnet: True
student_model_compile: False
dropout: None
teacher_model_name_or_path: gpt2
teacher_load_in_8bit: False
teacher_load_in_4bit: False
teacher_model_compile: False
dataset_uri: wikimedia/wikipedia
dataset_subset: 20231101.en
dataset_split: train
dataset_column_name: text
dataset_sample_size: 250000
dataset_test_size: 0.01
gradient_accumulation_steps: 1
weight_decay: 0.0
max_grad_norm: 1.0
warmup_ratio: 0.5
warmup_steps: 0
gradient_checkpointing: True

Framework Versions

Distily 0.2.0
Transformers 4.44.2
Pytorch 2.3.0
Datasets 2.21.0