Summary

Distilled with Distily library using teacher model gpt2 on dataset wikimedia/wikipedia.

Model Architecture:

  • Architecture: GPT2LMHeadModel
  • Total Parameters: 124,439,808
  • Data Type (dtype): torch.bfloat16
  • Model Size: 0.24 GB

Evaluation Metrics Comparison

step epoch enwikippl frwikippl loss runtime samples_per_second steps_per_second tinystoriesppl zhwikippl
teacher eval 36.25 77.0 11.75 21.375
0 0 10788957847552.0 93458488360960.0 23.9652 41.1128 60.808 7.613 3539992576.0 57174604644352.0
2500 0.0404 888.0 5536.0 3.2958 40.0823 62.372 7.809 492.0 4576.0
5000 0.0808 380.0 1448.0 2.4808 41.6839 59.975 7.509 255.0 400.0
7500 0.1212 250.0 748.0 2.1083 44.1725 56.596 7.086 197.0 233.0
10000 0.1616 189.0 616.0 1.8890 43.9453 56.889 7.122 156.0 216.0
12500 0.2020 140.0 488.0 1.6027 42.1657 59.29 7.423 119.0 178.0
15000 0.2424 113.5 434.0 1.4410 42.3062 59.093 7.398 94.0 183.0
17500 0.2828 92.5 340.0 1.3090 42.413 58.944 7.38 76.5 165.0
20000 0.3232 79.5 308.0 1.1661 40.1951 62.197 7.787 73.0 151.0
22500 0.3636 68.0 229.0 0.9997 41.1581 60.741 7.605 56.75 122.5
25000 0.4040 63.25 201.0 0.9359 40.9228 61.091 7.649 50.75 99.5
27500 0.4444 59.25 218.0 0.8936 40.1195 62.314 7.802 46.25 116.5
30000 0.4848 59.25 204.0 0.8841 42.297 59.106 7.4 49.75 87.0
32500 0.5253 57.5 184.0 0.8730 40.8597 61.185 7.66 44.25 101.5
35000 0.5657 56.0 177.0 0.8049 44.9443 55.624 6.964 39.75 62.25
37500 0.6061 55.0 163.0 0.7798 44.8966 55.684 6.972 43.5 93.5
40000 0.6465 52.0 166.0 0.7611 40.5252 61.69 7.724 37.25 73.5
42500 0.6869 51.5 159.0 0.7336 41.7519 59.878 7.497 38.5 70.0
45000 0.7273 46.25 143.0 0.6241 40.2456 62.119 7.777 32.25 54.5
47500 0.7677 45.75 136.0 0.5998 42.1189 59.356 7.431 31.5 43.75
50000 0.8081 45.25 135.0 0.5841 40.1272 62.302 7.8 31.0 43.75
52500 0.8485 44.25 128.0 0.5705 41.9206 59.637 7.466 31.25 43.25
55000 0.8889 43.5 125.5 0.5532 40.1106 62.328 7.803 29.875 38.25
57500 0.9293 43.5 125.5 0.5470 40.2997 62.035 7.767 29.875 38.0
60000 0.9697 43.5 126.0 0.5432 39.9729 62.542 7.83 29.625 37.5
61875 1.0 43.5 126.0 0.5426 41.9287 59.625 7.465 29.625 37.5

Resource Usage Comparison

  • VRAM Use: 7.7831 GB

Distillation (Teacher -> Student) Architecture Difference:

  • Architecture: GPT2LMHeadModel -> GPT2LMHeadModel
  • Total Parameters: 124,439,808 -> 124,439,808
  • Data Type (dtype): torch.bfloat16 -> torch.bfloat16
  • Model Size: 0.24 GB -> 0.24 GB
Module Diff Details


Train Dataset

Trained on 145,744,973 tokens from the wikimedia/wikipedia dataset.

  • Num Samples: 247,500
  • Subset: 20231101.en
  • Split: train

Training Objective

DistillationObjective(logits_loss_component=LossComponent(label=logits, weight=1, loss_fn=kl), attn_loss_component=LossComponent(label=attn, weight=25.0, loss_fn=raw_mse, layer_mapper=layer-2))

Hyperparameters

The following hyperparameters were used during training:

Expand
  • learning_rate: 0.0001
  • train_batch_size: 4
  • eval_batch_size: 8
  • seed: 42
  • optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
  • lr_scheduler_type: linear
  • lr_scheduler_warmup_ratio: 0.5
  • num_epochs: 1.0
  • distillation_objective: DistillationObjective(logits_loss_component=LossComponent(label=logits, weight=1, loss_fn=kl), attn_loss_component=LossComponent(label=attn, weight=25.0, loss_fn=raw_mse, layer_mapper=layer-2))
  • train_embeddings: True
  • lr_scheduler: <torch.optim.lr_scheduler.LambdaLR object at 0x7fd7d547a6e0>
  • student_model_name_or_path: None
  • student_config_name_or_path: None
  • student_model_config: None
  • reinitialize_weights: None
  • copy_teacher_modules: [('lm_head', False)]
  • student_model_as_bitnet: True
  • student_model_compile: False
  • dropout: None
  • teacher_model_name_or_path: gpt2
  • teacher_load_in_8bit: False
  • teacher_load_in_4bit: False
  • teacher_model_compile: False
  • dataset_uri: wikimedia/wikipedia
  • dataset_subset: 20231101.en
  • dataset_split: train
  • dataset_column_name: text
  • dataset_sample_size: 250000
  • dataset_test_size: 0.01
  • gradient_accumulation_steps: 1
  • weight_decay: 0.0
  • max_grad_norm: 1.0
  • warmup_ratio: 0.5
  • warmup_steps: 0
  • gradient_checkpointing: True

Framework Versions

  • Distily 0.2.0
  • Transformers 4.44.2
  • Pytorch 2.3.0
  • Datasets 2.21.0
Downloads last month
8
Safetensors
Model size
124M params
Tensor type
BF16
·
Inference API
Unable to determine this model’s pipeline type. Check the docs .

Model tree for distily/distily_miles_projector_experiment

Finetuned
(1265)
this model

Dataset used to train distily/distily_miles_projector_experiment