lapp0's picture
End of training
509881d verified
|
raw
history blame
5.81 kB
metadata
base_model: gpt2
datasets:
  - wikimedia/wikipedia
library_name: Distily
license: mit
tags:
  - bitnet
  - 1.58b
  - generated_from_trainer
model-index:
  - name: distily_miles_projector_experiment
    results: []

Summary

Distilled with Distily library using teacher model gpt2 on dataset wikimedia/wikipedia.

Model Architecture:

  • Architecture: GPT2LMHeadModel
  • Total Parameters: 124,439,808
  • Data Type (dtype): torch.bfloat16
  • Model Size: 0.24 GB

Evaluation Metrics Comparison

step epoch enwikippl frwikippl loss runtime samples_per_second steps_per_second tinystoriesppl zhwikippl
teacher eval 36.25 77.0 11.75 21.375
0 0 1486058684416.0 34084860461056.0 20.1302 40.0525 62.418 7.815 2281701376.0 15874199126016.0
2500 0.0404 756.0 3440.0 2.4552 40.0832 62.37 7.809 404.0 1560.0
5000 0.0808 352.0 1288.0 1.7734 42.1208 59.353 7.431 246.0 290.0
7500 0.1212 227.0 688.0 1.4859 44.2818 56.457 7.068 177.0 214.0
10000 0.1616 176.0 624.0 1.2995 40.5384 61.67 7.721 129.0 225.0
12500 0.2020 122.0 446.0 1.0558 43.2882 57.752 7.231 93.5 231.0
15000 0.2424 102.5 412.0 0.9530 40.2067 62.179 7.785 80.0 175.0
17500 0.2828 92.0 342.0 0.8613 42.4322 58.918 7.376 77.5 165.0
20000 0.3232 78.0 266.0 0.8054 42.4876 58.841 7.367 64.5 110.0
22500 0.3636 66.5 228.0 0.6962 40.1977 62.193 7.787 58.0 185.0
25000 0.4040 64.0 200.0 0.6565 42.3516 59.03 7.391 52.75 115.5
27500 0.4444 61.25 190.0 0.6213 42.9602 58.193 7.286 50.75 101.0
30000 0.4848 62.75 211.0 0.6318 44.9016 55.677 6.971 50.25 184.0
32500 0.5253 57.5 194.0 0.6184 43.9215 56.92 7.126 50.25 89.5
35000 0.5657 57.0 177.0 0.5768 42.6805 58.575 7.334 44.0 107.0
37500 0.6061 54.5 168.0 0.5596 44.1546 56.619 7.089 43.5 81.0
40000 0.6465 54.0 159.0 0.5345 42.0172 59.499 7.449 42.75 77.5
42500 0.6869 53.5 169.0 0.5260 41.7231 59.919 7.502 39.5 61.25
45000 0.7273 48.5 152.0 0.4414 40.3349 61.981 7.76 35.25 50.25
47500 0.7677 47.25 142.0 0.4216 41.3204 60.503 7.575 34.5 44.25
50000 0.8081 46.5 137.0 0.4085 43.1383 57.953 7.256 32.25 41.25
52500 0.8485 46.0 141.0 0.4018 42.0641 59.433 7.441 33.0 38.75
55000 0.8889 45.0 138.0 0.3859 40.373 61.923 7.753 31.875 35.75
57500 0.9293 44.75 133.0 0.3810 40.3972 61.885 7.748 31.625 36.0
60000 0.9697 44.75 132.0 0.3782 42.2203 59.213 7.413 31.625 35.5
61875 1.0 44.75 133.0 0.3778 44.5224 56.151 7.03 31.5 35.5

Resource Usage Comparison

  • VRAM Use: 7.7831 GB

Distillation (Teacher -> Student) Architecture Difference:

  • Architecture: GPT2LMHeadModel -> GPT2LMHeadModel
  • Total Parameters: 124,439,808 -> 124,439,808
  • Data Type (dtype): torch.bfloat16 -> torch.bfloat16
  • Model Size: 0.24 GB -> 0.24 GB
Module Diff Details


Train Dataset

Trained on 145,697,117 tokens from the wikimedia/wikipedia dataset.

  • Num Samples: 247,500
  • Subset: 20231101.en
  • Split: train

Training Objective

DistillationObjective(logits_loss_component=LossComponent(label=logits, weight=1, loss_fn=kl), attn_loss_component=LossComponent(label=attn, weight=5, loss_fn=raw_mse, layer_mapper=layer-2))

Hyperparameters

The following hyperparameters were used during training:

Expand
  • learning_rate: 0.0001
  • train_batch_size: 4
  • eval_batch_size: 8
  • seed: 42
  • optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
  • lr_scheduler_type: linear
  • lr_scheduler_warmup_ratio: 0.5
  • num_epochs: 1.0
  • distillation_objective: DistillationObjective(logits_loss_component=LossComponent(label=logits, weight=1, loss_fn=kl), attn_loss_component=LossComponent(label=attn, weight=5, loss_fn=raw_mse, layer_mapper=layer-2))
  • train_embeddings: True
  • lr_scheduler: <torch.optim.lr_scheduler.LambdaLR object at 0x7fd776b0cd90>
  • student_model_name_or_path: None
  • student_config_name_or_path: None
  • student_model_config: None
  • reinitialize_weights: None
  • copy_teacher_modules: [('lm_head', False)]
  • student_model_as_bitnet: True
  • student_model_compile: False
  • dropout: None
  • teacher_model_name_or_path: gpt2
  • teacher_load_in_8bit: False
  • teacher_load_in_4bit: False
  • teacher_model_compile: False
  • dataset_uri: wikimedia/wikipedia
  • dataset_subset: 20231101.en
  • dataset_split: train
  • dataset_column_name: text
  • dataset_sample_size: 250000
  • dataset_test_size: 0.01
  • gradient_accumulation_steps: 1
  • weight_decay: 0.0
  • max_grad_norm: 1.0
  • warmup_ratio: 0.5
  • warmup_steps: 0
  • gradient_checkpointing: True

Framework Versions

  • Distily 0.2.0
  • Transformers 4.44.2
  • Pytorch 2.3.0
  • Datasets 2.21.0