library_name: keras
Model description
This model is implementation of the distillation recipe proposed in DeiT.
Visit Keras example on Distilling Vision Transformers.
Full credits to: Sayak Paul
In the original Vision Transformers (ViT) paper (Dosovitskiy et al.), the authors concluded that to perform on par with Convolutional Neural Networks (CNNs), ViTs need to be pre-trained on larger datasets. The larger the better. This is mainly due to the lack of inductive biases in the ViT architecture -- unlike CNNs, they don't have layers that exploit locality.
Many groups have proposed different ways to deal with the problem of data-intensiveness of ViT training. One such way was shown in the Data-efficient image Transformers, (DeiT) paper (Touvron et al.). The authors introduced a distillation technique that is specific to transformer-based vision models. DeiT is among the first works to show that it's possible to train ViTs well without using larger datasets.
Intended uses & limitations
The model is trained for demonstrative purposes and does not guarantee the best results in production.
For better results, follow & optimize the Keras example as per your need.
Training and evaluation data
The model is trained and evaluated on TF Flowers dataset
Training procedure
Training procedure is followed exactly as from the keras example.
The batch size is however decreased to 16 from the original 256 for accomodating the model in a single V100 GPU memory.
Training hyperparameters
The following hyperparameters were used during training:
name | learning_rate | decay | beta_1 | beta_2 | epsilon | amsgrad | weight_decay | exclude_from_weight_decay | training_precision |
---|---|---|---|---|---|---|---|---|---|
AdamW | 6.25000029685907e-05 | 0.0 | 0.8999999761581421 | 0.9990000128746033 | 1e-07 | False | 9.999999747378752e-05 | None | float32 |