Distilled Data-efficient Image Transformer for Face Mask Detection

Distilled data-efficient Image Transformer (DeiT) model pre-trained and fine-tuned on Self Currated Custom Face-Mask18K Dataset (18k images, 2 classes) at resolution 224x224. It was first introduced in the paper Training data-efficient image transformers & distillation through attention by Touvron et al.

Model description

This model is a distilled Vision Transformer (ViT). It uses a distillation token, besides the class token, to effectively learn from a teacher (CNN) during both pre-training and fine-tuning. The distillation token is learned through backpropagation, by interacting with the class ([CLS]) and patch tokens through the self-attention layers.

Images are presented to the model as a sequence of fixed-size patches (resolution 16x16), which are linearly embedded.

Training Metrics

epoch                    =          2.0
total_flos               = 2078245655GF
train_loss               =       0.0438
train_runtime            =   1:37:16.87
train_samples_per_second =        9.887
train_steps_per_second   =        0.309

Evaluation Metrics

epoch                   =        2.0
eval_accuracy           =     0.9922
eval_loss               =     0.0271
eval_runtime            = 0:03:17.36
eval_samples_per_second =      18.22
eval_steps_per_second   =       2.28
Downloads last month
15
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.