nielsr HF staff commited on
Commit
0178a03
1 Parent(s): 4e8e4a3

Add model card

Browse files
Files changed (1) hide show
  1. README.md +115 -0
README.md ADDED
@@ -0,0 +1,115 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ tags:
4
+ - image-classification
5
+ - timm
6
+ datasets:
7
+ - imagenet
8
+ ---
9
+
10
+ # Distilled Data-efficient Image Transformer (base model)
11
+
12
+ Distilled data-efficient Image Transformer (DeiT) model pre-trained at resolution 224x224 and fine-tuned at resolution 384x384 on ImageNet-1k (1 million images, 1,000 classes). It was first introduced in the paper [Training data-efficient image transformers & distillation through attention](https://arxiv.org/abs/2012.12877) by Touvron et al. and first released in [this repository](https://github.com/facebookresearch/deit). However, the weights were converted from the [timm repository](https://github.com/rwightman/pytorch-image-models) by Ross Wightman.
13
+
14
+ Disclaimer: The team releasing DeiT did not write a model card for this model so this model card has been written by the Hugging Face team.
15
+
16
+ ## Model description
17
+
18
+ This model is a distilled Vision Transformer (ViT). It uses a distillation token, besides the class token, to effectively learn from a teacher (CNN) during both pre-training and fine-tuning. The distillation token is learned through backpropagation, by interacting with the class ([CLS]) and patch tokens through the self-attention layers.
19
+
20
+ Images are presented to the model as a sequence of fixed-size patches (resolution 16x16), which are linearly embedded.
21
+
22
+ ## Intended uses & limitations
23
+
24
+ You can use the raw model for image classification. See the [model hub](https://huggingface.co/models?search=facebook/deit) to look for
25
+ fine-tuned versions on a task that interests you.
26
+
27
+ ### How to use
28
+
29
+ Since this model is a distilled ViT model, you can plug it into DeiTModel, DeiTForImageClassification or DeiTForImageClassificationWithTeacher. Note that the model expects the data to be prepared using DeiTFeatureExtractor. Here we use AutoFeatureExtractor, which will automatically use the appropriate feature extractor given the model name.
30
+
31
+ Here is how to use this model to classify an image of the COCO 2017 dataset into one of the 1,000 ImageNet classes:
32
+
33
+ ```python
34
+ from transformers import AutoFeatureExtractor, DeiTForImageClassificationWithTeacher
35
+ from PIL import Image
36
+ import requests
37
+ url = 'http://images.cocodataset.org/val2017/000000039769.jpg'
38
+ image = Image.open(requests.get(url, stream=True).raw)
39
+ feature_extractor = AutoFeatureExtractor.from_pretrained('facebook/deit-base-distilled-patch16-384')
40
+ model = DeiTForImageClassificationWithTeacher.from_pretrained('facebook/deit-base-distilled-patch16-384')
41
+ inputs = feature_extractor(images=image, return_tensors="pt")
42
+ outputs = model(**inputs)
43
+ logits = outputs.logits
44
+ # model predicts one of the 1000 ImageNet classes
45
+ predicted_class_idx = logits.argmax(-1).item()
46
+ print("Predicted class:", model.config.id2label[predicted_class_idx])
47
+ ```
48
+
49
+ Currently, both the feature extractor and model support PyTorch. Tensorflow and JAX/FLAX are coming soon.
50
+
51
+ ## Training data
52
+
53
+ This model was pretrained and fine-tuned with distillation on [ImageNet-1k](http://www.image-net.org/challenges/LSVRC/2012/), a dataset consisting of 1 million images and 1k classes.
54
+
55
+ ## Training procedure
56
+
57
+ ### Preprocessing
58
+
59
+ The exact details of preprocessing of images during training/validation can be found [here](https://github.com/facebookresearch/deit/blob/ab5715372db8c6cad5740714b2216d55aeae052e/datasets.py#L78).
60
+
61
+ At inference time, images are resized/rescaled to the same resolution (438x438), center-cropped at 384x384 and normalized across the RGB channels with the ImageNet mean and standard deviation.
62
+
63
+ ### Pretraining
64
+
65
+ The model was trained on a single 8-GPU node for 3 days. Pre-training resolution is 224. For all hyperparameters (such as batch size and learning rate) we refer to table 9 of the original paper.
66
+
67
+ ## Evaluation results
68
+
69
+ | Model | ImageNet top-1 accuracy | ImageNet top-5 accuracy | # params | URL |
70
+ |-------------------------------------------|-------------------------|-------------------------|----------|------------------------------------------------------------------|
71
+ | DeiT-tiny | 72.2 | 91.1 | 5M | https://huggingface.co/facebook/deit-tiny-patch16-224 |
72
+ | DeiT-small | 79.9 | 95.0 | 22M | https://huggingface.co/facebook/deit-small-patch16-224 |
73
+ | DeiT-base | 81.8 | 95.6 | 86M | https://huggingface.co/facebook/deit-base-patch16-224 |
74
+ | DeiT-tiny distilled | 74.5 | 91.9 | 6M | https://huggingface.co/facebook/deit-tiny-distilled-patch16-224 |
75
+ | DeiT-small distilled | 81.2 | 95.4 | 22M | https://huggingface.co/facebook/deit-small-distilled-patch16-224 |
76
+ | DeiT-base distilled | 83.4 | 96.5 | 87M | https://huggingface.co/facebook/deit-base-distilled-patch16-224 |
77
+ | DeiT-base 384 | 82.9 | 96.2 | 87M | https://huggingface.co/facebook/deit-base-patch16-384 |
78
+ | **DeiT-base distilled 384 (1000 epochs)** | **85.2** | **97.2** | **88M** | **https://huggingface.co/facebook/deit-base-distilled-patch16-384** |
79
+
80
+ Note that for fine-tuning, the best results are obtained with a higher resolution (384x384). Of course, increasing the model size will result in better performance.
81
+
82
+ ### BibTeX entry and citation info
83
+
84
+ ```bibtex
85
+ @misc{touvron2021training,
86
+ title={Training data-efficient image transformers & distillation through attention},
87
+ author={Hugo Touvron and Matthieu Cord and Matthijs Douze and Francisco Massa and Alexandre Sablayrolles and Hervé Jégou},
88
+ year={2021},
89
+ eprint={2012.12877},
90
+ archivePrefix={arXiv},
91
+ primaryClass={cs.CV}
92
+ }
93
+ ```
94
+
95
+ ```bibtex
96
+ @misc{wu2020visual,
97
+ title={Visual Transformers: Token-based Image Representation and Processing for Computer Vision},
98
+ author={Bichen Wu and Chenfeng Xu and Xiaoliang Dai and Alvin Wan and Peizhao Zhang and Zhicheng Yan and Masayoshi Tomizuka and Joseph Gonzalez and Kurt Keutzer and Peter Vajda},
99
+ year={2020},
100
+ eprint={2006.03677},
101
+ archivePrefix={arXiv},
102
+ primaryClass={cs.CV}
103
+ }
104
+ ```
105
+
106
+ ```bibtex
107
+ @inproceedings{deng2009imagenet,
108
+ title={Imagenet: A large-scale hierarchical image database},
109
+ author={Deng, Jia and Dong, Wei and Socher, Richard and Li, Li-Jia and Li, Kai and Fei-Fei, Li},
110
+ booktitle={2009 IEEE conference on computer vision and pattern recognition},
111
+ pages={248--255},
112
+ year={2009},
113
+ organization={Ieee}
114
+ }
115
+ ```