keras-io
/

vit-small-ds

Image Classification

Model card Files Files and versions Community

ariG23498 HF Staff commited on Jan 14, 2022

Commit

e632464

·

1 Parent(s): ece2a31

add: readme

Adding a readme to the model repository

Files changed (1) hide show

README.md +16 -0

README.md ADDED Viewed

	@@ -0,0 +1,16 @@

+# Train a Vision Transformer on small datasets
+Author: [Aritra Roy Gosthipaty](https://twitter.com/ariG23498)
+[Keras Blog](https://keras.io/examples/vision/vit_small_ds/) | [Colab Notebook](https://colab.research.google.com/github/keras-team/keras-io/blob/master/examples/vision/ipynb/vit_small_ds.ipynb)
+In the academic paper [An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale](https://arxiv.org/abs/2010.11929), the authors mention that Vision Transformers (ViT) are data-hungry. Therefore, pretraining a ViT on a large-sized dataset like JFT300M and fine-tuning it on medium-sized datasets (like ImageNet) is the only way to beat state-of-the-art Convolutional Neural Network models.
+The self-attention layer of ViT lacks locality inductive bias (the notion that image pixels are locally correlated and that their correlation maps are translation-invariant). This is the reason why ViTs need more data. On the other hand, CNNs look at images through spatial sliding windows, which helps them get better results with smaller datasets.
+In the academic paper [Vision Transformer for Small-Size Datasets](https://arxiv.org/abs/2112.13492v1), the authors set out to tackle the problem of locality inductive bias in ViTs.
+The main ideas are:
+- Shifted Patch Tokenization
+- Locality Self Attention