Vision Transformer Mirror

The Vision Transformer (ViT) is a model that is pretrained with supervised learning on a large image dataset, namely ImageNet-21k, at a resolution of 224x224 pixels. Images are presented to the model as a series of fixed-size patches (16x16 in resolution), which are linearly embedded. A [CLS] token is also added at the beginning of the sequence for classification tasks. Absolute position embeddings are added before the sequence is fed into the layers of the Transformer encoder. It should be noted that the model does not provide any fine-tuned heads, as these have been zeroed out by Google researchers. However, the model includes a pretrained pooler that can be used for downstream tasks, such as image classification. By pretraining the model, it learns the internal representations of images, which can then be used to extract features useful for downstream tasks: for example, if you have a labeled image dataset, you can train a standard classifier by placing a linear layer on top of the pretrained encoder. The linear layer is typically placed on top of the [CLS] token because the final hidden state of this token can be considered as the representation of the entire image.

Usage

from modelscope import snapshot_download
model_dir = snapshot_download('Genius-Society/ViT')

Maintenance

git clone [email protected]:Genius-Society/ViT
cd ViT

Reference

[1] An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale