|
--- |
|
license: cc-by-4.0 |
|
datasets: |
|
- imagenet-1k |
|
metrics: |
|
- accuracy |
|
pipeline_tag: image-classification |
|
language: |
|
- en |
|
tags: |
|
- vision transformer |
|
- simpool |
|
- computer vision |
|
- deep learning |
|
--- |
|
|
|
# Supervised ViT-S/16 (small-sized Vision Transformer with patch size 16) model |
|
|
|
ViT-S official model trained on ImageNet-1k for 100 epochs. Reproduced for ICCV 2023 [SimPool](https://arxiv.org/abs/2309.06891) paper. |
|
|
|
SimPool is a simple attention-based pooling method at the end of network, released in this [repository](https://github.com/billpsomas/simpool/). |
|
Disclaimer: This model card is written by the author of SimPool, i.e. [Bill Psomas](http://users.ntua.gr/psomasbill/). |
|
|
|
## BibTeX entry and citation info |
|
|
|
``` |
|
@misc{psomas2023simpool, |
|
title={Keep It SimPool: Who Said Supervised Transformers Suffer from Attention Deficit?}, |
|
author={Bill Psomas and Ioannis Kakogeorgiou and Konstantinos Karantzalos and Yannis Avrithis}, |
|
year={2023}, |
|
eprint={2309.06891}, |
|
archivePrefix={arXiv}, |
|
primaryClass={cs.CV} |
|
} |
|
``` |