Spaces:
Running
on
Zero
ViTDet: Exploring Plain Vision Transformer Backbones for Object Detection
Yanghao Li, Hanzi Mao, Ross Girshickโ , Kaiming Heโ
In this repository, we provide configs and models in Detectron2 for ViTDet as well as MViTv2 and Swin backbones with our implementation and settings as described in ViTDet paper.
Pretrained Models
COCO
Mask R-CNN
Name | pre-train | train time (s/im) |
inference time (s/im) |
train mem (GB) |
box AP |
mask AP |
model id | download |
---|---|---|---|---|---|---|---|---|
ViTDet, ViT-B | IN1K, MAE | 0.314 | 0.079 | 10.9 | 51.6 | 45.9 | 325346929 | model |
ViTDet, ViT-L | IN1K, MAE | 0.603 | 0.125 | 20.9 | 55.5 | 49.2 | 325599698 | model |
ViTDet, ViT-H | IN1K, MAE | 1.098 | 0.178 | 31.5 | 56.7 | 50.2 | 329145471 | model |
Cascade Mask R-CNN
Name | pre-train | train time (s/im) |
inference time (s/im) |
train mem (GB) |
box AP |
mask AP |
model id | download |
---|---|---|---|---|---|---|---|---|
Swin-B | IN21K, sup | 0.389 | 0.077 | 8.7 | 53.9 | 46.2 | 342979038 | model |
Swin-L | IN21K, sup | 0.508 | 0.097 | 12.6 | 55.0 | 47.2 | 342979186 | model |
MViTv2-B | IN21K, sup | 0.475 | 0.090 | 8.9 | 55.6 | 48.1 | 325820315 | model |
MViTv2-L | IN21K, sup | 0.844 | 0.157 | 19.7 | 55.7 | 48.3 | 325607715 | model |
MViTv2-H | IN21K, sup | 1.655 | 0.285 | 18.4* | 55.9 | 48.3 | 326187358 | model |
ViTDet, ViT-B | IN1K, MAE | 0.362 | 0.089 | 12.3 | 54.0 | 46.7 | 325358525 | model |
ViTDet, ViT-L | IN1K, MAE | 0.643 | 0.142 | 22.3 | 57.6 | 50.0 | 328021305 | model |
ViTDet, ViT-H | IN1K, MAE | 1.137 | 0.196 | 32.9 | 58.7 | 51.0 | 328730692 | model |
LVIS
Mask R-CNN
Name | pre-train | train time (s/im) |
inference time (s/im) |
train mem (GB) |
box AP |
mask AP |
model id | download |
---|---|---|---|---|---|---|---|---|
ViTDet, ViT-B | IN1K, MAE | 0.317 | 0.085 | 14.4 | 40.2 | 38.2 | 329225748 | model |
ViTDet, ViT-L | IN1K, MAE | 0.576 | 0.137 | 24.7 | 46.1 | 43.6 | 329211570 | model |
ViTDet, ViT-H | IN1K, MAE | 1.059 | 0.186 | 35.3 | 49.1 | 46.0 | 332434656 | model |
Cascade Mask R-CNN
Name | pre-train | train time (s/im) |
inference time (s/im) |
train mem (GB) |
box AP |
mask AP |
model id | download |
---|---|---|---|---|---|---|---|---|
Swin-B | IN21K, sup | 0.368 | 0.090 | 11.5 | 44.0 | 39.6 | 329222304 | model |
Swin-L | IN21K, sup | 0.486 | 0.105 | 13.8 | 46.0 | 41.4 | 329222724 | model |
MViTv2-B | IN21K, sup | 0.475 | 0.100 | 11.8 | 46.3 | 42.0 | 329477206 | model |
MViTv2-L | IN21K, sup | 0.844 | 0.172 | 21.0 | 49.4 | 44.2 | 329661552 | model |
MViTv2-H | IN21K, sup | 1.661 | 0.290 | 21.3* | 49.5 | 44.1 | 330445165 | model |
ViTDet, ViT-B | IN1K, MAE | 0.356 | 0.099 | 15.2 | 43.0 | 38.9 | 329226874 | model |
ViTDet, ViT-L | IN1K, MAE | 0.629 | 0.150 | 24.9 | 49.2 | 44.5 | 329042206 | model |
ViTDet, ViT-H | IN1K, MAE | 1.100 | 0.204 | 35.5 | 51.5 | 46.6 | 332552778 | model |
Note: Unlike the system-level comparisons in the paper, these models use a lower resolution (1024 instead of 1280) and standard NMS (instead of soft NMS). As a result, they have slightly lower box and mask AP.
We observed higher variance on LVIS evalution results compared to COCO. For example, the standard deviations of box AP and mask AP were 0.30% (compared to 0.10% on COCO) when we trained ViTDet, ViT-B five times with varying random seeds.
The above models were trained and measured on 8-node with 64 NVIDIA A100 GPUs in total. *: Activation checkpointing is used.
Training
All configs can be trained with:
../../tools/lazyconfig_train_net.py --config-file configs/path/to/config.py
By default, we use 64 GPUs with batch size as 64 for training.
Evaluation
Model evaluation can be done similarly:
../../tools/lazyconfig_train_net.py --config-file configs/path/to/config.py --eval-only train.init_checkpoint=/path/to/model_checkpoint
Citing ViTDet
If you use ViTDet, please use the following BibTeX entry.
@article{li2022exploring,
title={Exploring plain vision transformer backbones for object detection},
author={Li, Yanghao and Mao, Hanzi and Girshick, Ross and He, Kaiming},
journal={arXiv preprint arXiv:2203.16527},
year={2022}
}