rlawjdghek's picture
prep (#1)
61c2d32 verified
|
raw
history blame
5.86 kB

MViTv2: Improved Multiscale Vision Transformers for Classification and Detection

Yanghao Li*, Chao-Yuan Wu*, Haoqi Fan, Karttikeya Mangalam, Bo Xiong, Jitendra Malik, Christoph Feichtenhofer*

[arXiv] [BibTeX]

In this repository, we provide detection configs and models for MViTv2 (CVPR 2022) in Detectron2. For image classification tasks, please refer to MViTv2 repo.

Results and Pretrained Models

COCO

Name pre-train Method epochs box
AP
mask
AP
#params FLOPS model id download
MViTV2-T IN1K Mask R-CNN 36 48.3 43.8 44M 279G 307611773 model
MViTV2-T IN1K Cascade Mask R-CNN 36 52.2 45.0 76M 701G 308344828 model
MViTV2-S IN1K Cascade Mask R-CNN 36 53.2 46.0 87M 748G 308344647 model
MViTV2-B IN1K Cascade Mask R-CNN 36 54.1 46.7 103M 814G 308109448 model
MViTV2-B IN21K Cascade Mask R-CNN 36 54.9 47.4 103M 814G 309003202 model
MViTV2-L IN21K Cascade Mask R-CNN 50 55.8 48.3 270M 1519G 308099658 model
MViTV2-H IN21K Cascade Mask R-CNN 36 56.1 48.5 718M 3084G 309013744 model

Note that the above models were trained and measured on 8-node with 64 NVIDIA A100 GPUs in total. The ImageNet pre-trained model weights are obtained from MViTv2 repo.

Training

All configs can be trained with:

../../tools/lazyconfig_train_net.py --config-file configs/path/to/config.py

By default, we use 64 GPUs with batch size as 64 for training.

Evaluation

Model evaluation can be done similarly:

../../tools/lazyconfig_train_net.py --config-file configs/path/to/config.py --eval-only train.init_checkpoint=/path/to/model_checkpoint

Citing MViTv2

If you use MViTv2, please use the following BibTeX entry.

@inproceedings{li2021improved,
  title={MViTv2: Improved multiscale vision transformers for classification and detection},
  author={Li, Yanghao and Wu, Chao-Yuan and Fan, Haoqi and Mangalam, Karttikeya and Xiong, Bo and Malik, Jitendra and Feichtenhofer, Christoph},
  booktitle={CVPR},
  year={2022}
}