# MViTv2: Improved Multiscale Vision Transformers for Classification and Detection Yanghao Li*, Chao-Yuan Wu*, Haoqi Fan, Karttikeya Mangalam, Bo Xiong, Jitendra Malik, Christoph Feichtenhofer* [[`arXiv`](https://arxiv.org/abs/2112.01526)] [[`BibTeX`](#CitingMViTv2)] In this repository, we provide detection configs and models for MViTv2 (CVPR 2022) in Detectron2. For image classification tasks, please refer to [MViTv2 repo](https://github.com/facebookresearch/mvit). ## Results and Pretrained Models ### COCO

Name	pre-train	Method	epochs	box AP	mask AP	#params	FLOPS	model id	download
MViTV2-T	IN1K	Mask R-CNN	36	48.3	43.8	44M	279G	307611773	model
MViTV2-T	IN1K	Cascade Mask R-CNN	36	52.2	45.0	76M	701G	308344828	model
MViTV2-S	IN1K	Cascade Mask R-CNN	36	53.2	46.0	87M	748G	308344647	model
MViTV2-B	IN1K	Cascade Mask R-CNN	36	54.1	46.7	103M	814G	308109448	model
MViTV2-B	IN21K	Cascade Mask R-CNN	36	54.9	47.4	103M	814G	309003202	model
MViTV2-L	IN21K	Cascade Mask R-CNN	50	55.8	48.3	270M	1519G	308099658	model
MViTV2-H	IN21K	Cascade Mask R-CNN	36	56.1	48.5	718M	3084G	309013744	model

Note that the above models were trained and measured on 8-node with 64 NVIDIA A100 GPUs in total. The ImageNet pre-trained model weights are obtained from [MViTv2 repo](https://github.com/facebookresearch/mvit). ## Training All configs can be trained with: ``` ../../tools/lazyconfig_train_net.py --config-file configs/path/to/config.py ``` By default, we use 64 GPUs with batch size as 64 for training. ## Evaluation Model evaluation can be done similarly: ``` ../../tools/lazyconfig_train_net.py --config-file configs/path/to/config.py --eval-only train.init_checkpoint=/path/to/model_checkpoint ``` ## Citing MViTv2 If you use MViTv2, please use the following BibTeX entry. ```BibTeX @inproceedings{li2021improved, title={MViTv2: Improved multiscale vision transformers for classification and detection}, author={Li, Yanghao and Wu, Chao-Yuan and Fan, Haoqi and Mangalam, Karttikeya and Xiong, Bo and Malik, Jitendra and Feichtenhofer, Christoph}, booktitle={CVPR}, year={2022} } ```