# MViTv2: Improved Multiscale Vision Transformers for Classification and Detection Yanghao Li*, Chao-Yuan Wu*, Haoqi Fan, Karttikeya Mangalam, Bo Xiong, Jitendra Malik, Christoph Feichtenhofer* [[`arXiv`](https://arxiv.org/abs/2112.01526)] [[`BibTeX`](#CitingMViTv2)] In this repository, we provide detection configs and models for MViTv2 (CVPR 2022) in Detectron2. For image classification tasks, please refer to [MViTv2 repo](https://github.com/facebookresearch/mvit). ## Results and Pretrained Models ### COCO
Name | pre-train | Method | epochs | box AP |
mask AP |
#params | FLOPS | model id | download |
---|---|---|---|---|---|---|---|---|---|
MViTV2-T | IN1K | Mask R-CNN | 36 | 48.3 | 43.8 | 44M | 279G | 307611773 | model |
MViTV2-T | IN1K | Cascade Mask R-CNN | 36 | 52.2 | 45.0 | 76M | 701G | 308344828 | model |
MViTV2-S | IN1K | Cascade Mask R-CNN | 36 | 53.2 | 46.0 | 87M | 748G | 308344647 | model |
MViTV2-B | IN1K | Cascade Mask R-CNN | 36 | 54.1 | 46.7 | 103M | 814G | 308109448 | model |
MViTV2-B | IN21K | Cascade Mask R-CNN | 36 | 54.9 | 47.4 | 103M | 814G | 309003202 | model |
MViTV2-L | IN21K | Cascade Mask R-CNN | 50 | 55.8 | 48.3 | 270M | 1519G | 308099658 | model |
MViTV2-H | IN21K | Cascade Mask R-CNN | 36 | 56.1 | 48.5 | 718M | 3084G | 309013744 | model |