File size: 5,857 Bytes
61c2d32
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
# MViTv2: Improved Multiscale Vision Transformers for Classification and Detection

Yanghao Li*, Chao-Yuan Wu*, Haoqi Fan, Karttikeya Mangalam, Bo Xiong, Jitendra Malik, Christoph Feichtenhofer*

[[`arXiv`](https://arxiv.org/abs/2112.01526)] [[`BibTeX`](#CitingMViTv2)]

In this repository, we provide detection configs and models for MViTv2 (CVPR 2022) in Detectron2. For image classification tasks, please refer to [MViTv2 repo](https://github.com/facebookresearch/mvit).

## Results and Pretrained Models

### COCO

<table><tbody>
<!-- START TABLE -->
<!-- TABLE HEADER -->
<th valign="bottom">Name</th>
<th valign="bottom">pre-train</th>
<th valign="bottom">Method</th>
<th valign="bottom">epochs</th>
<th valign="bottom">box<br/>AP</th>
<th valign="bottom">mask<br/>AP</th>
<th valign="bottom">#params</th>
<th valign="bottom">FLOPS</th>
<th valign="bottom">model id</th>
<th valign="bottom">download</th>
<!-- TABLE BODY -->
<!-- ROW: mask_rcnn_mvitv2_t_3x -->
 <tr><td align="left"><a href="configs/mask_rcnn_mvitv2_t_3x.py">MViTV2-T</a></td>
<td align="center">IN1K</td>
<td align="center">Mask R-CNN</td>
<td align="center">36</td>
<td align="center">48.3</td>
<td align="center">43.8</td>
<td align="center">44M</td>
<td align="center">279G</td>
<td align="center">307611773</td>
<td align="center"><a href="https://dl.fbaipublicfiles.com/detectron2/MViTv2/mask_rcnn_mvitv2_t_3x/f307611773/model_final_1a1c30.pkl">model</a></td>
</tr>
<!-- ROW: cascade_mask_rcnn_mvitv2_t_3x -->
 <tr><td align="left"><a href="configs/cascade_mask_rcnn_mvitv2_t_3x.py">MViTV2-T</a></td>
<td align="center">IN1K</td>
<td align="center">Cascade Mask R-CNN</td>
<td align="center">36</td>
<td align="center">52.2</td>
<td align="center">45.0</td>
<td align="center">76M</td>
<td align="center">701G</td>
<td align="center">308344828</td>
<td align="center"><a href="https://dl.fbaipublicfiles.com/detectron2/MViTv2/cascade_mask_rcnn_mvitv2_t_3x/f308344828/model_final_c6967a.pkl">model</a></td>
</tr>
<!-- ROW: cascade_mask_rcnn_mvitv2_s_3x -->
<tr><td align="left"><a href="configs/cascade_mask_rcnn_mvitv2_s_3x.py">MViTV2-S</a></td>
<td align="center">IN1K</td>
<td align="center">Cascade Mask R-CNN</td>
<td align="center">36</td>
<td align="center">53.2</td>
<td align="center">46.0</td>
<td align="center">87M</td>
<td align="center">748G</td>
<td align="center">308344647</td>
<td align="center"><a href="https://dl.fbaipublicfiles.com/detectron2/MViTv2/cascade_mask_rcnn_mvitv2_s_3x/f308344647/model_final_279baf.pkl">model</a></td>
</tr>
<!-- ROW: cascade_mask_rcnn_mvitv2_b_3x -->
<tr><td align="left"><a href="configs/cascade_mask_rcnn_mvitv2_b_3x.py">MViTV2-B</a></td>
<td align="center">IN1K</td>
<td align="center">Cascade Mask R-CNN</td>
<td align="center">36</td>
<td align="center">54.1</td>
<td align="center">46.7</td>
<td align="center">103M</td>
<td align="center">814G</td>
<td align="center">308109448</td>
<td align="center"><a href="https://dl.fbaipublicfiles.com/detectron2/MViTv2/cascade_mask_rcnn_mvitv2_b_3x/f308109448/model_final_421a91.pkl">model</a></td>
</tr>
<!-- ROW: cascade_mask_rcnn_mvitv2_b_in21k_3x -->
<tr><td align="left"><a href="configs/cascade_mask_rcnn_mvitv2_b_in21k_3x.py">MViTV2-B</a></td>
<td align="center">IN21K</td>
<td align="center">Cascade Mask R-CNN</td>
<td align="center">36</td>
<td align="center">54.9</td>
<td align="center">47.4</td>
<td align="center">103M</td>
<td align="center">814G</td>
<td align="center">309003202</td>
<td align="center"><a href="https://dl.fbaipublicfiles.com/detectron2/MViTv2/cascade_mask_rcnn_mvitv2_b_in12k_3x/f309003202/model_final_be5168.pkl">model</a></td>
</tr>
<!-- ROW: cascade_mask_rcnn_mvitv2_l_in21k_lsj_50ep -->
<tr><td align="left"><a href="configs/cascade_mask_rcnn_mvitv2_l_in21k_lsj_50ep.py">MViTV2-L</a></td>
<td align="center">IN21K</td>
<td align="center">Cascade Mask R-CNN</td>
<td align="center">50</td>
<td align="center">55.8</td>
<td align="center">48.3</td>
<td align="center">270M</td>
<td align="center">1519G</td>
<td align="center">308099658</td>
<td align="center"><a href="https://dl.fbaipublicfiles.com/detectron2/MViTv2/cascade_mask_rcnn_mvitv2_l_in12k_lsj_50ep/f308099658/model_final_c41c5a.pkl">model</a></td>
</tr>
<!-- ROW: cascade_mask_rcnn_mvitv2_h_in21k_lsj_3x -->
<tr><td align="left"><a href="configs/cascade_mask_rcnn_mvitv2_h_in21k_lsj_3x.py">MViTV2-H</a></td>
<td align="center">IN21K</td>
<td align="center">Cascade Mask R-CNN</td>
<td align="center">36</td>
<td align="center">56.1</td>
<td align="center">48.5</td>
<td align="center">718M</td>
<td align="center">3084G</td>
<td align="center">309013744</td>
<td align="center"><a href="https://dl.fbaipublicfiles.com/detectron2/MViTv2/cascade_mask_rcnn_mvitv2_h_in12k_lsj_3x/f309013744/model_final_30d36b.pkl">model</a></td>
</tr>
</tbody></table>

Note that the above models were trained and measured on 8-node with 64 NVIDIA A100 GPUs in total. The ImageNet pre-trained model weights are obtained from [MViTv2 repo](https://github.com/facebookresearch/mvit).

## Training
All configs can be trained with:

```
../../tools/lazyconfig_train_net.py --config-file configs/path/to/config.py
```
By default, we use 64 GPUs with batch size as 64 for training.

## Evaluation
Model evaluation can be done similarly:
```
../../tools/lazyconfig_train_net.py --config-file configs/path/to/config.py --eval-only train.init_checkpoint=/path/to/model_checkpoint
```



## <a name="CitingMViTv2"></a>Citing MViTv2

If you use MViTv2, please use the following BibTeX entry.

```BibTeX
@inproceedings{li2021improved,
  title={MViTv2: Improved multiscale vision transformers for classification and detection},
  author={Li, Yanghao and Wu, Chao-Yuan and Fan, Haoqi and Mangalam, Karttikeya and Xiong, Bo and Malik, Jitendra and Feichtenhofer, Christoph},
  booktitle={CVPR},
  year={2022}
}
```