Spaces:
Sleeping
Detectron2 Model Zoo and Baselines
Introduction
This file documents a large collection of baselines trained with detectron2 in Sep-Oct, 2019. All numbers were obtained on Big Basin servers with 8 NVIDIA V100 GPUs & NVLink. The speed numbers are periodically updated with latest PyTorch/CUDA/cuDNN versions. You can access these models from code using detectron2.model_zoo APIs.
In addition to these official baseline models, you can find more models in projects/.
How to Read the Tables
- The "Name" column contains a link to the config file. Models can be reproduced using
tools/train_net.py
with the corresponding yaml config file, ortools/lazyconfig_train_net.py
for python config files. - Training speed is averaged across the entire training.
We keep updating the speed with latest version of detectron2/pytorch/etc.,
so they might be different from the
metrics
file. Training speed for multi-machine jobs is not provided. - Inference speed is measured by
tools/train_net.py --eval-only
, or inference_on_dataset(), with batch size 1 in detectron2 directly. Measuring it with custom code may introduce other overhead. Actual deployment in production should in general be faster than the given inference speed due to more optimizations. - The model id column is provided for ease of reference. To check downloaded file integrity, any model on this page contains its md5 prefix in its file name.
- Training curves and other statistics can be found in
metrics
for each model.
Common Settings for COCO Models
All COCO models were trained on
train2017
and evaluated onval2017
.The default settings are not directly comparable with Detectron's standard settings. For example, our default training data augmentation uses scale jittering in addition to horizontal flipping.
To make fair comparisons with Detectron's settings, see Detectron1-Comparisons for accuracy comparison, and benchmarks for speed comparison.
For Faster/Mask R-CNN, we provide baselines based on 3 different backbone combinations:
- FPN: Use a ResNet+FPN backbone with standard conv and FC heads for mask and box prediction, respectively. It obtains the best speed/accuracy tradeoff, but the other two are still useful for research.
- C4: Use a ResNet conv4 backbone with conv5 head. The original baseline in the Faster R-CNN paper.
- DC5 (Dilated-C5): Use a ResNet conv5 backbone with dilations in conv5, and standard conv and FC heads for mask and box prediction, respectively. This is used by the Deformable ConvNet paper.
Most models are trained with the 3x schedule (
37 COCO epochs). Although 1x models are heavily under-trained, we provide some ResNet-50 models with the 1x (12 COCO epochs) training schedule for comparison when doing quick research iteration.
ImageNet Pretrained Models
It's common to initialize from backbone models pre-trained on ImageNet classification tasks. The following backbone models are available:
- R-50.pkl: converted copy of MSRA's original ResNet-50 model.
- R-101.pkl: converted copy of MSRA's original ResNet-101 model.
- X-101-32x8d.pkl: ResNeXt-101-32x8d model trained with Caffe2 at FB.
- R-50.pkl (torchvision): converted copy of torchvision's ResNet-50 model. More details can be found in the conversion script.
Note that the above models have different format from those provided in Detectron: we do not fuse BatchNorm into an affine layer. Pretrained models in Detectron's format can still be used. For example:
- X-152-32x8d-IN5k.pkl: ResNeXt-152-32x8d model trained on ImageNet-5k with Caffe2 at FB (see ResNeXt paper for details on ImageNet-5k).
- R-50-GN.pkl: ResNet-50 with Group Normalization.
- R-101-GN.pkl: ResNet-101 with Group Normalization.
These models require slightly different settings regarding normalization and architecture. See the model zoo configs for reference.
License
All models available for download through this document are licensed under the Creative Commons Attribution-ShareAlike 3.0 license.
COCO Object Detection Baselines
Faster R-CNN:
Name | lr sched |
train time (s/iter) |
inference time (s/im) |
train mem (GB) |
box AP |
model id | download |
---|---|---|---|---|---|---|---|
R50-C4 | 1x | 0.551 | 0.102 | 4.8 | 35.7 | 137257644 | model | metrics |
R50-DC5 | 1x | 0.380 | 0.068 | 5.0 | 37.3 | 137847829 | model | metrics |
R50-FPN | 1x | 0.210 | 0.038 | 3.0 | 37.9 | 137257794 | model | metrics |
R50-C4 | 3x | 0.543 | 0.104 | 4.8 | 38.4 | 137849393 | model | metrics |
R50-DC5 | 3x | 0.378 | 0.070 | 5.0 | 39.0 | 137849425 | model | metrics |
R50-FPN | 3x | 0.209 | 0.038 | 3.0 | 40.2 | 137849458 | model | metrics |
R101-C4 | 3x | 0.619 | 0.139 | 5.9 | 41.1 | 138204752 | model | metrics |
R101-DC5 | 3x | 0.452 | 0.086 | 6.1 | 40.6 | 138204841 | model | metrics |
R101-FPN | 3x | 0.286 | 0.051 | 4.1 | 42.0 | 137851257 | model | metrics |
X101-FPN | 3x | 0.638 | 0.098 | 6.7 | 43.0 | 139173657 | model | metrics |
RetinaNet:
Name | lr sched |
train time (s/iter) |
inference time (s/im) |
train mem (GB) |
box AP |
model id | download |
---|---|---|---|---|---|---|---|
R50 | 1x | 0.205 | 0.041 | 4.1 | 37.4 | 190397773 | model | metrics |
R50 | 3x | 0.205 | 0.041 | 4.1 | 38.7 | 190397829 | model | metrics |
R101 | 3x | 0.291 | 0.054 | 5.2 | 40.4 | 190397697 | model | metrics |
RPN & Fast R-CNN:
Name | lr sched |
train time (s/iter) |
inference time (s/im) |
train mem (GB) |
box AP |
prop. AR |
model id | download |
---|---|---|---|---|---|---|---|---|
RPN R50-C4 | 1x | 0.130 | 0.034 | 1.5 | 51.6 | 137258005 | model | metrics | |
RPN R50-FPN | 1x | 0.186 | 0.032 | 2.7 | 58.0 | 137258492 | model | metrics | |
Fast R-CNN R50-FPN | 1x | 0.140 | 0.029 | 2.6 | 37.8 | 137635226 | model | metrics |
COCO Instance Segmentation Baselines with Mask R-CNN
Name | lr sched |
train time (s/iter) |
inference time (s/im) |
train mem (GB) |
box AP |
mask AP |
model id | download |
---|---|---|---|---|---|---|---|---|
R50-C4 | 1x | 0.584 | 0.110 | 5.2 | 36.8 | 32.2 | 137259246 | model | metrics |
R50-DC5 | 1x | 0.471 | 0.076 | 6.5 | 38.3 | 34.2 | 137260150 | model | metrics |
R50-FPN | 1x | 0.261 | 0.043 | 3.4 | 38.6 | 35.2 | 137260431 | model | metrics |
R50-C4 | 3x | 0.575 | 0.111 | 5.2 | 39.8 | 34.4 | 137849525 | model | metrics |
R50-DC5 | 3x | 0.470 | 0.076 | 6.5 | 40.0 | 35.9 | 137849551 | model | metrics |
R50-FPN | 3x | 0.261 | 0.043 | 3.4 | 41.0 | 37.2 | 137849600 | model | metrics |
R101-C4 | 3x | 0.652 | 0.145 | 6.3 | 42.6 | 36.7 | 138363239 | model | metrics |
R101-DC5 | 3x | 0.545 | 0.092 | 7.6 | 41.9 | 37.3 | 138363294 | model | metrics |
R101-FPN | 3x | 0.340 | 0.056 | 4.6 | 42.9 | 38.6 | 138205316 | model | metrics |
X101-FPN | 3x | 0.690 | 0.103 | 7.2 | 44.3 | 39.5 | 139653917 | model | metrics |
New baselines using Large-Scale Jitter and Longer Training Schedule
The following baselines of COCO Instance Segmentation with Mask R-CNN are generated using a longer training schedule and large-scale jitter as described in Google's Simple Copy-Paste Data Augmentation paper. These models are trained from scratch using random initialization. These baselines exceed the previous Mask R-CNN baselines.
In the following table, one epoch consists of training on 118000 COCO images.
Name | epochs | train time (s/im) |
inference time (s/im) |
box AP |
mask AP |
model id | download |
---|---|---|---|---|---|---|---|
R50-FPN | 100 | 0.376 | 0.069 | 44.6 | 40.3 | 42047764 | model | metrics |
R50-FPN | 200 | 0.376 | 0.069 | 46.3 | 41.7 | 42047638 | model | metrics |
R50-FPN | 400 | 0.376 | 0.069 | 47.4 | 42.5 | 42019571 | model | metrics |
R101-FPN | 100 | 0.518 | 0.073 | 46.4 | 41.6 | 42025812 | model | metrics |
R101-FPN | 200 | 0.518 | 0.073 | 48.0 | 43.1 | 42131867 | model | metrics |
R101-FPN | 400 | 0.518 | 0.073 | 48.9 | 43.7 | 42073830 | model | metrics |
regnetx_4gf_dds_FPN | 100 | 0.474 | 0.071 | 46.0 | 41.3 | 42047771 | model | metrics |
regnetx_4gf_dds_FPN | 200 | 0.474 | 0.071 | 48.1 | 43.1 | 42132721 | model | metrics |
regnetx_4gf_dds_FPN | 400 | 0.474 | 0.071 | 48.6 | 43.5 | 42025447 | model | metrics |
regnety_4gf_dds_FPN | 100 | 0.487 | 0.073 | 46.1 | 41.6 | 42047784 | model | metrics |
regnety_4gf_dds_FPN | 200 | 0.487 | 0.072 | 47.8 | 43.0 | 42047642 | model | metrics |
regnety_4gf_dds_FPN | 400 | 0.487 | 0.072 | 48.2 | 43.3 | 42045954 | model | metrics |
COCO Person Keypoint Detection Baselines with Keypoint R-CNN
Name | lr sched |
train time (s/iter) |
inference time (s/im) |
train mem (GB) |
box AP |
kp. AP |
model id | download |
---|---|---|---|---|---|---|---|---|
R50-FPN | 1x | 0.315 | 0.072 | 5.0 | 53.6 | 64.0 | 137261548 | model | metrics |
R50-FPN | 3x | 0.316 | 0.066 | 5.0 | 55.4 | 65.5 | 137849621 | model | metrics |
R101-FPN | 3x | 0.390 | 0.076 | 6.1 | 56.4 | 66.1 | 138363331 | model | metrics |
X101-FPN | 3x | 0.738 | 0.121 | 8.7 | 57.3 | 66.0 | 139686956 | model | metrics |
COCO Panoptic Segmentation Baselines with Panoptic FPN
Name | lr sched |
train time (s/iter) |
inference time (s/im) |
train mem (GB) |
box AP |
mask AP |
PQ | model id | download |
---|---|---|---|---|---|---|---|---|---|
R50-FPN | 1x | 0.304 | 0.053 | 4.8 | 37.6 | 34.7 | 39.4 | 139514544 | model | metrics |
R50-FPN | 3x | 0.302 | 0.053 | 4.8 | 40.0 | 36.5 | 41.5 | 139514569 | model | metrics |
R101-FPN | 3x | 0.392 | 0.066 | 6.0 | 42.4 | 38.5 | 43.0 | 139514519 | model | metrics |
LVIS Instance Segmentation Baselines with Mask R-CNN
Mask R-CNN baselines on the LVIS dataset, v0.5. These baselines are described in Table 3(c) of the LVIS paper.
NOTE: the 1x schedule here has the same amount of iterations as the COCO 1x baselines. They are roughly 24 epochs of LVISv0.5 data. The final results of these configs have large variance across different runs.
Name | lr sched |
train time (s/iter) |
inference time (s/im) |
train mem (GB) |
box AP |
mask AP |
model id | download |
---|---|---|---|---|---|---|---|---|
R50-FPN | 1x | 0.292 | 0.107 | 7.1 | 23.6 | 24.4 | 144219072 | model | metrics |
R101-FPN | 1x | 0.371 | 0.114 | 7.8 | 25.6 | 25.9 | 144219035 | model | metrics |
X101-FPN | 1x | 0.712 | 0.151 | 10.2 | 26.7 | 27.1 | 144219108 | model | metrics |
Cityscapes & Pascal VOC Baselines
Simple baselines for
- Mask R-CNN on Cityscapes instance segmentation (initialized from COCO pre-training, then trained on Cityscapes fine annotations only)
- Faster R-CNN on PASCAL VOC object detection (trained on VOC 2007 train+val + VOC 2012 train+val, tested on VOC 2007 using 11-point interpolated AP)
Name | train time (s/iter) |
inference time (s/im) |
train mem (GB) |
box AP |
box AP50 |
mask AP |
model id | download |
---|---|---|---|---|---|---|---|---|
R50-FPN, Cityscapes | 0.240 | 0.078 | 4.4 | 36.5 | 142423278 | model | metrics | ||
R50-C4, VOC | 0.537 | 0.081 | 4.8 | 51.9 | 80.3 | 142202221 | model | metrics |
Other Settings
Ablations for Deformable Conv and Cascade R-CNN:
Name | lr sched |
train time (s/iter) |
inference time (s/im) |
train mem (GB) |
box AP |
mask AP |
model id | download |
---|---|---|---|---|---|---|---|---|
Baseline R50-FPN | 1x | 0.261 | 0.043 | 3.4 | 38.6 | 35.2 | 137260431 | model | metrics |
Deformable Conv | 1x | 0.342 | 0.048 | 3.5 | 41.5 | 37.5 | 138602867 | model | metrics |
Cascade R-CNN | 1x | 0.317 | 0.052 | 4.0 | 42.1 | 36.4 | 138602847 | model | metrics |
Baseline R50-FPN | 3x | 0.261 | 0.043 | 3.4 | 41.0 | 37.2 | 137849600 | model | metrics |
Deformable Conv | 3x | 0.349 | 0.047 | 3.5 | 42.7 | 38.5 | 144998336 | model | metrics |
Cascade R-CNN | 3x | 0.328 | 0.053 | 4.0 | 44.3 | 38.5 | 144998488 | model | metrics |
Ablations for normalization methods, and a few models trained from scratch following Rethinking ImageNet Pre-training.
(Note: The baseline uses 2fc
head while the others use 4conv1fc
head)
Name | lr sched |
train time (s/iter) |
inference time (s/im) |
train mem (GB) |
box AP |
mask AP |
model id | download |
---|---|---|---|---|---|---|---|---|
Baseline R50-FPN | 3x | 0.261 | 0.043 | 3.4 | 41.0 | 37.2 | 137849600 | model | metrics |
GN | 3x | 0.309 | 0.060 | 5.6 | 42.6 | 38.6 | 138602888 | model | metrics |
SyncBN | 3x | 0.345 | 0.053 | 5.5 | 41.9 | 37.8 | 169527823 | model | metrics |
GN (from scratch) | 3x | 0.338 | 0.061 | 7.2 | 39.9 | 36.6 | 138602908 | model | metrics |
GN (from scratch) | 9x | N/A | 0.061 | 7.2 | 43.7 | 39.6 | 183808979 | model | metrics |
SyncBN (from scratch) | 9x | N/A | 0.055 | 7.2 | 43.6 | 39.3 | 184226666 | model | metrics |
A few very large models trained for a long time, for demo purposes. They are trained using multiple machines:
Name | inference time (s/im) |
train mem (GB) |
box AP |
mask AP |
PQ | model id | download |
---|---|---|---|---|---|---|---|
Panoptic FPN R101 | 0.098 | 11.4 | 47.4 | 41.3 | 46.1 | 139797668 | model | metrics |
Mask R-CNN X152 | 0.234 | 15.1 | 50.2 | 44.0 | 18131413 | model | metrics | |
above + test-time aug. | 51.9 | 45.9 |