Spaces:
Running
YOLOv7
YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors
Abstract
YOLOv7 surpasses all known object detectors in both speed and accuracy in the range from 5 FPS to 160 FPS and has the highest accuracy 56.8% AP among all known real-time object detectors with 30 FPS or higher on GPU V100. YOLOv7-E6 object detector (56 FPS V100, 55.9% AP) outperforms both transformer-based detector SWIN-L Cascade-Mask R-CNN (9.2 FPS A100, 53.9% AP) by 509% in speed and 2% in accuracy, and convolutional-based detector ConvNeXt-XL Cascade-Mask R-CNN (8.6 FPS A100, 55.2% AP) by 551% in speed and 0.7% AP in accuracy, as well as YOLOv7 outperforms: YOLOR, YOLOX, Scaled-YOLOv4, YOLOv5, DETR, Deformable DETR, DINO-5scale-R50, ViT-Adapter-B and many other object detectors in speed and accuracy. Moreover, we train YOLOv7 only on MS COCO dataset from scratch without using any other datasets or pre-trained weights. Source code is released in this https URL.
Results and models
COCO
Backbone | Arch | Size | SyncBN | AMP | Mem (GB) | Box AP | Config | Download |
---|---|---|---|---|---|---|---|---|
YOLOv7-tiny | P5 | 640 | Yes | Yes | 2.7 | 37.5 | config | model | log |
YOLOv7-l | P5 | 640 | Yes | Yes | 10.3 | 50.9 | config | model | log |
YOLOv7-x | P5 | 640 | Yes | Yes | 13.7 | 52.8 | config | model | log |
YOLOv7-w | P6 | 1280 | Yes | Yes | 27.0 | 54.1 | config | model | log |
YOLOv7-e | P6 | 1280 | Yes | Yes | 42.5 | 55.1 | config | model | log |
Note:
In the official YOLOv7 code, the random_perspective
data augmentation in COCO object detection task training uses mask annotation information, which leads to higher performance. Object detection should not use mask annotation, so only box annotation information is used in MMYOLO
. We will use the mask annotation information in the instance segmentation task.
- The performance is unstable and may fluctuate by about 0.3 mAP. The performance shown above is the best model.
- If users need the weight of
YOLOv7-e2e
, they can train according to the configs provided by us, or convert the official weight according to the converter script. fast
means thatYOLOv5DetDataPreprocessor
andyolov5_collate
are used for data preprocessing, which is faster for training, but less flexible for multitasking. Recommended to use fast version config if you only care about object detection.SyncBN
means use SyncBN,AMP
indicates training with mixed precision.- We use 8x A100 for training, and the single-GPU batch size is 16. This is different from the official code.
Citation
@article{wang2022yolov7,
title={{YOLOv7}: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors},
author={Wang, Chien-Yao and Bochkovskiy, Alexey and Liao, Hong-Yuan Mark},
journal={arXiv preprint arXiv:2207.02696},
year={2022}
}