pcuenq's picture
pcuenq HF staff
Fix link (#5)
7c771f8 verified
metadata
tags:
  - image-segmentation
library_name: coreml
license: apache-2.0

DETR-Resnet50 (semantic segmentation) Core ML Models

See the Files tab for converted models.

DEtection TRansformer (DETR) model trained end-to-end on COCO 2017 object detection (118k annotated images). It was introduced in the paper End-to-End Object Detection with Transformers by Carion et al. and first released in this repository.

Model description

The DETR model is an encoder-decoder transformer with a convolutional backbone. Two heads are added on top of the decoder outputs in order to perform object detection: a linear layer for the class labels and a MLP (multi-layer perceptron) for the bounding boxes. The model uses so-called object queries to detect objects in an image. Each object query looks for a particular object in the image. For COCO, the number of object queries is set to 100.

The model is trained using a "bipartite matching loss": one compares the predicted classes + bounding boxes of each of the N = 100 object queries to the ground truth annotations, padded up to the same length N (so if an image only contains 4 objects, 96 annotations will just have a "no object" as class and "no bounding box" as bounding box). The Hungarian matching algorithm is used to create an optimal one-to-one mapping between each of the N queries and each of the N annotations. Next, standard cross-entropy (for the classes) and a linear combination of the L1 and generalized IoU loss (for the bounding boxes) are used to optimize the parameters of the model.

model image

Evaluation - Variants

Variant Parameters Size (MB) Weight Precision Act. Precision IoU Pixel accuracy
facebook/detr-resnet-50-panoptic (PyTorch) 43M 172 Float32 Float32 0.393 0.746
DETRResnet50SemanticSegmentationF32 43M 171 Float32 Float32 0.393 0.746
DETRResnet50SemanticSegmentationF16 43M 86 Float16 Float16 0.395 0.746

IoU and Pixel accuracy measured on 512 images from the COCO dataset. The ground truth labels were extracted from the panoptic segmentation annotations, transformed to semantic segmentation masks. Input images were resized so that the smaller edge equals 448, then center-cropped.

Inference time

The following results refer to DETRResnet50SemanticSegmentationF16. The compute units for MacBook Pro (M1 Max) were manually selected to "CPU and Neural Engine".

Device OS Inference time (ms) Dominant compute unit
iPhone 15 Pro Max 17.5 40 Neural Engine
MacBook Pro (M1 Max) 14.5 43 Neural Engine
iPhone 12 Pro Max 18.0 52 Neural Engine
MacBook Pro (M3 Max) 15.0 29 Neural Engine

Download

Install huggingface-cli

brew install huggingface-cli

To download one of the .mlpackage folders to the models directory:

huggingface-cli download \
  --local-dir models --local-dir-use-symlinks False \
  apple/coreml-detr-semantic-segmentation \
  --include "DETRResnet50SemanticSegmentationF16.mlpackage/*"

To download everything, skip the --include argument. This will retrieve float32 and float16 variants, as well as quantized versions of the float16 variant.

Integrate in Swift apps

The huggingface/coreml-examples repository contains sample Swift code for coreml-detr-semantic-segmentation and other models. See the instructions there to build the demo app, which shows how to use the model in your own Swift apps.