VoxelNet: End-to-End Learning for Point Cloud Based 3D Object Detection
Abstract
Accurate detection of objects in 3D point clouds is a central problem in many applications, such as autonomous navigation, housekeeping robots, and augmented/virtual reality. To interface a highly sparse LiDAR point cloud with a region proposal network (RPN), most existing efforts have focused on hand-crafted feature representations, for example, a bird's eye view projection. In this work, we remove the need of manual feature engineering for 3D point clouds and propose VoxelNet, a generic 3D detection network that unifies feature extraction and bounding box prediction into a single stage, end-to-end trainable deep network. Specifically, VoxelNet divides a point cloud into equally spaced 3D voxels and transforms a group of points within each voxel into a unified feature representation through the newly introduced voxel feature encoding (VFE) layer. In this way, the point cloud is encoded as a descriptive volumetric representation, which is then connected to a RPN to generate detections. Experiments on the KITTI car detection benchmark show that VoxelNet outperforms the state-of-the-art LiDAR based 3D detection methods by a large margin. Furthermore, our network learns an effective discriminative representation of objects with various geometries, leading to encouraging results in 3D detection of pedestrians and cyclists, based on only LiDAR.
Community
- Proposes VoxelNet: object detection and feature extraction in a single stage. Divide point cloud into 3D voxels and transform through voxel feature encoding (VFE) layer. VoxelNet has feature learning network, convolutional middle layers, and region proposal network. Feature learning network: partition/group point cloud (uniformly) into voxels, randomly sample an upper limit for voxels with many points.
- VFE layer: non-empty voxel is a set of 4D points (XYZ and reflectance), augment (concat) relative offset from voxel center, transform points through shared FCN (linear, BN, ReLU) into feature space; element-wise max-pooling (retain dimension of feature embedding, pool across points) for locally aggregated feature; concat aggregated feature with each point embedding and get output feature set; after sequence of VFE layers, use FCN and max-pool to get single voxel feature. Voxel features represented as sparse 4D tensor (C dimensional voxel embedding in a 3D voxelised space/grid). Convolutional Middle Layers (M-dim conv) apply 3D convolution, BN, and ReLU sequentially (expand receptive field and add more context).
- RPN has three fully conv blocks (multiple conv layers with BN and ReLU); upsample (using deconv) and concat (like UNet) to create high-resolution (half of voxel grid size) feature map; conv mapped to probability score and regression maps (standard RPN procedure). GT bounding box has XYZ center, LWH box size, and Z yaw; defines formulation for regression (to anchor) targets; triplet-like contrastive loss, positive and negative anchor, binary cross entropy/classification loss, smooth L1 regression loss. Implementation speed up using voxel input and stored coordinate buffer (faster lookup/accessing for sparse tensors). Trained on KITTI dataset and has separate networks for car detection and pedestrian and cyclist detection (different hyperparameters for clipping point cloud, partitioning voxel grid, and network design).
- Data augmentations: collision free perturbations to GT bounding boxes (yaw and translate); global scaling (enlarge or shrink whole scene); global yaw rotations to entire scene. Compares against a hand crafted (HC) baseline (hand crafted features instead of VFE layers), better AP than MV (multi-modal approach) for all three classes; evaluation also on KITTI test set.
From Apple.
Links: PapersWithCode, Unofficial GitHub
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper