AdaOcc: Adaptive-Resolution Occupancy Prediction (2024)

Chao Chen1  Ruoyu Wang1,2  Yuliang Guo2  Cheng Zhao2  Xinyu Huang2  Chen Feng1  Liu Ren2
1New York University  2 BOSCH Research North America
https://github.com/ai4ce/Bosch-NYU-OccupancyNet/
The corresponding author: Liu Ren liu.ren@us.bosch.com

Abstract

Autonomous driving in complex urban scenarios requires 3D perception to be both comprehensive and precise.Traditional 3D perception methods focus on object detection, resulting in sparse representations that lack environmental detail. Recent approaches estimate 3D occupancy around vehicles for a more comprehensive scene representation. However, dense 3D occupancy prediction increases computational demands, challenging the balance between efficiency and resolution.High-resolution occupancy grids offer accuracy but demand substantial computational resources, while low-resolution grids are efficient but lack detail. To address this dilemma, we introduce AdaOcc, a novel adaptive-resolution, multi-modal prediction approach.Our method integrates object-centric 3D reconstruction and holistic occupancy prediction within a single framework, performing highly detailed and precise 3D reconstruction only in regions of interest (ROIs). These high-detailed 3D surfaces are represented in point clouds, thus their precision is not constrained by the predefined grid resolution of the occupancy map.We conducted comprehensive experiments on the nuScenes dataset, demonstrating significant improvements over existing methods. In close-range scenarios, we surpass previous baselines by over 13% in IOU, and over 40% in Hausdorff distance. In summary, AdaOcc offers a more versatile and effective framework for delivering accurate 3D semantic occupancy prediction across diverse driving scenarios.

AdaOcc: Adaptive-Resolution Occupancy Prediction (1)

1 Introduction

Accurate representation of surroundings is vital for autonomous driving decision-making. The required perception granularity varies based on tasks: highways may need sparse but long-range views, while urban areas require dense, close-range detail. Finding a representation enabling safe navigation in all scenarios, adapting to dynamic changes in real-time, remains a significant challenge.

Various scene representations have emerged in autonomous driving research. The prototypical ones include uniform voxel-based representations[40, 44, 49, 1, 18], bounding box representations[9, 46, 41, 37, 13, 39], implicit representations[34, 32, 51], point-based representations[14, 33, 45], and other forms of representations[25, 28]. While object-centric representations using bounding boxes were traditionally popular, voxel-based representations have become more prevalent in recent years due to the rich information they offer in 3D semantic occupancy maps. Since each voxel contains both occupancy and semantic information, voxel-based representations provide a more comprehensive understanding of the scene. They include additional background descriptions and capture surface shapes with a certain level of granularity. Moreover, voxel-based representations are popular for their seamless integration with navigation and planning frameworks.

AdaOcc: Adaptive-Resolution Occupancy Prediction (2)

Despite the flexibility of 3D semantic occupancy maps in adjusting grid sizes, most of the existing methods produce relatively low resolution (0.4mtimes0.4meter0.4\text{\,}\mathrm{m}start_ARG 0.4 end_ARG start_ARG times end_ARG start_ARG roman_m end_ARG or 0.5mtimes0.5meter0.5\text{\,}\mathrm{m}start_ARG 0.5 end_ARG start_ARG times end_ARG start_ARG roman_m end_ARG sized) occupancy grid [44, 37, 36], limiting their applications to highway driving. In urban driving or parking scenarios, higher-resolution representation is essential for precise vehicle maneuvering. As seen in Figure2, the distance measured between the two cars can deviate as much as 0.6m when using 0.8m and 0.2m voxels. However, increasing resolution leads to a cubic increase in computational complexity and GPU memory consumption.

To balance memory efficiency and perception accuracy, we propose two strategies:

Non-uniform resolution. In vehicle path planning, close-range elements are considered more important than far-range elements, and objects (e.g., cars, pedestrians) are prioritized over background elements (e.g., roads, sidewalks). Therefore, we propose a non-uniform resolution representation, focusing high-resolution prediction on close-range objects.

Multi-modal 3D representation. This design is based on the two major drawbacks of voxel grid representation. Firstly, its accuracy is limited by voxel size, leading to GPU resource overload when aiming for finer granularity. Secondly, real-world scene sparsity often results in many unoccupied voxels, causing inefficient memory usage.To address these issues, we propose a multimodal approach for 3D representation, incorporating outputs such as voxel grids, point clouds, and bounding boxes, rather than relying solely on voxel grids for 3D semantic occupancy (as shown on the left ofFig.1). Among these, point clouds are particularly advantageous due to their high detail and independence from voxel size. They do not necessarily require increased memory usage and only indicate the presence of occupied elements, thus overcoming the limitations of voxel grids.

By employing the aforementioned strategies, we introduce AdaOcc, a multi-modal, adaptive-resolution semantic occupancy prediction approach. AdaOccprioritizes precision in critical areas vital for autonomous driving decisions, i.e., close-range objects. Rather than uniformly allocating computational resources, AdaOccemploys fine-detailed reconstruction via point cloud for close-range objects while using coarser 3D voxel grid predictions for 3D semantic occupancy as supplement. We draw inspiration from object-centric methods[42, 19, 41] that incorporate an Object Proposal Network (OPN), to identify the regions of interest (ROIs) in 3D space. Subsequently, we connect a 3D point cloud decoder[47] to generate detailed point clouds within these ROIs.The right panel of Fig.1 demonstrates that AdaOcc’s adaptive-resolution representation strikes a balance of high accuracy and efficiency for autonomous driving tasks.

To enhance the synergy between different modalities, we jointly train a shared backbone (2D bird’s-eye view(BEV)[19] or 3D feature volume [40]), providing voxel grids, bounding boxes and point clouds in a unified network architecture. Leveraging multi-modal representation, our occupancy prediction model can be effectively trained using only coarse ground-truth occupancy data plus raw Lidar points while still being evaluated as high resolution occupancy. AdaOcc is experimentally compared to previous methods on nuScenes dataset in both close-range and long-range scenarios to demonstrate its effectiveness. Notably, the remarkable improvement in IOU (¿13 %) and Hausdorff distance (¿ 40%) metric for close-range evaluations demonstrates our method’s superior performance in capturing the details of object reconstruction and achieving precise object position estimation.

In summary,our contributions are listed as follows:

  1. 1.

    We propose a multi-modal adaptive-resolution method, offering three output representations with high precision in critical regions while maintaining efficiency for real-time applications.

  2. 2.

    We develop an effective joint training paradigm that boosts the synergy between the occupancy prediction and object folding branches.

  3. 3.

    Our approach demonstrates superior accuracy on the nuScenes dataset, particularly excelling in close-range scenarios that require precise maneuvering.

2 Related Works

3D semantic occupancy prediction. 3D semantic occupancy prediction is rapidly evolving, playing a crucial role in achieving precise perception necessary for safely navigating urban environments. Several pioneering works [1, 18] are designed for single-view image inputs, laying the foundation for dense geometry and semantic inference from a singular perspective. In contrast, other methods [10, 44, 49, 26] utilize surround-view images to achieve a comprehensive 360-degree understanding of the environment. Among these, OpenOccupancy [40] provides a benchmark to evaluate occupancy prediction at the finest level, using a 0.2-meter voxel size.The proposed method, CONet, was the first to practically realize occupancy prediction at a 0.2-meter voxel scale through a cascaded approach.

Since these approaches rely on uniformly sampled voxels, their precision is largely constrained by the total number of voxels a computing unit can afford. By focusing computational resources on target objects, AdaOcc achieves highly precise perception in critical regions while maintaining the overall computation cost.

3D object detection from surround-view images. The landscape of camera-based surround-view 3D object detection in autonomous driving has seen significant advancements in unified framework design, as demonstrated by[22, 9, 46, 48].Researchers have concentrated on transforming multiple perspective views into a unified 3D space within a single frame, as demonstrated by studies such as[42, 22, 9, 19, 13, 43]. This process can be categorized into two main approaches: (1)BEV-based methods[9, 19, 46, 8, 17, 13, 41, 37, 16], (2) sparse query based methods[42, 22, 20, 2, 43]. In comparison, BEV methods are considered more compatible with other 3D perception tasks demanding dense outputs, such as 3D occupancy prediction, depth estimation, and 3D scene reconstruction.

Inspired by [41, 37], AdaOccfurther enhances 3D comprehension by integrating object detection, occupancy prediction, and object surface reconstruction into a unified framework. Our framework not only comprehensively represents the entire scene but also focuses on high surface precision within object regions. While [41] employs a similar strategy by performing semantic occupancy prediction within certain ROIs, it still outputs 3D voxels in uniform grids. This approach continues to face the efficiency-precision dilemma in choosing the grid resolution, as seen in other occupancy prediction methods.

2D-3D encoding backbones.Within the realm of 2D-3D encoding backbones, two predominant methodologies emerge: transformer-based backbones [19, 24, 23, 29] and Lift-Splat-Shoot (LSS)-based backbones [30, 40, 18]. Transformer-based backbones typically create a query grid in 3D space, project these grid points onto 2D image planes, and then aggregate the extracted features back to the query grid using a deformable transformer [50]. Conversely, LSS-based backbones incorporate a depth probability prediction module that allocates 2D image features across 3D space according to estimated depth probabilities. Each approach offers distinct advantages. For our experiments, we chose BEVFormer [19] (transformer-based) and CONet [40] (LSS-based) as the baseline networks to underscore the capabilities of AdaOcc.

Multi-resolution 3D representations. Multi-resolution representations have gained considerable traction across various fields in computer graphics and geometric modeling, as evidenced by seminal works[15, 6, 5, 27]. Several approaches[4, 7, 12, 38] adopt a hierarchical method for shape reconstruction, starting with a preliminary low-resolution model that is progressively refined into a high-resolution output. Other methods[21, 35] extend hierarchical structures, such as octrees of implicit functions, to represent the radiance field for neural rendering, yet the granularity of the octree is predetermined by the depth map inputIn contrast, both MDIF[3] and FoldingNet[47] offer representations of object shapes with adjustable levels of detail.

Within the field of occupancy prediction, CONet[40] pioneers a coarse-to-fine strategy, refining only the occupied areas in the coarse occupancy map to achieve the first practical 0.2-meter semantic occupancy prediction method. Building solely upon the coarse occupancy map of CONet, our approach significantly improves performance in terms of Hausdorff distance and reduces memory usage, as detailed in Section4.

3 Methodology

Problem statement. We formulate our task as a multi-modal, adaptive-resolution occupancy prediction. The input of the network is a set of surround-view input images ={I1,I2,,IN}subscript𝐼1subscript𝐼2subscript𝐼𝑁\mathcal{I}=\{I_{1},I_{2},...,I_{N}\}caligraphic_I = { italic_I start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_I start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_I start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT }, and the output of the network are in multiple modalities, including: (1) 3D semantic occupancy map MH×W×D𝑀superscript𝐻𝑊𝐷M\in\mathbb{R}^{{H}\times{W}\times{D}}italic_M ∈ blackboard_R start_POSTSUPERSCRIPT italic_H × italic_W × italic_D end_POSTSUPERSCRIPT, where this map spans from X[xminxmax]𝑋delimited-[]subscript𝑥𝑚𝑖𝑛subscript𝑥𝑚𝑎𝑥X\subset[x_{min}...x_{max}]italic_X ⊂ [ italic_x start_POSTSUBSCRIPT italic_m italic_i italic_n end_POSTSUBSCRIPT … italic_x start_POSTSUBSCRIPT italic_m italic_a italic_x end_POSTSUBSCRIPT ], Y[yminymax]𝑌delimited-[]subscript𝑦𝑚𝑖𝑛subscript𝑦𝑚𝑎𝑥Y\subset[y_{min}...y_{max}]italic_Y ⊂ [ italic_y start_POSTSUBSCRIPT italic_m italic_i italic_n end_POSTSUBSCRIPT … italic_y start_POSTSUBSCRIPT italic_m italic_a italic_x end_POSTSUBSCRIPT ], Z[zminzmax]𝑍delimited-[]subscript𝑧𝑚𝑖𝑛subscript𝑧𝑚𝑎𝑥Z\subset[z_{min}...z_{max}]italic_Z ⊂ [ italic_z start_POSTSUBSCRIPT italic_m italic_i italic_n end_POSTSUBSCRIPT … italic_z start_POSTSUBSCRIPT italic_m italic_a italic_x end_POSTSUBSCRIPT ]. (2) a set of bounding boxes represented by translation(xi,yi,zi)subscript𝑥𝑖subscript𝑦𝑖subscript𝑧𝑖\left(x_{i},y_{i},z_{i}\right)( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ), rotation(qxi,qyi,qzi,qwi)𝑞subscript𝑥𝑖𝑞subscript𝑦𝑖𝑞subscript𝑧𝑖𝑞subscript𝑤𝑖\left(qx_{i},qy_{i},qz_{i},qw_{i}\right)( italic_q italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_q italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_q italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_q italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ), sizes (hi,wi,di)subscript𝑖subscript𝑤𝑖subscript𝑑𝑖\left(h_{i},w_{i},d_{i}\right)( italic_h start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_d start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ), and (3) object shape in point cloud format N×K×3superscript𝑁𝐾3\mathbb{R}^{N\times K\times 3}blackboard_R start_POSTSUPERSCRIPT italic_N × italic_K × 3 end_POSTSUPERSCRIPT, here we pick K = 2500250025002500.

By means of adaptive-resolution, we aim to create a mixed-resolution occupancy map that combine fine resolution for objects with coarse resolution for all matters. The grid size for the occupancy map includes 0.2m, 0.4m, 0.8m, and so on. In this work, we define high resolution as which the voxel size is less than or equal to 0.2m. Otherwise, it is low resolution.

Architecture overview. Our approach is versatile, capable of integrating with either BEVFormer[19] or CONet[40], as depicted in Figure3. It processes six surround-view input images \mathcal{I}caligraphic_I through a 2D-3D encoder. Specifically, images tsubscript𝑡\mathcal{I}_{t}caligraphic_I start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT captured at time t𝑡titalic_t are processed using a CNN to extract 2D image features. These features are subsequently projected into a 3D feature volume F𝐹Fitalic_F that facilitates semantic occupancy prediction, object detection, and object surface reconstruction. The BEV feature is considered a specific instance of the 3D feature volume.

AdaOcc: Adaptive-Resolution Occupancy Prediction (3)

3.1 Occupancy Decoder

Ego-centric occupancy perception is designed to create semantic occupancy maps of fixed grid size in surround-view driving scenarios. This module is intended to provide a holistic understanding of the entire area, allowing for the use of a low-resolution occupancy decoder for greater efficiency. The surrounding occupancy labels, denoted by Vtsubscript𝑉𝑡V_{t}italic_V start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, are predicted through:

Vt=MLP(Ft1,Ft),subscript𝑉𝑡𝑀𝐿𝑃subscript𝐹𝑡1subscript𝐹𝑡V_{t}=MLP\left(F_{t-1},F_{t}\right),italic_V start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_M italic_L italic_P ( italic_F start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , italic_F start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ,(1)

where using BEVFormer as the framework, the previous and the current feature volume (Ft1,FtH×W×D×Cvoxsubscript𝐹𝑡1subscript𝐹𝑡superscript𝐻𝑊𝐷subscript𝐶𝑣𝑜𝑥F_{t-1},F_{t}\in\mathbb{R}^{H\times W\times D\times C_{vox}}italic_F start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , italic_F start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_H × italic_W × italic_D × italic_C start_POSTSUBSCRIPT italic_v italic_o italic_x end_POSTSUBSCRIPT end_POSTSUPERSCRIPT) are fed into an MLP-based voxel decoder, to get the coarse 3D semantic occupancy prediction (VtH×W×Dsubscript𝑉𝑡superscript𝐻𝑊𝐷V_{t}\in\mathbb{R}^{H\times W\times D}italic_V start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_H × italic_W × italic_D end_POSTSUPERSCRIPT), Cvoxsubscript𝐶𝑣𝑜𝑥C_{vox}italic_C start_POSTSUBSCRIPT italic_v italic_o italic_x end_POSTSUBSCRIPT represents feature dimension. All CONet variants(using CONet as backbone) only relies on the current feature (Ftsubscript𝐹𝑡F_{t}italic_F start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT) to predict Vtsubscript𝑉𝑡V_{t}italic_V start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and then applies an additional step of attention-based occupancy refinement on top of the original occupancy prediction.

3.2 3D Object Detector

The 3D object detector is designed to generate 3D object bounding boxes that facilitate object-centric shape reconstruction or can be directly used for downstream tasks. The predicted 3D object bounding box, denoted as B^^𝐵\hat{B}over^ start_ARG italic_B end_ARG, is defined as follows:

B^=argmaxBP(object|B)P(B|I),^𝐵subscript𝐵𝑃conditionalobject𝐵𝑃conditional𝐵𝐼\hat{B}=\arg\max_{B}P(\text{object}|B)\cdot P(B|I),over^ start_ARG italic_B end_ARG = roman_arg roman_max start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT italic_P ( object | italic_B ) ⋅ italic_P ( italic_B | italic_I ) ,(2)

where P(object|B)𝑃conditionalobject𝐵P(\text{object}|B)italic_P ( object | italic_B ) is the probability that an object is present given the bounding boxB𝐵Bitalic_B, P(object|B)𝑃conditionalobject𝐵P(\text{object}|B)italic_P ( object | italic_B ) is the likelihood of the bounding box B𝐵Bitalic_Bgiven the input image I𝐼Iitalic_I. We follow the setup from DETR to regress 900900900900 bounding boxes and compute their object classification scores.The accuracy of the 3D object detector is critical for detailed surface reconstruction, as it directly influences the quality of the results.

3.3 FoldingNet Decoder

The FoldingNet decoder processes 3D features and predictions, producing finely detailed surfaces for targeted objects. Initially, FoldingNet[47] utilizes PointNet[31] to encode an object’s point cloud and then decodes the latent features into another point cloud at an arbitrarily selected resolution. Adapting this process to effectively leverage existing 3D features within our unified framework to directly output highly accurate surface points introduces distinct challenges. Moreover, driving scenarios often involve only partial observations of an object; the variability in partial visibility combined with inaccuracies in the predicted object boxes can further complicate the object-centric decoding process.

Box-Aligned Object Feature Aggregation. Leveraging our object proposal network, we can concentrate on reconstructing the point cloud for each object individually. We establish a regular sampling grid G=pi,pi3formulae-sequence𝐺subscript𝑝𝑖subscript𝑝𝑖superscript3G={p_{i}},p_{i}\in\mathbb{R}^{3}italic_G = italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT within each provided 3D bounding box. The sampling grid is initially created in the object coordinate frame, in alignment with the three dimensions of the bounding box.Then G𝐺Gitalic_G is transformed to the ego-vehicle coordinate frame via the object pose T𝑇Titalic_T.We retrieve the set of feature vectors from the 3D feature volume F𝐹Fitalic_F using G𝐺Gitalic_G and apply max pooling over all sampled features to obtain an ”object feature vector” 𝐜𝐜\mathbf{c}bold_c. If features appear at a floating-point location, cubic interpolation is employed to retrieve the feature. The insight behind the sampling process is that the features within a bounding box encodes the local surface shape related to the object. The max pooling operation enhances the robustness of 𝐜𝐜\mathbf{c}bold_c to errors in bounding box prediction and to issues of partial visibility. The sampling process can be represented using the following equation:

𝐜=maxPooling({F(T(pi))}),piG.formulae-sequence𝐜maxPooling𝐹𝑇subscript𝑝𝑖subscript𝑝𝑖𝐺\mathbf{c}=\text{maxPooling}(\{F(T(p_{i}))\}),\ p_{i}\in G.bold_c = maxPooling ( { italic_F ( italic_T ( italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) } ) , italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ italic_G .(3)

In training, we use ground truth bounding boxes poses to transform the 3D sampling grid, while in testing we use predicted bounding boxes poses.

Point cloud decoding. After the feature encoding the object shape is retrieved, the object surface point cloud 𝐏𝐏\mathbf{P}bold_P is decoded as:

𝐏=fθ(𝐜,𝐠),𝐏subscript𝑓𝜃𝐜𝐠\mathbf{P}=f_{\theta}(\mathbf{c},\mathbf{g}),bold_P = italic_f start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_c , bold_g ) ,(4)

where 𝐠𝐠\mathbf{g}bold_g is the 2D sampling grid used by the FoldingNet decoder fθsubscript𝑓𝜃f_{\theta}italic_f start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT, which is a multi-layer perceptron (MLP) to decode the point cloud.

3.4 Joint training and losses

In our approach, we address the challenge of object-centric occupancy prediction through the integration of three components. A joint training paradigm can effective enhance the synergy between different modules. The effectiveness of the joint training is further validated via ablation study in supplemental materials.

Our joint training approach integrates a combination of losses from various modules: semantic occupancy loss, object detection loss, and surface reconstruction loss. The semantic occupancy loss (semsubscriptsem\mathcal{L}_{\mathrm{sem}}caligraphic_L start_POSTSUBSCRIPT roman_sem end_POSTSUBSCRIPT), which utilizes focal loss, is designed for predicting semantic occupancy within a fixed grid size. For object detection, we employ the object detection loss (detsubscriptdet\mathcal{L}_{\mathrm{det}}caligraphic_L start_POSTSUBSCRIPT roman_det end_POSTSUBSCRIPT), incorporating both focal loss for classification and L1 loss for the regression of bounding boxes. This loss function not only selects N valid boxes from a pool of candidates but also accurately estimates the position of each box simultaneously. Furthermore, the surface reconstruction loss (surfsubscriptsurf\mathcal{L}_{\mathrm{surf}}caligraphic_L start_POSTSUBSCRIPT roman_surf end_POSTSUBSCRIPT), using chamfer loss, is applied to the surface reconstruction of foreground objects to ensure precise alignment between the predicted and actual object point clouds. These three loss functions collectively enhance the efficiency of our adaptive occupancy prediction framework. More detailed descriptions of these loss components can be found in the supplementary materials.

4 Experiment

We conducted extensive experiments using the NuScenes dataset, which included evaluations of occupancy predictions at both close and full ranges, as well as object detection. Because comparing voxelized ground truth with multi-modal output presents challenges, we converted our detailed object point clouds into an occupancy representation with a grid size of 0.2m for evaluation purpose. This resolution matches the ground truth voxel size in CONet and OpenOccupancy[40].

4.1 Dataset

Our experiments engage NuScenes dataset to assess our object-centric occupancy prediction methods.In our experimental configuration, the ground truth labels encompass a bounded range in the x-direction from -50.0 to +50.0 meters, in the y-direction from -50.0 to +50.0 meters, and in the z-direction from -5.0 to 3.0 meters. Furthermore, to evaluate a approach’s performance under various voxel resolutions, we partition the space into voxels with granularity settings of 0.2 meters, 0.4 meters, and 0.8 meters. Given above experimental setups, we evaluate candidate methods within the varying defined spatial boundaries and at different voxel resolutions. Note that the ground truth 3D semantic occupancy comes from[40]. Note that our setup is similar to the OpenOccupancy Benchmark[40], except that we retain very small object ground truth (GT) boxes in both our training and validation sets.

4.2 Baselines

In this paper, we selected BEVFormer[19] and CONet[40] as state-of-the-art baselines for our evaluation. We aim to enhance these baselines with a multi-resolution representation to demonstrate the robustness and flexibility of our approach in improving occupancy prediction accuracy across various methods. BEVFormer, which utilizes object detection for scene representation, proves its efficacy in occupancy prediction tasks[37]. We aim to enhance BEVFormer’s accuracy through the adoption of a more granular representation method, specifically FoldingNet. Conversely, CONet employs Depth Net for initial rough occupancy predictions, subsequently refined using a transformer. Although its refinement process is conceptually similar to our approach, it lacks flexibility and does not efficiently utilize GPU resources, as it refines all occupancy grids uniformly. Our method focuses on refining predictions particularly for nearby objects, optimizing the overall resource expenditure.

AdaOcc: Adaptive-Resolution Occupancy Prediction (4)
AdaOcc: Adaptive-Resolution Occupancy Prediction (5)
AdaOcc: Adaptive-Resolution Occupancy Prediction (6)

4.3 Evaluation metric

We assess the performance of our object detection, and occupancy prediction approach respectively. For object detection, we follow exact same procedure as OccNet[37], shown in Table.4. For occupancy prediction, the overall occupancy is evaluated by Intersection over Union (IOU), and per-class occupancy is evaluated by mean Intersection over Union (mIOU). In addition, we apply Hausdorff Distance[11] for detailed assessment of the accuracy of object shapes. It evaluates the similarity between the predicted object point cloud and the ground truth object point cloud by measuring the maximum distance between them after bipartite matching. Hausdorff distance is expected to more adequately describe the precision of object shapes, not only their positional accuracy. Detailed descriptions of the evaluation metrics are provided in the supplementary materials. For 3D voxel grid, we use the centers of voxels as the points to calculate Hausdorff distance. In our experiments, We calculate the Hausdorff distance against ground truth voxel grid using the output point cloud in AdaOcc, while using the finest voxel grid they can get in the baseline methods.

MethodTrain Grid Size(m)Hausdorff Distance(m)(↓)IOU(↑)Eval. Time(s)GPU Usage(GB)
BEVFormer0.47.8680.1250.4435.186
0.80.1220.2724.166
CONet0.210.8160.2430.38316.310
0.40.1920.3679.476
0.80.1700.2928.768
AdaOcc_B0.44.099(+47.9%)0.142(+13.6%)0.4555.250
0.80.140(+14.75%)0.3154.171
AdaOcc_C0.25.967(+44.8%)0.246(+1.2%)1.34818.274
0.40.197(+2.6%)1.23911.010
0.80.193(+13.5%)0.77010.314

4.4 Close-range occupancy prediction

This section evaluates close-range occupancy predictions, crucial for narrow path navigation and parking in autonomous driving.

Without lose of generality, we define close-range as spanning from -12.8 to +12.8 meters in both x and y directions, and from -5.0 to 3.0 meters in the z-direction. Depth estimation within 30 meters has been shown to be accurate[39], allowing for precise predictions and enhanced model performance in this range. We train the AdaOccmodel with two backbones from BEVFormer and CONet, named AdaOcc_B and AdaOcc_C, respectively. We also use three grid resolutions (0.2m, 0.4m, and 0.8m) in training and assess all models by upscaling the grid resolution to 0.2m for Intersection over Union (IOU) calculations. The focus in close-range settings is on obstacle avoidance, and therefore, mIOU is not included. Furthermore, IOU results for BEVFormer are only provided at 0.4m and 0.8m training grid sizes due to memory constraints on GPUs like the RTX 3090.

Close-range IOU on BEVFormer.Tab.1 shows that AdaOcc based on BEVFormer consistently demonstrates an IOU improvement of at least 13% for train grid size of 0.4m and 0.8m. Additionally, since BEVFormer already includes an object detection framework, incorporating object surface reconstruction does not add significant overhead in either evaluation time or GPU usage. This ensures that using the folding method on BEVFormer provides a lightweight and efficient way to improve close-range IOU.

Close-range IOU on CONet.For CONet, AdaOcc still shows some improvement. However, AdaOcc_C demonstrates significant improvement at a coarse training grid size (0.8m), but its performance is not as good at a fine training grid size (0.2m). This is because CONet’s inherent coarse-to-fine refinement mechanism already provides substantial improvement in occupancy prediction. Adding extra object detection and object surface reconstruction at a fine resolution does not significantly enhance the original results. Additionally, CONet is a resource-intensive method as it refines every coarsely occupied cell. While object surface reconstruction brings some benefits, the extra object detection head and foldingnet head make the gains of AdaOcc at training grid sizes of 0.4m and 0.8m not particularly worthwhile.

Qualitative analysis. Figure.4 demonstrates that the classic occupancy prediction methods are likely to connect different objects of the same class which will inevitably influence the close-range path planning performance.In contrast, the overall reconstruction quality of AdaOccfor each object is remarkably better than other baselines within a given range. The solution to poor object detection is to conduct object detection and reconstruct the object at a close range. More qualitative comparisons are included in the supplemental materials.

MethodTrain Grid Size(m)IOU(↑)mIOU(↑)
BEVFormer0.40.1220.072
0.80.0890.053
CONet0.20.1560.095
0.40.1360.082
0.80.1200.074
AdaOcc_B0.40.128(+4.9%)0.089(+23.6%)
0.80.093 (+4.4%)0.089(+67.9%)
AdaOcc_C0.20.157(+0.6%)0.093 (-2.1%)
0.40.136 (+0.0%)0.085(+3.6%)
0.80.122 (+1.6%)0.079(+6.8%)

Analysis of Hausdorff distance. As discussed, AdaOccbased on all baselines has the best average Hausdorff distance. However, we have observed that the misinterpretation of bounding box positions and categories, especially for small objects like humans and bicycles, can significantly impact the accuracy of close-range object occupancy prediction tasks. Thus, it explains why we generate an occupancy map by integrating the voxelized object occupancy onto the coarse occupancy map, instead of replacing one by another. We believe that voxelized object occupancy is a good reference and complement to the coarse occupancy map, especially when the coarse occupancy map misinterprets the occupied cell as unoccupied.

Class Name

Baselines

Overall

Barrier

Bicycle

Bus

Car

Construction

Motorcycle

Pedestrian

Traffic Cone

Trailer

Truck

IOUBEVFormer0.0530.0140.0350.0770.0950.0500.0670.0740.0370.0270.061
CONet0.0740.0200.0730.1180.1260.0570.0670.0970.0540.0560.072
AdaOcc_B0.0890.0290.0820.1440.1590.0870.1030.1170.0660.0610.044
AdaOcc_C0.0790.0280.0710.1410.1280.0770.0720.0490.0410.1050.080

Evaluation time on different voxel grid sizes. BEVFormer has the shortest evaluation time because it predicts every occupancy grid equally and coarsely. In contrast, CONet requires the most time for evaluation due to its two-stage process. Initially, it computes a coarse occupancy map similar to BEVFormer, and subsequently, it refines each grid for all occupied grids. As CONet itself lacks an object detection pipeline, its cost for object surface reconstruction will be higher than BEVFormer’s, as it requires both an object detection and a foldingnet head for object point cloud reconstruction.

4.5 3D object detection

3D detection task with 3D box regression coarsely regresses the location of the foreground object can be coarsely regressed. In this section, we prove that the joint training of occupancy prediction, 3D detection, and surface reconstruction can improve the detector performance for all three models (BEVNet, VoxNet, and OccNet)[37], in terms of mAP, NDS, and other parameters. We developed the object detection pipeline from BEVNet and CONet, and the performance of the object detection method is very similar to the other three baseline methods.

MethodmAP↑NDS↑mAOE↓mAVE↓mAAE↓mATE↓mASE↓
BEVFormer0.2710.3900.5780.5410.2110.8350.293
VoxNet0.2770.3870.5860.6140.2030.8280.285
OccNet0.2760.3900.5850.5700.1900.8420.285
AdaOcc_B0.2730.3910.5770.5740.2220.8080.295
AdaOcc_C0.2720.3900.5790.5320.2090.8330.291

4.6 Full-range occupancy prediction

The adaptive-resolution approach represents a computationally efficient and flexible strategy that strikes a balance between accuracy and efficiency. It involves generating a full-range adaptive-resolution occupancy map by incorporating close-range voxelized object occupancy onto a full-range coarse occupancy map. Similar to Sec.4.4, we tested AdaOccwith grid resolution=0.2m, 0.4m, and 0.8m.

Full range IOU and mIOU.Tab.2 illustrates that AdaOccbased on both baselines outperforms their baseline model at grid resolutions of 0.2m, 0.4m, and 0.8m. However, we can observe that globally, AdaOccbased on BEVFormer demonstrates more significant improvements. This is consistent with the findings in Section 4.4. CONet already has some refinement at training grid sizes of 0.2m and 0.4m, so additional object surface reconstruction does not yield significant improvements.

Per-object-class evaluation in IOU. In our occupancy prediction evaluation, we prioritize object segments over static scenes, particularly for classes like pedestrians and vehicles, as their movements are unpredictable. Similarly, Table 3 reveals that AdaOccsurpasses the other baselines for all 10 object classes.

Through examining mAP and NDS for BEVFormer and AdaOcc_B shown inTab.4, we establish that AdaOcc slightly outperforms BEVFormer, the foundation upon which AdaOccis built. This finding demonstrates that the object surface reconstruction task not only enhances the accuracy of occupancy prediction but also enriches the learned features for object detection.

5 Conclusion

In conclusion, our proposed approach offers a multi-modal adaptive-resolution method, providing three output representations with highly precise surfaces in critical regions, while ensuring efficiency for real-time applications. Additionally, we develop an effective joint training paradigm to enhance synergy between the occupancy and folding networks, resulting in improved near-range occupancy prediction performance. Our methods exhibit superior accuracy on the nuScenes dataset, highlighting a focus on detailed surface reconstruction.

Limitation.We observe that the joint training method does not significantly improve the quality of object detection tasks. Further investigations into the interaction between the coarse occupancy prediction and the object surface reconstruction are needed to boost the consistency between different representations. In addition, the efficiency of the unified framework can be further optimized via more advanced parallelized designs.

References

  • [1]Anh-Quan Cao and Raoul de Charette.Monoscene: Monocular 3d semantic scene completion.In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3991–4001, 2022.
  • [2]Shaoyu Chen, Xinggang Wang, Tianheng Cheng, Qian Zhang, Chang Huang, and Wenyu Liu.Polar parametrization for vision-based surround-view 3d detection.arXiv preprint arXiv:2206.10965, 2022.
  • [3]Zhang Chen, Yinda Zhang, Kyle Genova, Sean Fanello, Sofien Bouaziz, Christian Häne, Ruofei Du, Cem Keskin, Thomas Funkhouser, and Danhang Tang.Multiresolution deep implicit functions for 3d shape representation.In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 13087–13096, 2021.
  • [4]Angela Dai, Charles RuizhongtaiQi, and Matthias Nießner.Shape completion using 3d-encoder-predictor cnns and shape synthesis.In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 5868–5877, 2017.
  • [5]Leila DeFloriani and Paola Magillo.Multiresolution mesh representation: Models and data structures.Tutorials on Multiresolution in Geometric Modelling: Summer School Lecture Notes, pages 363–417, 2002.
  • [6]Igor Guskov, Wim Sweldens, and Peter Schröder.Multiresolution signal processing for meshes.In Proceedings of the 26th annual conference on Computer graphics and interactive techniques, pages 325–334, 1999.
  • [7]Rana Hanocka, Amir Hertz, Noa Fish, Raja Giryes, Shachar Fleishman, and Daniel Cohen-Or.Meshcnn: a network with an edge.ACM Transactions on Graphics (ToG), 38(4):1–12, 2019.
  • [8]Bin Huang, Yangguang Li, Enze Xie, Feng Liang, Luya Wang, Mingzhu Shen, Fenggang Liu, Tianqi Wang, Ping Luo, and Jing Shao.Fast-bev: Towards real-time on-vehicle bird’s-eye view perception.arXiv preprint arXiv:2301.07870, 2023.
  • [9]Junjie Huang, Guan Huang, Zheng Zhu, Yun Ye, and Dalong Du.Bevdet: High-performance multi-camera 3d object detection in bird-eye-view.arXiv preprint arXiv:2112.11790, 2021.
  • [10]Yuanhui Huang, Wenzhao Zheng, Yunpeng Zhang, Jie Zhou, and Jiwen Lu.Tri-perspective view for vision-based 3d semantic occupancy prediction.In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9223–9232, 2023.
  • [11]Alireza Javaheri, Catarina Brites, Fernando Pereira, and João Ascenso.A generalized hausdorff distance based quality metric for point cloud geometry.In 2020 Twelfth International Conference on Quality of Multimedia Experience (QoMEX), pages 1–6. IEEE, 2020.
  • [12]Yue Jiang, Dantong Ji, Zhizhong Han, and Matthias Zwicker.Sdfdiff: Differentiable rendering of signed distance fields for 3d shape optimization.In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 1251–1261, 2020.
  • [13]Yanqin Jiang, Li Zhang, Zhenwei Miao, Xiatian Zhu, Jin Gao, Weiming Hu, and Yu-Gang Jiang.Polarformer: Multi-camera 3d object detection with polar transformer.In Proceedings of the AAAI Conference on Artificial Intelligence, volume37, pages 1042–1050, 2023.
  • [14]Maik Keller, Damien Lefloch, Martin Lambers, Shahram Izadi, Tim Weyrich, and Andreas Kolb.Real-time 3d reconstruction in dynamic scenes using point-based fusion.In 2013 International Conference on 3D Vision-3DV 2013, pages 1–8. IEEE, 2013.
  • [15]Leif Kobbelt, Swen Campagna, Jens Vorsatz, and Hans-Peter Seidel.Interactive multi-resolution modeling on arbitrary meshes.In Proceedings of the 25th annual conference on Computer graphics and interactive techniques, pages 105–114, 1998.
  • [16]Abhinav Kumar, Yuliang Guo, Xinyu Huang, Liu Ren, and Xiaoming Liu.Seabird: Segmentation in bird’s view with dice loss improves monocular 3d detection of large objects.Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024.
  • [17]Yinhao Li, Zheng Ge, Guanyi Yu, Jinrong Yang, Zengran Wang, Yukang Shi, Jianjian Sun, and Zeming Li.Bevdepth: Acquisition of reliable depth for multi-view 3d object detection.In Proceedings of the AAAI Conference on Artificial Intelligence, volume37, pages 1477–1485, 2023.
  • [18]Yiming Li, Zhiding Yu, Christopher Choy, Chaowei Xiao, JoseM Alvarez, Sanja Fidler, Chen Feng, and Anima Anandkumar.Voxformer: Sparse voxel transformer for camera-based 3d semantic scene completion.In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9087–9098, 2023.
  • [19]Zhiqi Li, Wenhai Wang, Hongyang Li, Enze Xie, Chonghao Sima, Tong Lu, Yu Qiao, and Jifeng Dai.Bevformer: Learning bird’s-eye-view representation from multi-camera images via spatiotemporal transformers.In European conference on computer vision, pages 1–18. Springer, 2022.
  • [20]Xuewu Lin, Tianwei Lin, Zixiang Pei, Lichao Huang, and Zhizhong Su.Sparse4d: Multi-view 3d object detection with sparse spatial-temporal fusion.arXiv preprint arXiv:2211.10581, 2022.
  • [21]Lingjie Liu, Jiatao Gu, Kyaw ZawLin, Tat-Seng Chua, and Christian Theobalt.Neural sparse voxel fields.Advances in Neural Information Processing Systems, 33:15651–15663, 2020.
  • [22]Yingfei Liu, Tiancai Wang, Xiangyu Zhang, and Jian Sun.Petr: Position embedding transformation for multi-view 3d object detection.In European Conference on Computer Vision, pages 531–548. Springer, 2022.
  • [23]Yingfei Liu, Junjie Yan, Fan Jia, Shuailin Li, Aqi Gao, Tiancai Wang, and Xiangyu Zhang.Petrv2: A unified framework for 3d perception from multi-camera images.In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 3262–3272, 2023.
  • [24]Zhipeng Luo, Changqing Zhou, Gongjie Zhang, and Shijian Lu.Detr4d: Direct multi-view 3d object detection with sparse attention.arXiv preprint arXiv:2212.07849, 2022.
  • [25]Hidenobu Matsuki, Riku Murai, PaulHJ Kelly, and AndrewJ Davison.Gaussian splatting slam.arXiv preprint arXiv:2312.06741, 2023.
  • [26]Ruihang Miao, Weizhou Liu, Mingrui Chen, Zheng Gong, Weixin Xu, Chen Hu, and Shuchang Zhou.Occdepth: A depth-aware method for 3d semantic scene completion.arXiv preprint arXiv:2302.13540, 2023.
  • [27]Takashi Michikawa, Takashi Kanai, Masahiro Fujita, and Hiroaki Chiyokura.Multiresolution interpolation meshes.In Proceedings Ninth Pacific Conference on Computer Graphics and Applications. Pacific Graphics 2001, pages 60–69. IEEE, 2001.
  • [28]Tomoyuki Mukasa, Jiu Xu, and Bjorn Stenger.3d scene mesh from cnn depth predictions and sparse monocular slam.In Proceedings of the IEEE international conference on computer vision workshops, pages 921–928, 2017.
  • [29]Jinhyung Park, Chenfeng Xu, Shijia Yang, Kurt Keutzer, Kris Kitani, Masayoshi Tomizuka, and Wei Zhan.Time will tell: New outlooks and a baseline for temporal multi-view 3d object detection.arXiv preprint arXiv:2210.02443, 2022.
  • [30]Jonah Philion and Sanja Fidler.Lift, splat, shoot: Encoding images from arbitrary camera rigs by implicitly unprojecting to 3d.In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XIV 16, pages 194–210. Springer, 2020.
  • [31]CharlesR Qi, Hao Su, Kaichun Mo, and LeonidasJ Guibas.Pointnet: Deep learning on point sets for 3d classification and segmentation.In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 652–660, 2017.
  • [32]Antoni Rosinol, JohnJ Leonard, and Luca Carlone.Nerf-slam: Real-time dense monocular slam with neural radiance fields.In 2023 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 3437–3444. IEEE, 2023.
  • [33]Thomas Schops, Torsten Sattler, and Marc Pollefeys.Bad slam: Bundle adjusted direct rgb-d slam.In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 134–144, 2019.
  • [34]Edgar Sucar, Shikun Liu, Joseph Ortiz, and AndrewJ Davison.imap: Implicit mapping and positioning in real-time.In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 6229–6238, 2021.
  • [35]Su Sun, Cheng Zhao, Yuliang Guo, Ruoyu Wang, Xinyu Huang, YingjieVictor Chen, and Liu Ren.Behind the veil: Enhanced indoor 3d scene reconstruction with occluded surfaces completion.Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024.
  • [36]Xiaoyu Tian, Tao Jiang, Longfei Yun, Yucheng Mao, Huitong Yang, Yue Wang, Yilun Wang, and Hang Zhao.Occ3d: A large-scale 3d occupancy prediction benchmark for autonomous driving.Advances in Neural Information Processing Systems, 36, 2024.
  • [37]Wenwen Tong, Chonghao Sima, Tai Wang, Li Chen, Silei Wu, Hanming Deng, Yi Gu, Lewei Lu, Ping Luo, Dahua Lin, etal.Scene as occupancy.In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 8406–8415, 2023.
  • [38]Hao Wang, Nadav Schor, Ruizhen Hu, Haibin Huang, Daniel Cohen-Or, and Hui Huang.Global-to-local generative model for 3d shapes.ACM Transactions on Graphics (TOG), 37(6):1–10, 2018.
  • [39]Shihao Wang, Yingfei Liu, Tiancai Wang, Ying Li, and Xiangyu Zhang.Exploring object-centric temporal modeling for efficient multi-view 3d object detection.arXiv preprint arXiv:2303.11926, 2023.
  • [40]Xiaofeng Wang, Zheng Zhu, Wenbo Xu, Yunpeng Zhang, Yi Wei, Xu Chi, Yun Ye, Dalong Du, Jiwen Lu, and Xingang Wang.Openoccupancy: A large scale benchmark for surrounding semantic occupancy perception.arXiv preprint arXiv:2303.03991, 2023.
  • [41]Yuqi Wang, Yuntao Chen, Xingyu Liao, Lue Fan, and Zhaoxiang Zhang.Panoocc: Unified occupancy representation for camera-based 3d panoptic segmentation.arXiv preprint arXiv:2306.10013, 2023.
  • [42]Yue Wang, VitorCampagnolo Guizilini, Tianyuan Zhang, Yilun Wang, Hang Zhao, and Justin Solomon.Detr3d: 3d object detection from multi-view images via 3d-to-2d queries.In Conference on Robot Learning, pages 180–191. PMLR, 2022.
  • [43]Zitian Wang, Zehao Huang, Jiahui Fu, Naiyan Wang, and Si Liu.Object as query: Equipping any 2d object detector with 3d detection ability.arXiv preprint arXiv:2301.02364, 2023.
  • [44]Yi Wei, Linqing Zhao, Wenzhao Zheng, Zheng Zhu, Jie Zhou, and Jiwen Lu.Surroundocc: Multi-camera 3d occupancy prediction for autonomous driving.In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 21729–21740, 2023.
  • [45]Thomas Whelan, Stefan Leutenegger, Renato Salas-Moreno, Ben Glocker, and Andrew Davison.Elasticfusion: Dense slam without a pose graph.In Robotics: Science and Systems. Robotics: Science and Systems, 2015.
  • [46]Enze Xie, Zhiding Yu, Daquan Zhou, Jonah Philion, Anima Anandkumar, Sanja Fidler, Ping Luo, and JoseM Alvarez.M2bev: Multi-camera joint 3d detection and segmentation with unified birds-eye view representation.arXiv preprint arXiv:2204.05088, 2022.
  • [47]Yaoqing Yang, Chen Feng, Yiru Shen, and Dong Tian.Foldingnet: Point cloud auto-encoder via deep grid deformation.In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 206–215, 2018.
  • [48]Yunpeng Zhang, Wenzhao Zheng, Zheng Zhu, Guan Huang, Jiwen Lu, and Jie Zhou.A simple baseline for multi-camera 3d object detection.In AAAI Conference, volume37, pages 3507–3515, 2023.
  • [49]Yunpeng Zhang, Zheng Zhu, and Dalong Du.Occformer: Dual-path transformer for vision-based 3d semantic occupancy prediction.arXiv preprint arXiv:2304.05316, 2023.
  • [50]Xizhou Zhu, Weijie Su, Lewei Lu, Bin Li, Xiaogang Wang, and Jifeng Dai.Deformable detr: Deformable transformers for end-to-end object detection.arXiv preprint arXiv:2010.04159, 2020.
  • [51]Zihan Zhu, Songyou Peng, Viktor Larsson, Weiwei Xu, Hujun Bao, Zhaopeng Cui, MartinR Oswald, and Marc Pollefeys.Nice-slam: Neural implicit scalable encoding for slam.In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12786–12796, 2022.

Appendix

We provide in this supplementary more ablation studies and additional visualizations on AdaOcc_B that could not fit in the paper. In particular, we include (1) more details about the loss, (2) more details about the evaluation metrics, (3) the ablation study examining the impact of the number of bounding boxes in occupancy mapping, (4) the ablation study on the number of folded points for each bounding box, and (5) more occupancy visualizations for all baselines.

A Loss Details

Semantic occupancy loss. We applied focal loss as the semantic occupancy loss. Focal loss is a typical classification loss, specialized to tackle problems such as class imbalances and hard data samples.

sem(M)=xminxmaxyminymaxzminzmaxα(1p(x,y,z))βlog(p(x,y,z)),subscriptsem𝑀superscriptsubscriptsubscript𝑥minsubscript𝑥maxsuperscriptsubscriptsubscript𝑦minsubscript𝑦maxsuperscriptsubscriptsubscript𝑧minsubscript𝑧max𝛼superscript1𝑝𝑥𝑦𝑧𝛽𝑝𝑥𝑦𝑧\mathcal{L}_{\text{sem}}(M)=\sum_{x_{\text{min }}}^{x_{\text{max }}}\sum_{y_{%\text{min }}}^{y_{\text{max }}}\sum_{z_{\text{min }}}^{z_{\text{max }}}-\alpha%(1-p(x,y,z))^{\beta}\log(p(x,y,z)),caligraphic_L start_POSTSUBSCRIPT sem end_POSTSUBSCRIPT ( italic_M ) = ∑ start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT min end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_x start_POSTSUBSCRIPT max end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_y start_POSTSUBSCRIPT min end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_y start_POSTSUBSCRIPT max end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_z start_POSTSUBSCRIPT min end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_z start_POSTSUBSCRIPT max end_POSTSUBSCRIPT end_POSTSUPERSCRIPT - italic_α ( 1 - italic_p ( italic_x , italic_y , italic_z ) ) start_POSTSUPERSCRIPT italic_β end_POSTSUPERSCRIPT roman_log ( italic_p ( italic_x , italic_y , italic_z ) ) ,(5)

where p(.) represents the predicted probability of the correct class, α𝛼\alphaitalic_α and β𝛽\betaitalic_β are hyperparameters to balance well-classified and hard examples.

Object detection loss. We use a similar loss function as in DETR3D[42] for the object detection task.The object detection loss includes a focal loss and a L1 loss, for the classification and regression of the bounding boxes respectively. The focal loss is similar to Eq.5, minimizing the discrepancy between the predicted bounding boxes classes and ground truth bounding box classes, and L1 loss minimizes the difference between N𝑁Nitalic_N predicted bounding box parameters Bpredsubscript𝐵𝑝𝑟𝑒𝑑B_{pred}italic_B start_POSTSUBSCRIPT italic_p italic_r italic_e italic_d end_POSTSUBSCRIPT and N𝑁Nitalic_N corresponding ground truth bounding box parameters BGTsubscript𝐵𝐺𝑇B_{GT}italic_B start_POSTSUBSCRIPT italic_G italic_T end_POSTSUBSCRIPT as:

reg=1Ni=1N|BpredBGT|.subscriptreg1𝑁superscriptsubscript𝑖1𝑁subscript𝐵predsubscript𝐵GT\mathcal{L}_{\text{reg}}=\frac{1}{N}\sum_{i=1}^{N}\left|B_{\mathrm{pred}}-B_{%\mathrm{GT}}\right|.caligraphic_L start_POSTSUBSCRIPT reg end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT | italic_B start_POSTSUBSCRIPT roman_pred end_POSTSUBSCRIPT - italic_B start_POSTSUBSCRIPT roman_GT end_POSTSUBSCRIPT | .(6)

Surface reconstruction loss. We use the Chamfer distance[47] as the surface reconstruction loss. It is a geometric distance-based loss function used for measuring the dissimilarity between two point sets. In the context of our work, it quantifies the discrepancy between the reconstructed surface points and the ground truth points. The Chamfer distance Chamfersubscript𝐶𝑎𝑚𝑓𝑒𝑟\mathcal{L}_{Chamfer}caligraphic_L start_POSTSUBSCRIPT italic_C italic_h italic_a italic_m italic_f italic_e italic_r end_POSTSUBSCRIPT is defined as:

surf=𝐱𝐗min𝐲𝐘𝐱𝐲22+𝐲𝐘min𝐱𝐗𝐲𝐱22,subscriptsurfsubscript𝐱𝐗subscript𝐲𝐘superscriptsubscriptnorm𝐱𝐲22subscript𝐲𝐘subscript𝐱𝐗superscriptsubscriptnorm𝐲𝐱22\mathcal{L}_{\text{surf}}=\sum_{\mathbf{x}\in\mathbf{X}}\min_{\mathbf{y}\in%\mathbf{Y}}\|\mathbf{x}-\mathbf{y}\|_{2}^{2}+\sum_{\mathbf{y}\in\mathbf{Y}}%\min_{\mathbf{x}\in\mathbf{X}}\|\mathbf{y}-\mathbf{x}\|_{2}^{2},caligraphic_L start_POSTSUBSCRIPT surf end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT bold_x ∈ bold_X end_POSTSUBSCRIPT roman_min start_POSTSUBSCRIPT bold_y ∈ bold_Y end_POSTSUBSCRIPT ∥ bold_x - bold_y ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + ∑ start_POSTSUBSCRIPT bold_y ∈ bold_Y end_POSTSUBSCRIPT roman_min start_POSTSUBSCRIPT bold_x ∈ bold_X end_POSTSUBSCRIPT ∥ bold_y - bold_x ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ,(7)

where 𝐗𝐗\mathbf{X}bold_X represents the reconstructed points, and 𝐘𝐘\mathbf{Y}bold_Y denotes the ground truth points clouds.

B Evaluation metrics details

Intersection over Union (IOU): IOU quantifies the overlap between the predicted and ground truth regions. It computes the ratio of the intersection to the union of these regions, providing an indicator of how well the prediction aligns with the actual data.

Mathematically, IOU is defined as:

IOU=Area of IntersectionArea of Union𝐼𝑂𝑈Area of IntersectionArea of UnionIOU=\frac{\text{ Area of Intersection }}{\text{ Area of Union }}italic_I italic_O italic_U = divide start_ARG Area of Intersection end_ARG start_ARG Area of Union end_ARG(8)

Mean Intersection over Union (mIOU): mIOU is the mean value of IOU computed across multiple instances or classes. It provides a holistic measure of the method’s accuracy across various categories.

Mathematically, mIOU is defined as:

mIOU=1Ni=1NIOUi,𝑚𝐼𝑂𝑈1𝑁superscriptsubscript𝑖1𝑁𝐼𝑂subscript𝑈𝑖mIOU=\frac{1}{N}\sum_{i=1}^{N}IOU_{i},italic_m italic_I italic_O italic_U = divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_I italic_O italic_U start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ,(9)

where N𝑁Nitalic_N is the number of instances or classes and IOUi𝐼𝑂subscript𝑈𝑖IOU_{i}italic_I italic_O italic_U start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT represents the IOU value for thei𝑖iitalic_ith instance or class.

However, IoU and mIoU are primarily used to measure the accuracy of object detection or segmentation models, quantifying the overlap between two regions. While they can indicate the disparity between the detected object’s position and the true object’s position, they do not provide detailed information about shape. Therefore, even with a high IoU, it does not guarantee the accuracy of the detected object’s shape.

Hausdorff distance, on the other hand, offers a deeper metric for assessing the accuracy of object shapes. It evaluates the similarity of the evaluated object point cloud and the ground truth object point cloud by measuring the maximum pairwise distance between them. This implies that Hausdorff distance can better describe the precision of object shapes, not just their positional accuracy. Hence, Hausdorff distance is highly useful in tasks such as shape reconstruction, point cloud matching, etc., where a comprehensive consideration of object shape accuracy is necessary. It is defined as:

Hausdorff=max{max𝐱Xmin𝐲Y𝐱𝐲,max𝐲Ymin𝐱X𝐲𝐱}subscriptHausdorffsubscript𝐱𝑋subscript𝐲𝑌norm𝐱𝐲subscript𝐲𝑌subscript𝐱𝑋norm𝐲𝐱\mathcal{L}_{\text{Hausdorff }}=\max\left\{\max_{\mathbf{x}\in X}\min_{\mathbf%{y}\in Y}\|\mathbf{x}-\mathbf{y}\|,\max_{\mathbf{y}\in Y}\min_{\mathbf{x}\in X%}\|\mathbf{y}-\mathbf{x}\|\right\}caligraphic_L start_POSTSUBSCRIPT Hausdorff end_POSTSUBSCRIPT = roman_max { roman_max start_POSTSUBSCRIPT bold_x ∈ italic_X end_POSTSUBSCRIPT roman_min start_POSTSUBSCRIPT bold_y ∈ italic_Y end_POSTSUBSCRIPT ∥ bold_x - bold_y ∥ , roman_max start_POSTSUBSCRIPT bold_y ∈ italic_Y end_POSTSUBSCRIPT roman_min start_POSTSUBSCRIPT bold_x ∈ italic_X end_POSTSUBSCRIPT ∥ bold_y - bold_x ∥ }(10)

where 𝐗𝐗\mathbf{X}bold_X represents the reconstructed point cloud points, and 𝐘𝐘\mathbf{Y}bold_Y denotes the ground truth point cloud points. A detailed illustration can be referred toFig.I.

AdaOcc: Adaptive-Resolution Occupancy Prediction (7)

C Impact of the number of bounding boxes

We utilize the DeTR head introduced in [19] for the object detection task. The number of bounding boxes is a hyperparameter. We explore this parameter in our ablation study, as presented in Tab.I.

# of box102030400
IOU0.3130.3120.3120.3100.309
mIOU0.15900.15320.14880.14750.1540
Time(s)0.12010.12980.13520.16030.1109

D Study on different methods of Compressing 3D into 2D

We investigate various methods of compressing 3D features into 2D, which is required specifically for AdaOcctasks based on the CONet backbone, such as object detection and folding. The size of the 3D features varies depending on the size of the bounding boxes. Thus, a fixed attention-based weighted aggregation does not work as the input dimension is not fixed. Therefore, we selected four different aggregation methods: max pooling, average pooling, global mean (taking the average over all dimensions), and global max (taking the max over all dimensions). Note that we do not consider summation to be a viable method, as the varying size of 3D features would make the summed 2D features heavily dependent on the size of the 3D features, which is undesirable. It turns out that the max pooling layer performs best, which is similar to the aggregation method used in PointNet, as shown inTab.II.

MethodMaxpoolingAvgpoolingGlobal-meanGlobal-max
IOU0.0900.0870.0880.088
mIOU0.0530.0510.0490.047

E Study on the number of folded points per Box

We employ FoldingNet[47] to reconstruct the surface of objects. While the number of folded points for training remains fixed, the number of folded points for testing can be adjusted based on resolution requirements. We conduct an ablation study to analyze how the number of folded points in testing affects IOU and mIOU results, shown in Table.III.

Fold size90025001000040000
IOU0.3120.3130.3130.313
mIOU0.15300.15320.15370.1541
Time(s)0.14140.140420.14190.1444

F More occupancy visualization

Additional occupancy maps for various scenes are displayed. AdaOccdemonstrates improved separation of the bounding boxes.

AdaOcc: Adaptive-Resolution Occupancy Prediction (8)

AdaOcc: Adaptive-Resolution Occupancy Prediction (9)

AdaOcc: Adaptive-Resolution Occupancy Prediction (10)

AdaOcc: Adaptive-Resolution Occupancy Prediction (11)

AdaOcc: Adaptive-Resolution Occupancy Prediction (12)

AdaOcc: Adaptive-Resolution Occupancy Prediction (13)

AdaOcc: Adaptive-Resolution Occupancy Prediction (14)

AdaOcc: Adaptive-Resolution Occupancy Prediction (15)

AdaOcc: Adaptive-Resolution Occupancy Prediction (16)

AdaOcc: Adaptive-Resolution Occupancy Prediction (17)

AdaOcc: Adaptive-Resolution Occupancy Prediction (18)

AdaOcc: Adaptive-Resolution Occupancy Prediction (19)

AdaOcc: Adaptive-Resolution Occupancy Prediction (20)

AdaOcc: Adaptive-Resolution Occupancy Prediction (21)

AdaOcc: Adaptive-Resolution Occupancy Prediction (22)

AdaOcc: Adaptive-Resolution Occupancy Prediction (23)

AdaOcc: Adaptive-Resolution Occupancy Prediction (24)

AdaOcc: Adaptive-Resolution Occupancy Prediction (25)

AdaOcc: Adaptive-Resolution Occupancy Prediction (2024)
Top Articles
Latest Posts
Article information

Author: Barbera Armstrong

Last Updated:

Views: 6077

Rating: 4.9 / 5 (59 voted)

Reviews: 90% of readers found this page helpful

Author information

Name: Barbera Armstrong

Birthday: 1992-09-12

Address: Suite 993 99852 Daugherty Causeway, Ritchiehaven, VT 49630

Phone: +5026838435397

Job: National Engineer

Hobby: Listening to music, Board games, Photography, Ice skating, LARPing, Kite flying, Rugby

Introduction: My name is Barbera Armstrong, I am a lovely, delightful, cooperative, funny, enchanting, vivacious, tender person who loves writing and wants to share my knowledge and understanding with you.