[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Next Article in Journal
A Regional Gravimetric and Hybrid Geoid Model in Northern Greece from Dedicated Gravity Campaigns
Previous Article in Journal
Cross-Line Fusion of Ground Penetrating Radar for Full-Space Localization of External Defects in Drainage Pipelines
Previous Article in Special Issue
LULC-SegNet: Enhancing Land Use and Land Cover Semantic Segmentation with Denoising Diffusion Feature Fusion
You seem to have javascript disabled. Please note that many of the page functionalities won't work as expected without javascript enabled.
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Gradient Enhancement Techniques and Motion Consistency Constraints for Moving Object Segmentation in 3D LiDAR Point Clouds

1
School of Electronics, Peking University, Beijing 100871, China
2
School of Artificial Intelligence, China University of Mining and Technology-Beijing, Beijing 100083, China
*
Author to whom correspondence should be addressed.
Remote Sens. 2025, 17(2), 195; https://doi.org/10.3390/rs17020195
Submission received: 28 October 2024 / Revised: 26 December 2024 / Accepted: 5 January 2025 / Published: 8 January 2025

Abstract

:
The ability to segment moving objects from three-dimensional (3D) LiDAR scans is critical to advancing autonomous driving technology, facilitating core tasks like localization, collision avoidance, and path planning. In this paper, we introduce a novel deep neural network designed to enhance the performance of 3D LiDAR point cloud moving object segmentation (MOS) through the integration of image gradient information and the principle of motion consistency. Our method processes sequential range images, employing depth pixel difference convolution (DPDC) to improve the efficacy of dilated convolutions, thus boosting spatial information extraction from range images. Additionally, we incorporate Bayesian filtering to impose posterior constraints on predictions, enhancing the accuracy of motion segmentation. To handle the issue of uneven object scales in range images, we develop a novel edge-aware loss function and use a progressive training strategy to further boost performance. Our method is validated on the SemanticKITTI-based LiDAR MOS benchmark, where it significantly outperforms current state-of-the-art (SOTA) methods, all while working directly on two-dimensional (2D) range images without requiring mapping.

Graphical Abstract">
Graphical Abstract

1. Introduction

With the rapid development of autonomous driving technology, accurately distinguishing between static and dynamic objects in the environment is crucial for achieving safe and reliable autonomous navigation [1,2]. This capability not only supports critical tasks like collision avoidance [3] and path planning [4], but also significantly enhances performance in areas such as pose estimation [5], sensor data registration [6], and simultaneous localization and mapping (SLAM) [7]. Consequently, the ability to reliably segment moving objects from sensor data in real time has become an indispensable requirement for various autonomous mobile systems [8]. Among the many sensors used in autonomous driving, three-dimensional (3D) Light Detection and Ranging (LiDAR) has emerged as a core tool for environmental perception due to its precise distance measurements and wide field of view (FOV) [9]. Unlike cameras, LiDAR maintains stable measurement accuracy under challenging lighting conditions [10]. However, LiDAR data present their own set of challenges. For instance, the resolution, distance scale, and edge information in point clouds are often incomplete [11], which complicates fine-grained segmentation. Moreover, segmenting moving objects usually requires continuous data across multiple frames to determine object trajectories. Accurately distinguishing between static and dynamic objects benefits significantly from leveraging motion information across multiple frames, as this allows for more reliable trajectory analysis and enhances segmentation accuracy, which is particularly important in dynamic environments [12].
Recent advances in deep learning-based LiDAR semantic segmentation have led to significant progress, yet most studies focus on recognizing object categories such as vehicles, pedestrians, signs, roads, and buildings [13,14,15,16], without adequately addressing the distinction between real-time dynamic and static objects in the environment (e.g., differentiating between moving and parked vehicles) [17]. Research on moving object segmentation (MOS) from LiDAR point clouds remains relatively limited. Existing studies falling into two main categories: point-based deep neural networks, which directly process raw 3D point cloud data [18,19], and projection-based neural networks, which convert point clouds into two-dimensional range images or other projection formats [20,21]. While point-based networks are effective at extracting features from unordered point clouds, their high computational cost makes them difficult to scale to large scenes. Although voxel-based methods reduce computational complexity, they also suffer from information loss due to voxelization [22]. On the other hand, projection methods are computationally lighter and more suited to real-time applications but often experience reduced segmentation accuracy due to edge blurring caused by back-projection [23].
In 3D LiDAR projection methods, the resulting images differ fundamentally from traditional optical RGB images; their pixels encode depth or intensity information as well as the contours and edge features of objects in 3D space. These characteristics provide a unique advantage for accurately describing geometric forms. However, current research has not sufficiently explored feature extraction techniques tailored to range images, particularly methods for capturing and leveraging gradient information effectively, representing a critical gap. To bridge this gap, it is necessary to fully exploit the spatial information inherent in range images to improve MOS performance. In real-world scenarios, the dynamic behavior of moving objects often exhibits temporal consistency, i.e., the motion states of objects tend to correlate over time. This property is widely utilized in tasks such as localization and navigation for robotic systems and autonomous driving. By assuming motion consistency, model predictions across frames can be made smoother, effectively reducing uncertainties at object boundaries caused by occlusions or blurriness in range images. Furthermore, the loss of edge details, a key factor impacting segmentation accuracy, is particularly pronounced when point cloud data are converted into range images. The inherent sparsity of points and back-projection limitations can exacerbate this issue, making it essential to develop methods that improve boundary precision, thereby preserving edge details when handling objects of varying scales. In addition, while detection methods are fundamental for identifying object-level positions in a scene, segmentation offers pixel-level precision that is essential for tasks such as pose estimation and environmental perception. This work specifically focuses on improving segmentation to provide detailed spatial information, complementing detection approaches.
To address the challenges outlined above, this paper proposes a novel MOS method based on 3D LiDAR point clouds. Unlike traditional methods that perform single-frame predictions, our model improves single-frame segmentation results by leveraging gradient information within range images and integrating the principle of object motion consistency through a Bayesian back-end fusion approach. This enables the effective incorporation and smoothing of multi-frame information while maintaining the simplicity of single-frame prediction frameworks. Inspired by the dual-branch design introduced in MotionSeg3D, we adopt a similar network structure to guide the processing of range images and residual images while focusing on new enhancements. Specifically, we convert 3D point cloud data into two-dimensional range images, which act as a lightweight intermediate representation, significantly reducing computational overhead. Next, we introduce a novel Depth Pixel Differential Convolution (DPDC) method to extract gradient information from range images, thereby enhancing spatial feature representation. This approach enables the effective capturing of object edge details and mitigates the common issue of edge blurring in traditional range image methods. To further enhance segmentation performance, we incorporate Bayesian filtering, which smooths the predicted results by constraining them with prior information. This is particularly beneficial in mitigating inaccuracies due to occlusions, especially when objects are near boundaries or overlap with others. Moreover, to handle the issue of uneven object scales in LiDAR data, we propose an edge-aware loss function and employ a progressive training paradigm to ensure high accuracy across objects of varying sizes. In the experimental section, we validate our approach extensively using the SemanticKITTI dataset. Compared to existing range image-based LiDAR MOS methods, our approach demonstrates improved accuracy. This method focuses on frame-based segmentation, enhanced by leveraging motion information across multiple frames through temporal smoothing, without relying on explicit object tracking. Furthermore, through comprehensive comparisons with various state-of-the-art (SOTA) methods, the proposed method showcases superior performance in handling MOS tasks.
The main contribution of our work can be summarized as follows:
  • A novel DPDC based on image gradient priors is proposed to enhance spatial representation in range images, improving LiDAR segmentation accuracy;
  • Bayesian filtering is introduced for the first time to apply motion constraints, smoothing multi-frame predictions and reducing edge blurring caused by moving object occlusion;
  • A novel edge-aware loss function is designed, incorporating edge information into the loss function for the first time, with a progressive training strategy to improve model performance;
  • In the latest SemanticKITTI-MOS benchmark, the proposed method shows consistent performance improvements over existing techniques and outperforms SOTA methods.

2. Related Work

2.1. Range Image-Based Point Cloud Segmentation

Researchers have proposed various innovative solutions for MOS in 3D LiDAR data, with this paper focusing on MOS methods that rely solely on LiDAR sensors, eliminating the need for mapping. Early studies primarily addressed semantic segmentation in LiDAR data [24,25,26], as semantic segmentation is often considered a foundational step for MOS. However, mainstream semantic segmentation methods can only identify potentially movable objects, such as vehicles or pedestrians, without distinguishing between truly moving objects and stationary ones, such as parked vehicles or static structures. The sparse nature of LiDAR distance measurements and the uneven distribution of point cloud data add to the complexity of accurate MOS.
Early representative work included the approach introduced by Wang et al. [27], which focuses on segmenting potentially movable objects like vehicles, pedestrians, and cyclists in urban environments. Ruchti et al. extended this by predicting the likelihood of potentially movable objects using a learning-based approach [28]. Chen et al.’s Semantic LiDAR SLAM detects and filters moving objects by comparing online observations with a semantic map [29], utilizing semantic segmentation results [24]. Other approaches include direct point cloud processing, such as Yoon et al.’s heuristic algorithms [30] that detect moving objects through LiDAR scan residuals and free space checks, and the method proposed by Shi et al. [31], which predicts moving objects from sequential point clouds. While these methods show promise, they often suffer from high computational costs and complexity, making them impractical for resource-constrained autonomous driving systems.
Recently, range projection-based semantic segmentation methods have shown potential for addressing real-time MOS challenges [32]. However, many of these methods still rely on processing multiple point cloud frames to improve accuracy, which increases computational overhead and resource demands. Deep learning-based end-to-end methods, such as point cloud scene flow, differentiate between moving and stationary objects by estimating motion vectors between consecutive scans [33,34,35]. Nevertheless, these approaches often struggle with slow-moving objects and sensor noise and face challenges when handling large-scale point cloud data in real time. Bounding-box-based methods for motion detection rely heavily on prior information, limiting their generalization capability [36,37,38,39,40].
A noteworthy end-to-end MOS approach, LMNet [20], incorporates residual images and leverages existing semantic segmentation techniques [24,25,26]. By incorporating temporal features through residual images generated with pose priors, LMNet improves segmentation performance. However, the use of residual images increases computational load, and their reliance on sensor state prediction can be problematic, as object motion is inherently a relative positioning task. Therefore, there is room for improvement in methods that avoid over-reliance on residual images. A notable approach that operates directly on range images without requiring map building is Motion3D [21]. It employs a dual-branch structure to process spatial and temporal information from sequential LiDAR scans, combining these with a motion-guided attention module, and refines segmentation results using lightweight sparse convolutions. This coarse-to-fine architecture effectively mitigates edge blurring. Recent advancements in MOS include InsMOS and MF-MOS. InsMOS incorporates 4D sparse convolutions for instance-level segmentation, outperforming existing methods on both the SemanticKITTI and Apollo datasets, demonstrating its robustness across different environments [41]. MF-MOS leverages both residual maps for motion extraction and range images for semantic guidance, achieving state-of-the-art performance on the SemanticKITTI dataset [42].
In summary, despite recent advancements, challenges remain in effectively extracting pixel depth and gradient information from range images while minimizing edge blurring. Significant improvements are still possible in existing LiDAR-based MOS methods that leverage range image data.

2.2. Gradient Enhancement Techniques

Classical convolutional neural networks (CNNs) typically rely on deep stacking to extract high-level semantic features when processing range images [43]. However, these convolutional kernels often struggle to capture fine pixel-level differences, particularly when dealing with range images generated from sparse point clouds. Range images not only represent the structure of objects in three-dimensional space but also exhibit unique depth and gradient characteristics. Inspired by gradient enhancement techniques from traditional image processing tasks, we seek to combine the strengths of classical operators with modern CNNs to enhance performance.
Classical edge detection operators, such as Canny [44], Sobel [45], and Local Binary Patterns (LBP) [46], are effective at capturing edge information in images. However, these methods have inherent limitations: they rely on shallow structures, lack the capacity to represent complex edge features, and function as static operators that cannot adapt to varying scene demands. As a result, they perform poorly in complex tasks. Although CNNs can extract gradient information, doing so typically requires stacking a large number of convolutional layers.
Recently, differential convolution has emerged as a promising method for extracting image gradient information. The concept of differential convolution originates from LBP, where LBP encodes local pixel differences within a patch into decimal values for texture classification. Building on the success of CNNs in computer vision, Xu et al. introduced the Local Binary Convolution (LBC) algorithm, which combines non-linear activation functions to significantly enhance the edge detection capabilities of a network [47]. Building on this, Y et al. proposed Central Difference Convolution (CDC) [48], which directly encodes pixel differences into fully learnable convolutional kernels, making it more adaptable to various scenes and tasks. Further advancements, such as Cross Central Difference Convolution (CCDC) [49] and Pixel Difference Convolution (PDC) [50], have been developed to capture finer gradient information and enhance network performance.
To the best of our knowledge, this is the first time gradient enhancement has been introduced for range image segmentation. In this work, we employ difference convolution to capture pixel gradient information, thereby improving both segmentation accuracy and edge clarity. By more effectively extracting edge details from range images, we expect to achieve significant improvements in MOS tasks.

3. Method

3.1. Preliminaries

3.1.1. LiDAR Point Cloud Input Representation

Regarding LiDAR point cloud input representation, a projection-based representation method that transforms unstructured 3D point clouds into 2D space to generate dense Range View (RV) images is employed [20,21,25]. Specifically, each LiDAR point p ( x , y , z ) is mapped to image coordinates ( u , v ) through a spherical projection Π : R 3 R 2 , covering a full 360 FOV. The projection is defined by Equation (1).
u v = 1 2 1 arctan ( y , x ) π 1 w 1 arcsin ( z r 1 ) + f up f 1 h ,
where ( u , v ) represent the 2D image coordinates, w and h are the width and height of the projected image, f denotes the vertical FOV, defined as f = | f d o w n | + | f u p | , and   r = x 2 + y 2 + z 2 represents the range of each point. | f u p | and | f d o w n | represent the angular bounds of the vertical field of view (FOV) of the LiDAR sensor, with | f u p | corresponding to the upper bound and | f d o w n | to the lower bound. During the projection process, not only are the 3D coordinates ( x , y , z ) considered, but the intensity ( i ) and range ( r ) of each point are also stored in separate channels, resulting in a [ w × h × 5 ] image enriched with geometric information. Figure 1 depicts the visualization results of the 2D projected image. The corresponding car and motorcycle are labeled in RGB image and point clouds, using red arrow and yellow arrow, respectively. Additionally, in the context of the range image, different colors correspond to varying distributions of pixel values. Specifically, for the depth dimension, lighter colors indicate greater depth and correspond to an increased actual distance.
This projection-based representation method allows the generated range image to effectively capture the spatial structure of the 3D point cloud while being processed by standard 2D convolutional neural networks, thereby significantly reducing the computational burden of directly handling point cloud data. Compared to traditional 3D processing methods, this lightweight representation facilitates efficient feature extraction and analysis by leveraging well-established 2D convolutional architectures. The range image representation has been widely adopted across various tasks, retaining accurate point cloud information and enabling easy adaptation to new network architectures, offering strong generalization potential and applicability. Additionally, we introduce the use of residual images to enhance the feature representation of the input data. The primary goal is to incorporate temporal information into the network. The key steps involve transforming the past LiDAR scans to the current coordinate system using relative poses, which are estimated through a LiDAR-based SLAM system. This ensures that consecutive LiDAR frames are spatially aligned. Next, both the current frame and the transformed past frame are projected into range images, and the residual between the current and past frames is calculated through a normalized process, as shown in Equation (2).
d k , i l = | r i r i k l | r i ,
where r i represents the range value of p i , which is from the current frame at image coordinates ( u i , v i ) . In addition, r i k l denotes the corresponding range value from the transformed scan at the same image pixel location. While the projection of 3D LiDAR data to 2D facilitates efficient feature extraction and processing, it inherently leads to a reduction in dimensional information. Specifically, the mapping of depth and elevation into a 2D space may obscure some vertical structure and introduce potential errors in object representation, particularly in complex environments. To mitigate these issues, we leverage advanced feature enhancement techniques, such as DPDC and Bayesian filtering, to refine the segmented results and improve temporal consistency. These strategies help to recover some of the lost information and minimize errors due to projection inaccuracies.

3.1.2. Convolution Methods and Improvement Strategies

Before presenting the proposed method, we first introduce three convolutional operations designed to enhance the representation of range images: dilated convolution, difference convolution, and meta-kernel convolution. Dilated convolution [25,51,52], which maintains the same number of parameters while expanding the receptive field, has been demonstrated to improve performance in semantic segmentation tasks. Difference convolution [50], on the other hand, is utilized for extracting gradient information by incorporating the strengths of traditional edge detection operators—such as computing differences between adjacent pixels to capture edge features—while retaining the powerful learning capabilities of deep neural networks. Meta-Kernel convolution is specifically tailored for range images, where it assigns convolutional kernel weights based on the spatial distance of each pixel, further optimizing performance through depth-wise and point-wise convolutions from MobileNet to accelerate computation [53].
Figure 2 visualizes the workflows of these three convolutional operations. Green dots represent feature map pixels, while the other colors indicate the neighboring pixels involved in the convolution process. In difference convolution, the black arrow vectors indicate subtraction operations, with the base of the vector representing the pixel being subtracted and the arrowhead pointing to the pixel from which it is subtracted. These advanced convolution methods significantly enhance the presentation of the approach presented in this paper by illustrating the underlying mechanisms. The above methods represent key inspirations and the foundation for the improvements introduced in this paper. Specifically, we design the DPDC, a novel convolution operation that integrates and extends the ideas from difference convolution and dilated convolution. The DPDC module balances global receptive field extraction and local gradient enhancement, effectively capturing edge details and spatial structures within range images. This innovative design is presented in detail in Section 3.3.

3.2. Network Overview

An overview of our approach is illustrated in Figure 3. Assuming the system captures the raw LiDAR point cloud at the current time step, denoted as S i , through relative pose transformations, the point cloud positions from previous frames are mapped into the coordinate system of the current frame, ensuring that all point cloud frames are aligned within the same coordinate system. The primary goal of this alignment operation is to enable the network to observe the point cloud changes over multiple time steps, thereby capturing spatio-temporal information, i.e., the dynamic changes in the point cloud across both spatial and temporal dimensions. The specific transformation formula is given by Equation (3).
S j S 0 = T 0 j · S j ,
where T 0 j represents the pose transformation matrix from frame j to the current frame, S j is the point cloud data from frame j, and S 0 is the current frame. Through this transformation, the point cloud data from all frames are aligned to the coordinate system of the current frame.
In Figure 3, F 0 represents the distance image projected from the original LiDAR scan S 0 , while T F N denotes the transformed frame, which is generated by projecting the aligned point clouds from the previous N frames. N refers to the number of frames in the motion constraint window. The proposed network architecture adopts a dual-branch structure to connect the range image and residual image. The details of the network modules are illustrated in the figure, where Block II is a residual module incorporating dilated convolutions. The choice of using three Depth Pixel Difference Convolutions (DPDCs) was based on prior experiments, where we tested configurations with 1, 3, and 5 DPDCs. We found that one DPDC resulted in minimal improvement, and the difference between 3 and 5 DPDCs was negligible. Therefore, three DPDCs were selected for a balance between performance and computational efficiency. Further experiments with different configurations will be explored in future work. The parameters k and d represent the kernel size and dilation rate, respectively. The notation Ⓒ refers to the concatenation, ⊕ denotes element-wise addition, and ⊗ represents the dot product operation. The DPDC modules in Block III and Block V were designed by us and will be analyzed in detail in the following sections. When the network receives a new input frame F 0 , it is processed together with the previous N 1 frames, generating the output and confidence map from the Bayesian filtering module, as shown in the Figure 3. This output is then combined with the predictions of the previous N 1 frames and further refined by Bayesian filtering to improve the prediction of the current frame while updating the confidence of the previous N 1 frames. Once the prediction for the current frame is completed and a new S 1 frame arrives, the network discards the input T F N + 1 and its corresponding Bayesian confidence, forming a local motion constraint window that allows the network to focus on local temporal information. Finally, the output is produced by a back-projection module, utilizing sparse convolution. In addition, our segmentation framework complements detection methods by refining spatial details and enabling accurate pose estimation, particularly in scenarios where pixel-level precision is critical for interpreting dynamic environments.

3.3. Depth Pixel Difference Convolution

As discussed in Section 3.1.2, to integrate the global feature extraction capabilities of dilated convolutions with the local detail enhancement provided by differential convolutions, we designed a module named DPDC, as shown in Figure 4. The DPDC module consists of five parallel convolution layers, including one dilated convolution and four variations of differential convolutions. Figure 4 (left) provides a detailed illustration of the overall design principle of the DPDC. The DPDC module consists of five parallel convolution branches: Center Difference Convolution (CDC), Radial Difference Convolution (RDC), Straight Difference Convolution (SDC), Diagonal Difference Convolution (DDC), and standard dilated convolution. These branches collaboratively extract gradient information from different spatial directions to enhance feature representation. Figure 4 (right) shows the design details of the proposed SDC and DDC: (1) SDC: SDC focuses on extracting gradients along the horizontal and vertical axes. By calculating pixel differences between neighboring positions (e.g., left-right and top-bottom), SDC emphasizes edge details and structural information in straight-line orientations. This operation effectively encodes variations in the horizontal and vertical directions, which are critical for detecting object boundaries and spatial discontinuities. (2) DDC: DDC extends the concept of gradient extraction to diagonal directions, such as top-left to bottom-right and top-right to bottom-left. By computing differences along diagonal axes, DDC enhances the network’s ability to capture diagonal edges and fine-grained texture features, which are often overlooked by standard convolutions.
The DPDC module consists of five parallel convolution layers, including one dilated convolution and four variations of differential convolutions: CDC, RDC, SDC, and DDC. The outputs of these five convolution operations are combined through element-wise summation, which aggregates the features captured from different directions and receptive fields. Specifically, CDC emphasizes local center differences, and RDC captures radial relationships, while SDC and DDC extract gradient features along the straight and diagonal directions, respectively. By combining these outputs, the DPDC module achieves a balance between global spatial structure and local gradient details, enabling the network to effectively capture object boundaries, contours, and subtle variations in range images. By summing the output features from each convolution layer, the final output is obtained. The core idea of DPDC is to leverage the dilated convolution to capture global differences across pixel regions while simultaneously computing local information through pixel-wise differences, thereby balancing the global structure and local details in the depth image. The DPDC is a key innovation that enhances spatial feature representation by addressing challenges such as edge blurring and noise in LiDAR range images. This module ensures precise gradient extraction, providing a robust foundation for segmentation within each frame.
Within the DPDC module, in addition to employing classic CDC and RDC, we also introduced SDC and DDC. These are designed to integrate traditional local descriptors into the convolution operations. SDC computes pixel gradients along the horizontal or vertical directions by selecting pixel pairs along straight lines, while DDC computes diagonal gradients through differences between diagonal pixel pairs. After training, the learned convolutional kernel weights are equivalently rearranged and directly applied to the unchanged input features. Both SDC and DDC explicitly encode gradient priors into the convolution layers, utilizing gradient information from the image to improve the generalization capability of model. The output content of DPDC is a feature map that integrates multi-directional gradient information and dilated global features. This combined feature map enhances the spatial representation of range images and provides rich edge-aware information, which is critical for improving the accuracy of moving object segmentation, particularly in complex and occluded scenarios.
However, while the parallel deployment of five convolution layers enhances feature extraction, it also increases the parameter count and inference time of model. To address this, we exploit the additive property of convolutions during inference to reparameterize these branches into a simplified convolution operation, reducing computational cost during inference. This approach preserves the diversity of convolution operations during training while significantly lowering the computational complexity at inference. Specifically, the multi-branch structure of DPDC is maintained during training, but in the inference phase, these multi-branch convolution kernels are merged into a single convolution kernel through reparameterization techniques to minimize computational overhead. The goal of reparameterization is to aggregate the weights of each convolution branch from the training phase into a unified weight matrix, resulting in a single convolution kernel, as expressed in Equation (4).
w f i n a l = i w i ,
where w i represents the convolution weights of each branch during training, and w f i n a l is the consolidated single convolution kernel for inference. In this way, only one convolution operation is required during inference, avoiding the computational burden of multiple branches and significantly improving inference efficiency.
The key advantage of DPDC lies in its ability to capture both global structures and local details in depth images. Dilated convolutions expand the receptive field, enabling the model to capture global relationships between objects in point cloud projections, while differential convolutions emphasize local variations between pixels, enhancing the detection of object boundaries, contours, and dynamic changes. This makes DPDC highly effective in processing depth images with sparsity and discontinuity, allowing it to recover global features while preserving edge details, providing a robust solution for complex scene analysis. In addition, the reparameterization technique not only reduces the computational cost during inference, but also effectively mitigates any potential redundancy in feature representations by aggregating complementary gradient features from the multi-branch convolutions into a unified kernel.

3.4. Motion Consistency Constraints

Our proposed network takes 2D range images as input and outputs a prediction for each pixel, indicating whether it belongs to a moving or stationary object, along with its associated confidence. To improve temporal consistency, we combine the confidences from previous frames using Bayesian filtering to recursively update the prediction of the current frame. This method effectively integrates information across multiple frames, smoothing noise and enhancing the detection of slow-moving objects. At each time step t, for the pixel p i , we estimate its motion state m i based on previous observations z 0 : t 1 , along with the current observation z t . The posterior probability is recursively given by the Bayesian rule in Equation (5).
p ( m i | z 0 : t ) = p ( z t | m i ) p ( m i | z 0 : t 1 ) p ( z t | z 0 : t 1 ) ,
where p ( m i | z 0 : t 1 ) is the prior estimation based on previous frames, p ( z t | m i ) is the likelihood function conditioned on the current observation, and p ( z t | z 0 : t 1 ) is the normalization factor to ensure the posterior sums to 1. This recursive formula allows us to progressively update the motion state probability for each pixel at the current time step by incorporating new observations from multiple frames.
To simplify the recursive computation, the Bayesian filtering update is often expressed in log-odds form. The log-odds is defined as Equation (6).
l ( m i | z 0 : t ) = log p ( m i = 1 | z 0 : t ) p ( m i = 0 | z 0 : t ) .
The recursive update of the log-odds can then be written as Equation (7).
l ( m i | z 0 : t ) = l ( m i | z 0 : t 1 ) + l ( m i | z t ) l ( m i ) ,
where l ( m i | z 0 : t ) is the log-odds from the previous time step, l ( m i | z t ) is the log-odds derived from the likelihood of the current frame, and l ( m i ) is the prior log-odds from the initial state. This log-odds formulation enables the recursive fusion of confidence estimates across frames, which simplifies computations and improves numerical stability. For each pixel p i , the network outputs a confidence score ξ t , i , representing the probability that the pixel belongs to a moving object p ( m i = 1 | z t ) = ξ t , i . The corresponding log-odds can be expressed as Equation (8).
l ( m i | z 0 : t ) = l ( m i | z 0 : t 1 ) + log ξ t , i 1 ξ t , i l ( m i ) .
This converts the output probability of the network into the log-odds form needed for Bayesian filtering. The confidence from the current frame can then be combined with the results of previous frames through the recursive update mechanism. After updating, the log-odds can be converted back to a probability for subsequent use in Equation (9).
p ( m i = 1 | z 0 : t ) = 1 1 + e l ( m i | z 0 : t ) .
By combining information from multiple frames, Bayesian filtering reduces the impact of noise or random errors in individual frames, leading to more stable predictions. For objects that move slowly and are difficult to detect in single frames, the recursive integration of frame-level confidences helps capture motion trends over time, reducing the likelihood of missed detections. In addition, the recursive nature of Bayesian filtering allows predictions to be adjusted iteratively across frames, ensuring greater temporal consistency and reducing the occurrence of false positives or negatives. The specific details are expressed in Algorithm 1. The method leverages temporal smoothing and motion information from multiple frames to enhance segmentation accuracy, ensuring robust performance in dynamic environments. While methods like majority voting or averaging can be applied to integrate multi-frame information, they fail to account for the uncertainty and temporal dependencies in the data. The Bayesian filter, on the other hand, provides a probabilistic framework that dynamically adjusts the contribution of each frame based on its reliability, ensuring more accurate and consistent results. This approach is particularly beneficial in dynamic environments, where object motion and sensor noise can significantly affect the segmentation accuracy.
Algorithm 1: Recursive Bayesian Filtering Process with Network Prediction Output
Remotesensing 17 00195 i001

3.5. Edge-Aware Loss Function

The goal of segmentation in range images is to categorize moving objects (foreground) and stationary objects (background). However, due to the category imbalance and the discontinuity of the pixels, it is difficult for the traditional region loss function to achieve high-accuracy segmentation in the edge region. Therefore, we designed an edge-aware loss function L e a to be used to improve the edge processing capability of the segmentation task.
Given a range image, let Ω be the set of all observable pixels. Regarding the formulation of the segmentation task, we define the ground truth labels g : Ω 0 , 1 , where g p = 1 indicates the foreground (moving object) and g p = 0 indicates the background (stationary object). The network outputs the predicted probabilities s θ : Ω 0 , 1 , representing the probability that each pixel belongs to the foreground. θ represents the trainable parameters of the neural network. Common segmentation loss functions usually measure the similarity (or overlap) between the probabilistic output of the network and the corresponding true labeling by means of a region integral. In the binary classification case, the integral of the foreground L F takes the form of Equation (10).
L F = Ω g p f s θ p d p .
And the integral form of the background L B is expressed as Equation (11).
L B = Ω 1 g p f 1 s θ p d p .
For instance, the standard dichotomous cross-entropy loss is the sum of these two terms, where f = log · . Similarly, Generalized Dice Loss (GDL) uses a region integral, where f = 1 and is normalized, and for the binary classification problem, is given by Equation (12).
L G D L ( θ ) = 1 2 w G Ω g ( p ) s θ ( p ) d p + w B Ω ( 1 g ( p ) ) ( 1 s θ ( p ) ) d p w G Ω [ s θ ( p ) + g ( p ) ] d p + w B Ω [ 2 s θ ( p ) g ( p ) ] d p ,
where w G = 1 Ω g p d p 2 and w B = 1 Ω 1 g p d p 2 were introduced to reduce the correlation between the overlap coefficient and region size of Dice.
Let S θ represent the set of pixels corresponding to the predicted region, which can be calculated by S θ = { p Ω s θ ( p ) δ } . The threshold δ is considered the pixel predicted as a target. In addition, let G Ω be the set of ground truth pixels. The distance between the predicted edge S θ and the ground truth edge G is employed to assist the loss function. First, the distance function d G p represents the shortest distance from a pixel p Ω to the ground truth edge G , defined as Equation (13).
d G p = min q G p q .
Similarly, the distance function d S θ p represents the shortest distance from pixel p to the predicted edge, which is given by Equation (14).
d S θ p = min q S θ p q .
Based on these distance functions, the edge loss is defined as the integral of the squared differences between the distances of the predicted and ground truth boundaries, as expressed in Equation (15).
L E ( θ ) = Ω d G p d S θ p 2 d p .
This loss function minimizes the distance between the predicted edge and the ground truth edge, ensuring accurate edge predictions. In practice, calculating the distance from each pixel to the edge is complex. Therefore, we approximate the edge distance using a distance map D G p . We define D G p as the shortest distance of each pixel on the predicted edge to the ground truth edge G , expressed as Equation (16).
D G p = min p S θ , q G p q .
Then, by combining this with the predicted probability distribution s θ p , we express the edge loss in integral form, as expressed in Equation (17).
L E ( θ ) = S θ D G p s θ p d p .
This form penalizes deviations of the predicted edge from the ground truth edge using the distance map D G p , with a stronger penalty for areas closer to the edge. To further calculate the distance mapping for the ground truth edge, let ϕ G p represent a level set function that describes the relative distance of each pixel to the ground truth edge. The level set function ϕ G p is defined as ϕ G p > 0 when p is inside the foreground region, ϕ G p < 0 when p is inside the background region, and ϕ G p = 0 when p lies on the edge G . Therefore, the edge loss can be simplified to a weighted integral over the level set function, which is given by Equation (18).
L E ( θ ) = S θ ϕ G p s θ p d p .
In our work, the edge loss L E helps the network generate more accurate segmentation results in the edge regions, avoiding the edge blurring issue caused by using only regional loss functions. During training, the edge loss works alongside traditional regional losses, such as weighted cross-entropy loss and Lovász-Softmax loss, to further improve robustness in edge and overall segmentation. The edge-aware loss function L e a will include the weighted cross-entropy loss L w c e , Lovász-Softmax loss L l s , and the introduced edge loss L E . The expression is given in Equation (19).
L e a ( θ ) = ( 1 α ) ( L w c e ( θ ) + L l s ( θ ) ) + α L E ( θ ) ,
where α controls the weights of the edge loss. The loss weights in our model were determined through a combination of empirical tuning and experimental validation. Specifically, the weights were not derived from a theoretical framework, but rather, through a systematic process of trial and error to find the optimal balance between different loss components (e.g., gradient loss, motion consistency loss, etc.). We performed a series of experiments to assess the impact of various weight combinations on the model’s overall performance, and the final weights were chosen based on the best results observed on validation datasets. This empirical approach ensures that the weights are tailored to the specific task and dataset, optimizing the performance of our model for moving object segmentation. Although the weights are determined experimentally, we believe this approach provides a practical and effective way to balance the different contributions to the loss function.This combined loss function addresses class imbalance while also enhancing edge detection. This improvement not only alleviates the class imbalance issue, but also better captures the contour information in sparse data, making the segmentation results more refined and accurate. L w c e is defined as Equation (20).
L w c e ( y , y ^ ) = α i p y i log p y ^ i , α i = 1 / f i ,
where y i and y ^ i are the true and predicted labels, respectively. The frequency of the i t h class is f i . In addition, the Lovász-Softmax loss can expressed as Equation (21).
L l s = 1 | C | c C Δ J c ¯ ( m ( c ) ) , m i ( c ) = 1 x i ( c ) if c = y i ( c ) x i ( c ) otherwise ,
where | C | represents the class number, and Δ J c ¯ indicates the Lovász extension of the Jaccard index. The predicted probability for class c at pixel i, denoted as x i ( c ) , lies in the range [ 0 , 1 ] , while the corresponding ground truth label y i ( c ) is either 1 or 1.

4. Experiments

4.1. Experimental Setup

4.1.1. Implementation Details

We implemented the proposed method using PyTorch [54] and trained the model on a single NVIDIA GeForce RTX 4090 SUPRIM X 24G GPU. The size of the range images was set to 64 × 2048 . During training, we minimized L w c e and L l s using stochastic gradient descent with a momentum of 0.9 and a weight decay of 0.0001 . For the selection of α for L e , we employed a rebalancing strategy, starting with a low initial value of α > 0 , and gradually increasing it as training progressed. To balance computational efficiency, we set the training to 100 epochs, with the number of frames in the motion window set to N = 2 and α = 0.02 starting from 0.02 and increasing by 0.005 per epoch. By doing so, we prioritize the regional loss term initially and progressively amplify the influence of the edge loss term over time. We believe that exploring more sophisticated choices for α holds potential for further research, though it is not the primary focus of this paper. The learning rate was managed using a cosine annealing schedule with a 10-epoch warm-up, followed by three cycles of decay over 100 epochs. Additionally, for the prior probability p 0 in Bayesian filtering, we followed the optimal setting from the literature, using p 0 = 0.25 .

4.1.2. Dataset

For the experiments in this study, we utilized the SemanticKITTI-MOS dataset, following the approach outlined in [20]. This dataset provides comprehensive 3D LiDAR data, with each point labeled into one of two categories: moving or static objects. The dataset comprises 22 sequences, with 10 sequences (specifically Seq 00–07 and Seq 09–10, totaling 19,130 frames) used for training, 1 sequence (Seq 08, totaling 4071 frames) designated for validation, and 11 sequences (Seq 11–21, totaling 20,351 frames) set aside for testing purposes.
To further enrich the training data, augmentation was performed by incorporating additional sequences from the KITTI-Road dataset, as previously applied in related work [21]. In this process, coarse labels were automatically generated and subsequently refined manually, adding 12 extra sequences. This augmentation included 6 sequences for training (Seq 30–34 and Seq 40, contributing an additional 2905 frames) and 6 sequences for validation (Seq 35–39 and Seq 41, totaling 2889 frames). We express our gratitude to the authors for making these raw data available, as they significantly enhance the diversity and robustness of the training process, leading to improved model generalization, especially in scenarios where moving objects are under-represented.

4.1.3. Baseline

To evaluate our method, we conducted comparisons against four widely recognized baseline approaches: (1) LiMoSeg, which processes consecutive LiDAR scans in Bird’s Eye View (BEV) format, excelling in real-time performance (for this method, we relied on the results reported in the original paper); (2) Cylinder3D, a method that segments objects by using cylindrical partitioning and point-wise feature extraction from point clouds (this model was not retrained); (3) LMNet, which combines range images and residual images, with a kNN (k Nearest Neighbors)-based post-processing step to refine object boundaries; and (4) MotionSeg3D, which integrates spatio-temporal information with multiple LiDAR scan representation modalities to boost performance in LiDAR-based MOS. For the last two methods, we retrained the models under our experimental settings to ensure consistency, fairness, and accuracy in the comparison. By retraining these models, we guaranteed that all methods were evaluated under the same conditions, providing a reliable benchmark for performance comparison.

4.1.4. Metrics for Evaluation

To evaluate the performance of MOS, we employed the widely used Jaccard Index, also known as the intersection-over-union (IoU) metric [55], specifically for moving objects. The formula is provided by Equation (22).
IoU M O S = TP TP + FP + FN .
In Equation (22), TP, FP, and FN represent the counts of true positives, false positives, and false negatives, respectively, for the moving object class. In our evaluation, we use the standard IoU metric, specifically IoU M O S , which is calculated by considering only the moving object class. IoU M O S is defined as the ratio of true positive moving object detections to the sum of true positives, false positives, and false negatives in the moving object segmentation task. Unlike regular IoU, which evaluates overall segmentation performance, IoU M O S focuses exclusively on the accuracy of moving object detection and segmentation.

4.2. Results and Comparisons

The evaluation of our proposed method was conducted on the SemanticKITTI-MOS dataset, with the results presented in Table 1. The table provides a detailed quantitative comparison of our approach with other SOTA methods, including LiMoSeg, Cylinder3D, LMNet, and Mos3D. Specifically, LMNet v1 and LMNet v2 utilize the RangeNet++ and SalsaNext backbones, respectively. Meanwhile, MotionSeg3D v1 refers to the model employing only the 2D segmentation network, while MotionSeg3D v2 incorporates the full 3D segmentation structure. Due to practical constraints, the comparison of full methods was conducted using a two-frame residual input, resulting in slightly lower performance compared to the optimal results reported for each model. All results in Table 1 were obtained using the augmented dataset described in Section 4.1.2, ensuring that all experiments were conducted under fair and consistent conditions.
The data in Table 1 demonstrate the significant performance improvements achieved by our method. The bolded numbers in the table represent the results with the best accuracy. This confirms that the proposed DPDC module, based on image gradient priors, effectively enhances the spatial representation of range images, maintaining or even improving segmentation accuracy with reduced input. Additionally, by applying Bayesian filtering to smooth multi-frame predictions, we further reduce edge blurring caused by motion object occlusions, leading to more precise edge detection. The introduced edge-aware loss function also contributes to performance improvement. On the latest SemanticKITTI-MOS benchmark, our method consistently delivers superior results on both the validation and test sets. To examine the impact of object distance on segmentation performance, we categorized the objects into two distance groups: near (0–30 m) and far (>30 m). Our results show that objects within the 0–30 m range have higher segmentation accuracy, with an average IoU M O S of 74.3%, compared to 72.9% for objects further than 30 m. This analysis highlights the challenges of segmenting distant objects, which exhibit lower segmentation accuracy due to sparser point clouds and reduced resolution at greater distances.
Furthermore, the visual results are depicted to demonstrate the performance of our method from two key perspectives, providing a more intuitive understanding for readers. First, we compared the performance of different methods across various scenes, as shown in Figure 5. In Figure 5, it is clear that for scenes like Scene 3 and Scene 4, where there are fewer moving vehicles, all methods achieved relatively good segmentation results. However, in more complex scenes, the Cylinder3D method, which relies solely on point cloud data without incorporating temporal information, showed significant shortcomings in detecting moving objects, often resulting in large misclassifications around these objects. In contrast, although LMNet takes both range images and residual images as inputs, it tends to produce blurry predictions around object boundaries, which affects segmentation accuracy. While the performance of MotionSeg3D is the closest to our method, it exhibits more boundary misalignment issues when dealing with distant objects (corresponding to the small objects in the range image). Our method demonstrated better robustness and accuracy in these areas, particularly when handling segmentation tasks in complex scenes, achieving notable improvements.
Secondly, we highlighted the prediction consistency across consecutive frames within the same scene, as illustrated in Figure 6. Taking Scene 1 in Figure 5 as an example, this challenging scene contains many moving vehicles. The vehicle prediction results, highlighted in the blue dashed boxes, clearly show the robustness of our method between frame predictions. By incorporating image gradient information and Bayesian probability methods, we not only enhanced the model’s predictive capabilities, but also significantly reduced errors and uncertainty in temporal predictions. This enables our model to more consistently identify and track objects in dynamic scenes, addressing some of the limitations present in current methods. Our experiments demonstrate that with these enhanced strategies, the overall performance of the model is greatly improved, particularly in tasks involving temporal predictions, where it exhibits superior effectiveness.

4.3. Ablation Study

To better analyze the contribution of each module to the overall performance, we conducted a detailed ablation study. In this experiment, we used the original SemanticKITTI-MOS dataset instead of the augmented one, primarily due to practical constraints regarding experimental efficiency. The specific results are presented in Table 2. The first row corresponds to our full model, which shows a slight decrease in accuracy compared to the results in Table 1, attributed to the reduced training data. The second row presents the results obtained by replacing the DPDC module with vanilla convolutions of comparable parameter size. The third row reflects the performance when Bayesian filtering is omitted. The fourth row indicates the results without the added edge loss function.
The experimental results demonstrate that the inclusion of the DPDC module enhances segmentation accuracy significantly, with an improvement of approximately 3.5% compared to models without this module. Additionally, the integration of Bayesian filtering notably improves the smoothing effects during multi-frame fusion, particularly in scenarios involving the occlusion of moving objects, resulting in a significant enhancement in edge prediction accuracy by 2.8% compared to scenarios without filtering. Furthermore, the edge-aware loss function, combined with an incremental training paradigm, boosts the model’s capability to detect object edges. Our experiments indicate that the edge-aware loss contributes 2.2% to the overall IoU improvement compared to traditional cross-entropy loss, while the incremental training approach facilitates better integration of spatial and temporal information, enhancing the model’s adaptability to complex scenes.

4.4. Runtime and Efficiency

In Table 3, we present a comparison of the inference times for various models on validation sequence 8 (specifically 4071 scans). There are about 122 k point per scan. The runtime was evaluated with an Intel® CoreTM i7-10700K CPU @ 3.80 GHz × 16 and a single NVIDIA GeForce RTX 4090 GPU/PCIe/SSE2 in an Ubuntu 18.04.6 system (OS type: 64-bit). The results indicate that our proposed method is competitive in terms of runtime compared to other mainstream approaches. Specifically, our method achieves a runtime of 66.37 ms for a single inference. While this is slightly higher than LMNet v1 and v2, it remains comparable to the current SOTA methods, Cylinder3D and MotionSeg3D. In addition, the model size remains compact, with approximately 14.01 M parameters, which is only slightly larger than the MotionSeg3D model (13.61 M parameters). The primary modifications in our work involve changes to the convolution operations and the integration of motion consistency constraints through DPDC and Bayesian filtering. These changes do not significantly increase the overall model size, and the computational overhead is primarily due to the additional processing required by the DPDC layers and temporal smoothing techniques. Our current experiments were conducted using an RTX 4090 GPU, with an average memory usage of 5.6 GB per sample. However, real-time deployment on embedded systems or autonomous vehicles will require optimizations to reduce memory consumption and enhance resource efficiency. Future work will focus on implementing model compression and efficient memory management to ensure that the model can run effectively on low-resource platforms.
It is noteworthy that although the runtime is marginally higher than some lightweight methods (such as LMNet v1 and v2), our approach demonstrates greater robustness and stability in improving segmentation accuracy and edge handling capabilities. This suggests that a good balance has been achieved between performance and computational overhead. Particularly in more complex dynamic scenes, our model significantly enhances the precision of edge blurring handling through the introduction of DPDC and Bayesian filtering techniques. These improvements do incur some computational costs but simultaneously ensure an enhancement in segmentation accuracy. While the method has been evaluated extensively on benchmark datasets such as Semantic-KITTI, its real-world deployment remains an area for future work. The computational resources and real-time integration required for autonomous systems demand further optimization, and code deployment to real-world systems is currently ongoing. We plan to address these challenges in future work by testing the method on real-world platforms and refining it for real-time performance.
While the method shows promise for real-time processing on high-performance GPUs, further optimization is necessary for deployment on embedded systems with limited computational resources. We plan to focus on improving real-time performance by reducing model size and memory consumption and utilizing hardware-specific acceleration techniques such as TensorRT in future work. Future work will focus on adapting the method for real-time deployment by testing it on embedded systems and optimizing the model for lower computational cost through model pruning, quantization, and hardware acceleration. We will also evaluate the method’s real-world performance in dynamic, complex environments typical in autonomous driving.

5. Discussion

Firstly, while the projection of 3D LiDAR data to 2D data facilitates efficient feature extraction and processing, it inherently leads to a reduction in dimensional information. Specifically, the mapping of depth and elevation into a 2D space may obscure some of the vertical structure and introduce potential errors in object representation, particularly in complex environments. To mitigate these issues, we leverage advanced feature enhancement techniques, such as DPDC and Bayesian filtering, to refine the segmented results and improve temporal consistency. These strategies help to recover some of the lost information and minimize errors due to projection inaccuracies.
Secondly, while our method has achieved significant progress, several challenges and limitations remain to be further explored. Firstly, when the number of input residual images is reduced to two frames, the model performance degrades. This suggests that although the DPDC module effectively enhances the ability of dilated convolutions to capture spatial information, it still fails to fully utilize gradient information to compensate for the reduced spatial data. Therefore, how to maintain segmentation performance when temporal data input is reduced remains a key area for future research. Secondly, although the Bayesian filtering technique efficiently smooths multi-frame predictions and mitigates occlusion and edge blurring issues, it struggles with rapid changes and noise in dynamic scenes, leading to a loss of fine edge predictions for fast-moving objects. Furthermore, the parameters of Bayesian filtering are highly task-specific and require manual tuning, increasing the complexity of model implementation. Developing a more adaptive mechanism for automatic parameter selection could reduce uncertainty and simplify the filtering process. The edge-aware loss function improves edge accuracy and enhances boundary predictions for objects of varying scales, but its design and learning process still present potential for improvement. Future work should focus on optimizing the learning process of the loss function and designing better training paradigms to more effectively address this limitation.
In addition, while Semantic-KITTI has been widely used for evaluating LiDAR-based moving object segmentation, its coverage is limited to a specific set of urban environments, which may not fully represent the diversity of real-world driving conditions. Future work will expand the evaluation to more comprehensive datasets, such as NuScenes, Waymo, and PANDASet, to assess the generalization capabilities of our method in more varied environments.
In summary, while our model performs excellently in the SemanticKITTI-MOS benchmark, its performance is still affected by input data, filtering parameters, and handling multi-scale objects. Addressing both epistemic and aleatoric uncertainties is crucial for safe autonomous driving, as epistemic uncertainty can reveal the limitations of the segmentation model, while aleatoric uncertainty highlights noise observed by the sensor during segmentation.

6. Conclusions

In this paper, we proposed a novel 2D range image-based deep neural network for 3D LiDAR point cloud MOS. By introducing the DPDC module, we enhanced the effectiveness of difference convolutions and improved spatial information representation in range images. Following the principle of motion consistency, we employed Bayesian filtering to smooth multi-frame predictions and mitigated occlusion and motion blur, which reduced edge blurring. Additionally, we designed an edge-aware loss function and integrated it with a progressive training paradigm to adapt the network to varying scales. This work presents a frame-based moving object segmentation approach, introducing innovations in convolutional feature extraction and temporal smoothing, offering an efficient alternative to tracking-based methods for dynamic scenarios.
Despite achieving significant performance improvements on the SemanticKITTI-MOS benchmark compared to current SOTA methods, certain aspects of our method remain to be optimized, such as the efficiency of the DPDC module and the Bayesian filtering. The performance of our proposed method may also be limited in more challenging environments, those characterized by a greater diversity of moving object types, increased object density, and more pronounced variations in scale and velocity. Future research could focus on integrating effective data augmentation techniques to enhance the robustness and generalization of the model in extreme conditions.

Author Contributions

Conceptualization, F.T.; Data Curation, B.Z. and J.S.; Funding Acquisition, B.Z.; Methodology, F.T.; Project Administration, B.Z. and J.S.; Resources, B.Z.; Software, F.T.; Supervision, B.Z.; Validation, F.T.; Writing—Original Draft, F.T.; Writing—Review and Editing, B.Z. and J.S. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported in part by Local Special Program Funding (JMRHZX20240115).

Data Availability Statement

The raw data supporting the conclusions of this article will be made available by the authors without undue reservation.

Acknowledgments

The authors would like to thank the editors and anonymous reviewers for their valuable advice and efforts in supporting the publication of this paper.

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Cai, Y.; Li, B.; Zhou, J.; Zhang, H.; Cao, Y. Removing Moving Objects without Registration from 3D LiDAR Data Using Range Flow Coupled with IMU Measurements. Remote Sens. 2023, 15, 3390. [Google Scholar] [CrossRef]
  2. Montañez, O.J.; Suarez, M.J.; Fernandez, E.A. Application of Data Sensor Fusion Using Extended Kalman Filter Algorithm for Identification and Tracking of Moving Targets from LiDAR–Radar Data. Remote Sens. 2023, 15, 3396. [Google Scholar] [CrossRef]
  3. Muzahid, A.J.M.; Kamarulzaman, S.F.B.; Rahman, M.A.; Murad, S.A.; Kamal, M.A.S.; Alenezi, A.H. Multiple vehicle cooperation and collision avoidance in automated vehicles: Survey and an AI-enabled conceptual framework. Sci. Rep. 2023, 13, 603. [Google Scholar] [CrossRef] [PubMed]
  4. Chu, Z.; Wang, F.; Lei, T.; Luo, C. Path Planning Based on Deep Reinforcement Learning for Autonomous Underwater Vehicles Under Ocean Current Disturbance. IEEE Trans. Intell. Veh. 2023, 8, 108–120. [Google Scholar] [CrossRef]
  5. Wang, H.; Tang, F.; Wei, J.; Zhu, B.; Wang, Y.; Zhang, K. Online Semi-supervised Transformer for Resilient Vehicle GNSS/INS Navigation. IEEE Trans. Veh. Technol. 2024, 73, 16295–16311. [Google Scholar] [CrossRef]
  6. Tuna, T.; Nubert, J.; Nava, Y.; Khattak, S.; Hutter, M. X-ICP: Localizability-Aware LiDAR Registration for Robust Localization in Extreme Environments. IEEE Trans. Robot. 2022, 40, 452–471. [Google Scholar] [CrossRef]
  7. Zou, Q.; Sun, Q.; Chen, L.; Nie, B.; Li, Q. A Comparative Analysis of LiDAR SLAM-Based Indoor Navigation for Autonomous Vehicles. IEEE Trans. Intell. Transp. Syst. 2021, 23, 6907–6921. [Google Scholar] [CrossRef]
  8. Arora, M.; Wiesmann, L.; Chen, X.; Stachniss, C. Mapping the Static Parts of Dynamic Scenes from 3D LiDAR Point Clouds Exploiting Ground Segmentation. In Proceedings of the 2021 European Conference on Mobile Robots (ECMR), Bonn, Germany, 31 August–3 September 2021; pp. 1–6. [Google Scholar]
  9. Tang, F.; Zhang, S.; Zhu, B.; Sun, J. Outdoor large-scene 3D point cloud reconstruction based on transformer. Front. Phys. 2024, 12, 1474797. [Google Scholar] [CrossRef]
  10. Dewan, A.; Oliveira, G.L.; Burgard, W. Deep semantic classification for 3D LiDAR data. In Proceedings of the 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Vancouver, BC, Canada, 24 September 2017; pp. 3544–3549. [Google Scholar]
  11. Yang, H.; Yezzi, A.J. Decomposing the Tangent of Occluding Boundaries According to Curvatures and Torsions. In Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022. [Google Scholar]
  12. Schauer, J.; Nüchter, A. The Peopleremover—Removing Dynamic Objects From 3-D Point Cloud Data by Traversing a Voxel Occupancy Grid. IEEE Robot. Autom. Lett. 2018, 3, 1679–1686. [Google Scholar] [CrossRef]
  13. Postica, G.; Romanoni, A.; Matteucci, M. Robust moving objects detection in lidar data exploiting visual cues. In Proceedings of the 2016 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Daejeon, Republic of Korea, 9–14 October 2016; pp. 1093–1098. [Google Scholar]
  14. Tang, F.; Gui, L.; Liu, J.; Chen, K.; Lang, L.; Cheng, Y. Metal target detection method using passive millimeter-wave polarimetric imagery. Opt. Express 2020, 28, 13336–13351. [Google Scholar] [CrossRef]
  15. Wang, Y.; Yu, A.; Cheng, Y.; Qi, J. Matrix diffractive deep neural networks merging polarization into meta-device. Laser Photonics Rev. 2023, 18, 2300903. [Google Scholar] [CrossRef]
  16. Cheng, Y.; Tian, X.; Zhu, D.; Wu, L.; Zhang, L.; Qi, J.; Qiu, J. Regional-Based Object Detection Using Polarization and Fisher Vectors in Passive Millimeter-Wave Imaging. IEEE Trans. Microw. Theory Tech. 2023, 71, 2702–2713. [Google Scholar] [CrossRef]
  17. Zhou, P.; Liu, Y.; Meng, Z. PointSLOT: Real-Time Simultaneous Localization and Object Tracking for Dynamic Environment. IEEE Robot. Autom. Lett. 2023, 8, 2645–2652. [Google Scholar] [CrossRef]
  18. Tang, H.; Liu, Z.; Zhao, S.; Lin, Y.; Lin, J.; Wang, H.; Han, S. Searching Efficient 3D Architectures with Sparse Point-Voxel Convolution. arXiv 2020, arXiv:abs/2007.16100. [Google Scholar]
  19. Hu, Q.; Yang, B.; Xie, L.; Rosa, S.; Guo, Y.; Wang, Z.; Trigoni, A.; Markham, A. RandLA-Net: Efficient Semantic Segmentation of Large-Scale Point Clouds. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2019; pp. 11105–11114. [Google Scholar]
  20. Chen, X.; Li, S.; Mersch, B.; Wiesmann, L.; Gall, J.; Behley, J.; Stachniss, C. Moving Object Segmentation in 3D LiDAR Data: A Learning-Based Approach Exploiting Sequential Data. IEEE Robot. Autom. Lett. 2021, 6, 6529–6536. [Google Scholar] [CrossRef]
  21. Sun, J.; Dai, Y.; Zhang, X.; Xu, J.; Ai, R.; Gu, W.; Chen, X. Efficient Spatial-Temporal Information Fusion for LiDAR-Based 3D Moving Object Segmentation. In Proceedings of the 2022 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Kyoto, Japan, 23–27 October 2022; pp. 11456–11463. [Google Scholar]
  22. Zhu, X.; Zhou, H.; Wang, T.; Hong, F.; Ma, Y.; Li, W.; Li, H.; Lin, D. Cylindrical and Asymmetrical 3D Convolution Networks for LiDAR Segmentation. In Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2020; pp. 9934–9943. [Google Scholar]
  23. Zhao, Z.; Gan, S.; Xiao, B.; Wang, X.; Liu, C. Three-Dimensional Reconstruction of Zebra Crossings in Vehicle-Mounted LiDAR Point Clouds. Remote Sens. 2024, 16, 3722. [Google Scholar] [CrossRef]
  24. Milioto, A.; Vizzo, I.; Behley, J.; Stachniss, C. RangeNet++: Fast and Accurate LiDAR Semantic Segmentation. In Proceedings of the 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Macau, China, 3–8 November 2019; pp. 4213–4220. [Google Scholar] [CrossRef]
  25. Cortinhal, T.; Tzelepis, G.; Aksoy, E.E. SalsaNext: Fast, Uncertainty-Aware Semantic Segmentation of LiDAR Point Clouds. In Proceedings of the International Symposium on Visual Computing, San Diego, CA, USA, 5–7 October 2020; pp. 207–222. [Google Scholar]
  26. Li, S.; Chen, X.; Liu, Y.; Dai, D.; Stachniss, C.; Gall, J. Multi-Scale Interaction for Real-Time LiDAR Data Segmentation on an Embedded Platform. IEEE Robot. Autom. Lett. 2022, 7, 738–745. [Google Scholar] [CrossRef]
  27. Wang, D.Z.; Posner, I.; Newman, P. What could move? Finding cars, pedestrians and bicyclists in 3D laser data. In Proceedings of the 2012 IEEE International Conference on Robotics and Automation, St Paul, MN, USA, 14–18 May 2012; pp. 4038–4044. [Google Scholar] [CrossRef]
  28. Ruchti, P.; Burgard, W. Mapping with Dynamic-Object Probabilities Calculated from Single 3D Range Scans. In Proceedings of the 2018 IEEE International Conference on Robotics and Automation (ICRA), Brisbane, Australia, 21–25 May 2018; pp. 6331–6336. [Google Scholar] [CrossRef]
  29. Chen, X.; Milioto, A.; Palazzolo, E.; Giguère, P.; Behley, J.; Stachniss, C. SuMa++: Efficient LiDAR-based Semantic SLAM. In Proceedings of the 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Macau, China, 3–8 November 2019; pp. 4530–4537. [Google Scholar] [CrossRef]
  30. Thomas, H.; Qi, C.; Deschaud, J.E.; Marcotegui, B.; Goulette, F.; Guibas, L.J. KPConv: Flexible and Deformable Convolution for Point Clouds. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019; pp. 6410–6419. [Google Scholar]
  31. Shi, H.; Lin, G.; Wang, H.; Hung, T.Y.; Wang, Z. SpSequenceNet: Semantic Segmentation Network on 4D Point Clouds. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 4573–4582. [Google Scholar]
  32. Zhu, X.; Zhou, H.; Wang, T.; Hong, F.; Li, W.; Ma, Y.; Li, H.; Yang, R.; Lin, D. Cylindrical and Asymmetrical 3D Convolution Networks for LiDAR-Based Perception. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 44, 6807–6822. [Google Scholar] [CrossRef]
  33. Baur, S.A.; Emmerichs, D.; Moosmann, F.; Pinggera, P.; Ommer, B.; Geiger, A. SLIM: Self-Supervised LiDAR Scene Flow and Motion Segmentation. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 10–17 October 2021; pp. 13106–13116. [Google Scholar]
  34. Liu, X.; Qi, C.; Guibas, L.J. FlowNet3D: Learning Scene Flow in 3D Point Clouds. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2018; pp. 529–537. [Google Scholar]
  35. Wu, W.; Wang, Z.; Li, Z.; Liu, W.; Li, F. PointPWC-Net: Cost Volume on Point Clouds for (Self-)Supervised Scene Flow Estimation. In Proceedings of the European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020. [Google Scholar]
  36. Beltrán, J.; Guindel, C.; Moreno, F.; Cruzado, D.; Garcia, F.; Escalera, A.D.L. Birdnet: A 3D Object Detection Framework from Lidar Information. In Proceedings of the IEEE International Conference on Intelligent Transportation Systems (ITSC), Maui, HI, USA, 4–7 November 2018; pp. 3517–3523. [Google Scholar]
  37. Lang, A.; Vora, S.; Caesar, H.; Zhou, L.; Yang, J.; Beijbom, O. PointPillars: Fast Encoders for Object Detection from Point Clouds. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019. [Google Scholar]
  38. Shi, S.; Guo, C.; Jiang, L.; Wang, Z.; Shi, J.; Wang, X.; Li, H. PV-RCNN: Point-Voxel Feature Set Abstraction for 3D Object Detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020. [Google Scholar]
  39. Vaquero, V.; del Pino, I.; Moreno-Noguer, F.; Solà, J.; Sanfeliu, A.; Andrade-Cetto, J. Dual-Branch CNNs for Vehicle Detection and Tracking on Lidar Data. IEEE Trans. Intell. Transp. Syst. (ITS) 2020, 22, 6942–6953. [Google Scholar] [CrossRef]
  40. Zhou, Y.; Tuzel, O. VoxelNet: End-to-End Learning for Point Cloud Based 3D Object Detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–23 June 2018. [Google Scholar]
  41. Wang, N.; Shi, C.; Guo, R.; Lu, H.; Zheng, Z.; Chen, X. InsMOS: Instance-Aware Moving Object Segmentation in LiDAR Data. In Proceedings of the 2023 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Detroit, MI, USA, 1–5 October 2023; pp. 7598–7605. [Google Scholar]
  42. Cheng, J.; Zeng, K.; Huang, Z.; Tang, X.; Wu, J.; Zhang, C.; Chen, X.; Fan, R. MF-MOS: A Motion-Focused Model for Moving Object Segmentation. In Proceedings of the 2024 IEEE International Conference on Robotics and Automation (ICRA), Yokohama, Japan, 13–17 May 2024; pp. 12499–12505. [Google Scholar] [CrossRef]
  43. Sun, H.; Luo, Z.; Ren, D.; Du, B.; Chang, L.; Wan, J. Unsupervised multi-branch network with high-frequency enhancement for image dehazing. Pattern Recognit. 2024, 156, 110763. [Google Scholar] [CrossRef]
  44. Li, Y.; Liu, B. Improved edge detection algorithm for canny operator. In Proceedings of the 2022 IEEE 10th Joint International Information Technology and Artificial Intelligence Conference (ITAIC), Chongqing, China, 17–19 June 2022; Volume 10, pp. 1–5. [Google Scholar]
  45. Li, X.; Chang, Q.; Li, Y.; Miyazaki, J. Multi-directional Sobel operator kernel on GPUs. arXiv 2023, arXiv:abs/2305.00515. [Google Scholar] [CrossRef]
  46. Ojala, T.; Pietikäinen, M.; Mäenpää, T. Multiresolution Gray-Scale and Rotation Invariant Texture Classification with Local Binary Patterns. IEEE Trans. Pattern Anal. Mach. Intell. 2002, 24, 971–987. [Google Scholar] [CrossRef]
  47. Juefei-Xu, F.; Boddeti, V.N.; Savvides, M. Local Binary Convolutional Neural Networks. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 4284–4293. [Google Scholar] [CrossRef]
  48. Yu, Z.; Zhao, C.; Wang, Z.; Qin, Y.; Su, Z.; Li, X.; Zhou, F.; Zhao, G. Searching Central Difference Convolutional Networks for Face Anti-Spoofing. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 5294–5304. [Google Scholar]
  49. Yu, Z.; Qin, Y.; Zhao, H.; Li, X.; Zhao, G. Dual-Cross Central Difference Network for Face Anti-Spoofing. In Proceedings of the International Joint Conference on Artificial Intelligence, Montreal, QC, Canada, 19–27 August 2021. [Google Scholar]
  50. Su, Z.; Liu, W.; Yu, Z.; Hu, D.; Liao, Q.; Tian, Q.; Pietikäinen, M.; Liu, L. Pixel Difference Networks for Efficient Edge Detection. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 10–17 October 2021; pp. 5097–5107. [Google Scholar]
  51. Chen, L.; Papandreou, G.; Schroff, F.; Adam, H. Rethinking Atrous Convolution for Semantic Image Segmentation. arXiv 2017, arXiv:1706.05587. [Google Scholar]
  52. Wang, P.; Chen, P.; Yuan, Y.; Liu, D.; Huang, Z.; Hou, X.; Cottrell, G.W. Understanding Convolution for Semantic Segmentation. arXiv 2017, arXiv:1702.08502. [Google Scholar]
  53. Fan, L.; Xiong, X.; Wang, F.; Wang, N.; Zhang, Z. RangeDet: In Defense of Range View for LiDAR-based 3D Object Detection. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 11–17 October 2021; pp. 2898–2907. [Google Scholar] [CrossRef]
  54. Paszke, A.; Gross, S.; Massa, F.; Lerer, A.; Bradbury, J.; Chanan, G.; Killeen, T.; Lin, Z.; Gimelshein, N.; Antiga, L.; et al. PyTorch: An Imperative Style, High-Performance Deep Learning Library. arXiv 2019, arXiv:1912.01703. [Google Scholar]
  55. Everingham, M.; Gool, L.V.; Williams, C.K.I.; Winn, J.M.; Zisserman, A. The Pascal Visual Object Classes (VOC) Challenge. Int. J. Comput. Vis. 2010, 88, 303–338. [Google Scholar] [CrossRef]
  56. Mohapatra, S.; Hodaei, M.; Yogamani, S.K.; Milz, S.; Mäder, P.; Gotzig, H.; Simon, M.; Rashed, H. LiMoSeg: Real-time Bird’s Eye View based LiDAR Motion Segmentation. arXiv 2021, arXiv:2111.04875. [Google Scholar]
Figure 1. Visualization of 2D image.
Figure 1. Visualization of 2D image.
Remotesensing 17 00195 g001
Figure 2. Details of improved convolution methods.
Figure 2. Details of improved convolution methods.
Remotesensing 17 00195 g002
Figure 3. An overview of our method. The upper part illustrates the overall workflow of the network, while the lower part details the specific implementation of each submodule.
Figure 3. An overview of our method. The upper part illustrates the overall workflow of the network, while the lower part details the specific implementation of each submodule.
Remotesensing 17 00195 g003
Figure 4. Details of Depth Pixel Difference Convolution (DPDC).
Figure 4. Details of Depth Pixel Difference Convolution (DPDC).
Remotesensing 17 00195 g004
Figure 5. Qualitative comparisons of various methods for LiDAR-MOS in different scenes on the SemanticKITTI-MOS validation set are presented. Blue circles emphasize mispredictions and indistinct boundaries. For optimal viewing, refer to the images in color and zoom in for finer details.
Figure 5. Qualitative comparisons of various methods for LiDAR-MOS in different scenes on the SemanticKITTI-MOS validation set are presented. Blue circles emphasize mispredictions and indistinct boundaries. For optimal viewing, refer to the images in color and zoom in for finer details.
Remotesensing 17 00195 g005
Figure 6. Qualitative comparisons of various methods for LiDAR-MOS between consecutive frames on the SemanticKITTI-MOS validation set are presented. Blue circles emphasize mispredictions and indistinct boundaries. For optimal viewing, refer to the images in color and zoom in for finer details.
Figure 6. Qualitative comparisons of various methods for LiDAR-MOS between consecutive frames on the SemanticKITTI-MOS validation set are presented. Blue circles emphasize mispredictions and indistinct boundaries. For optimal viewing, refer to the images in color and zoom in for finer details.
Remotesensing 17 00195 g006
Table 1. Evaluation and comparison of moving objects IoU M O S (%) on the validation set (Seq 08) and the benchmark test set.
Table 1. Evaluation and comparison of moving objects IoU M O S (%) on the validation set (Seq 08) and the benchmark test set.
MethodValidation (Seq 08)Test (Seq 11–21)
LiMoSeg [56]52.6
Cylinder3D [32]66.361.22
LMNet, v1 [20]56.456.9
LMNet, v2 [20]64.451.3
MotionSeg3D, v1 [21]66.460.9
MotionSeg3D, v2 [21]70.262.5
Ours73.665.2
Table 2. An ablation study on the components was conducted using the validation set (seq08). The symbol “ Δ ” indicates the average decrease relative to the baseline in the validation dataset Δ V and test dataset Δ T .
Table 2. An ablation study on the components was conducted using the validation set (seq08). The symbol “ Δ ” indicates the average decrease relative to the baseline in the validation dataset Δ V and test dataset Δ T .
ModelsValidationTest Δ V Δ T Δ
Full Model *73.564.90.00.00.0
w/o DPDC70.261.43.33.73.5
w/o Bayesian Filtering70.862.02.72.92.8
w/o L e a 71.662.41.92.52.2
* Trained with raw SemanticKITTI-MOS dataset.
Table 3. Time comparison for a single evaluation on validation sequence 8.
Table 3. Time comparison for a single evaluation on validation sequence 8.
MethodRunning Time (ms)
Cylinder3D [32]78.89
LMNet, v1 [20]10.79
LMNet, v2 [20]12.02
MotionsSeg3D, v1 [21]26.50
MotionsSeg3D, v2 [21]74.11
Ours66.37
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Tang, F.; Zhu, B.; Sun, J. Gradient Enhancement Techniques and Motion Consistency Constraints for Moving Object Segmentation in 3D LiDAR Point Clouds. Remote Sens. 2025, 17, 195. https://doi.org/10.3390/rs17020195

AMA Style

Tang F, Zhu B, Sun J. Gradient Enhancement Techniques and Motion Consistency Constraints for Moving Object Segmentation in 3D LiDAR Point Clouds. Remote Sensing. 2025; 17(2):195. https://doi.org/10.3390/rs17020195

Chicago/Turabian Style

Tang, Fangzhou, Bocheng Zhu, and Junren Sun. 2025. "Gradient Enhancement Techniques and Motion Consistency Constraints for Moving Object Segmentation in 3D LiDAR Point Clouds" Remote Sensing 17, no. 2: 195. https://doi.org/10.3390/rs17020195

APA Style

Tang, F., Zhu, B., & Sun, J. (2025). Gradient Enhancement Techniques and Motion Consistency Constraints for Moving Object Segmentation in 3D LiDAR Point Clouds. Remote Sensing, 17(2), 195. https://doi.org/10.3390/rs17020195

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop