1. Introduction
Recent advances in computing power, sensor technologies, and machine learning have significantly fueled interest in autonomous unmanned aerial vehicles (UAVs), also known as drones. These systems have become indispensable across a wide range of applications, including robot navigation, autonomous driving, virtual reality (VR), augmented reality (AR), environmental monitoring, delivery services, and disaster response. In such contexts, navigation and positioning are essential to ensuring the UAV’s operational accuracy, safety, and efficiency. Modern UAVs heavily rely on sensor fusion techniques to provide robust state estimation that enables them to operate autonomously, even in complex or dynamic environments. Beyond UAVs, sensor fusion plays a vital role in the Internet of Vehicles (IoV), autonomous robots, and other emerging technologies [
1,
2].
The field of state estimation in navigation and control systems for autonomous robots has evolved significantly over the years, driven by technological advancements in sensor hardware and computational algorithms. State estimation involves deriving accurate information about a system’s position, velocity, and orientation based on sensor data. While single-sensor solutions have been extensively studied, their limitations have increasingly motivated research into multi-sensor fusion approaches. These approaches leverage the complementary characteristics of diverse sensors to overcome the constraints of individual sensors and enhance the accuracy, robustness, and resilience of state estimation systems [
3].
Despite the progress made, achieving robust, accurate, and seamless navigation and positioning solutions remains a major challenge when relying solely on single-sensor systems. For example, the inertial navigation system (INS), which relies on accelerometers and gyroscopes to compute relative positions, is highly accurate only for short durations. Over time, the accumulation of sensor noise and integration errors causes significant drift. Similarly, GPS, while offering absolute positioning data, is effective primarily in open sky environments but is prone to signal blockage, multipath interference, and degraded performance in urban canyons, dense forests, or indoor environments. These limitations demand the integration of additional sensor types, such as cameras, LiDAR, and IMU, to ensure robust state estimation with enhanced spatial and temporal coverage.
Visual inertial navigation systems (VINS) [
4] have emerged as a cost effective and practical solution for state estimation in UAVs, combining visual and inertial data to achieve higher accuracy. However, VINS performance in complex environments is often hindered by its susceptibility to changing illumination, low texture regions, and dynamic obstacles. LiDAR, on the other hand, provides accurate distance measurements and operates independently of lighting conditions. Its growing affordability and precision have made it a popular choice for UAVs. Nonetheless, LiDAR systems face challenges related to sparse data and difficulty in extracting semantic information. Similarly, vision-based approaches using monocular or stereo cameras struggle with initialization, sensitivity to illumination changes, and distance variability. These challenges highlight the need for multi-sensor fusion, where the strengths of different sensors are combined to overcome individual shortcomings.
In recent years, multi sensor fusion approaches have advanced significantly, enabling UAVs to achieve real-time, high-precision positioning and mapping. For example, integrating GPS with IMU data mitigates inertial navigation drift and improves noise filtering in complex environments. Incorporating LiDAR and visual data further enhances accuracy by providing rich spatial and semantic information. However, traditional sensor fusion methods often rely on static weighting of sensor inputs, which can lead to suboptimal performance in dynamic or degraded scenarios. These limitations have driven research toward adaptive sensor fusion techniques that dynamically adjust sensor contributions based on real-time environmental conditions and sensor reliability [
5,
6].
Recent advancements in deep learning have introduced a powerful paradigm for adaptive sensor fusion. Deep learning models, such as long short-term memory (LSTM) networks, can effectively learn temporal dependencies in sensor data and adaptively compute fusion weights based on real-time input. This capability allows UAVs to dynamically prioritize reliable sensors and minimize the impact of degraded or faulty sensor data. Such adaptability is particularly valuable in scenarios involving sudden illumination changes, feature deprived environments, degraded GPS signals, or complete signal loss where traditional single-sensor systems and static-weight fusion approaches often fail.
This paper presents a novel, deep learning-based adaptive multi-sensor fusion framework for UAV state estimation. The proposed framework integrates stereo cameras, IMU, LiDAR sensors, and GPS-RTK data into a unified system, which is depicted in
Figure 1. A long short-term memory (LSTM) model is used to dynamically compute sensor fusion weights in real time, ensuring robust, accurate, and consistent state estimation under diverse conditions. Unlike conventional methods that rely on fixed sensor weights, our approach leverages the real-time adaptability of deep learning to optimize sensor contributions based on environmental and operational factors.
Our approach is validated on an in-house UAV platform equipped with an internally integrated and calibrated sensor suite. The system is evaluated against high-precision RTK ground truth, demonstrating its ability to maintain robust state estimation in both GPS-enabled and GPS-denied scenarios. The algorithms autonomously determine relevant sensor data, leveraging stereo inertial or LiDAR inertial odometry outputs to ensure global positioning in the absence of GPS.
The major contributions of this research are as follows:
We propose an innovative multi-sensor fusion system integrating a VGA stereo camera, two 3D LiDAR sensors, a nine-degree-of-freedom IMU, and optimized GPS-RTK networking to achieve precise UAV state estimation.
A deep learning-based adaptive weighting mechanism is implemented using LSTM to dynamically adjust sensor contributions, ensuring robust state estimation across diverse and challenging environments.
A commercial UAV equipped with an internally integrated and calibrated sensor platform is used to collect complex datasets, enabling robust evaluation of the proposed method.
Extensive evaluations confirm the efficacy and performance of the stereo-visual-LiDAR fusion framework, demonstrating high efficiency, robustness, consistency, and accuracy in challenging scenarios.
By addressing the limitations of traditional methods and introducing dynamic adaptability through deep learning, this work significantly advances the field of UAV state estimation, paving the way for more reliable autonomous navigation systems.
2. Related Work
In recent decades, many innovative approaches for UAV state estimation have been proposed, leveraging different types of sensors. Among these, vision-based and LiDAR-based methods have gained substantial attention due to their ability to provide rich environmental data for accurate localization and mapping. Researchers have extensively explored the fusion of visual and inertial sensors, given their complementary properties in addressing UAV navigation challenges [
7].
For state estimation, sensors such as IMUs are frequently used in fusion designs that can be broadly categorized into loosely coupled and tightly coupled approaches. In loosely coupled systems, sensor outputs are independently processed and subsequently fused, offering simplicity and flexibility when integrating diverse sensors. However, tightly coupled systems have gained increasing preference due to their ability to process raw sensor data directly, such as utilizing raw IMU measurements in pose estimation. This allows for more accurate state estimation, especially in scenarios with high dynamic motion or challenging environmental conditions. Papers [
8,
9] propose tightly coupled methods that integrate visual and inertial data for efficient and robust state estimation. By exploiting the raw data from IMU and cameras, these methods address issues like drift and improve system robustness compared to loosely coupled alternatives.
2.1. Multi-Sensor Fusion Approaches
Current multi-sensor fusion methods can be broadly classified into filtering-based, optimization-based, and deep learning-based approaches [
10].
2.1.1. Filtering-Based Methods
Filtering-based methods, such as the extended Kalman filter (EKF) and unscented Kalman filter (UKF), have been widely adopted for sensor fusion due to their computational efficiency and ability to handle real-time applications. These methods assume Gaussian noise and rely on linearization techniques to model system dynamics. However, their performance deteriorates in the presence of nonlinear models or non-Gaussian noise distributions. Furthermore, their reliance on static sensor weightings can result in suboptimal performance in dynamic and unpredictable environments.
2.1.2. Optimization-Based Methods
Optimization-based approaches address the limitations of filtering methods by formulating the state estimation problem as an optimization task. These methods, such as bundle adjustment (BA) and factor graph optimization (FGO), are well suited for handling nonlinearities and non-Gaussian noise. Although optimization methods are computationally more demanding, they provide higher precision and robustness, making them popular for applications requiring high accuracy, such as simultaneous localization and mapping (SLAM). For example, techniques that combine visual, inertial, and LiDAR data in optimization frameworks have demonstrated significant improvements in state estimation accuracy in diverse scenarios.
2.1.3. Deep Learning-Based Methods
With the rapid advancements in deep learning, researchers have increasingly explored neural network-based algorithms for sensor fusion and state estimation [
11]. These methods leverage the ability of neural networks to learn complex, nonlinear relationships directly from data. For instance, networks designed for depth estimation and motion representation from image sequences have shown promise in improving pose estimation accuracy and robustness. Furthermore, neural networks can dynamically adapt sensor fusion weights based on real-time sensor reliability, enabling more robust state estimation in dynamic environments. However, the high computational cost and the need for extensive training data remain significant challenges for deploying deep learning-based methods in real-time UAV applications.
2.2. Sensor-Specific Contributions
2.2.1. Vision-Based SLAM
Vision based approaches, such as monocular or stereo visual SLAM, utilize cameras to map the environment and estimate the UAV’s pose. These methods offer a cost-effective solution but are highly sensitive to illumination changes, feature poor environments, and dynamic objects. Moreover, challenges such as scale ambiguity in monocular systems and computational overhead in stereo systems limit their widespread application.
2.2.2. LiDAR-Based SLAM
LiDAR systems generate dense 3D point clouds of the environment, providing high-precision spatial information that is resilient to lighting variations. Compared to vision-based SLAM, LiDAR-based SLAM demonstrates superior performance in feature-poor or dynamic environments [
12]. However, LiDAR data are inherently sparse and lacks semantic information, necessitating integration with other sensors such as cameras and IMUs for robust state estimation.
2.2.3. Multi-Sensor Fusion for SLAM
Recent studies highlight the importance of integrating complementary sensor types, such as cameras, LiDAR, IMU, and GNSS, to achieve robust and efficient SLAM-based navigation [
13,
14]. For instance, adding visual, LiDAR, or inertial factors enhances SLAM systems by improving robustness and state estimation accuracy [
15,
16]. Combining LiDAR and visual data mitigates the limitations of each sensor, while IMUs provide continuous data for motion prediction and noise filtering. The integration of GPS and GNSS further ensures resilience against environmental variability and provides accurate global positioning to address drift in large-scale environments.
2.3. Challenges and Opportunities
While current state estimation techniques show significant promise, challenges such as accumulated drift, sensitivity to environmental factors, and limited adaptability in dynamic scenarios persist. To address these issues, adaptive multi-sensor fusion techniques that dynamically adjust sensor weights based on environmental and operational factors have emerged as a promising solution. For example, learning-based frameworks leverage the adaptability of neural networks to dynamically compute sensor fusion weights, improving resilience and robustness in degraded conditions.
Table 1. represents the summary of the comparison between LSAF and other state-of-the-art methods.
The proposed (LSAF) LSTM-based dynamic weight adjustment differs from existing methods by integrating LSTM-derived adaptive weights into MSCKF for real-time UAV state estimation, rather than just optimizing fusion weights offline. Unlike prior works, our approach employs an attention-based mechanism within LSTM to dynamically prioritize sensor reliability at each time step, ensuring robustness in SLAM-based pose estimation. Additionally, our hierarchical fusion strategy combines LSTM, SLAM, and MSCKF, making it more adaptable to real-world UAV applications, especially in GPS-denied and dynamic environments. These innovations differentiate our work from conventional LSTM-based fusion techniques.
This paper builds on this body of work by proposing a novel framework that combines the strengths of optimization-based and deep learning-based approaches. Using long-short-term memory (LSTM) networks, our method dynamically computes sensor fusion weights in real time, adapting to environmental conditions and sensor reliability. This framework integrates stereo cameras, LiDAR, IMU, and GPS-RTK data into a unified system, achieving superior performance in both GPS-enabled and GPS-denied scenarios.
3. Methodology
This research aims to achieve robust and accurate UAV state estimation by integrating measurements from multiple sensors, including GPS, stereo cameras, LiDARs, and IMUs, into a unified framework. The proposed system combines a multi-state constraint Kalman filter (MSCKF) [
27] with a long short-term memory (LSTM)-based self-adaptive sensor fusion mechanism. This hybrid framework dynamically adjusts sensor fusion weights based on real-time environmental conditions and sensor reliability, ensuring consistent performance in challenging scenarios, such as GPS-degraded environments, rapid motion, and feature-deprived areas.
3.1. Coordinate Systems and Sensor Calibration
To ensure consistency across multi-sensor measurements, the system defines two primary coordinate systems—the world frame (
W) and the UAV body frame (
B)—as can be seen in the
Figure 2. These systems represent the proposed LSAF framework. The body frame is aligned with the IMU frame for simplicity, as the IMU serves as the central reference for state propagation. Local sensors, such as stereo cameras, LiDARs, and IMUs, measure relative motion and require initialization of their reference frames. Initialization is typically performed by setting the UAV’s first pose as the origin. Global sensors, such as GPS, operate in an Earth-centered global coordinate frame and provide absolute positioning measurements. GPS data, expressed as latitude, longitude, and altitude, are converted into Cartesian coordinates (
) for consistency with local sensor measurements.
Offline calibration of all sensors is performed to reduce measurement biases, align coordinate frames, and ensure accurate fusion of data. This calibration accounts for sensor-specific offsets, such as biases in IMU accelerometers and gyroscopes, misalignment of LiDAR and camera frames, and GPS inaccuracies due to multipath effects or environmental interference. The calibration process ensures that measurements from all sensors are consistent and directly comparable within the fusion framework.
3.2. State Representation and Propagation
The UAV’s motion is modeled using a six-degree-of-freedom (6-DOF) representation, including position, velocity, orientation, and sensor biases. The state vector x is defined as follows:
where
is the position of the UAV,
is the velocity,
is the orientation represented as a quaternion,
is the accelerometer bias, and
is the gyroscope bias. The state is propagated forward in time using IMU measurements of linear acceleration (
) and angular velocity (
) as follows:
Here,
represents the rotation matrix derived from the quaternion
, and
is the gravity vector. The terms
,
,
, and
denote process noise, modeled as zero-mean Gaussian distributions.
3.3. Measurement Models for Multi-Sensor Integration
Each sensor provides measurements that are incorporated into the fusion framework through dedicated measurement models. These models relate sensor observations to the UAV’s state, ensuring accurate integration. The key measurement models are as described below.
1. IMU measurements provide linear acceleration and angular velocity. These are modeled as follows:
2. GPS provides absolute position measurements in the global frame:
where
denotes measurement noise.
3. LiDAR generates 3D point clouds, providing precise spatial measurements:
where
is the rotation matrix from the IMU to LiDAR frame, and
is the IMU’s position.
4. Stereo Camera provide 2D projections of 3D feature points:
where
are the 3D coordinates of a feature in the camera frame, and
are the corresponding pixel coordinates.
3.4. Self-Adaptive Fusion with LSTM
Accurate state estimation for autonomous UAVs in dynamic and uncertain environments remains a critical challenge. Traditional sensor fusion methods such as the multi-state constraint Kalman filter (MSCKF) assume fixed measurement noise covariance (), which limits their ability to adapt to varying sensor reliability. To address this limitation, this work introduces a long short-term memory (LSTM)-based self-adaptive fusion framework, which dynamically adjusts the measurement noise covariance for each sensor based on real-time reliability assessments. By leveraging temporal dependencies in sensor data, the proposed approach improves robustness to environmental variations, sensor degradation, and measurement inconsistencies.
The LSTM model takes the following characteristics as input key features indicative of sensor reliability: GPS signal strength, visual feature density, LiDAR point cloud density, and IMU noise levels. These features are processed over time to generate adaptive fusion weights, which are used to modify the sensor measurement models dynamically. The LSTM network is trained offline on a dataset comprising diverse environmental conditions, including urban landscapes, forested areas, and GPS-denied spaces, with ground truth obtained from GPS-RTK and high-accuracy SLAM systems. The ability of the LSTM to learn and generalize from these varied conditions enables it to adjust sensor fusion parameters optimally in real time, improving the overall accuracy and robustness of UAV state estimation.
Table 2 summarizes the LSTM-based self-adaptive multi-sensor fusion (LSAF) framework, which enhances UAV state estimation by dynamically weighting multi-sensor inputs. The framework integrates data from GPS, IMU, stereo cameras, and LiDAR, leveraging an LSTM model to extract temporal dependencies and compute adaptive sensor reliability scores. These weights dynamically adjust sensor contributions to SLAM-based pose estimation and MSCKF-based state correction, improving accuracy and robustness. The model architecture comprises two LSTM layers followed by a time-distributed dense layer, trained using the mean squared error (MSE) loss function and optimized via the Adam optimizer over 1000 epochs. Unlike traditional fusion techniques, the LSTM updates sensor weights at each time step, allowing for real-time adaptation to environmental variations. By assigning higher weightage to more reliable sensors, the system ensures precise state estimation, particularly in GPS-denied environments, high-speed maneuvers, and featureless conditions, ultimately enhancing UAV navigation and autonomous flight performance. Algorithm 1 represent the proposed LSAF process training phase steps.
Algorithm 1 Proposed LSTM-based self-adaptive multi-sensor fusion (LSAF) training phase |
- 1:
Input: : Multi-sensor measurements G: Ground truth values : LSTM model parameters : Initial state estimate and covariance : Measurement and process noise covariance : Convergence threshold
- 2:
Output: - 3:
Step 1: Initialization Initialize LSTM model parameters: Set noise covariances: Define training parameters: number of epochs N, learning rate
- 4:
Step 2: Training Phase (For each epoch N) - 5:
for to N do Encode Sensor Data: Compute hidden states using LSTM: Compute Adaptive Weights: Use attention mechanism: Update Noise Covariance: Adjust sensor uncertainty: Predict Next State: Using motion model: Compute Kalman Gain: Optimize state estimation: State and Covariance Update: Using Kalman filter: Compute Loss: Evaluate using Mean Squared Error (MSE): Update Model Parameters: Adjust LSTM weights using Adam optimizer:
- 6:
end for - 7:
Step 3: Output the Final State Estimation
|
3.5. Proposed LSTM-Based Multi-Sensor Fusion Architecture
The proposed LSTM-based multi-sensor fusion framework is designed to effectively integrate long-term temporal dependencies into sensor data, enabling robust and adaptive fusion. The architecture, illustrated in
Figure 3, consists of two sequential LSTM layers followed by a time-distributed dense layer, ensuring optimal processing of time-series sensor inputs.
The proposed architecture is designed to efficiently process sequential multi-sensor data for adaptive state estimation in UAV applications. At the core of this framework is the multi-sensor input layer, which aggregates data from various sources, including inertial measurement units (IMU), LiDAR, GPS, and stereo cameras. This structured representation ensures that the model can effectively capture variations in sensor reliability over time, providing a robust foundation for subsequent processing. By concatenating information from different sensor modalities, the input layer creates a time-series feature space that allows the network to analyze both spatial and temporal correlations in sensor data.
The first LSTM layer, comprising 128 units, plays a crucial role in capturing long-term dependencies in sensor reliability. Since real-world sensor data exhibit complex temporal dynamics, this layer enables the model to recognize patterns related to sensor degradation, noise fluctuations, and environmental interference. By leveraging its ability to retain past information through memory cells, the LSTM network ensures that historical context is incorporated into the state estimation process, allowing for more informed predictions. This is particularly valuable in scenarios where certain sensors intermittently provide unreliable measurements due to external disturbances or occlusions.
Following this, the second LSTM layer, consisting of 64 units, is responsible for refining the temporal features extracted by the first layer. This secondary processing stage reduces the dimensionality of the extracted feature set while preserving the most relevant sequential information. By compressing high-dimensional sensor data into a more compact representation, the network becomes more efficient in distinguishing meaningful trends from noise. The stacking of LSTM layers further enhances the model’s ability to discern complex dependencies between different sensor modalities, leading to improved estimation accuracy. To maintain temporal consistency in the output, the architecture incorporates a time-distributed dense layer. Unlike conventional fully connected layers, which process entire input sequences at once, this layer applies dense transformations independently to each time step. This ensures that the predicted UAV states remain aligned with the corresponding sensor measurements, preserving the sequential integrity of the data. The time-distributed nature of this layer allows the model to generate real-time predictions without disrupting the temporal structure of the input.
The final output layer provides the estimated UAV state by incorporating adaptive fusion weights derived from past sensor behavior. These weights are dynamically adjusted based on the learned temporal dependencies, allowing the system to prioritize the most reliable sensors under varying operational conditions. The model continuously refines its predictions by leveraging historical patterns of sensor accuracy, leading to more robust and adaptive state estimation. This approach is particularly beneficial in GPS-denied environments, highly dynamic conditions, and scenarios where individual sensors experience intermittent failures. Through this structured design, the architecture effectively integrates sequential information to enhance UAV navigation and state estimation accuracy in challenging environments. The proposed model is optimized using the mean squared error (MSE) loss function, which is well suited for regression tasks as it minimizes the squared differences between predicted and actual values. This approach ensures that larger errors are penalized more heavily, leading to more precise predictions. For optimization, the Adam optimizer is utilized due to its adaptive learning rate and ability to efficiently handle complex datasets. Adam’s combination of momentum and adaptive gradient-based optimization contributes to faster and more stable convergence. To evaluate the model’s predictive performance, the mean absolute error (MAE) metric is employed, as it provides a straightforward measure of the average prediction error magnitude. The training process spans 1000 epochs with a batch size of 32, ensuring effective learning without overfitting. Additionally, techniques such as early stopping or validation loss monitoring can be incorporated to enhance model robustness and prevent unnecessary overtraining.
3.6. LSTM Cell Mechanism for Self-Adaptive Fusion
To achieve real time adaptation of sensor fusion weights, the LSTM cell operates at each time step to adjust the measurement noise covariance matrix dynamically.
Figure 4 illustrates the internal mechanism of the LSTM cell, detailing its role in self-adaptive sensor fusion.
3.6.1. LSTM Training and Validation Loss
The training and validation loss curves, as shown in
Figure 5, display a steady and consistent decrease over the course of 1000 epochs. This behavior signifies the model’s effective learning of temporal patterns from the multi-sensor dataset. The training loss starts with a high initial value, reflecting the model’s early attempts to understand the complexities of the dataset. Over successive epochs, the loss steadily declines as the LSTM architecture refines its understanding of the data.
The minimal gap between the training and validation loss curves demonstrates effective generalization, indicating that the model avoids overfitting to the training data. This alignment underscores the robustness of the chosen hyperparameters, including the learning rate, batch size, and architecture depth, in achieving optimal learning performance. The observed convergence validates the model’s suitability for capturing sequential patterns in multi-sensor data, making it highly reliable for downstream applications.
3.6.2. Mean Absolute Error (MAE) Analysis
The MAE curves for training and validation, depicted in
Figure 6, reveal a consistent decline over 1000 epochs, highlighting the model’s ability to minimize prediction errors. The MAE metric evaluates the absolute difference between the predicted and actual values, making it an effective measure for assessing prediction accuracy.
The training and validation MAE curves are closely aligned, indicating that the model generalizes well to unseen data without significant overfitting. The steady convergence of these curves suggests that the proposed LSTM-based framework is highly effective in learning the temporal dependencies in the multi-sensor dataset. This highlights the model’s ability to accurately predict sequential data, even in the presence of noise and variability in the sensor measurements.
3.6.3. Validation of the Proposed Framework
The results validate the efficacy of the proposed LSTM-based self-adaptive multi-sensor fusion (LSAF) framework. The combination of temporal pattern learning via the LSTM and its ability to minimize both loss and MAE ensures a comprehensive solution for dynamic system modeling. The ability of the LSAF framework to generalize to unseen data while maintaining precise predictions makes it highly reliable for complex applications such as autonomous navigation and simultaneous localization and mapping (SLAM).
The convergence of training and validation metrics highlights the robustness and adaptability of the system. These attributes make the proposed pipeline a reliable approach for handling real-world multi-sensor fusion challenges in dynamic environments.
3.7. Fusion Framework Using MSCKF
The proposed LSTM-based self-adaptive multi-sensor fusion (LSAF) framework is designed for real-time UAV state estimation by dynamically integrating data from GPS, IMU, stereo cameras, and LiDAR. As illustrated in
Figure 7, the system employs multiple onboard sensors, including two Livox MID-360 LiDARs, a DJI front stereo camera, a DJI IMU, and a GPS-RTK system, ensuring a comprehensive perception of the environment. The IMU provides high-frequency motion tracking, while GPS-RTK offers precise global positioning. The LiDARs generate dense 3D environmental maps, and the stereo camera enhances spatial perception, particularly in visually rich environments. To efficiently process these multimodal sensor data, an LSTM network extracts temporal dependencies and evaluates sensor reliability. The attention-based mechanism within the LSTM model computes adaptive fusion weights, dynamically adjusting the measurement noise covariance to prioritize the most reliable sensors in real time. The weighted multi-sensor measurements are then passed to a SLAM-based pose estimation module, which fuses all available sensors’ stereo cameras, LiDAR, IMU, and GPS-RTK, ensuring robust localization. The proposed algorithm is based on the enhancement of VINS SLAM [
4]. The LSTM-derived reliability scores influence SLAM by assigning higher weightage to more reliable sensors, thereby enhancing pose estimation accuracy. When all sensors are available, SLAM produces an optimal UAV state estimate. Following SLAM-based pose estimation, the UAV state is further refined using the multi-state constraint Kalman filter (MSCKF), which ensures consistency in state propagation and correction. The Kalman gain is computed dynamically, leveraging the LSTM-adapted fusion weights to optimally integrate new observations. This adaptive approach mitigates the effects of sensor degradation, noise, and environmental uncertainties, improving the UAV’s robustness in GPS-denied areas and high-speed motion conditions. Following SLAM-based pose estimation, the multi-state constraint Kalman filter (MSCKF) is employed to propagate the UAV state and refine it based on the LSTM-adapted fusion weights. The state propagation step follows the motion model:
Once new sensor measurements are available, the Kalman gain is computed to optimally integrate observations:
By incorporating LSTM-based sensor reliability assessments into the MSCKF, the framework dynamically adapts to changing sensor conditions, enhancing robustness in GPS-denied environments and complex dynamic flight scenarios. The complete algorithmic workflow is detailed in Algorithm 2, outlining sensor preprocessing, adaptive sensor fusion, SLAM-based pose estimation, and MSCKF-based state correction.
Algorithm 2 LSTM-based self-adaptive multi-sensor fusion (LSAF) algorithm |
: Multi-sensor inputs (GPS-RTK, IMU, Stereo Camera, LiDAR) : Pre-trained LSTM model for adaptive fusion : Previous UAV state estimate (position, velocity, orientation) : Measurement and process noise covariance
Step 1: Sensor Data Preprocessing Step 2: Adaptive Sensor Fusion Compute dynamic sensor reliability weights: Adjust measurement noise covariance dynamically: Compute fused sensor measurement:
Step 3: UAV State Prediction Predict initial UAV state using IMU measurements: Propagate state covariance:
Step 4: LSTM-Guided Multi-Sensor SLAM-Based Pose Estimation Fuse all available sensors (Stereo Camera, LiDAR, IMU, and GPS-RTK) for SLAM-based pose estimation. Assign higher weightage to more reliable sensors based on LSTM reliability scores: Compute weighted SLAM pose estimation: Ensure global consistency using GPS-RTK when available. If all sensors degrade, rely on IMU-based odometry:
Step 5: State Correction using MSCKF
|
Algorithm 2 details the complete computational pipeline, outlining key steps such as sensor preprocessing, LSTM-based adaptive fusion, SLAM-based pose estimation, and MSCKF-based state correction. The proposed framework significantly improves UAV navigation accuracy by enabling real-time adaptation to sensor reliability, making it well suited for challenging flight conditions, including environments with limited GPS visibility, rapid motion dynamics, and feature deprived landscapes.
3.8. Advantages of the Proposed Framework
The proposed framework combines the strengths of traditional filtering methods with modern deep learning techniques, enabling robust UAV state estimation in real time. The LSTM-based self-adaptive fusion mechanism allows the system to dynamically prioritize sensor contributions based on their reliability, improving robustness in challenging environments. The integration of the MSCKF ensures computational efficiency and consistency, making the system suitable for real-time UAV operations in diverse scenarios.
3.9. Experimental Setup and Dataset
The experiments were carried out in an open-field outdoor environment, as shown in
Figure 8. The dataset was collected in a wide open lawn area with minimal features, such as sparse distant trees and limited structural elements. The environment presented significant challenges for single-sensor SLAM approaches due to the lack of features and bright, sunny conditions that degraded stereo and LiDAR based odometry. The UAV platform was handheld during data collection to simulate various motion patterns, and the dataset included asynchronous measurements from all sensors. Sensor data were fused using event-based updates, where the state was propagated to the timestamp of each measurement. Calibration parameters, such as camera-LiDAR-IMU extrinsics, were estimated offline and incorporated into the extended state vector for accurate fusion.
The offline calibration of the proposed system consists of three key components: estimation of the stereo camera’s intrinsic and extrinsic parameters, determination of the IMU-camera extrinsic offset, and calibration of the LiDAR-IMU transformation. To estimate both intrinsic and extrinsic parameters of the stereo camera, we employ the well-established Kalibr calibration toolbox [
29], ensuring precise alignment between the camera and IMU. For 3D LiDAR-IMU calibration, we utilize the state-of-the-art LI-Init toolbox [
30], which provides a robust real-time initialization framework for LiDAR-inertial systems, compensating for temporal offsets and extrinsic misalignments. To evaluate the robustness of the proposed approach under diverse conditions, we collected multiple datasets across three scenarios, such as handheld and UAV mounted configurations. The datasets, referred to as
UL Outdoor Car Parking Dataset, UL Outdoor Handheld Dataset, and UL Car Bridge Dataset, were recorded at the University of Limerick Campus within the CRIS Lab research group.
Figure 8 illustrates the experimental environments, while
Table 3 presents detailed a dataset of UAV hardware sensor specifications. To address the challenge of asynchronous sensor data, we employ first-order linear interpolation to estimate the IMU pose at each sensor’s measurement time, mitigating time bias without significantly increasing computational overhead. Instead of direct event-based updates, this method ensures that sensor data are aligned with a consistent reference frame, preventing oversampling of high-frequency IMU data or undersampling of low-frequency GPS and LiDAR measurements. Additionally, ROS-based timestamp synchronization of DJI-OSDK and Livox LiDAR nodes further minimizes timing inconsistencies, enhancing fusion accuracy and reducing drift in state estimation.
The proposed method was evaluated without loop closure mode to assess its consistency and robustness. Performance metrics, including the absolute pose error (APE) and the root mean square error (RMSE), were calculated to quantify the accuracy of the estimated trajectory [
31]. The comparison focused on the ability to mitigate cumulative errors and maintain robust state estimation across large-scale environments. The details of the hardware used during the experiments are listed in
Table 3.
4. Results and Comparison
The evaluation of the proposed LSTM-based self-adaptive multi-sensor fusion system was conducted on a collected dataset using a UAV equipped with state-of-the-art sensors, including two Livox Mid360 LiDARs (facing forward and downward), front-facing stereo cameras, an IMU, and GPS-RTK. The hardware configuration is summarized in
Table 3. The UAV configuration and experimental setup are shown in
Figure 8. These sensors provide complementary modalities that are dynamically fused using the proposed system, leveraging the LSTM-based approach to adaptively weigh sensor contributions based on their reliability and environmental conditions.
4.1. UL Outdoor Car Parking Dataset
The car parking dataset provides a complex testing environment with open air spaces, tree shadows, and dynamic illumination changes, as depicted in
Figure 8. This experiment assessed the LSAF approach in a large-scale outdoor setting without loop closure to verify the robustness of the proposed methodology. The UAV operated in a vast, open lawn with minimal tree coverage and bright sunlight conditions that significantly challenge stereo-based odometry systems. These scenarios lead to frequent failures in vision-only or LiDAR-only methods.
During the UAV’s navigation over the parking area, most of the LiDAR-detected features were confined to the ground, resulting in degraded motion estimation. The trajectory plots, when compared to ground truth RTK data, showed that FASTLIO2 [
17] suffered from substantial errors due to LiDAR degradation. Additionally, VINS-Fusion (stereo-inertial) [
4] performed poorly, exhibiting the highest position drifts, while VINS-Fusion (stereo-IMU-GPS) [
4] showed noticeable drifts under these conditions. Sparse LiDAR features in the dataset further impacted LiDAR-based methods like FASTLIO2. However, the proposed LSAF system, leveraging stereo, IMU, LiDAR, and GPS with a pre-trained LSTM-based deep learning model, provided enhanced UAV state estimation and consistently smoother trajectories in this challenging environment.
Figure 9 displays the trajectories obtained using different methods, while
Figure 10 highlights the box plots showing the overall APE of each strategy.
Table 4 provides the RMSE values for each method and
Figure 11 and
Figure 12 represent absolute position errors (x, y, z) and orientation errors (roll, pitch, and yaw) for the UL outdoor car parking dataset.
4.2. UL Outdoor Handheld Dataset
In this experiment, we employed a custom-designed UAV sensor suite to evaluate the capabilities of our proposed framework. The RTK position was used as ground truth, leveraging the high-quality GPS signal recorded throughout the experiment. Data collection was performed using a handheld UAV method while navigating an outdoor environment. This setup presented challenges such as image degradation, structureless surroundings, dynamic targets, and unstable feature conditions that are particularly difficult for vision-based and LiDAR-based methods.
To validate the consistency of the proposed LSAF framework, the experiment was conducted without loop closure. The handheld mode eliminated the noise typically introduced during flight missions, providing a clean dataset to assess the proposed LSAF under challenging conditions. State estimation was performed using LSAF across various sensor combinations and compared with state-of-the-art (SOTA) methods such as VINS-Fusion [
4] and FASTLIO2 [
17].
Figure 13 displays the trajectories obtained using different methods, while
Figure 14 highlights the box plots showing the overall APE of each of the strategies.
Table 5 provides the RMSE values for each method, and
Figure 15 and
Figure 16 present the absolute position errors (x, y, z) and orientation errors (roll, pitch, and yaw) for the handheld UAV dataset.
The results demonstrate that significant position drifts occurred in the stereo IMU-only scenario. However, accuracy improved considerably when LiDAR, GPS, or their combination was integrated. VINS-Fusion exhibited growing errors due to accumulated drift, whereas LSAF maintained a smooth trajectory consistent with the ground truth. Unlike VINS-Fusion and FASTLIO2, which failed to align precisely with the reference data, LSAF achieved superior performance by leveraging the LSTM-based self-adaptive multi-sensor fusion (LSAF) framework and MSCKF fusion.
The system was compared with state-of-the-art algorithms, including VINS-Fusion [
4] and FASTLIO2 [
17]. VINS-Fusion integrates visual inertial odometry with/without GPS data, while FASTLIO2 employs LiDAR inertial odometry combined with global optimization, including loop closure. In comparison, the proposed method utilizes an LSTM-based adaptive weighting mechanism to enhance robustness against sensor degradation and environmental variability, ensuring accurate and reliable state estimation in dynamic conditions.
Quantitative Analysis
Table 4 summarizes the results of the accuracy evaluation for the proposed method and the benchmark algorithms on the outdoor dataset. The proposed system, which incorporates adaptive fusion based on LSTM, outperformed both VINS-Fusion and FASTLIO2 in terms of maximum, mean, and RMSE metrics. Specifically, the proposed method achieved the lowest RMSE of 0.328436 and 0.385019, significantly outperforming FASTLIO2 (0.385019), VINS-Fusion (S+I) 9.45291, and VINS-Fusion (S+I+G) 9.438278 on the UL outdoor car parking dataset. The maximum error for the proposed system was 0.889442, which was notably lower than that of the other methods, indicating better robustness to outliers and sensor degradation.
Table 5 presents the accuracy evaluation results for the proposed method and benchmark algorithms on the UL outdoor handheld dataset, highlighting the superior performance of the proposed system incorporating LSTM-based self-adaptive fusion (LSAF). The proposed method achieved the lowest RMSE of 0.598172, significantly outperforming benchmark methods such as FASTLIO2 (6.830505), VINS-Fusion (S+I+G) (6.846302), and VINS-Fusion (S+I) (7.18024). Additionally, the proposed system demonstrated a maximum error of 2.982927, which is substantially lower compared to the other methods, reflecting its robustness to outliers and sensor degradation.
The mean and median errors for the proposed method were also the lowest, at 0.525667 and 0.450802, respectively, showcasing its consistent accuracy. In contrast, methods like VINS-Fusion and FASTLIO2 exhibited significantly higher errors due to their limitations in handling dynamic environments and sensor noise. The results further emphasize the advantages of incorporating LSTM-based adaptive fusion for enhanced performance in challenging real-world scenarios. The performance gap between the proposed system with and without LSAF also highlights the critical role of adaptive fusion in reducing positional errors and ensuring reliable state estimation.
4.3. UL Car Bridge Dataset
The experimental evaluation was conducted at the University of Limerick’s Car Bridge, where the UAV was flown both above and beneath the bridge to assess the performance of the proposed LSAF framework under varying environmental conditions, as shown in
Figure 17. Ground truth reference data for LSAF was obtained using an RTK system, while the UAV was manually operated by a trained pilot. To validate the global consistency of LSAF, the experiment was performed without loop closure. The test environment presented significant challenges, including rapid illumination changes, agile UAV maneuvers, and GPS-degraded conditions, as illustrated in the trajectory (
Figure 17). These conditions are particularly demanding for visual-inertial odometry (VIO) and LiDAR odometry (LO) methods, where the proposed LSAF fusion approach demonstrated superior performance compared to state-of-the-art techniques. Throughout the experiment, the UAV maintained a strong RTK signal lock, ensuring fixed position accuracy for most of the flight. However, while navigating under the bridge, the number of visible GPS satellites temporarily dropped to 11, causing intermittent signal degradation in this challenging environment.
This experiment evaluates the global consistency of the proposed LSAF framework in a challenging environment with unstable and noisy GPS signals, particularly under a bridge, where localization accuracy and trajectory smoothness are significantly affected. The results demonstrate that LSAF effectively mitigates single sensor drift, maintaining global consistency and ensuring smooth local trajectory estimation despite degraded GPS conditions.
Figure 18 and
Figure 19 illustrate the absolute position errors in the x, y, and z coordinates and the roll, pitch, and yaw angles, comparing multiple methods on the UAV car bridge dataset, while
Table 6 presents the corresponding RMSE values for each approach.
Figure 20 shows a box plot of the overall relative pose error (RPE) for five different strategies, demonstrating that LSAF outperforms other state-of-the-art (SOTA) methods.
4.4. Qualitative Analysis and Trajectory Comparison
The trajectory plots illustrated in
Figure 9,
Figure 13, and
Figure 17 compare the estimated trajectories of the proposed method, FASTLIO2, and VINS-Fusion. The proposed method demonstrates superior consistency and alignment with the ground truth provided by RTK, particularly in regions with sparse LiDAR and stereo features. In contrast, FASTLIO2 exhibits significant drift in regions with degraded LiDAR feature density, while VINS-Fusion suffers from cumulative errors due to visual degradation under high illumination.
The absolute pose error (APE) plots for each axis in
Figure 11,
Figure 15 and
Figure 18 further highlight the advantages of the proposed system. The box plots in
Figure 10,
Figure 14, and
Figure 20 compare the overall APE distributions for all methods, showing that the proposed method achieves the smallest error spread and highest accuracy across the dataset. The LSTM-based adaptive fusion mechanism effectively mitigates sensor-specific errors by dynamically adjusting sensor contributions in real time. For instance, in regions where LiDAR features are sparse, the LSTM assigns higher weights to IMU or GPS or stereo camera data, thereby maintaining accurate state estimation.
4.5. Analysis of LSTM-Based Adaptive Fusion
The inclusion of the LSTM-based adaptive fusion mechanism introduces several advantages over traditional fixed-weight fusion approaches. The dynamic weighting process enables the system to adapt to environmental changes and sensor degradation. For example, in bright outdoor conditions, the LSTM down-weights stereo camera data when visual feature density is low, prioritizing LiDAR and IMU measurements instead. Similarly, in areas with sparse LiDAR features, GPS data are weighted more heavily to mitigate drift.
Figure 9,
Figure 13, and
Figure 17 demonstrate the proposed method’s ability to maintain trajectory accuracy despite varying sensor reliability. This is further supported by the quantitative metrics in
Table 4,
Table 5 and
Table 6, which show that the proposed system consistently outperforms the benchmark methods in all scenarios. The adaptive nature of the LSTM allows the system to handle asynchronous and noisy sensor measurements more effectively than traditional Kalman filter-based fusion approaches.
4.6. System Robustness and Computational Efficiency
The proposed system is designed to be robust to individual sensor failures, ensuring continuous operation in challenging scenarios. For instance, temporary GPS signal loss or degraded LiDAR performance does not significantly impact the overall state estimation due to the self-adaptive fusion mechanism. This resilience is critical for real-world UAV applications, where sensor reliability can vary due to environmental factors.
The computational efficiency of the proposed system was validated on an Ubuntu Linux laptop equipped with an Intel Core(TM) i7-10750H CPU (3.70 GHz) and 32GB of memory. The implementation utilized C++ and ROS, ensuring real-time performance with minimal latency. The inclusion of the LSTM, while adding computational complexity, was optimized using hardware acceleration, ensuring that the system operates within real-time constraints.
6. Conclusions and Future Work
This study introduces a novel LSTM-based self-adaptive multi-sensor fusion framework aimed at improving UAV state estimation accuracy and robustness. The proposed approach dynamically adjusts sensor fusion weights in real time, leveraging an LSTM network to account for varying environmental conditions and sensor reliability. By integrating measurements from GPS, LiDAR, stereo cameras, and IMU, the system effectively addresses challenges posed by GPS-degraded environments, sparse feature areas, and high motion dynamics. The framework was validated on real-world datasets collected using a UAV platform in challenging outdoor environments. Experimental results demonstrate that the proposed fusion framework outperforms state-of-the-art methods such as VINS-Fusion (S+I), VINS-Fusion (S+I+G) and FASTLIO2 (L+I), as well as approaches without LSAF fusion, achieving superior trajectory accuracy and consistency. The incorporation of the LSTM-based adaptive weighting mechanism significantly enhances the system’s ability to handle sensor degradation and environmental variability. In scenarios where traditional fixed-weight fusion methods struggle, such as in bright, sunny conditions with degraded stereo or sparse LiDAR features, the LSTM dynamically prioritizes the most reliable sensors, ensuring robust and accurate state estimation. This adaptability is a key innovation that bridges the gap between traditional filtering techniques and modern deep learning approaches for UAV navigation.
Despite the system’s demonstrated success, there remain opportunities for further enhancement. Future work could focus on extending the proposed framework to broader and more diverse datasets, particularly in GPS-denied environments such as dense urban canyons, forested areas, or indoor spaces. Incorporating additional sensor modalities, such as radar or thermal cameras, could further enhance robustness in low-visibility conditions or adverse weather. Moreover, improving the LSTM model by incorporating uncertainty estimation techniques, such as Bayesian neural networks, could provide better confidence measures for the adaptive weighting process. Another promising direction lies in the optimization of computational efficiency to enable deployment on smaller, resource-constrained UAV platforms. Techniques such as model pruning, quantization, or the use of edge AI hardware could be explored to reduce the computational overhead of the LSTM while maintaining real-time performance. Additionally, investigating the integration of reinforcement learning into the fusion framework could enable the system to autonomously adapt to new environments during operation without the need for extensive retraining.
In conclusion, the proposed LSTM-based self-adaptive fusion framework represents a significant advancement in UAV state estimation, combining the strengths of traditional Kalman filtering with the flexibility of modern deep learning. The demonstrated robustness and adaptability of the system position it as a valuable contribution to the field of autonomous UAV navigation, with the potential for further enhancements and applications in diverse operational scenarios.