[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Next Article in Journal
Particle Filter Tracking System Based on Digital Zoom and Regional Image Measure
Previous Article in Journal
Simulation and Analysis of Imaging Process of Phosphor Screens for X-Ray Imaging of Streak Tube Using Geant4-Based Monte Carlo Method
You seem to have javascript disabled. Please note that many of the page functionalities won't work as expected without javascript enabled.
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Stereo Event-Based Visual–Inertial Odometry

by
Kunfeng Wang
,
Kaichun Zhao
,
Wenshuai Lu
and
Zheng You
*
Department of Precision Instrument, Tsinghua University, Beijing 100080, China
*
Author to whom correspondence should be addressed.
Sensors 2025, 25(3), 887; https://doi.org/10.3390/s25030887
Submission received: 30 December 2024 / Revised: 27 January 2025 / Accepted: 30 January 2025 / Published: 31 January 2025
(This article belongs to the Section Intelligent Sensors)

Abstract

:
Event-based cameras are a new type of vision sensor in which pixels operate independently and respond asynchronously to changes in brightness with microsecond resolution, instead of providing standard intensity frames. Compared with traditional cameras, event-based cameras have low latency, no motion blur, and high dynamic range (HDR), which provide possibilities for robots to deal with some challenging scenes. We propose a visual–inertial odometry for stereo event-based cameras based on Error-State Kalman Filter (ESKF). The vision module updates the pose by relying on the edge alignment of a semi-dense 3D map to a 2D image, while the IMU module updates the pose using median integration. We evaluate our method on public datasets with general 6-DoF motion (three-axis translation and three-axis rotation) and compare the results against the ground truth. We compared our results with those from other methods, demonstrating the effectiveness of our approach.

1. Introduction

Simultaneous localization and mapping (SLAM) has important applications in many emerging technologies such as robotics, intelligent transportation, and augmented/virtual reality (AR/VR). There are already many works on SLAM based on traditional cameras [1,2]. However, traditional cameras can struggle in challenging situations, such as capturing high-speed motions or scenes with a high dynamic range.
Event-based cameras (or event cameras) are bio-inspired vision sensors that work very different from traditional cameras which report the pixel-wise intensity changes asynchronously at the time they occur, called “events” [3,4], where each event consists of its spatio-temporal coordinate and the sign of the brightness change (e.g., 0 or 1). Event cameras have different types of sensors, such as Dynamic Vision Sensor (DVS) [3], DAVIS or ATIS. They do not output an intensity image at a fixed rate but a stream of asynchronous events at microsecond resolution. Event cameras offer numerous advantages over traditional cameras, including microsecond latency, low power consumption, and high dynamic range (e.g., 140 dB). With microsecond resolution, event cameras can work at high-speed motions, which can cause blurry images from traditional cameras.
The main challenge is how to make event cameras applicable to SLAM and other vision tasks [4]. Because the output of event cameras is different from traditional cameras, vision algorithms designed for traditional cameras cannot be directly applied to event cameras. Thus, we need new methods to process the data from these novel cameras. There are already some works for event cameras, such as feature tracking [5,6], 3D orientation estimation [7,8,9,10], and simultaneous localization and mapping (SLAM) [11,12,13,14,15,16].
Due to the limitations in accuracy and robustness when using a single sensor for localization, an IMU is typically used as a complementary sensor when employing cameras for positioning. Current sensor fusion methods can be broadly classified into two types: optimization-based and filtering-based. Generally, optimization-based methods require higher computational power, while filtering-based methods demand less computational effort but may offer slightly lower accuracy. Given that several optimization-based fusion approaches have already been proposed, we introduce a filtering-based fusion scheme, aiming to provide a new solution for stereo event camera–IMU fusion. In this paper, we propose a stereo visual–inertial odometry (VIO) for event cameras with arbitrary 6-DoF motion. As shown in Figure 1, the upper left corner displays an image captured by a traditional camera, which does not contribute to our algorithm. The image on the right illustrates the running performance of our algorithm, including local point cloud and trajectory (shown as black line). Our pipeline has three parts: vision module, IMU integral, and ESKF. For the vision module, we reference the strategy of ESVO [16]. IMU data use the median integral, and the fusion strategy is ESKF. Our work can be summarized as follows:
  • A novel visual–inertial odometry is presented for stereo event cameras based on ESKF.
  • Our method relies solely on visual information from event cameras and does not incorporate traditional cameras. Some approaches use images from traditional cameras, and in certain challenging scenarios, the failure of the traditional camera can lead to system failure, negatively impacting the performance of the event camera.
  • A quantitative evaluation of our pipeline is compared with other methods on the public event camera datasets, demonstrating the effectiveness of our system.

2. Related Work

In the past few years, many works have been proposed to use event cameras for ego-motion estimation. Here, we review some of the literature [4].

2.1. Event-Based Depth Estimation

(a) Monocular: Depth estimation using a single event camera has been explored in studies such as [11,17,18]. This problem differs significantly from previous approaches due to the challenge of matching events over time. These methods produce a semi-dense 3D point cloud (referred to as a 3D edge map) of the scene, leveraging the camera’s motion information over time. Thus, they are not designed for instantaneous depth estimation but rather concentrate on depth estimation for SLAM/VO applications.
(b) Stereo: Lots of studies on depth estimation by event cameras utilize events over very short time periods, from two or more synchronized event cameras which are rigidly attached. Events from different event camera imaging planes use a common clock. According to the classic depth estimation method, the process first involves matching events across image planes and then triangulating the location of the 3D points [19]. The challenge is to find correlations between events at different moments. Events are matched either (i) using traditional stereo vision matching metrics (e.g., normalized cross-correlation) on event frames [20,21] or time-surfaces [22,23], or (ii) by exploiting the temporal correlations of events between different sensors [22,24,25]. Most of these methods are suitable for simple scenes and few objects, which facilitates finding correspondences.

2.2. Event-Based 3-DOF Estimation

The 3-DOF estimation includes rotation [7,8,9,10] and planar motion [26].
Cook et al. [7] proposed an algorithm that relies on an event camera to estimate ego motion, image intensity, and optical flow with an interacting network. However, it was only applied to pure rotational motion. The filter-based approach in [8] used probabilistic filters in parallel to track event camera’s three-degree-of-freedom rotation, creating images of natural scenes. Rotational motion estimation was also implemented in [9], where camera pose estimation was achieved by minimizing the photometric error at the event locations with a probabilistic edge map. A motion compensation optimization framework was introduced in [10] to estimate the angular velocity of the camera rather than its absolute rotation. All of the above works are limited to rotation estimation, not translation.
Weikersdorfer et al. [26] proposed a 2D SLAM system based on an event camera, which was limited to planar motion and scenes with rich textures. This method employed a particle filter for pose estimation, with the reprojection error of the event relative to the local map.

2.3. Event-Based VO

Solutions for 3D SLAM in any 6-DOF motion and scenes, with pure event cameras, have been proposed [11,12,13,14,15,16].
(a) Monocular: The approach in [11] extended [8] by using three probabilistic filters to estimate pose, depths of events and intensity images. However, it consumed a lot of computing power and required a GPU to run in real-time. In contrast, the semi-dense approach in [12] demonstrated that intensity image reconstruction was unnecessary for depth estimation and pose estimation. This method performed 3D reconstruction by space sweeping [17] and edge-map alignment for pose estimation. The final SLAM system could run in real-time on a CPU. The formulation in [13] was underpinned by a principled joint optimization problem involving non-parametric Gaussian Process motion modeling and incremental maximum a posteriori inference. However, its computational power consumption was also high, making it difficult to achieve real-time performance. Chamorro et al. [14] explored a new event-based line-SLAM approach following a parallel tracking and mapping design. Wang et al. [15] proposed a solution to this problem by performing contrast maximization in 3D. The 3D locations of the rays cast for each event were smoothly varied based on a continuous-time motion parametrization, and the optimal results were determined by maximizing the contrast in a volumetric ray density field.
(b) Stereo: The approach proposed in [16] addressed the solution for stereo visual odometry based on event cameras, achieving real-time performance with a standard CPU. It reconstructed a semi-dense 3D map following two steps: (i) computing depth estimates of events and (ii) fusing these depth estimates to obtain a more accurate semi-dense depth map [23].

2.4. Event-Based VIO

To improve the accuracy and robustness of visual odometry, combining with other sensors is a common option, such as IMU.
Zhu et al. [27] tracked features using [5], and combined them with IMU data by way of a Kalman filter. Rebecq et al. [28] proposed to synthesize motion compensated event images from the spatio-temporal windows of events and then detect-and-track features. Feature tracking was fused with inertial data by means of keyframe-based nonlinear optimization to recover the camera trajectory and a sparse map of 3D landmarks. Based on this work, the laboratory subsequently expanded to integrate event camera with traditional camera and IMU [29]. As opposed to the above-mentioned feature-based methods, the work in [30] optimized a combined objective with inertial and event-reprojection error terms over a segment of the camera trajectory, in the style of visual–inertial bundle adjustment. Gentil et al. [31] introduced an optimization-based framework using Lines. Liu et al. [32] fused the events’ depth, time-surface images, and pre-integrated data to estimate the camera motion within a sliding window nonlinear optimization framework. There are also some works that combine images, events, and IMU data, such as [29,33]. However, we think that the advantage of event cameras lies in their ability to handle high-speed motion or high dynamic range scenes. While integrating event cameras with traditional cameras may offer benefits, in certain challenging scenarios, the failure of the traditional camera can cause the entire system to fail, which can ultimately limit the performance of the event camera.

3. Visual–Inertial Pipeline

This section presents the details of stereo event-based visual–inertial odometry. We start with an overview of the entire pipeline. Then, we introduce how to process event data so that we can use their information efficiently. After that, we give the details of the vision module and IMU integral module. At least, we present how to fuse information from different sensors with ESKF.

3.1. Framework Overview

A flowchart of our proposed pipeline detailing all steps is illustrated in Figure 2. We propose a novel method for fusing event camera data with IMU data to achieve localization based on ESKF. Our algorithm utilizes three parallel threads: the IMU integration thread, the mapping thread, and the tracking thread. We can obtain the current pose from both the IMU and the stereo event cameras, and then fuse these two poses using ESKF to obtain the final fused pose. In the ESKF framework, state variables require both predictive and observational information. The IMU provides the predictive information, while the visual module provides the observational information. Through the IMU integration thread, we obtain the current pose (pose by IMU), which serves as the predictive or prior information. The tracking thread in the visual module also provides the current pose (pose by events), which we use as the observation. When constructing the local point cloud map, we fuse the pose obtained from the IMU with the pose from the tracking thread using ESKF, then we obtain the current fused pose and the final pose output of the system.
Next, we provide a brief overview of the content covered in each section of this section. First, the event preprocessing module converts event streams into time-surface images, which are used by the vision module (Section 3.2). Secondly, we explain how the visual module obtains poses using data from the stereo event cameras. After initialization is completed, the mapping thread of the vision module uses the time-surfaces, events and the current pose (fused pose) at this moment to update a local map (semi-dense point cloud). The tracking thread of the vision module estimates the pose of the left event camera by aligning the local point cloud with the time-surface image (Section 3.3). Finally, we introduce the IMU integration process and the implementation of ESKF (Section 3.4).
Initialization: The vision module provides a coarse initial map by a stereo method (a modified SGM method [34]). The IMU module assumes that the initial state of the system is static [35,36], using the first 1–2 s (depending on dataset) of the IMU data to estimate the gravity direction and the biases of the gyroscope and accelerometer.

3.2. Event Representation (Time-Surface)

The output of the event camera is a number of independent events. Each event can be described as e k = ( u k , v k , t k , p k ) , includes pixel coordinate ( u k , v k ) T , where intensity is changed beyond the preset value, timestamp t k , and polarity p k ( { 1 , 1 } o r { 0 , 1 } ) .
Generally, we do not process events asynchronously at the very high rate they occur, but use an alternative representation called the time-surface (shown in Figure 3). It is a 2D image, and each pixel stores a value of time, which represents the timestamp of the last event at this pixel [37,38]. This converts events into an image whose ‘intensity’ represents the motion history at that coordinate, with larger values indicating more recent motion history. The value at each pixel x = ( u , v ) T is defined by
Γ e x p ( t t l a s t ( x ) η )
where t is an arbitrary time ( t > t l a s t ( x ) ) , and  t l a s t ( x ) is the timestamp of the last event occurring at x . η denotes the constant decay rate parameter, which usually is small. The time-surface visualizes the history of moving brightness change at each pixel location, which usually presents the edges of the scenes. The pixel values in a time-surface are rescaled from [0, 1] to the range [0, 255] for convenient visualization and processing.

3.3. Vision Module

This module is mainly divided into two parts: mapping (Section 3.3.1) and tracking (Section 3.3.2). The mapping and tracking threads operate simultaneously and are interdependent. The tracking thread provides the pose estimation at the current time (pose by events), while the mapping thread reconstructs a local point cloud map using the pose, event data, and time-surface image at that moment. This 3D point cloud is then used by the tracking thread for subsequent pose estimation.

3.3.1. The 3D Point Cloud Reconstruction

The key to 3D point cloud reconstruction is estimating the depth of each event in the camera frame, and then combining the event’s coordinate with the depth to generate the corresponding point cloud. Assuming that we have time-surfaces ( Γ l e f t ( · , t ) , Γ r i g h t ( · , t ) ) at time t from left and right event cameras, respectively, the left event camera follows a camera trajectory T t δ t : t . The inverse depth ρ * of an event ( e t ϵ = ( x , t ϵ ) , ϵ [ 0 , δ t ] ) on the left event camera image plane can be estimated by the following function:
ρ * = a r g m i n ρ C ( x , ρ , Γ l e f t ( · , t ) , Γ r i g h t ( · , t ) , T t δ t : t )
C x 1 , i W 1 , x 2 , i W 2 r i 2 ( ρ )
The residual
r i ( ρ ) Γ l e f t ( x 1 , i , t ) Γ r i g h t ( x 2 , i , t )
denotes the difference of the temporal between two corresponding pixels x 1 , i and x 2 , i inside neighborhoods (i.e., patches) W 1 and W 2 , and  x 1 and x 2 are the centers of W 1 and W 2 , respectively. The points x 1 and x 2 are given by
x 1 = π ( c t T c t ϵ · π 1 ( x , ρ k ) ) x 2 = π ( r i g h t T l e f t · c t T c t ϵ · π 1 ( x , ρ k ) )
The function π projects a 3D point in the space onto the camera’s imaging plane, and  π 1 back-projects a pixel into 3D space with inverse depth ρ . T l e f t r i g h t is the transformation matrix from the left event camera to the right event camera. Through the above explanation, we can estimate the depth of each event point. Suppose that the depth of each event point is known. Then, we fuse all inverse depth estimates to obtain a more accurate semi-dense 3D map of the current moment, which will be used for tracking later.

3.3.2. Pose Estimation by Events

In this section, we explain how to obtain the current pose using the visual method. Time-surfaces contain edge information in the scene (as shown in Figure 4). Historical information records the camera’s historical movements. Large values correspond to events that are close to the current time. Since in the image, a grayscale value of 255 represents white and 0 represents black. To construct a minimum optimization problem, we use the time-surface negative rather than the time-surface, defined by the following.
Γ ¯ ( x , t ) = 1 Γ ( x , t )
Γ ¯ ( x , t ) is also rescaled to the range [ 0 , 255 ] . The tracking problem can be formulated as follows. Let S F r e f = { x i } be a set of pixel locations with valid inverse depth ρ i in the reference frame F r e f . Assuming the TS negative at time k is known, obtained by Γ ¯ l e f t ( · , k ) , we need to find the pose T, which makes the warped semi-dense map T ( S F r e f ) align well with Γ ¯ l e f t ( · , k ) . In an ideal scenario, the 3D points in space should be projected onto pixels with a grayscale value of 0 in the TS negative image. This problem can be defined as
ψ * = a r g m i n ψ x S F r e f ( Γ ¯ l e f t ( W ( x , ρ ; ψ ) , k ) ) 2
where the warping function
W ( x , ρ ; ψ ) π l e f t ( T ( π r e f 1 ( x , ρ ) , G ( ψ ) ) )
transfers points from F r e f to the current frame. ψ is a 6 × 1 vector representing rotation and translation. The function G ( ψ ) provides the matrix form of ψ . The function π r e f 1 ( · ) back-projects a pixel x into space with the inverse depth ρ , while π l e f t ( · ) projects a space point onto the image plane of the left event camera. T ( · ) transforms a 3D point with motion G ( ψ ) from F r e f to the left frame F k . Based on the above explanation, we can obtain the pose ψ by the visual method.

3.4. ESKF Description

3.4.1. Structure of the ESKF State Vector

The goal of the proposed filter is to track the pose of the IMU frame { I } (generally considered as body frame) with respect to a global frame of reference { G } . An overview of the algorithm is given in Algorithm 1. The IMU data are processed for propagating the ESKF state and covariance. Then, each time an observation is received, the state vector is updated. The IMU state is a 15 × 1 vector which is defined as
x I M U = ( p T , v T , q T , b a T , b ω T ) T
where the quaternion q represents the rotation from the frame { I } to the frame { G } . The vectors v R 3 and p R 3 represent the velocity and position of the body frame { I } in the global frame { G } . The vectors b a R 3 and b ω R 3 are the biases of the linear acceleration and angular velocity from the IMU, which are modeled as random walk processes, driven by the white Gaussian noise vectors n b a and n b ω , respectively. Following (9), the IMU error-state is defined as
δ x I M U = ( δ p T , δ v T , δ q T , δ b a T , δ b ω T ) T
the standard error definition is used for the position, velocity, and biases (e.g., p ˜ = p + δ p , p ˜ is the real value of the position, and  p is the ideal value). For the quaternion, a different error definition is employed, which is defined by the relation q ˜ = q δ q . The symbol ⨂ denotes quaternion multiplication. The error quaternion is
δ q = 1 δ θ 2
where δ θ represents a small angle rotation.
Algorithm 1 Framework of SEVIO
Input:
  The event data from two event-based cameras;
  The acceleration and angular velocity from IMU.
Output:
  The pose of the body frame { I } with respect to the global frame { G } .
1:
Initialize: A modified SGM method (vision); Estimation of IMU biases and the gravity direction (ESKF).
2:
IMU propagation;
3:
if no observation then
4:
    Posterior equals prior (Equation (20))
5:
else
6:
    Update posterior pose (Equation (21))
7:
    Clear the error-state (Equation (22))
8:
end if

3.4.2. Process Model

In this section, we will explain how to update the error state variable δ x by IMU data. The continuous dynamics of the IMU error-state is
δ p ˙ = δ v δ v ˙ = R t [ a t b a t ] × δ θ + R t ( n a δ b a ) δ θ ˙ = [ ω t b ω t ] × δ θ + n ω δ b ω δ b a ˙ = n b a δ b ω ˙ = n b ω
In these expressions, a t and ω t are the acceleration and angular velocity from IMU measurements. [ · ] × means the skew symmetric matrix. R t is the rotation matrix from frame { I } to frame { G } , n a and n ω are the zero-mean, white Gaussian noise processes modeling the measurement noise. The linearized continuous-time model for the IMU error-state is
δ x ˙ = F t δ x + B t n
where n = ( n a T , n ω T , n b a T , n b ω T ) T . The vectors n b a and n b ω are the random walk rate of the accelerometer and gyroscope measurement biases. F t and B t are shown in Appendix A.1.
To deal with the discrete time measurement from the IMU, we apply the median integral to propagate the estimated IMU state:
q ω b k = q ω b k 1 cos ϕ 2 ϕ ϕ sin ϕ 2 v k = v k 1 + ( R ω b k a k + R ω b k 1 a k 1 2 g ) ( t k t k 1 ) p k = p k 1 + v k + v k 1 2 ( t k t k 1 )
where ϕ = ω k 1 + ω k 2 ( t k t k 1 ) , ϕ = ϕ .
To discretize (13),
δ x k = F k 1 δ x k 1 + B k 1 n k
where F k 1 and B k 1 are shown in Appendix A.2.

3.4.3. Measurement Model

We use the pose obtained through the visual method as the observation. Now we introduce the measurement model employed for updating the state estimates. Generally, the observation equation is written as
y = G t δ x + C t n R
In this expression, G t is the measurement Jacobian matrix, and n R is the observation noise:
n R = ( n δ p ¯ x , n δ p ¯ y , n δ p ¯ z , n δ θ ¯ x , n δ θ ¯ y , n δ θ ¯ z ) T
The noise term C t n R is zero-mean, white, and uncorrelated to the state for the ESKF framework to be applied. In this application condition, the observation equation is
y = [ δ p ¯ T , δ θ ¯ T ] T
According to the previous formulas, we obtain the equations of discrete ESKF, which are shown in Appendix A.3.

3.4.4. Fuse Poses from IMU and Vision

In the preceding sections, we present the process model and measurement model. Now we introduce in detail the updating phase of the ESKF. We will fuse the pose obtained from the IMU with the pose obtained from vision to generate the fused pose, which serves as the output pose of our system. The whole process is summarized in Algorithm 1.
Before we start the ESKF process, we need to know the initial state of the system. In our implementation, we assume that the initial state of the system is static and take the average value of IMU output for a period of time to estimate the gravity direction and the biases of the gyroscope and accelerometer. Then, we obtain the rotation from the IMU frame to the world. Next is initializing the filter parameters, including state vector δ x ^ 0 , variance P ^ 0 , process noise Q , and observation noise R . The specific form is shown in Appendix A.4.
Then, we integrate the IMU output according to (14) for performing filter prior estimate and executing the first two steps of the five steps of the ESKF:
δ x ˇ k = F k 1 δ x ^ k 1 + B k 1 n k P ˇ k = F k 1 P ^ k 1 F k 1 T + B k 1 Q k B k 1 T
When there is no observation, the posterior equals the prior:
δ x ^ k = δ x ˇ k P ^ k = P ˇ k x ^ k = x ˇ k
When the mapping thread of vision module works, we consider that an observation is received and execute the last three steps of the five steps of the ESKF:
K k = P ˇ k G k T ( G k P ˇ k G k T + C k R k C k T ) 1 P ^ k = ( I K k G k ) P ˇ k δ x ^ k = δ x ˇ k + K k ( y k G k δ x ˇ k )
After that, we update the posterior pose and clear the state vector δ x ^ k :
p ^ k = p ˇ k δ p ^ k v ^ k = v ˇ k δ v ^ k R ^ k = R ˇ k ( I [ δ θ ^ k ] × ) b ^ a k = b ˇ a k δ b ^ a k b ^ ω k = b ˇ ω k δ b ^ ω k

4. Experiments

In this section, we present the datasets that we used and evaluate the proposed stereo event-based VIO system. The results show that SEVIO produces more accurate trajectories than ESVO and U-SLAM. Then, we illustrate the real-time performance of our pipeline on different resolution event data.

4.1. Datasets Used

We use public datasets to evaluate our stereo VIO system [39,40]. The datasets from [39] were collected using a UAV, while the datasets from [40] were gathered using a mobile vehicle. Our method assumes that the undistorted coordinates have been obtained. The parameters of the camera and IMU, along with their extrinsic parameters, are provided in public datasets. Different datasets use different event cameras, and their parameters are listed in Table 1.

4.2. Accuracy Evaluation

We provide an example to present the performance of our algorithm in Figure 5. On the left is the scene of the dataset, while on the right is the partial output of the algorithm, including the local point cloud and trajectory. To demonstrate the performance of the stereo VIO system, we save the trajectories output by our algorithm and compare them with the ground truth provided by the dataset. We use a general approach to evaluate our pose estimates: relative pose error (RPE) and absolute pose error (APE). We compare our method with two approaches: one is ESVO [16], which demonstrates the improvement of our approach over pure vision, and the other is U-SLAM (event camera+IMU) [28,29], which highlights the advantages of our approach compared to other event camera-based VIO systems.
The evaluation results for each sequence are listed in Table 2, and the best results in each sequence are bolded. It shows that most of the trajectory accuracy is improved. Some algorithm results are shown in Figure 6. Our method is less accurate than ESVO on indoor_flying1_edit. In an ideal situation, visual initialization should occur immediately after IMU initialization. However, due to the lack of texture in the scene, the conditions for visual initialization may not be met. In our example, once IMU initialization is completed, the drone begins to move, but the lack of scene information prevents visual initialization. After a delay, the visual initialization is eventually completed, but by that time, the bias of IMU is different from that during the IMU initialization. In addition, better vision and filter parameters may further improve the accuracy of the results. Furthermore, it is evident that our method outperforms U-SLAM in both accuracy and robustness.
One of the advantages of event cameras is that they can work in HDR scenes. Event cameras have higher dynamic range (with about 140 dB) than tradition cameras (with about 60 dB). We test this feature of the event camera using a dataset hdr_normal_edit. It is collected in a dark room with a light source to increase the brightness range in the scene. This can result in overexposure in some areas of traditional camera images, while other areas may be too dark to capture scene details. In this challenging scene, the performance of the event camera is not affected.

4.3. Real-Time Performance

We test the real-time performance of the algorithm on a standard CPU (Intel Xeon Platinum 8375C CPU @ 2.90 GHz). Our system can work in real-time with event data from low-resolution cameras (e.g., 346 × 260 ). For higher-resolution cameras (e.g., 640 × 480 ), we need to reduce the play speed of the data for better performance (e.g., 0.2). This is because compared to traditional cameras, within the same resolution and time period, event cameras capture data with microsecond-level delays. Although they only record edge information of the scene, they still generate a large volume of data, leading to significant computational power consumption. Choosing partly appropriate event data to participate in calculations may be a solution.

5. Conclusions

In this paper, we presented a stereo event-based visual–inertial pipeline using a ESKF framework. Pose estimation is essential for robot control. Due to the hardware advantages of event cameras, they have the potential to handle challenging scenes more effectively than traditional cameras. Compared with the visual odometry for stereo event cameras (ESVO), our method improves accuracy by approximately 30%. Compared to the event camera-based visual–inertial fusion method U-SLAM, our approach also demonstrates superior accuracy and robustness. We also demonstrate the performance of event cameras in high dynamic environments. Multi-sensor fusion is the future direction for pose estimation, and event cameras can serve as a valuable supplement to traditional cameras in the field of robotics. In the future work, since event cameras generate a larger volume of data compared to traditional cameras, further research is needed on how to filter event camera data to reduce computational complexity.

Author Contributions

Conceptualization, K.W.; Methodology, K.W.; Software, K.W.; Validation, K.W.; Formal analysis, K.W.; Investigation, K.W.; Resources, K.W.; Data curation, K.W.; Writing—original draft, K.W.; Writing—review and editing, K.W., K.Z., W.L. and Z.Y.; Visualization, K.W.; Supervision, K.Z., W.L. and Z.Y.; Project administration, Z.Y.; Funding acquisition, Z.Y. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Natural Science Foundation of China (NSFC) under Grant No. U21A6003.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The original contributions presented in the study are included in the article.

Acknowledgments

The authors would like to express their gratitude to National Natural Science Foundation of China (NSFC) for their support.

Conflicts of Interest

The authors declare no conflicts of interest.

Appendix A

Appendix A.1

The F t and B t in (13)
F t = 0 3 × 3 I 3 0 3 × 3 0 3 × 3 0 3 × 3 0 3 × 3 0 3 × 3 R t [ a t b a t ] × R t 0 3 × 3 0 3 × 3 0 3 × 3 [ ω t b ω t ] × 0 3 × 3 I 3 0 3 × 3 0 3 × 3 0 3 × 3 0 3 × 3 0 3 × 3 0 3 × 3 0 3 × 3 0 3 × 3 0 3 × 3 0 3 × 3
B t = 0 3 × 3 0 3 × 3 0 3 × 3 0 3 × 3 R t 0 3 × 3 0 3 × 3 0 3 × 3 0 3 × 3 I 3 0 3 × 3 0 3 × 3 0 3 × 3 0 3 × 3 I 3 0 3 × 3 0 3 × 3 0 3 × 3 0 3 × 3 I 3

Appendix A.2

The F k 1 and B k 1 in (15)
F k 1 = I 15 + F t T
B k 1 = 0 3 × 3 0 3 × 3 0 3 × 3 0 3 × 3 R k 1 T 0 3 × 3 0 3 × 3 0 3 × 3 0 3 × 3 I 3 T 0 3 × 3 0 3 × 3 0 3 × 3 0 3 × 3 I 3 T 0 3 × 3 0 3 × 3 0 3 × 3 0 3 × 3 I 3 T
T is the period of the ESKF.

Appendix A.3

Discrete ESKF equations
δ x ˇ k = F k 1 δ x ^ k 1 + B k 1 n k P ˇ k = F k 1 P ^ k 1 F k 1 T + B k 1 Q k B k 1 T K k = P ˇ k G k T ( G k P ˇ k G k T + C k R k C k T ) 1 P ^ k = ( I K k G k ) P ˇ k δ x ^ k = δ x ˇ k + K k ( y k G k δ x ˇ k )

Appendix A.4

Variance, process noise, and observation noise
Variance:
P ^ 0 = P δ p 0 3 × 3 0 3 × 3 0 3 × 3 0 3 × 3 0 3 × 3 P δ v 0 3 × 3 0 3 × 3 0 3 × 3 0 3 × 3 0 3 × 3 P δ θ 0 3 × 3 0 3 × 3 0 3 × 3 0 3 × 3 0 3 × 3 P δ b a 0 3 × 3 0 3 × 3 0 3 × 3 0 3 × 3 0 3 × 3 P δ b ω
Process noise and observation noise:
Q = Q a 0 3 × 3 0 3 × 3 0 3 × 3 0 3 × 3 Q ω 0 3 × 3 0 3 × 3 0 3 × 3 0 3 × 3 Q b a 0 3 × 3 0 3 × 3 0 3 × 3 0 3 × 3 Q b ω
R = R δ p 0 3 × 3 0 3 × 3 R δ θ

References

  1. Fuentes-Pacheco, J.; Ruiz-Ascencio, J.; Rendón-Mancha, J.M. Visual simultaneous localization and mapping: A survey. Artif. Intell. Rev. 2015, 43, 55–81. [Google Scholar] [CrossRef]
  2. Cadena, C.; Carlone, L.; Carrillo, H.; Latif, Y.; Scaramuzza, D.; Neira, J.; Reid, I.; Leonard, J.J. Past, Present, and Future of Simultaneous Localization and Mapping: Toward the Robust-Perception Age. IEEE Trans. Robot. 2016, 32, 1309–1332. [Google Scholar] [CrossRef]
  3. Lichtsteiner, P.; Posch, C.; Delbruck, T. A 128×128 120 dB 15μs Latency Asynchronous Temporal Contrast Vision Sensor. IEEE J. Solid-State Circuits 2008, 43, 566–576. [Google Scholar] [CrossRef]
  4. Gallego, G.; Delbruck, T.; Orchard, G.M.; Bartolozzi, C.; Taba, B.; Censi, A.; Leutenegger, S.; Davison, A.J.; Conradt, J.; Daniilidis, K.; et al. Event-Based Vision: A Survey. IEEE Trans. Pattern Anal. Mach. Intell. 2020, 44, 154–180. [Google Scholar] [CrossRef]
  5. Zhu, A.Z.; Atanasov, N.; Daniilidis, K. Event-based feature tracking with probabilistic data association. In Proceedings of the 2017 IEEE International Conference on Robotics and Automation (ICRA), Singapore, 29 May–3 June 2017; pp. 4465–4470. [Google Scholar] [CrossRef]
  6. Gehrig, D.; Rebecq, H.; Gallego, G.; Scaramuzza, D. EKLT: Asynchronous Photometric Feature Tracking Using Events and Frames. Int. J. Comput. Vis. 2020, 128, 601–618. [Google Scholar] [CrossRef]
  7. Cook, M.; Gugelmann, L.; Jug, F.; Krautz, C.; Steger, A. Interacting maps for fast visual interpretation. In Proceedings of the 2011 International Joint Conference on Neural Networks, San Jose, CA, USA, 31 July–5 August 2011; pp. 770–776. [Google Scholar] [CrossRef]
  8. Kim, H.; Handa, A.; Benosman, R.; Ieng, S.-H.; Davison, A. Simultaneous mosaicing and tracking with an event camera. J. Solid Circuits 2008, 43, 566–576. [Google Scholar]
  9. Reinbacher, C.; Munda, G.; Pock, T. Real-time panoramic tracking for event cameras. In Proceedings of the 2017 IEEE International Conference on Computational Photography (ICCP), Stanford, CA, USA, 12–14 May 2017; pp. 1–9. [Google Scholar] [CrossRef]
  10. Gallego, G.; Scaramuzza, D. Accurate Angular Velocity Estimation With an Event Camera. IEEE Robot. Autom. Lett. 2017, 2, 632–639. [Google Scholar] [CrossRef]
  11. Kim, H.; Leutenegger, S.; Davison, A.J. Real-Time 3D Reconstruction and 6-DoF Tracking with an Event Camera. In Computer Vision—ECCV 2016; Leibe, B., Matas, J., Sebe, N., Welling, M., Eds.; ECCV 2016. Lecture Notes in Computer Science; Springer: Cham, Switzerland, 2016; Volume 9910. [Google Scholar] [CrossRef]
  12. Rebecq, H.; Horstschaefer, T.; Gallego, G.; Scaramuzza, D. EVO: A Geometric Approach to Event-Based 6-DOF Parallel Tracking and Mapping in Real Time. IEEE Robot. Autom. Lett. 2017, 2, 593–600. [Google Scholar] [CrossRef]
  13. Liu, D.; Parra, A.; Latif, Y.; Chen, B.; Chin, T.-J.; Reid, I. Asynchronous Optimisation for Event-based Visual Odometry. In Proceedings of the 2022 International Conference on Robotics and Automation (ICRA), Philadelphia, PA, USA, 23–27 May 2022; pp. 9432–9438. [Google Scholar] [CrossRef]
  14. Chamorro, W.; Solà, J.; Andrade-Cetto, J. Event-Based Line SLAM in Real-Time. IEEE Robot. Autom. Lett. 2022, 7, 8146–8153. [Google Scholar] [CrossRef]
  15. Wang, Y.; Yang, J.; Peng, X.; Wu, P.; Gao, L.; Huang, K.; Chen, J.; Kneip, L. Visual Odometry with an Event Camera Using Continuous Ray Warping and Volumetric Contrast Maximization. Sensors 2022, 22, 5687. [Google Scholar] [CrossRef]
  16. Zhou, Y.; Gallego, G.; Shen, S. Event-Based Stereo Visual Odometry. IEEE Trans. Robot. 2021, 37, 1433–1450. [Google Scholar] [CrossRef]
  17. Rebecq, H.; Gallego, G.; Mueggler, E.; Scaramuzza, D. EMVS: Event-Based Multi-View Stereo 3D Reconstruction with an Event Camera in Real-Time. Int. J. Comput. Vis. 2018, 126, 1394–1414. [Google Scholar] [CrossRef]
  18. Gallego, G.; Rebecq, H.; Scaramuzza, D. A unifying contrast maximization framework for event cameras, with applications to motion, depth, and optical flow estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018. [Google Scholar]
  19. Hartley, R.; Zisserman, A. Multiple View Geometry in Computer Vision; Cambridge University Press: Cambridge, UK, 2003. [Google Scholar]
  20. Kogler, J.; Sulzbachner, C.; Humenberger, M.; Eibensteiner, F. Address-event based stereo vision with bio-inspired silicon retina imagers. In Advances in Theory and Applications of Stereo Vision; InTechOpen: Rijeka, Croatia, 2011; pp. 165–188. [Google Scholar]
  21. Schraml, S.; Belbachir, A.N.; Milosevic, N.; Schon, P. Dynamic stereo vision system for real-time tracking. In Proceedings of the 2010 IEEE International Symposium on Circuits and Systems, Paris, France, 30 May–2 June 2010; pp. 1409–1412. [Google Scholar] [CrossRef]
  22. Ieng, S.-H.; Carneiro, J.; Osswald, M.; Benosman, R. Neuromorphic Event-Based Generalized Time-Based Stereovision. Front. Neurosci. 2018, 12, 442. [Google Scholar] [CrossRef] [PubMed]
  23. Zhou, Y.; Gallego, G.; Rebecq, H.; Kneip, L.; Li, H.; Scaramuzza, D. Semi-dense 3D reconstruction with a stereo event camera. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018. [Google Scholar]
  24. Kogler, J.; Humenberger, M.; Sulzbachner, C. Event-Based Stereo Matching Approaches for Frameless Address Event Stereo Data. In Advances in Visual Computing; Bebis, G., Boyle, R., Parvin, B., Eds.; ISVC 2011 Lecture Notes in Computer Science; Springer: Berlin/Heidelberg, Germany, 2011; Volume 6938. [Google Scholar] [CrossRef]
  25. Lee, J.; Delbruck, T.; Park, P.K.J.; Pfeiffer, M.; Shin, C.-W.; Ryu, H.; Kang, B.C. Live demonstration: Gesture-based remote control using stereo pair of dynamic vision sensors. In Proceedings of the 2012 IEEE International Symposium on Circuits and Systems (ISCAS), Seoul, Republic of Korea, 20–23 May 2012; pp. 741–745. [Google Scholar] [CrossRef]
  26. Weikersdorfer, D.; Hoffmann, R.; Conradt, J. Simultaneous Localization and Mapping for Event-Based Vision Systems. In Computer Vision Systems; Chen, M., Leibe, B., Neumann, B., Eds.; ICVS 2013. Lecture Notes in Computer Science; Springer: Berlin/Heidelberg, Germany, 2013; Volume 7963. [Google Scholar] [CrossRef]
  27. Zhu, Z.; Alex, N.A.; Daniilidis, K. Event-based visual inertial odometry. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017. [Google Scholar]
  28. Rebecq, H.; Horstschaefer, T.; Scaramuzza, D. Real-time visual-inertial odometry for event cameras using keyframe-based nonlinear optimization. In Proceedings of the British Machine Vision Conference, London, UK, 4–7 September 2017; pp. 1–8. [Google Scholar]
  29. Vidal, A.R.; Rebecq, H.; Horstschaefer, T.; Scaramuzza, D. Ultimate SLAM? Combining events, images, and IMU for robust visual SLAM in HDR and high-speed scenarios. IEEE Robot. Autom. Lett. 2018, 3, 994–1001. [Google Scholar] [CrossRef]
  30. Mueggler, E.; Gallego, G.; Rebecq, H.; Scaramuzza, D. Continuous-Time Visual-Inertial Odometry for Event Cameras. IEEE Trans. Robot. 2018, 34, 1425–1440. [Google Scholar] [CrossRef]
  31. Le Gentil, C.; Tschopp, F.; Alzugaray, I.; Vidal-Calleja, T.; Siegwart, R.; Nieto, J. IDOL: A Framework for IMU-DVS Odometry using Lines. In Proceedings of the 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Las Vegas, NV, USA, 24 October 2020–24 January 2021; pp. 5863–5870. [Google Scholar] [CrossRef]
  32. Liu, Z.; Shi, D.; Li, R.; Yang, S. ESVIO: Event-Based Stereo Visual-Inertial Odometry. Sensors 2023, 23, 1998. [Google Scholar] [CrossRef] [PubMed]
  33. Elamin, A.; El-Rabbany, A.; Jacob, S. Event-Based Visual/Inertial Odometry for UAV Indoor Navigation. Sensors 2025, 25, 61. [Google Scholar] [CrossRef] [PubMed]
  34. Hirschmuller, H. Stereo Processing by Semiglobal Matching and Mutual Information. IEEE Trans. Pattern Anal. Mach. Intell. 2008, 30, 328–341. [Google Scholar] [CrossRef] [PubMed]
  35. Sun, K.; Mohta, K.; Pfrommer, B.; Watterson, M.; Liu, S.; Mulgaonkar, Y.; Taylor, C.J.; Kumar, V. Robust Stereo Visual Inertial Odometry for Fast Autonomous Flight. IEEE Robot. Autom. Lett. 2018, 3, 965–972. [Google Scholar] [CrossRef]
  36. Geneva, P.; Eckenhoff, K.; Lee, W.; Yang, Y.; Huang, G. OpenVINS: A Research Platform for Visual-Inertial Estimation. In Proceedings of the 2020 IEEE International Conference on Robotics and Automation (ICRA), Paris, France, 31 May–31 August 2020; pp. 4666–4672. [Google Scholar] [CrossRef]
  37. Delbruck, T. Frame-free dynamic digital vision. In Proceedings of the International Symposium on Secure-Life Electronics, Advanced Electronics for Quality Life and Society, Tokyo, Japan, 6–7 March 2008; Volume 1. [Google Scholar]
  38. Lagorce, X.; Orchard, G.; Galluppi, F.; Shi, B.E.; Benosman, R.B. HOTS: A Hierarchy of Event-Based Time-Surfaces for Pattern Recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 1346–1359. [Google Scholar] [CrossRef]
  39. Zhu, A.Z.; Thakur, D.; Ozaslan, T.; Pfrommer, B.; Kumar, V.; Daniilidis, K. The Multivehicle Stereo Event Camera Dataset: An Event Camera Dataset for 3D Perception. IEEE Robot. Autom. Lett. 2018, 3, 2032–2039. [Google Scholar] [CrossRef]
  40. Gao, L.; Liang, Y.; Yang, J.; Wu, S.; Wang, C.; Chen, J.; Kneip, L. VECtor: A Versatile Event-Centric Benchmark for Multi-Sensor SLAM. IEEE Robot. Autom. Lett. 2022, 7, 8217–8224. [Google Scholar] [CrossRef]
Figure 1. (Top left): scene. (Bottom left): inverse depth map at time t, and different colors represent different depths. (Right): global map and pose estimation.
Figure 1. (Top left): scene. (Bottom left): inverse depth map at time t, and different colors represent different depths. (Right): global map and pose estimation.
Sensors 25 00887 g001
Figure 2. Overview of our proposed stereo event-based visual–inertial odometry.
Figure 2. Overview of our proposed stereo event-based visual–inertial odometry.
Sensors 25 00887 g002
Figure 3. Time-surface. (Left): output of an event camera, and different colors represent different times. (Right): time-surface map. Figure adapted from [16].
Figure 3. Time-surface. (Left): output of an event camera, and different colors represent different times. (Right): time-surface map. Figure adapted from [16].
Sensors 25 00887 g003
Figure 4. Time-surface and its included historical information.
Figure 4. Time-surface and its included historical information.
Sensors 25 00887 g004
Figure 5. Algorithm performance. The left image shows the experimental scene, while the right image displays the local point clouds and trajectories.
Figure 5. Algorithm performance. The left image shows the experimental scene, while the right image displays the local point clouds and trajectories.
Sensors 25 00887 g005
Figure 6. The first column shows images from a traditional camera. The second column is the time-surface. The third column is the inverse depth map. The last column is the warping depth map overlaid on the time-surface negative.
Figure 6. The first column shows images from a traditional camera. The second column is the time-surface. The third column is the inverse depth map. The last column is the warping depth map overlaid on the time-surface negative.
Sensors 25 00887 g006
Table 1. Parameters of event camera used in the datasets.
Table 1. Parameters of event camera used in the datasets.
DatasetCameraIMUResolution (pixel)Baseline (cm)
MVSECDAVIS 346MPU6150 346 × 260 10.0
VECtorProphesee Gen3MTi-30 640 × 480 17.0
Table 2. Absolute pose error and relative pose error (RMSE).
Table 2. Absolute pose error and relative pose error (RMSE).
SequencesESVOU-SLAMSEVIO (Ours)
APE (m) RPE (m) APE (m) RPE (m) APE (m) RPE (m)
indoor_flying1_edit0.1900.014--0.2990.011
indoor_flying3_edit0.3420.0270.4730.0240.2660.010
school_dolly_edit0.9900.0771.2600.0820.7030.075
school_scooter_edit2.6660.233--1.2910.195
units_dolly_edit0.7140.0960.6280.0890.5140.084
units_scooter_edit0.6520.083--0.4610.072
sofa_normal_edit0.3680.031--0.2860.024
desk_normal_edit0.4160.0340.7520.0480.3250.027
hdr_normal_edit0.1570.012--0.1260.009
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Wang, K.; Zhao, K.; Lu, W.; You, Z. Stereo Event-Based Visual–Inertial Odometry. Sensors 2025, 25, 887. https://doi.org/10.3390/s25030887

AMA Style

Wang K, Zhao K, Lu W, You Z. Stereo Event-Based Visual–Inertial Odometry. Sensors. 2025; 25(3):887. https://doi.org/10.3390/s25030887

Chicago/Turabian Style

Wang, Kunfeng, Kaichun Zhao, Wenshuai Lu, and Zheng You. 2025. "Stereo Event-Based Visual–Inertial Odometry" Sensors 25, no. 3: 887. https://doi.org/10.3390/s25030887

APA Style

Wang, K., Zhao, K., Lu, W., & You, Z. (2025). Stereo Event-Based Visual–Inertial Odometry. Sensors, 25(3), 887. https://doi.org/10.3390/s25030887

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop