Keywords

1 Introduction

Event cameras are optical sensors that asynchronously output events in case of brightness variations at pixel level. The major advantages of this type of neuromorphic sensors are the low power consumption, the low data rate, the high temporal resolution, and the high dynamic range [8]. On the other hand, despite exhibiting a higher power consumption and often a lower dynamic range, traditional cameras are able to record local information, like textures, and the majority of the computer vision algorithms are designed to work on this kind of data. Indeed, being able to apply existing algorithms to the output of event cameras could help the adoption of event-based sensors.

In this paper, aiming to conjugate the advantages of traditional and event cameras, we investigate the use of a deep learning-based method to interpolate intensity frames acquired by a low-rate camera with the support of the intermediate event data. Specifically, we exploit a fully-convolutional encoder-decoder architecture to predict intensity frames, relying on an initial or a periodic set of key-frames and a series of event frames, i.e. frames that collect the information captured by event cameras in a certain time interval.

Focusing on the automotive scenario, we employ a novel event-based dataset called DDD17 [4] (see Fig. 1) and evaluate the feasibility of the proposed method with a wide set of pixel-level metrics. Quantitative and qualitative comparisons with a recent competitor [26] shows the superior quality of the images synthesized by the proposed model.

Summarizing, our contributions are twofold:

  • We propose a fully-convolutional encoder-decoder architecture that combines traditional images and event data (as event frames) to interpolate consecutive intensity frames;

  • We evaluate the effectiveness of the proposed approach on a public automotive dataset, assessing the ability to generate reasonable images and providing a fair comparison with a state-of-the-art approach.

Fig. 1.
figure 1

Samples from the DDD17 dataset. The first row contains the intensity grayscale images while the second one contains the event frames.

2 Related Work

In the last years, event-based vision has increased its popularity in the computer vision community. Indeed, many novel algorithms have been proposed to deal with event-based data, produced by Dynamic Vision Sensors [11] (DVSs), like visual odometry [29], SLAM [17], optical flow estimation [8], and monocular [21] or stereo [1, 28] depth estimation.

Event cameras have also been exploited for the ego-motion estimation [7, 14], the real-time feature detection and tracking [15, 20], and the robot control in predator/prey scenarios [16]. Furthermore, it has been shown that event data can be employed to solve many classification tasks, such as the classification of characters [19], gestures [13], and faces [10]. Recently, an optimization-based algorithm that simultaneously estimates the optical flow and the brightness intensity was proposed in [3], while [18, 23] presented a manifold regularization method that reconstructs intensity frames from event data.

Lately, Scheerlinck et al.  [26] proposed a complementary filter that combines image frames and events to estimate the scene intensity. The filter asynchronously updates the intensity estimation whenever new events or intensity frames are received. If the grayscale frames are missing, the estimation can be produced using events only.

This method is recent (at the time of writing) and outperforms previous existing works. Thus, we selected it as a baseline reference to evaluate our approach (see Sect. 4).

Fig. 2.
figure 2

Overview of the proposed method. The input of the encoder-decoder architecture is represented by the stack of an intensity and an event frame, while the output is the predicted intensity frame. During inference, the output at each step is used as the input intensity image in the following step.

3 Proposed Method

In this Section, we formally define the event frame concept. Then, we present the investigated task from both a mathematical and an implementation point of view.

3.1 Event Frames

Following the notation of [14], the j-th event \(e_{j}\) provided by an event camera can be expressed as:

$$\begin{aligned} e_{j} = (x_{j}, y_{j}, t_{j}, p_{j}) \end{aligned}$$
(1)

where \(x_{j}\), \(y_{j}\), and \(t_{j}\) are the spatio-temporal coordinates of a brightness change and \(p_{j} \in \{-1, +1\}\) is the polarity of the brightness change (i.e. positive or negative variation).

An event frame can be defined as the pixel-wise integration of the events occurred in a time interval \([t, t+\tau ]\):

$$\begin{aligned} \varPsi _{\tau }(t) = \sum _{e_j \in [t,t+\tau ]} p_j \end{aligned}$$
(2)

where \(e_j \in [t,t+\tau ]\) means \(\left\{ e_j \, | \, t_j \in [t,t+\tau ] \right\} \). In practice, an event frame can be formulated as a grayscale image that summarizes the events captured in a particular time interval. There is loss of information when the number of events exceeds the number of gray levels of the image.

3.2 Intensity Frame Estimation

We propose a method that corresponds to a learned parametric function F defined as:

$$\begin{aligned} F: \mathbb {R}^{2 \times w \times h} \xrightarrow {} \mathbb {R}^{w \times h} \end{aligned}$$
(3)

that takes as input an intensity image \(I_{t} \in \mathbb {R}^{w\times h}\) recorded at time t and an event frame \(\varPsi _{\tau }(t) \in \mathbb {R}^{w\times h}\) (which summarizes pixel-level brightness variations in the time interval \([t, t+\tau ]\)) in order to estimate the intensity image \(\hat{I}(t+\tau ) \in \mathbb {R}^{w \times h}\) at time \(t+\tau \). w and h correspond to the width and the height of the event frames and the intensity images.

Formally, the synthesized image \(\hat{I}(t+\tau )\) can be defined as:

$$\begin{aligned} {\hat{I}(t+\tau )} = F \,(I(t), \varPsi _{\tau }(t), \theta ) \end{aligned}$$
(4)

where \(\theta \) corresponds to the parameters of the function F.

3.3 Architecture

In practice, the parametric function F corresponds to an encoder-decoder architecture that predicts the intensity frame \(\hat{I}(t+\tau )\) from the concatenation of an intensity frame I(t) and an event frame \(\varPsi _{\tau }(t)\), as represented in Fig. 2. In particular, the model is a fully-convolutional deep neural network with skip connections between layers i and \(n-i\), with n corresponding to the total number of layers.

As in the U-Net architecture [24], the number of layers with skip connections is set to \(n=4\) with 128, 256, 512, 512 \(3 \times 3\) kernels in the encoder layers and with 256, 128, 64, 64 \(3 \times 3\) kernels in the decoder layers.

These skip-connected layers are preceded by two convolutional layers with 64 feature maps and followed by a convolutional layer with 1 feature map that projects the internal network representation to the final intensity estimation.

3.4 Training Procedure

The network is trained in a supervised manner using the Mean Squared Error (MSE) loss as objective function:

$$\begin{aligned} MSE = \frac{1}{N} \sum _{i=0}^N (y_i - \hat{y}_i)^2 \end{aligned}$$
(5)

where \(y_i\) and \(\hat{y}_i\) are respectively the i-th pixels of the ground truth and the generated image of the same size \(N = w \cdot h\).

We optimize the network using the Adam optimizer [9] with learning rate \(2 \cdot 10^{-4}\), \(\beta _1 = 0.5\), \(\beta _2 = 0.999\) and a mini-batch size of 8.

During the training phase, two consecutive frames (one as input, one as ground-truth of the output) and the intermediate event frame (as input) are employed. During the testing phase, instead, in order to obtain a sequence of synthesized frames, the model iteratively receives the previously generated image as intensity input or a new key-frame after \(\lambda \) iterations.

4 Experimental Evaluation

In this section, we firstly present the dataset that has been employed to train and evaluate the proposed method. In the following, we report the procedure that we have adopted to evaluate the quality of the estimated intensity frames. Finally, we present and analyze the experimental results.

Table 1. Pixel-wise metrics (lower is better) computed on the synthesized frames of DDD17.

4.1 DDD17: End-to-end DAVIS Driving Dataset

Recently, Binas et al.  [4] presented DDD17: End-to-end DAVIS Driving Dataset, the first open dataset of annotated DAVIS driving recordings. The dataset contains more than 12 h of recordings captured with a DAVIS sensor [5] (some sample images are shown in Fig. 1). Each recording includes both event data and grayscale frames along with vehicle information (e.g. vehicle speed, throttle, brake, steering angle). Recordings are captured in cities and highways, in dry and wet weather conditions, during day, evening, and night.

However, the quality of the gray-level images is low, the spatial resolution is limited to \(346 \times 260\) pixels, and the framerate is variable (it depends on the brightness of the scene).

In our experiments, similar to [14], we use only the recordings acquired during the day. In contrast to Maqueda et al.  [14], however, we create the train, validation, and test sets using different recordings.

4.2 Metrics

Inspired by [6], we employed a variety of metrics to check the quality of the generated images, being aware that evaluating synthesized images is in general a difficult and still open problem [25].

Table 2. Starting from the left, we report the percentage of pixels under three different thresholds, the Peak Signal-to-Noise Ratio (PSNR), and the Structural Similarity (SSIM) indexes, computed on the synthesized frames of DDD17. Higher is better.

In particular, we use distances (\(L_1\) and \(L_2\)), differences (absolute and squared relative difference), the root mean squared error (in the linear, logarithmic, and scale-invariant version), and the percentage of pixels under a certain error threshold. Furthermore, with respect to [6], we introduce two additional metrics: the Peak Signal-to-Noise Ratio (PSNR) and the Structural Similarity index (SSIM) [27]. They are calculated to respectively evaluate the image noise level (in logarithmic scale) and the perceived image quality.

From a mathematical perspective, the PSNR is defined as:

$$\begin{aligned} \text {PSNR} = 10 \cdot \log _{10} \Big (\frac{m \cdot |I|}{\sum _{y \in I} (\, y - \hat{y} \,)^2}\Big ) \end{aligned}$$
(6)

where I is the ground truth image, \(\hat{I}\) is the synthesized image, and m is the maximum possible value of I and \(\hat{I}\). \(\hat{y} \in \hat{I}\) corresponds to the element of the generated image at the same location of \(y \in I\). In our experiments \(m=1\).

The SSIM is defined as:

$$\begin{aligned} \text {SSIM}(p,q) = \frac{(2\mu _p\mu _q + c_1)(2\sigma _{pq} + c_2)}{(\mu _p^2 + \mu _q^2 + c_1)(\sigma _p^2 + \sigma _q^2 + c_2)} \end{aligned}$$
(7)

Given two windows \(p \in I\), \(q \in \hat{I}\) of equal size \(11 \times 11\), \(\mu _{p,q}\), \(\sigma _{p,q}\) are the mean and variance of p and q while \(\sigma _{pq}\) is the covariance of p, q.

\(c_{1}\) and \(c_{2}\) are defined as \(c_1 = (0.01 \cdot L)^2\) and \(c_2 = (0.03 \cdot L)^2\) where L is the dynamic range (i.e. the difference between the maximum and the minimum theoretical value) of I and \(\hat{I}\). In our experiments \(L=1\).

Fig. 3.
figure 3

Samples of synthesized frames produced by our method (last column) and the one of Scheerlinck et al. [26] (second column), while the first column contains ground truth images. As shown, the proposed method produces less artefacts, in the form of black or white spots, maintaining a good level of details, and it is able to preserve the overall structure and appearance of the original scene.

4.3 Experimental Results

We analyze the quality of the intensity estimations produced by our approach and by the method presented in [26] employing the pixel-wise metrics reported in Sect. 4.2.

In the experiments, we empirically set the number of consecutive synthesized frames (i.e. the sequence length) to \(\lambda = 6\). It is worth noting that, within a sequence, the input intensity frame of the proposed method is the intensity estimation of the previous step except for the initial key-frame. We adapt the input images of DDD17 to match the architecture requirements: the input data is resized to a spatial resolution of \(256 \times 192\).

Quantitative results are reported in Tables 1 and 2. As it can be seen, the proposed method outperforms the competitor with a clear margin in every evaluation. In particular, PSNR and SSIM confirm the fidelity of the representation and the good level of perceived similarity between the generated and the ground truth images, respectively. Indeed, compared to the output of [26], frames synthesized by our method contain less artifacts and shadows and the overall structure of the scene is better preserved.

Visual examples, which are reported in Fig. 3, highlight the ability of the proposed network to correctly handle the input event frames.

Finally, we investigate the performance of a traditional vision-based detection algorithm tested on the generated images. We adopt the well-known object detection network Yolo-v3 [22], pre-trained on the COCO dataset [12], to investigate the ability of the proposed method to preserve the appearance of objects which are significant in the automotive context, like pedestrians, trucks, cars, and stop signals.

Since ground truth object annotations are not available in the dataset, we first run the object detector on the real images contained in DDD17, obtaining a sort of ground truth annotation. Then, we run Yolo-v3 on the generated images and compare these detections with the produced annotations.

Results are expressed in terms of Intersection-over-Union (IoU) [2], which is defined as follows:

$$\begin{aligned} IoU(A, B) = \frac{\text {Area of Overlap}}{\text {Area of Union}} = \frac{|A \cap B |}{|A \cup B| - |A \cap B|} \end{aligned}$$
(8)

where A and B are the bounding boxes found in the original and the generated frames, respectively. A detection is valid if:

$$\begin{aligned} IoU(A, B) > \tau , \, \, \tau =0.5 \end{aligned}$$
(9)

A weighted object detection score is also employed: each class contributes to the final average according to its associated weight computed as the number of its occurrence on the total number present in the test sequences.

We obtained a mean Intersection-over-Union of 0.863 (the maximum reachable value is 1) with \(61\%\) of valid object detections. We believe that these results are remarkably promising because they show that the generated frames are semantically similar to the real ones. Therefore, the proposed method can be an effective way to apply traditional vision algorithms to the output of event cameras.

5 Conclusion

In this work, we have presented a deep learning-based method that performs intensity estimation given an initial or periodic collection of intensity key-frames and a group of events.

The model relies on a fully convolutional encoder-decoder architecture that learns to combine intensity and event frames to produce updated intensity estimations. The experimental evaluation shows that the proposed method can be effectively employed to the intensity estimation task and that it is a valid alternative to current state-of-the-art methods.

As future work, we plan to test the framework on additional datasets as well as to take into account the long-term temporal evolution of the scene.