Video Synthesis from Intensity and Event Frames

Stefano Pini¹⁴,
Guido Borghi¹⁴,
Roberto Vezzani¹⁴ &
…
Rita Cucchiara¹⁴

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 11751))

Included in the following conference series:

International Conference on Image Analysis and Processing

2333 Accesses
11 Citations

Abstract

Event cameras, neuromorphic devices that naturally respond to brightness changes, have multiple advantages with respect to traditional cameras. However, the difficulty of applying traditional computer vision algorithms on event data limits their usability. Therefore, in this paper we investigate the use of a deep learning-based architecture that combines an initial grayscale frame and a series of event data to estimate the following intensity frames. In particular, a fully-convolutional encoder-decoder network is employed and evaluated for the frame synthesis task on an automotive event-based dataset. Performance obtained with pixel-wise metrics confirms the quality of the images synthesized by the proposed architecture.

You have full access to this open access chapter, Download conference paper PDF

Reducing the Sim-to-Real Gap for Event Cameras

Revisit Event Generation Model: Self-supervised Learning of Event-to-Video Reconstruction with Implicit Neural Representations

NEST: Neural Event Stack for Event-Based Image Enhancement

Keywords

1 Introduction

Event cameras are optical sensors that asynchronously output events in case of brightness variations at pixel level. The major advantages of this type of neuromorphic sensors are the low power consumption, the low data rate, the high temporal resolution, and the high dynamic range [8]. On the other hand, despite exhibiting a higher power consumption and often a lower dynamic range, traditional cameras are able to record local information, like textures, and the majority of the computer vision algorithms are designed to work on this kind of data. Indeed, being able to apply existing algorithms to the output of event cameras could help the adoption of event-based sensors.

In this paper, aiming to conjugate the advantages of traditional and event cameras, we investigate the use of a deep learning-based method to interpolate intensity frames acquired by a low-rate camera with the support of the intermediate event data. Specifically, we exploit a fully-convolutional encoder-decoder architecture to predict intensity frames, relying on an initial or a periodic set of key-frames and a series of event frames, i.e. frames that collect the information captured by event cameras in a certain time interval.

Focusing on the automotive scenario, we employ a novel event-based dataset called DDD17 [4] (see Fig. 1) and evaluate the feasibility of the proposed method with a wide set of pixel-level metrics. Quantitative and qualitative comparisons with a recent competitor [26] shows the superior quality of the images synthesized by the proposed model.

Summarizing, our contributions are twofold:

We propose a fully-convolutional encoder-decoder architecture that combines traditional images and event data (as event frames) to interpolate consecutive intensity frames;
We evaluate the effectiveness of the proposed approach on a public automotive dataset, assessing the ability to generate reasonable images and providing a fair comparison with a state-of-the-art approach.

2 Related Work

In the last years, event-based vision has increased its popularity in the computer vision community. Indeed, many novel algorithms have been proposed to deal with event-based data, produced by Dynamic Vision Sensors [11] (DVSs), like visual odometry [29], SLAM [17], optical flow estimation [8], and monocular [21] or stereo [1, 28] depth estimation.

Event cameras have also been exploited for the ego-motion estimation [7, 14], the real-time feature detection and tracking [15, 20], and the robot control in predator/prey scenarios [16]. Furthermore, it has been shown that event data can be employed to solve many classification tasks, such as the classification of characters [19], gestures [13], and faces [10]. Recently, an optimization-based algorithm that simultaneously estimates the optical flow and the brightness intensity was proposed in [3], while [18, 23] presented a manifold regularization method that reconstructs intensity frames from event data.

Lately, Scheerlinck et al. [26] proposed a complementary filter that combines image frames and events to estimate the scene intensity. The filter asynchronously updates the intensity estimation whenever new events or intensity frames are received. If the grayscale frames are missing, the estimation can be produced using events only.

This method is recent (at the time of writing) and outperforms previous existing works. Thus, we selected it as a baseline reference to evaluate our approach (see Sect. 4).

3 Proposed Method

In this Section, we formally define the event frame concept. Then, we present the investigated task from both a mathematical and an implementation point of view.

3.1 Event Frames

Following the notation of [14], the j-th event $e_{j}$ provided by an event camera can be expressed as:

$$\begin{aligned} e_{j} = (x_{j}, y_{j}, t_{j}, p_{j}) \end{aligned}$$

(1)

where $x_{j}$, $y_{j}$, and $t_{j}$ are the spatio-temporal coordinates of a brightness change and $p_{j} \in \{-1, +1\}$ is the polarity of the brightness change (i.e. positive or negative variation).

An event frame can be defined as the pixel-wise integration of the events occurred in a time interval $[t, t+\tau ]$:

$$\begin{aligned} \varPsi _{\tau }(t) = \sum _{e_j \in [t,t+\tau ]} p_j \end{aligned}$$

(2)

where $e_j \in [t,t+\tau ]$ means $\left\{ e_j \, | \, t_j \in [t,t+\tau ] \right\} $. In practice, an event frame can be formulated as a grayscale image that summarizes the events captured in a particular time interval. There is loss of information when the number of events exceeds the number of gray levels of the image.

3.2 Intensity Frame Estimation

We propose a method that corresponds to a learned parametric function F defined as:

$$\begin{aligned} F: \mathbb {R}^{2 \times w \times h} \xrightarrow {} \mathbb {R}^{w \times h} \end{aligned}$$

(3)

that takes as input an intensity image $I_{t} \in \mathbb {R}^{w\times h}$ recorded at time t and an event frame $\varPsi _{\tau }(t) \in \mathbb {R}^{w\times h}$ (which summarizes pixel-level brightness variations in the time interval $[t, t+\tau ]$) in order to estimate the intensity image $\hat{I}(t+\tau ) \in \mathbb {R}^{w \times h}$ at time $t+\tau $. w and h correspond to the width and the height of the event frames and the intensity images.

Formally, the synthesized image $\hat{I}(t+\tau )$ can be defined as:

$$\begin{aligned} {\hat{I}(t+\tau )} = F \,(I(t), \varPsi _{\tau }(t), \theta ) \end{aligned}$$

(4)

where $\theta $ corresponds to the parameters of the function F.

3.3 Architecture

In practice, the parametric function F corresponds to an encoder-decoder architecture that predicts the intensity frame $\hat{I}(t+\tau )$ from the concatenation of an intensity frame I(t) and an event frame $\varPsi _{\tau }(t)$, as represented in Fig. 2. In particular, the model is a fully-convolutional deep neural network with skip connections between layers i and $n-i$, with n corresponding to the total number of layers.

As in the U-Net architecture [24], the number of layers with skip connections is set to $n=4$ with 128, 256, 512, 512 $3 \times 3$ kernels in the encoder layers and with 256, 128, 64, 64 $3 \times 3$ kernels in the decoder layers.

These skip-connected layers are preceded by two convolutional layers with 64 feature maps and followed by a convolutional layer with 1 feature map that projects the internal network representation to the final intensity estimation.

3.4 Training Procedure

The network is trained in a supervised manner using the Mean Squared Error (MSE) loss as objective function:

$$\begin{aligned} MSE = \frac{1}{N} \sum _{i=0}^N (y_i - \hat{y}_i)^2 \end{aligned}$$

(5)

where $y_i$ and $\hat{y}_i$ are respectively the i-th pixels of the ground truth and the generated image of the same size $N = w \cdot h$.

We optimize the network using the Adam optimizer [9] with learning rate $2 \cdot 10^{-4}$, $\beta _1 = 0.5$, $\beta _2 = 0.999$ and a mini-batch size of 8.

During the training phase, two consecutive frames (one as input, one as ground-truth of the output) and the intermediate event frame (as input) are employed. During the testing phase, instead, in order to obtain a sequence of synthesized frames, the model iteratively receives the previously generated image as intensity input or a new key-frame after $\lambda $ iterations.

4 Experimental Evaluation

In this section, we firstly present the dataset that has been employed to train and evaluate the proposed method. In the following, we report the procedure that we have adopted to evaluate the quality of the estimated intensity frames. Finally, we present and analyze the experimental results.

Table 1. Pixel-wise metrics (lower is better) computed on the synthesized frames of DDD17.

Full size table

4.1 DDD17: End-to-end DAVIS Driving Dataset

Recently, Binas et al. [4] presented DDD17: End-to-end DAVIS Driving Dataset, the first open dataset of annotated DAVIS driving recordings. The dataset contains more than 12 h of recordings captured with a DAVIS sensor [5] (some sample images are shown in Fig. 1). Each recording includes both event data and grayscale frames along with vehicle information (e.g. vehicle speed, throttle, brake, steering angle). Recordings are captured in cities and highways, in dry and wet weather conditions, during day, evening, and night.

However, the quality of the gray-level images is low, the spatial resolution is limited to $346 \times 260$ pixels, and the framerate is variable (it depends on the brightness of the scene).

In our experiments, similar to [14], we use only the recordings acquired during the day. In contrast to Maqueda et al. [14], however, we create the train, validation, and test sets using different recordings.

4.2 Metrics

Inspired by [6], we employed a variety of metrics to check the quality of the generated images, being aware that evaluating synthesized images is in general a difficult and still open problem [25].

Table 2. Starting from the left, we report the percentage of pixels under three different thresholds, the Peak Signal-to-Noise Ratio (PSNR), and the Structural Similarity (SSIM) indexes, computed on the synthesized frames of DDD17. Higher is better.

Full size table

In particular, we use distances ($L_1$ and $L_2$), differences (absolute and squared relative difference), the root mean squared error (in the linear, logarithmic, and scale-invariant version), and the percentage of pixels under a certain error threshold. Furthermore, with respect to [6], we introduce two additional metrics: the Peak Signal-to-Noise Ratio (PSNR) and the Structural Similarity index (SSIM) [27]. They are calculated to respectively evaluate the image noise level (in logarithmic scale) and the perceived image quality.

From a mathematical perspective, the PSNR is defined as:

$$\begin{aligned} \text {PSNR} = 10 \cdot \log _{10} \Big (\frac{m \cdot |I|}{\sum _{y \in I} (\, y - \hat{y} \,)^2}\Big ) \end{aligned}$$

(6)

where I is the ground truth image, $\hat{I}$ is the synthesized image, and m is the maximum possible value of I and $\hat{I}$. $\hat{y} \in \hat{I}$ corresponds to the element of the generated image at the same location of $y \in I$. In our experiments $m=1$.

The SSIM is defined as:

$$\begin{aligned} \text {SSIM}(p,q) = \frac{(2\mu _p\mu _q + c_1)(2\sigma _{pq} + c_2)}{(\mu _p^2 + \mu _q^2 + c_1)(\sigma _p^2 + \sigma _q^2 + c_2)} \end{aligned}$$

(7)

Given two windows $p \in I$, $q \in \hat{I}$ of equal size $11 \times 11$, $\mu _{p,q}$, $\sigma _{p,q}$ are the mean and variance of p and q while $\sigma _{pq}$ is the covariance of p, q.

$c_{1}$ and $c_{2}$ are defined as $c_1 = (0.01 \cdot L)^2$ and $c_2 = (0.03 \cdot L)^2$ where L is the dynamic range (i.e. the difference between the maximum and the minimum theoretical value) of I and $\hat{I}$. In our experiments $L=1$.

4.3 Experimental Results

We analyze the quality of the intensity estimations produced by our approach and by the method presented in [26] employing the pixel-wise metrics reported in Sect. 4.2.

In the experiments, we empirically set the number of consecutive synthesized frames (i.e. the sequence length) to $\lambda = 6$. It is worth noting that, within a sequence, the input intensity frame of the proposed method is the intensity estimation of the previous step except for the initial key-frame. We adapt the input images of DDD17 to match the architecture requirements: the input data is resized to a spatial resolution of $256 \times 192$.

Quantitative results are reported in Tables 1 and 2. As it can be seen, the proposed method outperforms the competitor with a clear margin in every evaluation. In particular, PSNR and SSIM confirm the fidelity of the representation and the good level of perceived similarity between the generated and the ground truth images, respectively. Indeed, compared to the output of [26], frames synthesized by our method contain less artifacts and shadows and the overall structure of the scene is better preserved.

Visual examples, which are reported in Fig. 3, highlight the ability of the proposed network to correctly handle the input event frames.

Finally, we investigate the performance of a traditional vision-based detection algorithm tested on the generated images. We adopt the well-known object detection network Yolo-v3 [22], pre-trained on the COCO dataset [12], to investigate the ability of the proposed method to preserve the appearance of objects which are significant in the automotive context, like pedestrians, trucks, cars, and stop signals.

Since ground truth object annotations are not available in the dataset, we first run the object detector on the real images contained in DDD17, obtaining a sort of ground truth annotation. Then, we run Yolo-v3 on the generated images and compare these detections with the produced annotations.

Results are expressed in terms of Intersection-over-Union (IoU) [2], which is defined as follows:

$$\begin{aligned} IoU(A, B) = \frac{\text {Area of Overlap}}{\text {Area of Union}} = \frac{|A \cap B |}{|A \cup B| - |A \cap B|} \end{aligned}$$

(8)

where A and B are the bounding boxes found in the original and the generated frames, respectively. A detection is valid if:

$$\begin{aligned} IoU(A, B) > \tau , \, \, \tau =0.5 \end{aligned}$$

(9)

A weighted object detection score is also employed: each class contributes to the final average according to its associated weight computed as the number of its occurrence on the total number present in the test sequences.

We obtained a mean Intersection-over-Union of 0.863 (the maximum reachable value is 1) with $61\%$ of valid object detections. We believe that these results are remarkably promising because they show that the generated frames are semantically similar to the real ones. Therefore, the proposed method can be an effective way to apply traditional vision algorithms to the output of event cameras.

5 Conclusion

In this work, we have presented a deep learning-based method that performs intensity estimation given an initial or periodic collection of intensity key-frames and a group of events.

The model relies on a fully convolutional encoder-decoder architecture that learns to combine intensity and event frames to produce updated intensity estimations. The experimental evaluation shows that the proposed method can be effectively employed to the intensity estimation task and that it is a valid alternative to current state-of-the-art methods.

As future work, we plan to test the framework on additional datasets as well as to take into account the long-term temporal evolution of the scene.

References

Andreopoulos, A., Kashyap, H.J., Nayak, T.K., Amir, A., Flickner, M.D.: A low power, high throughput, fully event-based stereo system. In: IEEE International Conference on Computer Vision and Pattern Recognition, pp. 7532–7542 (2018)
Google Scholar
Ballotta, D., Borghi, G., Vezzani, R., Cucchiara, R.: Fully convolutional network for head detection with depth images. In: 2018 24th International Conference on Pattern Recognition (ICPR), pp. 752–757. IEEE (2018)
Google Scholar
Bardow, P., Davison, A.J., Leutenegger, S.: Simultaneous optical flow and intensity estimation from an event camera. In: IEEE International Conference on Computer Vision and Pattern Recognition, pp. 884–892 (2016)
Google Scholar
Binas, J., Neil, D., Liu, S.C., Delbruck, T.: Ddd17: end-to-end davis driving dataset. In: Workshop on Machine Learning for Autonomous Vehicles (MLAV) in ICML 2017 (2017)
Google Scholar
Brandli, C., Berner, R., Yang, M., Liu, S.C., Delbruck, T.: A 240 $\times $ 180 130 db 3 $\mu $s latency global shutter spatiotemporal vision sensor. IEEE J. Solid-State Circuits 49(10), 2333–2341 (2014)
Article Google Scholar
Eigen, D., Puhrsch, C., Fergus, R.: Depth map prediction from a single image using a multi-scale deep network. In: Neural Information Processing Systems (2014)
Google Scholar
Gallego, G., Lund, J.E., Mueggler, E., Rebecq, H., Delbruck, T., Scaramuzza, D.: Event-based, 6-dof camera tracking from photometric depth maps. IEEE Trans. Pattern Anal. Mach. Intell. 40(10), 2402–2412 (2018)
Article Google Scholar
Gallego, G., Rebecq, H., Scaramuzza, D.: A unifying contrast maximization framework for event cameras, with applications to motion, depth, and optical flow estimation. In: IEEE International Conference on Computer Vision and Pattern Recognition, vol. 1 (2018)
Google Scholar
Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. CoRR abs/1412.6980 (2014). http://arxiv.org/abs/1412.6980
Lagorce, X., Orchard, G., Galluppi, F., Shi, B.E., Benosman, R.B.: HOTS: a hierarchy of event-based time-surfaces for pattern recognition. IEEE Trans. Pattern Anal. Mach. Intell. 39(7), 1346–1359 (2017)
Article Google Scholar
Lichtsteiner, P., Posch, C., Delbruck, T.: A 128 $\times $ 128 120 db 30 mw asynchronous vision sensor that responds to relative intensity change. In: 2006 IEEE International Solid State Circuits Conference-Digest of Technical Papers. pp. 2060–2069. IEEE (2006)
Google Scholar
Lin, T.-Y., et al.: Microsoft COCO: common objects in context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 740–755. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10602-1_48
Chapter Google Scholar
Lungu, I.A., Corradi, F., Delbrück, T.: Live demonstration: convolutional neural network driven by dynamic vision sensor playing RoShamBo. In: IEEE International Symposium on Circuits and Systems (ISCAS), pp. 1–1. IEEE (2017)
Google Scholar
Maqueda, A.I., Loquercio, A., Gallego, G., Garcıa, N., Scaramuzza, D.: Event-based vision meets deep learning on steering prediction for self-driving cars. In: IEEE International Conference on Computer Vision and Pattern Recognition, pp. 5419–5427 (2018)
Google Scholar
Mitrokhin, A., Fermuller, C., Parameshwara, C., Aloimonos, Y.: Event-based moving object detection and tracking. arXiv preprint arXiv:1803.04523 (2018)
Moeys, D.P., et al.: Steering a predator robot using a mixed frame/event-driven convolutional neural network. In: 2016 Second International Conference on Event-based Control, Communication, and Signal Processing (EBCCSP), pp. 1–8. IEEE (2016)
Google Scholar
Mueggler, E., Rebecq, H., Gallego, G., Delbruck, T., Scaramuzza, D.: The event-camera dataset and simulator: event-based data for pose estimation, visual odometry, and slam. Int. J. Robot. Res. 36(2), 142–149 (2017)
Article Google Scholar
Munda, G., Reinbacher, C., Pock, T.: Real-time intensity-image reconstruction for event cameras using manifold regularisation. Int. J. Comput. Vision 126(12), 1381–1393 (2018)
Article Google Scholar
Orchard, G., Meyer, C., Etienne-Cummings, R., Posch, C., Thakor, N., Benosman, R.: HFirst: a temporal approach to object recognition. IEEE Trans. Pattern Anal. Mach. Intell. 37(10), 2028–2040 (2015)
Article Google Scholar
Ramesh, B., Zhang, S., Lee, Z.W., Gao, Z., Orchard, G., Xiang, C.: Long-term object tracking with a moving event camera. In: British Machine Vision Conference (2018)
Google Scholar
Rebecq, H., Gallego, G., Scaramuzza, D.: EMVS: event-based multi-view stereo. In: British Machine Vision Conference (2016)
Google Scholar
Redmon, J., Farhadi, A.: YOLOv3: an incremental improvement. arXiv preprint arXiv:1804.02767 (2018)
Reinbacher, C., Graber, G., Pock, T.: Real-time intensity-image reconstruction for event cameras using manifold regularisation. In: British Machine Vision Conference (2016)
Google Scholar
Ronneberger, O., Fischer, P., Brox, T.: U-Net: convolutional networks for biomedical image segmentation. In: Navab, N., Hornegger, J., Wells, W.M., Frangi, A.F. (eds.) MICCAI 2015. LNCS, vol. 9351, pp. 234–241. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-24574-4_28
Chapter Google Scholar
Salimans, T., Goodfellow, I., Zaremba, W., Cheung, V., Radford, A., Chen, X.: Improved techniques for training GANs. In: Neural Information Processing Systems, pp. 2234–2242 (2016)
Google Scholar
Scheerlinck, C., Barnes, N., Mahony, R.: Continuous-time intensity estimation using event cameras. In: Asian Conference on Computer Vision (2018)
Google Scholar
Wang, Z., Bovik, A.C., Sheikh, H.R., Simoncelli, E.P.: Image quality assessment: from error visibility to structural similarity. IEEE Trans. Image Process. 13(4), 600–612 (2004)
Article Google Scholar
Zhou, Y., Gallego, G., Rebecq, H., Kneip, L., Li, H., Scaramuzza, D.: Semi-dense 3D reconstruction with a stereo event camera. In: European Conference on Computer Vision (2018)
Google Scholar
Zhu, A.Z., Atanasov, N., Daniilidis, K.: Event-based visual inertial odometry. In: IEEE International Conference on Computer Vision and Pattern Recognition, pp. 5816–5824 (2017)
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Engineering “Enzo Ferrari”, University of Modena and Reggio Emilia, Modena, Italy
Stefano Pini, Guido Borghi, Roberto Vezzani & Rita Cucchiara

Authors

Stefano Pini
View author publications
You can also search for this author in PubMed Google Scholar
Guido Borghi
View author publications
You can also search for this author in PubMed Google Scholar
Roberto Vezzani
View author publications
You can also search for this author in PubMed Google Scholar
Rita Cucchiara
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Guido Borghi .

Editor information

Editors and Affiliations

University of Trento, Povo, Italy
Elisa Ricci
Mapillary Research, Graz, Austria
Samuel Rota Bulò
University of Amsterdam, Amsterdam, The Netherlands
Cees Snoek
Fondazione Bruno Kessler, Povo, Italy
Oswald Lanz
Fondazione Bruno Kessler, Povo, Italy
Stefano Messelodi
University of Trento, Povo, Italy
Nicu Sebe

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Pini, S., Borghi, G., Vezzani, R., Cucchiara, R. (2019). Video Synthesis from Intensity and Event Frames. In: Ricci, E., Rota Bulò, S., Snoek, C., Lanz, O., Messelodi, S., Sebe, N. (eds) Image Analysis and Processing – ICIAP 2019. ICIAP 2019. Lecture Notes in Computer Science(), vol 11751. Springer, Cham. https://doi.org/10.1007/978-3-030-30642-7_28

Download citation

DOI: https://doi.org/10.1007/978-3-030-30642-7_28
Published: 02 September 2019
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-30641-0
Online ISBN: 978-3-030-30642-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Societies and partnerships

The International Association for Pattern Recognition (opens in a new tab)