ESS: Learning Event-Based Semantic Segmentation from Still Images

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13694))

Included in the following conference series:

European Conference on Computer Vision

3030 Accesses

Abstract

Retrieving accurate semantic information in challenging high dynamic range (HDR) and high-speed conditions remains an open challenge for image-based algorithms due to severe image degradations. Event cameras promise to address these challenges since they feature a much higher dynamic range and are resilient to motion blur. Nonetheless, semantic segmentation with event cameras is still in its infancy which is chiefly due to the lack of high-quality, labeled datasets. In this work, we introduce ESS (Event-based Semantic Segmentation), which tackles this problem by directly transferring the semantic segmentation task from existing labeled image datasets to unlabeled events via unsupervised domain adaptation (UDA). Compared to existing UDA methods, our approach aligns recurrent, motion-invariant event embeddings with image embeddings. For this reason, our method neither requires video data nor per-pixel alignment between images and events and, crucially, does not need to hallucinate motion from still images. Additionally, we introduce DSEC-Semantic, the first large-scale event-based dataset with fine-grained labels. We show that using image labels alone, ESS outperforms existing UDA approaches, and when combined with event labels, it even outperforms state-of-the-art supervised approaches on both DDD17 and DSEC-Semantic. Finally, ESS is general-purpose, which unlocks the vast amount of existing labeled image datasets and paves the way for new and exciting research directions in new fields previously inaccessible for event cameras.

Z. Sun and N. Messikommer—Equal contribution.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic

£29.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: GBP 19.95; Price includes VAT (United Kingdom)

eBook: GBP 79.50; Price includes VAT (United Kingdom)

Softcover Book: GBP 99.99; Price includes VAT (United Kingdom)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Finding Meaning in Points: Weakly Supervised Semantic Segmentation for Event Cameras

OVOSE: Open-Vocabulary Semantic Segmentation in Event-Based Cameras

Instance-Level Moving Object Segmentation from a Single Image with Events

Article 20 February 2025

Notes

1.
For clarity, we omit the subscript i in the future.

References

Alonso, I., Murillo, A.C.: EV-SegNet: semantic segmentation for event-based cameras. In: IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW) (2019)
Google Scholar
Bardow, P., Davison, A.J., Leutenegger, S.: Simultaneous optical flow and intensity estimation from an event camera. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 884–892 (2016). https://doi.org/10.1109/CVPR.2016.102
Binas, J., Neil, D., Liu, S.C., Delbruck, T.: DDD17: end-to-end DAVIS driving dataset. In: ICML Workshop on Machine Learning for Autonomous Vehicles (2017)
Google Scholar
Brandli, C., Berner, R., Yang, M., Liu, S.C., Delbruck, T.: A 240x180 130dB 3\(\mu \)s latency global shutter spatiotemporal vision sensor. IEEE J. Solid-State Circuits 49(10), 2333–2341 (2014). https://doi.org/10.1109/JSSC.2014.2342715
Article Google Scholar
Chollet, F.: Xception: Deep learning with depthwise separable convolutions. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1800–1807 (2017). https://doi.org/10.1109/CVPR.2017.195
Cordts, M., et al.: The cityscapes dataset for semantic urban scene understanding. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2016)
Google Scholar
Dosovitskiy, A., Ros, G., Codevilla, F., Lopez, A., Koltun, V.: CARLA: an open urban driving simulator. In: Conference on Robotics Learning (CoRL) (2017)
Google Scholar
Falanga, D., Kleber, K., Scaramuzza, D.: Dynamic obstacle avoidance for quadrotors with event cameras. Sci. Robot. 5(40), eaaz9712 (2020). https://doi.org/10.1126/scirobotics.aaz9712
Gallego, G., et al.: Event-based vision: a survey. IEEE Trans. Pattern Anal. Mach. Intell. (2020). https://doi.org/10.1109/TPAMI.2020.3008413
Article Google Scholar
Gehrig, D., Gehrig, M., Hidalgo-Carrió, J., Scaramuzza, D.: Video to Events: recycling video datasets for event cameras. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2020)
Google Scholar
Gehrig, D., Loquercio, A., Derpanis, K.G., Scaramuzza, D.: End-to-end learning of representations for asynchronous event-based data. In: International Conference on Computer Vision (ICCV) (2019)
Google Scholar
Gehrig, D., Rüegg, M., Gehrig, M., Hidalgo-Carrio, J., Scaramuzza, D.: Combining events and frames using recurrent asynchronous multimodal networks for monocular depth prediction. In: IEEE Robotic and Automation Letters (RA-L) (2021)
Google Scholar
Gehrig, M., Aarents, W., Gehrig, D., Scaramuzza, D.: DSEC: a stereo event camera dataset for driving scenarios. In: IEEE Robotics and Automation Letters (2021). https://doi.org/10.1109/LRA.2021.3068942
Hu, Y., Delbruck, T., Liu, S.-C.: Learning to exploit multiple vision modalities by using grafted networks. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12361, pp. 85–101. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58517-4_6
Chapter Google Scholar
Hidalgo-Carrio, J., Gehrig, D., Scaramuzza, D.: Learning monocular dense depth from events. IEEE International Conference on 3D Vision (3DV) (2020)
Google Scholar
Liu, L., et al.: On the variance of the adaptive learning rate and beyond. In: International Conference on Learning Representations (ICLR) (2020)
Google Scholar
Maqueda, A.I., Loquercio, A., Gallego, G., García, N., Scaramuzza, D.: Event-based vision meets deep learning on steering prediction for self-driving cars. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 5419–5427 (2018). https://doi.org/10.1109/CVPR.2018.00568
Messikommer, N., Gehrig, D., Gehrig, M., Scaramuzza, D.: Bridging the gap between events and frames through unsupervised domain adaptation. In: IEEE Robotics and Automation Letters (2022)
Google Scholar
Muglikar, M., Moeys, D., Scaramuzza, D.: Event-guided depth sensing. In: IEEE International Conference on 3D Vision (3DV) (2021)
Google Scholar
Perot, E., de Tournemire, P., Nitti, D., Masci, J., Sironi, A.: Learning to detect objects with a 1 megapixel event camera. In: Conference on Neural Information Processing Systems (NIPS) (2020)
Google Scholar
Rebecq, H., Ranftl, R., Koltun, V., Scaramuzza, D.: High speed and high dynamic range video with an event camera. IEEE Trans. Pattern Anal. Mach. Intell. (2019). https://doi.org/10.1109/TPAMI.2019.2963386
Article Google Scholar
Reinbacher, C., Graber, G., Pock, T.: Real-time intensity-image reconstruction for event cameras using manifold regularisation. In: British Machine Vision Conference (BMVC) (2016). https://doi.org/10.5244/C.30.9
Rosinol Vidal, A., Rebecq, H., Horstschaefer, T., Scaramuzza, D.: Ultimate SLAM? combining events, images, and IMU for robust visual SLAM in HDR and high speed scenarios. IEEE Robot. Autom. Lett. 3(2), 994–1001 (2018). https://doi.org/10.1109/LRA.2018.2793357
Article Google Scholar
Tao, A., Sapra, K., Catanzaro, B.: Hierarchical multi-scale attention for semantic segmentation. arXiv preprint arXiv:2005.10821 (2020)
Tulyakov, S., et al.: TimeLens: event-based video frame interpolation. IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2021)
Google Scholar
Wang, L., Kim, T.K., Yoon, K.J.: EventSR: from asynchronous events to image reconstruction, restoration, and super-resolution via end-to-end adversarial learning. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 8312–8322 (2020)
Google Scholar
Wang, L., Chae, Y., Yoon, K.J.: Dual transfer learning for event-based end-task prediction via pluggable event to image translation. In: International Conference on Computer Vision (ICCV), pp. 2135–2145 (2021)
Google Scholar
Wang, L., Chae, Y., Yoon, S.H., Kim, T.K., Yoon, K.J.: EvDistill: asynchronous events to end-task learning via bidirectional reconstruction-guided cross-modal knowledge distillation. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2021)
Google Scholar
Zhu, A.Z., Atanasov, N., Daniilidis, K.: Event-based visual inertial odometry. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 5816–5824 (2017). https://doi.org/10.1109/CVPR.2017.616
Zhu, A.Z., Yuan, L., Chaney, K., Daniilidis, K.: Unsupervised event-based learning of optical flow, depth, and egomotion. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2019)
Google Scholar

Download references

Acknowledgment

This work was supported by the National Centre of Competence in Research (NCCR) Robotics through the Swiss National Science Foundation (SNSF) and the European Research Council (ERC) under grant agreement No. 864042 (AGILEFLIGHT).

Author information

Authors and Affiliations

Department of Informatics, University of Zurich, Zurich, Switzerland
Zhaoning Sun, Nico Messikommer, Daniel Gehrig & Davide Scaramuzza
Department of Neuroinformatics, University of Zurich and ETH Zurich, Zurich, Switzerland
Zhaoning Sun, Nico Messikommer, Daniel Gehrig & Davide Scaramuzza

Authors

Zhaoning Sun
View author publications
You can also search for this author in PubMed Google Scholar
Nico Messikommer
View author publications
You can also search for this author in PubMed Google Scholar
Daniel Gehrig
View author publications
You can also search for this author in PubMed Google Scholar
Davide Scaramuzza
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Nico Messikommer .

Editor information

Editors and Affiliations

Tel Aviv University, Tel Aviv, Israel
Shai Avidan
University College London, London, UK
Gabriel Brostow
Google AI, Accra, Ghana
Moustapha Cissé
University of Catania, Catania, Italy
Giovanni Maria Farinella
Facebook (United States), Menlo Park, CA, USA
Tal Hassner

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 2997 KB)

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Sun, Z., Messikommer, N., Gehrig, D., Scaramuzza, D. (2022). ESS: Learning Event-Based Semantic Segmentation from Still Images. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds) Computer Vision – ECCV 2022. ECCV 2022. Lecture Notes in Computer Science, vol 13694. Springer, Cham. https://doi.org/10.1007/978-3-031-19830-4_20

Download citation

DOI: https://doi.org/10.1007/978-3-031-19830-4_20
Published: 22 October 2022
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-19829-8
Online ISBN: 978-3-031-19830-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics