Event Camera Data Dense Pre-training

Yan Yang¹⁴,
Liyuan Pan¹³ &
Liu Liu¹⁵

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 15101))

Included in the following conference series:

European Conference on Computer Vision

145 Accesses

Abstract

This paper introduces a self-supervised learning framework designed for pre-training neural networks tailored to dense prediction tasks using event camera data. Our approach utilizes solely event data for training.

Transferring achievements from dense RGB pre-training directly to event camera data yields subpar performance. This is attributed to the spatial sparsity inherent in an event image (converted from event data), where many pixels do not contain information. To mitigate this sparsity issue, we encode an event image into event patch features, automatically mine contextual similarity relationships among patches, group the patch features into distinctive contexts, and enforce context-to-context similarities to learn discriminative event features.

For training our framework, we curate a synthetic event camera dataset featuring diverse scene and motion patterns. Transfer learning performance on downstream dense prediction tasks illustrates the superiority of our method over state-of-the-art approaches.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic

£29.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: GBP 19.95; Price includes VAT (United Kingdom)

eBook: GBP 49.99; Price includes VAT (United Kingdom)

Softcover Book: GBP 64.99; Price includes VAT (United Kingdom)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Compressed Event Sensing (CES) Volumes for Event Cameras

Article 31 July 2024

Data Association Between Event Streams and Intensity Frames Under Diverse Baselines

Event Enhanced High-Quality Image Recovery

Notes

1.
https://dsec.ifi.uzh.ch/uzh/dsec-flow-optical-flow-benchmark/.

References

Alonso, I., Murillo, A.C.: EV-SegNet: semantic segmentation for event-based cameras. In: IEEE Conference on Computer Vision and Pattern Recognition Workshops, CVPR Workshops 2019, Long Beach, CA, USA, 16–20 June 2019, pp. 1624–1633. Computer Vision Foundation / IEEE (2019). https://doi.org/10.1109/CVPRW.2019.00205. http://openaccess.thecvf.com/content_CVPRW_2019/html/EventVision/Alonso_EV-SegNet_Semantic_Segmentation_for_Event-Based_Cameras_CVPRW_2019_paper.html
Bai, Y., Chen, X., Kirillov, A., Yuille, A.L., Berg, A.C.: Point-level region contrast for object detection pre-training. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022, New Orleans, LA, USA, 18–24 June 2022, pp. 16040–16049. IEEE (2022). https://doi.org/10.1109/CVPR52688.2022.01559
Bao, H., Dong, L., Piao, S., Wei, F.: Beit: BERT pre-training of image transformers. In: The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, 25–29 April 2022. OpenReview.net (2022). https://openreview.net/forum?id=p-BhZSz59o4
Binas, J., Neil, D., Liu, S., Delbrück, T.: DDD17: end-to-end DAVIS driving dataset. CoRR abs/1711.01458 (2017). http://arxiv.org/abs/1711.01458
Caron, M., Misra, I., Mairal, J., Goyal, P., Bojanowski, P., Joulin, A.: Unsupervised learning of visual features by contrasting cluster assignments. In: Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M., Lin, H. (eds.) Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, 6–12 December 2020, virtual (2020). https://proceedings.neurips.cc/paper/2020/hash/70feb62b69f16e0238f741fab228fec2-Abstract.html
Caron, M., et al.: Emerging properties in self-supervised vision transformers. In: 2021 IEEE/CVF International Conference on Computer Vision, ICCV 2021, Montreal, QC, Canada, 10–17 October 2021, pp. 9630–9640. IEEE (2021). https://doi.org/10.1109/ICCV48922.2021.00951
Chen, T., Kornblith, S., Norouzi, M., Hinton, G.E.: A simple framework for contrastive learning of visual representations. In: Proceedings of the 37th International Conference on Machine Learning, ICML 2020, 13–18 July 2020, Virtual Event. Proceedings of Machine Learning Research, vol. 119, pp. 1597–1607. PMLR (2020). http://proceedings.mlr.press/v119/chen20j.html
Chen, X., Fan, H., Girshick, R., He, K.: Improved baselines with momentum contrastive learning. arXiv preprint arXiv:2003.04297 (2020)
Chen, X., He, K.: Exploring simple Siamese representation learning. In: IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2021, virtual, 19–25 June 2021, pp. 15750–15758. Computer Vision Foundation / IEEE (2021). https://doi.org/10.1109/CVPR46437.2021.01549. https://openaccess.thecvf.com/content/CVPR2021/html/Chen_Exploring_Simple_Siamese_Representation_Learning_CVPR_2021_paper.html
Chen*, X., Xie*, S., He, K.: An empirical study of training self-supervised vision transformers. arXiv preprint arXiv:2104.02057 (2021)
Cheng, W., Luo, H., Yang, W., Yu, L., Li, W.: Structure-aware network for lane marker extraction with dynamic vision sensor. CoRR abs/2008.06204 (2020). https://arxiv.org/abs/2008.06204
Cherti, M., et al.: Reproducible scaling laws for contrastive language-image learning. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2023, Vancouver, BC, Canada, 17–24 June 2023, pp. 2818–2829. IEEE (2023). https://doi.org/10.1109/CVPR52729.2023.00276
Cuadrado, J., Rancon, U., Cottereau, B., Barranco, F., Masquelier, T.: Optical flow estimation from event-based cameras and spiking neural networks. Front. Neurosci. 17 (2023). https://doi.org/10.3389/fnins.2023.1160034
Deng, J., Dong, W., Socher, R., Li, L., Li, K., Fei-Fei, L.: ImageNet: a large-scale hierarchical image database. In: 2009 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR 2009)x, Miami, Florida, USA 20-25, pp. 248–255. IEEE Computer Society (2009). https://doi.org/10.1109/CVPR.2009.5206848
Dosovitskiy, A., et al.: An image is worth 16x16 words: transformers for image recognition at scale. In: 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, 3–7 May 2021. OpenReview.net (2021). https://openreview.net/forum?id=YicbFdNTTy
Esser, P., Rombach, R., Ommer, B.: Taming transformers for high-resolution image synthesis. In: IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2021, virtual, 19–25 June 2021, pp. 12873–12883. Computer Vision Foundation / IEEE (2021). https://doi.org/10.1109/CVPR46437.2021.01268. https://openaccess.thecvf.com/content/CVPR2021/html/Esser_Taming_Transformers_for_High-Resolution_Image_Synthesis_CVPR_2021_paper.html
Fang, Y., et al.: EVA: exploring the limits of masked visual representation learning at scale. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2023, Vancouver, BC, Canada, 17–24 June 2023, pp. 19358–19369. IEEE (2023). https://doi.org/10.1109/CVPR52729.2023.01855
Gallego, G., et al.: Event-based vision: a survey. IEEE Trans. Pattern Anal. Mach. Intell. 44(1), 154–180 (2022). https://doi.org/10.1109/TPAMI.2020.3008413
Gehrig, D., Rüegg, M., Gehrig, M., Hidalgo-Carrió, J., Scaramuzza, D.: Combining events and frames using recurrent asynchronous multimodal networks for monocular depth prediction. IEEE Robot. Autom. Lett. 6(2), 2822–2829 (2021). https://doi.org/10.1109/LRA.2021.3060707
Gehrig, M., Aarents, W., Gehrig, D., Scaramuzza, D.: DSEC: a stereo event camera dataset for driving scenarios. IEEE Robotics Autom. Lett. 6(3), 4947–4954 (2021). https://doi.org/10.1109/LRA.2021.3068942
Gehrig, M., Millhäusler, M., Gehrig, D., Scaramuzza, D.: E-RAFT: dense optical flow from event cameras. In: International Conference on 3D Vision, 3DV 2021, London, United Kingdom, 1–3 December 2021, pp. 197–206. IEEE (2021). https://doi.org/10.1109/3DV53792.2021.00030
Grill, J., et al.: Bootstrap your own latent - a new approach to self-supervised learning. In: Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M., Lin, H. (eds.) Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, 6–12 December 2020, virtual (2020). https://proceedings.neurips.cc/paper/2020/hash/f3ada80d5c4ee70142b17b8192b2958e-Abstract.html
Hamaguchi, R., Furukawa, Y., Onishi, M., Sakurada, K.: Hierarchical neural memory network for low latency event processing. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2023, Vancouver, BC, Canada, 17–24 June 2023, pp. 22867–22876. IEEE (2023). https://doi.org/10.1109/CVPR52729.2023.02190
He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.B.: Masked autoencoders are scalable vision learners. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022, New Orleans, LA, USA, 18–24 June 2022, pp. 15979–15988. IEEE (2022). https://doi.org/10.1109/CVPR52688.2022.01553
He, K., Fan, H., Wu, Y., Xie, S., Girshick, R.B.: Momentum contrast for unsupervised visual representation learning. In: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2020, Seattle, WA, USA, 13–19 June 2020, pp. 9726–9735. Computer Vision Foundation / IEEE (2020). https://doi.org/10.1109/CVPR42600.2020.00975
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. CoRR abs/1512.03385 (2015). http://arxiv.org/abs/1512.03385
Kim, J., Bae, J., Park, G., Zhang, D., Kim, Y.M.: N-ImageNet: towards robust, fine-grained object recognition with event cameras. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp. 2146–2156, October 2021
Google Scholar
Li, C., et al.: Efficient self-supervised vision transformers for representation learning. In: The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, 25–29 April 2022. OpenReview.net (2022). https://openreview.net/forum?id=fVu3o-YUGQK
Li, W., Xie, J., Loy, C.C.: Correlational image modeling for self-supervised visual pre-training. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2023, Vancouver, BC, Canada, 17–24 June 2023, pp. 15105–15115. IEEE (2023). https://doi.org/10.1109/CVPR52729.2023.01450
Li, Y., et al.: BlinkFlow: a dataset to push the limits of event-based optical flow estimation. CoRR abs/2303.07716 (2023). https://doi.org/10.48550/arXiv.2303.07716
Liu, H., et al.: TMA: temporal motion aggregation for event-based optical flow. In: ICCV (2023)
Google Scholar
Menze, M., Heipke, C., Geiger, A.: Joint 3d estimation of vehicles and scene flow. ISPRS Ann. Photogram. Remote Sens. Spat. Inf. Sci. II-3/W5, 427–434 (2015). https://doi.org/10.5194/isprsannals-II-3-W5-427-2015
Oquab, M., et al.: DINOv2: learning robust visual features without supervision. CoRR abs/2304.07193 (2023). https://doi.org/10.48550/arXiv.2304.07193
Orchard, G., Jayawant, A., Cohen, G., Thakor, N.V.: Converting static image datasets to spiking neuromorphic datasets using saccades. CoRR abs/1507.07629 (2015). http://arxiv.org/abs/1507.07629
Peng, Z., Dong, L., Bao, H., Ye, Q., Wei, F.: BEiT v2: masked image modeling with vector-quantized visual tokenizers. CoRR abs/2208.06366 (2022). https://doi.org/10.48550/arXiv.2208.06366
Radford, A., et al.: Learning transferable visual models from natural language supervision. In: Meila, M., Zhang, T. (eds.) Proceedings of the 38th International Conference on Machine Learning, ICML 2021, 18–24 July 2021, Virtual Event. Proceedings of Machine Learning Research, vol. 139, pp. 8748–8763. PMLR (2021). http://proceedings.mlr.press/v139/radford21a.html
Shiba, S., Aoki, Y., Gallego, G.: Secrets of event-based optical flow. In: Avidan, S., Brostow, G.J., Cissé, M., Farinella, G.M., Hassner, T. (eds.) ECCV 2022, Part XVIII. LNCS, vol. 13678, pp. 628–645. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-19797-0_36
Chapter Google Scholar
Sironi, A., Brambilla, M., Bourdis, N., Lagorce, X., Benosman, R.: HATS: histograms of averaged time surfaces for robust event-based object classification. In: 2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA, 18–22 June 2018, pp. 1731–1740. Computer Vision Foundation / IEEE Computer Society (2018). https://doi.org/10.1109/CVPR.2018.00186. http://openaccess.thecvf.com/content_cvpr_2018/html/Sironi_HATS_Histograms_of_CVPR_2018_paper.html
Sun, Z., Messikommer, N., Gehrig, D., Scaramuzza, D.: ESS: learning event-based semantic segmentation from still images. In: Avidan, S., Brostow, G.J., Cissé, M., Farinella, G.M., Hassner, T. (eds.) ECCV 2022, Part XXXIV. LNCS, vol. 13694, pp. 341–357. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-19830-4_20
Chapter Google Scholar
Wan, Z., Dai, Y., Mao, Y.: Learning dense and continuous optical flow from an event camera. IEEE Trans. Image Process. 31, 7237–7251 (2022). https://doi.org/10.1109/TIP.2022.3220938
Wang, W., et al.: TartanAir: a dataset to push the limits of visual SLAM. In: IEEE/RSJ International Conference on Intelligent Robots and Systems, IROS 2020, Las Vegas, NV, USA, 24 October 2020–24 January 2021, pp. 4909–4916. IEEE (2020). https://doi.org/10.1109/IROS45743.2020.9341801
Wang, X., Zhang, R., Shen, C., Kong, T., Li, L.: Dense contrastive learning for self-supervised visual pre-training. In: IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2021, virtual, 19–25 June 2021, pp. 3024–3033. Computer Vision Foundation / IEEE (2021). https://doi.org/10.1109/CVPR46437.2021.00304. https://openaccess.thecvf.com/content/CVPR2021/html/Wang_Dense_Contrastive_Learning_for_Self-Supervised_Visual_Pre-Training_CVPR_2021_paper.html
Weikersdorfer, D., Adrian, D.B., Cremers, D., Conradt, J.: Event-based 3d SLAM with a depth-augmented dynamic vision sensor. In: 2014 IEEE International Conference on Robotics and Automation, ICRA 2014, Hong Kong, China, May 31–June 7 2014, pp. 359–364. IEEE (2014). https://doi.org/10.1109/ICRA.2014.6906882
Wu, Y., Paredes-Vallés, F., de Croon, G.C.H.E.: Lightweight event-based optical flow estimation via iterative deblurring. In: Proceedings of IEEE International Conference on Robotics and Automation (ICRA 2024), May 2024, to Appear
Google Scholar
Wu, Z., Xiong, Y., Yu, S.X., Lin, D.: Unsupervised feature learning via non-parametric instance discrimination. In: 2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA, 18–22 June 2018, pp. 3733–3742. Computer Vision Foundation / IEEE Computer Society (2018). https://doi.org/10.1109/CVPR.2018.00393. http://openaccess.thecvf.com/content_cvpr_2018/html/Wu_Unsupervised_Feature_Learning_CVPR_2018_paper.html
Xie, Z., et al.: SimMIM: a simple framework for masked image modeling. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022, New Orleans, LA, USA, 18–24 June 2022, pp. 9643–9653. IEEE (2022). https://doi.org/10.1109/CVPR52688.2022.00943
Yang, Y., Pan, L., Liu, L.: Event camera data pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp. 10699–10709, October 2023
Google Scholar
Yun, S., Lee, H., Kim, J., Shin, J.: Patch-level representation learning for self-supervised vision transformers. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022, New Orleans, LA, USA, 18–24 June 2022, pp. 8344–8353. IEEE (2022). https://doi.org/10.1109/CVPR52688.2022.00817
Zhang, D., Ding, Q., Duan, P., Zhou, C., Shi, B.: Data association between event streams and intensity frames under diverse baselines. In: Avidan, S., Brostow, G.J., Cissé, M., Farinella, G.M., Hassner, T. (eds.) ECCV 2022, Part VII. LNCS, vol. 13667, pp. 72–90. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-20071-7_5
Chapter Google Scholar
Zhang, S., et al.: OPT: open pre-trained transformer language models. CoRR abs/2205.01068 (2022). https://doi.org/10.48550/ARXIV.2205.01068. https://doi.org/10.48550/arXiv.2205.01068
Zhou, J., Zheng, X., Lyu, Y., Wang, L.: E-CLIP: towards label-efficient event-based open-world understanding by CLIP. CoRR abs/2308.03135 (2023). https://doi.org/10.48550/arXiv.2308.03135
Zhou, J., et al.: Image BERT pre-training with online tokenizer. In: The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, 25–29 April 2022. OpenReview.net (2022). https://openreview.net/forum?id=ydopy-e6Dg
Zhu, A.Z., Thakur, D., Özaslan, T., Pfrommer, B., Kumar, V., Daniilidis, K.: The multi vehicle stereo event camera dataset: an event camera dataset for 3D perception. CoRR abs/1801.10202 (2018). http://arxiv.org/abs/1801.10202
Zhu, A.Z., Yuan, L., Chaney, K., Daniilidis, K.: Unsupervised event-based learning of optical flow, depth, and egomotion. CoRR abs/1812.08156 (2018). http://arxiv.org/abs/1812.08156

Download references

Acknowledgements

Liyuan Pan’s work was supported in part by the Beijing Institute of Technology Research Fund Program for Young Scholars, BIT Special-Zone, and National Natural Science Foundation of China 62302045.

Author information

Authors and Affiliations

School of CSAT, Beijing Institute of Technology, Beijing, China
Liyuan Pan
BDSI, Australian National University, Canberra, Australia
Yan Yang
KooMap Department, Huawei, Beijing, China
Liu Liu

Authors

Yan Yang
View author publications
You can also search for this author in PubMed Google Scholar
Liyuan Pan
View author publications
You can also search for this author in PubMed Google Scholar
Liu Liu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Liyuan Pan .

Editor information

Editors and Affiliations

University of Birmingham, Birmingham, UK
Aleš Leonardis
University of Trento, Trento, Italy
Elisa Ricci
Technical University of Darmstadt, Darmstadt, Hessen, Germany
Stefan Roth
Princeton University, Palo Alto, CA, USA
Olga Russakovsky
Czech Technical University in Prague, Prague, Czech Republic
Torsten Sattler
École des Ponts ParisTech, Marne-la-Vallée, France
Gül Varol

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Yang, Y., Pan, L., Liu, L. (2025). Event Camera Data Dense Pre-training. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15101. Springer, Cham. https://doi.org/10.1007/978-3-031-72775-7_17

Download citation

DOI: https://doi.org/10.1007/978-3-031-72775-7_17
Published: 30 September 2024
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-72774-0
Online ISBN: 978-3-031-72775-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Event Camera Data Dense Pre-training

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Compressed Event Sensing (CES) Volumes for Event Cameras

Data Association Between Event Streams and Intensity Frames Under Diverse Baselines

Event Enhanced High-Quality Image Recovery

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

Event Camera Data Dense Pre-training

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Compressed Event Sensing (CES) Volumes for Event Cameras

Data Association Between Event Streams and Intensity Frames Under Diverse Baselines

Event Enhanced High-Quality Image Recovery

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation