Abstract
This paper introduces a self-supervised learning framework designed for pre-training neural networks tailored to dense prediction tasks using event camera data. Our approach utilizes solely event data for training.
Transferring achievements from dense RGB pre-training directly to event camera data yields subpar performance. This is attributed to the spatial sparsity inherent in an event image (converted from event data), where many pixels do not contain information. To mitigate this sparsity issue, we encode an event image into event patch features, automatically mine contextual similarity relationships among patches, group the patch features into distinctive contexts, and enforce context-to-context similarities to learn discriminative event features.
For training our framework, we curate a synthetic event camera dataset featuring diverse scene and motion patterns. Transfer learning performance on downstream dense prediction tasks illustrates the superiority of our method over state-of-the-art approaches.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Alonso, I., Murillo, A.C.: EV-SegNet: semantic segmentation for event-based cameras. In: IEEE Conference on Computer Vision and Pattern Recognition Workshops, CVPR Workshops 2019, Long Beach, CA, USA, 16–20 June 2019, pp. 1624–1633. Computer Vision Foundation / IEEE (2019). https://doi.org/10.1109/CVPRW.2019.00205. http://openaccess.thecvf.com/content_CVPRW_2019/html/EventVision/Alonso_EV-SegNet_Semantic_Segmentation_for_Event-Based_Cameras_CVPRW_2019_paper.html
Bai, Y., Chen, X., Kirillov, A., Yuille, A.L., Berg, A.C.: Point-level region contrast for object detection pre-training. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022, New Orleans, LA, USA, 18–24 June 2022, pp. 16040–16049. IEEE (2022). https://doi.org/10.1109/CVPR52688.2022.01559
Bao, H., Dong, L., Piao, S., Wei, F.: Beit: BERT pre-training of image transformers. In: The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, 25–29 April 2022. OpenReview.net (2022). https://openreview.net/forum?id=p-BhZSz59o4
Binas, J., Neil, D., Liu, S., Delbrück, T.: DDD17: end-to-end DAVIS driving dataset. CoRR abs/1711.01458 (2017). http://arxiv.org/abs/1711.01458
Caron, M., Misra, I., Mairal, J., Goyal, P., Bojanowski, P., Joulin, A.: Unsupervised learning of visual features by contrasting cluster assignments. In: Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M., Lin, H. (eds.) Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, 6–12 December 2020, virtual (2020). https://proceedings.neurips.cc/paper/2020/hash/70feb62b69f16e0238f741fab228fec2-Abstract.html
Caron, M., et al.: Emerging properties in self-supervised vision transformers. In: 2021 IEEE/CVF International Conference on Computer Vision, ICCV 2021, Montreal, QC, Canada, 10–17 October 2021, pp. 9630–9640. IEEE (2021). https://doi.org/10.1109/ICCV48922.2021.00951
Chen, T., Kornblith, S., Norouzi, M., Hinton, G.E.: A simple framework for contrastive learning of visual representations. In: Proceedings of the 37th International Conference on Machine Learning, ICML 2020, 13–18 July 2020, Virtual Event. Proceedings of Machine Learning Research, vol. 119, pp. 1597–1607. PMLR (2020). http://proceedings.mlr.press/v119/chen20j.html
Chen, X., Fan, H., Girshick, R., He, K.: Improved baselines with momentum contrastive learning. arXiv preprint arXiv:2003.04297 (2020)
Chen, X., He, K.: Exploring simple Siamese representation learning. In: IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2021, virtual, 19–25 June 2021, pp. 15750–15758. Computer Vision Foundation / IEEE (2021). https://doi.org/10.1109/CVPR46437.2021.01549. https://openaccess.thecvf.com/content/CVPR2021/html/Chen_Exploring_Simple_Siamese_Representation_Learning_CVPR_2021_paper.html
Chen*, X., Xie*, S., He, K.: An empirical study of training self-supervised vision transformers. arXiv preprint arXiv:2104.02057 (2021)
Cheng, W., Luo, H., Yang, W., Yu, L., Li, W.: Structure-aware network for lane marker extraction with dynamic vision sensor. CoRR abs/2008.06204 (2020). https://arxiv.org/abs/2008.06204
Cherti, M., et al.: Reproducible scaling laws for contrastive language-image learning. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2023, Vancouver, BC, Canada, 17–24 June 2023, pp. 2818–2829. IEEE (2023). https://doi.org/10.1109/CVPR52729.2023.00276
Cuadrado, J., Rancon, U., Cottereau, B., Barranco, F., Masquelier, T.: Optical flow estimation from event-based cameras and spiking neural networks. Front. Neurosci. 17 (2023). https://doi.org/10.3389/fnins.2023.1160034
Deng, J., Dong, W., Socher, R., Li, L., Li, K., Fei-Fei, L.: ImageNet: a large-scale hierarchical image database. In: 2009 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR 2009)x, Miami, Florida, USA 20-25, pp. 248–255. IEEE Computer Society (2009). https://doi.org/10.1109/CVPR.2009.5206848
Dosovitskiy, A., et al.: An image is worth 16x16 words: transformers for image recognition at scale. In: 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, 3–7 May 2021. OpenReview.net (2021). https://openreview.net/forum?id=YicbFdNTTy
Esser, P., Rombach, R., Ommer, B.: Taming transformers for high-resolution image synthesis. In: IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2021, virtual, 19–25 June 2021, pp. 12873–12883. Computer Vision Foundation / IEEE (2021). https://doi.org/10.1109/CVPR46437.2021.01268. https://openaccess.thecvf.com/content/CVPR2021/html/Esser_Taming_Transformers_for_High-Resolution_Image_Synthesis_CVPR_2021_paper.html
Fang, Y., et al.: EVA: exploring the limits of masked visual representation learning at scale. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2023, Vancouver, BC, Canada, 17–24 June 2023, pp. 19358–19369. IEEE (2023). https://doi.org/10.1109/CVPR52729.2023.01855
Gallego, G., et al.: Event-based vision: a survey. IEEE Trans. Pattern Anal. Mach. Intell. 44(1), 154–180 (2022). https://doi.org/10.1109/TPAMI.2020.3008413
Gehrig, D., Rüegg, M., Gehrig, M., Hidalgo-Carrió, J., Scaramuzza, D.: Combining events and frames using recurrent asynchronous multimodal networks for monocular depth prediction. IEEE Robot. Autom. Lett. 6(2), 2822–2829 (2021). https://doi.org/10.1109/LRA.2021.3060707
Gehrig, M., Aarents, W., Gehrig, D., Scaramuzza, D.: DSEC: a stereo event camera dataset for driving scenarios. IEEE Robotics Autom. Lett. 6(3), 4947–4954 (2021). https://doi.org/10.1109/LRA.2021.3068942
Gehrig, M., Millhäusler, M., Gehrig, D., Scaramuzza, D.: E-RAFT: dense optical flow from event cameras. In: International Conference on 3D Vision, 3DV 2021, London, United Kingdom, 1–3 December 2021, pp. 197–206. IEEE (2021). https://doi.org/10.1109/3DV53792.2021.00030
Grill, J., et al.: Bootstrap your own latent - a new approach to self-supervised learning. In: Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M., Lin, H. (eds.) Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, 6–12 December 2020, virtual (2020). https://proceedings.neurips.cc/paper/2020/hash/f3ada80d5c4ee70142b17b8192b2958e-Abstract.html
Hamaguchi, R., Furukawa, Y., Onishi, M., Sakurada, K.: Hierarchical neural memory network for low latency event processing. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2023, Vancouver, BC, Canada, 17–24 June 2023, pp. 22867–22876. IEEE (2023). https://doi.org/10.1109/CVPR52729.2023.02190
He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.B.: Masked autoencoders are scalable vision learners. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022, New Orleans, LA, USA, 18–24 June 2022, pp. 15979–15988. IEEE (2022). https://doi.org/10.1109/CVPR52688.2022.01553
He, K., Fan, H., Wu, Y., Xie, S., Girshick, R.B.: Momentum contrast for unsupervised visual representation learning. In: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2020, Seattle, WA, USA, 13–19 June 2020, pp. 9726–9735. Computer Vision Foundation / IEEE (2020). https://doi.org/10.1109/CVPR42600.2020.00975
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. CoRR abs/1512.03385 (2015). http://arxiv.org/abs/1512.03385
Kim, J., Bae, J., Park, G., Zhang, D., Kim, Y.M.: N-ImageNet: towards robust, fine-grained object recognition with event cameras. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp. 2146–2156, October 2021
Li, C., et al.: Efficient self-supervised vision transformers for representation learning. In: The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, 25–29 April 2022. OpenReview.net (2022). https://openreview.net/forum?id=fVu3o-YUGQK
Li, W., Xie, J., Loy, C.C.: Correlational image modeling for self-supervised visual pre-training. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2023, Vancouver, BC, Canada, 17–24 June 2023, pp. 15105–15115. IEEE (2023). https://doi.org/10.1109/CVPR52729.2023.01450
Li, Y., et al.: BlinkFlow: a dataset to push the limits of event-based optical flow estimation. CoRR abs/2303.07716 (2023). https://doi.org/10.48550/arXiv.2303.07716
Liu, H., et al.: TMA: temporal motion aggregation for event-based optical flow. In: ICCV (2023)
Menze, M., Heipke, C., Geiger, A.: Joint 3d estimation of vehicles and scene flow. ISPRS Ann. Photogram. Remote Sens. Spat. Inf. Sci. II-3/W5, 427–434 (2015). https://doi.org/10.5194/isprsannals-II-3-W5-427-2015
Oquab, M., et al.: DINOv2: learning robust visual features without supervision. CoRR abs/2304.07193 (2023). https://doi.org/10.48550/arXiv.2304.07193
Orchard, G., Jayawant, A., Cohen, G., Thakor, N.V.: Converting static image datasets to spiking neuromorphic datasets using saccades. CoRR abs/1507.07629 (2015). http://arxiv.org/abs/1507.07629
Peng, Z., Dong, L., Bao, H., Ye, Q., Wei, F.: BEiT v2: masked image modeling with vector-quantized visual tokenizers. CoRR abs/2208.06366 (2022). https://doi.org/10.48550/arXiv.2208.06366
Radford, A., et al.: Learning transferable visual models from natural language supervision. In: Meila, M., Zhang, T. (eds.) Proceedings of the 38th International Conference on Machine Learning, ICML 2021, 18–24 July 2021, Virtual Event. Proceedings of Machine Learning Research, vol. 139, pp. 8748–8763. PMLR (2021). http://proceedings.mlr.press/v139/radford21a.html
Shiba, S., Aoki, Y., Gallego, G.: Secrets of event-based optical flow. In: Avidan, S., Brostow, G.J., Cissé, M., Farinella, G.M., Hassner, T. (eds.) ECCV 2022, Part XVIII. LNCS, vol. 13678, pp. 628–645. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-19797-0_36
Sironi, A., Brambilla, M., Bourdis, N., Lagorce, X., Benosman, R.: HATS: histograms of averaged time surfaces for robust event-based object classification. In: 2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA, 18–22 June 2018, pp. 1731–1740. Computer Vision Foundation / IEEE Computer Society (2018). https://doi.org/10.1109/CVPR.2018.00186. http://openaccess.thecvf.com/content_cvpr_2018/html/Sironi_HATS_Histograms_of_CVPR_2018_paper.html
Sun, Z., Messikommer, N., Gehrig, D., Scaramuzza, D.: ESS: learning event-based semantic segmentation from still images. In: Avidan, S., Brostow, G.J., Cissé, M., Farinella, G.M., Hassner, T. (eds.) ECCV 2022, Part XXXIV. LNCS, vol. 13694, pp. 341–357. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-19830-4_20
Wan, Z., Dai, Y., Mao, Y.: Learning dense and continuous optical flow from an event camera. IEEE Trans. Image Process. 31, 7237–7251 (2022). https://doi.org/10.1109/TIP.2022.3220938
Wang, W., et al.: TartanAir: a dataset to push the limits of visual SLAM. In: IEEE/RSJ International Conference on Intelligent Robots and Systems, IROS 2020, Las Vegas, NV, USA, 24 October 2020–24 January 2021, pp. 4909–4916. IEEE (2020). https://doi.org/10.1109/IROS45743.2020.9341801
Wang, X., Zhang, R., Shen, C., Kong, T., Li, L.: Dense contrastive learning for self-supervised visual pre-training. In: IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2021, virtual, 19–25 June 2021, pp. 3024–3033. Computer Vision Foundation / IEEE (2021). https://doi.org/10.1109/CVPR46437.2021.00304. https://openaccess.thecvf.com/content/CVPR2021/html/Wang_Dense_Contrastive_Learning_for_Self-Supervised_Visual_Pre-Training_CVPR_2021_paper.html
Weikersdorfer, D., Adrian, D.B., Cremers, D., Conradt, J.: Event-based 3d SLAM with a depth-augmented dynamic vision sensor. In: 2014 IEEE International Conference on Robotics and Automation, ICRA 2014, Hong Kong, China, May 31–June 7 2014, pp. 359–364. IEEE (2014). https://doi.org/10.1109/ICRA.2014.6906882
Wu, Y., Paredes-Vallés, F., de Croon, G.C.H.E.: Lightweight event-based optical flow estimation via iterative deblurring. In: Proceedings of IEEE International Conference on Robotics and Automation (ICRA 2024), May 2024, to Appear
Wu, Z., Xiong, Y., Yu, S.X., Lin, D.: Unsupervised feature learning via non-parametric instance discrimination. In: 2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA, 18–22 June 2018, pp. 3733–3742. Computer Vision Foundation / IEEE Computer Society (2018). https://doi.org/10.1109/CVPR.2018.00393. http://openaccess.thecvf.com/content_cvpr_2018/html/Wu_Unsupervised_Feature_Learning_CVPR_2018_paper.html
Xie, Z., et al.: SimMIM: a simple framework for masked image modeling. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022, New Orleans, LA, USA, 18–24 June 2022, pp. 9643–9653. IEEE (2022). https://doi.org/10.1109/CVPR52688.2022.00943
Yang, Y., Pan, L., Liu, L.: Event camera data pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp. 10699–10709, October 2023
Yun, S., Lee, H., Kim, J., Shin, J.: Patch-level representation learning for self-supervised vision transformers. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022, New Orleans, LA, USA, 18–24 June 2022, pp. 8344–8353. IEEE (2022). https://doi.org/10.1109/CVPR52688.2022.00817
Zhang, D., Ding, Q., Duan, P., Zhou, C., Shi, B.: Data association between event streams and intensity frames under diverse baselines. In: Avidan, S., Brostow, G.J., Cissé, M., Farinella, G.M., Hassner, T. (eds.) ECCV 2022, Part VII. LNCS, vol. 13667, pp. 72–90. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-20071-7_5
Zhang, S., et al.: OPT: open pre-trained transformer language models. CoRR abs/2205.01068 (2022). https://doi.org/10.48550/ARXIV.2205.01068. https://doi.org/10.48550/arXiv.2205.01068
Zhou, J., Zheng, X., Lyu, Y., Wang, L.: E-CLIP: towards label-efficient event-based open-world understanding by CLIP. CoRR abs/2308.03135 (2023). https://doi.org/10.48550/arXiv.2308.03135
Zhou, J., et al.: Image BERT pre-training with online tokenizer. In: The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, 25–29 April 2022. OpenReview.net (2022). https://openreview.net/forum?id=ydopy-e6Dg
Zhu, A.Z., Thakur, D., Özaslan, T., Pfrommer, B., Kumar, V., Daniilidis, K.: The multi vehicle stereo event camera dataset: an event camera dataset for 3D perception. CoRR abs/1801.10202 (2018). http://arxiv.org/abs/1801.10202
Zhu, A.Z., Yuan, L., Chaney, K., Daniilidis, K.: Unsupervised event-based learning of optical flow, depth, and egomotion. CoRR abs/1812.08156 (2018). http://arxiv.org/abs/1812.08156
Acknowledgements
Liyuan Pan’s work was supported in part by the Beijing Institute of Technology Research Fund Program for Young Scholars, BIT Special-Zone, and National Natural Science Foundation of China 62302045.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2025 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Yang, Y., Pan, L., Liu, L. (2025). Event Camera Data Dense Pre-training. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15101. Springer, Cham. https://doi.org/10.1007/978-3-031-72775-7_17
Download citation
DOI: https://doi.org/10.1007/978-3-031-72775-7_17
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-72774-0
Online ISBN: 978-3-031-72775-7
eBook Packages: Computer ScienceComputer Science (R0)