[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to main content

Event Camera Data Dense Pre-training

  • Conference paper
  • First Online:
Computer Vision – ECCV 2024 (ECCV 2024)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 15101))

Included in the following conference series:

  • 145 Accesses

Abstract

This paper introduces a self-supervised learning framework designed for pre-training neural networks tailored to dense prediction tasks using event camera data. Our approach utilizes solely event data for training.

Transferring achievements from dense RGB pre-training directly to event camera data yields subpar performance. This is attributed to the spatial sparsity inherent in an event image (converted from event data), where many pixels do not contain information. To mitigate this sparsity issue, we encode an event image into event patch features, automatically mine contextual similarity relationships among patches, group the patch features into distinctive contexts, and enforce context-to-context similarities to learn discriminative event features.

For training our framework, we curate a synthetic event camera dataset featuring diverse scene and motion patterns. Transfer learning performance on downstream dense prediction tasks illustrates the superiority of our method over state-of-the-art approaches.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
£29.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
GBP 19.95
Price includes VAT (United Kingdom)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
GBP 49.99
Price includes VAT (United Kingdom)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
GBP 64.99
Price includes VAT (United Kingdom)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

Notes

  1. 1.

    https://dsec.ifi.uzh.ch/uzh/dsec-flow-optical-flow-benchmark/.

References

  1. Alonso, I., Murillo, A.C.: EV-SegNet: semantic segmentation for event-based cameras. In: IEEE Conference on Computer Vision and Pattern Recognition Workshops, CVPR Workshops 2019, Long Beach, CA, USA, 16–20 June 2019, pp. 1624–1633. Computer Vision Foundation / IEEE (2019). https://doi.org/10.1109/CVPRW.2019.00205. http://openaccess.thecvf.com/content_CVPRW_2019/html/EventVision/Alonso_EV-SegNet_Semantic_Segmentation_for_Event-Based_Cameras_CVPRW_2019_paper.html

  2. Bai, Y., Chen, X., Kirillov, A., Yuille, A.L., Berg, A.C.: Point-level region contrast for object detection pre-training. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022, New Orleans, LA, USA, 18–24 June 2022, pp. 16040–16049. IEEE (2022). https://doi.org/10.1109/CVPR52688.2022.01559

  3. Bao, H., Dong, L., Piao, S., Wei, F.: Beit: BERT pre-training of image transformers. In: The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, 25–29 April 2022. OpenReview.net (2022). https://openreview.net/forum?id=p-BhZSz59o4

  4. Binas, J., Neil, D., Liu, S., Delbrück, T.: DDD17: end-to-end DAVIS driving dataset. CoRR abs/1711.01458 (2017). http://arxiv.org/abs/1711.01458

  5. Caron, M., Misra, I., Mairal, J., Goyal, P., Bojanowski, P., Joulin, A.: Unsupervised learning of visual features by contrasting cluster assignments. In: Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M., Lin, H. (eds.) Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, 6–12 December 2020, virtual (2020). https://proceedings.neurips.cc/paper/2020/hash/70feb62b69f16e0238f741fab228fec2-Abstract.html

  6. Caron, M., et al.: Emerging properties in self-supervised vision transformers. In: 2021 IEEE/CVF International Conference on Computer Vision, ICCV 2021, Montreal, QC, Canada, 10–17 October 2021, pp. 9630–9640. IEEE (2021). https://doi.org/10.1109/ICCV48922.2021.00951

  7. Chen, T., Kornblith, S., Norouzi, M., Hinton, G.E.: A simple framework for contrastive learning of visual representations. In: Proceedings of the 37th International Conference on Machine Learning, ICML 2020, 13–18 July 2020, Virtual Event. Proceedings of Machine Learning Research, vol. 119, pp. 1597–1607. PMLR (2020). http://proceedings.mlr.press/v119/chen20j.html

  8. Chen, X., Fan, H., Girshick, R., He, K.: Improved baselines with momentum contrastive learning. arXiv preprint arXiv:2003.04297 (2020)

  9. Chen, X., He, K.: Exploring simple Siamese representation learning. In: IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2021, virtual, 19–25 June 2021, pp. 15750–15758. Computer Vision Foundation / IEEE (2021). https://doi.org/10.1109/CVPR46437.2021.01549. https://openaccess.thecvf.com/content/CVPR2021/html/Chen_Exploring_Simple_Siamese_Representation_Learning_CVPR_2021_paper.html

  10. Chen*, X., Xie*, S., He, K.: An empirical study of training self-supervised vision transformers. arXiv preprint arXiv:2104.02057 (2021)

  11. Cheng, W., Luo, H., Yang, W., Yu, L., Li, W.: Structure-aware network for lane marker extraction with dynamic vision sensor. CoRR abs/2008.06204 (2020). https://arxiv.org/abs/2008.06204

  12. Cherti, M., et al.: Reproducible scaling laws for contrastive language-image learning. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2023, Vancouver, BC, Canada, 17–24 June 2023, pp. 2818–2829. IEEE (2023). https://doi.org/10.1109/CVPR52729.2023.00276

  13. Cuadrado, J., Rancon, U., Cottereau, B., Barranco, F., Masquelier, T.: Optical flow estimation from event-based cameras and spiking neural networks. Front. Neurosci. 17 (2023). https://doi.org/10.3389/fnins.2023.1160034

  14. Deng, J., Dong, W., Socher, R., Li, L., Li, K., Fei-Fei, L.: ImageNet: a large-scale hierarchical image database. In: 2009 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR 2009)x, Miami, Florida, USA 20-25, pp. 248–255. IEEE Computer Society (2009). https://doi.org/10.1109/CVPR.2009.5206848

  15. Dosovitskiy, A., et al.: An image is worth 16x16 words: transformers for image recognition at scale. In: 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, 3–7 May 2021. OpenReview.net (2021). https://openreview.net/forum?id=YicbFdNTTy

  16. Esser, P., Rombach, R., Ommer, B.: Taming transformers for high-resolution image synthesis. In: IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2021, virtual, 19–25 June 2021, pp. 12873–12883. Computer Vision Foundation / IEEE (2021). https://doi.org/10.1109/CVPR46437.2021.01268. https://openaccess.thecvf.com/content/CVPR2021/html/Esser_Taming_Transformers_for_High-Resolution_Image_Synthesis_CVPR_2021_paper.html

  17. Fang, Y., et al.: EVA: exploring the limits of masked visual representation learning at scale. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2023, Vancouver, BC, Canada, 17–24 June 2023, pp. 19358–19369. IEEE (2023). https://doi.org/10.1109/CVPR52729.2023.01855

  18. Gallego, G., et al.: Event-based vision: a survey. IEEE Trans. Pattern Anal. Mach. Intell. 44(1), 154–180 (2022). https://doi.org/10.1109/TPAMI.2020.3008413

  19. Gehrig, D., Rüegg, M., Gehrig, M., Hidalgo-Carrió, J., Scaramuzza, D.: Combining events and frames using recurrent asynchronous multimodal networks for monocular depth prediction. IEEE Robot. Autom. Lett. 6(2), 2822–2829 (2021). https://doi.org/10.1109/LRA.2021.3060707

  20. Gehrig, M., Aarents, W., Gehrig, D., Scaramuzza, D.: DSEC: a stereo event camera dataset for driving scenarios. IEEE Robotics Autom. Lett. 6(3), 4947–4954 (2021). https://doi.org/10.1109/LRA.2021.3068942

  21. Gehrig, M., Millhäusler, M., Gehrig, D., Scaramuzza, D.: E-RAFT: dense optical flow from event cameras. In: International Conference on 3D Vision, 3DV 2021, London, United Kingdom, 1–3 December 2021, pp. 197–206. IEEE (2021). https://doi.org/10.1109/3DV53792.2021.00030

  22. Grill, J., et al.: Bootstrap your own latent - a new approach to self-supervised learning. In: Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M., Lin, H. (eds.) Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, 6–12 December 2020, virtual (2020). https://proceedings.neurips.cc/paper/2020/hash/f3ada80d5c4ee70142b17b8192b2958e-Abstract.html

  23. Hamaguchi, R., Furukawa, Y., Onishi, M., Sakurada, K.: Hierarchical neural memory network for low latency event processing. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2023, Vancouver, BC, Canada, 17–24 June 2023, pp. 22867–22876. IEEE (2023). https://doi.org/10.1109/CVPR52729.2023.02190

  24. He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.B.: Masked autoencoders are scalable vision learners. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022, New Orleans, LA, USA, 18–24 June 2022, pp. 15979–15988. IEEE (2022). https://doi.org/10.1109/CVPR52688.2022.01553

  25. He, K., Fan, H., Wu, Y., Xie, S., Girshick, R.B.: Momentum contrast for unsupervised visual representation learning. In: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2020, Seattle, WA, USA, 13–19 June 2020, pp. 9726–9735. Computer Vision Foundation / IEEE (2020). https://doi.org/10.1109/CVPR42600.2020.00975

  26. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. CoRR abs/1512.03385 (2015). http://arxiv.org/abs/1512.03385

  27. Kim, J., Bae, J., Park, G., Zhang, D., Kim, Y.M.: N-ImageNet: towards robust, fine-grained object recognition with event cameras. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp. 2146–2156, October 2021

    Google Scholar 

  28. Li, C., et al.: Efficient self-supervised vision transformers for representation learning. In: The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, 25–29 April 2022. OpenReview.net (2022). https://openreview.net/forum?id=fVu3o-YUGQK

  29. Li, W., Xie, J., Loy, C.C.: Correlational image modeling for self-supervised visual pre-training. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2023, Vancouver, BC, Canada, 17–24 June 2023, pp. 15105–15115. IEEE (2023). https://doi.org/10.1109/CVPR52729.2023.01450

  30. Li, Y., et al.: BlinkFlow: a dataset to push the limits of event-based optical flow estimation. CoRR abs/2303.07716 (2023). https://doi.org/10.48550/arXiv.2303.07716

  31. Liu, H., et al.: TMA: temporal motion aggregation for event-based optical flow. In: ICCV (2023)

    Google Scholar 

  32. Menze, M., Heipke, C., Geiger, A.: Joint 3d estimation of vehicles and scene flow. ISPRS Ann. Photogram. Remote Sens. Spat. Inf. Sci. II-3/W5, 427–434 (2015). https://doi.org/10.5194/isprsannals-II-3-W5-427-2015

  33. Oquab, M., et al.: DINOv2: learning robust visual features without supervision. CoRR abs/2304.07193 (2023). https://doi.org/10.48550/arXiv.2304.07193

  34. Orchard, G., Jayawant, A., Cohen, G., Thakor, N.V.: Converting static image datasets to spiking neuromorphic datasets using saccades. CoRR abs/1507.07629 (2015). http://arxiv.org/abs/1507.07629

  35. Peng, Z., Dong, L., Bao, H., Ye, Q., Wei, F.: BEiT v2: masked image modeling with vector-quantized visual tokenizers. CoRR abs/2208.06366 (2022). https://doi.org/10.48550/arXiv.2208.06366

  36. Radford, A., et al.: Learning transferable visual models from natural language supervision. In: Meila, M., Zhang, T. (eds.) Proceedings of the 38th International Conference on Machine Learning, ICML 2021, 18–24 July 2021, Virtual Event. Proceedings of Machine Learning Research, vol. 139, pp. 8748–8763. PMLR (2021). http://proceedings.mlr.press/v139/radford21a.html

  37. Shiba, S., Aoki, Y., Gallego, G.: Secrets of event-based optical flow. In: Avidan, S., Brostow, G.J., Cissé, M., Farinella, G.M., Hassner, T. (eds.) ECCV 2022, Part XVIII. LNCS, vol. 13678, pp. 628–645. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-19797-0_36

    Chapter  Google Scholar 

  38. Sironi, A., Brambilla, M., Bourdis, N., Lagorce, X., Benosman, R.: HATS: histograms of averaged time surfaces for robust event-based object classification. In: 2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA, 18–22 June 2018, pp. 1731–1740. Computer Vision Foundation / IEEE Computer Society (2018). https://doi.org/10.1109/CVPR.2018.00186. http://openaccess.thecvf.com/content_cvpr_2018/html/Sironi_HATS_Histograms_of_CVPR_2018_paper.html

  39. Sun, Z., Messikommer, N., Gehrig, D., Scaramuzza, D.: ESS: learning event-based semantic segmentation from still images. In: Avidan, S., Brostow, G.J., Cissé, M., Farinella, G.M., Hassner, T. (eds.) ECCV 2022, Part XXXIV. LNCS, vol. 13694, pp. 341–357. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-19830-4_20

    Chapter  Google Scholar 

  40. Wan, Z., Dai, Y., Mao, Y.: Learning dense and continuous optical flow from an event camera. IEEE Trans. Image Process. 31, 7237–7251 (2022). https://doi.org/10.1109/TIP.2022.3220938

  41. Wang, W., et al.: TartanAir: a dataset to push the limits of visual SLAM. In: IEEE/RSJ International Conference on Intelligent Robots and Systems, IROS 2020, Las Vegas, NV, USA, 24 October 2020–24 January 2021, pp. 4909–4916. IEEE (2020). https://doi.org/10.1109/IROS45743.2020.9341801

  42. Wang, X., Zhang, R., Shen, C., Kong, T., Li, L.: Dense contrastive learning for self-supervised visual pre-training. In: IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2021, virtual, 19–25 June 2021, pp. 3024–3033. Computer Vision Foundation / IEEE (2021). https://doi.org/10.1109/CVPR46437.2021.00304. https://openaccess.thecvf.com/content/CVPR2021/html/Wang_Dense_Contrastive_Learning_for_Self-Supervised_Visual_Pre-Training_CVPR_2021_paper.html

  43. Weikersdorfer, D., Adrian, D.B., Cremers, D., Conradt, J.: Event-based 3d SLAM with a depth-augmented dynamic vision sensor. In: 2014 IEEE International Conference on Robotics and Automation, ICRA 2014, Hong Kong, China, May 31–June 7 2014, pp. 359–364. IEEE (2014). https://doi.org/10.1109/ICRA.2014.6906882

  44. Wu, Y., Paredes-Vallés, F., de Croon, G.C.H.E.: Lightweight event-based optical flow estimation via iterative deblurring. In: Proceedings of IEEE International Conference on Robotics and Automation (ICRA 2024), May 2024, to Appear

    Google Scholar 

  45. Wu, Z., Xiong, Y., Yu, S.X., Lin, D.: Unsupervised feature learning via non-parametric instance discrimination. In: 2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA, 18–22 June 2018, pp. 3733–3742. Computer Vision Foundation / IEEE Computer Society (2018). https://doi.org/10.1109/CVPR.2018.00393. http://openaccess.thecvf.com/content_cvpr_2018/html/Wu_Unsupervised_Feature_Learning_CVPR_2018_paper.html

  46. Xie, Z., et al.: SimMIM: a simple framework for masked image modeling. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022, New Orleans, LA, USA, 18–24 June 2022, pp. 9643–9653. IEEE (2022). https://doi.org/10.1109/CVPR52688.2022.00943

  47. Yang, Y., Pan, L., Liu, L.: Event camera data pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp. 10699–10709, October 2023

    Google Scholar 

  48. Yun, S., Lee, H., Kim, J., Shin, J.: Patch-level representation learning for self-supervised vision transformers. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022, New Orleans, LA, USA, 18–24 June 2022, pp. 8344–8353. IEEE (2022). https://doi.org/10.1109/CVPR52688.2022.00817

  49. Zhang, D., Ding, Q., Duan, P., Zhou, C., Shi, B.: Data association between event streams and intensity frames under diverse baselines. In: Avidan, S., Brostow, G.J., Cissé, M., Farinella, G.M., Hassner, T. (eds.) ECCV 2022, Part VII. LNCS, vol. 13667, pp. 72–90. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-20071-7_5

    Chapter  Google Scholar 

  50. Zhang, S., et al.: OPT: open pre-trained transformer language models. CoRR abs/2205.01068 (2022). https://doi.org/10.48550/ARXIV.2205.01068. https://doi.org/10.48550/arXiv.2205.01068

  51. Zhou, J., Zheng, X., Lyu, Y., Wang, L.: E-CLIP: towards label-efficient event-based open-world understanding by CLIP. CoRR abs/2308.03135 (2023). https://doi.org/10.48550/arXiv.2308.03135

  52. Zhou, J., et al.: Image BERT pre-training with online tokenizer. In: The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, 25–29 April 2022. OpenReview.net (2022). https://openreview.net/forum?id=ydopy-e6Dg

  53. Zhu, A.Z., Thakur, D., Özaslan, T., Pfrommer, B., Kumar, V., Daniilidis, K.: The multi vehicle stereo event camera dataset: an event camera dataset for 3D perception. CoRR abs/1801.10202 (2018). http://arxiv.org/abs/1801.10202

  54. Zhu, A.Z., Yuan, L., Chaney, K., Daniilidis, K.: Unsupervised event-based learning of optical flow, depth, and egomotion. CoRR abs/1812.08156 (2018). http://arxiv.org/abs/1812.08156

Download references

Acknowledgements

Liyuan Pan’s work was supported in part by the Beijing Institute of Technology Research Fund Program for Young Scholars, BIT Special-Zone, and National Natural Science Foundation of China 62302045.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Liyuan Pan .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2025 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Yang, Y., Pan, L., Liu, L. (2025). Event Camera Data Dense Pre-training. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15101. Springer, Cham. https://doi.org/10.1007/978-3-031-72775-7_17

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-72775-7_17

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-72774-0

  • Online ISBN: 978-3-031-72775-7

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics