[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to main content

Sync from the Sea: Retrieving Alignable Videos from Large-Scale Datasets

  • Conference paper
  • First Online:
Computer Vision – ECCV 2024 (ECCV 2024)

Abstract

Temporal video alignment aims to synchronize the key events like object interactions or action phase transitions in two videos. Such methods could benefit various video editing, processing, and understanding tasks. However, existing approaches operate under the restrictive assumption that a suitable video pair for alignment is given, significantly limiting their broader applicability. To address this, we re-pose temporal alignment as a search problem and introduce the task of Alignable Video Retrieval (AVR). Given a query video, our approach can identify well-alignable videos from a large collection of clips and temporally synchronize them to the query. To achieve this, we make three key contributions: 1) we introduce DRAQ, a video alignability indicator to identify and re-rank the best alignable video from a set of candidates; 2) we propose an effective and generalizable frame-level video feature design to improve the alignment performance of several off-the-shelf feature representations, and 3) we propose a novel benchmark and evaluation protocol for AVR using cycle-consistency metrics. Our experiments on 3 datasets, including large-scale Kinetics700, demonstrate the effectiveness of our approach in identifying alignable video pairs from diverse datasets.

I. R. Dave—Majority of work done as an intern at Adobe Research, USA.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
£29.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
GBP 19.95
Price includes VAT (United Kingdom)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
GBP 54.99
Price includes VAT (United Kingdom)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
GBP 69.99
Price includes VAT (United Kingdom)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. Baraldi, L., Douze, M., Cucchiara, R., Jégou, H.: LAMV: learning to align and match videos with kernelized temporal layers. In: Proceedings of the CVPR, pp. 7804–7813 (2018)

    Google Scholar 

  2. Benaim, S., et al.: SpeedNet: learning the speediness in videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9922–9931 (2020)

    Google Scholar 

  3. Black, A., et al.: Vader: video alignment differencing and retrieval. arXiv preprint arXiv:2303.13193 (2023)

  4. Caba Heilbron, F., Escorcia, V., Ghanem, B., Carlos Niebles, J.: ActivityNet: a large-scale video benchmark for human activity understanding. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 961–970 (2015)

    Google Scholar 

  5. Ceylan, D., Huang, C.H.P., Mitra, N.J.: Pix2Video: video editing using image diffusion. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 23206–23217 (2023)

    Google Scholar 

  6. Chen, M., Wei, F., Li, C., Cai, D.: Frame-wise action representations for long videos via sequence contrastive learning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 13801–13810 (2022)

    Google Scholar 

  7. Cuturi, M., Blondel, M.: Soft-DTW: a differentiable loss function for time-series. In: International Conference on Machine Learning, pp. 894–903. PMLR (2017)

    Google Scholar 

  8. Dave, I., Gupta, R., Rizve, M.N., Shah, M.: TCLR: temporal contrastive learning for video representation. Comput. Vis. Image Underst. 103406 (2022). https://doi.org/10.1016/j.cviu.2022.103406

  9. Dave, I., Scheffer, Z., Kumar, A., Shiraz, S., Rawat, Y.S., Shah, M.: GabriellaV2: towards better generalization in surveillance videos for action detection. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV) Workshops, pp. 122–132 (2022)

    Google Scholar 

  10. Dave, I.R., Jenni, S., Shah, M.: No more shortcuts: realizing the potential of temporal self-supervision. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 38, pp. 1481–1491 (2024)

    Google Scholar 

  11. Dave, I.R., Rizve, M.N., Shah, M.: Finepseudo: improving pseudo-labelling through temporal-alignablity for semi-supervised fine-grained action recognition. In: European Conference on Computer Vision (2024)

    Google Scholar 

  12. Douze, M., Revaud, J., Verbeek, J.J., Jégou, H., Schmid, C.: Circulant temporal encoding for video retrieval and temporal alignment. IJCV 119, 291–306 (2015)

    Article  MathSciNet  Google Scholar 

  13. Dvornik, M., Hadji, I., Derpanis, K.G., Garg, A., Jepson, A.: Drop-DTW: aligning common signal between sequences while dropping outliers. In: NeurIPS, vol. 34, pp. 13782–13793 (2021)

    Google Scholar 

  14. Dwibedi, D., Aytar, Y., Tompson, J., Sermanet, P., Zisserman, A.: Temporal cycle-consistency learning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1801–1810 (2019)

    Google Scholar 

  15. Epstein, D., Chen, B., Vondrick, C.: Oops! predicting unintentional action in video. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 919–929 (2020)

    Google Scholar 

  16. Fakhfour, N., ShahverdiKondori, M., Mohammadzade, H.: Video alignment using unsupervised learning of local and global features. arXiv preprint arXiv:2304.06841 (2023)

  17. Feichtenhofer, C., Fan, H., Malik, J., He, K.: Slowfast networks for video recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6202–6211 (2019)

    Google Scholar 

  18. Fernando, B., Bilen, H., Gavves, E., Gould, S.: Self-supervised video representation learning with odd-one-out networks. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 5729–5738. IEEE (2017)

    Google Scholar 

  19. Grill, J.B., et al.: Bootstrap your own latent-a new approach to self-supervised learning. In: Advances in Neural Information Processing Systems, vol. 33, pp. 21271–21284 (2020)

    Google Scholar 

  20. Gu, C., et al.: AVA: a video dataset of spatio-temporally localized atomic visual actions. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6047–6056 (2018)

    Google Scholar 

  21. Hadji, I., Derpanis, K.G., Jepson, A.D.: Representation learning via global temporal alignment and cycle-consistency. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11068–11077 (2021)

    Google Scholar 

  22. Han, Z., He, X., Tang, M., Lv, Y.: Video similarity and alignment learning on partial video copy detection. In: Proceedings of the ACM International Conference on Multimedia, pp. 4165–4173 (2021)

    Google Scholar 

  23. Haresh, S., et al.: Learning by aligning videos in time. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5548–5558 (2021)

    Google Scholar 

  24. He, S., et al.: TransVCL: attention-enhanced video copy localization network with flexible supervision. In: AAAI, vol. 37, pp. 799–807 (2023)

    Google Scholar 

  25. Jenni, S., Black, A., Collomosse, J.: Audio-visual contrastive learning with temporal self-supervision. arXiv preprint arXiv:2302.07702 (2023)

  26. Jenni, S., Jin, H.: Time-equivariant contrastive video representation learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9970–9980 (2021)

    Google Scholar 

  27. Jenni, S., Meishvili, G., Favaro, P.: Video representation learning by recognizing temporal transformations. In: The European Conference on Computer Vision (2020)

    Google Scholar 

  28. Jenni, S., Woodson, M., Heilbron, F.C.: Video-retime: learning temporally varying speediness for time remapping. arXiv preprint arXiv:2205.05609 (2022)

  29. Jiang, Y.G., Wang, J.: Partial copy detection in videos: a benchmark and an evaluation of popular methods. IEEE Trans. Big Data 2(1), 32–42 (2016)

    Article  Google Scholar 

  30. Kay, W., et al.: The kinetics human action video dataset. arXiv preprint arXiv:1705.06950 (2017)

  31. Lee, H.Y., Huang, J.B., Singh, M., Yang, M.H.: Unsupervised representation learning by sorting sequences. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 667–676 (2017)

    Google Scholar 

  32. Misra, I., Zitnick, C.L., Hebert, M.: Shuffle and learn: unsupervised learning using temporal order verification. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9905, pp. 527–544. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46448-0_32

    Chapter  Google Scholar 

  33. Müller, M.: Dynamic time warping. Information retrieval for music and motion, pp. 69–84 (2007)

    Google Scholar 

  34. Qian, R., et al.: Spatiotemporal contrastive video representation learning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6964–6974 (2021)

    Google Scholar 

  35. Sermanet, P., et al.: Time-contrastive networks: self-supervised learning from video. In: 2018 IEEE International Conference on Robotics and Automation (ICRA), pp. 1134–1141. IEEE (2018)

    Google Scholar 

  36. Singer, U., et al.: Make-a-video: text-to-video generation without text-video data. arXiv preprint arXiv:2209.14792 (2022)

  37. Sivic, J., Zisserman, A.: Video google: efficient visual search of videos. Toward category-level object recognition, pp. 127–144 (2006)

    Google Scholar 

  38. Soomro, K., Zamir, A.R., Shah, M.: UCF101: a dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402 (2012)

  39. Tan, W., Guo, H., Liu, R.: A fast partial video copy detection using KNN and global feature database. In: Proceedings of the WCACV, pp. 2191–2199 (2022)

    Google Scholar 

  40. Thoker, F.M., Doughty, H., Snoek, C.G.: Tubelet-contrastive self-supervision for video-efficient generalization. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 13812–13823 (2023)

    Google Scholar 

  41. Tran, D., Wang, H., Torresani, L., Ray, J., LeCun, Y., Paluri, M.: A closer look at spatiotemporal convolutions for action recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6450–6459 (2018)

    Google Scholar 

  42. Wei, D., Lim, J., Zisserman, A., Freeman, W.T.: Learning and using the arrow of time. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 8052–8060 (2018)

    Google Scholar 

  43. Zhang, H., Liu, D., Zheng, Q., Su, B.: Modeling video as stochastic processes for fine-grained video representation learning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2225–2234 (2023)

    Google Scholar 

  44. Zhang, W., Zhu, M., Derpanis, K.G.: From actemes to action: a strongly-supervised representation for detailed action understanding. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2248–2255 (2013)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Ishan Rajendrakumar Dave .

Editor information

Editors and Affiliations

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (zip 14277 KB)

Rights and permissions

Reprints and permissions

Copyright information

© 2025 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Dave, I.R., Heilbron, F.C., Shah, M., Jenni, S. (2025). Sync from the Sea: Retrieving Alignable Videos from Large-Scale Datasets. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15066. Springer, Cham. https://doi.org/10.1007/978-3-031-73242-3_21

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-73242-3_21

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-73241-6

  • Online ISBN: 978-3-031-73242-3

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics