Sync from the Sea: Retrieving Alignable Videos from Large-Scale Datasets

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 15066))

Included in the following conference series:

European Conference on Computer Vision

315 Accesses

Abstract

Temporal video alignment aims to synchronize the key events like object interactions or action phase transitions in two videos. Such methods could benefit various video editing, processing, and understanding tasks. However, existing approaches operate under the restrictive assumption that a suitable video pair for alignment is given, significantly limiting their broader applicability. To address this, we re-pose temporal alignment as a search problem and introduce the task of Alignable Video Retrieval (AVR). Given a query video, our approach can identify well-alignable videos from a large collection of clips and temporally synchronize them to the query. To achieve this, we make three key contributions: 1) we introduce DRAQ, a video alignability indicator to identify and re-rank the best alignable video from a set of candidates; 2) we propose an effective and generalizable frame-level video feature design to improve the alignment performance of several off-the-shelf feature representations, and 3) we propose a novel benchmark and evaluation protocol for AVR using cycle-consistency metrics. Our experiments on 3 datasets, including large-scale Kinetics700, demonstrate the effectiveness of our approach in identifying alignable video pairs from diverse datasets.

I. R. Dave—Majority of work done as an intern at Adobe Research, USA.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic

£29.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: GBP 19.95; Price includes VAT (United Kingdom)

eBook: GBP 54.99; Price includes VAT (United Kingdom)

Softcover Book: GBP 69.99; Price includes VAT (United Kingdom)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Aligning Videos in Space and Time

Minimum-risk temporal alignment of videos

Article 14 August 2017

Semi-global Alignment of Range Videos

References

Baraldi, L., Douze, M., Cucchiara, R., Jégou, H.: LAMV: learning to align and match videos with kernelized temporal layers. In: Proceedings of the CVPR, pp. 7804–7813 (2018)
Google Scholar
Benaim, S., et al.: SpeedNet: learning the speediness in videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9922–9931 (2020)
Google Scholar
Black, A., et al.: Vader: video alignment differencing and retrieval. arXiv preprint arXiv:2303.13193 (2023)
Caba Heilbron, F., Escorcia, V., Ghanem, B., Carlos Niebles, J.: ActivityNet: a large-scale video benchmark for human activity understanding. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 961–970 (2015)
Google Scholar
Ceylan, D., Huang, C.H.P., Mitra, N.J.: Pix2Video: video editing using image diffusion. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 23206–23217 (2023)
Google Scholar
Chen, M., Wei, F., Li, C., Cai, D.: Frame-wise action representations for long videos via sequence contrastive learning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 13801–13810 (2022)
Google Scholar
Cuturi, M., Blondel, M.: Soft-DTW: a differentiable loss function for time-series. In: International Conference on Machine Learning, pp. 894–903. PMLR (2017)
Google Scholar
Dave, I., Gupta, R., Rizve, M.N., Shah, M.: TCLR: temporal contrastive learning for video representation. Comput. Vis. Image Underst. 103406 (2022). https://doi.org/10.1016/j.cviu.2022.103406
Dave, I., Scheffer, Z., Kumar, A., Shiraz, S., Rawat, Y.S., Shah, M.: GabriellaV2: towards better generalization in surveillance videos for action detection. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV) Workshops, pp. 122–132 (2022)
Google Scholar
Dave, I.R., Jenni, S., Shah, M.: No more shortcuts: realizing the potential of temporal self-supervision. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 38, pp. 1481–1491 (2024)
Google Scholar
Dave, I.R., Rizve, M.N., Shah, M.: Finepseudo: improving pseudo-labelling through temporal-alignablity for semi-supervised fine-grained action recognition. In: European Conference on Computer Vision (2024)
Google Scholar
Douze, M., Revaud, J., Verbeek, J.J., Jégou, H., Schmid, C.: Circulant temporal encoding for video retrieval and temporal alignment. IJCV 119, 291–306 (2015)
Article MathSciNet Google Scholar
Dvornik, M., Hadji, I., Derpanis, K.G., Garg, A., Jepson, A.: Drop-DTW: aligning common signal between sequences while dropping outliers. In: NeurIPS, vol. 34, pp. 13782–13793 (2021)
Google Scholar
Dwibedi, D., Aytar, Y., Tompson, J., Sermanet, P., Zisserman, A.: Temporal cycle-consistency learning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1801–1810 (2019)
Google Scholar
Epstein, D., Chen, B., Vondrick, C.: Oops! predicting unintentional action in video. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 919–929 (2020)
Google Scholar
Fakhfour, N., ShahverdiKondori, M., Mohammadzade, H.: Video alignment using unsupervised learning of local and global features. arXiv preprint arXiv:2304.06841 (2023)
Feichtenhofer, C., Fan, H., Malik, J., He, K.: Slowfast networks for video recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6202–6211 (2019)
Google Scholar
Fernando, B., Bilen, H., Gavves, E., Gould, S.: Self-supervised video representation learning with odd-one-out networks. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 5729–5738. IEEE (2017)
Google Scholar
Grill, J.B., et al.: Bootstrap your own latent-a new approach to self-supervised learning. In: Advances in Neural Information Processing Systems, vol. 33, pp. 21271–21284 (2020)
Google Scholar
Gu, C., et al.: AVA: a video dataset of spatio-temporally localized atomic visual actions. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6047–6056 (2018)
Google Scholar
Hadji, I., Derpanis, K.G., Jepson, A.D.: Representation learning via global temporal alignment and cycle-consistency. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11068–11077 (2021)
Google Scholar
Han, Z., He, X., Tang, M., Lv, Y.: Video similarity and alignment learning on partial video copy detection. In: Proceedings of the ACM International Conference on Multimedia, pp. 4165–4173 (2021)
Google Scholar
Haresh, S., et al.: Learning by aligning videos in time. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5548–5558 (2021)
Google Scholar
He, S., et al.: TransVCL: attention-enhanced video copy localization network with flexible supervision. In: AAAI, vol. 37, pp. 799–807 (2023)
Google Scholar
Jenni, S., Black, A., Collomosse, J.: Audio-visual contrastive learning with temporal self-supervision. arXiv preprint arXiv:2302.07702 (2023)
Jenni, S., Jin, H.: Time-equivariant contrastive video representation learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9970–9980 (2021)
Google Scholar
Jenni, S., Meishvili, G., Favaro, P.: Video representation learning by recognizing temporal transformations. In: The European Conference on Computer Vision (2020)
Google Scholar
Jenni, S., Woodson, M., Heilbron, F.C.: Video-retime: learning temporally varying speediness for time remapping. arXiv preprint arXiv:2205.05609 (2022)
Jiang, Y.G., Wang, J.: Partial copy detection in videos: a benchmark and an evaluation of popular methods. IEEE Trans. Big Data 2(1), 32–42 (2016)
Article Google Scholar
Kay, W., et al.: The kinetics human action video dataset. arXiv preprint arXiv:1705.06950 (2017)
Lee, H.Y., Huang, J.B., Singh, M., Yang, M.H.: Unsupervised representation learning by sorting sequences. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 667–676 (2017)
Google Scholar
Misra, I., Zitnick, C.L., Hebert, M.: Shuffle and learn: unsupervised learning using temporal order verification. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9905, pp. 527–544. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46448-0_32
Chapter Google Scholar
Müller, M.: Dynamic time warping. Information retrieval for music and motion, pp. 69–84 (2007)
Google Scholar
Qian, R., et al.: Spatiotemporal contrastive video representation learning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6964–6974 (2021)
Google Scholar
Sermanet, P., et al.: Time-contrastive networks: self-supervised learning from video. In: 2018 IEEE International Conference on Robotics and Automation (ICRA), pp. 1134–1141. IEEE (2018)
Google Scholar
Singer, U., et al.: Make-a-video: text-to-video generation without text-video data. arXiv preprint arXiv:2209.14792 (2022)
Sivic, J., Zisserman, A.: Video google: efficient visual search of videos. Toward category-level object recognition, pp. 127–144 (2006)
Google Scholar
Soomro, K., Zamir, A.R., Shah, M.: UCF101: a dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402 (2012)
Tan, W., Guo, H., Liu, R.: A fast partial video copy detection using KNN and global feature database. In: Proceedings of the WCACV, pp. 2191–2199 (2022)
Google Scholar
Thoker, F.M., Doughty, H., Snoek, C.G.: Tubelet-contrastive self-supervision for video-efficient generalization. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 13812–13823 (2023)
Google Scholar
Tran, D., Wang, H., Torresani, L., Ray, J., LeCun, Y., Paluri, M.: A closer look at spatiotemporal convolutions for action recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6450–6459 (2018)
Google Scholar
Wei, D., Lim, J., Zisserman, A., Freeman, W.T.: Learning and using the arrow of time. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 8052–8060 (2018)
Google Scholar
Zhang, H., Liu, D., Zheng, Q., Su, B.: Modeling video as stochastic processes for fine-grained video representation learning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2225–2234 (2023)
Google Scholar
Zhang, W., Zhu, M., Derpanis, K.G.: From actemes to action: a strongly-supervised representation for detailed action understanding. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2248–2255 (2013)
Google Scholar

Download references

Author information

Authors and Affiliations

Center for Research in Computer Vision, University of Central Florida, Orlando, USA
Ishan Rajendrakumar Dave & Mubarak Shah
Adobe Research, San Francisco, USA
Fabian Caba Heilbron & Simon Jenni

Authors

Ishan Rajendrakumar Dave
View author publications
You can also search for this author in PubMed Google Scholar
Fabian Caba Heilbron
View author publications
You can also search for this author in PubMed Google Scholar
Mubarak Shah
View author publications
You can also search for this author in PubMed Google Scholar
Simon Jenni
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Ishan Rajendrakumar Dave .

Editor information

Editors and Affiliations

University of Birmingham, Birmingham, UK
Aleš Leonardis
University of Trento, Trento, Italy
Elisa Ricci
Technical University of Darmstadt, Darmstadt, Hessen, Germany
Stefan Roth
Princeton University, Princeton, NJ, USA
Olga Russakovsky
Czech Technical University in Prague, Prague, Czech Republic
Torsten Sattler
École des Ponts ParisTech, Marne-la-Vallée, France
Gül Varol

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (zip 14277 KB)

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Dave, I.R., Heilbron, F.C., Shah, M., Jenni, S. (2025). Sync from the Sea: Retrieving Alignable Videos from Large-Scale Datasets. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15066. Springer, Cham. https://doi.org/10.1007/978-3-031-73242-3_21

Download citation

DOI: https://doi.org/10.1007/978-3-031-73242-3_21
Published: 29 October 2024
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-73241-6
Online ISBN: 978-3-031-73242-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics