[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
research-article

Action Segmentation through Self-Supervised Video Features and Positional-Encoded Embeddings

Published: 16 August 2024 Publication History

Abstract

Action segmentation consists of temporally segmenting a video and labeling each segmented interval with a specific action label. In this work, we propose a novel action segmentation method that requires no initial video analysis and no annotated data. Our proposal involves extracting features from videos using several pre-trained deep-learning models, including spatiotemporal and self-supervised methods. Data is then transformed using a positional encoder, and finally, a clustering algorithm is applied, where each produced cluster presumably corresponds to a different single and distinguishable action. For self-supervised features, we explored DINO, and for spatiotemporal features, we investigated I3D and SlowFast methods. Moreover, two different clustering algorithms (FINCH and KMeans) were investigated, and we also explored how varying the length of video snippets that generate the feature vectors affected the quality of the segmentation. Experiments show that our method produces competitive results on the Breakfast and INRIA Instructional Videos dataset benchmarks. Our best result was produced using a composition of self-supervised features generated by DINO, FINCH clustering, and positional encoding.

References

[1]
Sathyanarayanan N. Aakur and Sudeep Sarkar. 2019. A perceptual prediction framework for self supervised event segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. IEEE, Long Beach, CA, USA, 1197–1206. DOI:
[2]
Jean-Baptiste Alayrac, Piotr Bojanowski, Nishant Agrawal, Josef Sivic, Ivan Laptev, and Simon Lacoste-Julien. 2016. Unsupervised learning from narrated instruction videos. In CVPR2016-29th IEEE Conference on Computer Vision and Pattern Recognition. IEEE, Las Vegas, NV, United States, 21 pages. https://hal.inria.fr/hal-01171193
[3]
Federico Becattini, Tiberio Uricchio, Lorenzo Seidenari, Lamberto Ballan, and Alberto Del Bimbo. 2020. Am I done? Predicting action progress in videos. ACM Trans. Multimedia Comput. Commun. Appl. 16, 4, Article 119 (Dec.2020), 24 pages. DOI:
[4]
Piotr Bojanowski, Rémi Lajugie, Francis Bach, Ivan Laptev, Jean Ponce, Cordelia Schmid, and Josef Sivic. 2014. Weakly supervised action labeling in videos under ordering constraints. In Computer Vision – ECCV 2014. Springer International Publishing, Cham, 628–643.
[5]
Mathilde Caron, Hugo Touvron, Ishan Misra, Hervé Jégou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. 2021. Emerging properties in self-supervised vision transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision. HAL Open Science, Virtual, France, 9650–9660.
[6]
João Carreira, Eric Noland, Andras Banki-Horvath, Chloe Hillier, and Andrew Zisserman. 2018. A short note about Kinetics-600. CoRR abs/1808.01340 (2018), 6 pages. arXiv:1808.01340http://arxiv.org/abs/1808.01340
[7]
Joao Carreira and Andrew Zisserman. 2017. Quo vadis, action recognition? A new model and the kinetics dataset. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR ’17). IEEE, Honolulu, HI, USA, 6299–6308. https://openaccess.thecvf.com/content_cvpr_2017/html/Carreira_Quo_Vadis_Action_CVPR_2017_paper.html
[8]
Chien-Yi Chang, De-An Huang, Yanan Sui, Li Fei-Fei, and Juan Carlos Niebles. 2019. D3TW: Discriminative differentiable dynamic time warping for weakly supervised action alignment and segmentation. arXiv:1901.02598 [cs] abs/1901.02598 (April2019), 10 pages. http://arxiv.org/abs/1901.02598arXiv: 1901.02598.
[9]
Weidong Chen, Guorong Li, Xinfeng Zhang, Shuhui Wang, Liang Li, and Qingming Huang. 2023. Weakly supervised text-based actor-action video segmentation by clip-level multi-instance learning. ACM Trans. Multimedia Comput. Commun. Appl. 19, 1, Article 12 (Jan.2023), 22 pages. DOI:
[10]
Junyoung Chung, Caglar Gulcehre, KyungHyun Cho, and Yoshua Bengio. 2014. Empirical evaluation of gated recurrent neural networks on sequence modeling. ArXiv abs/1412.3555 (2014), 9 pages. https://api.semanticscholar.org/CorpusID:5201925
[11]
Guilherme de A. P. Marques, Antonio José G. Busson, Alan Lívio V. Guedes, Julio Cesar Duarte, and Sérgio Colcher. 2022. Unsupervised method for video action segmentation through spatio-temporal and positional-encoded embeddings. In Proceedings of the 13th ACM Multimedia Systems Conference (Athlone, Ireland) (MMSys ’22). Association for Computing Machinery, New York, NY, USA, 136–149. DOI:
[12]
Jeff Donahue, Lisa Anne Hendricks, Marcus Rohrbach, Subhashini Venugopalan, Sergio Guadarrama, Kate Saenko, and Trevor Darrell. 2017. Long-term recurrent convolutional networks for visual recognition and description. IEEE Transactions on Pattern Analysis and Machine Intelligence 39, 4 (2017), 677–691. DOI:
[13]
Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. 2021. An Image is Worth \(16\times 16\) Words: Transformers for Image Recognition at Scale. In International Conference on Learning Representations. OpenReview.net, Virtual Event, Austria, 22 pages.
[14]
Mohsen Fayyaz and Jurgen Gall. 2020. SCT: Set constrained temporal transformer for set supervised action segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Computer Vision Foundation / IEEE, Seattle, WA, USA, 501–510.
[15]
Christoph Feichtenhofer. 2020. X3D: Expanding architectures for efficient video recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR ’20). IEEE, Seattle, WA, USA, 200–210.
[16]
Christoph Feichtenhofer, Haoqi Fan, Jitendra Malik, and Kaiming He. 2019. SlowFast networks for video recognition. In 2019 IEEE/CVF International Conference on Computer Vision (ICCV ’19). IEEE Computer Society, Los Alamitos, CA, USA, 6202–6211. https://openaccess.thecvf.com/content_ICCV_2019/html/Feichtenhofer_SlowFast_Networks_for_Video_Recognition_ICCV_2019_paper.html
[17]
Jonas Gehring, Michael Auli, David Grangier, Denis Yarats, and Yann N. Dauphin. 2017. Convolutional sequence to sequence learning. In Proceedings of the 34th International Conference on Machine Learning - Volume 70 (ICML’17). JMLR.org, Sydney, NSW, Australia, 1243–1252.
[18]
R. Goyal, S. Kahou, V. Michalski, J. Materzynska, S. Westphal, H. Kim, V. Haenel, I. Fruend, P. Yianilos, M. Mueller-Freitag, F. Hoppe, C. Thurau, I. Bax, and R. Memisevic. 2017. The “Something Something” video database for learning and evaluating visual common sense. In 2017 IEEE International Conference on Computer Vision (ICCV ’17). IEEE Computer Society, Los Alamitos, CA, USA, 5843–5851. DOI:
[19]
Hongfeng Han, Guoxing Yang, Yuqi Huo, Zhiwu Lu, and Ji-Rong Wen. 2021. Complex action segmentation in compressed videos. In 2021 IEEE International Conference on Multimedia and Expo (ICME ’21). IEEE, Shenzhen, China, 1–6.
[20]
Samitha Herath, Mehrtash Harandi, and Fatih Porikli. 2017. Going deeper into action recognition. Image Vision Comput. 60, C (Apr.2017), 4–21. DOI:
[21]
Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural Computation 9, 8 (1997), 32 pages.
[22]
Berthold K. P. Horn and Brian G. Schunck. 1981. Determining optical flow. Artificial Intelligence 17, 1 (1981), 185–203. DOI:
[23]
Fairouz Hussein and Massimo Piccardi. 2017. V-JAUNE: A framework for joint action recognition and video summarization. ACM Trans. Multimedia Comput. Commun. Appl. 13, 2, Article 20 (Apr.2017), 19 pages. DOI:
[24]
Andrej Karpathy, George Toderici, Sanketh Shetty, Thomas Leung, Rahul Sukthankar, and Li Fei-Fei. 2014. Large-scale video classification with convolutional neural networks. In 2014 IEEE Conference on Computer Vision and Pattern Recognition. IEEE, Columbus, OH, USA, 1725–1732. DOI:
[25]
Will Kay, Joao Carreira, Karen Simonyan, Brian Zhang, Chloe Hillier, Sudheendra Vijayanarasimhan, Fabio Viola, Tim Green, Trevor Back, Paul Natsev, Mustafa Suleyman, and Andrew Zisserman. 2017. The kinetics human action video dataset. arXiv:1705.06950 [cs] abs/1705.06950 (May2017), 22 pages. http://arxiv.org/abs/1705.06950arXiv: 1705.06950.
[26]
Alexander Klaser, Marcin Marszalek, and Cordelia Schmid. 2008. A spatio-temporal descriptor based on 3D-gradients. In BMVC 2008-19th British Machine Vision Conference, Mark Everingham, Chris Needham, and Roberto Fraile (Eds.). British Machine Vision Association, Leeds, United Kingdom, 275:1–10. https://hal.inria.fr/inria-00514853
[27]
Oscar Koller, Sepehr Zargaran, and Hermann Ney. 2017. Re-Sign: Re-aligned end-to-end sequence modelling with deep recurrent CNN-HMMs. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR ’17). IEEE, Honolulu, HI, USA, 3416–3424. DOI:
[28]
Dan Kondratyuk, Liangzhe Yuan, Yandong Li, Li Zhang, Mingxing Tan, Matthew Brown, and Boqing Gong. 2021. MoViNets: Mobile video networks for efficient video recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR ’21). IEEE, Nashville, TN, USA, 16020–16030.
[29]
J. B. Kruskal and Mark Liberman. 1983. The symmetric time-warping problem: From continuous to discrete. In Time Warps, String Edits, and Macromolecules: The Theory and Practice of Sequence Comparison. Addison-Wesley, Massachussetts.
[30]
Hilde Kuehne, Ali Arslan, and Thomas Serre. 2014. The language of actions: Recovering the syntax and semantics of goal-directed human activities. In 2014 IEEE Conference on Computer Vision and Pattern Recognition. IEEE, Columbus, OH, USA, 780–787. DOI:ISSN: 1063-6919.
[31]
Hilde Kuehne, Alexander Richard, and Juergen Gall. 2017. Weakly supervised learning of actions from transcripts. Computer Vision and Image Understanding 163 (2017), 78–89.
[32]
Hilde Kuehne, Alexander Richard, and Juergen Gall. 2020. A hybrid RNN-HMM approach for weakly supervised temporal action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence 42, 4 (April2020), 765–779. DOI:arXiv: 1906.01028.
[33]
H. W. Kuhn. 1955. The Hungarian method for the assignment problem. Naval Research Logistics Quarterly 2, 1-2 (1955), 83–97. DOI:
[34]
Anna Kukleva, Hilde Kuehne, Fadime Sener, and Jurgen Gall. 2019. Unsupervised learning of action classes with continuous temporal embedding. In 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR ’19). IEEE, Long Beach, CA, USA, 12066–12074. https://openaccess.thecvf.com/content_CVPR_2019/html/Kukleva_Unsupervised_Learning_of_Action_Classes_With_Continuous_Temporal_Embedding_CVPR_2019_paper.html
[35]
Heeseung Kwon, Manjin Kim, Suha Kwak, and Minsu Cho. 2020. MotionSqueeze: Neural motion feature learning for video understanding. In Computer Vision – ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XVI (Glasgow, United Kingdom). Springer-Verlag, Berlin, 345–362. DOI:
[36]
Jun Li, Peng Lei, and Sinisa Todorovic. 2019. Weakly supervised energy-based learning for action segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision. IEEE, Seoul, Korea (South), 6243–6251.
[37]
Jun Li and Sinisa Todorovic. 2021. Action shuffle alternating learning for unsupervised action segmentation. arXiv:2104.02116 [cs] abs/2104.02116 (April2021), 9 pages. http://arxiv.org/abs/2104.02116arXiv: 2104.02116.
[38]
Ji Lin, Chuang Gan, and Song Han. 2019. TSM: Temporal shift module for efficient video understanding. In 2019 IEEE/CVF International Conference on Computer Vision, ICCV (October 27–November 2, 2019). IEEE, Seoul, Korea (South), 7082–7092. DOI:
[39]
J. MacQueen. 1967. Some methods for classification and analysis of multivariate observations. In In 5th Berkeley Symposium on Mathematical Statistics and Probability. University of California Press, Berkeley, CA, 281–297.
[40]
P. Mendes, A. Busson, S. Colcher, D. Schwabe, A. Guedes, and C. Laufer. 2020. A cluster-matching-based method for video face recognition. In Proceedings of the Brazilian Symposium on Multimedia and the Web. ACM, São Luiz do Maranhão - MA, Brazil, 97–104.
[41]
Paulo Renato C. Mendes, Eduardo S. Vieira, Pedro Vinicius A. de Freitas, Antonio José G. Busson, Álan Lívio V. Guedes, Carlos de Salles Soares Neto, and Sérgio Colcher. 2020. Shaping the video conferences of tomorrow with AI. In Anais Estendidos do XXVI Simpósio Brasileiro de Sistemas Multimídia e Web. SBC, ACM, São Luiz do Maranhão - MA, Brazil, 165–168.
[42]
Francisco Micó-Enguídanos, Wilmer Moina-Rivera, Juan Gutiérrez-Aguado, and Miguel Garcia-Pineda. 2023. Per-title and per-segment CRF estimation using DNNs for quality-based video coding. Expert Systems with Applications 227 (2023), 120289. DOI:
[43]
Ronald Poppe. 2010. A survey on vision-based human action recognition. Image and Vision Computing 28, 6 (2010), 976–990. DOI:
[44]
Alexander Richard, Hilde Kuehne, and Juergen Gall. 2017. Weakly supervised action learning with RNN based fine-to-coarse modeling. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, Honolulu, HI, USA, 754–763.
[45]
Alexander Richard, Hilde Kuehne, Ahsan Iqbal, and Juergen Gall. 2018. NeuralNetwork-Viterbi: A framework for weakly supervised video learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, Salt Lake City, UT, USA, 7386–7395.
[46]
Gabriel N. P. dos Santos, Pedro V. A. de Freitas, Antonio José G. Busson, Álan L. V. Guedes, Ruy Milidiú, and Sérgio Colcher. 2019. Deep learning methods for video understanding. In Proceedings of the 25th Brazillian Symposium on Multimedia and the Web. ACM, Rio de Janeiro - RJ, Brazil, 21–23.
[47]
M. Saquib Sarfraz, Vivek Sharma, and Rainer Stiefelhagen. 2019. Efficient parameter-free clustering using first neighbor relations. In 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR ’19). IEEE Computer Society, Los Alamitos, CA, USA, 11 pages. https://arxiv.org/abs/1902.11266v1
[48]
Saquib Sarfraz, Naila Murray, Vivek Sharma, Ali Diba, Luc Van Gool, and Rainer Stiefelhagen. 2021. Temporally-weighted hierarchical clustering for unsupervised action segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. IEEE Computer Society, Los Alamitos, CA, USA, 11225–11234.
[49]
Konrad Schindler and Luc van Gool. 2008. Action snippets: How many frames does human action recognition require?. In 2008 IEEE Conference on Computer Vision and Pattern Recognition. IEEE, Anchorage, AK, USA, 1–8. DOI:
[50]
Fadime Sener and Angela Yao. 2018. Unsupervised learning and segmentation of complex activities from video. In 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. IEEE, Salt Lake City, UT, USA, 8368–8376. DOI:
[51]
Peter Shaw, Jakob Uszkoreit, and Ashish Vaswani. 2018. Self-attention with relative position representations. CoRR abs/1803.02155 (2018), 5 pages. arXiv:1803.02155http://arxiv.org/abs/1803.02155
[52]
Qinghongya Shi, Hong-Bo Zhang, Zhe Li, Ji-Xiang Du, Qing Lei, and Jing-Hua Liu. 2022. Shuffle-invariant network for action recognition in videos. ACM Trans. Multimedia Comput. Commun. Appl. 18, 3, Article 69 (Mar.2022), 18 pages. DOI:
[53]
Gunnar A. Sigurdsson, Gül Varol, Xiaolong Wang, Ali Farhadi, Ivan Laptev, and Abhinav Gupta. 2016. Hollywood in homes: Crowdsourcing data collection for activity understanding. In Computer Vision – ECCV 2016, Bastian Leibe, Jiri Matas, Nicu Sebe, and Max Welling (Eds.). Springer International Publishing, Cham, 510–526.
[54]
Karen Simonyan and Andrew Zisserman. 2014. Two-stream convolutional networks for action recognition in videos. In Advances in Neural Information Processing Systems 27, Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, and K. Q. Weinberger (Eds.). Curran Associates, Inc., Cambridge, MA, USA, 568–576. http://papers.nips.cc/paper/5353-two-stream-convolutional-networks-for-action-recognition-in-videos.pdf
[55]
Yaser Souri, Mohsen Fayyaz, Luca Minciullo, Gianpiero Francesca, and Juergen Gall. 2021. Fast weakly supervised action segmentation using mutual consistency. IEEE Transactions on Pattern Analysis and Machine Intelligence 44, 10 (2021), 6196–6208.
[56]
Deqing Sun, Xiaodong Yang, Ming-Yu Liu, and Jan Kautz. 2017. PWC-Net: CNNs for optical flow using pyramid, warping, and cost volume. CoRR abs/1709.02371 (2017), 18 pages. arXiv:1709.02371http://arxiv.org/abs/1709.02371
[57]
Javier Sánchez Pérez, Enric Meinhardt-Llopis, and Gabriele Facciolo. 2013. TV-L1 optical flow estimation. Image Processing On Line 3 (2013), 137–150.
[58]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. Advances in Neural Information Processing Systems 30 (2017), 11 pages.
[59]
Rosaura G. VidalMata, Walter J. Scheirer, Anna Kukleva, David Cox, and Hilde Kuehne. 2021. Joint visual-temporal embedding for unsupervised learning of actions in untrimmed sequences. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. IEEE, Waikoloa, HI, USA, 1238–1247.
[60]
Heng Wang and Cordelia Schmid. 2013. Action recognition with improved trajectories. In 2013 IEEE International Conference on Computer Vision. IEEE, Sydney, NSW, Australia, 3551–3558. https://openaccess.thecvf.com/content_iccv_2013/html/Wang_Action_Recognition_with_2013_ICCV_paper.html
[61]
Xiaolong Wang, Ross B. Girshick, Abhinav Gupta, and Kaiming He. 2017. Non-local neural networks. CoRR abs/1711.07971 (2017), 10 pages. arXiv:1711.07971http://arxiv.org/abs/1711.07971
[62]
Chao-Yuan Wu, Manzil Zaheer, Hexiang Hu, R. Manmatha, Alexander J. Smola, and Philipp Krähenbühl. 2018. Compressed video action recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, Salt Lake City, UT, USA, 6026–6035.
[63]
Huifen Xia and Yongzhao Zhan. 2020. A survey on temporal action localization. IEEE Access 8 (2020), 70477–70487. DOI:
[64]
Chenliang Xu and Li Ding. 2018. Weakly-supervised action segmentation with iterative soft boundary assignment. In 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. IEEE, Salt Lake City, UT, USA, 6508–6516. DOI:
[65]
Joe Yue-Hei Ng, Matthew Hausknecht, Sudheendra Vijayanarasimhan, Oriol Vinyals, Rajat Monga, and George Toderici. 2015. Beyond short snippets: Deep networks for video classification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR ’15). IEEE, Boston, MA, USA, 9 pages.

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Transactions on Multimedia Computing, Communications, and Applications
ACM Transactions on Multimedia Computing, Communications, and Applications  Volume 20, Issue 9
September 2024
780 pages
EISSN:1551-6865
DOI:10.1145/3613681
  • Editor:
  • Abdulmotaleb El Saddik
Issue’s Table of Contents

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 16 August 2024
Online AM: 24 February 2024
Accepted: 09 February 2024
Revised: 06 December 2023
Received: 22 April 2023
Published in TOMM Volume 20, Issue 9

Check for updates

Author Tags

  1. Neural networks
  2. video understanding
  3. action segmentation
  4. clustering

Qualifiers

  • Research-article

Funding Sources

  • Air Force Office of Scientific Research

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 245
    Total Downloads
  • Downloads (Last 12 months)245
  • Downloads (Last 6 weeks)18
Reflects downloads up to 13 Dec 2024

Other Metrics

Citations

View Options

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Full Text

View this article in Full Text.

Full Text

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media