[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1145/3470482.3479632acmconferencesArticle/Chapter ViewAbstractPublication PageswebmediaConference Proceedingsconference-collections
research-article

A Cluster-Based Method for Action Segmentation Using Spatio-Temporal and Positional Encoded Embeddings

Published: 05 November 2021 Publication History

Abstract

A crucial task to overall video understanding is the recognition and localisation in time of different actions or events that are present along the scenes. To address this problem, action segmentation must be achieved. Action segmentation consists of temporally segmenting a video by labeling each frame with a specific action. In this work, we propose a novel action segmentation method that requires no prior video analysis and no annotated data. Our method involves extracting spatio-temporal features from videos in samples of 0.5s using a pre-trained deep network. Data is then transformed using a positional encoder and finally a clustering algorithm is applied with the use of a silhouette score to find the optimal number of clusters where each cluster presumably corresponds to a different single and distinguishable action. In experiments, we show that our method produces competitive results on Breakfast and Inria Instructional Videos dataset benchmarks.

References

[1]
Sathyanarayanan N. Aakur and Sudeep Sarkar. 2019. A Perceptual Prediction Framework for Self Supervised Event Segmentation. arXiv:1811.04869 [cs] (April 2019). http://arxiv.org/abs/1811.04869 arXiv: 1811.04869.
[2]
Jean-Baptiste Alayrac, Piotr Bojanowski, Nishant Agrawal, Josef Sivic, Ivan Laptev, and Simon Lacoste-Julien. 2016. Unsupervised Learning from Narrated Instruction Videos. In CVPR2016 - 29th IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas, United States. https://hal.inria.fr/hal-01171193
[3]
Relja Arandjelovic, Petr Gronat, Akihiko Torii, Tomas Pajdla, and Josef Sivic. 2016. NetVLAD: CNN architecture for weakly supervised place recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 5297--5307.
[4]
Piotr Bojanowski, Rémi Lajugie, Francis Bach, Ivan Laptev, Jean Ponce, Cordelia Schmid, and Josef Sivic. 2014. Weakly supervised action labeling in videos under ordering constraints. In European Conference on Computer Vision. Springer, 628--643.
[5]
Joao Carreira and Andrew Zisserman. 2017. Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset. 6299--6308.
[6]
Chien-Yi Chang, De-An Huang, Yanan Sui, Li Fei-Fei, and Juan Carlos Niebles. 2019. D3tw: Discriminative differentiable dynamic time warping for weakly R@supervised action alignment and segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 3546--3555.
[7]
Kyunghyun Cho, Bart Van Merriënboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. 2014. Learning phrase representations using RNN encoder-decoder for statistical machine translation. arXiv preprint arXiv:1406.1078 (2014).
[8]
Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. 2009. Imagenet: A large-scale hierarchical image database. (2009).
[9]
Li Ding and Chenliang Xu. 2018. Weakly-supervised action segmentation with iterative soft boundary assignment. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 6508--6516.
[10]
Mohsen Fayyaz and Jurgen Gall. 2020. Sct: Set constrained temporal transformer for set supervised action segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 501--510.
[11]
Christoph Feichtenhofer, Haoqi Fan, Jitendra Malik, and Kaiming He. 2019. Slow-Fast Networks for Video Recognition. 6202--6211.
[12]
Jort F Gemmeke, Daniel PW Ellis, Dylan Freedman, Aren Jansen, Wade Lawrence, R Channing Moore, Manoj Plakal, and Marvin Ritter. 2017. Audio set: An ontology and human-labeled dataset for audio events. In 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 776--780.
[13]
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition.
[14]
Shawn Hershey, Sourish Chaudhuri, Daniel P. W. Ellis, Jort F. Gemmeke, Aren Jansen, Channing Moore, Manoj Plakal, Devin Platt, Rif A. Saurous, Bryan Seybold, Malcolm Slaney, Ron Weiss, and Kevin Wilson. 2017. CNN Architectures for Large-Scale Audio Classification. In International Conference on Acoustics, Speech and Signal Processing (ICASSP). https://arxiv.org/abs/1609.09430
[15]
Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural computation 9, 8 (1997).
[16]
Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. 2017. ImageNet classification with deep convolutional neural networks. Commun. ACM 60 (May 2017), 84--90. https://doi.org/10.1145/3065386
[17]
JB Kruskal and Mark Liberman. 1983. The symmetric time-warping problem: From continuous to discrete. Time Warps, String Edits, and Macromolecules: The Theory and Practice of Sequence Comparison (Jan. 1983).
[18]
Hilde Kuehne, Ali Arslan, and Thomas Serre. 2014. The Language of Actions: Recovering the Syntax and Semantics of Goal-Directed Human Activities. In 2014 IEEE Conference on Computer Vision and Pattern Recognition. 780--787. https://doi.org/10.1109/CVPR.2014.105 ISSN: 1063--6919.
[19]
Hilde Kuehne, Alexander Richard, and Juergen Gall. 2017. Weakly supervised learning of actions from transcripts. Computer Vision and Image Understanding 163 (2017), 78--89.
[20]
Hilde Kuehne, Alexander Richard, and Juergen Gall. 2018. A hybrid rnn-hmm approach for weakly supervised temporal action segmentation. IEEE transactions on pattern analysis and machine intelligence 42, 4 (2018), 765--779.
[21]
Anna Kukleva, Hilde Kuehne, Fadime Sener, and Jurgen Gall. 2019. Unsupervised Learning of Action Classes With Continuous Temporal Embedding. 12066--12074.
[22]
Jun Li, Peng Lei, and Sinisa Todorovic. 2019. Weakly supervised energy-based learning for action segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 6243--6251.
[23]
Jun Li and Sinisa Todorovic. 2021. Action Shuffle Alternating Learning for Unsupervised Action Segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 12628--12636.
[24]
J. Macqueen. 1967. Some methods for classification and analysis of multivariate observations. In In 5-th Berkeley Symposium on Mathematical Statistics and Probability. 281--297.
[25]
Paulo Renato C Mendes, Antonio José G Busson, Sérgio Colcher, Daniel Schwabe, Álan Lívio V Guedes, and Carlos Laufer. 2020. A Cluster-Matching-Based Method for Video Face Recognition. In Proceedings of the Brazilian Symposium on Multimedia and the Web. 97--104.
[26]
Paulo Renato C Mendes, Eduardo S Vieira, Pedro Vinicius A de Freitas, Antonio José G Busson, Álan Lívio V Guedes, Carlos de Salles Soares Neto, and Sérgio Colcher. 2020. Shaping the Video Conferences of Tomorrow With AI. In Anais Estendidos do XXVI Simpósio Brasileiro de Sistemas Multimídia e Web. SBC, 165--168.
[27]
Antoine Miech, Ivan Laptev, and Josef Sivic. 2017. Learnable pooling with context gating for video classification. arXiv preprint arXiv:1706.06905 (2017).
[28]
Alexander Richard, Hilde Kuehne, and Juergen Gall. 2017. Weakly supervised action learning with rnn based fine-to-coarse modeling. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition. 754--763.
[29]
Alexander Richard, Hilde Kuehne, Ahsan Iqbal, and Juergen Gall. 2018. Neuralnetwork-viterbi: A framework for weakly supervised video learning. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition. 7386--7395.
[30]
Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C. Berg, and Li Fei-Fei. 2015. ImageNet Large Scale Visual Recognition Challenge. arXiv:1409.0575 [cs] (Jan. 2015). arXiv: 1409.0575.
[31]
Gabriel NP dos Santos, Pedro VA de Freitas, Antonio José G Busson, Álan LV Guedes, Ruy Milidiú, and Sérgio Colcher. 2019. Deep learning methods for video understanding. In Proceedings of the 25th Brazillian Symposium on Multimedia and the Web. 21--23.
[32]
M. Saquib Sarfraz, Vivek Sharma, and Rainer Stiefelhagen. 2019. Efficient Parameter-free Clustering Using First Neighbor Relations. (Feb. 2019).
[33]
Saquib Sarfraz, Naila Murray, Vivek Sharma, Ali Diba, Luc Van Gool, and Rainer Stiefelhagen. 2021. Temporally-Weighted Hierarchical Clustering for Unsupervised Action Segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 11225--11234.
[34]
Fadime Sener and Angela Yao. 2018. Unsupervised Learning and Segmentation of Complex Activities from Video. In 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. IEEE, Salt Lake City, UT, 8368--8376. https://doi.org/10.1109/CVPR.2018.00873
[35]
Karen Simonyan and Andrew Zisserman. 2014. Two-Stream Convolutional Networks for Action Recognition in Videos. In Advances in Neural Information Processing Systems 27, Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, and K. Q. Weinberger (Eds.). Curran Associates, Inc., 568--576.
[36]
Yaser Souri, Mohsen Fayyaz, Luca Minciullo, Gianpiero Francesca, and Juergen Gall. 2021. Fast weakly supervised action segmentation using mutual consistency. IEEE Transactions on Pattern Analysis and Machine Intelligence (2021).
[37]
Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. 2015. Going deeper with convolutions. In Proceedings of the IEEE conference on computer vision and pattern recognition.
[38]
Aäron van den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol Vinyals, Alex Graves, Nal Kalchbrenner, Andrew Senior, and Koray Kavukcuoglu. [n.d.]. WaveNet: A Generative Model for Raw Audio. In 9th ISCA Speech Synthesis Workshop. 125--125.
[39]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in neural information processing systems. 5998--6008.
[40]
Rosaura G VidalMata, Walter J Scheirer, Anna Kukleva, David Cox, and Hilde Kuehne. 2021. Joint visual-temporal embedding for unsupervised learning of actions in untrimmed sequences. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. 1238--1247.
[41]
Heng Wang and Cordelia Schmid. 2013. Action recognition with improved trajectories. In Proceedings of the IEEE international conference on computer vision. 3551--3558.
[42]
Fanyi Xiao, Yong Jae Lee, Kristen Grauman, Jitendra Malik, and Christoph Feichtenhofer. 2020. Audiovisual SlowFast Networks for Video Recognition. (Jan. 2020). https://arxiv.org/abs/2001.08740v2

Index Terms

  1. A Cluster-Based Method for Action Segmentation Using Spatio-Temporal and Positional Encoded Embeddings

      Recommendations

      Comments

      Please enable JavaScript to view thecomments powered by Disqus.

      Information & Contributors

      Information

      Published In

      cover image ACM Conferences
      WebMedia '21: Proceedings of the Brazilian Symposium on Multimedia and the Web
      November 2021
      271 pages
      ISBN:9781450386098
      DOI:10.1145/3470482
      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Sponsors

      In-Cooperation

      • SBC: Brazilian Computer Society
      • CNPq: Conselho Nacional de Desenvolvimento Cientifico e Tecn
      • CAPES: Brazilian Higher Education Funding Council

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 05 November 2021

      Permissions

      Request permissions for this article.

      Check for updates

      Author Tags

      1. Action recognition
      2. Action segmentation
      3. I3D
      4. Positional encoding

      Qualifiers

      • Research-article
      • Research
      • Refereed limited

      Conference

      WebMedia '21
      Sponsor:
      WebMedia '21: Brazilian Symposium on Multimedia and the Web
      November 5 - 12, 2021
      Minas Gerais, Belo Horizonte, Brazil

      Acceptance Rates

      WebMedia '21 Paper Acceptance Rate 24 of 75 submissions, 32%;
      Overall Acceptance Rate 270 of 873 submissions, 31%

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • 0
        Total Citations
      • 106
        Total Downloads
      • Downloads (Last 12 months)11
      • Downloads (Last 6 weeks)0
      Reflects downloads up to 11 Dec 2024

      Other Metrics

      Citations

      View Options

      Login options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media