[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1145/3568562.3568653acmotherconferencesArticle/Chapter ViewAbstractPublication PagessoictConference Proceedingsconference-collections
research-article

Motion Embedded Image: A combination of spatial and temporal features for action recognition

Published: 01 December 2022 Publication History

Abstract

Demand for human activity recognition from videos has rapidly increased in many real-life applications, e.g., video surveillance, entertainment, healthcare, child and old homes, etc. Moreover, the explosion of short-form videos on social networking platforms such as Tiktok, Facebook, Youtube, etc., makes this problem gain much more attention. In this paper, we focus on the problem of human activity recognition in general short videos. Compared with still images, clips provide both spatial and temporal information, and the challenge is to capture the complementary information on appearance from still frames and motion between frames. Our contribution is two-fold. First, we study an approach of using motion embedded Image in a variation of two-stream ConvNet architecture: one stream is a motion stream capable of capturing and recognizing motion based on embedded batches of frames; another one is a normal image classification ConvNet being fed still frames to classify static appearance and recognize the missing spatial information from the first stream. Second, we build a brand new dataset of Southeast Asian Sports short videos, consisting of both standard videos with no effect and non-standard videos with effects - a modern factor that all currently available datasets being used for benchmarking models lack. Our model is trained and evaluated using different backbone architectures and on two benchmarks: UCF-101 and SEAGS-V1. The result shows that this is a model with competitive performance compared to previous attempts to use deep nets for human activity recognition in short-form videos.

References

[1]
João Carreira and Andrew Zisserman. 2017. Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset. CoRR abs/1705.07750(2017). arXiv:1705.07750http://arxiv.org/abs/1705.07750
[2]
Christoph Feichtenhofer. 2020. X3D: Expanding Architectures for Efficient Video Recognition. CoRR abs/2004.04730(2020). arXiv:2004.04730https://arxiv.org/abs/2004.04730
[3]
Christoph Feichtenhofer, Haoqi Fan, Jitendra Malik, and Kaiming He. 2018. SlowFast Networks for Video Recognition. CoRR abs/1812.03982(2018). arXiv:1812.03982http://arxiv.org/abs/1812.03982
[4]
Christoph Feichtenhofer, Axel Pinz, and Andrew Zisserman. 2016. Convolutional Two-Stream Network Fusion for Video Action Recognition. CoRR abs/1604.06573(2016). arXiv:1604.06573http://arxiv.org/abs/1604.06573
[5]
Raghav Goyal, Samira Ebrahimi Kahou, Vincent Michalski, Joanna Materzynska, Susanne Westphal, Heuna Kim, Valentin Haenel, Ingo Fründ, Peter Yianilos, Moritz Mueller-Freitag, Florian Hoppe, Christian Thurau, Ingo Bax, and Roland Memisevic. 2017. The "something something" video database for learning and evaluating visual common sense. CoRR abs/1706.04261(2017). arXiv:1706.04261http://arxiv.org/abs/1706.04261
[6]
Charles Han, Chao Wang, Evelyn Mei, Joseph Redmon, Santosh Kumar Divvala, Zuxuan Wu, Xi Wang, Yu-Gang Jiang, Hao Ye, and X. Xue. 2017. YOLO-based Adaptive Window Two-stream Convolutional Neural Network for Video Classification.
[7]
Kensho Hara, Hirokatsu Kataoka, and Yutaka Satoh. 2017. Can Spatiotemporal 3D CNNs Retrace the History of 2D CNNs and ImageNet?CoRR abs/1711.09577(2017). arXiv:1711.09577http://arxiv.org/abs/1711.09577
[8]
Fabian Caba Heilbron, Victor Escorcia, Bernard Ghanem, and Juan Carlos Niebles. 2015. ActivityNet: A large-scale video benchmark for human activity understanding. In 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 961–970. https://doi.org/10.1109/CVPR.2015.7298698
[9]
Cordelia Schmid Liu Cheng-Lin Heng Wang, Alexander Kläser. 2011. Action Recognition by Dense Trajectories. (2011). https://hal.inria.fr/inria-00583818/document
[10]
Shuiwang Ji, Wei Xu, Ming Yang, and Kai Yu. 2013. 3D Convolutional Neural Networks for Human Action Recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence 35, 1(2013), 221–231. https://doi.org/10.1109/TPAMI.2012.59
[11]
M. Esat Kalfaoglu, Sinan Kalkan, and A. Aydin Alatan. 2020. Late Temporal Modeling in 3D CNN Architectures with BERT for Action Recognition. https://doi.org/10.48550/ARXIV.2008.01232
[12]
Andrej Karpathy, George Toderici, Sanketh Shetty, Thomas Leung, Rahul Sukthankar, and Li Fei-Fei. 2014. Large-Scale Video Classification with Convolutional Neural Networks. In 2014 IEEE Conference on Computer Vision and Pattern Recognition. 1725–1732. https://doi.org/10.1109/CVPR.2014.223
[13]
Will Kay, João Carreira, Karen Simonyan, Brian Zhang, Chloe Hillier, Sudheendra Vijayanarasimhan, Fabio Viola, Tim Green, Trevor Back, Paul Natsev, Mustafa Suleyman, and Andrew Zisserman. 2017. The Kinetics Human Action Video Dataset. CoRR abs/1705.06950(2017). arXiv:1705.06950http://arxiv.org/abs/1705.06950
[14]
Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. 2012. ImageNet Classification with Deep Convolutional Neural Networks. In Advances in Neural Information Processing Systems 25, F. Pereira, C. J. C. Burges, L. Bottou, and K. Q. Weinberger (Eds.). Curran Associates, Inc., 1097–1105. http://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks.pdf
[15]
Laptev and Lindeberg. 2003. Space-time interest points. In Proceedings Ninth IEEE International Conference on Computer Vision. 432–439 vol.1. https://doi.org/10.1109/ICCV.2003.1238378
[16]
Ji Lin, Chuang Gan, and Song Han. 2018. Temporal Shift Module for Efficient Video Understanding. CoRR abs/1811.08383(2018). arXiv:1811.08383http://arxiv.org/abs/1811.08383
[17]
Joe Yue-Hei Ng, Jonghyun Choi, Jan Neumann, and Larry S. Davis. 2016. ActionFlowNet: Learning Motion Representation for Action Recognition. CoRR abs/1612.03052(2016). arXiv:1612.03052http://arxiv.org/abs/1612.03052
[18]
Joe Yue-Hei Ng, Matthew J. Hausknecht, Sudheendra Vijayanarasimhan, Oriol Vinyals, Rajat Monga, and George Toderici. 2015. Beyond Short Snippets: Deep Networks for Video Classification. CoRR abs/1503.08909(2015). arXiv:1503.08909http://arxiv.org/abs/1503.08909
[19]
Mikel D. Rodriguez, Javed Ahmed, and Mubarak Shah. 2008. Action MACH a spatio-temporal Maximum Average Correlation Height filter for action recognition. In 2008 IEEE Conference on Computer Vision and Pattern Recognition. 1–8. https://doi.org/10.1109/CVPR.2008.4587727
[20]
Karen Simonyan and Andrew Zisserman. 2014. Two-Stream Convolutional Networks for Action Recognition in Videos. CoRR abs/1406.2199(2014). arXiv:1406.2199http://arxiv.org/abs/1406.2199
[21]
Khurram Soomro, Amir Roshan Zamir, and Mubarak Shah. 2012. UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild. CoRR abs/1212.0402(2012). arXiv:1212.0402http://arxiv.org/abs/1212.0402
[22]
Limin Wang, Yuanjun Xiong, Zhe Wang, and Yu Qiao. 2015. Towards Good Practices for Very Deep Two-Stream ConvNets. CoRR abs/1507.02159(2015). arXiv:1507.02159http://arxiv.org/abs/1507.02159
[23]
Christopher Zach, Thomas Pock, and Horst Bischof. 2007. A Duality Based Approach for Realtime TV-L1 Optical Flow. Pattern Recognition 4713, 214–223. https://doi.org/10.1007/978-3-540-74936-3_22
[24]
Bowen Zhang, Limin Wang, Zhe Wang, Yu Qiao, and Hanli Wang. 2016. Real-time Action Recognition with Enhanced Motion Vector CNNs. CoRR abs/1604.07669(2016). arXiv:1604.07669http://arxiv.org/abs/1604.07669
[25]
Yi Zhu, Zhen-Zhong Lan, Shawn D. Newsam, and Alexander G. Hauptmann. 2017. Hidden Two-Stream Convolutional Networks for Action Recognition. CoRR abs/1704.00389(2017). arXiv:1704.00389http://arxiv.org/abs/1704.00389

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Other conferences
SoICT '22: Proceedings of the 11th International Symposium on Information and Communication Technology
December 2022
474 pages
ISBN:9781450397254
DOI:10.1145/3568562
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 01 December 2022

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. action recognition
  2. motion embedded image
  3. sports dataset
  4. two-stream network

Qualifiers

  • Research-article
  • Research
  • Refereed limited

Funding Sources

  • Honors Program, University of Science, Vietnam National University - Ho Chi Minh City

Conference

SoICT 2022

Acceptance Rates

Overall Acceptance Rate 147 of 318 submissions, 46%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 34
    Total Downloads
  • Downloads (Last 12 months)16
  • Downloads (Last 6 weeks)1
Reflects downloads up to 12 Dec 2024

Other Metrics

Citations

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format.

HTML Format

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media