More Web Proxy on the site http://driver.im/

research-article

Motion Embedded Image: A combination of spatial and temporal features for action recognition

Authors:

Nham Huynh-Duc,

Chung Thai Nguyen,

Minh-Triet TranAuthors Info & Claims

SoICT '22: Proceedings of the 11th International Symposium on Information and Communication Technology

Pages 302 - 308

https://doi.org/10.1145/3568562.3568653

Published: 01 December 2022 Publication History

Abstract

Demand for human activity recognition from videos has rapidly increased in many real-life applications, e.g., video surveillance, entertainment, healthcare, child and old homes, etc. Moreover, the explosion of short-form videos on social networking platforms such as Tiktok, Facebook, Youtube, etc., makes this problem gain much more attention. In this paper, we focus on the problem of human activity recognition in general short videos. Compared with still images, clips provide both spatial and temporal information, and the challenge is to capture the complementary information on appearance from still frames and motion between frames. Our contribution is two-fold. First, we study an approach of using motion embedded Image in a variation of two-stream ConvNet architecture: one stream is a motion stream capable of capturing and recognizing motion based on embedded batches of frames; another one is a normal image classification ConvNet being fed still frames to classify static appearance and recognize the missing spatial information from the first stream. Second, we build a brand new dataset of Southeast Asian Sports short videos, consisting of both standard videos with no effect and non-standard videos with effects - a modern factor that all currently available datasets being used for benchmarking models lack. Our model is trained and evaluated using different backbone architectures and on two benchmarks: UCF-101 and SEAGS-V1. The result shows that this is a model with competitive performance compared to previous attempts to use deep nets for human activity recognition in short-form videos.

References

[1]

João Carreira and Andrew Zisserman. 2017. Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset. CoRR abs/1705.07750(2017). arXiv:1705.07750http://arxiv.org/abs/1705.07750

[2]

Christoph Feichtenhofer. 2020. X3D: Expanding Architectures for Efficient Video Recognition. CoRR abs/2004.04730(2020). arXiv:2004.04730https://arxiv.org/abs/2004.04730

[3]

Christoph Feichtenhofer, Haoqi Fan, Jitendra Malik, and Kaiming He. 2018. SlowFast Networks for Video Recognition. CoRR abs/1812.03982(2018). arXiv:1812.03982http://arxiv.org/abs/1812.03982

[4]

Christoph Feichtenhofer, Axel Pinz, and Andrew Zisserman. 2016. Convolutional Two-Stream Network Fusion for Video Action Recognition. CoRR abs/1604.06573(2016). arXiv:1604.06573http://arxiv.org/abs/1604.06573

[5]

Raghav Goyal, Samira Ebrahimi Kahou, Vincent Michalski, Joanna Materzynska, Susanne Westphal, Heuna Kim, Valentin Haenel, Ingo Fründ, Peter Yianilos, Moritz Mueller-Freitag, Florian Hoppe, Christian Thurau, Ingo Bax, and Roland Memisevic. 2017. The "something something" video database for learning and evaluating visual common sense. CoRR abs/1706.04261(2017). arXiv:1706.04261http://arxiv.org/abs/1706.04261

[6]

Charles Han, Chao Wang, Evelyn Mei, Joseph Redmon, Santosh Kumar Divvala, Zuxuan Wu, Xi Wang, Yu-Gang Jiang, Hao Ye, and X. Xue. 2017. YOLO-based Adaptive Window Two-stream Convolutional Neural Network for Video Classification.

[7]

Kensho Hara, Hirokatsu Kataoka, and Yutaka Satoh. 2017. Can Spatiotemporal 3D CNNs Retrace the History of 2D CNNs and ImageNet?CoRR abs/1711.09577(2017). arXiv:1711.09577http://arxiv.org/abs/1711.09577

[8]

Fabian Caba Heilbron, Victor Escorcia, Bernard Ghanem, and Juan Carlos Niebles. 2015. ActivityNet: A large-scale video benchmark for human activity understanding. In 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 961–970. https://doi.org/10.1109/CVPR.2015.7298698

[9]

Cordelia Schmid Liu Cheng-Lin Heng Wang, Alexander Kläser. 2011. Action Recognition by Dense Trajectories. (2011). https://hal.inria.fr/inria-00583818/document

[10]

Shuiwang Ji, Wei Xu, Ming Yang, and Kai Yu. 2013. 3D Convolutional Neural Networks for Human Action Recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence 35, 1(2013), 221–231. https://doi.org/10.1109/TPAMI.2012.59

Digital Library

[11]

M. Esat Kalfaoglu, Sinan Kalkan, and A. Aydin Alatan. 2020. Late Temporal Modeling in 3D CNN Architectures with BERT for Action Recognition. https://doi.org/10.48550/ARXIV.2008.01232

[12]

Andrej Karpathy, George Toderici, Sanketh Shetty, Thomas Leung, Rahul Sukthankar, and Li Fei-Fei. 2014. Large-Scale Video Classification with Convolutional Neural Networks. In 2014 IEEE Conference on Computer Vision and Pattern Recognition. 1725–1732. https://doi.org/10.1109/CVPR.2014.223

Digital Library

[13]

Will Kay, João Carreira, Karen Simonyan, Brian Zhang, Chloe Hillier, Sudheendra Vijayanarasimhan, Fabio Viola, Tim Green, Trevor Back, Paul Natsev, Mustafa Suleyman, and Andrew Zisserman. 2017. The Kinetics Human Action Video Dataset. CoRR abs/1705.06950(2017). arXiv:1705.06950http://arxiv.org/abs/1705.06950

[14]

Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. 2012. ImageNet Classification with Deep Convolutional Neural Networks. In Advances in Neural Information Processing Systems 25, F. Pereira, C. J. C. Burges, L. Bottou, and K. Q. Weinberger (Eds.). Curran Associates, Inc., 1097–1105. http://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks.pdf

[15]

Laptev and Lindeberg. 2003. Space-time interest points. In Proceedings Ninth IEEE International Conference on Computer Vision. 432–439 vol.1. https://doi.org/10.1109/ICCV.2003.1238378

[16]

Ji Lin, Chuang Gan, and Song Han. 2018. Temporal Shift Module for Efficient Video Understanding. CoRR abs/1811.08383(2018). arXiv:1811.08383http://arxiv.org/abs/1811.08383

[17]

Joe Yue-Hei Ng, Jonghyun Choi, Jan Neumann, and Larry S. Davis. 2016. ActionFlowNet: Learning Motion Representation for Action Recognition. CoRR abs/1612.03052(2016). arXiv:1612.03052http://arxiv.org/abs/1612.03052

[18]

Joe Yue-Hei Ng, Matthew J. Hausknecht, Sudheendra Vijayanarasimhan, Oriol Vinyals, Rajat Monga, and George Toderici. 2015. Beyond Short Snippets: Deep Networks for Video Classification. CoRR abs/1503.08909(2015). arXiv:1503.08909http://arxiv.org/abs/1503.08909

[19]

Mikel D. Rodriguez, Javed Ahmed, and Mubarak Shah. 2008. Action MACH a spatio-temporal Maximum Average Correlation Height filter for action recognition. In 2008 IEEE Conference on Computer Vision and Pattern Recognition. 1–8. https://doi.org/10.1109/CVPR.2008.4587727

[20]

Karen Simonyan and Andrew Zisserman. 2014. Two-Stream Convolutional Networks for Action Recognition in Videos. CoRR abs/1406.2199(2014). arXiv:1406.2199http://arxiv.org/abs/1406.2199

[21]

Khurram Soomro, Amir Roshan Zamir, and Mubarak Shah. 2012. UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild. CoRR abs/1212.0402(2012). arXiv:1212.0402http://arxiv.org/abs/1212.0402

[22]

Limin Wang, Yuanjun Xiong, Zhe Wang, and Yu Qiao. 2015. Towards Good Practices for Very Deep Two-Stream ConvNets. CoRR abs/1507.02159(2015). arXiv:1507.02159http://arxiv.org/abs/1507.02159

[23]

Christopher Zach, Thomas Pock, and Horst Bischof. 2007. A Duality Based Approach for Realtime TV-L1 Optical Flow. Pattern Recognition 4713, 214–223. https://doi.org/10.1007/978-3-540-74936-3_22

[24]

Bowen Zhang, Limin Wang, Zhe Wang, Yu Qiao, and Hanli Wang. 2016. Real-time Action Recognition with Enhanced Motion Vector CNNs. CoRR abs/1604.07669(2016). arXiv:1604.07669http://arxiv.org/abs/1604.07669

[25]

Yi Zhu, Zhen-Zhong Lan, Shawn D. Newsam, and Alexander G. Hauptmann. 2017. Hidden Two-Stream Convolutional Networks for Action Recognition. CoRR abs/1704.00389(2017). arXiv:1704.00389http://arxiv.org/abs/1704.00389

Index Terms

Motion Embedded Image: A combination of spatial and temporal features for action recognition
1. Computing methodologies
  1. Artificial intelligence
    1. Computer vision
      1. Computer vision tasks
        Activity recognition and understanding
  2. Machine learning
    1. Machine learning algorithms

Recommendations

Local velocity-adapted motion events for spatio-temporal recognition

In this paper, we address the problem of motion recognition using event-based local motion representations. We assume that similar patterns of motion contain similar events with consistent motion across image sequences. Using this assumption, we ...
Learning motion and content-dependent features with convolutions for action recognition

A variety of recognizing architectures based on deep convolutional neural networks have been devised for labeling videos containing human motion with action labels. However, so far, most works cannot properly deal with the temporal dynamics encoded in ...
SOAP: Enhancing Spatio-Temporal Relation and Motion Information Capturing for Few-Shot Action Recognition
MM '24: Proceedings of the 32nd ACM International Conference on Multimedia

High frame-rate~(HFR) videos of action recognition improve fine-grained expression while reducing the spatio-temporal relation and motion information density. Thus, large amounts of video samples are continuously required for traditional data-driven ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Other conferences

SoICT '22: Proceedings of the 11th International Symposium on Information and Communication Technology

December 2022

474 pages

ISBN:9781450397254

DOI:10.1145/3568562

Copyright © 2022 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 01 December 2022

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed limited

Funding Sources

Honors Program, University of Science, Vietnam National University - Ho Chi Minh City

Conference

SoICT 2022

SoICT 2022: The 11th International Symposium on Information and Communication Technology

December 1 - 3, 2022

Hanoi, Vietnam

Acceptance Rates

Overall Acceptance Rate 147 of 318 submissions, 46%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
34
Total Downloads

Downloads (Last 12 months)16
Downloads (Last 6 weeks)1

Reflects downloads up to 12 Dec 2024

Other Metrics

View Author Metrics

Citations

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

HTML Format

View this article in HTML Format.

Media

Figures

Other

Tables

View Table of Contents