More Web Proxy on the site http://driver.im/

research-article

View while Moving: Efficient Video Recognition in Long-untrimmed Videos

Authors:

Wendong WangAuthors Info & Claims

MM '23: Proceedings of the 31st ACM International Conference on Multimedia

Pages 173 - 183

https://doi.org/10.1145/3581783.3612035

Published: 27 October 2023 Publication History

Abstract

Recent adaptive methods for efficient video recognition mostly follow the two-stage paradigm of "preview-then-recognition" and have achieved great success on multiple video benchmarks. However, this two-stage paradigm involves two visits of raw frames from coarse-grained to fine-grained during inference (cannot be parallelized), and the captured spatiotemporal features cannot be reused in the second stage (due to varying granularity), being not friendly to efficiency and computation optimization.To this end, inspired by human cognition, we propose a novel recognition paradigm of "View while Moving" for efficient long-untrimmed video recognition.In contrast to the two-stage paradigm, our paradigm only needs to access the raw frame once.The two phases of coarse-grained sampling and fine-grained recognition are combined into unified spatiotemporal modeling, showing great performance.Moreover, we investigate the properties of semantic units in video and propose a hierarchical mechanism to efficiently capture and reason about the unit-level and video-level temporal semantics in long-untrimmed videos respectively.Extensive experiments on both long-untrimmed and short-trimmed videos demonstrate that our approach outperforms state-of-the-art methods in terms of accuracy as well as efficiency, yielding new efficiency and accuracy trade-offs for video spatiotemporal modeling.

References

[1]

Relja Arandjelovic and Andrew Zisserman. 2017. Look, listen and learn. In ICCV. 609--617.

[2]

Romil Bhardwaj, Zhengxu Xia, Ganesh Ananthanarayanan, Junchen Jiang, Yuanchao Shu, Nikolaos Karianakis, Kevin Hsieh, Paramvir Bahl, and Ion Stoica. 2022. Ekya: Continuous learning of video analytics models on edge compute servers. In 19th USENIX Symposium on Networked Systems Design and Implementation (NSDI 22). 119--135.

[3]

Fabian Caba Heilbron, Victor Escorcia, Bernard Ghanem, and Juan Carlos Niebles. 2015. Activitynet: A large-scale video benchmark for human activity understanding. In CVPR. 961--970.

[4]

Petros Christodoulou. 2019. Soft actor-critic for discrete action settings. arXiv preprint arXiv:1910.07207 (2019).

[5]

Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. 2009. Imagenet: A large-scale hierarchical image database. In CVPR. IEEE, 248--255.

[6]

Jeffrey Donahue, Lisa Anne Hendricks, Sergio Guadarrama, Marcus Rohrbach, Subhashini Venugopalan, Kate Saenko, and Trevor Darrell. 2015. Long-term recurrent convolutional networks for visual recognition and description. In CVPR. 2625--2634.

[7]

Maël Donoso, Anne GE Collins, and Etienne Koechlin. 2014. Foundations of human reasoning in the prefrontal cortex. Science, Vol. 344, 6191 (2014), 1481--1486.

[8]

Kuntai Du, Ahsan Pervaiz, Xin Yuan, Aakanksha Chowdhery, Qizheng Zhang, Henry Hoffmann, and Junchen Jiang. 2020. Server-driven video streaming for deep learning inference. In Proceedings of the Annual conference of the ACM Special Interest Group on Data Communication on the applications, technologies, architectures, and protocols for computer communication. 557--570.

Digital Library

[9]

Hehe Fan, Zhongwen Xu, Linchao Zhu, Chenggang Yan, Jianjun Ge, and Yi Yang. 2018. Watching a small portion could be as good as watching all: Towards efficient video classification. In IJCAI.

[10]

Christoph Feichtenhofer. 2020. X3D: Expanding Architectures for Efficient Video Recognition. In CVPR. 203--213.

[11]

Christoph Feichtenhofer, Haoqi Fan, Jitendra Malik, and Kaiming He. 2019. Slowfast networks for video recognition. In ICCV. 6202--6211.

[12]

Ruohan Gao, Tae-Hyun Oh, Kristen Grauman, and Lorenzo Torresani. 2020. Listen to look: Action recognition by previewing audio. In CVPR. 10457--10467.

[13]

Amir Ghodrati, Babak Ehteshami Bejnordi, and Amirhossein Habibian. 2021. Frameexit: Conditional early exiting for efficient video recognition. In CVPR. 15608--15618.

[14]

Kensho Hara, Hirokatsu Kataoka, and Yutaka Satoh. 2018. Can spatiotemporal 3d cnns retrace the history of 2d cnns and imagenet?. In CVPR. 6546--6555.

[15]

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In CVPR. 770--778.

[16]

Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural computation, Vol. 9, 8 (1997), 1735--1780.

[17]

Mihir Jain, Amir Ghodrati, and Cees GM Snoek. 2020. Actionbytes: Learning from trimmed videos to localize actions. In CVPR. 1171--1180.

[18]

Yu-Gang Jiang, Zuxuan Wu, Jun Wang, Xiangyang Xue, and Shih-Fu Chang. 2017. Exploiting feature and class relationships in video categorization with regularized deep neural networks. TPAMI, Vol. 40, 2 (2017), 352--364.

[19]

Will Kay, Joao Carreira, Karen Simonyan, Brian Zhang, Chloe Hillier, Sudheendra Vijayanarasimhan, Fabio Viola, Tim Green, Trevor Back, Paul Natsev, et al. 2017. The kinetics human action video dataset. arXiv preprint arXiv:1705.06950 (2017).

[20]

Evangelos Kazakos, Arsha Nagrani, Andrew Zisserman, and Dima Damen. 2019. Epic-fusion: Audio-visual temporal binding for egocentric action recognition. In ICCV. 5492--5501.

[21]

Bruno Korbar, Du Tran, and Lorenzo Torresani. 2019. Scsampler: Sampling salient clips from video for efficient action recognition. In ICCV. 6232--6242.

[22]

Hengduo Li, Zuxuan Wu, Abhinav Shrivastava, and Larry S Davis. 2021. 2d or not 2d? adaptive 3d convolution selection for efficient video recognition. In CVPR. 6155--6164.

[23]

Ji Lin, Chuang Gan, and Song Han. 2019. Tsm: Temporal shift module for efficient video understanding. In ICCV. 7083--7093.

[24]

Zhaoyang Liu, Donghao Luo, Yabiao Wang, Limin Wang, Ying Tai, Chengjie Wang, Jilin Li, Feiyue Huang, and Tong Lu. 2020. Teinet: Towards an efficient architecture for video recognition. In AAAI, Vol. 34. 11669--11676.

[25]

Ryan Lowe, Yi I Wu, Aviv Tamar, Jean Harb, OpenAI Pieter Abbeel, and Igor Mordatch. 2017. Multi-agent actor-critic for mixed cooperative-competitive environments. NIPS, Vol. 30 (2017).

[26]

Huaishao Luo, Lei Ji, Ming Zhong, Yang Chen, Wen Lei, Nan Duan, and Tianrui Li. 2022. CLIP4Clip: An empirical study of CLIP for end to end video clip retrieval and captioning. Neurocomputing, Vol. 508 (2022), 293--304.

Digital Library

[27]

Yue Meng, Chung-Ching Lin, Rameswar Panda, Prasanna Sattigeri, Leonid Karlinsky, Aude Oliva, Kate Saenko, and Rogerio Feris. 2020. Ar-net: Adaptive frame resolution for efficient action recognition. In ECCV. Springer, 86--104.

[28]

Yue Meng, Rameswar Panda, Chung-Ching Lin, Prasanna Sattigeri, Leonid Karlinsky, Kate Saenko, Aude Oliva, and Rogerio Feris. 2021. AdaFuse: Adaptive Temporal Fusion Network for Efficient Action Recognition. In ICLR.

[29]

Omar Mossad, Khaled Diab, Ihab Amer, and Mohamed Hefeeda. 2021. Deepgame: Efficient video encoding for cloud gaming. In Proceedings of the 29th ACM International Conference on Multimedia. 1387--1395.

Digital Library

[30]

Yingwei Pan, Yehao Li, Jianjie Luo, Jun Xu, Ting Yao, and Tao Mei. 2022. Auto-captions on GIF: A Large-scale Video-sentence Dataset for Vision-language Pre-training. In Proceedings of the 30th ACM International Conference on Multimedia. 7070--7074.

Digital Library

[31]

Rameswar Panda, Chun-Fu Richard Chen, Quanfu Fan, Ximeng Sun, Kate Saenko, Aude Oliva, and Rogerio Feris. 2021. Adamml: Adaptive multi-modal learning for efficient video recognition. In ICCV. 7576--7585.

[32]

AJ Piergiovanni, Anelia Angelova, and Michael S Ryoo. 2019. Tiny Video Networks. arXiv preprint arXiv:1910.06961 (2019).

[33]

Zhaofan Qiu, Ting Yao, and Tao Mei. 2017. Learning spatio-temporal representation with pseudo-3d residual networks. In ICCV. 5533--5541.

[34]

Xuming Ran, Jie Zhang, Ziyuan Ye, Haiyan Wu, Qi Xu, Huihui Zhou, and Quanying Liu. 2021. Deep auto-encoder with neural response. arXiv preprint arXiv:2111.15309 (2021).

[35]

Charan Ranganath and Robert S Blumenfeld. 2005. Doubts about double dissociations between short-and long-term memory. Trends in cognitive sciences, Vol. 9, 8 (2005), 374--380.

[36]

Karen Simonyan and Andrew Zisserman. 2014. Two-stream convolutional networks for action recognition in videos. In NIPS. 568--576.

[37]

Ximeng Sun, Rameswar Panda, Chun-Fu Richard Chen, Aude Oliva, Rogerio Feris, and Kate Saenko. 2021. Dynamic network quantization for efficient video inference. In ICCV. 7375--7385.

[38]

Du Tran, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. 2015. Learning spatiotemporal features with 3d convolutional networks. In ICCV. 4489--4497.

[39]

Du Tran, Heng Wang, Lorenzo Torresani, Jamie Ray, Yann LeCun, and Manohar Paluri. 2018. A closer look at spatiotemporal convolutions for action recognition. In CVPR. 6450--6459.

[40]

Gül Varol, Ivan Laptev, and Cordelia Schmid. 2017. Long-term temporal convolutions for action recognition. TPAMI, Vol. 40, 6 (2017), 1510--1517.

[41]

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. NIPS, Vol. 30 (2017).

[42]

Shaohua Wan, Songtao Ding, and Chen Chen. 2022. Edge computing enabled video segmentation for real-time traffic monitoring in internet of vehicles. Pattern Recognition, Vol. 121 (2022), 108146.

Digital Library

[43]

Heng Wang and Cordelia Schmid. 2013. Action recognition with improved trajectories. In ICCV. 3551--3558.

[44]

Limin Wang, Zhan Tong, Bin Ji, and Gangshan Wu. 2021b. Tdn: Temporal difference networks for efficient action recognition. In CVPR. 1895--1904.

[45]

Limin Wang, Yuanjun Xiong, Zhe Wang, Yu Qiao, Dahua Lin, Xiaoou Tang, and Luc Van Gool. 2016. Temporal segment networks: Towards good practices for deep action recognition. In ECCV. Springer, 20--36.

[46]

Qiang Wang, Yanhao Zhang, Yun Zheng, Pan Pan, and Xian-Sheng Hua. 2022c. Disentangled representation learning for text-video retrieval. arXiv preprint arXiv:2203.07111 (2022).

[47]

Xiaolong Wang, Ross Girshick, Abhinav Gupta, and Kaiming He. 2018. Non-local neural networks. In CVPR. 7794--7803.

[48]

Yulin Wang, Zhaoxi Chen, Haojun Jiang, Shiji Song, Yizeng Han, and Gao Huang. 2021a. Adaptive Focus for Efficient Video Recognition. In ICCV.

[49]

Yulin Wang, Yang Yue, Yuanze Lin, Haojun Jiang, Zihang Lai, Victor Kulikov, Nikita Orlov, Humphrey Shi, and Gao Huang. 2022a. AdaFocus V2: End-to-End Training of Spatial Dynamic Networks for Video Recognition. In CVPR.

[50]

Yulin Wang, Yang Yue, Xinhong Xu, Ali Hassani, Victor Kulikov, Nikita Orlov, Shiji Song, Humphrey Shi, and Gao Huang. 2022b. AdaFocusV3: On Unified Spatial-Temporal Dynamic Video Recognition. In ECCV. Springer, 226--243.

[51]

Chao-Yuan Wu, Christoph Feichtenhofer, Haoqi Fan, Kaiming He, Philipp Krahenbuhl, and Ross Girshick. 2019a. Long-term feature banks for detailed video understanding. In CVPR. 284--293.

[52]

Chao-Yuan Wu, Yanghao Li, Karttikeya Mangalam, Haoqi Fan, Bo Xiong, Jitendra Malik, and Christoph Feichtenhofer. 2022. Memvit: Memory-augmented multiscale vision transformer for efficient long-term video recognition. In CVPR. 13587--13597.

[53]

Wenhao Wu, Dongliang He, Tianwei Lin, Fu Li, Chuang Gan, and Errui Ding. 2021. Mvfnet: Multi-view fusion network for efficient video recognition. In AAAI, Vol. 35. 2943--2951.

[54]

Wenhao Wu, Dongliang He, Xiao Tan, Shifeng Chen, and Shilei Wen. 2019b. Multi-agent reinforcement learning based frame sampling for effective untrimmed video recognition. In ICCV. 6222--6231.

[55]

Zuxuan Wu, Caiming Xiong, Yu-Gang Jiang, and Larry S Davis. 2019c. LiteEval: A Coarse-to-Fine Framework for Resource Efficient Video Recognition. In NIPS. 7778--7787.

[56]

Zuxuan Wu, Caiming Xiong, Chih-Yao Ma, Richard Socher, and Larry S Davis. 2019d. Adaframe: Adaptive frame selection for fast video recognition. In CVPR. 1278--1287.

[57]

Boyang Xia, Zhihao Wang, Wenhao Wu, Haoran Wang, and Jungong Han. 2022a. Temporal saliency query network for efficient video recognition. In ECCV. Springer, 741--759.

[58]

Boyang Xia, Wenhao Wu, Haoran Wang, Rui Su, Dongliang He, Haosen Yang, Xiaoran Fan, and Wanli Ouyang. 2022b. Nsnet: Non-saliency suppression sampler for efficient video recognition. In ECCV. Springer, 705--723.

[59]

Jiayi Xie, Yaochen Zhu, and Zhenzhong Chen. 2021. Micro-video Popularity Prediction via Multimodal Variational Information Bottleneck. IEEE Transactions on Multimedia (2021).

[60]

Qi Xu, Yaxin Li, Jiangrong Shen, Pingping Zhang, Jian K Liu, Huajin Tang, and Gang Pan. 2022. Hierarchical Spiking-Based Model for Efficient Image Classification With Enhanced Feature Extraction and Encoding. IEEE Transactions on Neural Networks and Learning Systems (2022).

[61]

Liangwei Yang, Zhiwei Liu, Yu Wang, Chen Wang, Ziwei Fan, and Philip S Yu. 2022a. Large-scale personalized video game recommendation via social-aware contextualized graph neural network. In Proceedings of the ACM Web Conference 2022. 3376--3386.

Digital Library

[62]

Shusheng Yang, Xinggang Wang, Yu Li, Yuxin Fang, Jiemin Fang, Wenyu Liu, Xun Zhao, and Ying Shan. 2022b. Temporally efficient vision transformer for video instance segmentation. In CVPR. 2885--2895.

[63]

Serena Yeung, Olga Russakovsky, Greg Mori, and Li Fei-Fei. 2016. End-to-end learning of action detection from frame glimpses in videos. In CVPR. 2678--2687.

[64]

Bowen Zhang, Hexiang Hu, and Fei Sha. 2018. Cross-modal and hierarchical modeling of video and text. In ECCV. 374--390.

[65]

Yitian Zhang, Yue Bai, Huan Wang, Yi Xu, and Yun Fu. 2022. Look More but Care Less in Video Recognition. arXiv preprint arXiv:2211.09992 (2022).

[66]

Mohammadreza Zolfaghari, Kamaljeet Singh, and Thomas Brox. 2018. Eco: Efficient convolutional network for online video understanding. In ECCV. 695--712.

Cited By

Chan KIm S(2024)GAT-Based Bi-CARU with Adaptive Feature-Based Transformation for Video SummarisationTechnologies10.3390/technologies1208012612:8(126)Online publication date: 5-Aug-2024
https://doi.org/10.3390/technologies12080126
Ye TLiu YYang MZhang LLi Z(2024)Semantic Fusion Based Graph Network for Video Scene Detection2024 International Joint Conference on Neural Networks (IJCNN)10.1109/IJCNN60899.2024.10651314(1-8)Online publication date: 30-Jun-2024
https://doi.org/10.1109/IJCNN60899.2024.10651314
Lei XYang MXi GLiu YLi JZhang LTian Y(2024)DTA: Deformable Temporal Attention for Video Recognition2024 International Joint Conference on Neural Networks (IJCNN)10.1109/IJCNN60899.2024.10650436(1-8)Online publication date: 30-Jun-2024
https://doi.org/10.1109/IJCNN60899.2024.10650436
Show More Cited By

Index Terms

View while Moving: Efficient Video Recognition in Long-untrimmed Videos
1. Computing methodologies
  1. Artificial intelligence
    1. Computer vision
      1. Computer vision tasks
        Activity recognition and understanding

Recommendations

Efficient Video Transformers via Spatial-temporal Token Merging for Action Recognition
Transformer has exhibited promising performance in various video recognition tasks but brings a huge computational cost in modeling spatial-temporal cues. This work aims to boost the efficiency of existing video transformers for action recognition through ...
Facial expression recognition from near-infrared videos

Facial expression recognition is to determine the emotional state of the face regardless of its identity. Most of the existing datasets for facial expressions are captured in a visible light spectrum. However, the visible light (VIS) can change with ...
Frontal view recognition in multiview video sequences
ICME'09: Proceedings of the 2009 IEEE international conference on Multimedia and Expo

In this paper, a novel method is proposed as a solution to the problem of frontal view recognition from multiview image sequences. Our aim is to correctly identify the view that corresponds to the camera placed in front of a person, or the camera whose ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

MM '23: Proceedings of the 31st ACM International Conference on Multimedia

October 2023

9913 pages

ISBN:9798400701085

DOI:10.1145/3581783

General Chairs:
Abdulmotaleb El Saddik
University of Ottawa, Canada & MBZUAI, UAE
,
Tao Mei
HiDream.ai, China
,
Rita Cucchiara
University of Modena and Reggio Emilia, Italy
,
Program Chairs:
Marco Bertini
University of Florence, Italy
,
Diana Patricia Tobon Vallejo
Unversidad de Medellin, Colombia
,
Pradeep K. Atrey
University at Albany, State University of New York, USA
,
M. Shamim Hossain
M. Shamim Hossain (King Saud University, KSA

Copyright © 2023 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

SIGMM: ACM Special Interest Group on Multimedia

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 27 October 2023

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

Industry-University-Research Innovation Fund of Universities in China
National Natural Science Foundation of China

Conference

MM '23

Sponsor:

SIGMM

MM '23: The 31st ACM International Conference on Multimedia

October 29 - November 3, 2023

Ottawa ON, Canada

Acceptance Rates

Overall Acceptance Rate 2,145 of 8,556 submissions, 25%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

6
Total Citations
View Citations
259
Total Downloads

Downloads (Last 12 months)207
Downloads (Last 6 weeks)10

Reflects downloads up to 11 Dec 2024

Other Metrics

View Author Metrics

Citations

Cited By

Chan KIm S(2024)GAT-Based Bi-CARU with Adaptive Feature-Based Transformation for Video SummarisationTechnologies10.3390/technologies1208012612:8(126)Online publication date: 5-Aug-2024
https://doi.org/10.3390/technologies12080126
Ye TLiu YYang MZhang LLi Z(2024)Semantic Fusion Based Graph Network for Video Scene Detection2024 International Joint Conference on Neural Networks (IJCNN)10.1109/IJCNN60899.2024.10651314(1-8)Online publication date: 30-Jun-2024
https://doi.org/10.1109/IJCNN60899.2024.10651314
Lei XYang MXi GLiu YLi JZhang LTian Y(2024)DTA: Deformable Temporal Attention for Video Recognition2024 International Joint Conference on Neural Networks (IJCNN)10.1109/IJCNN60899.2024.10650436(1-8)Online publication date: 30-Jun-2024
https://doi.org/10.1109/IJCNN60899.2024.10650436
Yang MTian YZhang LLiang XRan XWang W(2024)AdaViPro: Region-Based Adaptive Visual Prompt For Large-Scale Models Adapting2024 IEEE International Conference on Image Processing (ICIP)10.1109/ICIP51287.2024.10647632(1316-1322)Online publication date: 27-Oct-2024
https://doi.org/10.1109/ICIP51287.2024.10647632
Ren X Fan WWang Y(2024)Efficiently adapting large pre-trained models for real-time violence recognition in smart city surveillanceJournal of Real-Time Image Processing10.1007/s11554-024-01486-w21:4Online publication date: 15-Jun-2024
https://dl.acm.org/doi/10.1007/s11554-024-01486-w
Weng YHan MHe HChang XZhuang B(2024)LongVLM: Efficient Long Video Understanding via Large Language ModelsComputer Vision – ECCV 202410.1007/978-3-031-73414-4_26(453-470)Online publication date: 25-Oct-2024
https://doi.org/10.1007/978-3-031-73414-4_26

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents