[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1145/3581783.3612035acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
research-article

View while Moving: Efficient Video Recognition in Long-untrimmed Videos

Published: 27 October 2023 Publication History

Abstract

Recent adaptive methods for efficient video recognition mostly follow the two-stage paradigm of "preview-then-recognition" and have achieved great success on multiple video benchmarks. However, this two-stage paradigm involves two visits of raw frames from coarse-grained to fine-grained during inference (cannot be parallelized), and the captured spatiotemporal features cannot be reused in the second stage (due to varying granularity), being not friendly to efficiency and computation optimization.To this end, inspired by human cognition, we propose a novel recognition paradigm of "View while Moving" for efficient long-untrimmed video recognition.In contrast to the two-stage paradigm, our paradigm only needs to access the raw frame once.The two phases of coarse-grained sampling and fine-grained recognition are combined into unified spatiotemporal modeling, showing great performance.Moreover, we investigate the properties of semantic units in video and propose a hierarchical mechanism to efficiently capture and reason about the unit-level and video-level temporal semantics in long-untrimmed videos respectively.Extensive experiments on both long-untrimmed and short-trimmed videos demonstrate that our approach outperforms state-of-the-art methods in terms of accuracy as well as efficiency, yielding new efficiency and accuracy trade-offs for video spatiotemporal modeling.

References

[1]
Relja Arandjelovic and Andrew Zisserman. 2017. Look, listen and learn. In ICCV. 609--617.
[2]
Romil Bhardwaj, Zhengxu Xia, Ganesh Ananthanarayanan, Junchen Jiang, Yuanchao Shu, Nikolaos Karianakis, Kevin Hsieh, Paramvir Bahl, and Ion Stoica. 2022. Ekya: Continuous learning of video analytics models on edge compute servers. In 19th USENIX Symposium on Networked Systems Design and Implementation (NSDI 22). 119--135.
[3]
Fabian Caba Heilbron, Victor Escorcia, Bernard Ghanem, and Juan Carlos Niebles. 2015. Activitynet: A large-scale video benchmark for human activity understanding. In CVPR. 961--970.
[4]
Petros Christodoulou. 2019. Soft actor-critic for discrete action settings. arXiv preprint arXiv:1910.07207 (2019).
[5]
Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. 2009. Imagenet: A large-scale hierarchical image database. In CVPR. IEEE, 248--255.
[6]
Jeffrey Donahue, Lisa Anne Hendricks, Sergio Guadarrama, Marcus Rohrbach, Subhashini Venugopalan, Kate Saenko, and Trevor Darrell. 2015. Long-term recurrent convolutional networks for visual recognition and description. In CVPR. 2625--2634.
[7]
Maël Donoso, Anne GE Collins, and Etienne Koechlin. 2014. Foundations of human reasoning in the prefrontal cortex. Science, Vol. 344, 6191 (2014), 1481--1486.
[8]
Kuntai Du, Ahsan Pervaiz, Xin Yuan, Aakanksha Chowdhery, Qizheng Zhang, Henry Hoffmann, and Junchen Jiang. 2020. Server-driven video streaming for deep learning inference. In Proceedings of the Annual conference of the ACM Special Interest Group on Data Communication on the applications, technologies, architectures, and protocols for computer communication. 557--570.
[9]
Hehe Fan, Zhongwen Xu, Linchao Zhu, Chenggang Yan, Jianjun Ge, and Yi Yang. 2018. Watching a small portion could be as good as watching all: Towards efficient video classification. In IJCAI.
[10]
Christoph Feichtenhofer. 2020. X3D: Expanding Architectures for Efficient Video Recognition. In CVPR. 203--213.
[11]
Christoph Feichtenhofer, Haoqi Fan, Jitendra Malik, and Kaiming He. 2019. Slowfast networks for video recognition. In ICCV. 6202--6211.
[12]
Ruohan Gao, Tae-Hyun Oh, Kristen Grauman, and Lorenzo Torresani. 2020. Listen to look: Action recognition by previewing audio. In CVPR. 10457--10467.
[13]
Amir Ghodrati, Babak Ehteshami Bejnordi, and Amirhossein Habibian. 2021. Frameexit: Conditional early exiting for efficient video recognition. In CVPR. 15608--15618.
[14]
Kensho Hara, Hirokatsu Kataoka, and Yutaka Satoh. 2018. Can spatiotemporal 3d cnns retrace the history of 2d cnns and imagenet?. In CVPR. 6546--6555.
[15]
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In CVPR. 770--778.
[16]
Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural computation, Vol. 9, 8 (1997), 1735--1780.
[17]
Mihir Jain, Amir Ghodrati, and Cees GM Snoek. 2020. Actionbytes: Learning from trimmed videos to localize actions. In CVPR. 1171--1180.
[18]
Yu-Gang Jiang, Zuxuan Wu, Jun Wang, Xiangyang Xue, and Shih-Fu Chang. 2017. Exploiting feature and class relationships in video categorization with regularized deep neural networks. TPAMI, Vol. 40, 2 (2017), 352--364.
[19]
Will Kay, Joao Carreira, Karen Simonyan, Brian Zhang, Chloe Hillier, Sudheendra Vijayanarasimhan, Fabio Viola, Tim Green, Trevor Back, Paul Natsev, et al. 2017. The kinetics human action video dataset. arXiv preprint arXiv:1705.06950 (2017).
[20]
Evangelos Kazakos, Arsha Nagrani, Andrew Zisserman, and Dima Damen. 2019. Epic-fusion: Audio-visual temporal binding for egocentric action recognition. In ICCV. 5492--5501.
[21]
Bruno Korbar, Du Tran, and Lorenzo Torresani. 2019. Scsampler: Sampling salient clips from video for efficient action recognition. In ICCV. 6232--6242.
[22]
Hengduo Li, Zuxuan Wu, Abhinav Shrivastava, and Larry S Davis. 2021. 2d or not 2d? adaptive 3d convolution selection for efficient video recognition. In CVPR. 6155--6164.
[23]
Ji Lin, Chuang Gan, and Song Han. 2019. Tsm: Temporal shift module for efficient video understanding. In ICCV. 7083--7093.
[24]
Zhaoyang Liu, Donghao Luo, Yabiao Wang, Limin Wang, Ying Tai, Chengjie Wang, Jilin Li, Feiyue Huang, and Tong Lu. 2020. Teinet: Towards an efficient architecture for video recognition. In AAAI, Vol. 34. 11669--11676.
[25]
Ryan Lowe, Yi I Wu, Aviv Tamar, Jean Harb, OpenAI Pieter Abbeel, and Igor Mordatch. 2017. Multi-agent actor-critic for mixed cooperative-competitive environments. NIPS, Vol. 30 (2017).
[26]
Huaishao Luo, Lei Ji, Ming Zhong, Yang Chen, Wen Lei, Nan Duan, and Tianrui Li. 2022. CLIP4Clip: An empirical study of CLIP for end to end video clip retrieval and captioning. Neurocomputing, Vol. 508 (2022), 293--304.
[27]
Yue Meng, Chung-Ching Lin, Rameswar Panda, Prasanna Sattigeri, Leonid Karlinsky, Aude Oliva, Kate Saenko, and Rogerio Feris. 2020. Ar-net: Adaptive frame resolution for efficient action recognition. In ECCV. Springer, 86--104.
[28]
Yue Meng, Rameswar Panda, Chung-Ching Lin, Prasanna Sattigeri, Leonid Karlinsky, Kate Saenko, Aude Oliva, and Rogerio Feris. 2021. AdaFuse: Adaptive Temporal Fusion Network for Efficient Action Recognition. In ICLR.
[29]
Omar Mossad, Khaled Diab, Ihab Amer, and Mohamed Hefeeda. 2021. Deepgame: Efficient video encoding for cloud gaming. In Proceedings of the 29th ACM International Conference on Multimedia. 1387--1395.
[30]
Yingwei Pan, Yehao Li, Jianjie Luo, Jun Xu, Ting Yao, and Tao Mei. 2022. Auto-captions on GIF: A Large-scale Video-sentence Dataset for Vision-language Pre-training. In Proceedings of the 30th ACM International Conference on Multimedia. 7070--7074.
[31]
Rameswar Panda, Chun-Fu Richard Chen, Quanfu Fan, Ximeng Sun, Kate Saenko, Aude Oliva, and Rogerio Feris. 2021. Adamml: Adaptive multi-modal learning for efficient video recognition. In ICCV. 7576--7585.
[32]
AJ Piergiovanni, Anelia Angelova, and Michael S Ryoo. 2019. Tiny Video Networks. arXiv preprint arXiv:1910.06961 (2019).
[33]
Zhaofan Qiu, Ting Yao, and Tao Mei. 2017. Learning spatio-temporal representation with pseudo-3d residual networks. In ICCV. 5533--5541.
[34]
Xuming Ran, Jie Zhang, Ziyuan Ye, Haiyan Wu, Qi Xu, Huihui Zhou, and Quanying Liu. 2021. Deep auto-encoder with neural response. arXiv preprint arXiv:2111.15309 (2021).
[35]
Charan Ranganath and Robert S Blumenfeld. 2005. Doubts about double dissociations between short-and long-term memory. Trends in cognitive sciences, Vol. 9, 8 (2005), 374--380.
[36]
Karen Simonyan and Andrew Zisserman. 2014. Two-stream convolutional networks for action recognition in videos. In NIPS. 568--576.
[37]
Ximeng Sun, Rameswar Panda, Chun-Fu Richard Chen, Aude Oliva, Rogerio Feris, and Kate Saenko. 2021. Dynamic network quantization for efficient video inference. In ICCV. 7375--7385.
[38]
Du Tran, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. 2015. Learning spatiotemporal features with 3d convolutional networks. In ICCV. 4489--4497.
[39]
Du Tran, Heng Wang, Lorenzo Torresani, Jamie Ray, Yann LeCun, and Manohar Paluri. 2018. A closer look at spatiotemporal convolutions for action recognition. In CVPR. 6450--6459.
[40]
Gül Varol, Ivan Laptev, and Cordelia Schmid. 2017. Long-term temporal convolutions for action recognition. TPAMI, Vol. 40, 6 (2017), 1510--1517.
[41]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. NIPS, Vol. 30 (2017).
[42]
Shaohua Wan, Songtao Ding, and Chen Chen. 2022. Edge computing enabled video segmentation for real-time traffic monitoring in internet of vehicles. Pattern Recognition, Vol. 121 (2022), 108146.
[43]
Heng Wang and Cordelia Schmid. 2013. Action recognition with improved trajectories. In ICCV. 3551--3558.
[44]
Limin Wang, Zhan Tong, Bin Ji, and Gangshan Wu. 2021b. Tdn: Temporal difference networks for efficient action recognition. In CVPR. 1895--1904.
[45]
Limin Wang, Yuanjun Xiong, Zhe Wang, Yu Qiao, Dahua Lin, Xiaoou Tang, and Luc Van Gool. 2016. Temporal segment networks: Towards good practices for deep action recognition. In ECCV. Springer, 20--36.
[46]
Qiang Wang, Yanhao Zhang, Yun Zheng, Pan Pan, and Xian-Sheng Hua. 2022c. Disentangled representation learning for text-video retrieval. arXiv preprint arXiv:2203.07111 (2022).
[47]
Xiaolong Wang, Ross Girshick, Abhinav Gupta, and Kaiming He. 2018. Non-local neural networks. In CVPR. 7794--7803.
[48]
Yulin Wang, Zhaoxi Chen, Haojun Jiang, Shiji Song, Yizeng Han, and Gao Huang. 2021a. Adaptive Focus for Efficient Video Recognition. In ICCV.
[49]
Yulin Wang, Yang Yue, Yuanze Lin, Haojun Jiang, Zihang Lai, Victor Kulikov, Nikita Orlov, Humphrey Shi, and Gao Huang. 2022a. AdaFocus V2: End-to-End Training of Spatial Dynamic Networks for Video Recognition. In CVPR.
[50]
Yulin Wang, Yang Yue, Xinhong Xu, Ali Hassani, Victor Kulikov, Nikita Orlov, Shiji Song, Humphrey Shi, and Gao Huang. 2022b. AdaFocusV3: On Unified Spatial-Temporal Dynamic Video Recognition. In ECCV. Springer, 226--243.
[51]
Chao-Yuan Wu, Christoph Feichtenhofer, Haoqi Fan, Kaiming He, Philipp Krahenbuhl, and Ross Girshick. 2019a. Long-term feature banks for detailed video understanding. In CVPR. 284--293.
[52]
Chao-Yuan Wu, Yanghao Li, Karttikeya Mangalam, Haoqi Fan, Bo Xiong, Jitendra Malik, and Christoph Feichtenhofer. 2022. Memvit: Memory-augmented multiscale vision transformer for efficient long-term video recognition. In CVPR. 13587--13597.
[53]
Wenhao Wu, Dongliang He, Tianwei Lin, Fu Li, Chuang Gan, and Errui Ding. 2021. Mvfnet: Multi-view fusion network for efficient video recognition. In AAAI, Vol. 35. 2943--2951.
[54]
Wenhao Wu, Dongliang He, Xiao Tan, Shifeng Chen, and Shilei Wen. 2019b. Multi-agent reinforcement learning based frame sampling for effective untrimmed video recognition. In ICCV. 6222--6231.
[55]
Zuxuan Wu, Caiming Xiong, Yu-Gang Jiang, and Larry S Davis. 2019c. LiteEval: A Coarse-to-Fine Framework for Resource Efficient Video Recognition. In NIPS. 7778--7787.
[56]
Zuxuan Wu, Caiming Xiong, Chih-Yao Ma, Richard Socher, and Larry S Davis. 2019d. Adaframe: Adaptive frame selection for fast video recognition. In CVPR. 1278--1287.
[57]
Boyang Xia, Zhihao Wang, Wenhao Wu, Haoran Wang, and Jungong Han. 2022a. Temporal saliency query network for efficient video recognition. In ECCV. Springer, 741--759.
[58]
Boyang Xia, Wenhao Wu, Haoran Wang, Rui Su, Dongliang He, Haosen Yang, Xiaoran Fan, and Wanli Ouyang. 2022b. Nsnet: Non-saliency suppression sampler for efficient video recognition. In ECCV. Springer, 705--723.
[59]
Jiayi Xie, Yaochen Zhu, and Zhenzhong Chen. 2021. Micro-video Popularity Prediction via Multimodal Variational Information Bottleneck. IEEE Transactions on Multimedia (2021).
[60]
Qi Xu, Yaxin Li, Jiangrong Shen, Pingping Zhang, Jian K Liu, Huajin Tang, and Gang Pan. 2022. Hierarchical Spiking-Based Model for Efficient Image Classification With Enhanced Feature Extraction and Encoding. IEEE Transactions on Neural Networks and Learning Systems (2022).
[61]
Liangwei Yang, Zhiwei Liu, Yu Wang, Chen Wang, Ziwei Fan, and Philip S Yu. 2022a. Large-scale personalized video game recommendation via social-aware contextualized graph neural network. In Proceedings of the ACM Web Conference 2022. 3376--3386.
[62]
Shusheng Yang, Xinggang Wang, Yu Li, Yuxin Fang, Jiemin Fang, Wenyu Liu, Xun Zhao, and Ying Shan. 2022b. Temporally efficient vision transformer for video instance segmentation. In CVPR. 2885--2895.
[63]
Serena Yeung, Olga Russakovsky, Greg Mori, and Li Fei-Fei. 2016. End-to-end learning of action detection from frame glimpses in videos. In CVPR. 2678--2687.
[64]
Bowen Zhang, Hexiang Hu, and Fei Sha. 2018. Cross-modal and hierarchical modeling of video and text. In ECCV. 374--390.
[65]
Yitian Zhang, Yue Bai, Huan Wang, Yi Xu, and Yun Fu. 2022. Look More but Care Less in Video Recognition. arXiv preprint arXiv:2211.09992 (2022).
[66]
Mohammadreza Zolfaghari, Kamaljeet Singh, and Thomas Brox. 2018. Eco: Efficient convolutional network for online video understanding. In ECCV. 695--712.

Cited By

View all
  • (2024)GAT-Based Bi-CARU with Adaptive Feature-Based Transformation for Video SummarisationTechnologies10.3390/technologies1208012612:8(126)Online publication date: 5-Aug-2024
  • (2024)Semantic Fusion Based Graph Network for Video Scene Detection2024 International Joint Conference on Neural Networks (IJCNN)10.1109/IJCNN60899.2024.10651314(1-8)Online publication date: 30-Jun-2024
  • (2024)DTA: Deformable Temporal Attention for Video Recognition2024 International Joint Conference on Neural Networks (IJCNN)10.1109/IJCNN60899.2024.10650436(1-8)Online publication date: 30-Jun-2024
  • Show More Cited By

Index Terms

  1. View while Moving: Efficient Video Recognition in Long-untrimmed Videos

    Recommendations

    Comments

    Please enable JavaScript to view thecomments powered by Disqus.

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    MM '23: Proceedings of the 31st ACM International Conference on Multimedia
    October 2023
    9913 pages
    ISBN:9798400701085
    DOI:10.1145/3581783
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 27 October 2023

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. efficient video recognition
    2. long-untrimmed video

    Qualifiers

    • Research-article

    Funding Sources

    Conference

    MM '23
    Sponsor:
    MM '23: The 31st ACM International Conference on Multimedia
    October 29 - November 3, 2023
    Ottawa ON, Canada

    Acceptance Rates

    Overall Acceptance Rate 2,145 of 8,556 submissions, 25%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)207
    • Downloads (Last 6 weeks)10
    Reflects downloads up to 11 Dec 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)GAT-Based Bi-CARU with Adaptive Feature-Based Transformation for Video SummarisationTechnologies10.3390/technologies1208012612:8(126)Online publication date: 5-Aug-2024
    • (2024)Semantic Fusion Based Graph Network for Video Scene Detection2024 International Joint Conference on Neural Networks (IJCNN)10.1109/IJCNN60899.2024.10651314(1-8)Online publication date: 30-Jun-2024
    • (2024)DTA: Deformable Temporal Attention for Video Recognition2024 International Joint Conference on Neural Networks (IJCNN)10.1109/IJCNN60899.2024.10650436(1-8)Online publication date: 30-Jun-2024
    • (2024)AdaViPro: Region-Based Adaptive Visual Prompt For Large-Scale Models Adapting2024 IEEE International Conference on Image Processing (ICIP)10.1109/ICIP51287.2024.10647632(1316-1322)Online publication date: 27-Oct-2024
    • (2024)Efficiently adapting large pre-trained models for real-time violence recognition in smart city surveillanceJournal of Real-Time Image Processing10.1007/s11554-024-01486-w21:4Online publication date: 15-Jun-2024
    • (2024)LongVLM: Efficient Long Video Understanding via Large Language ModelsComputer Vision – ECCV 202410.1007/978-3-031-73414-4_26(453-470)Online publication date: 25-Oct-2024

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media