[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1145/3552458.3556444acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
research-article

Cross-modal Token Selection for Video Understanding

Published: 10 October 2022 Publication History

Abstract

Multi-modal action recognition is an essential task in human-centric machine learning. Humans perceive the world by processing and fusing information of multiple modalities such as vision and audio. We introduce a novel transformer-based multi-modal architecture that outperforms existing state-of-the-art methods while significantly reducing the computational cost. The key to our idea is a Token-Selector module that collates and condenses the most useful token combinations and only shares what is necessary for cross-modal modeling. We conduct extensive experiments on multiple multi-modal benchmark datasets and achieve state-of-the-art performance under similar experimental conditions while reducing 30 percent of computing consumption. Extensive ablation studies showcase the benefits of our improved method over naive approaches.

Supplementary Material

MP4 File (HCMA22-hcma26p.mp4)
Presentation video - this video introduces our previous work in terms of background, motivation, methodology, experiments, etc.

References

[1]
REFERENCES [1] Anurag Arnab et al. "Vivit: A video vision transformer". In: Proceedings of the IEEE/CVF International Conference on Computer Vision. 2021, pp. 6836--6846.
[2]
Gedas Bertasius, Heng Wang, and Lorenzo Torresani. "Is space-time attention all you need for video understanding?" In: ICML. Vol. 2. 3. 2021, p. 4.
[3]
Joao Carreira and Andrew Zisserman. "Quo vadis, action recognition? a new model and the kinetics dataset". In: proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2017, pp. 6299--6308.
[4]
Honglie Chen et al. "Vggsound: A large-scale audio-visual dataset". In: ICASSP 2020--2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE. 2020, pp. 721--725.
[5]
Dima Damen et al. "Scaling egocentric vision: The epic-kitchens dataset". In: Proceedings of the European Conference on Computer Vision (ECCV). 2018, pp. 720--736.
[6]
Jia Deng et al. "Imagenet: A large-scale hierarchical image database". In: 2009 IEEE conference on computer vision and pattern recognition. Ieee. 2009, pp. 248-- 255.
[7]
Yi Ding et al. "Sparse Fusion for Multimodal Transformers". In: arXiv preprint arXiv:2111.11992 (2021).
[8]
Alexey Dosovitskiy et al. "An image is worth 16x16 words: Transformers for image recognition at scale". In: arXiv preprint arXiv:2010.11929 (2020).
[9]
Haoqi Fan et al. "Multiscale vision transformers". In: Proceedings of the IEEE/CVF International Conference on Computer Vision. 2021, pp. 6824--6835.
[10]
Christoph Feichtenhofer, Axel Pinz, and Andrew Zisserman. "Convolutional two-stream network fusion for video action recognition". In: Proceedings of the IEEE conference on computer vision and pattern recognition. 2016, pp. 1933--1941.
[11]
Christoph Feichtenhofer et al. "Slowfast networks for video recognition". In: Proceedings of the IEEE/CVF international conference on computer vision. 2019, pp. 6202--6211.
[12]
Bernard Ghanem et al. "The activitynet large-scale activity recognition challenge 2018 summary". In: arXiv preprint arXiv:1808.03766 (2018).
[13]
Yuan Gong, Yu-An Chung, and James Glass. "Ast: Audio spectrogram transformer". In: arXiv preprint arXiv:2104.01778 (2021).
[14]
Evangelos Kazakos et al. "Epic-fusion: Audio-visual temporal binding for egocentric action recognition". In: Proceedings of the IEEE/CVF International Conference on Computer Vision. 2019, pp. 5492--5501.
[15]
Valerii Likhosherstov et al. "Polyvit: Co-training vision transformers on images, videos and audio". In: arXiv preprint arXiv:2111.12993 (2021).
[16]
Ji Lin, Chuang Gan, and Song Han. "Tsm: Temporal shift module for efficient video understanding". In: Proceedings of the IEEE/CVF International Conference on Computer Vision. 2019, pp. 7083--7093.
[17]
Ze Liu et al. "Swin transformer: Hierarchical vision transformer using shifted windows". In: Proceedings of the IEEE/CVF International Conference on Computer Vision. 2021, pp. 10012--10022.
[18]
Arsha Nagrani et al. "Attention bottlenecks for multimodal fusion". In: Advances in Neural Information Processing Systems 34 (2021), pp. 14200--14213.
[19]
Andrew Owens and Alexei A Efros. "Audio-visual scene analysis with selfsupervised multisensory features". In: Proceedings of the European Conference on Computer Vision (ECCV). 2018, pp. 631--648.
[20]
Dennis Park et al. "Exploring weak stabilization for motion feature extraction". In: Proceedings of the IEEE conference on computer vision and pattern recognition. 2013, pp. 2882--2889.
[21]
Wasifur Rahman et al. "Integrating multimodal information in large pretrained transformers". In: Proceedings of the conference. Association for Computational Linguistics. Meeting. Vol. 2020. NIH Public Access. 2020, p. 2359.
[22]
Michael S Ryoo et al. "TokenLearner: What Can 8 Learned Tokens Do for Images and Videos?" In: arXiv preprint arXiv:2106.11297 (2021).
[23]
Saurav Sahay et al. "Low rank fusion based transformers for multimodal sequences". In: arXiv preprint arXiv:2007.02038 (2020).
[24]
Du Tran et al. "Learning spatiotemporal features with 3d convolutional networks". In: Proceedings of the IEEE international conference on computer vision. 2015, pp. 4489--4497.
[25]
Ashish Vaswani et al. "Attention is all you need". In: Advances in neural information processing systems 30 (2017).
[26]
Heng Wang and Cordelia Schmid. "Action recognition with improved trajectories". In: Proceedings of the IEEE international conference on computer vision. 2013, pp. 3551--3558.
[27]
Heng Wang et al. "Dense trajectories and motion boundary descriptors for action recognition". In: International journal of computer vision 103.1 (2013), pp. 60--79.
[28]
Limin Wang et al. "Temporal segment networks for action recognition in videos". In: IEEE transactions on pattern analysis and machine intelligence 41.11 (2018), pp. 2740--2755.
[29]
Weiyao Wang, Du Tran, and Matt Feiszli. "What makes training multi-modal classification networks hard?" In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2020, pp. 12695--12705.
[30]
Chao-Yuan Wu et al. "Long-term feature banks for detailed video understanding". In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2019, pp. 284--293.
[31]
Shandong Wu, Omar Oreifej, and Mubarak Shah. "Action recognition in videos acquired by a moving camera using motion decomposition of lagrangian particle trajectories". In: 2011 International conference on computer vision. IEEE. 2011, pp. 1419--1426.
[32]
Fanyi Xiao et al. "Audiovisual slowfast networks for video recognition". In: arXiv preprint arXiv:2001.08740 (2020).
[33]
Rui Yan et al. "HiGCIN: Hierarchical graph-based cross inference network for group activity recognition". In: IEEE transactions on pattern analysis and machine intelligence (2020).
[34]
Rui Yan et al. "Participation-contributed temporal dynamic model for group activity recognition". In: Proceedings of the 26th ACM international conference on Multimedia. 2018, pp. 1292--1300.
[35]
Rui Yan et al. "Social adaptive module for weakly-supervised group activity recognition". In: European Conference on Computer Vision. Springer. 2020, pp. 208--224.
[36]
Hao Zhang, Yanbin Hao, and Chong-Wah Ngo. "Token shift transformer for video classification". In: Proceedings of the 29th ACM International Conference on Multimedia. 2021, pp. 917--925.
[37]
Hongyi Zhang et al. "mixup: Beyond empirical risk minimization". In: arXiv preprint arXiv:1710.09412 (2017

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
HCMA '22: Proceedings of the 3rd International Workshop on Human-Centric Multimedia Analysis
October 2022
106 pages
ISBN:9781450394925
DOI:10.1145/3552458
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 10 October 2022

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. action recognition
  2. human-centric
  3. multi-modal
  4. transformer

Qualifiers

  • Research-article

Conference

MM '22
Sponsor:

Acceptance Rates

HCMA '22 Paper Acceptance Rate 12 of 21 submissions, 57%;
Overall Acceptance Rate 12 of 21 submissions, 57%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 115
    Total Downloads
  • Downloads (Last 12 months)19
  • Downloads (Last 6 weeks)1
Reflects downloads up to 13 Dec 2024

Other Metrics

Citations

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media