More Web Proxy on the site http://driver.im/

research-article

Cross-modal Token Selection for Video Understanding

Authors:

Rui YanAuthors Info & Claims

HCMA '22: Proceedings of the 3rd International Workshop on Human-Centric Multimedia Analysis

Pages 93 - 99

https://doi.org/10.1145/3552458.3556444

Published: 10 October 2022 Publication History

Abstract

Multi-modal action recognition is an essential task in human-centric machine learning. Humans perceive the world by processing and fusing information of multiple modalities such as vision and audio. We introduce a novel transformer-based multi-modal architecture that outperforms existing state-of-the-art methods while significantly reducing the computational cost. The key to our idea is a Token-Selector module that collates and condenses the most useful token combinations and only shares what is necessary for cross-modal modeling. We conduct extensive experiments on multiple multi-modal benchmark datasets and achieve state-of-the-art performance under similar experimental conditions while reducing 30 percent of computing consumption. Extensive ablation studies showcase the benefits of our improved method over naive approaches.

Supplementary Material

MP4 File (HCMA22-hcma26p.mp4)

Presentation video - this video introduces our previous work in terms of background, motivation, methodology, experiments, etc.

Download
11.03 MB

References

[1]

REFERENCES [1] Anurag Arnab et al. "Vivit: A video vision transformer". In: Proceedings of the IEEE/CVF International Conference on Computer Vision. 2021, pp. 6836--6846.

[2]

Gedas Bertasius, Heng Wang, and Lorenzo Torresani. "Is space-time attention all you need for video understanding?" In: ICML. Vol. 2. 3. 2021, p. 4.

[3]

Joao Carreira and Andrew Zisserman. "Quo vadis, action recognition? a new model and the kinetics dataset". In: proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2017, pp. 6299--6308.

[4]

Honglie Chen et al. "Vggsound: A large-scale audio-visual dataset". In: ICASSP 2020--2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE. 2020, pp. 721--725.

[5]

Dima Damen et al. "Scaling egocentric vision: The epic-kitchens dataset". In: Proceedings of the European Conference on Computer Vision (ECCV). 2018, pp. 720--736.

[6]

Jia Deng et al. "Imagenet: A large-scale hierarchical image database". In: 2009 IEEE conference on computer vision and pattern recognition. Ieee. 2009, pp. 248-- 255.

[7]

Yi Ding et al. "Sparse Fusion for Multimodal Transformers". In: arXiv preprint arXiv:2111.11992 (2021).

[8]

Alexey Dosovitskiy et al. "An image is worth 16x16 words: Transformers for image recognition at scale". In: arXiv preprint arXiv:2010.11929 (2020).

[9]

Haoqi Fan et al. "Multiscale vision transformers". In: Proceedings of the IEEE/CVF International Conference on Computer Vision. 2021, pp. 6824--6835.

[10]

Christoph Feichtenhofer, Axel Pinz, and Andrew Zisserman. "Convolutional two-stream network fusion for video action recognition". In: Proceedings of the IEEE conference on computer vision and pattern recognition. 2016, pp. 1933--1941.

[11]

Christoph Feichtenhofer et al. "Slowfast networks for video recognition". In: Proceedings of the IEEE/CVF international conference on computer vision. 2019, pp. 6202--6211.

[12]

Bernard Ghanem et al. "The activitynet large-scale activity recognition challenge 2018 summary". In: arXiv preprint arXiv:1808.03766 (2018).

[13]

Yuan Gong, Yu-An Chung, and James Glass. "Ast: Audio spectrogram transformer". In: arXiv preprint arXiv:2104.01778 (2021).

[14]

Evangelos Kazakos et al. "Epic-fusion: Audio-visual temporal binding for egocentric action recognition". In: Proceedings of the IEEE/CVF International Conference on Computer Vision. 2019, pp. 5492--5501.

[15]

Valerii Likhosherstov et al. "Polyvit: Co-training vision transformers on images, videos and audio". In: arXiv preprint arXiv:2111.12993 (2021).

[16]

Ji Lin, Chuang Gan, and Song Han. "Tsm: Temporal shift module for efficient video understanding". In: Proceedings of the IEEE/CVF International Conference on Computer Vision. 2019, pp. 7083--7093.

[17]

Ze Liu et al. "Swin transformer: Hierarchical vision transformer using shifted windows". In: Proceedings of the IEEE/CVF International Conference on Computer Vision. 2021, pp. 10012--10022.

[18]

Arsha Nagrani et al. "Attention bottlenecks for multimodal fusion". In: Advances in Neural Information Processing Systems 34 (2021), pp. 14200--14213.

[19]

Andrew Owens and Alexei A Efros. "Audio-visual scene analysis with selfsupervised multisensory features". In: Proceedings of the European Conference on Computer Vision (ECCV). 2018, pp. 631--648.

[20]

Dennis Park et al. "Exploring weak stabilization for motion feature extraction". In: Proceedings of the IEEE conference on computer vision and pattern recognition. 2013, pp. 2882--2889.

[21]

Wasifur Rahman et al. "Integrating multimodal information in large pretrained transformers". In: Proceedings of the conference. Association for Computational Linguistics. Meeting. Vol. 2020. NIH Public Access. 2020, p. 2359.

[22]

Michael S Ryoo et al. "TokenLearner: What Can 8 Learned Tokens Do for Images and Videos?" In: arXiv preprint arXiv:2106.11297 (2021).

[23]

Saurav Sahay et al. "Low rank fusion based transformers for multimodal sequences". In: arXiv preprint arXiv:2007.02038 (2020).

[24]

Du Tran et al. "Learning spatiotemporal features with 3d convolutional networks". In: Proceedings of the IEEE international conference on computer vision. 2015, pp. 4489--4497.

[25]

Ashish Vaswani et al. "Attention is all you need". In: Advances in neural information processing systems 30 (2017).

[26]

Heng Wang and Cordelia Schmid. "Action recognition with improved trajectories". In: Proceedings of the IEEE international conference on computer vision. 2013, pp. 3551--3558.

Digital Library

[27]

Heng Wang et al. "Dense trajectories and motion boundary descriptors for action recognition". In: International journal of computer vision 103.1 (2013), pp. 60--79.

[28]

Limin Wang et al. "Temporal segment networks for action recognition in videos". In: IEEE transactions on pattern analysis and machine intelligence 41.11 (2018), pp. 2740--2755.

[29]

Weiyao Wang, Du Tran, and Matt Feiszli. "What makes training multi-modal classification networks hard?" In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2020, pp. 12695--12705.

[30]

Chao-Yuan Wu et al. "Long-term feature banks for detailed video understanding". In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2019, pp. 284--293.

[31]

Shandong Wu, Omar Oreifej, and Mubarak Shah. "Action recognition in videos acquired by a moving camera using motion decomposition of lagrangian particle trajectories". In: 2011 International conference on computer vision. IEEE. 2011, pp. 1419--1426.

Digital Library

[32]

Fanyi Xiao et al. "Audiovisual slowfast networks for video recognition". In: arXiv preprint arXiv:2001.08740 (2020).

[33]

Rui Yan et al. "HiGCIN: Hierarchical graph-based cross inference network for group activity recognition". In: IEEE transactions on pattern analysis and machine intelligence (2020).

[34]

Rui Yan et al. "Participation-contributed temporal dynamic model for group activity recognition". In: Proceedings of the 26th ACM international conference on Multimedia. 2018, pp. 1292--1300.

[35]

Rui Yan et al. "Social adaptive module for weakly-supervised group activity recognition". In: European Conference on Computer Vision. Springer. 2020, pp. 208--224.

[36]

Hao Zhang, Yanbin Hao, and Chong-Wah Ngo. "Token shift transformer for video classification". In: Proceedings of the 29th ACM International Conference on Multimedia. 2021, pp. 917--925.

Digital Library

[37]

Hongyi Zhang et al. "mixup: Beyond empirical risk minimization". In: arXiv preprint arXiv:1710.09412 (2017

Index Terms

Cross-modal Token Selection for Video Understanding
1. Computing methodologies
  1. Artificial intelligence
    1. Computer vision
      1. Computer vision tasks
        Activity recognition and understanding

Recommendations

Cross-scale cascade transformer for multimodal human action recognition
Highlights
- A cross-modal and cross-scale fusion module is proposed to perform multimodal feature interaction.
- The proposed fusion network can handle different multimodal input combinations and obtain significant performance improvement.
- ...
Abstract
Human action recognition can benefit from multimodal information to address the classification problem under complex situations. However, existing works either use score fusion or perform simple feature integration methods to combine multiple ...
Multi-modal Multi-view Topic-opinion Mining for Social Event Analysis
MM '16: Proceedings of the 24th ACM international conference on Multimedia

In this paper, we propose a novel multi-modal multi-view topic-opinion mining (MMTOM) model for social event analysis in multiple collection sources. Compared with existing topic-opinion mining methods, our proposed model has several advantages: (1) The ...
Adaptive Anti-Bottleneck Multi-Modal Graph Learning Network for Personalized Micro-video Recommendation
MM '22: Proceedings of the 30th ACM International Conference on Multimedia

Micro-video recommendation has attracted extensive research attention with the increasing popularity of micro-video sharing platforms. There exists a substantial amount of excellent efforts made to the micro-video recommendation task. Recently, ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

HCMA '22: Proceedings of the 3rd International Workshop on Human-Centric Multimedia Analysis

October 2022

106 pages

ISBN:9781450394925

DOI:10.1145/3552458

Program Chairs:
Dingwen Zhang
Northwestern Polytechnical University, Xi'an, China
,
Chaowei Fang
Xidian University, Xi'an, China
,
Wu Liu
JD AI Research, Beijing, China
,
Xinchen Liu
JD AI Research, Beijing, China
,
Jingkuan Song
University of Electronic Science & Technology of China, Chengdu, China
,
Hongyuan Zhu
Agency for Science, Technology, and Research (A*STAR), Singapore
,
Wenbing Huang
Tsinghua University, Beijing, China
,
John Smith
IBM Research, USA

Copyright © 2022 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGMM: ACM Special Interest Group on Multimedia

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 10 October 2022

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

MM '22

Sponsor:

SIGMM

MM '22: The 30th ACM International Conference on Multimedia

October 10 - 14, 2022

Lisboa, Portugal

Acceptance Rates

HCMA '22 Paper Acceptance Rate 12 of 21 submissions, 57%;

Overall Acceptance Rate 12 of 21 submissions, 57%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
115
Total Downloads

Downloads (Last 12 months)19
Downloads (Last 6 weeks)1

Reflects downloads up to 13 Dec 2024

Other Metrics

View Author Metrics

Citations

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents