More Web Proxy on the site http://driver.im/

research-article

Multimodal Global Relation Knowledge Distillation for Egocentric Action Anticipation

Authors:

Changsheng XuAuthors Info & Claims

MM '21: Proceedings of the 29th ACM International Conference on Multimedia

Pages 245 - 254

https://doi.org/10.1145/3474085.3475327

Published: 17 October 2021 Publication History

Abstract

In this paper, we consider the task of action anticipation on egocentric videos. Previous methods ignore explicit modeling of the global context relation among past and future actions, which is not an easy task due to the vacancy of unobserved videos. To solve this problem, we propose a Multimodal Global Relation Knowledge Distillation (MGRKD) framework to distill the knowledge learned from full videos to improve the action anticipation task on partially observed videos. The proposed MGRKD has a teacher-student learning strategy, where either the teacher or student model has three branches of global relation graph networks (GRGN) to explore the pairwise relations between past and future actions based on three kinds of features (i.e., RGB, motion or object). The teacher model has a similar architecture with the student model, except that the teacher model uses true feature of the future video snippet to build the graph in GRGN while the student model uses a progressive GRU to predict an initialized node feature of future snippet in GRGN. Through the teacher-student learning strategy, the discriminative features and relation knowledge of the past and future actions learned in the teacher model can be distilled to the student model. The experiments on two egocentric video datasets EPIC-Kitchens and EGTEA Gaze+ show that the proposed framework achieves state-of-the-art performances.

Supplementary Material

MP4 File (10.1145-3474085.3475327.mp4)

video presentation

Download
161.15 MB

References

[1]

Apratim Bhattacharyya, Mario Fritz, and Bernt Schiele. 2018. Long-term on-board prediction of people in traffic scenes under uncertainty. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4194--4202.

[2]

Guglielmo Camporese, Pasquale Coscia, Antonino Furnari, Giovanni Maria Farinella, and Lamberto Ballan. 2020. Knowledge distillation for action anticipation via label smoothing. In 25th International Conference on Pattern Recognition.

[3]

Chaofan Chen, Xiaoshan Yang, Changsheng Xu, Xuhui Huang, and Zhe Ma. 2021. ECKPN: Explicit Class Knowledge Propagation Network for Transductive Few-Shot Learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.

[4]

Kyunghyun Cho, Bart Van Merriënboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. 2014. Learning phrase representations using RNN encoder-decoder for statistical machine translation. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing.

[5]

Dima Damen, Hazel Doughty, Giovanni Maria Farinella, Sanja Fidler, Antonino Furnari, Evangelos Kazakos, Davide Moltisanti, Jonathan Munro, Toby Perrett, Will Price, et al. 2018. Scaling egocentric vision: The epic-kitchens dataset. In European Conference on Computer Vision. 720--736.

[6]

Roeland De Geest and Tinne Tuytelaars. 2018. Modeling temporal structure with LSTM for online action detection. In 2018 IEEE Winter Conference on Applications of Computer Vision. IEEE, 1549--1557.

[7]

Antonino Furnari, Sebastiano Battiato, and Giovanni Maria Farinella. 2018. Leveraging uncertainty to rethink loss functions and evaluation measures for egocentric action anticipation. In European Conference on Computer Vision.

[8]

Antonino Furnari and Giovanni Farinella. 2020. Rolling-Unrolling LSTMs for Action Anticipation from First-Person Video. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020).

[9]

Antonino Furnari and Giovanni Maria Farinella. 2019. What would you expect? anticipating egocentric actions with rolling-unrolling lstms and modality attention. In Proceedings of the IEEE International Conference on Computer Vision. 6252--6261.

[10]

Junyu Gao, Xiaoshan Yang, Yingying Zhang, and Changsheng Xu. 2020. Unsupervised video summarization via relation-aware assignment learning. IEEE Transactions on Multimedia (2020).

[11]

Jiyang Gao, Zhenheng Yang, and Ram Nevatia. 2017. Red: Reinforced encoder-decoder networks for action anticipation. In British Machine Vision Conference.

[12]

Junyu Gao, Tianzhu Zhang, and Changsheng Xu. 2019. I know the relationships: Zero-shot action recognition via two-stream graph convolutional networks and knowledge graphs. In Proceedings of the AAAI conference on artificial intelligence, Vol. 33. 8303--8311.

Digital Library

[13]

Saurabh Gupta, Judy Hoffman, and Jitendra Malik. 2016. Cross modal distillation for supervision transfer. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2827--2836.

[14]

Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. 2015. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531 (2015).

[15]

Jian-Fang Hu, Wei-Shi Zheng, Lianyang Ma, Gang Wang, Jianhuang Lai, and Jianguo Zhang. 2018. Early action prediction by soft regression. IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 41, 11 (2018), 2568--2583.

[16]

Yifei Huang, Yusuke Sugano, and Yoichi Sato. 2020. Improving Action Segmentation via Graph-Based Temporal Reasoning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 14024--14034.

[17]

Ashesh Jain, Avi Singh, Hema S Koppula, Shane Soh, and Ashutosh Saxena. 2016. Recurrent neural networks for driver activity anticipation via sensory-fusion architecture. In 2016 IEEE International Conference on Robotics and Automation. IEEE, 3118--3125.

Digital Library

[18]

Michael Kampffmeyer, Yinbo Chen, Xiaodan Liang, Hao Wang, Yujia Zhang, and Eric P Xing. 2019. Rethinking knowledge graph propagation for zero-shot learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 11487--11496.

[19]

Takeo Kanade and Martial Hebert. 2012. First-person vision. IEEE, Vol. 100, 8 (2012), 2442--2453.

[20]

Thomas N Kipf and Max Welling. 2016. Semi-supervised classification with graph convolutional networks. In International Conference on Learning Representations.

[21]

Yu Kong, Zhiqiang Tao, and Yun Fu. 2017. Deep sequential context networks for action prediction. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1473--1481.

[22]

Hema S Koppula and Ashutosh Saxena. 2015. Anticipating human activities using object affordances for reactive robotic response. IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 38, 1 (2015), 14--29.

Digital Library

[23]

Anoop Korattikara Balan, Vivek Rathod, Kevin P Murphy, and Max Welling. 2015. Bayesian dark knowledge. In Advances in Neural Information Processing Systems, Vol. 28. 3438--3446.

Digital Library

[24]

Tian Lan, Tsung-Chuan Chen, and Silvio Savarese. 2014. A hierarchical representation for future action prediction. In European Conference on Computer Vision. Springer, 689--704.

[25]

Kang Li, Jie Hu, and Yun Fu. 2012. Modeling complex temporal composition of actionlets for activity prediction. In European Conference on Computer Vision. Springer, 286--299.

Digital Library

[26]

Yin Li, Miao Liu, and James M Rehg. 2018. In the eye of beholder: Joint learning of gaze and actions in first person video. In European Conference on Computer Vision. 619--635.

[27]

Miao Liu, Siyu Tang, Yin Li, and James M Rehg. 2020. Forecasting human-object interaction: joint prediction of motor attention and actions in first person video. In European Conference on Computer Vision. Springer, 704--721.

Digital Library

[28]

David Lopez-Paz, Léon Bottou, Bernhard Schölkopf, and Vladimir Vapnik. 2015. Unifying distillation and privileged information. arXiv preprint arXiv:1511.03643 (2015).

[29]

Ping Luo, Zhenyao Zhu, Ziwei Liu, Xiaogang Wang, Xiaoou Tang, et al. 2016. Face Model Compression by Distilling Knowledge from Neurons. In Proceedings of the AAAI Conference on Artificial Intelligence. 3560--3566.

Digital Library

[30]

Shugao Ma, Leonid Sigal, and Stan Sclaroff. 2016. Learning activity progression in lstms for activity detection and early detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1942--1950.

[31]

Yuexin Ma, Xinge Zhu, Sibo Zhang, Ruigang Yang, Wenping Wang, and Dinesh Manocha. 2019. Trafficpredict: Trajectory prediction for heterogeneous traffic-agents. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33. 6120--6127.

Digital Library

[32]

Antoine Miech, Ivan Laptev, Josef Sivic, Heng Wang, Lorenzo Torresani, and Du Tran. 2019. Leveraging the present to anticipate the future in videos. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops.

[33]

Medhini Narasimhan, Svetlana Lazebnik, and Alexander Schwing. 2018. Out of the box: Reasoning with graph convolution nets for factual visual question answering. In Advances in Neural Information Processing Systems. 2654--2665.

Digital Library

[34]

Boxiao Pan, Haoye Cai, De-An Huang, Kuan-Hui Lee, Adrien Gaidon, Ehsan Adeli, and Juan Carlos Niebles. 2020. Spatio-Temporal Graph for Video Captioning with Knowledge Distillation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 10870--10879.

[35]

Fan Qi, Xiaoshan Yang, and Changsheng Xu. 2020. Emotion knowledge driven video highlight detection. IEEE Transactions on Multimedia (2020).

[36]

Zhaobo Qi, Shuhui Wang, Chi Su, Li Su, Qingming Huang, and Qi Tian. 2021. Self-Regulated Learning for Egocentric Video Activity Anticipation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2021).

Digital Library

[37]

MS Ryoo, Thomas J Fuchs, Lu Xia, Jake K Aggarwal, and Larry Matthies. 2015. Robot-centric activity prediction from first-person videos: What will they do to me?. In 10th International Conference on Human-Robot Interaction. IEEE, 295--302.

Digital Library

[38]

Fadime Sener, Dipika Singhania, and Angela Yao. 2020. Temporal Aggregate Representations for Long Term Video Understanding. In European Conference on Computer Vision.

[39]

Karen Simonyan and Andrew Zisserman. 2014. Two-stream convolutional networks for action recognition in videos. In Advances in Neural Information Processing Systems. 568--576.

Digital Library

[40]

Bilge Soran, Ali Farhadi, and Linda Shapiro. 2015. Generating notifications for missing actions: Don't forget to turn the lights off!. In Proceedings of the IEEE International Conference on Computer Vision. 4669--4677.

Digital Library

[41]

Siqi Sun, Yu Cheng, Zhe Gan, and Jingjing Liu. 2019. Patient knowledge distillation for bert model compression. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing.

[42]

Carl Vondrick, Hamed Pirsiavash, and Antonio Torralba. 2016. Anticipating visual representations from unlabeled video. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 98--106.

[43]

Limin Wang, Yuanjun Xiong, Zhe Wang, Yu Qiao, Dahua Lin, Xiaoou Tang, and Luc Van Gool. 2016. Temporal segment networks: Towards good practices for deep action recognition. In European Conference on Computer Vision. 20--36.

[44]

Wei Wang, Junyu Gao, Xiaoshan Yang, and Changsheng Xu. 2020. Learning coarse-to-fine graph neural networks for video-text retrieval. IEEE Transactions on Multimedia (2020).

[45]

Xionghui Wang, Jian-Fang Hu, Jian-Huang Lai, Jianguo Zhang, and Wei-Shi Zheng. 2019. Progressive teacher-student learning for early action prediction. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 3556--3565.

[46]

Xiaolong Wang, Yufei Ye, and Abhinav Gupta. 2018. Zero-shot recognition via semantic embeddings and knowledge graphs. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 6857--6866.

[47]

Yu Wu, Linchao Zhu, Xiaohan Wang, Yi Yang, and Fei Wu. 2021. Learning to Anticipate Egocentric Actions by Imagination. IEEE Transactions on Image Processing, Vol. 30 (2021), 1143--1152.

[48]

Jinrui Yang, Wei-Shi Zheng, Qize Yang, Ying-Cong Chen, and Qi Tian. 2020. Spatial-Temporal Graph Convolutional Network for Video-Based Person Re-Identification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 3289--3299.

[49]

Ting Yao, Yingwei Pan, Yehao Li, and Tao Mei. 2018. Exploring visual relationship for image captioning. In European Conference on Computer Vision. 684--699.

[50]

Runhao Zeng, Wenbing Huang, Mingkui Tan, Yu Rong, Peilin Zhao, Junzhou Huang, and Chuang Gan. 2019. Graph convolutional networks for temporal action localization. In Proceedings of the IEEE International Conference on Computer Vision. 7094--7103.

[51]

Jingran Zhang, Fumin Shen, Xing Xu, and Heng Tao Shen. 2020 b. Temporal reasoning graph for activity recognition. IEEE Transactions on Image Processing, Vol. 29 (2020), 5491--5506.

[52]

Tianyu Zhang, Weiqing Min, Ying Zhu, Yong Rui, and Shuqiang Jiang. 2020 a. An Egocentric Action Anticipation Framework via Fusing Intuition and Analysis. In Proceedings of the 28th ACM International Conference on Multimedia. 402--410.

Digital Library

[53]

Yubo Zhang, Pavel Tokmakov, Martial Hebert, and Cordelia Schmid. 2019. A structured model for action detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 9975--9984.

[54]

Yuanen Zhou, Meng Wang, Daqing Liu, Zhenzhen Hu, and Hanwang Zhang. 2020. More Grounded Image Captioning by Distilling Image-Text Matching Model. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4777--4786.

Cited By

Zhang TMin WLiu TJiang SRui Y(2024)Toward Egocentric Compositional Action Anticipation with Adaptive Semantic DebiasingACM Transactions on Multimedia Computing, Communications, and Applications10.1145/363333320:5(1-21)Online publication date: 11-Jan-2024
https://dl.acm.org/doi/10.1145/3633333
Qi ZWang SZhang WHuang Q(2024)Uncertainty-Boosted Robust Video Activity AnticipationIEEE Transactions on Pattern Analysis and Machine Intelligence10.1109/TPAMI.2024.339373046:12(7775-7792)Online publication date: Dec-2024
https://doi.org/10.1109/TPAMI.2024.3393730
Cao CSun ZLv QMin LZhang Y(2024)VS-TransGRU: A Novel Transformer-GRU-Based Framework Enhanced by Visual-Semantic Fusion for Egocentric Action AnticipationIEEE Transactions on Circuits and Systems for Video Technology10.1109/TCSVT.2024.342559834:11(11605-11618)Online publication date: Nov-2024
https://doi.org/10.1109/TCSVT.2024.3425598
Show More Cited By

Index Terms

Multimodal Global Relation Knowledge Distillation for Egocentric Action Anticipation
1. Computing methodologies
  1. Artificial intelligence
    1. Computer vision
      1. Computer vision tasks
        Activity recognition and understanding
    2. Knowledge representation and reasoning
      1. Temporal reasoning

Recommendations

Egocentric Early Action Prediction via Adversarial Knowledge Distillation
Egocentric early action prediction aims to recognize actions from the first-person view by only observing a partial video segment, which is challenging due to the limited context information of the partial video. In this article, to tackle the egocentric ...
Jointly-Learnt Networks for Future Action Anticipation via Self-Knowledge Distillation and Cycle Consistency
Future action anticipation aims to infer future actions from the observation of a small set of past video frames. In this paper, we propose a novel Jointly-learnt Action Anticipation Network (J-AAN) via Self-Knowledge Distillation (Self-KD) and cycle ...
Student-Oriented Teacher Knowledge Refinement for Knowledge Distillation
MM '24: Proceedings of the 32nd ACM International Conference on Multimedia

Knowledge distillation has become widely recognized for its ability to transfer knowledge from a large teacher network to a compact and more streamlined student network. Traditional knowledge distillation methods primarily follow a teacher-oriented ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

MM '21: Proceedings of the 29th ACM International Conference on Multimedia

October 2021

5796 pages

ISBN:9781450386517

DOI:10.1145/3474085

General Chairs:
Heng Tao Shen
University of Electronic Science&Technology of China, China
,
Yueting Zhuang
Zhejiang University, China
,
John R. Smith
IBM, USA
,
Program Chairs:
Yang Yang
University of Electronic Science and Technology of China, China
,
Pablo Cesar
CWI&TU Delft, The Netherlands
,
Florian Metze
FACEBOOK, Inc., USA
,
Balakrishnan Prabhakaran
University of Texas at Dallas, USA

Copyright © 2021 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGMM: ACM Special Interest Group on Multimedia

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 17 October 2021

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

National Natural Science Foundation of China
National Key Research and Development Program of China

Conference

MM '21

Sponsor:

SIGMM

MM '21: ACM Multimedia Conference

October 20 - 24, 2021

Virtual Event, China

Acceptance Rates

Overall Acceptance Rate 2,145 of 8,556 submissions, 25%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

7
Total Citations
View Citations
368
Total Downloads

Downloads (Last 12 months)34
Downloads (Last 6 weeks)5

Reflects downloads up to 10 Dec 2024

Other Metrics

View Author Metrics

Citations

Cited By

Zhang TMin WLiu TJiang SRui Y(2024)Toward Egocentric Compositional Action Anticipation with Adaptive Semantic DebiasingACM Transactions on Multimedia Computing, Communications, and Applications10.1145/363333320:5(1-21)Online publication date: 11-Jan-2024
https://dl.acm.org/doi/10.1145/3633333
Qi ZWang SZhang WHuang Q(2024)Uncertainty-Boosted Robust Video Activity AnticipationIEEE Transactions on Pattern Analysis and Machine Intelligence10.1109/TPAMI.2024.339373046:12(7775-7792)Online publication date: Dec-2024
https://doi.org/10.1109/TPAMI.2024.3393730
Cao CSun ZLv QMin LZhang Y(2024)VS-TransGRU: A Novel Transformer-GRU-Based Framework Enhanced by Visual-Semantic Fusion for Egocentric Action AnticipationIEEE Transactions on Circuits and Systems for Video Technology10.1109/TCSVT.2024.342559834:11(11605-11618)Online publication date: Nov-2024
https://doi.org/10.1109/TCSVT.2024.3425598
Qiu YRajan D(2024)A multivariate Markov chain model for interpretable dense action anticipationNeurocomputing10.1016/j.neucom.2024.127285574:COnline publication date: 17-Apr-2024
https://dl.acm.org/doi/10.1016/j.neucom.2024.127285
Li WLuo DYang DWang W(2024)Large Language Model for Action AnticipationArtificial Neural Networks and Machine Learning – ICANN 202410.1007/978-3-031-72338-4_15(207-222)Online publication date: 17-Sep-2024
https://doi.org/10.1007/978-3-031-72338-4_15
Mukherjee SChopra B(2024)Egocentric Action Prediction via Knowledge Distillation and Subject-Action RelevanceComputer Vision and Image Processing10.1007/978-3-031-58181-6_48(565-573)Online publication date: 3-Jul-2024
https://doi.org/10.1007/978-3-031-58181-6_48
Dai GShu XYan RHuang PTang JEl Saddik AMei TCucchiara RBertini MTobon Vallejo DAtrey PHossain M(2023)Slowfast Diversity-aware Prototype Learning for Egocentric Action RecognitionProceedings of the 31st ACM International Conference on Multimedia10.1145/3581783.3612144(7549-7558)Online publication date: 26-Oct-2023
https://dl.acm.org/doi/10.1145/3581783.3612144

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents