[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1145/3474085.3475327acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
research-article

Multimodal Global Relation Knowledge Distillation for Egocentric Action Anticipation

Published: 17 October 2021 Publication History

Abstract

In this paper, we consider the task of action anticipation on egocentric videos. Previous methods ignore explicit modeling of the global context relation among past and future actions, which is not an easy task due to the vacancy of unobserved videos. To solve this problem, we propose a Multimodal Global Relation Knowledge Distillation (MGRKD) framework to distill the knowledge learned from full videos to improve the action anticipation task on partially observed videos. The proposed MGRKD has a teacher-student learning strategy, where either the teacher or student model has three branches of global relation graph networks (GRGN) to explore the pairwise relations between past and future actions based on three kinds of features (i.e., RGB, motion or object). The teacher model has a similar architecture with the student model, except that the teacher model uses true feature of the future video snippet to build the graph in GRGN while the student model uses a progressive GRU to predict an initialized node feature of future snippet in GRGN. Through the teacher-student learning strategy, the discriminative features and relation knowledge of the past and future actions learned in the teacher model can be distilled to the student model. The experiments on two egocentric video datasets EPIC-Kitchens and EGTEA Gaze+ show that the proposed framework achieves state-of-the-art performances.

Supplementary Material

MP4 File (10.1145-3474085.3475327.mp4)
video presentation

References

[1]
Apratim Bhattacharyya, Mario Fritz, and Bernt Schiele. 2018. Long-term on-board prediction of people in traffic scenes under uncertainty. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4194--4202.
[2]
Guglielmo Camporese, Pasquale Coscia, Antonino Furnari, Giovanni Maria Farinella, and Lamberto Ballan. 2020. Knowledge distillation for action anticipation via label smoothing. In 25th International Conference on Pattern Recognition.
[3]
Chaofan Chen, Xiaoshan Yang, Changsheng Xu, Xuhui Huang, and Zhe Ma. 2021. ECKPN: Explicit Class Knowledge Propagation Network for Transductive Few-Shot Learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.
[4]
Kyunghyun Cho, Bart Van Merriënboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. 2014. Learning phrase representations using RNN encoder-decoder for statistical machine translation. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing.
[5]
Dima Damen, Hazel Doughty, Giovanni Maria Farinella, Sanja Fidler, Antonino Furnari, Evangelos Kazakos, Davide Moltisanti, Jonathan Munro, Toby Perrett, Will Price, et al. 2018. Scaling egocentric vision: The epic-kitchens dataset. In European Conference on Computer Vision. 720--736.
[6]
Roeland De Geest and Tinne Tuytelaars. 2018. Modeling temporal structure with LSTM for online action detection. In 2018 IEEE Winter Conference on Applications of Computer Vision. IEEE, 1549--1557.
[7]
Antonino Furnari, Sebastiano Battiato, and Giovanni Maria Farinella. 2018. Leveraging uncertainty to rethink loss functions and evaluation measures for egocentric action anticipation. In European Conference on Computer Vision.
[8]
Antonino Furnari and Giovanni Farinella. 2020. Rolling-Unrolling LSTMs for Action Anticipation from First-Person Video. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020).
[9]
Antonino Furnari and Giovanni Maria Farinella. 2019. What would you expect? anticipating egocentric actions with rolling-unrolling lstms and modality attention. In Proceedings of the IEEE International Conference on Computer Vision. 6252--6261.
[10]
Junyu Gao, Xiaoshan Yang, Yingying Zhang, and Changsheng Xu. 2020. Unsupervised video summarization via relation-aware assignment learning. IEEE Transactions on Multimedia (2020).
[11]
Jiyang Gao, Zhenheng Yang, and Ram Nevatia. 2017. Red: Reinforced encoder-decoder networks for action anticipation. In British Machine Vision Conference.
[12]
Junyu Gao, Tianzhu Zhang, and Changsheng Xu. 2019. I know the relationships: Zero-shot action recognition via two-stream graph convolutional networks and knowledge graphs. In Proceedings of the AAAI conference on artificial intelligence, Vol. 33. 8303--8311.
[13]
Saurabh Gupta, Judy Hoffman, and Jitendra Malik. 2016. Cross modal distillation for supervision transfer. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2827--2836.
[14]
Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. 2015. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531 (2015).
[15]
Jian-Fang Hu, Wei-Shi Zheng, Lianyang Ma, Gang Wang, Jianhuang Lai, and Jianguo Zhang. 2018. Early action prediction by soft regression. IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 41, 11 (2018), 2568--2583.
[16]
Yifei Huang, Yusuke Sugano, and Yoichi Sato. 2020. Improving Action Segmentation via Graph-Based Temporal Reasoning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 14024--14034.
[17]
Ashesh Jain, Avi Singh, Hema S Koppula, Shane Soh, and Ashutosh Saxena. 2016. Recurrent neural networks for driver activity anticipation via sensory-fusion architecture. In 2016 IEEE International Conference on Robotics and Automation. IEEE, 3118--3125.
[18]
Michael Kampffmeyer, Yinbo Chen, Xiaodan Liang, Hao Wang, Yujia Zhang, and Eric P Xing. 2019. Rethinking knowledge graph propagation for zero-shot learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 11487--11496.
[19]
Takeo Kanade and Martial Hebert. 2012. First-person vision. IEEE, Vol. 100, 8 (2012), 2442--2453.
[20]
Thomas N Kipf and Max Welling. 2016. Semi-supervised classification with graph convolutional networks. In International Conference on Learning Representations.
[21]
Yu Kong, Zhiqiang Tao, and Yun Fu. 2017. Deep sequential context networks for action prediction. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1473--1481.
[22]
Hema S Koppula and Ashutosh Saxena. 2015. Anticipating human activities using object affordances for reactive robotic response. IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 38, 1 (2015), 14--29.
[23]
Anoop Korattikara Balan, Vivek Rathod, Kevin P Murphy, and Max Welling. 2015. Bayesian dark knowledge. In Advances in Neural Information Processing Systems, Vol. 28. 3438--3446.
[24]
Tian Lan, Tsung-Chuan Chen, and Silvio Savarese. 2014. A hierarchical representation for future action prediction. In European Conference on Computer Vision. Springer, 689--704.
[25]
Kang Li, Jie Hu, and Yun Fu. 2012. Modeling complex temporal composition of actionlets for activity prediction. In European Conference on Computer Vision. Springer, 286--299.
[26]
Yin Li, Miao Liu, and James M Rehg. 2018. In the eye of beholder: Joint learning of gaze and actions in first person video. In European Conference on Computer Vision. 619--635.
[27]
Miao Liu, Siyu Tang, Yin Li, and James M Rehg. 2020. Forecasting human-object interaction: joint prediction of motor attention and actions in first person video. In European Conference on Computer Vision. Springer, 704--721.
[28]
David Lopez-Paz, Léon Bottou, Bernhard Schölkopf, and Vladimir Vapnik. 2015. Unifying distillation and privileged information. arXiv preprint arXiv:1511.03643 (2015).
[29]
Ping Luo, Zhenyao Zhu, Ziwei Liu, Xiaogang Wang, Xiaoou Tang, et al. 2016. Face Model Compression by Distilling Knowledge from Neurons. In Proceedings of the AAAI Conference on Artificial Intelligence. 3560--3566.
[30]
Shugao Ma, Leonid Sigal, and Stan Sclaroff. 2016. Learning activity progression in lstms for activity detection and early detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1942--1950.
[31]
Yuexin Ma, Xinge Zhu, Sibo Zhang, Ruigang Yang, Wenping Wang, and Dinesh Manocha. 2019. Trafficpredict: Trajectory prediction for heterogeneous traffic-agents. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33. 6120--6127.
[32]
Antoine Miech, Ivan Laptev, Josef Sivic, Heng Wang, Lorenzo Torresani, and Du Tran. 2019. Leveraging the present to anticipate the future in videos. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops.
[33]
Medhini Narasimhan, Svetlana Lazebnik, and Alexander Schwing. 2018. Out of the box: Reasoning with graph convolution nets for factual visual question answering. In Advances in Neural Information Processing Systems. 2654--2665.
[34]
Boxiao Pan, Haoye Cai, De-An Huang, Kuan-Hui Lee, Adrien Gaidon, Ehsan Adeli, and Juan Carlos Niebles. 2020. Spatio-Temporal Graph for Video Captioning with Knowledge Distillation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 10870--10879.
[35]
Fan Qi, Xiaoshan Yang, and Changsheng Xu. 2020. Emotion knowledge driven video highlight detection. IEEE Transactions on Multimedia (2020).
[36]
Zhaobo Qi, Shuhui Wang, Chi Su, Li Su, Qingming Huang, and Qi Tian. 2021. Self-Regulated Learning for Egocentric Video Activity Anticipation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2021).
[37]
MS Ryoo, Thomas J Fuchs, Lu Xia, Jake K Aggarwal, and Larry Matthies. 2015. Robot-centric activity prediction from first-person videos: What will they do to me?. In 10th International Conference on Human-Robot Interaction. IEEE, 295--302.
[38]
Fadime Sener, Dipika Singhania, and Angela Yao. 2020. Temporal Aggregate Representations for Long Term Video Understanding. In European Conference on Computer Vision.
[39]
Karen Simonyan and Andrew Zisserman. 2014. Two-stream convolutional networks for action recognition in videos. In Advances in Neural Information Processing Systems. 568--576.
[40]
Bilge Soran, Ali Farhadi, and Linda Shapiro. 2015. Generating notifications for missing actions: Don't forget to turn the lights off!. In Proceedings of the IEEE International Conference on Computer Vision. 4669--4677.
[41]
Siqi Sun, Yu Cheng, Zhe Gan, and Jingjing Liu. 2019. Patient knowledge distillation for bert model compression. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing.
[42]
Carl Vondrick, Hamed Pirsiavash, and Antonio Torralba. 2016. Anticipating visual representations from unlabeled video. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 98--106.
[43]
Limin Wang, Yuanjun Xiong, Zhe Wang, Yu Qiao, Dahua Lin, Xiaoou Tang, and Luc Van Gool. 2016. Temporal segment networks: Towards good practices for deep action recognition. In European Conference on Computer Vision. 20--36.
[44]
Wei Wang, Junyu Gao, Xiaoshan Yang, and Changsheng Xu. 2020. Learning coarse-to-fine graph neural networks for video-text retrieval. IEEE Transactions on Multimedia (2020).
[45]
Xionghui Wang, Jian-Fang Hu, Jian-Huang Lai, Jianguo Zhang, and Wei-Shi Zheng. 2019. Progressive teacher-student learning for early action prediction. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 3556--3565.
[46]
Xiaolong Wang, Yufei Ye, and Abhinav Gupta. 2018. Zero-shot recognition via semantic embeddings and knowledge graphs. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 6857--6866.
[47]
Yu Wu, Linchao Zhu, Xiaohan Wang, Yi Yang, and Fei Wu. 2021. Learning to Anticipate Egocentric Actions by Imagination. IEEE Transactions on Image Processing, Vol. 30 (2021), 1143--1152.
[48]
Jinrui Yang, Wei-Shi Zheng, Qize Yang, Ying-Cong Chen, and Qi Tian. 2020. Spatial-Temporal Graph Convolutional Network for Video-Based Person Re-Identification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 3289--3299.
[49]
Ting Yao, Yingwei Pan, Yehao Li, and Tao Mei. 2018. Exploring visual relationship for image captioning. In European Conference on Computer Vision. 684--699.
[50]
Runhao Zeng, Wenbing Huang, Mingkui Tan, Yu Rong, Peilin Zhao, Junzhou Huang, and Chuang Gan. 2019. Graph convolutional networks for temporal action localization. In Proceedings of the IEEE International Conference on Computer Vision. 7094--7103.
[51]
Jingran Zhang, Fumin Shen, Xing Xu, and Heng Tao Shen. 2020 b. Temporal reasoning graph for activity recognition. IEEE Transactions on Image Processing, Vol. 29 (2020), 5491--5506.
[52]
Tianyu Zhang, Weiqing Min, Ying Zhu, Yong Rui, and Shuqiang Jiang. 2020 a. An Egocentric Action Anticipation Framework via Fusing Intuition and Analysis. In Proceedings of the 28th ACM International Conference on Multimedia. 402--410.
[53]
Yubo Zhang, Pavel Tokmakov, Martial Hebert, and Cordelia Schmid. 2019. A structured model for action detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 9975--9984.
[54]
Yuanen Zhou, Meng Wang, Daqing Liu, Zhenzhen Hu, and Hanwang Zhang. 2020. More Grounded Image Captioning by Distilling Image-Text Matching Model. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4777--4786.

Cited By

View all
  • (2024)Toward Egocentric Compositional Action Anticipation with Adaptive Semantic DebiasingACM Transactions on Multimedia Computing, Communications, and Applications10.1145/363333320:5(1-21)Online publication date: 11-Jan-2024
  • (2024)Uncertainty-Boosted Robust Video Activity AnticipationIEEE Transactions on Pattern Analysis and Machine Intelligence10.1109/TPAMI.2024.339373046:12(7775-7792)Online publication date: Dec-2024
  • (2024)VS-TransGRU: A Novel Transformer-GRU-Based Framework Enhanced by Visual-Semantic Fusion for Egocentric Action AnticipationIEEE Transactions on Circuits and Systems for Video Technology10.1109/TCSVT.2024.342559834:11(11605-11618)Online publication date: Nov-2024
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
MM '21: Proceedings of the 29th ACM International Conference on Multimedia
October 2021
5796 pages
ISBN:9781450386517
DOI:10.1145/3474085
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 17 October 2021

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. egocentric action anticipation
  2. graph network
  3. knowledge distillation

Qualifiers

  • Research-article

Funding Sources

  • National Natural Science Foundation of China
  • National Key Research and Development Program of China

Conference

MM '21
Sponsor:
MM '21: ACM Multimedia Conference
October 20 - 24, 2021
Virtual Event, China

Acceptance Rates

Overall Acceptance Rate 2,145 of 8,556 submissions, 25%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)34
  • Downloads (Last 6 weeks)5
Reflects downloads up to 10 Dec 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Toward Egocentric Compositional Action Anticipation with Adaptive Semantic DebiasingACM Transactions on Multimedia Computing, Communications, and Applications10.1145/363333320:5(1-21)Online publication date: 11-Jan-2024
  • (2024)Uncertainty-Boosted Robust Video Activity AnticipationIEEE Transactions on Pattern Analysis and Machine Intelligence10.1109/TPAMI.2024.339373046:12(7775-7792)Online publication date: Dec-2024
  • (2024)VS-TransGRU: A Novel Transformer-GRU-Based Framework Enhanced by Visual-Semantic Fusion for Egocentric Action AnticipationIEEE Transactions on Circuits and Systems for Video Technology10.1109/TCSVT.2024.342559834:11(11605-11618)Online publication date: Nov-2024
  • (2024)A multivariate Markov chain model for interpretable dense action anticipationNeurocomputing10.1016/j.neucom.2024.127285574:COnline publication date: 17-Apr-2024
  • (2024)Large Language Model for Action AnticipationArtificial Neural Networks and Machine Learning – ICANN 202410.1007/978-3-031-72338-4_15(207-222)Online publication date: 17-Sep-2024
  • (2024)Egocentric Action Prediction via Knowledge Distillation and Subject-Action RelevanceComputer Vision and Image Processing10.1007/978-3-031-58181-6_48(565-573)Online publication date: 3-Jul-2024
  • (2023)Slowfast Diversity-aware Prototype Learning for Egocentric Action RecognitionProceedings of the 31st ACM International Conference on Multimedia10.1145/3581783.3612144(7549-7558)Online publication date: 26-Oct-2023

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media