DmrNet: Dual-stream Mutual Information Contraction and Re-discrimination Network for Semi-supervised Temporal Action Detection

Qiming Zhang¹,
Zhengping Hu^1,2,
Yulu Wang¹,
Shuai Bi¹,
Hehao Zhang¹ &
…
Jirui Di¹

56 Accesses
Explore all metrics

Abstract

Semi-supervised temporal action detection only requires a small number of labeled samples from the dataset and utilizes the remaining unlabeled samples for model training, effectively alleviating the significant time and manpower costs associated with annotating large-scale temporal action detection datasets. However, previous semi-supervised temporal action detection methods relied on sequential action localization and classification, which leads to erroneous localization predictions that can easily affect subsequent classification predictions, resulting in error propagation problem. To overcome error propagation, we propose a dual-stream mutual information contraction and re-discrimination network (DmrNet). Specifically, the traditional two-step strategy of temporal action detection has been changed to a four-step parallel strategy by us. Firstly, this paper designs the first-step classification prediction and the second-step localization prediction as a parallel structure to prevent error propagation from localization to classification. Then, in the third step, the dual-stream mutual information contraction part maps the dual-stream features to a new vector space to ensure the cross-correlation between classification and action localization. Finally, the fourth step of classification re-discrimination part captures the consistency information of the dual-stream structure to enhance internal representation. Compared with existing methods, DmrNet achieved an average accuracy improvement of 10.7% on ActivityNet v1.3 and 5.2% on THUMOS14 using only 10% annotation data. The experimental results show that the proposed DmrNet not only achieves good detection performance in semi-supervised learning but also achieves performance comparable to state-of-the-art methods in fully supervised learning.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic

£29.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price includes VAT (United Kingdom)

Instant access to the full article PDF.

Institutional subscriptions

Spatiotemporal Perturbation Based Dynamic Consistency for Semi-supervised Temporal Action Detection

Learning Reliable Dense Pseudo-Labels for Point-Level Weakly-Supervised Action Localization

Article Open access 10 April 2024

Extracting Action Sensitive Features to Facilitate Weakly-Supervised Action Localization

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Data Availability

The datasets generated during and/or analyzed during the current study are available from the corresponding author on reasonable request.

References

Chen Y, Jiang H, Xiao J, Li D, Gu Q. Temporal action detection with dynamic weights based on curriculum learning. Neurocomputing. 2023;524:106–16.
Article Google Scholar
Gu J, Wang Z, Kuen J, Ma L, Shahroudy A, Shuai B, et al. Recent advances in convolutional neural networks. Pattern Recogn. 2018;77:354–77.
Article Google Scholar
LeCun Y, Bengio Y, Hinton G. Deep learning. Nature. 2015;521(7553):436–44.
Article Google Scholar
Jhuang H, Gall J, Zuffi S, Schmid C, Black MJ. Towards understanding action recognition. In: proceedings of the IEEE/CVF international conference on computer vision. 2013. pp. 3192-3199. https://doi.org/10.1109/ICCV.2013.396.
Kay W, Carreira J, Simonyan K, Zhang B, Hillier C, Vijayanarasimhan S, et al. The kinetics human action video dataset. 2017. axXiv preprint https://arxiv.org/abs/1705.06950. Accessed 10 Dec 2023.
Soomro K, Zamir A, Shah M. Ucf101: a dataset of 101 human actions classes from videos in the wild. 2012. axXiv preprint https://arxiv.org/abs/1212.0402. Accessed 10 Dec 2023.
Ngo BH, Kim JH, Chae YJ, Cho SI. Multi-view collaborative learning for semi-supervised domain adaptation. IEEE Access. 2021;9:166488–501.
Article Google Scholar
Ngo BH, Park JH, Park SJ, Cho SI. Semi-supervised domain adaptation using explicit class-wise matching for domain-invariant and class-discriminative feature learning. IEEE Access. 2021;9:128467–80.
Article Google Scholar
Gao J, Yang Z, Chen K, Sun C, Nevatia R. TURN TAP: temporal unit regression network for temporal action proposals. In: proceedings of the IEEE/CVF international conference on computer vision. 2017. pp. 3648–3656. https://doi.org/10.48550/arXiv.1703.06189.
Xu H, Das A, Saenko K. R-C3D: region convolutional 3d network for temporal activity detection. In: proceedings of the IEEE/CVF international conference on computer vision. 2017. pp. 5794–5803. https://doi.org/10.48550/arXiv.1703.07814.
Zhao C, Thabet A K, Ghanem B. Video self-stitching graph network for temporal action localization. In: proceedings of the IEEE/CVF international conference on computer vision. 2021. pp. 13658–13667. https://doi.org/10.48550/arXiv.2011.14598.
Lin T, Zhao X, Su H, Wang C, Yang M. BSN: boundary sensitive network for temporal action proposal generation. In: European conference on computer vision. 2018. pp. 3–21. https://doi.org/10.48550/arXiv.1806.02964.
Lin T, Liu X, Li X, Ding E, Wen S. BMN: boundary-matching network for temporal action proposal generation. In: Proceedings of the IEEE/CVF international conference on computer vision. 2019. pp. 3889–3898. https://doi.org/10.48550/arXiv.1907.09702.
Lin C, Li J, Wang Y, Tai Y, Luo D, Cui Z, et al. Fast learning of temporal action proposal via dense boundary generator. In: Proceedings of the 2020–34th AAAI conference on artificial intelligence. 2020. pp. 11499–11506.
Liang Z, Zhai P, Zheng D, Fang Y. Global-aware pyramid network with boundary adjustment for anchor-free temporal action detection. In: proceedings of the 3rd international conference on control, robotics and intelligent system. 2022. pp. 187–193. https://doi.org/10.1145/3562007.3562041.
Shi H, Chen H, Zhao G. Attention-guided boundary refinement on anchor-free temporal action detection. In: Proceedings of the Scandinavian conference on image analysis. 2023. pp. 129–139. https://doi.org/10.1007/978-3-031-31435-3_9.
Laine S, Aila T. Temporal ensembling for semi-supervised learning. In: Proceedings of the 5th international conference on learning representations. 2017. https://doi.org/10.48550/arXiv.1610.02242.
Tarvainen A, Valpola H. Mean teachers are better role models: weight-averaged consistency targets improve semi-supervised deep learning results. In: proceedings of the annual conference on neural information processing systems. 2017. pp. 1196–1205. https://doi.org/10.48550/arXiv.1703.01780.
Berthelot D, Carlini N, Goodfellow I, Oliver A, Papernot N, Nicolas C. Mixmatch: a holistic approach to semi-supervised learning. In: proceedings of the annual conference on neural information processing systems. 2019. p. 5050–5060. https://doi.org/10.48550/arXiv.1905.02249.
Sajjadi M, Javanmardi M, Tasdizen T. Regularization with stochastic transformations and perturbations for deep semi-supervised learning. In: proceedings of the annual conference on neural information processing systems. 2016. pp. 1171–1179. https://doi.org/10.48550/arXiv.1606.04586.
Grandvalet Y, Bengio Y. Semi-supervised learning by entropy minimization. In: Proceedings of the 17th international conference on neural information processing systems. 2005. pp. 529–536.
Jiang Y, Li X, Chen Y, He Y, Xu Q, Yang Z, et al. Maxmatch: semi-supervised learning with worst-case consistency. IEEE Trans Pattern Anal Mach Intell. 2023;45(5):5970–87.
Article Google Scholar
Fan Y, Kukleva A, Dai D, Schiele B. Revisiting consistency regularization for semi-supervised learning. Int J Comput Vision. 2023;131(3):626–43.
Article Google Scholar
Park JH, Kim JH, Ngo BH, Kwon JE, Cho SI. Adversarial representation teaching with perturbation-agnostic student-teacher structure for semi-supervised learning. Appl Intell. 2023;53(22):26797–809.
Article Google Scholar
Lee DH. Pseudo-label: the simple and efficient semi-supervised learning method for deep neural networks. In: Workshop on challenges in representation learning. 2013. p. 896.
Sohn K, Berthelot D, Carlini N, Zhang Z, Carlini N, Cubuk E, et al. Fixmatch: simplifying semi-supervised learning with consistency and confidence. In: proceedings of the 34th conference on neural information processing systems. 2020. pp. 596–608. https://doi.org/10.48550/arXiv.2001.07685.
Hu S, Liu C, Dutta J, Chang M C, Lyu S, Ramakrishnan N. PseudoProp: robust pseudo-label generation for semi-supervised object detection in autonomous driving systems. In: proceedings of the IEEE computer society conference on computer vision and pattern recognition workshops. 2022. pp. 4389–4397. https://doi.org/10.48550/arXiv.2203.05983.
Chang H, Xie G, Yu J, Ling Q, Gao F, Yu Y. A viable framework for semi-supervised learning on realistic dataset. Mach Learn. 2023;112(6):1847–69.
Article MathSciNet Google Scholar
Ji J, Cao K, Niebles JC. Learning temporal action proposals with fewer labels. In: proceedings of the IEEE/CVF international conference on computer vision. 2019. pp. 7073–7082. https://doi.org/10.48550/arXiv.1910.01286.
Wang X, Zhang S, Qing Z, Shao Y, Gao C, Sang N. Self-supervised learning for semi-supervised temporal action proposal. In: proceedings of the IEEE conference on computer vision and pattern recognition. 2021. pp. 1905–1914. https://doi.org/10.48550/arXiv.2104.03214.
Nag S, Zhu X, Song YZ, Xiang T. Semi-supervised temporal action detection with proposal-free masking. In: proceedings of the 17th European conference on computer vision. 2022. pp. 663–680. https://doi.org/10.48550/arXiv.2207.07059.
Li D, Yang X, Tang Y, Zhang C, Zhang W, Ma L. Active learning with effective scoring functions for semi-supervised temporal action localization. Displays. 2023;78:102434.
Article Google Scholar
Xia K, Wang L, Zhou SP, Hua G, Tang W. Learning from noisy pseudo labels for semi-supervised temporal action localization. In: proceedings of the 2023 IEEE/CVF international conference on computer vision. 2023. pp. 10126–10135. https://doi.org/10.1109/ICCV51070.2023.00932.
Pehlivan S, Laaksonen J. Temporal teacher with masked transformers for semi-supervised action proposal generation. Mach Vis Appl. 2024;35(3):1–15.
Article Google Scholar
Carreira J, Zisserman A. Quo vadis, action recognition? A new model and the kinetics dataset. In: proceedings of the IEEE conference on computer vision and pattern recognition. 2017. pp. 4724–4733. https://doi.org/10.48550/arXiv.1705.07750.
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, et al. Attention is all you need. 2017. arXiv preprint https://arxiv.org/abs/1706.03762
Heilbron F C, Escorcia V, Ghanem B, Niebles J C. ActivityNet: a large-scale video benchmark for human activity understanding. In: proceedings of the IEEE conference on computer vision and pattern recognition. 2015. pp. 961–970. https://doi.org/10.1109/CVPR.2015.7298698.
Jiang Y G, Liu J, Zamir A R, Toderici G, Laptev I, Shah M, et al. THUMOS challenge: action recognition with a large number of classes. 2014. Available from: https://www.crcv.ucf.edu/THUMOS14/. Accessed 10 Dec 2023.
Li J, Liu X, Zong Z, Zhang W, Zhang M, Song J. Graph attention-based proposal 3D convnets for action detection. In: proceedings of the 34th AAAI conference on artificial intelligence. 2020. pp. 4626–4633. https://doi.org/10.1609/aaai.v34i04.5893.
Chen P, Gan C, Shen G, Huang W, Zeng R, Tan M. Relation attention for temporal action localization. IEEE Trans Multimed. 2020;22(10):2723–33.
Article Google Scholar
Gao Z, Le W, Zhang Q, Niu Z, Zheng N, Hua G. Video imprint segmentation for temporal action detection in untrimmed videos. In: Proceedings of the AAAI conference on artificial intelligence. 2019. pp. 8328–8335. https://doi.org/10.1609/aaai.v33i01.33018328.
Vaudaux-Ruth G, Chan-Hon-Tong A, Achard C. SALAD: self-assessment learning for action detection. In: Proceedings of the IEEE Winter conference on applications of computer vision. 2021. pp. 1268–1277. https://doi.org/10.48550/arXiv.2011.06958.
Li X, Lin T, Liu X, Gan C, Zuo W, Li C, et al. Deep concept-wise temporal convolutional networks for action localization. In: proceedings of the 28th ACM international conference on multimedia. 2019. pp. 4004–4012. https://doi.org/10.48550/arXiv.1908.09442.
Zeng R, Huang W, Tan M, Rong Y, Zhao P, Huang J, et al. Graph convolutional networks for temporal action localization. In: proceedings of the IEEE international conference on computer vision. 2019. pp. 7094–7103. https://doi.org/10.48550/arXiv.1909.03252.
Wang B, Yang L, Zhao Y. POLO: learning explicit cross-modality fusion for temporal action localization. IEEE Signal Process Lett. 2021;28:503–7.
Article Google Scholar
Wu J, Sun P, Chen S, Yang J, Qi Z, Ma L, Luo P. Towards high-quality temporal action detection with sparse proposals. 2021. axXiv preprint https://arxiv.org/abs/2109.08847. Accessed 10 Dec 2023.
Liu X, Hu Y, Bai S, Ding F, Bai X, Torr P H. Multi-shot temporal event localization: a benchmark. In: proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2021. pp. 12596–12606. https://doi.org/10.48550/arXiv.2012.09434.
Xu M, Zhao C, Rojas D S, Thabet A, Bernard G. G-TAD: sub-graph localization for temporal action detection. In: proceedings of the IEEE computer society conference on computer vision and pattern recognition. 2020. pp. 10153–10162. https://doi.org/10.48550/arXiv.1911.11462.
Long F, Yao T, Qiu Z, Tian X, Luo J, Mei T. Gaussian temporal awareness networks for action localization. In: proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2019. pp. 344–353. https://doi.org/10.48550/arXiv.1909.03877.
Lin C, Xu C, Luo D, Wang Y, Tai Y, Wang C, et al. Learning salient boundary feature for anchor-free temporal action localization. In: proceedings of the IEEE conference on computer vision and pattern recognition. 2021. pp. 3319–3328. https://doi.org/10.48550/arXiv.2103.13137.
Zhang W, Wang B, Ma S, Zhang Y, Zhao Y. I2Net: mining intra-video and inter-video attention for temporal action localization. Neurocomputing. 2021;444:16–29.
Article Google Scholar
Su R, Xu D, Sheng L, Ouyang W. PCG-TAL: progressive cross-granularity cooperation for temporal action localization. IEEE Trans Image Process. 2021;30:2103–13.
Article Google Scholar
Wang Z, Liu Q. Progressive boundary refinement network for temporal action detection. In: proceedings of the 34th AAAI conference on artificial intelligence. 2020. pp. 11612–11619. https://doi.org/10.1609/aaai.v34i07.6829.
Alwassel H, Giancola S, Ghanem B. TSP: temporally-sensitive pretraining of video encoders for localization tasks. In: proceedings of the IEEE/CVF international conference on computer vision. 2021. pp. 3173–3183. https://doi.org/10.48550/arXiv.2011.11479.
Gan MG, Zhang Y. Improving accuracy of temporal action detection by deep hybrid convolutional network. Multimed Tools Appl. 2023;82(11):16127–49.
Article Google Scholar
Liu MH, Liu HY, Zhao SR, Ma F, Li ML, Dai ZH, et al. STAN: Spatial-temporal awareness network for temporal action detection. In: proceedings of the ACM international workshop on multimedia content analysis in sports. 2023. pp. 161–165. https://doi.org/10.1145/3606038.3616169.

Download references

Funding

This work was supported by National Natural Science Foundation of China (No. 62001413), the General Program of National Natural Science Foundation of China (No. 61771420), and the Natural Science Foundation of Hebei Province of China (No. F2020203064).

Author information

Authors and Affiliations

School of Information and Engineering & Yanshan University, Qinhuangdao, 066004, Hebei, China
Qiming Zhang, Zhengping Hu, Yulu Wang, Shuai Bi, Hehao Zhang & Jirui Di
Yanshan University & Hebei Key Laboratory of Information Transmission and Signal Processing, Qinhuangdao, 066004, Hebei, China
Zhengping Hu

Authors

Qiming Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Zhengping Hu
View author publications
You can also search for this author in PubMed Google Scholar
Yulu Wang
View author publications
You can also search for this author in PubMed Google Scholar
Shuai Bi
View author publications
You can also search for this author in PubMed Google Scholar
Hehao Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Jirui Di
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

Conceptualization, Qiming Zhang and Zhengping Hu; methodology, Qiming Zhang; formal analysis and investigation, Qiming Zhang; writing — original draft preparation, Qiming Zhang and Zhengping Hu; writing — review and editing, Qiming Zhang and Zhengping Hu; funding acquisition, Zhengping Hu; resources, Yulu Wang and Hehao Zhang; and supervision, Shuai Bi and Jirui Di.

Corresponding author

Correspondence to Zhengping Hu.

Ethics declarations

Ethics Approval

This article does not contain any studies that used human participants or animals.

Informed Consent

Informed consent was obtained from all individual participants included in the study.

Competing Interests

The authors declare no competing interests.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Zhang, Q., Hu, Z., Wang, Y. et al. DmrNet: Dual-stream Mutual Information Contraction and Re-discrimination Network for Semi-supervised Temporal Action Detection. Cogn Comput 17, 15 (2025). https://doi.org/10.1007/s12559-024-10374-1

Download citation

Received: 18 December 2023
Accepted: 14 September 2024
Published: 27 November 2024
DOI: https://doi.org/10.1007/s12559-024-10374-1

DmrNet: Dual-stream Mutual Information Contraction and Re-discrimination Network for Semi-supervised Temporal Action Detection

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Spatiotemporal Perturbation Based Dynamic Consistency for Semi-supervised Temporal Action Detection

Learning Reliable Dense Pseudo-Labels for Point-Level Weakly-Supervised Action Localization

Extracting Action Sensitive Features to Facilitate Weakly-Supervised Action Localization

Data Availability

References

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Ethics Approval

Informed Consent

Competing Interests

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

Navigation

DmrNet: Dual-stream Mutual Information Contraction and Re-discrimination Network for Semi-supervised Temporal Action Detection

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Spatiotemporal Perturbation Based Dynamic Consistency for Semi-supervised Temporal Action Detection

Learning Reliable Dense Pseudo-Labels for Point-Level Weakly-Supervised Action Localization

Extracting Action Sensitive Features to Facilitate Weakly-Supervised Action Localization

Explore related subjects

Data Availability

References

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Ethics Approval

Informed Consent

Competing Interests

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now

Search

Navigation