Abstract
Video action segmentation is a crucial task in evaluating the ability to understand human activities. Previous works on this task mainly focus on capturing complex temporal structures and fail to consider the feature ambiguity among similar actions and the biased training sets, thus they are easy to confuse some actions. In this paper, we propose a novel action segmentation framework, called DeConfuNet, to solve the above issue. First, we design a discriminative enhancement module (DEM) trained by an adaptive margin-guided discriminative feature learning which adjusts the margin adaptively to increase the feature distinguishability among similar actions, and whose multi-stage reasoning and adaptive feature fusion structures provide structural advantages for distinguishing similar actions. Second, we propose an equalizing influence module (EIM) that can overcome the impact of biased training sets by balancing the influence of training samples under a coefficient-adaptive loss function. Third, an energy and context-driven refinement module (ECRM) further alleviates the impact of the unbalanced influence of training samples by fusing and refining the inference of DEM and EIM, which utilizes the phased prediction including context and energy clues to assimilate untrustworthy segments, alleviating over-segmentation hugely. Extensive experiments show the effectiveness of each proposed technique, they verify that the DEM and EIM are complementary in reasoning and cooperate to overcome the confusion issue, and our approach achieves significant improvement and state-of-the-art performance of accuracy, edit score, and F1 score on the challenging 50Salads, GTEA, and Breakfast benchmarks.
Similar content being viewed by others
Data Availability
The datasets used in this paper can be downloaded from 50Salads repository (https://cvip.computing.dundee.ac.uk/datasets/foodpreparation/50salads/), GTEA repository (https://cbs.ic.gatech.edu/fpv/), and Breakfast repository (https://serre-lab.clps.brown.edu/resource/breakfast-actions-dataset/). The data generated during the current study are available from the corresponding author on reasonable request.
References
Lv, H., Chen, C., Cui, Z., Xu, C., Li, Y., Yang, J.: Learning normal dynamics in videos with meta prototype network. In: 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 15420–15429 (2021). https://doi.org/10.1109/cvpr46437.2021.01517
Luo, W., Liu, W., Lian, D., Tang, J., Duan, L., Peng, X., Gao, S.: Video anomaly detection with sparse coding inspired deep neural networks. IEEE Trans. Pattern Anal. Mach. Intell. 43(3), 1070–1084 (2021). https://doi.org/10.1109/tpami.2019.2944377
Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4724–4733 (2017). https://doi.org/10.1109/cvpr.2017.502
Mac, K.-N., Joshi, D., Yeh, R., Xiong, J., Feris, R., Do, M.: Learning motion in feature space: Locally-consistent deformable convolution networks for fine-grained action detection. In: 2019 IEEE/CVF International Conference on Computer Vision (ICCV), pp. 6281–6290 (2019). https://doi.org/10.1109/iccv.2019.00638
Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Thirty-Fifth AAAI Conference on Artificial Intelligence, Virtual Event, February 2-9, 2021, pp. 2729–2737 (2021)
Farha, Y.A., Gall, J.: MS-TCN: Multi-stage temporal convolutional network for action segmentation. In: 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3570–3579 (2019). https://doi.org/10.1109/cvpr.2019.00369
Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: MS-TCN++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 1–1 (2020). https://doi.org/10.1109/tpami.2020.3021756
Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1003–1012 (2017). https://doi.org/10.1109/cvpr.2017.113
Huang, Y., Sugano, Y., Sato, Y.: Improving action segmentation via graph-based temporal reasoning. In: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 14021–14031 (2020). https://doi.org/10.1109/cvpr42600.2020.01404
Lei, P., Todorovic, S.: Temporal deformable residual networks for action segmentation in videos. In: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6742–6751 (2018). https://doi.org/10.1109/cvpr.2018.00705
Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision – ECCV 2020, pp. 34–51 (2020). https://doi.org/10.1007/978-3-030-58595-2_3
Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 16800–16809 (2021). https://doi.org/10.1109/cvpr46437.2021.01653
Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 9451–9460 (2020). https://doi.org/10.1109/cvpr42600.2020.00947
Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: Proceedings of the Thirty-First International Joint Conference on Artificial Intelligence, IJCAI-22, pp. 820–826 (2022). https://doi.org/10.24963/ijcai.2022/115
Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: 2021 IEEE Winter Conference on Applications of Computer Vision (WACV), pp. 2321–2330 (2021). https://doi.org/10.1109/wacv48630.2021.00237
Zhang, S., Li, Z., Yan, S., He, X., Sun, J.: Distribution alignment: A unified framework for long-tail visual recognition. In: 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2361–2370 (2021). https://doi.org/10.1109/cvpr46437.2021.00239
Cui, Y., Jia, M., Lin, T.-Y., Song, Y., Belongie, S.: Class-balanced loss based on effective number of samples. In: 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 9268–9277 (2019). https://doi.org/10.1109/cvpr.2019.00949
Wang, T., Zhu, Y., Zhao, C., Zeng, W., Wang, J., Tang, M.: Adaptive class suppression loss for long-tail object detection. In: 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3103–3112 (2021). https://doi.org/10.1109/cvpr46437.2021.00312
Zhou, B., Cui, Q., Wei, X.-S., Chen, Z.-M.: BBN: Bilateral-branch network with cumulative learning for long-tailed visual recognition. In: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 9716–9725 (2020). https://doi.org/10.1109/cvpr42600.2020.00974
Liu, J., Sun, Y., Han, C., Dou, Z., Li, W.: Deep representation learning on long-tailed data: A learnable embedding augmentation perspective. In: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2967–2976 (2020). https://doi.org/10.1109/cvpr42600.2020.00304
Xiang, L., Ding, G., Han, J.: Learning from multiple experts: Self-paced knowledge distillation for long-tailed classification. In: Computer Vision – ECCV 2020, pp. 247–263 (2020). https://doi.org/10.1007/978-3-030-58558-7_15
Chen, J., Wang, X., Guo, Z., Zhang, X., Sun, J.: Dynamic region-aware convolution. In: 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 8060–8069 (2021). https://doi.org/10.1109/cvpr46437.2021.00797
Chen, T., Lu, S., Fan, J.: S-CNN: Subcategory-aware convolutional networks for object detection. IEEE Trans. Pattern Anal. Mach. Intell. 40(10), 2522–2528 (2018). https://doi.org/10.1109/tpami.2017.2756936
Kuehne, H., Richard, A., Gall, J.: A hybrid rnn-hmm approach for weakly supervised temporal action segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 42(4), 765–779 (2020). https://doi.org/10.1109/TPAMI.2018.2884469
Richard, A., Kuehne, H., Gall, J.: Weakly supervised action learning with RNN based fine-to-coarse modeling. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1273–1282 (2017). https://doi.org/10.1109/cvpr.2017.140
Fayyaz, M., Gall, J.: SCT: Set constrained temporal transformer for set supervised action segmentation. In: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 498–507 (2020). https://doi.org/10.1109/cvpr42600.2020.00058
Ghosh, P., Yao, Y., Davis, L.S., Divakaran, A.: Stacked spatio-temporal graph convolutional networks for action segmentation. In: 2020 IEEE Winter Conference on Applications of Computer Vision (WACV), pp. 565–574 (2020). https://doi.org/10.1109/wacv45572.2020.9093361
Zhang, Y., Tang, S., Muandet, K., Jarvers, C., Neumann, H.: Local temporal bilinear pooling for fine-grained action parsing. In: 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 11997–12007 (2019). https://doi.org/10.1109/cvpr.2019.01228
Yi, F., Wen, H., Jiang, T.: Asformer: Transformer for action segmentation. In: The British Machine Vision Conference (2021)
Du, D., Su, B., Li, Y., Qi, Z., Si, L., Shan, Y.: Do we really need temporal convolutions in action segmentation? (2022)
Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision - ECCV 2022 - 17th European Conference, vol. 13695, pp. 52–68 (2022). https://doi.org/10.1007/978-3-031-19833-5_4
Yu, C., Wang, J., Peng, C., Gao, C., Yu, G., Sang, N.: Learning a discriminative feature network for semantic segmentation. In: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1857–1866 (2018). https://doi.org/10.1109/cvpr.2018.00199
Deng, J., Guo, J., Xue, N., Zafeiriou, S.: ArcFace: Additive angular margin loss for deep face recognition. In: 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4685–4694 (2019). https://doi.org/10.1109/cvpr.2019.00482
Chen, T., Kornblith, S., Norouzi, M., Hinton, G.E.: A simple framework for contrastive learning of visual representations. In: Proceedings of the 37th International Conference on Machine Learning, ICML 2020, 13-18 July 2020, Virtual Event, vol. 119, pp. 1597–1607 (2020)
Wang, Z., Wang, S., Yang, S., Li, H., Li, J., Li, Z.: Weakly supervised fine-grained image classification via guassian mixture model oriented discriminative learning. In: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 9746–9755 (2020). https://doi.org/10.1109/cvpr42600.2020.00977
Zhu, J., Liu, Y., Zhang, Y., Chen, Z., Wu, X.: Multi-attribute discriminative representation learning for prediction of adverse drug-drug interaction. IEEE Transactions on Pattern Analysis and Machine Intelligence, 1–1 (2022). https://doi.org/10.1109/tpami.2021.3135841
Wang, J., Cherian, A.: Discriminative video representation learning using support vector classifiers. IEEE Trans. Pattern Anal. Mach. Intell. 43(2), 420–433 (2021). https://doi.org/10.1109/tpami.2019.2937292
Liu, W., Wen, Y., Yu, Z., Li, M., Raj, B., Song, L.: SphereFace: Deep hypersphere embedding for face recognition. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 6738–6746 (2017). https://doi.org/10.1109/cvpr.2017.713
Wang, F., Cheng, J., Liu, W., Liu, H.: Additive margin softmax for face verification. IEEE Signal Process. Lett. 25(7), 926–930 (2018). https://doi.org/10.1109/lsp.2018.2822810
Tutsoy, O., Barkana, D.E.: Model free adaptive control of the under-actuated robot manipulator with the chaotic dynamics. ISA Trans. 118, 106–115 (2021). https://doi.org/10.1016/j.isatra.2021.02.006
Tutsoy, O.: Pharmacological, non-pharmacological policies and mutation: An artificial intelligence based multi-dimensional policy making algorithm for controlling the casualties of the pandemic diseases. IEEE Trans. Pattern Anal. Mach. Intell. 44(12), 9477–9488 (2022). https://doi.org/10.1109/TPAMI.2021.3127674
Chen, L.-C., Papandreou, G., Kokkinos, I., Murphy, K., Yuille, A.L.: DeepLab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected CRFs. IEEE Trans. Pattern Anal. Mach. Intell. 40(4), 834–848 (2018). https://doi.org/10.1109/tpami.2017.2699184
Zhang, H., Dana, K., Shi, J., Zhang, Z., Wang, X., Tyagi, A., Agrawal, A.: Context encoding for semantic segmentation. In: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7151–7160 (2018). https://doi.org/10.1109/cvpr.2018.00747
Yu, F., Koltun, V., Funkhouser, T.: Dilated residual networks. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 636–644 (2017). https://doi.org/10.1109/cvpr.2017.75
Xie, J., Zhu, S.-C., Wu, Y.N.: Learning energy-based spatial-temporal generative ConvNets for dynamic patterns. IEEE Trans. Pattern Anal. Mach. Intell. 43(2), 516–531 (2021). https://doi.org/10.1109/tpami.2019.2934852
Bond-Taylor, S., Leach, A., Long, Y., Willcocks, C.G.: Deep generative modelling: A comparative review of VAEs, GANs, normalizing flows, energy-based and autoregressive models. IEEE Transactions on Pattern Analysis and Machine Intelligence, 1–1 (2021). https://doi.org/10.1109/tpami.2021.3116668
Liu, W., Wang, X., Owens, J., Li, Y.: Energy-based out-of-distribution detection. In: Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M.F., Lin, H. (eds.) Advances in Neural Information Processing Systems, vol. 33, pp. 21464–21475 (2020)
Stein, S., Mckenna, S.J.: Combining embedded accelerometers with computer vision for recognizing food preparation activities. In: The 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing, UbiComp’ 13, Zurich, Switzerland, September 8-12, 2013, vol. 33, pp. 3281–3288 (2013)
Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). https://doi.org/10.1109/cvpr.2011.5995444
Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: 2014 IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014). https://doi.org/10.1109/CVPR.2014.105
Kuehne, H., Gall, J., Serre, T.: An end-to-end generative framework for video segmentation and recognition. In: 2016 IEEE Winter Conference on Applications of Computer Vision (WACV), pp. 1–8 (2016). https://doi.org/10.1109/WACV.2016.7477701
Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen, T., Lin, Z., Gimelshein, N., Antiga, L., Desmaison, A., Köpf, A., Yang, E.Z., DeVito, Z., Raison, M., Tejani, A., Chilamkurthy, S., Steiner, B., Fang, L., Bai, J., Chintala, S.: Pytorch: An imperative style, high-performance deep learning library. In: Advances in Neural Information Processing Systems 32, pp. 8024–8035 (2019)
Gao, S., Li, Z.-Y., Han, Q., Cheng, M.-M., Wang, L.: Rf-next: Efficient receptive field search for convolutional neural networks. IEEE Trans. Pattern Anal. Mach. Intell. 45(3), 2984–3002 (2023). https://doi.org/10.1109/TPAMI.2022.3183829
Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: 2021 IEEE/CVF International Conference on Computer Vision (ICCV), pp. 16282–16290 (2021). https://doi.org/10.1109/iccv48922.2021.01599
Chen, W., Chai, Y., Qi, M., Sun, H., Pu, Q., Kong, J., Zheng, C.: Bottom-up improved multistage temporal convolutional network for action segmentation. Appl. Intell. 52(12), 1573–7497 (2022)
Acknowledgements
This work is supported by Beijing Natural Science Foundation (No.4222037, L181010) and National Natural Science Foundation of China (No.61972035).
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
The authors declare that they have no conflict of interest.
Ethical approval
This article does not contain any studies with human participants or animals performed by any of the authors.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Ma, Z., Li, K. Tackling confusion among actions for action segmentation with adaptive margin and energy-driven refinement. Machine Vision and Applications 35, 21 (2024). https://doi.org/10.1007/s00138-023-01505-z
Received:
Revised:
Accepted:
Published:
DOI: https://doi.org/10.1007/s00138-023-01505-z