[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ Skip to main content
Log in

Tackling confusion among actions for action segmentation with adaptive margin and energy-driven refinement

  • Original Paper
  • Published:
Machine Vision and Applications Aims and scope Submit manuscript

Abstract

Video action segmentation is a crucial task in evaluating the ability to understand human activities. Previous works on this task mainly focus on capturing complex temporal structures and fail to consider the feature ambiguity among similar actions and the biased training sets, thus they are easy to confuse some actions. In this paper, we propose a novel action segmentation framework, called DeConfuNet, to solve the above issue. First, we design a discriminative enhancement module (DEM) trained by an adaptive margin-guided discriminative feature learning which adjusts the margin adaptively to increase the feature distinguishability among similar actions, and whose multi-stage reasoning and adaptive feature fusion structures provide structural advantages for distinguishing similar actions. Second, we propose an equalizing influence module (EIM) that can overcome the impact of biased training sets by balancing the influence of training samples under a coefficient-adaptive loss function. Third, an energy and context-driven refinement module (ECRM) further alleviates the impact of the unbalanced influence of training samples by fusing and refining the inference of DEM and EIM, which utilizes the phased prediction including context and energy clues to assimilate untrustworthy segments, alleviating over-segmentation hugely. Extensive experiments show the effectiveness of each proposed technique, they verify that the DEM and EIM are complementary in reasoning and cooperate to overcome the confusion issue, and our approach achieves significant improvement and state-of-the-art performance of accuracy, edit score, and F1 score on the challenging 50Salads, GTEA, and Breakfast benchmarks.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
£29.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price includes VAT (United Kingdom)

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8

Similar content being viewed by others

Data Availability

The datasets used in this paper can be downloaded from 50Salads repository (https://cvip.computing.dundee.ac.uk/datasets/foodpreparation/50salads/), GTEA repository (https://cbs.ic.gatech.edu/fpv/), and Breakfast repository (https://serre-lab.clps.brown.edu/resource/breakfast-actions-dataset/). The data generated during the current study are available from the corresponding author on reasonable request.

References

  1. Lv, H., Chen, C., Cui, Z., Xu, C., Li, Y., Yang, J.: Learning normal dynamics in videos with meta prototype network. In: 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 15420–15429 (2021). https://doi.org/10.1109/cvpr46437.2021.01517

  2. Luo, W., Liu, W., Lian, D., Tang, J., Duan, L., Peng, X., Gao, S.: Video anomaly detection with sparse coding inspired deep neural networks. IEEE Trans. Pattern Anal. Mach. Intell. 43(3), 1070–1084 (2021). https://doi.org/10.1109/tpami.2019.2944377

    Article  Google Scholar 

  3. Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4724–4733 (2017). https://doi.org/10.1109/cvpr.2017.502

  4. Mac, K.-N., Joshi, D., Yeh, R., Xiong, J., Feris, R., Do, M.: Learning motion in feature space: Locally-consistent deformable convolution networks for fine-grained action detection. In: 2019 IEEE/CVF International Conference on Computer Vision (ICCV), pp. 6281–6290 (2019). https://doi.org/10.1109/iccv.2019.00638

  5. Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Thirty-Fifth AAAI Conference on Artificial Intelligence, Virtual Event, February 2-9, 2021, pp. 2729–2737 (2021)

  6. Farha, Y.A., Gall, J.: MS-TCN: Multi-stage temporal convolutional network for action segmentation. In: 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3570–3579 (2019). https://doi.org/10.1109/cvpr.2019.00369

  7. Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: MS-TCN++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 1–1 (2020). https://doi.org/10.1109/tpami.2020.3021756

  8. Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1003–1012 (2017). https://doi.org/10.1109/cvpr.2017.113

  9. Huang, Y., Sugano, Y., Sato, Y.: Improving action segmentation via graph-based temporal reasoning. In: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 14021–14031 (2020). https://doi.org/10.1109/cvpr42600.2020.01404

  10. Lei, P., Todorovic, S.: Temporal deformable residual networks for action segmentation in videos. In: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6742–6751 (2018). https://doi.org/10.1109/cvpr.2018.00705

  11. Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision – ECCV 2020, pp. 34–51 (2020). https://doi.org/10.1007/978-3-030-58595-2_3

  12. Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 16800–16809 (2021). https://doi.org/10.1109/cvpr46437.2021.01653

  13. Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 9451–9460 (2020). https://doi.org/10.1109/cvpr42600.2020.00947

  14. Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: Proceedings of the Thirty-First International Joint Conference on Artificial Intelligence, IJCAI-22, pp. 820–826 (2022). https://doi.org/10.24963/ijcai.2022/115

  15. Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: 2021 IEEE Winter Conference on Applications of Computer Vision (WACV), pp. 2321–2330 (2021). https://doi.org/10.1109/wacv48630.2021.00237

  16. Zhang, S., Li, Z., Yan, S., He, X., Sun, J.: Distribution alignment: A unified framework for long-tail visual recognition. In: 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2361–2370 (2021). https://doi.org/10.1109/cvpr46437.2021.00239

  17. Cui, Y., Jia, M., Lin, T.-Y., Song, Y., Belongie, S.: Class-balanced loss based on effective number of samples. In: 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 9268–9277 (2019). https://doi.org/10.1109/cvpr.2019.00949

  18. Wang, T., Zhu, Y., Zhao, C., Zeng, W., Wang, J., Tang, M.: Adaptive class suppression loss for long-tail object detection. In: 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3103–3112 (2021). https://doi.org/10.1109/cvpr46437.2021.00312

  19. Zhou, B., Cui, Q., Wei, X.-S., Chen, Z.-M.: BBN: Bilateral-branch network with cumulative learning for long-tailed visual recognition. In: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 9716–9725 (2020). https://doi.org/10.1109/cvpr42600.2020.00974

  20. Liu, J., Sun, Y., Han, C., Dou, Z., Li, W.: Deep representation learning on long-tailed data: A learnable embedding augmentation perspective. In: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2967–2976 (2020). https://doi.org/10.1109/cvpr42600.2020.00304

  21. Xiang, L., Ding, G., Han, J.: Learning from multiple experts: Self-paced knowledge distillation for long-tailed classification. In: Computer Vision – ECCV 2020, pp. 247–263 (2020). https://doi.org/10.1007/978-3-030-58558-7_15

  22. Chen, J., Wang, X., Guo, Z., Zhang, X., Sun, J.: Dynamic region-aware convolution. In: 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 8060–8069 (2021). https://doi.org/10.1109/cvpr46437.2021.00797

  23. Chen, T., Lu, S., Fan, J.: S-CNN: Subcategory-aware convolutional networks for object detection. IEEE Trans. Pattern Anal. Mach. Intell. 40(10), 2522–2528 (2018). https://doi.org/10.1109/tpami.2017.2756936

    Article  Google Scholar 

  24. Kuehne, H., Richard, A., Gall, J.: A hybrid rnn-hmm approach for weakly supervised temporal action segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 42(4), 765–779 (2020). https://doi.org/10.1109/TPAMI.2018.2884469

    Article  Google Scholar 

  25. Richard, A., Kuehne, H., Gall, J.: Weakly supervised action learning with RNN based fine-to-coarse modeling. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1273–1282 (2017). https://doi.org/10.1109/cvpr.2017.140

  26. Fayyaz, M., Gall, J.: SCT: Set constrained temporal transformer for set supervised action segmentation. In: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 498–507 (2020). https://doi.org/10.1109/cvpr42600.2020.00058

  27. Ghosh, P., Yao, Y., Davis, L.S., Divakaran, A.: Stacked spatio-temporal graph convolutional networks for action segmentation. In: 2020 IEEE Winter Conference on Applications of Computer Vision (WACV), pp. 565–574 (2020). https://doi.org/10.1109/wacv45572.2020.9093361

  28. Zhang, Y., Tang, S., Muandet, K., Jarvers, C., Neumann, H.: Local temporal bilinear pooling for fine-grained action parsing. In: 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 11997–12007 (2019). https://doi.org/10.1109/cvpr.2019.01228

  29. Yi, F., Wen, H., Jiang, T.: Asformer: Transformer for action segmentation. In: The British Machine Vision Conference (2021)

  30. Du, D., Su, B., Li, Y., Qi, Z., Si, L., Shan, Y.: Do we really need temporal convolutions in action segmentation? (2022)

  31. Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision - ECCV 2022 - 17th European Conference, vol. 13695, pp. 52–68 (2022). https://doi.org/10.1007/978-3-031-19833-5_4

  32. Yu, C., Wang, J., Peng, C., Gao, C., Yu, G., Sang, N.: Learning a discriminative feature network for semantic segmentation. In: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1857–1866 (2018). https://doi.org/10.1109/cvpr.2018.00199

  33. Deng, J., Guo, J., Xue, N., Zafeiriou, S.: ArcFace: Additive angular margin loss for deep face recognition. In: 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4685–4694 (2019). https://doi.org/10.1109/cvpr.2019.00482

  34. Chen, T., Kornblith, S., Norouzi, M., Hinton, G.E.: A simple framework for contrastive learning of visual representations. In: Proceedings of the 37th International Conference on Machine Learning, ICML 2020, 13-18 July 2020, Virtual Event, vol. 119, pp. 1597–1607 (2020)

  35. Wang, Z., Wang, S., Yang, S., Li, H., Li, J., Li, Z.: Weakly supervised fine-grained image classification via guassian mixture model oriented discriminative learning. In: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 9746–9755 (2020). https://doi.org/10.1109/cvpr42600.2020.00977

  36. Zhu, J., Liu, Y., Zhang, Y., Chen, Z., Wu, X.: Multi-attribute discriminative representation learning for prediction of adverse drug-drug interaction. IEEE Transactions on Pattern Analysis and Machine Intelligence, 1–1 (2022). https://doi.org/10.1109/tpami.2021.3135841

  37. Wang, J., Cherian, A.: Discriminative video representation learning using support vector classifiers. IEEE Trans. Pattern Anal. Mach. Intell. 43(2), 420–433 (2021). https://doi.org/10.1109/tpami.2019.2937292

    Article  Google Scholar 

  38. Liu, W., Wen, Y., Yu, Z., Li, M., Raj, B., Song, L.: SphereFace: Deep hypersphere embedding for face recognition. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 6738–6746 (2017). https://doi.org/10.1109/cvpr.2017.713

  39. Wang, F., Cheng, J., Liu, W., Liu, H.: Additive margin softmax for face verification. IEEE Signal Process. Lett. 25(7), 926–930 (2018). https://doi.org/10.1109/lsp.2018.2822810

    Article  Google Scholar 

  40. Tutsoy, O., Barkana, D.E.: Model free adaptive control of the under-actuated robot manipulator with the chaotic dynamics. ISA Trans. 118, 106–115 (2021). https://doi.org/10.1016/j.isatra.2021.02.006

    Article  Google Scholar 

  41. Tutsoy, O.: Pharmacological, non-pharmacological policies and mutation: An artificial intelligence based multi-dimensional policy making algorithm for controlling the casualties of the pandemic diseases. IEEE Trans. Pattern Anal. Mach. Intell. 44(12), 9477–9488 (2022). https://doi.org/10.1109/TPAMI.2021.3127674

    Article  Google Scholar 

  42. Chen, L.-C., Papandreou, G., Kokkinos, I., Murphy, K., Yuille, A.L.: DeepLab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected CRFs. IEEE Trans. Pattern Anal. Mach. Intell. 40(4), 834–848 (2018). https://doi.org/10.1109/tpami.2017.2699184

    Article  Google Scholar 

  43. Zhang, H., Dana, K., Shi, J., Zhang, Z., Wang, X., Tyagi, A., Agrawal, A.: Context encoding for semantic segmentation. In: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7151–7160 (2018). https://doi.org/10.1109/cvpr.2018.00747

  44. Yu, F., Koltun, V., Funkhouser, T.: Dilated residual networks. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 636–644 (2017). https://doi.org/10.1109/cvpr.2017.75

  45. Xie, J., Zhu, S.-C., Wu, Y.N.: Learning energy-based spatial-temporal generative ConvNets for dynamic patterns. IEEE Trans. Pattern Anal. Mach. Intell. 43(2), 516–531 (2021). https://doi.org/10.1109/tpami.2019.2934852

    Article  Google Scholar 

  46. Bond-Taylor, S., Leach, A., Long, Y., Willcocks, C.G.: Deep generative modelling: A comparative review of VAEs, GANs, normalizing flows, energy-based and autoregressive models. IEEE Transactions on Pattern Analysis and Machine Intelligence, 1–1 (2021). https://doi.org/10.1109/tpami.2021.3116668

  47. Liu, W., Wang, X., Owens, J., Li, Y.: Energy-based out-of-distribution detection. In: Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M.F., Lin, H. (eds.) Advances in Neural Information Processing Systems, vol. 33, pp. 21464–21475 (2020)

  48. Stein, S., Mckenna, S.J.: Combining embedded accelerometers with computer vision for recognizing food preparation activities. In: The 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing, UbiComp’ 13, Zurich, Switzerland, September 8-12, 2013, vol. 33, pp. 3281–3288 (2013)

  49. Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). https://doi.org/10.1109/cvpr.2011.5995444

  50. Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: 2014 IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014). https://doi.org/10.1109/CVPR.2014.105

  51. Kuehne, H., Gall, J., Serre, T.: An end-to-end generative framework for video segmentation and recognition. In: 2016 IEEE Winter Conference on Applications of Computer Vision (WACV), pp. 1–8 (2016). https://doi.org/10.1109/WACV.2016.7477701

  52. Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen, T., Lin, Z., Gimelshein, N., Antiga, L., Desmaison, A., Köpf, A., Yang, E.Z., DeVito, Z., Raison, M., Tejani, A., Chilamkurthy, S., Steiner, B., Fang, L., Bai, J., Chintala, S.: Pytorch: An imperative style, high-performance deep learning library. In: Advances in Neural Information Processing Systems 32, pp. 8024–8035 (2019)

  53. Gao, S., Li, Z.-Y., Han, Q., Cheng, M.-M., Wang, L.: Rf-next: Efficient receptive field search for convolutional neural networks. IEEE Trans. Pattern Anal. Mach. Intell. 45(3), 2984–3002 (2023). https://doi.org/10.1109/TPAMI.2022.3183829

    Article  Google Scholar 

  54. Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: 2021 IEEE/CVF International Conference on Computer Vision (ICCV), pp. 16282–16290 (2021). https://doi.org/10.1109/iccv48922.2021.01599

  55. Chen, W., Chai, Y., Qi, M., Sun, H., Pu, Q., Kong, J., Zheng, C.: Bottom-up improved multistage temporal convolutional network for action segmentation. Appl. Intell. 52(12), 1573–7497 (2022)

    Article  Google Scholar 

Download references

Acknowledgements

This work is supported by Beijing Natural Science Foundation (No.4222037, L181010) and National Natural Science Foundation of China (No.61972035).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Kan Li.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Ethical approval

This article does not contain any studies with human participants or animals performed by any of the authors.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Ma, Z., Li, K. Tackling confusion among actions for action segmentation with adaptive margin and energy-driven refinement. Machine Vision and Applications 35, 21 (2024). https://doi.org/10.1007/s00138-023-01505-z

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s00138-023-01505-z

Keywords

Navigation