[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
research-article

Involving Distinguished Temporal Graph Convolutional Networks for Skeleton-Based Temporal Action Segmentation

Published: 12 June 2023 Publication History

Abstract

For RGB-based temporal action segmentation (TAS), excellent methods that capture frame-level features have achieved remarkable performance. However, for motion-centered TAS, it is still challenging for existing methods that ignore the extraction of spatial features of joints. In addition, inaccurate action boundaries caused by the frames of similar motion destroy the integrity of the action segments. To alleviate the issues, an end-to-end Involving Distinguished Temporal Graph Convolutional Networks called IDT-GCN is proposed. First, we construct an enhanced spatial graph structure that adaptively captures the similar and differential dependencies between joints in a single topology through learning two independent correlation modeling functions. Then, the proposed Involving Distinguished Graph Convolutional (ID-GC) models the spatial correlations of different actions in a video by using multiple enhanced topologies on the corresponding channels. Furthermore, we design a generic modeling temporal action regression network, termed Temporal Segment Regression (TSR), to extract segmented encoding features and action boundary representations by modeling action sequences. Combining them with label smoothing modules, we develop powerful spatial-temporal graph convolutional networks (IDT-GCN) for fine-grained TAS, which notably outperforms state-of-the-art methods on the MCFS-22 and MCFS-130 datasets. Adding TSR to TCN-based baseline methods achieves competitive performance compared with the state-of-the-art transformer-based methods on RGB-based datasets, i.e., Breakfast and 50Salads. Further experimental results on the action recognition task verify the superiority of the enhanced spatial graph structure over the previous graph convolutional networks.

References

[1]
D. Liu, N. Kamath, S. Bhattacharya, and R. Puri, “Adaptive context reading network for movie scene detection,” IEEE Trans. Circuits Syst. Video Technol., vol. 31, no. 9, pp. 3559–3574, Sep. 2021.
[2]
P. Sidiropoulos, V. Mezaris, I. Kompatsiaris, H. Meinedo, M. Bugalho, and I. Trancoso, “Temporal video segmentation to scenes using highlevel audiovisual features,” IEEE Trans. Circuits Syst. Video Technol., vol. 21, no. 8, pp. 1163–1177, Aug. 2011.
[3]
X. Luo, H. Li, X. Yang, Y. Yu, and D. Cao, “Capturing and understanding workers’ activities in far-field surveillance videos with deep action recognition and Bayesian nonparametric learning,” Comput.-Aided Civil Infrastruct. Eng., vol. 34, no. 4, pp. 333–351, Apr. 2019.
[4]
H. Son, H. Choi, H. Seong, and C. Kim, “Detection of construction workers under varying poses and changing background in image sequences via very deep residual networks,” Autom. Construct., vol. 99, pp. 27–38, Mar. 2019.
[5]
M. R. Sudha, K. Sriraghav, S. S. Abisheck, S. G. Jacob, and S. Manisha, “Approaches and applications of virtual reality and gesture recognition: A review,” Int. J. Ambient Comput. Intell., vol. 8, no. 4, pp. 1–18, Oct. 2017.
[6]
A. Biswas, S. Dutta, N. Dey, and A. T. Azar, “A Kinect-less augmented reality approach to real-time tag-less virtual trial room simulation,” Int. J. Service Sci., Manage., Eng., Technol., vol. 5, no. 4, pp. 13–28, Oct. 2014.
[7]
Y. A. Farha and J. Gall, “MS-TCN: Multi-stage temporal convolutional network for action segmentation,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2019, pp. 3570–3579.
[8]
F. Yi, H. Wen, and T. Jiang, “ASFormer: Transformer for action segmentation,” in Proc. Brit. Mach. Vis. Conf. (BMVC), 2021, pp. 1–11.
[9]
W. Chenet al., “Bottom-up improved multistage temporal convolutional network for action segmentation,” Appl. Intell., vol. 52, no. 12, pp. 14053–14069, 2022.
[10]
Y. Ishikawa, S. Kasai, Y. Aoki, and H. Kataoka, “Alleviating oversegmentation errors by detecting action boundaries,” in Proc. IEEE Winter Conf. Appl. Comput. Vis. (WACV), Jan. 2021, pp. 2321–2330.
[11]
J. Carreira and A. Zisserman, “Quo vadis, action recognition? A new model and the kinetics dataset,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Jul. 2017, pp. 4724–4733.
[12]
H. Kuehne, A. Arslan, and T. Serre, “The language of actions: Recovering the syntax and semantics of goal-directed human activities,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., Jun. 2014, pp. 780–787.
[13]
S. Stein and S. J. McKenna, “Combining embedded accelerometers with computer vision for recognizing food preparation activities,” in Proc. ACM Int. Joint Conf. Pervasive Ubiquitous Comput., Sep. 2013, pp. 729–738.
[14]
Z. Cao, T. Simon, S. Wei, and Y. Sheikh, “Realtime multi-person 2D pose estimation using part affinity fields,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Jul. 2017, pp. 1302–1310.
[15]
Z. Tu, J. Zhang, H. Li, Y. Chen, and J. Yuan, “Joint-bone fusion graph convolutional network for semi-supervised skeleton action recognition,” IEEE Trans. Multimedia, vol. 25, pp. 1819–1831, 2022.
[16]
H. Yang, D. Yan, L. Zhang, Y. Sun, D. Li, and S. J. Maybank, “Feedback graph convolutional network for skeleton-based action recognition,” IEEE Trans. Image Process., vol. 31, pp. 164–175, 2022.
[17]
J. Zhang, Y. Jia, W. Xie, and Z. Tu, “Zoom transformer for skeletonbased group activity recognition,” IEEE Trans. Circuits Syst. Video Technol., vol. 32, no. 12, pp. 8646–8659, Dec. 2022.
[18]
S. Cho, M. H. Maqbool, F. Liu, and H. Foroosh, “Self-attention network for skeleton-based human action recognition,” in Proc. IEEE Winter Conf. Appl. Comput. Vis. (WACV), Mar. 2020.
[19]
C. Lea, M. D. Flynn, R. Vidal, A. Reiter, and G. D. Hager, “Temporal convolutional networks for action segmentation and detection,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Jul. 2017, pp. 1003–1012.
[20]
P. Lei and S. Todorovic, “Temporal deformable residual networks for action segmentation in videos,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., Jun. 2018, pp. 6742–6751.
[21]
M. Chen, B. Li, Y. Bao, G. AlRegib, and Z. Kira, “Action segmentation with joint self-supervised temporal domain adaptation,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2020, pp. 9451–9460.
[22]
S. Li, Y. A. Farha, Y. Liu, M. Cheng, and J. Gall, “MS-TCN++: Multi-stage temporal convolutional network for action segmentation,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 45, no. 6, pp. 6647–6658, Jun. 2020.
[23]
Z.-Z. Wang, Z.-T. Gao, L.-M. Wang, Z.-F. Li, and G.-S. Wu, “Boundaryaware cascade networks for temporal action segmentation,” in Proc. ECCV. Cham, Switzerland: Springer, 2020, pp. 1–11.
[24]
Y. Liet al., “Efficient two-step networks for temporal action segmentation,” Neurocomputing, vol. 454, pp. 373–381, Sep. 2021.
[25]
M. Li, S. Chen, X. Chen, Y. Zhang, Y. Wang, and Q. Tian, “Actional-structural graph convolutional networks for skeleton-based action recognition,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2019, pp. 3590–3598.
[26]
L. Shi, Y. Zhang, J. Cheng, and H. Lu, “Two-stream adaptive graph convolutional networks for skeleton-based action recognition,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2019, pp. 12018–12027.
[27]
L. Shi, Y. Zhang, J. Cheng, and H. Lu, “Skeleton-based action recognition with directed graph neural networks,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2019, pp. 7904–7913.
[28]
P. Zhang, C. Lan, W. Zeng, J. Xing, J. Xue, and N. Zheng, “Semanticsguided neural networks for efficient skeleton-based human action recognition,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2020.
[29]
Z. Liu, H. Zhang, Z. Chen, Z. Wang, and W. Ouyang, “Disentangling and unifying graph convolutions for skeleton-based action recognition,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2020, pp. 140–149.
[30]
K. Cheng, Y. Zhang, X. He, W. Chen, J. Cheng, and H. Lu, “Skeletonbased action recognition with shift graph convolutional network,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2020, pp. 180–189.
[31]
K. Cheng, Y. Zhang, C. Cao, L. Shi, J. Cheng, and H. Lu, “Decoupling GCN with dropgraph module for skeleton-based action recognition,” in Proc. 16th Eur. Conf. Comput. Vis. (ECCV). Glasgow, U.K.: Springer, Aug. 2020, pp. 536–553.
[32]
K. Cheng, Y. Zhang, X. He, J. Cheng, and H. Lu, “Extremely lightweight skeleton-based action recognition with ShiftGCN++,” IEEE Trans. Image Process., vol. 30, pp. 7333–7348, 2021.
[33]
Z. Chen, S. Li, B. Yang, Q. Li, and H. Liu, “Multi-scale spatial temporal graph convolutional network for skeleton-based action recognition,” in Proc. AAAI Conf. Artif. Intell., vol. 35, no. 2, 2021, pp. 1113–1122.
[34]
Y. Chen, Z. Zhang, C. Yuan, B. Li, Y. Deng, and W. Hu, “Channelwise topology refinement graph convolution for skeleton-based action recognition,” 2021, arXiv:2107.12213.
[35]
S. Yan, Y. Xiong, and D. Lin, “Spatial temporal graph convolutional networks for skeleton-based action recognition,” in Proc. 32nd AAAI Conf. Artif. Intell., 2018, pp. 1–9.
[36]
Y. Chen, Z. Zhang, C. Yuan, B. Li, Y. Deng, and W. Hu, “Channel-wise topology refinement graph convolution for skeleton-based action recognition,” in Proc. IEEE/CVF Int. Conf. Comput. Vis. (ICCV), Oct. 2021, pp. 13339–13348.
[37]
A. Vaswaniet al., “Attention is all you need,” in Proc. Adv. Neural Inf. Process. Syst., 2017, pp. 5998–6008.
[38]
S. Liuet al., “Temporal segmentation of fine-grained semantic action: A motion-centered figure skating dataset,” in Proc. AAAI Conf. Artif. Intell., vol. 35, no. 3, 2021, pp. 2163–2171.
[39]
A. R. Punnakkal, A. Chandrasekaran, N. Athanasiou, A. Quirós- Ramírez, and M. J. Black, “BABEL: Bodies, action and behavior with English labels,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2021, pp. 722–731.
[40]
J. Liu, A. Shahroudy, M. Perez, G. Wang, L. Duan, and A. C. Kot, “NTU RGB+D 120: A large-scale benchmark for 3D human activity understanding,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 42, no. 10, pp. 2684–2701, Oct. 2020.
[41]
J. Zhanget al., “A spatial attentive and temporal dilated (SATD) GCN for skeleton-based action recognition,” CAAI Trans. Intell. Technol., vol. 7, no. 1, pp. 46–55, Mar. 2022.
[42]
Y. Song, Z. Zhang, C. Shan, and L. Wang, “Constructing stronger and faster baselines for skeleton-based action recognition,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 45, no. 2, pp. 1474–1488, Feb. 2023.
[43]
Y. Hou, Z. Li, P. Wang, and W. Li, “Skeleton optical spectrabased action recognition using convolutional neural networks,” IEEE Trans. Circuits Syst. Video Technol., vol. 28, no. 3, pp. 807–811, Mar. 2018.
[44]
Y. Song, Z. Zhang, C. Shan, and L. Wang, “Richly activated graph convolutional network for robust skeleton-based action recognition,” IEEE Trans. Circuits Syst. Video Technol., vol. 31, no. 5, pp. 1915–1925, May 2021.
[45]
J. Kong, H. Deng, and M. Jiang, “Symmetrical enhanced fusion network for skeleton-based action recognition,” IEEE Trans. Circuits Syst. Video Technol., vol. 31, no. 11, pp. 4394–4408, Nov. 2021.
[46]
Z. Huang, X. Shen, X. Tian, H. Li, J. Huang, and X.-S. Hua, “Spatiotemporal inception graph convolutional networks for skeleton-based action recognition,” in Proc. 28th ACM Int. Conf. Multimedia, Oct. 2020, pp. 2122–2130.
[47]
X. Zhu, Y. Zhou, D. Wang, W. Ouyang, and R. Su, “MLST-former: Multi-level spatial–temporal transformer for group activity recognition,” IEEE Trans. Circuits Syst. Video Technol., early access, Dec. 29, 2022. 10.1109/TCSVT.2022.3233069.
[48]
L. Wu, X. Lang, Y. Xiang, C. Chen, Z. Li, and Z. Wang, “Active spatial positions based hierarchical relation inference for group activity recognition,” IEEE Trans. Circuits Syst. Video Technol., vol. 33, no. 6, pp. 2839–2851, Dec. 2022.
[49]
D. Miki, S. Chen, and K. Demachi, “Weakly supervised graph convolutional neural network for human action localization,” in Proc. IEEE Winter Conf. Appl. Comput. Vis. (WACV), Mar. 2020, pp. 642–650.
[50]
M. Li, S. Chen, Y. Zhao, Y. Zhang, Y. Wang, and Q. Tian, “Dynamic multiscale graph neural networks for 3D skeleton based human motion prediction,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2020, pp. 211–220.
[51]
L. Dang, Y. Nie, C. Long, Q. Zhang, and G. Li, “MSR-GCN: Multi-scale residual graph convolution networks for human motion prediction,” in Proc. IEEE/CVF Int. Conf. Comput. Vis. (ICCV), Oct. 2021, pp. 11447–11456.
[52]
M. Li, S. Chen, Y. Zhao, Y. Zhang, Y. Wang, and Q. Tian, “Multiscale spatio-temporal graph neural networks for 3D skeleton-based motion prediction,” IEEE Trans. Image Process., vol. 30, pp. 7760–7775, 2021.
[53]
S. Miao, Y. Hou, Z. Gao, M. Xu, and W. Li, “A central difference graph convolutional operator for skeleton-based action recognition,” IEEE Trans. Circuits Syst. Video Technol., vol. 32, no. 7, pp. 4893–4899, Jul. 2022.
[54]
C. Wu, X. Wu, and J. Kittler, “Graph2Net: Perceptually-enriched graph learning for skeleton-based action recognition,” IEEE Trans. Circuits Syst. Video Technol., vol. 32, no. 4, pp. 2120–2132, Apr. 2022.
[55]
S. Gao, Q. Han, Z. Li, P. Peng, L. Wang, and M. Cheng, “Global2Local: Efficient structure search for video action segmentation,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2021, pp. 16800–16809.
[56]
H. Ahn and D. Lee, “Refining action segmentation with hierarchical video representations,” in Proc. IEEE/CVF Int. Conf. Comput. Vis. (ICCV), Oct. 2021, pp. 16282–16290.
[57]
M. Liet al., “Bridge-prompt: Towards ordinal action understanding in instructional videos,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2022, pp. 19848–19857.
[58]
P. Ghosh, Y. Yao, L. S. Davis, and A. Divakaran, “Stacked spatiotemporal graph convolutional networks for action segmentation,” in Proc. IEEE Winter Conf. Appl. Comput. Vis. (WACV), Mar. 2020, pp. 565–574.
[59]
B. Filtjens, B. Vanrumste, and P. Slaets, “Skeleton-based action segmentation with multi-stage spatial–temporal graph convolutional neural networks,” IEEE Trans. Emerg. Topics Comput., pp. 1–11, 2022.
[60]
L. Shi, Y. Zhang, J. Cheng, and H. Lu, “Decoupled spatial– temporal attention network for skeleton-based action recognition,” 2020, arXiv:2007.03263.
[61]
D. Bo, X. Wang, C. Shi, and H. Shen, “Beyond low-frequency information in graph convolutional networks,” in Proc. AAAI Conf. Artif. Intell. Palo Alto, CA, USA: AAAI Press, 2021, pp. 3950–3957.
[62]
D. Wang, D. Hu, X. Li, and D. Dou, “Temporal relational modeling with self-supervision for action segmentation,” in Proc. AAAI Conf. Artif. Intell., vol. 35, no. 4, 2021, pp. 2729–2737.
[63]
F. Ye, S. Pu, Q. Zhong, C. Li, D. Xie, and H. Tang, “Dynamic GCN: Context-enriched topology learning for skeleton-based action recognition,” in Proc. 28th ACM Int. Conf. Multimedia, Oct. 2020, pp. 55–63.

Cited By

View all
  • (2024)MIGA-Net: Multi-View Image Information Learning Based on Graph Attention Network for SAR Target RecognitionIEEE Transactions on Circuits and Systems for Video Technology10.1109/TCSVT.2024.341897934:11_Part_1(10779-10792)Online publication date: 24-Jun-2024
  • (2024)Positive and Negative Set Designs in Contrastive Feature Learning for Temporal Action SegmentationIEEE Transactions on Circuits and Systems for Video Technology10.1109/TCSVT.2024.341739234:11_Part_1(11156-11168)Online publication date: 21-Jun-2024

Index Terms

  1. Involving Distinguished Temporal Graph Convolutional Networks for Skeleton-Based Temporal Action Segmentation
          Index terms have been assigned to the content through auto-classification.

          Recommendations

          Comments

          Please enable JavaScript to view thecomments powered by Disqus.

          Information & Contributors

          Information

          Published In

          cover image IEEE Transactions on Circuits and Systems for Video Technology
          IEEE Transactions on Circuits and Systems for Video Technology  Volume 34, Issue 1
          Jan. 2024
          659 pages

          Publisher

          IEEE Press

          Publication History

          Published: 12 June 2023

          Qualifiers

          • Research-article

          Contributors

          Other Metrics

          Bibliometrics & Citations

          Bibliometrics

          Article Metrics

          • Downloads (Last 12 months)0
          • Downloads (Last 6 weeks)0
          Reflects downloads up to 09 Jan 2025

          Other Metrics

          Citations

          Cited By

          View all
          • (2024)MIGA-Net: Multi-View Image Information Learning Based on Graph Attention Network for SAR Target RecognitionIEEE Transactions on Circuits and Systems for Video Technology10.1109/TCSVT.2024.341897934:11_Part_1(10779-10792)Online publication date: 24-Jun-2024
          • (2024)Positive and Negative Set Designs in Contrastive Feature Learning for Temporal Action SegmentationIEEE Transactions on Circuits and Systems for Video Technology10.1109/TCSVT.2024.341739234:11_Part_1(11156-11168)Online publication date: 21-Jun-2024

          View Options

          View options

          Media

          Figures

          Other

          Tables

          Share

          Share

          Share this Publication link

          Share on social media