Open AccessArticle

Research on Multi-Scale Spatio-Temporal Graph Convolutional Human Behavior Recognition Method Incorporating Multi-Granularity Features

College of Intelligent Transportation, Chongqing Vocational College of Public Transportation, Chongqing 402260, China

College of Electrical and Electronic Engineering, Chongqing University of Technology, Chongqing 400054, China

Author to whom correspondence should be addressed.

Sensors 2024, 24(23), 7595; https://doi.org/10.3390/s24237595

Submission received: 15 October 2024 / Revised: 21 November 2024 / Accepted: 25 November 2024 / Published: 28 November 2024

(This article belongs to the Special Issue AI-Based Computer Vision Sensors & Systems)

Download

Browse Figures

Figure 1
Adjacency matrix topology diagram. (a–c) respectively represent the topological graphs of first-order, second-order, and third-order adjacency matrices used to connect human skeletal joints, while (d–f) respectively represent the topological graphs after constructing multi-scale adjacency matrices. "> Figure 2
Framework of multi-scale spatio-temporal graph convolutional network model incorporating multi-granularity features. (a) represents the overall framework of the proposed network, and (b) represents the framework of the multi-scale spatio-temporal convolutional module. "> Figure 3
Three granularity representation methods for MSR Action 3D. The blue nodes represent the original coarse-grained joints, and the red nodes represent the newly added fine-grained joints. "> Figure 4
The structure of cross-scale feature fusion layer (CSFL). "> Figure 5
Skeleton graphs of different granularities. "> Figure 6
The confusion matrix of the MSR Action 3D dataset. The darker the background color of each grid in the figure, the higher the recognition rate it represents. ">

Versions Notes

Abstract

Aiming at the problem that the existing human skeleton behavior recognition methods are insensitive to human local movements and show inaccurate recognition in distinguishing similar behaviors, a multi-scale spatio-temporal graph convolution method incorporating multi-granularity features is proposed for human behavior recognition. Firstly, a skeleton fine-grained partitioning strategy is proposed, which initializes the skeleton data into data streams of different granularities. An adaptive cross-scale feature fusion layer is designed using a normalized Gaussian function to perform feature fusion among different granularities, guiding the model to focus on discriminative feature representations among similar behaviors through fine-grained features. Secondly, a sparse multi-scale adjacency matrix is introduced to solve the bias weighting problem that amplifies the multi-scale spatial domain modeling process under multi-granularity conditions. Finally, an end-to-end graph convolutional neural network is constructed to improve the feature expression ability of spatio-temporal receptive field information and enhance the robustness of recognition between similar behaviors. The feasibility of the proposed algorithm was verified on the public behavior recognition dataset MSR Action 3D, with a accuracy of 95.67%, which is superior to existing behavior recognition methods.

Keywords:

graph convolutional network; behavior recognition; multi-granularity; bias weighting

1. Introduction

As an extremely important component in the field of computer vision, research on behavior recognition has always attracted considerable attention and has been widely applied. It has broad application prospects in intelligent surveillance, smart transportation, human–computer interaction, and other areas [1,2,3]. At present, human skeleton behavior recognition based on deep learning is mainly divided into three categories. The first type is to use convolutional neural networks [4,5,6] to model skeleton data as pseudo images, extracting highly abstract skeletal structural features. The second type is to use recurrent neural networks [7,8,9,10] to model skeleton data as sequences of coordinate vectors, capturing the dynamic correlations between consecutive frames of skeletal data to predict behavior categories. The last type is the graph convolutional network (GCN), which represents the human skeleton sequence as a spatio-temporal topological graph. By utilizing graph convolution, it effectively extracts the global features of the skeleton’s spatial structure, thereby enabling better modeling of the spatio-temporal characteristics of human skeleton information. Therefore, graph convolution-based human skeleton behavior recognition methods have become a research hotspot in recent years.

Yan et al. [11] introduced spatial–temporal graph convolutional networks (ST-GCNs), which for the first time utilized graph convolutional networks to model human skeleton data and achieved good recognition results in the process of action recognition. Shi et al. [12] proposed adaptive graph convolution, which calculates the similarity between joints based on input skeleton data of different action classes to adaptively measure the degree of correlation between joints. Li et al. [13] proposed a motion structure that emphasizes the dependency relationship between non-adjacent joints in space through action-linking modules and structural linking modules. References [14,15,16] proposed a multi-scale spatial graph convolutional network to capture feature information between joints in a wider space, using high-order polynomials of the adjacency matrix to aggregate features between remote joints. However, these methods have bias weighting issues in the process of spatial domain modeling, which means that in the process of modeling spatial position relationships using high-order adjacency matrices, joints far from the target joint make little contribution to recognition, and the final recognition result will be dominated by joints from local body parts. Meanwhile, due to the presence of information from different modalities and spatio-temporal scales within the skeleton, all of which are crucial for behavior recognition, many works have attempted to explore and utilize this information. Shi et al. [12] added the inter-frame difference between the bone flow and keypoint flow in 2s-AGCN as information for keypoint motion flow and bone motion flow. The Shift-GCN network proposed by Cheng et al. [17] performs more processing on the original data, extracting frame differences as dynamic information between keypoints and bones based on keypoint coordinates and bone vectors, and these four different forms of data are used as inputs to jointly predict category features.

Li et al. [18] performed higher-order transformations on the original skeleton data and employed a multi-stream network to fuse high-order information such as joint and bone information at the decision level, further enhancing the model’s performance. However, these methods did not take into account spatial granularity characteristics during human behavior. From the perspective of human kinematics, the recognition of certain actions relies on the characteristics between distant joints, while the identification of other similar actions is more dependent on subtle movement differences between local joints.

To address the aforementioned issues, a multi-scale spatio-temporal graph convolutional method integrating multi-granularity characteristics was proposed for human action recognition. The input skeleton data were initialized into data streams of different granularities to guide the network in learning the differences between similar actions. Additionally, a cross-scale fusion module was constructed for feature fusion among different granularities. By adopting a method of constructing multi-scale adjacency matrices, the adjacency matrices at different spatial scales were subtracted from each other to build sparse adjacency matrices, thereby solving the problem of biased weighting in the process of multi-scale spatial modeling. An end-to-end multi-scale graph convolutional network integrating multi-granularity characteristics was then constructed, and the feasibility of the proposed algorithm was validated on the MSR Action 3D dataset, which is publicly available for action recognition.

2. Skeleton-Based Behavior Recognition Based on Graph Convolution Networks

This section begins by introducing the fundamental principles of ST-GCNs for skeleton-based behavior recognition. Subsequently, it analyzes the bias weighting issue in graph convolutional networks, where the extraction of spatial features from skeletons tends to prioritize adjacent joints, making it difficult to capture dependencies between distant joints. Additionally, a solution to this problem is then proposed.

2.1. Spatio-Temporal Graph Convolutional Network

A GCN is widely used in the modeling of human skeleton data. In this method, the human skeleton is generally represented as a spatio-temporal graph

G = (V, E)

with N joints and T frames, where

V

represents the joints of the skeleton and

E

represents the edges connecting the human joints. The skeleton coordinates of human actions can be expressed as

X \in R^{C \times T \times N}

, where C represents the number of channels, T represents the number of frames in the video, and N is the number of joints in the human skeleton. The GCN-based model mainly consists of two parts: spatial graph convolution and temporal convolution.

In the spatial dimension, the feature extraction of any joint

v_{t i}

in the skeleton graph by graph convolution operation is expressed as

f_{o u t} (v_{t i}) = \sum_{v_{t j} \in B (v_{t i})} \frac{1}{Z_{t i} (v_{t j})} f_{i n} (p (v_{t j}, v_{t i})) \cdot w (v_{t j}, v_{t i})

(1)

where

f_{i n}

and

f_{o u t}

represent the input and output features, respectively;

B (v_{t i}) = \{v_{t j} |r (v_{t j}, v_{t i}) \in R\}

represents the set of neighboring joints of

v_{t i}

; and R controls the range of neighboring joints selected.

Z_{t i} (v_{t j}) = |\{v_{t k} |l_{t i} (v_{t k}) = l_{t i} (v_{t j})\}|

is the normalization term, and

w

is the weighting function of neighboring joints.

The operation of graph convolution in the time domain can be extended from graph convolution in the spatial domain, by using parameter Γ as the size of the temporal convolution kernel and it serving as a control for the temporal range of the neighbor set. Due to the introduction of temporal dimension information, the neighbor set in both spatial and temporal dimensions can be expressed as

B (v_{t i}) = \{v_{a j} |d (v_{t j}, v_{t i}) \leq K, |a - t| \leq Γ / 2\}

(2)

The corresponding label-mapping set for its neighboring joints is

l_{S T} (v_{q j}) = l_{t i} (v_{t j}) + (a - t + Γ / 2) \times K

(3)

where

l_{t i} (v_{t j})

represents the label mapping for

v_{t i}

in the case of a single frame.

Therefore, on the skeleton input defined by feature

X

and graph structure

A

, the output of the network after passing through a layer of graph convolution can be represented as

f_{o u t} = σ (D^{- \frac{1}{2}} \tilde{A} D^{- \frac{1}{2}} f_{i n} W)

(4)

In the formula,

\tilde{A} = A + I

represents the skeleton graph structure of the human body, and the connection relationship between joints in the skeleton graph is represented by an

N \times N

adjacency matrix

A

and an identity matrix

I

;

D

is the joint degree matrix, which is a diagonal matrix with diagonal elements indicating the number of edges connected to each joint,

D^{- \frac{1}{2}} \tilde{A} D^{- \frac{1}{2}}

represents the normalized skeleton structure,

W

represents the learnable weight matrix of the network, and

σ (\cdot)

denotes the activation function.

2.2. Analysis of Bias Weighting Problem Methods

The existing methods use high-order polynomials of adjacency matrices to aggregate multi-scale spatial structural information at different moments. Based on Formula (4), the iteration rule for high-order matrices is as follows:

f_{o u t} = σ (\sum_{k = 0}^{K} D_{(k)}^{- \frac{1}{2}} {\tilde{A}}^{k} D_{(k)}^{- \frac{1}{2}} f_{i n} W_{(k)})

(5)

where

K

is the highest power of the adjacency matrix and

{\tilde{A}}^{k}

represents the k-th power matrix of

\tilde{A}

The K-order adjacency matrix in a graph convolutional network represents the existence of K paths between two joints.

As can be seen from the above equation, due to the existence of cycles between joints, there are more paths to joints that are closer to the current joint (with a distance less than K) than to joints that are exactly K steps away. It results in the network assigning greater weight to joints that are closer in distance during the network iteration. Therefore, when conducting multiscale modeling in the spatial domain, the aggregated features will be dominated by the motion information of local body parts, making it difficult for the network to effectively capture the dependencies between distant joints. We refer to this phenomenon as the bias weighting issue. Since some human behaviors involve coordinated movements between distant joints, it is evident that the bias weighting phenomenon is detrimental to the recognition of such behaviors.

To address the bias weighting issue mentioned above, reference [19] proposes a multi-scale adjacency matrix; the construction method of the adjacency matrix is redefined as follows:

{[{\tilde{A}}_{(k)}]}_{i, j} = \{\begin{cases} 1 if d (v_{i}, v_{j}) = k, \\ 1 if i = j, \\ 0 if otherwise \end{cases}

(6)

where

d (v_{i}, v_{j})

provides the shortest distance between two joints

v_{i}

and

v_{j}

. By setting different values of K, we can obtain adjacency matrices of different scales. The K-order adjacency matrix formula can also be calculated using the following formula:

{\tilde{A}}_{(k)} = I + ϑ ({\tilde{A}}^{k} \geq 1) - ϑ ({\tilde{A}}^{k - 1} \geq 1)

(7)

where

ϑ ({\tilde{A}}^{k} \geq 1)

represents assigning values greater than or equal to 1 in the matrix to 1; replacing

{\tilde{A}}^{k}

in Equation (5) with

{\tilde{A}}_{(k)}

, we obtain

f_{o u t} = σ (\sum_{k = 0}^{K} D_{(k)}^{- \frac{1}{2}} {\tilde{A}}_{(k)} D_{(k)}^{- \frac{1}{2}} f_{i n} W_{(k)})

(8)

where

D_{(k)}^{- \frac{1}{2}} {\tilde{A}}^{k} D_{(k)}^{- \frac{1}{2}}

represents the standardized K-order adjacency matrix. In this paper, we propose a method of subtracting the K-order adjacency matrix from the K-1 order matrix to eliminate the bias weighting problem that exists in the original modeling approach, which allows the model to better capture the relationships between features that are more dependent on distant joints for action categorization.

As shown in Figure 1a–c represent the topological diagrams of the first-order, second-order, and third-order adjacency matrices used to connect human skeletal joints in a multi-scale spatial model. As the order of the adjacency matrix increases, joints closer to the current joint are assigned greater weights (the darker the color, the greater the weight assigned to the joint). Especially when introducing new joints to refine human skeletal features, the distance between the original two joints may increase due to the newly added joints; thus, the weights assigned to each other by the two joints will be further reduced. Figure 1d–f represent the topological graphs after constructing a multi-scale adjacency matrix; at this time, the adjacency matrix is reasonably sparsified, enabling the model to assign equal weights to distant joints and better capture the relationships between joints that are farther apart.

3. Improved Graph Convolutional Human Behavior Recognition Algorithm

This section begins by introducing the proposed behavior recognition approach based on a multi-scale spatio-temporal graph convolutional network (MS-TGCN) that incorporates multi-granularity features, along with an overview of the network’s overall architecture. Next, we present a multi-granularity skeleton segmentation strategy tailored for recognizing similar behaviors. Finally, a cross-scale feature fusion layer (CSFL) is designed to integrate multiple skeleton features of different granularities.

3.1. The Multi-Scale Spatio-Temporal Graph Convolution Network Incorporating Multi-Granularity Features

In order to fully consider the granularity features of human behavior and leverage their advantages in different behavior recognition processes, this paper proposes a multi-scale spatio-temporal graph convolutional network incorporating multi-granularity features for human behavior recognition; the network model framework is shown in Figure 2. Firstly, joint information of the human body is initialized into data streams of different granularity sizes; considering some behaviors within the dataset are highly similar, it is necessary to refine the joint data to capture more subtle semantic information between behaviors. Secondly, the refined data is fed into the MS-TGCN block to extract its spatio-temporal features. Then, the obtained output features are fed into the CSFL to blend coarse and fine-grained features, to capture the differences in features between similar behaviors. Finally, the fused features are fed into the MS-TGCN layer to further extract their spatio-temporal features and obtain the classification results.

The framework of the multi-scale spatio-temporal convolution network is shown in Figure 2b. Firstly, the normalized multi-granularity data stream is fed into the MS-TGCN, and the topological relationships between joints are reconstructed in the spatial domain using the multi-scale adjacency matrix mentioned above. By setting different K values, spatial feature fusion is performed on joints at different distances. Secondly, the data are input into two multi-scale time convolutional layers with different step sizes to capture broader temporal contextual features. Finally, the residual module is used to connect the input and output. The MS-TCN and MS-GCN modules utilized in this context correspond to the respective modules mentioned in reference [19].

3.2. Fine-Grained Skeleton Construction Strategy

Due to the high degree of overlap between similar behaviors in the spatio-temporal domain, traditional graph convolutional models have difficulty in capturing the semantic information that truly distinguishes categories and in learning accurate representations. In order to accurately depict fine-grained human behavior, this paper introduces a multi-granularity feature-learning method, initializing the human skeleton map into different fine-grained levels (a fine-grained skeleton refers to a human skeleton graph that is composed of more and finer joints.), as shown in Figure 3. This involves expanding the connections in coarse-grained graphs to tighter connections in fine-grained graphs, enabling fine-grained graphs to represent refined semantic information.

A single joint in the fine-grained skeleton graph is supplemented by averaging multiple adjacent joints in the coarse-grained skeleton graph using a two-dimensional average pooling method. The overall representation of the fine-grained skeleton graph is then obtained through concatenation operations. The formula for multi-granularity initialization is expressed as

V_{c k} = p o o l i n g (V_{f_{1}} + V_{f_{2}} + \dots + V_{f_{h}}), k \leq h

(9)

G r a p h_{n e w} = c o n c a t (V_{c_{1}}, V_{c_{2}}, \dots, V_{c_{k}})

(10)

where

V_{c k}

represents the joint information of

k

supplemented joints in the fine-grained skeleton graph,

V_{f_{h}}

represents the joint information of

h

joints in the fine-grained skeleton graph, and

G r a p h_{n e w}

represents the physical skeleton of the fine-grained graph.

3.3. Cross-Scale Feature Fusion

To achieve feature fusion between a coarse and fine granularity, fine-grained features are used to guide the original granularity features to learn discriminative feature expressions between similar behaviors; inspired by reference [12], this paper proposes an adaptive cross-scale feature fusion module, as shown in Figure 4.

Namely, this means embedding a normalized Gaussian function in the network to calculate the feature-mapping relationship between two sizes and generate a cross-scale feature fusion matrix. The specific operation is as follows:

f (v_{i}, v_{j}) = \exp [ψ^{T} (v_{i}) θ (v_{j})] / \sum_{j = 1}^{N} \exp [ψ^{T} (v_{i}) θ (v_{j})]

(11)

where

ψ^{T} (v_{i}) = W_{ψ} v_{i}

and

θ (v_{j}) = W_{θ} v_{j}

represent embedded operations, while

W_{ψ}

and

W_{θ}

are the corresponding weight parameters.

Taking 20-joint coarse-grained skeleton information and 25-joint fine-grained skeleton information as examples, the feature dimensions of the two input granularity features

f_{1}

and

f_{2}

are

C \times T \times V_{20}

and

C \times T \times V_{25}

, respectively, where

C

represents the number of channels of the embedded Gaussian function. A 1 × 1 convolution operation is performed on both of the two granularity features, then dimension transformation is applied to both results, followed by matrix multiplication. Finally, an adaptive transformation matrix is obtained through a softmax classifier as follows:

A_{f_{1}, f_{2}} = s o f t m a x (f_{1}^{T} W_{ψ}^{T} W_{θ} f_{2}) \in [0, 1]

(12)

This adaptive transformation matrix can dynamically adjust the mapping relationship between different granularity features, and the fused feature

\tilde{X_{o u t}}

after scale fusion can be represented as

\tilde{X_{o u t}} = λ G C N (A_{f_{1}, f_{2}}, f_{2}) + f_{1}

(13)

G C N (A_{f_{1}, f_{2}}, f_{2})

represents the fused features obtained through the graph convolution operation using the transformation matrix

A_{f_{1}, f_{2}}

on a 25-joint scale. Studies have shown that the output feature maps from the shallow layers of the network can improve the quality of semantic segmentation and capture finer details [20,21]. This is because the deep feature maps of the graph convolution network often focus on high-level semantic information, while the local detail information of various skeleton parts usually exists in the shallow features; as the network goes deeper, these local details are gradually destroyed or even completely lost. Therefore, we choose to perform cross-scale feature fusion after a multi-scale graph convolution of the data, and introduce a hyperparameter

λ

in the fusion process to adjust the fusion ratio reasonably.

4. Experiment and Result Analysis

4.1. Experimental Dataset

The MSR Action 3D dataset contains the 3D coordinates of 20 human skeleton joints collected by a c Kinect v1 depth camera from Microsoft, USA. It consists of 10 subjects each performing 20 actions, with each action repeated 2 to 3 times. The number of frames in each action sequence ranges from 10 to 100, resulting in a total of 567 action sequence sample data. Due to the presence of highly similar actions in this dataset, it serves as an excellent benchmark to validate the effectiveness of the algorithm proposed in this paper. A cross-validation method based on subject classification is used to test the performance of the model, where subjects 1, 3, 5, 7, and 9 are used for training and subjects 2, 4, 6, 8, and 10 are used for testing.

4.2. Experimental Environment and Settings

This experiment is implemented based on a multi-scale spatio-temporal graph convolutional network that incorporates multi-granularity features, as shown in Figure 2. The benchmark network is a stacked three-layer multi-scale spatio-temporal graph convolutional network, with input and output channels of (3, 96), (96, 192), and (192, 384), and “initialization”, representing fine-grained data initialization; CSFL stands for the cross-scale fusion layer, GAP represents the global average pooling layer, and FC denotes the fully connected layer. The entire network is set with a batch size of 64 for the dataset, and it is trained for 150 epochs. The initial learning rate is set to 0.1, which is reduced to one-tenth at epochs 80 and 120. The dropout rate is set to 0.25, and the weight decay is set to 0.0001. These training parameters are set with reference to comparative algorithms to facilitate comparisons in the accuracy of action recognition, and they also align with the standard setting conventions for neural network training.

4.3. Experimental Results and Analysis

4.3.1. Comparative Experiment Using Unbiased Weighting Method

To verify the effectiveness of the proposed multi-scale adjacency matrix method, this paper designs an experiment to compare the performance differences of the model before and after the introduction of this method. The experiment uses a stacked three-layer MS-TGCN network, where MS-TGCN-D represents multi-scale spatio-temporal graph convolution after applying the multi-scale adjacency matrix method, and the maximum value of the adjacency matrix for spatial positional relationships in the MSR Action 3D dataset is set to K = 10.

As shown in Table 1, when only using the MS-TGCN network, the accuracy of behavior recognition roughly shows a decreasing trend with a continuous increase in K value, which well proves the bias weighting problem caused by using high-order adjacency matrices. When using the MS-TGCN-D network, the introduction of the multi-scale adjacency matrix method brings a 2.76% improvement to the network at K = 6. For other values of K, it can also bring improvements ranging from 0.19% to 0.79%, thus verifying the effectiveness of introducing a multi-scale adjacency matrix. However, when K = 8 and K = 10, the accuracy of the network decreases by 0.72% and 0.39%, respectively; this is due to the highly similar characteristics of the action categories in the dataset, and the distant joints contribute little to the recognition performance of the network. If too large a K value is used, the network’s ability to capture the features of distant joints will increase, which does not align with the correlation between joints in most actions, thereby leading to a decrease in recognition accuracy. Therefore, in the subsequent experiments involving multi-granularity feature fusion, the value of K should not be too large; in this paper, K = 6 is selected for verification in the following experiments.

4.3.2. Comparative Experiments on Fusing Multi-Granularity Features

The fine-grained features of human joint information can fully represent refined semantic features during human movement. The 20-joint skeleton data in the MSR Action 3D dataset are refined into 23 and 25 joints using the method proposed in Section 3.2, as shown in Figure 5. Then, comparative experiments are conducted on the skeleton data of different fine-grained levels.

The accuracy of behavior recognition using different granularities is shown in Table 2. The accuracy for each behavior is shown in Table 3. As can be seen from Table 2, using fine-grained skeletons with either 23 or 25 joints alone cannot improve the overall behavior recognition accuracy; instead, it decreases. This is because the value of K, the order of the adjacency matrix, was selected based on 20 joints. By comparing Table 3, it can be seen that fine-grained data can effectively distinguish some similar behaviors (such as drawing a fork, drawing a circle, and drawing a tick); the accuracy of punching from the side has increased from 86.7% to 100%; and the accuracy improvement in bending action is the highest, reaching 19.5%. This is because the inserted joints are the waist, wrist, and calves; inserting the wrist joints allows the model to capture the differences in motion feature between drawing a fork, drawing a circle, and drawing a tick, while inserting the waist joints helps the model capture the feature expression during the bending process. However, the recognition accuracy of this model has decreased for some other behaviors (such as high waving, overhand serve, and pounding). This is because these actions rely heavily on the movement state of the entire arm, and the inserted joints make it easier for the network to capture the movement differences at the front end of the arm, resulting in a decrease in the ability to capture the motion state of the arm near the torso.

The CSFL proposed in this paper blends different granularity features, allowing the network to fully integrate fine-grained features on the basis of its original performance, thereby improving network performance. The cross-scale feature fusion experiment adopts three settings: fusing 20 joints with 23 joints and 25 joints, respectively, as well as fusing all three simultaneously. In these experiments, the CSFL is integrated into the backbone network (MS-TGCN-D) where different fusion ratio parameters can balance the influence between coarse-grained and fine-grained. To verify the impact of fusing different granularity data under different fusion ratio parameters on the network performance, this paper conducted comparative experiments on the values of λ and the fusion methods. The experimental results are shown in Table 4. According to the experimental results, it can be seen that when the network fuses three types of granularity data and the fusion ratio parameter is set to 0.1, the recognition accuracy reaches 95.67%, which is 0.79% higher than the accuracy without multi-granularity fusion. At this point, the network performance is optimal. This fully demonstrates the effectiveness of a neural network that fuses multi-granularity features.

With a network recognition accuracy of 95.67%, the confusion matrix of the MSR Action 3D dataset is shown in Figure 6. As shown in the figure, the multi-scale spatio-temporal graph convolutional network that incorporates multi-granularity features can improve the accuracy of some similar behaviors, such as drawing a fork, drawing a circle, picking up and throwing, bending down, and swinging tennis rackets on the basis of the original network; in particular, the recognition accuracy of bending actions has reached 100%, which is a significant improvement compared to the original network. However, the accuracy for some behaviors, such as hammering, hand catching, and raising your hand high, have decreased. This is related to the positions of the joints inserted at different levels of granularity. Different actions require fine expressions from joints in different parts of the body. The proposed multi-granularity approach in this paper mainly involves adding additional joints at the wrists, waist, and lower legs, which is why the discrimination accuracy for actions such as drawing crosses, drawing circles, picking up and throwing, bending over, and swinging a tennis racket have improved. However, for actions like punching, grabbing, and high waving that primarily involve finger and arm movements, our method did not insert more joints in these areas, making it difficult to finely express and distinguish between these types of actions. Instead, the insertion of joints in other parts of the body led to a dilution of the feature weights for the fingers and arms, ultimately resulting in a decrease in the recognition accuracy for these types of actions.

4.3.3. Comparison Experiment with Other Models

In order to better verify the improvement of the model in behavior recognition performance, this paper compared and analyzed its recognition accuracy with existing behavior recognition methods on the MSR Action 3D dataset. The comparison results are shown in Table 5.

The multi-scale spatio-temporal graph convolutional network proposed in this paper, which integrates multi-granularity features, achieves a behavior recognition accuracy of 95.67% on the MSR Action 3D dataset, and its experimental results are superior to most existing behavior recognition methods. Compared with the methods proposed in references [20,22], the accuracy has been improved by 2.04% and 3.77%, respectively; compared with the adaptive skeleton center point method proposed in reference [21], the accuracy has been improved by 7.2%; compared with the method of combining graph convolution with Long Short-Term Memory (LSTM) networks [10] and the multi-view depth motion map method STACOG [23], the accuracy is improved by 1.17% and 2.27%, respectively; compared with the enhanced data-driven algorithm proposed in reference [24] and the method of using point cloud data as input for behavior recognition [25], the accuracy has been improved by 0.86% and 0.49%, respectively; and compared with the fusion multi-modal data feature method proposed in reference [26], the accuracy has been improved by 3.76%. By comparison, it can be seen that the algorithm proposed in this paper has a high recognition accuracy in using 3D human skeleton information for human behavior recognition and also shows a strong competitiveness compared to existing methods.

5. Conclusions

Most human action recognition methods based on graph convolutional networks primarily extract global features from the human action process, and they often lack the ability to capture local differences between similar actions. The varying levels of granularity in human skeleton data can represent different hierarchical semantic characteristics during the action process, and fusing motion features at different granularity levels can effectively improve the network’s performance in recognizing similar actions.

This paper proposes a multi-scale spatio-temporal graph convolutional method that integrates multi-granularity features for human action recognition. By initializing human skeleton data into data streams of different granularities and employing a spatio-temporal graph convolutional network with multi-scale adjacency matrices, the network’s spatio-temporal representation capability is effectively enhanced. Additionally, an adaptive cross-scale fusion layer is introduced to guide the model in learning discriminative feature representations between similar actions using fine-grained features, thereby improving the accuracy of the network in recognizing similar actions.

Experimental results on the MSR Action 3D dataset demonstrate that our proposed algorithm outperforms existing methods in terms of overall accuracy for different action recognition tasks, particularly showing significant improvements in the accuracy of recognizing some highly similar actions. However, there are still many challenges in the field of action recognition. For instance, when dealing with different populations, the skeletal spatial scales of the same action can vary significantly between adults and children, which can affect the model’s recognition accuracy. Normalizing the scale of skeletal spatial structure is an issue that needs to be considered in action recognition. Additionally, the difference in action speed also poses a challenge. Similar body movements performed at different speeds can correspond to different behaviors. For example, slow movements may represent Tai Chi exercises, while fast movements could indicate pushing and shoving or fighting. Considering the temporal fine-grained aspects of actions is another research direction for future action recognition.

Author Contributions

Conceptualization, Y.W. and T.S.; methodology, Y.Y.; investigation and validation, Y.W.; writing—original draft preparation, Z.H.; writing—review and editing, T.S. and Y.Y. All authors have read and agreed to the published version of the manuscript.

Funding

This paper was supported by the following projects: Basic and Frontier Research Program of Chongqing Science and Technology Bureau (Grant No. CSTC2021JCYJ-MSXMX0348), Science and Technology Research of Chongqing Municipal Education Commission (Grant No. KJQN202305804), the first batch of the young backbone teacher-training plan for Chongqing Public Transport Vocational College.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Pilarski, P.M.; Butcher, A.; Johanson, M.; Botvinick, M.M.; Bolt, A.; Parker, A.S. Learned human-agent decision-making, communication and joint action in a virtual reality environment. arXiv 2019, arXiv:1905.02691. [Google Scholar]
Shi, L.; Zhou, Y.; Wang, J.; Wang, Z.; Chen, D.; Zhao, H.; Yang, W.; Szczerbicki, E. Compact global association based adaptive routing framework for personnel behavior understanding. Future Gener. Comput. Syst. 2023, 141, 514–525. [Google Scholar] [CrossRef]
Sudha, M.R.; Sriraghav, K.; Jacob, S.G.; Manisha, S. Approaches and applications of virtual reality and gesture recognition: A review. Int. J. Ambient. Comput. Intell. (IJACI) 2017, 8, 1–18. [Google Scholar] [CrossRef]
Duan, H.; Zhao, Y.; Chen, K.; Lin, D.; Dai, B. Revisiting skeleton-based action recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 2969–2978. [Google Scholar]
Liu, C.; Ying, J.; Yang, H.; Hu, X.; Liu, J. Improved human action recognition approach based on two-stream convolutional neural network model. Vis. Comput. 2021, 37, 1327–1341. [Google Scholar] [CrossRef]
Li, C.; Zhong, Q.; Xie, D.; Pu, S. Skeleton-based action recognition with attention-enhanced graph convolutional networks. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 11–17 October 2021; pp. 13086–13095. [Google Scholar]
Liu, J.; Shahroudy, A.; Xu, D.; Wang, G. Spatio-temporal LSTM with trust gates for 3D human action recognition. In Proceedings of the 14th European Conference on Computer Vision, Amsterdam, The Netherlands, 11–14 October 2016; Springer: Berlin/Heidelberg, Germany, 2016; pp. 816–833. [Google Scholar]
Wei, S.; Song, Y.; Zhang, Y. Human skeleton tree recurrent neural network with joint relative motion feature for skeleton based action recognition. In Proceedings of the IEEE International Conference on Image Processing (ICIP), Beijing, China, 17–20 September 2017; IEEE Computer Society Press: Los Alamitos, CA, USA, 2017; pp. 91–95. [Google Scholar]
Zheng, W.; Li, L.; Zhang, Z.; Huang, Y.; Wang, L. Relational network for skeleton-based action recognition. In Proceedings of the IEEE International Conference on Multimedia and Expo(ICME), Shanghai, China, 8–12 July 2019; IEEE Computer Society Press: Los Alamitos, CA, USA, 2019; pp. 826–831. [Google Scholar]
Zhao, R.; Wang, K.; Su, H.; Ji, Q. Bayesian graph convolution lstm for skeleton-based action recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019. [Google Scholar]
Yan, S.; Xiong, Y.; Lin, D. Spatial temporal graph convolutional networks for skeleton-based action recognition. In Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence, New Orleans, LA, USA, 2–7 February 2018. [Google Scholar]
Shi, L.; Zhang, Y.; Cheng, J.; Lu, H. Two-stream adaptive graph convolutional networks for skeleton-based action recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA 15–20 June 2019; IEEE: Piscataway, NJ, USA, 2019; pp. 12026–12035. [Google Scholar]
Li, M.; Chen, S.; Chen, X.; Zhang, Y.; Wang, Y.; Tian, Q. Actional-structural graph convolutional networks for skeleton-based action recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA 15–20 June 2019; IEEE Computer Society Press: Los Alamitos, CA, USA, 2019; pp. 3590–3598. [Google Scholar]
Li, C.; Cui, Z.; Zheng, W.; Xu, C.; Yang, J. Spatio-temporal graph convolution for skeleton based action recognition. In Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence, New Orleans, LA, USA, 2–7 February 2018. [Google Scholar]
Du, Y.; Wang, W.; Wang, L. Dynamic Skeleton Graph Convolution Networks for Action Recognition. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Salt Lake City, UT, USA, 18–22 June 2018; pp. 2011–2020. [Google Scholar]
Liao, R.; Zhao, Z.; Urtasun, R.; Zemel, R.S. Lanczosnet: Multi-scale deep graph convolutional networks. arXiv 2019, arXiv:1901.01484. [Google Scholar]
Cheng, K.; Zhang, Y.; He, X.; Chen, W.; Cheng, J.; Lu, H. Skeleton-based action recognition with shift graph convolutional network. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 183–192. [Google Scholar]
Li, W.; Liu, X.; Liu, Z.; Du, F.; Zou, Q. Skeleton-based action recognition using multi-scale and multi-stream improved graph convolutional network. IEEE Access 2020, 8, 144529–144542. [Google Scholar] [CrossRef]
Liu, Z.; Zhang, H.; Chen, Z.; Wang, Z.; Ouyang, W. Disentangling and unifying graph convolutions for skeleton-based action recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; IEEE: Piscataway, NJ, USA, 2020; pp. 143–152. [Google Scholar]
Yang, Y.; Deng, C.; Gao, S.; Liu, W.; Tao, D.; Gao, X. Discriminative multi-instance multitasks learning for 3D action recogni-tion. IEEE Trans. Multimed. 2017, 19, 519–529. [Google Scholar] [CrossRef]
Ran, X.Y.; Liu, K.; Li, G.; Ding, W.W.; Chen, B. Human action recognition algorithm based on adaptive skeleton center. J. Image Graph. 2018, 23, 519–525. [Google Scholar]
Agahian, S.; Negin, F.; Köse, C. Improving bag-of-poses with semi-temporal pose descriptors for skeleton-based action recognition. Vis. Comput. 2019, 35, 591–607. [Google Scholar] [CrossRef]
Bulbul, M.F.; Tabussum, S.; Ali, H.; Zheng, W.; Lee, M.Y.; Ullah, A. Exploring 3D human action recognition using STACOG on multi-view depth motion graphs sequences. Sensors 2021, 21, 3642–3651. [Google Scholar] [CrossRef] [PubMed]
Zhang, C.; Liang, J.; Li, X.; Xia, Y.; Di, L.; Hou, Z.; Huan, Z. Human action recognition based on enhanced data guidance and key node spatial temporal graph convolution. Multimed. Tools Appl. 2022, 81, 8349–8366. [Google Scholar] [CrossRef]
Wu, Q.; Huang, Q.; Li, X. Multimodal human action recognition based on spatio-temporal action representation recognition model. Multimed. Tools Appl. 2022, 81, 16409–16430. [Google Scholar] [CrossRef]
You, K.; Hou, Z.; Liang, J.; Lin, E.; Shi, H.; Zhong, Z. A 4D strong spatio-temporal feature learning network for behavior recognition of point cloud sequences. Multimedia Tools Appl. 2024, 83, 1–19. [Google Scholar] [CrossRef]

Figure 1. Adjacency matrix topology diagram. (a–c) respectively represent the topological graphs of first-order, second-order, and third-order adjacency matrices used to connect human skeletal joints, while (d–f) respectively represent the topological graphs after constructing multi-scale adjacency matrices.

Figure 2. Framework of multi-scale spatio-temporal graph convolutional network model incorporating multi-granularity features. (a) represents the overall framework of the proposed network, and (b) represents the framework of the multi-scale spatio-temporal convolutional module.

Figure 3. Three granularity representation methods for MSR Action 3D. The blue nodes represent the original coarse-grained joints, and the red nodes represent the newly added fine-grained joints.

Figure 4. The structure of cross-scale feature fusion layer (CSFL).

Figure 5. Skeleton graphs of different granularities.

Figure 6. The confusion matrix of the MSR Action 3D dataset. The darker the background color of each grid in the figure, the higher the recognition rate it represents.

Table 1. Comparison of training accuracy using multi-scale adjacency matrix method (%).

Model Methods	The Value of K in K-Order Adjacency Matrix
Model Methods	K = 2	K = 3	K = 4	K = 5	K = 6	K = 8	K = 10
MS-TGCN	94.09	93.31	92.52	92.91	92.12	92.91	92.12
MS-TGCN-D	94.28	93.70	93.31	92.91	94.88	92.13	91.73

Table 2. Accuracy of recognition using different granularity data on MS-TGCN-D network.

Number of Joints/(Pieces)			Accuracy (%)
20	23	25	Accuracy (%)
✓			94.88
	✓		94.09
		✓	93.31

Table 3. Accuracy of each behavior’s recognition using data of different granularities.

MSR Action 3D Behavior Types	Accuracy of Behavior Recognition (%)
MSR Action 3D Behavior Types	MS-TGCN-D (20 Joints)	MS-TGCN-D (23 Joints)	MS-TGCN-D (25 Joints)
Raise your hand high (HiW)	100	81.8	81.8
Wave your hand in front of your chest (HoW)	100	100	100
Hammering (H)	92.3	75.0	75.0
Hand catch (HCh)	100	100	100
Forward punch (FP)	100	94.2	90.9
High throw (HT)	100	97.6	88.9
Drawing a fork (DX)	92.3	100	100
Drawing a tick (DT)	100	100	100
Drawing a circle (DC)	93.8	93.8	93.8
Hand clap (HCp)	100	100	100
Two-hand wave (HW)	100	100	100
Punch from the side (SB)	86.7	100	100
Bending down (B)	58.3	77.8	77.8
Kick korward (FK)	100	100	100
Kick sideways (SK)	100	100	100
Jogging (J)	100	100	100
Swing a tennis racket (TSw)	93.8	92.1	83.3
Overhand serve (TSr)	100	95.9	93.8
Swing a golf club (GS)	100	100	100
Picking up and throwing (PT)	75	73.6	75

Table 4. Recognition accuracy of MS-TGCN-D (CSFL) model by fusing different numbers of joint points under different proportional parameters.

The Number of Joint Points			$The Value of the Proportional Parameter λ$	Accuracy (%)
20	23	25	$The Value of the Proportional Parameter λ$	Accuracy (%)
✓			\	94.88
✓	✓		0.1	94.09
			0.2	94.28
			0.3	93.70
✓		✓	0.1	93.31
			0.2	94.88
			0.3	92.92
✓	✓	✓	0.1	95.67
			0.2	92.92
			0.3	93.70

Table 5. Comparison of recognition accuracy with other methods on the MSR Action 3D dataset.

Method	Accuracy (%)
Yang et al. [20]	93.63
Ran et al. [21]	88.47
Agahian et al. [22]	91.90
Zhao et al. [10]	94.50
STACOG [23]	93.40
Zhang et al. [24]	94.81
Wu et al. [25]	95.18
You et al. [26]	91.91
Ours	95.67

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Wang, Y.; Song, T.; Yang, Y.; Hong, Z. Research on Multi-Scale Spatio-Temporal Graph Convolutional Human Behavior Recognition Method Incorporating Multi-Granularity Features. Sensors 2024, 24, 7595. https://doi.org/10.3390/s24237595

AMA Style

Wang Y, Song T, Yang Y, Hong Z. Research on Multi-Scale Spatio-Temporal Graph Convolutional Human Behavior Recognition Method Incorporating Multi-Granularity Features. Sensors. 2024; 24(23):7595. https://doi.org/10.3390/s24237595

Chicago/Turabian Style

Wang, Yulin, Tao Song, Yichen Yang, and Zheng Hong. 2024. "Research on Multi-Scale Spatio-Temporal Graph Convolutional Human Behavior Recognition Method Incorporating Multi-Granularity Features" Sensors 24, no. 23: 7595. https://doi.org/10.3390/s24237595

APA Style

Wang, Y., Song, T., Yang, Y., & Hong, Z. (2024). Research on Multi-Scale Spatio-Temporal Graph Convolutional Human Behavior Recognition Method Incorporating Multi-Granularity Features. Sensors, 24(23), 7595. https://doi.org/10.3390/s24237595

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Research on Multi-Scale Spatio-Temporal Graph Convolutional Human Behavior Recognition Method Incorporating Multi-Granularity Features

Abstract

1. Introduction

2. Skeleton-Based Behavior Recognition Based on Graph Convolution Networks

2.1. Spatio-Temporal Graph Convolutional Network

2.2. Analysis of Bias Weighting Problem Methods

3. Improved Graph Convolutional Human Behavior Recognition Algorithm

3.1. The Multi-Scale Spatio-Temporal Graph Convolution Network Incorporating Multi-Granularity Features

3.2. Fine-Grained Skeleton Construction Strategy

3.3. Cross-Scale Feature Fusion

4. Experiment and Result Analysis

4.1. Experimental Dataset

4.2. Experimental Environment and Settings

4.3. Experimental Results and Analysis

4.3.1. Comparative Experiment Using Unbiased Weighting Method

4.3.2. Comparative Experiments on Fusing Multi-Granularity Features

4.3.3. Comparison Experiment with Other Models

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI