CN116665300A

CN116665300A - Skeleton action recognition method based on space-time self-adaptive feature fusion graph convolution network

Info

Publication number: CN116665300A
Application number: CN202310609183.6A
Authority: CN
Inventors: 张海平; 张昕昊; 周福兴; 管力明; 施月玲
Original assignee: Hangzhou Dianzi University; School of Information Engineering of Hangzhou Dianzi University
Current assignee: Hangzhou Dianzi University; School of Information Engineering of Hangzhou Dianzi University
Priority date: 2023-05-29
Filing date: 2023-05-29
Publication date: 2023-08-29

Abstract

The invention discloses a skeleton action recognition method based on a space-time self-adaptive feature fusion graph convolution network, which comprises the following steps: s1, acquiring an original data set of a skeleton action sequence of a human body, and performing data preprocessing and data enhancement; s2, processing the skeleton data obtained after preprocessing and data enhancement to obtain second-order skeleton information of the skeleton data; s3, integrating the joint movement flow state and the bone movement flow state to form limb flow; s4, constructing a space-time self-adaptive feature fusion graph convolution network; s5, respectively inputting the joint flow state, bone flow state and limb flow data into a space-time self-adaptive feature fusion graph convolution network for training, obtaining a corresponding initial prediction result and softmax fraction, and finally fusing and outputting a final prediction result in a weight adding mode. The method can more fully extract context information of different scales, combines more joint data with more obvious characteristics to realize human behavior prediction, and is beneficial to improving the prediction accuracy of human behaviors.

Description

Skeleton action recognition method based on space-time self-adaptive feature fusion graph convolution network

Technical Field

The invention relates to the technical field of computer vision and deep learning, in particular to a skeleton action recognition method based on a space-time self-adaptive feature fusion graph convolution network.

Background

Skeletal data is a time series of two-dimensional or three-dimensional coordinate locations containing multiple human skeletal joints that may be extracted from video images using pose estimation methods or directly acquired with a sensor device. Compared with the traditional RGB video identification method, the motion identification based on the skeleton data can effectively reduce the influence of interference factors such as illumination change, environmental background, shielding and the like in the identification process, and has stronger adaptability to dynamic environments and complex backgrounds.

Currently, one typical method of using a skeleton for motion recognition is to build graph roll-up networks (GCNs). However, the current mainstream model based on GCN has the following disadvantages: (1) feature extraction capability is limited. The module close to the input (lower module) has a relatively small receptive field. Thus, the focus of the higher-level modules has a more global view of the input skeleton sequence than the lower-level modules. Thus, for time scale learning, it is difficult to solve problems such as skeletal action semantics to obtain more efficient modeling by using a fixed-size convolution kernel or expansion rate at each layer of the network only; and (2) the method of multi-stream fusion of specific behavior patterns is simple. At present, a classical multi-stream framework model generally directly adds softmax scores of each stream to obtain a final prediction result, but in practice, the prediction effects of each stream are obviously different, accurate prediction results are difficult to obtain by simple score addition, and the parameter calculation amount is large. (3) Generating an adjacency matrix of semantically meaningful edges is particularly important in this task, where traditional spatial topologies are affected by physical connectivity, and edge extraction remains a challenging problem.

Disclosure of Invention

The invention aims to solve the problems, and provides a skeleton action recognition method based on a space-time self-adaptive feature fusion graph convolution network, which can more fully extract context information of different scales, combines more joint data with more obvious features to realize human behavior prediction under the condition of not increasing the calculated amount, and is beneficial to improving the prediction accuracy of human behaviors.

In order to solve the technical problems, the technical scheme of the invention is as follows:

a skeleton action recognition method based on a space-time self-adaptive feature fusion graph rolling network comprises the following steps:

s1, carrying out data preprocessing and data enhancement on a human body action recognition data set of a large-scale original skeleton action sequence;

s2, for the enhanced skeleton dataProcessing to obtain second-order information of skeleton data, wherein X represents joint flow state X _joint For first-order bone information, C, T, N are the characteristic dimension of the joint, the number of sequence frames, and the number of joints, respectively.

The second order bone information is bone flow state X _bone、 Articulation flow regime X _joint-motion Bone movement flow pattern X _bone-motion The data, the formula is as follows:

X _bone ＝x[：，：，i]-x[：，：，i _nei ]|i＝1，2，...，N

X _joint-motion ＝x[：，t+1，：]-x[：，t，：]|t＝1，2，...，T，x∈X _joint

X _joint-motion ＝x[：，t+1，：]-x[：，t，：]|t＝1，2，...，T，x∈X _bone

wherein i represents the ith joint, i _nei Representing the adjacent joint to the ith joint on the same frame, and t represents the nth frame of the sequence.

S3, the original data set comprises 25 human joints, the joint movement flow state and the bone movement flow state are integrated in a channel dimension in a polymerization mode to form limb flow, and the limb flow only comprises 22 joints on the limbs.

S4, respectively flowing the joint fluid state X _joint And inputting the bone flow state and limb flow data into a space-time self-adaptive feature fusion graph convolution network for training, obtaining a corresponding initial prediction result and softmax fraction, and finally fusing and outputting a final prediction result in a weight adding mode.

Preferably, the spatio-temporal adaptive feature fusion graph rolling network model comprises a spatio-temporal adaptive feature fusion module, a global average pooling layer, a full connection layer and a softmax classifier which are connected in sequence, the space-time self-adaptive feature fusion module comprises output channels which are 64, 64 and 64 in sequence 64, 128, 256.

Preferably, the feature extraction module of each layer comprises a spatial attention strigologram convolution module, a BN+ReLU layer and a time self-adaptive feature fusion module which are connected in sequence, and meanwhile, skeleton data is obtainedInput to a 1 x 1 convolution layer and multiplied by the output of the spatial attention seeking convolution module to be input to the BN+ReLU layer, skeleton data +.>And adding the residual connection with the output of the BN+ReLU layer and inputting the added residual connection into a time self-adaptive feature fusion module.

Preferably, the space attention diagram convolution module comprises two parallel branches, each branch comprises a 1×1 convolution layer and a time pooling module, the pooled outputs of the two branches are subjected to subtraction operation, a feature map is constructed by sequentially passing through the Tanh module and the 1×1 convolution layer, and the feature map is added with a predefined adjacency matrix A to obtain A _cwt The following formula is satisfied:

A _cwt ＝αQ(X _in )+A

wherein α is a learnable parameter, A _cwt Is a topology map of a particular channel. Q is defined as follows:

Q(X _i )＝σ(TP(φ(X _in ))-TP(ψ(X _in )))

where σ, φ, and ψ are the 1×1 convolutional layers and TP is a time pooling module.

Preferably, the time adaptive feature fusion module comprises four branches, an attention feature fusion module M and attention feature fusion modules 1-M, each branch comprises a 1×1 convolution layer to reduce the channel dimension, and the first three branches comprise two dynamic time convolutions with convolution kernel size ks×1, dynamic hole convolutions with hole rates of 1 and dr respectively, and a maximum pooling layer. The calculation formula of ks is as follows:

wherein abs represents absolute value, C _l For the dimension of the output channel of the first layer feature extraction module, gamma and b are respectively set to 2 and 1, and a dynamic convolution kernel and a dynamic void rate can be obtained through t, and the method comprises the following steps:

ks＝t if t％2 else t+1

the output of the four branches is aggregated through a Concat function to obtain a multi-scale time feature X ₁ Initial skeleton dataInput into the attention feature fusion module M in residual form, whatThe output of the attention feature fusion module M is multiplied by the initial skeleton feature X to obtain initial attention fusion features, and the multi-scale time features X ₁ In the input attention characteristic fusion module 1-M, the output of the attention characteristic fusion module 1-M is connected with X ₁ Multiplying to obtain a time attention fusion characteristic, and adding the initial attention fusion characteristic and the time attention fusion characteristic to output a characteristic X', wherein the formula is expressed as follows:

wherein M (.) is represented as:

wherein g (-) and l (-) represent global context and local context, respectively.

Preferably, the data preprocessing is to divide the whole skeleton sequence into 20 segments on average during training, and randomly select one frame from each segment as a new sequence of 20 frames.

Preferably, the data enhancement is such that during the training process, the three-dimensional skeleton sequence is randomly rotated to enhance robustness to view changes.

The invention has the following characteristics and beneficial effects:

the method adopts a space-time self-adaptive feature fusion graph convolution network model, firstly carries out multi-scale self-adaptive feature extraction on the aggregated space-time topology to obtain a larger receptive field, and then utilizes an attention mechanism to carry out feature fusion. The time modeling module can adaptively realize topology feature fusion and help complete modeling of actions. Based on the existing multi-stream processing method, a body part-based stream processing method, called limb stream, is proposed, which can realize a richer and finer representation. Under the condition of not increasing the calculated amount, the joint data with more quantity and more obvious characteristics are combined to realize the prediction of the human body behaviors, and the final prediction accuracy of the human body behaviors is effectively improved.

Drawings

In order to more clearly illustrate the embodiments of the invention or the technical solutions of the prior art, the drawings which are used in the description of the embodiments or the prior art will be briefly described, it being obvious that the drawings in the description below are only some embodiments of the invention, and that other drawings can be obtained according to these drawings without inventive faculty for a person skilled in the art.

FIG. 1 is a flow chart of a skeleton action recognition method based on a space-time adaptive feature fusion graph convolution network;

FIG. 2 is a diagram of a spatio-temporal adaptive feature fusion graph convolution network framework of the present invention;

FIG. 3 is a schematic diagram of a spatial-temporal adaptive feature fusion graph convolution network model according to the present invention;

FIG. 3 (a) is a schematic diagram of the structure of a single flow state input of the spatio-temporal adaptive feature fusion graph convolution network of the present invention;

FIG. 3 (b) is a schematic structural diagram of the feature extraction module of the present invention;

FIG. 3 (c) is a schematic diagram of a spatial attention schematic convolution module according to the present invention;

fig. 3 (d) is a schematic structural diagram of the time adaptive feature fusion module of the present invention.

Detailed Description

It should be noted that, without conflict, the embodiments of the present invention and features of the embodiments may be combined with each other.

In the description of the present invention, it should be understood that the terms "center", "longitudinal", "lateral", "upper", "lower", "front", "rear", "left", "right", "vertical", "horizontal", "top", "bottom", "inner", "outer", etc. indicate orientations or positional relationships based on the orientations or positional relationships shown in the drawings, are merely for convenience in describing the present invention and simplifying the description, and do not indicate or imply that the devices or elements referred to must have a specific orientation, be configured and operated in a specific orientation, and thus should not be construed as limiting the present invention. Furthermore, the terms "first," "second," and the like, are used for descriptive purposes only and are not to be construed as indicating or implying a relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defining "a first", "a second", etc. may explicitly or implicitly include one or more such feature. In the description of the present invention, unless otherwise indicated, the meaning of "a plurality" is two or more.

In the description of the present invention, it should be noted that, unless explicitly specified and limited otherwise, the terms "mounted," "connected," and "connected" are to be construed broadly, and may be either fixedly connected, detachably connected, or integrally connected, for example; can be mechanically or electrically connected; can be directly connected or indirectly connected through an intermediate medium, and can be communication between two elements. The specific meaning of the above terms in the present invention can be understood by those of ordinary skill in the art in a specific case.

The invention provides a skeleton action recognition method based on a space-time self-adaptive feature fusion graph rolling network, which is shown in fig. 1-3 and comprises the following steps:

s1, carrying out data preprocessing and data enhancement on a human body action recognition data set of a large-scale original skeleton action sequence.

In one embodiment, the data set is an available public skeleton data set, which is made by a depth sensing camera and comprises 56800 skeleton action sequences, wherein the data set has 60 action categories, and the number of skeleton joints of each human body is 25; the data preprocessing is to divide the whole skeleton action sequence into 20 segments in average in the training process, and randomly select one frame from each segment as a new sequence of 20 frames. The data enhancement is to randomly rotate the three-dimensional skeleton-action sequence during training to enhance robustness to view changes.

S2, for the enhanced skeleton dataProcessing to obtain second-order information of skeleton data, whereinX represents the joint flow regime X _joint For first-order bone information, C, T, N are the characteristic dimension of the joint, the number of sequence frames, and the number of joints, respectively.

It should be noted that, for the enhanced skeleton dataThe processing method comprises the following steps: based on the joint data generation, vector differences are generated for the two nodes. The treatment method is a conventional technical means, and therefore, a specific description will not be given.

Specifically, the second order bone information includes bone flow state X _bone Fluid state X of joint movement _joint-motion Bone movement flow pattern X _bone-motion The data, the formula is as follows:

X _bone ＝x[：，：，i]-x[：，：，i _nei ]||i＝1，2，...，N

X _joint-motion ＝x[：，t+1，：]-x[：，t，：]||t＝1，2，...，T，x∈X _bone

For the motion recognition task of the skeleton, both the first-order skeleton information (coordinates of joints) and the second-order skeleton information (directions and lengths of bones) and the motion information of the first-order skeleton information and the second-order skeleton information are helpful for motion recognition, and the motion recognition accuracy is improved by combining more data with more obvious features.

S3, the original data set comprises 25 human joints, joint movement flow and bone movement flow are integrated to form limb flow, and the limb flow only comprises 22 joints on four limbs.

S4, respectively flowing the joint fluid state X _joint Inputting the bone flow state data and the limb flow data into a space-time self-adaptive feature fusion graph convolution network for training, obtaining a corresponding initial prediction result and softmax score, and finally fusing and outputting a final predictionAs a result, as shown in fig. 2.

In the embodiment, the space-time self-adaptive feature fusion graph rolling network model comprises a space-time self-adaptive feature fusion module, a global average pooling layer, a full-connection layer and a softmax classifier which are connected in sequence, the space-time self-adaptive feature fusion module comprises output channels which are 64, 64 and 64 in sequence 64, 128, 256. For joint and bone flow patterns, ten layers of feature extraction modules are used, and for limb flow patterns, eight layers of feature extraction modules are used.

As shown in fig. 3 (a), taking single flow state input as an example, different flow state data pass through the feature extraction modules with different layers, and finally global average pooling and full connection are performed to obtain output scores.

As shown in fig. 3 (b), the feature extraction modules of each layer comprise a spatial attention strive-for convolution module, a bn+relu layer and a time adaptive feature fusion module which are connected in sequence, and simultaneously combine skeleton dataInput to a 1 x 1 convolution layer and multiplied by the output of the spatial attention seeking convolution module to be input to the BN+ReLU layer, skeleton data +.>And adding the residual connection with the output of the BN+ReLU layer and inputting the added residual connection into a time self-adaptive feature fusion module.

As shown in fig. 3 (c), a spatial attention map convolution module is input as a model of the skeleton dataThe module comprises two parallel branches, each branch comprises a 1×1 convolution layer and a time pooling module, the pooled outputs of the two branches are subjected to subtraction operation, a feature map is constructed by the Tanh module and the 1×1 convolution layer in sequence, and the feature map is added with a predefined adjacency matrix A to obtain A _cwt The following formula is satisfied:

A _cwt ＝αQ(X _in )+A

Q(X _i )＝σ(TP(φ(X _in ))-TP(ψ(X _in )))

As shown in fig. 3 (d), the time adaptive feature fusion module includes four branches, each branch includes a 1×1 convolution layer to reduce the channel dimension, and the first three branches include two dynamic time convolutions with convolution kernel size ks×1, dynamic hole convolutions with hole rates of 1 and dr, respectively, and a maximum pooling layer. The calculation formula of ks is as follows:

ks＝t if t％2 else t+1

the output of the four branches is aggregated through a Concat function to obtain a multi-scale time feature X ₁ Inputting the initial skeleton feature X into an attention feature fusion module M in a residual form, multiplying the output of the attention feature fusion module M with the initial skeleton feature X to obtain an initial attention fusion feature, and obtaining the multi-scale time feature X ₁ In the input attention characteristic fusion module 1-M, the output of the attention characteristic fusion module 1-M is connected with X ₁ Multiplying to obtain a time attention fusion characteristic, and adding the initial attention fusion characteristic and the time attention fusion characteristic to output a characteristic X', wherein the formula is expressed as follows:

wherein M (.) is represented as:

wherein g (-) and l (-) represent global context and local context, respectively. For input features X and X ₁ And (5) performing initial feature fusion. After the sigmoid activation function, the output value is between 0 and 1, and the network can determine the weight of each of the sigmoid activation function through training.

S5, all experiments in the embodiment are performed under a PyTorch deep learning framework, and training is performed by using two NVIDIA A800 GPUs. The training parameters are as follows: the initial learning rate was set to 0.1, the weight decay was set to 0.0004, the parameters were adjusted using a random gradient descent (SGD) with a nestrov momentum of 0.9, the maximum training round number was set to 80, and the learning rate was divided by 10 at the 35 th and 55 th training stages. Training of models is well known to those skilled in the art and will not be described in detail herein.

The following are specific experiments and descriptions of the additions:

this example compared to the advanced model on NTU-rgb+d60 and NTU-rgb+d120 datasets, as shown in tables 1, 2, our model achieved the most advanced results in almost all benchmarks.

Table 1: comparison of top-1 accuracy (%) with the most advanced method in NTU-rgb+d dataset

Table 2: comparison of top-1 accuracy (%) with the most advanced method in NTU-rgb+d120 dataset

S6, the embodiment demonstrates the effectiveness of the multi-mode self-adaptive feature fusion network, and all ablation experiments are performed on NTU RGB+D60 and NTU RGB+D120Cross Subject references.

Table 3 demonstrates the validity of the time adaptive feature fusion module of FIG. 3 (d)

In Table 3, the backbone network is the CTR-GCN model, on which the present invention was developed, with 0.7% and 0.8% improvement in accuracy over the Cross Subject references of NTU60 and NTU120, respectively. The time self-adaptive fusion module can guide the model to better learn action classification.

Table 4 the invention compares the performance of a conventional independent stream on different data streams.

In Table 4, compared with the advanced method, the CTR-GCN uses joint flow, bone flow, joint movement flow and bone movement flow, the experimental reproduction results are respectively 89.8%, 90.2%, 87.4% and 86.9%, and the data of the joint flow, the bone flow and the limb flow are used, compared with the CTR-GCN, the joint flow and the bone flow are respectively improved by 0.4% and 0.5%, and the limb flow is the fusion of the joint movement flow and the bone movement flow, so that the effect is better than that of the joint movement flow and the bone movement flow.

The embodiments of the present invention have been described in detail above with reference to the accompanying drawings, but the present invention is not limited to the described embodiments. It will be apparent to those skilled in the art that various changes, modifications, substitutions and alterations can be made to these embodiments, including the components, without departing from the principles and spirit of the invention, yet fall within the scope of the invention.

Claims

1. A skeleton action recognition method based on a space-time self-adaptive feature fusion graph rolling network is characterized by comprising the following steps:

s1, acquiring an original data set of a skeleton action sequence of a human body, and performing data preprocessing and data enhancement;

s2, skeleton data obtained after pretreatment and data enhancementProcessing to obtain second-order skeleton information of skeleton data, wherein X represents joint flow state X _joint The first-order bone information C, T, N is the characteristic dimension, the sequence frame number and the joint number of the joint, and the second-order bone information comprises a bone flow state, a joint movement flow state and a bone movement flow state;

s3, integrating the joint movement flow state and the bone movement flow state in the channel dimension in a polymerization mode to form limb flow;

s4, constructing a space-time adaptive feature fusion graph rolling network, wherein the space-time adaptive feature fusion graph rolling network model comprises a space-time adaptive feature fusion module, a global average pooling layer, a full connection layer and a softmax classifier which are sequentially connected;

s5, respectively flowing the joint state X _joint And inputting the bone flow state and limb flow data into a space-time self-adaptive feature fusion graph convolution network for training, obtaining a corresponding initial prediction result and softmax fraction, and finally fusing and outputting a final prediction result in a weight adding mode.

2. The skeleton action recognition method based on the spatio-temporal adaptive feature fusion graph rolling network according to claim 1, wherein the preprocessing method of the original dataset in step S1 is as follows: the whole skeleton-action sequence is divided into 20 segments on average, and one frame is randomly selected from each segment as a new sequence of 20 frames.

3. The skeleton motion recognition method based on the spatio-temporal adaptive feature fusion graph rolling network according to claim 2, wherein in the step S1, the preprocessed skeleton motion sequence is subjected to data enhancement by randomly rotating the skeleton motion sequence.

4. The skeleton action recognition method based on the space-time adaptive feature fusion graph rolling network of claim 1, wherein, the space-time self-adaptive feature fusion module comprises output channels which are 64, 64 and 64 in sequence 64, 128, 256.

5. The skeleton action recognition method based on the space-time adaptive feature fusion graph rolling network of claim 4, wherein each layer of the feature extraction module comprises a 1×1 convolution layer, a space attention attempt convolution module, a bn+relu layer and a time adaptive feature fusion module, and the skeleton data is simultaneously processedInput to a convolution layer of 1×1 and space attention diagram convolution module, multiply the outputs of the two, input to BN+ReLU layer, and add skeleton data +.>And adding the residual connection with the output of the BN+ReLU layer and inputting the added residual connection into a time self-adaptive feature fusion module.

6. The skeleton action recognition method based on space-time adaptive feature fusion graph rolling network of claim 5, wherein the space attention graph convolution module comprises two parallel branches, each branch comprises a 1×1 convolution layer, a time pooling module and a Tanh module, the output of the time pooling module of the two branches is subjected to subtraction operation, then a feature graph is constructed by sequentially passing through the Tanh module and one 1×1 convolution layer, and the feature graph is obtainedAdding with a predefined adjacency matrix A to obtain A _cwt The following formula is satisfied:

A _cwt ＝αQ(X _in )+A

wherein α is a learnable parameter, A _cwt For a topology of a particular channel, Q is defined as follows:

Q(X _i )＝σ(TP(φ(X _in ))-TP(ψ(X _in )))

7. The skeleton action recognition method of claim 5, wherein the time adaptive feature fusion module comprises four branches, an attention feature fusion module M and attention feature fusion modules 1-M, each branch comprises a 1 x 1 convolution layer to reduce channel dimension, the first three branches comprise two dynamic time convolutions with convolution kernel size ks x 1, dynamic hole convolutions with hole rates of 1 and dr respectively, and a maximum pooling layer,

the output of the four branches is aggregated through a Concat function to obtain a multi-scale time feature X ₁ Initial skeleton dataThe method comprises the steps of inputting the attention characteristic fusion module M in a residual form, multiplying the output of the attention characteristic fusion module M with an initial skeleton characteristic X to obtain an initial attention fusion characteristic, and obtaining the multi-scale time characteristic X ₁ In the input attention characteristic fusion module 1-M, the output of the attention characteristic fusion module 1-M is connected with X ₁ Multiplying to obtain a time attention fusion characteristic, and adding the initial attention fusion characteristic and the time attention fusion characteristic to output a characteristic X', wherein the formula is expressed as follows:

wherein M (.) is represented as:

8. The skeleton action recognition method based on the spatio-temporal adaptive feature fusion graph rolling network according to claim 7, wherein the calculation formula of ks is as follows:

ks＝t if t％2else t+1。

9. the method of claim 7, wherein the raw dataset comprises 25 human joints and the limb flow comprises a total of 22 joints on a limb.