CN116665300A - Skeleton action recognition method based on space-time self-adaptive feature fusion graph convolution network - Google Patents
Skeleton action recognition method based on space-time self-adaptive feature fusion graph convolution network Download PDFInfo
- Publication number
- CN116665300A CN116665300A CN202310609183.6A CN202310609183A CN116665300A CN 116665300 A CN116665300 A CN 116665300A CN 202310609183 A CN202310609183 A CN 202310609183A CN 116665300 A CN116665300 A CN 116665300A
- Authority
- CN
- China
- Prior art keywords
- skeleton
- feature fusion
- module
- space
- time
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 230000004927 fusion Effects 0.000 title claims abstract description 80
- 238000000034 method Methods 0.000 title claims abstract description 35
- 230000009471 action Effects 0.000 title claims abstract description 27
- 210000000988 bone and bone Anatomy 0.000 claims abstract description 34
- 238000012549 training Methods 0.000 claims abstract description 14
- 238000007781 pre-processing Methods 0.000 claims abstract description 8
- 238000012545 processing Methods 0.000 claims abstract description 4
- 230000003044 adaptive effect Effects 0.000 claims description 22
- 238000011176 pooling Methods 0.000 claims description 14
- 238000000605 extraction Methods 0.000 claims description 13
- 238000005096 rolling process Methods 0.000 claims description 13
- 238000010586 diagram Methods 0.000 claims description 8
- 238000004364 calculation method Methods 0.000 claims description 4
- 239000011159 matrix material Substances 0.000 claims description 4
- 239000011800 void material Substances 0.000 claims description 3
- 238000006116 polymerization reaction Methods 0.000 claims description 2
- 230000006399 behavior Effects 0.000 abstract description 7
- 230000009286 beneficial effect Effects 0.000 abstract description 3
- 230000006870 function Effects 0.000 description 4
- 238000002474 experimental method Methods 0.000 description 3
- 239000012530 fluid Substances 0.000 description 3
- 230000008569 process Effects 0.000 description 3
- 238000003672 processing method Methods 0.000 description 3
- 101100153586 Caenorhabditis elegans top-1 gene Proteins 0.000 description 2
- 101100370075 Mus musculus Top1 gene Proteins 0.000 description 2
- 230000004913 activation Effects 0.000 description 2
- 238000007792 addition Methods 0.000 description 2
- 238000013135 deep learning Methods 0.000 description 2
- 230000000694 effects Effects 0.000 description 2
- 238000002679 ablation Methods 0.000 description 1
- 230000004075 alteration Effects 0.000 description 1
- 230000008859 change Effects 0.000 description 1
- 238000004891 communication Methods 0.000 description 1
- 230000007613 environmental effect Effects 0.000 description 1
- 238000005286 illumination Methods 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
- 230000007246 mechanism Effects 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V40/00—Recognition of biometric, human-related or animal-related patterns in image or video data
- G06V40/20—Movements or behaviour, e.g. gesture recognition
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/0464—Convolutional networks [CNN, ConvNet]
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/048—Activation functions
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
- G06N3/082—Learning methods modifying the architecture, e.g. adding, deleting or silencing nodes or connections
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/764—Arrangements for image or video recognition or understanding using pattern recognition or machine learning using classification, e.g. of video objects
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/77—Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
- G06V10/774—Generating sets of training patterns; Bootstrap methods, e.g. bagging or boosting
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/77—Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
- G06V10/80—Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
- G06V10/806—Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level of extracted features
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/82—Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Health & Medical Sciences (AREA)
- Health & Medical Sciences (AREA)
- Evolutionary Computation (AREA)
- General Physics & Mathematics (AREA)
- Computing Systems (AREA)
- Software Systems (AREA)
- Artificial Intelligence (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Multimedia (AREA)
- Biomedical Technology (AREA)
- Molecular Biology (AREA)
- Mathematical Physics (AREA)
- Medical Informatics (AREA)
- General Engineering & Computer Science (AREA)
- Life Sciences & Earth Sciences (AREA)
- Databases & Information Systems (AREA)
- Biophysics (AREA)
- Computational Linguistics (AREA)
- Data Mining & Analysis (AREA)
- Psychiatry (AREA)
- Social Psychology (AREA)
- Human Computer Interaction (AREA)
- Image Analysis (AREA)
- Measurement Of The Respiration, Hearing Ability, Form, And Blood Characteristics Of Living Organisms (AREA)
Abstract
The invention discloses a skeleton action recognition method based on a space-time self-adaptive feature fusion graph convolution network, which comprises the following steps: s1, acquiring an original data set of a skeleton action sequence of a human body, and performing data preprocessing and data enhancement; s2, processing the skeleton data obtained after preprocessing and data enhancement to obtain second-order skeleton information of the skeleton data; s3, integrating the joint movement flow state and the bone movement flow state to form limb flow; s4, constructing a space-time self-adaptive feature fusion graph convolution network; s5, respectively inputting the joint flow state, bone flow state and limb flow data into a space-time self-adaptive feature fusion graph convolution network for training, obtaining a corresponding initial prediction result and softmax fraction, and finally fusing and outputting a final prediction result in a weight adding mode. The method can more fully extract context information of different scales, combines more joint data with more obvious characteristics to realize human behavior prediction, and is beneficial to improving the prediction accuracy of human behaviors.
Description
Technical Field
The invention relates to the technical field of computer vision and deep learning, in particular to a skeleton action recognition method based on a space-time self-adaptive feature fusion graph convolution network.
Background
Skeletal data is a time series of two-dimensional or three-dimensional coordinate locations containing multiple human skeletal joints that may be extracted from video images using pose estimation methods or directly acquired with a sensor device. Compared with the traditional RGB video identification method, the motion identification based on the skeleton data can effectively reduce the influence of interference factors such as illumination change, environmental background, shielding and the like in the identification process, and has stronger adaptability to dynamic environments and complex backgrounds.
Currently, one typical method of using a skeleton for motion recognition is to build graph roll-up networks (GCNs). However, the current mainstream model based on GCN has the following disadvantages: (1) feature extraction capability is limited. The module close to the input (lower module) has a relatively small receptive field. Thus, the focus of the higher-level modules has a more global view of the input skeleton sequence than the lower-level modules. Thus, for time scale learning, it is difficult to solve problems such as skeletal action semantics to obtain more efficient modeling by using a fixed-size convolution kernel or expansion rate at each layer of the network only; and (2) the method of multi-stream fusion of specific behavior patterns is simple. At present, a classical multi-stream framework model generally directly adds softmax scores of each stream to obtain a final prediction result, but in practice, the prediction effects of each stream are obviously different, accurate prediction results are difficult to obtain by simple score addition, and the parameter calculation amount is large. (3) Generating an adjacency matrix of semantically meaningful edges is particularly important in this task, where traditional spatial topologies are affected by physical connectivity, and edge extraction remains a challenging problem.
Disclosure of Invention
The invention aims to solve the problems, and provides a skeleton action recognition method based on a space-time self-adaptive feature fusion graph convolution network, which can more fully extract context information of different scales, combines more joint data with more obvious features to realize human behavior prediction under the condition of not increasing the calculated amount, and is beneficial to improving the prediction accuracy of human behaviors.
In order to solve the technical problems, the technical scheme of the invention is as follows:
a skeleton action recognition method based on a space-time self-adaptive feature fusion graph rolling network comprises the following steps:
s1, carrying out data preprocessing and data enhancement on a human body action recognition data set of a large-scale original skeleton action sequence;
s2, for the enhanced skeleton dataProcessing to obtain second-order information of skeleton data, wherein X represents joint flow state X joint For first-order bone information, C, T, N are the characteristic dimension of the joint, the number of sequence frames, and the number of joints, respectively.
The second order bone information is bone flow state X bone、 Articulation flow regime X joint-motion Bone movement flow pattern X bone-motion The data, the formula is as follows:
X bone =x[:,:,i]-x[:,:,i nei ]|i=1,2,...,N
X joint-motion =x[:,t+1,:]-x[:,t,:]|t=1,2,...,T,x∈X joint
X joint-motion =x[:,t+1,:]-x[:,t,:]|t=1,2,...,T,x∈X bone
wherein i represents the ith joint, i nei Representing the adjacent joint to the ith joint on the same frame, and t represents the nth frame of the sequence.
S3, the original data set comprises 25 human joints, the joint movement flow state and the bone movement flow state are integrated in a channel dimension in a polymerization mode to form limb flow, and the limb flow only comprises 22 joints on the limbs.
S4, respectively flowing the joint fluid state X joint And inputting the bone flow state and limb flow data into a space-time self-adaptive feature fusion graph convolution network for training, obtaining a corresponding initial prediction result and softmax fraction, and finally fusing and outputting a final prediction result in a weight adding mode.
Preferably, the spatio-temporal adaptive feature fusion graph rolling network model comprises a spatio-temporal adaptive feature fusion module, a global average pooling layer, a full connection layer and a softmax classifier which are connected in sequence, the space-time self-adaptive feature fusion module comprises output channels which are 64, 64 and 64 in sequence 64, 128, 256.
Preferably, the feature extraction module of each layer comprises a spatial attention strigologram convolution module, a BN+ReLU layer and a time self-adaptive feature fusion module which are connected in sequence, and meanwhile, skeleton data is obtainedInput to a 1 x 1 convolution layer and multiplied by the output of the spatial attention seeking convolution module to be input to the BN+ReLU layer, skeleton data +.>And adding the residual connection with the output of the BN+ReLU layer and inputting the added residual connection into a time self-adaptive feature fusion module.
Preferably, the space attention diagram convolution module comprises two parallel branches, each branch comprises a 1×1 convolution layer and a time pooling module, the pooled outputs of the two branches are subjected to subtraction operation, a feature map is constructed by sequentially passing through the Tanh module and the 1×1 convolution layer, and the feature map is added with a predefined adjacency matrix A to obtain A cwt The following formula is satisfied:
A cwt =αQ(X in )+A
wherein α is a learnable parameter, A cwt Is a topology map of a particular channel. Q is defined as follows:
Q(X i )=σ(TP(φ(X in ))-TP(ψ(X in )))
where σ, φ, and ψ are the 1×1 convolutional layers and TP is a time pooling module.
Preferably, the time adaptive feature fusion module comprises four branches, an attention feature fusion module M and attention feature fusion modules 1-M, each branch comprises a 1×1 convolution layer to reduce the channel dimension, and the first three branches comprise two dynamic time convolutions with convolution kernel size ks×1, dynamic hole convolutions with hole rates of 1 and dr respectively, and a maximum pooling layer. The calculation formula of ks is as follows:
wherein abs represents absolute value, C l For the dimension of the output channel of the first layer feature extraction module, gamma and b are respectively set to 2 and 1, and a dynamic convolution kernel and a dynamic void rate can be obtained through t, and the method comprises the following steps:
ks=t if t%2 else t+1
the output of the four branches is aggregated through a Concat function to obtain a multi-scale time feature X 1 Initial skeleton dataInput into the attention feature fusion module M in residual form, whatThe output of the attention feature fusion module M is multiplied by the initial skeleton feature X to obtain initial attention fusion features, and the multi-scale time features X 1 In the input attention characteristic fusion module 1-M, the output of the attention characteristic fusion module 1-M is connected with X 1 Multiplying to obtain a time attention fusion characteristic, and adding the initial attention fusion characteristic and the time attention fusion characteristic to output a characteristic X', wherein the formula is expressed as follows:
wherein M (.) is represented as:
wherein g (-) and l (-) represent global context and local context, respectively.
Preferably, the data preprocessing is to divide the whole skeleton sequence into 20 segments on average during training, and randomly select one frame from each segment as a new sequence of 20 frames.
Preferably, the data enhancement is such that during the training process, the three-dimensional skeleton sequence is randomly rotated to enhance robustness to view changes.
The invention has the following characteristics and beneficial effects:
the method adopts a space-time self-adaptive feature fusion graph convolution network model, firstly carries out multi-scale self-adaptive feature extraction on the aggregated space-time topology to obtain a larger receptive field, and then utilizes an attention mechanism to carry out feature fusion. The time modeling module can adaptively realize topology feature fusion and help complete modeling of actions. Based on the existing multi-stream processing method, a body part-based stream processing method, called limb stream, is proposed, which can realize a richer and finer representation. Under the condition of not increasing the calculated amount, the joint data with more quantity and more obvious characteristics are combined to realize the prediction of the human body behaviors, and the final prediction accuracy of the human body behaviors is effectively improved.
Drawings
In order to more clearly illustrate the embodiments of the invention or the technical solutions of the prior art, the drawings which are used in the description of the embodiments or the prior art will be briefly described, it being obvious that the drawings in the description below are only some embodiments of the invention, and that other drawings can be obtained according to these drawings without inventive faculty for a person skilled in the art.
FIG. 1 is a flow chart of a skeleton action recognition method based on a space-time adaptive feature fusion graph convolution network;
FIG. 2 is a diagram of a spatio-temporal adaptive feature fusion graph convolution network framework of the present invention;
FIG. 3 is a schematic diagram of a spatial-temporal adaptive feature fusion graph convolution network model according to the present invention;
FIG. 3 (a) is a schematic diagram of the structure of a single flow state input of the spatio-temporal adaptive feature fusion graph convolution network of the present invention;
FIG. 3 (b) is a schematic structural diagram of the feature extraction module of the present invention;
FIG. 3 (c) is a schematic diagram of a spatial attention schematic convolution module according to the present invention;
fig. 3 (d) is a schematic structural diagram of the time adaptive feature fusion module of the present invention.
Detailed Description
It should be noted that, without conflict, the embodiments of the present invention and features of the embodiments may be combined with each other.
In the description of the present invention, it should be understood that the terms "center", "longitudinal", "lateral", "upper", "lower", "front", "rear", "left", "right", "vertical", "horizontal", "top", "bottom", "inner", "outer", etc. indicate orientations or positional relationships based on the orientations or positional relationships shown in the drawings, are merely for convenience in describing the present invention and simplifying the description, and do not indicate or imply that the devices or elements referred to must have a specific orientation, be configured and operated in a specific orientation, and thus should not be construed as limiting the present invention. Furthermore, the terms "first," "second," and the like, are used for descriptive purposes only and are not to be construed as indicating or implying a relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defining "a first", "a second", etc. may explicitly or implicitly include one or more such feature. In the description of the present invention, unless otherwise indicated, the meaning of "a plurality" is two or more.
In the description of the present invention, it should be noted that, unless explicitly specified and limited otherwise, the terms "mounted," "connected," and "connected" are to be construed broadly, and may be either fixedly connected, detachably connected, or integrally connected, for example; can be mechanically or electrically connected; can be directly connected or indirectly connected through an intermediate medium, and can be communication between two elements. The specific meaning of the above terms in the present invention can be understood by those of ordinary skill in the art in a specific case.
The invention provides a skeleton action recognition method based on a space-time self-adaptive feature fusion graph rolling network, which is shown in fig. 1-3 and comprises the following steps:
s1, carrying out data preprocessing and data enhancement on a human body action recognition data set of a large-scale original skeleton action sequence.
In one embodiment, the data set is an available public skeleton data set, which is made by a depth sensing camera and comprises 56800 skeleton action sequences, wherein the data set has 60 action categories, and the number of skeleton joints of each human body is 25; the data preprocessing is to divide the whole skeleton action sequence into 20 segments in average in the training process, and randomly select one frame from each segment as a new sequence of 20 frames. The data enhancement is to randomly rotate the three-dimensional skeleton-action sequence during training to enhance robustness to view changes.
S2, for the enhanced skeleton dataProcessing to obtain second-order information of skeleton data, whereinX represents the joint flow regime X joint For first-order bone information, C, T, N are the characteristic dimension of the joint, the number of sequence frames, and the number of joints, respectively.
It should be noted that, for the enhanced skeleton dataThe processing method comprises the following steps: based on the joint data generation, vector differences are generated for the two nodes. The treatment method is a conventional technical means, and therefore, a specific description will not be given.
Specifically, the second order bone information includes bone flow state X bone Fluid state X of joint movement joint-motion Bone movement flow pattern X bone-motion The data, the formula is as follows:
X bone =x[:,:,i]-x[:,:,i nei ]||i=1,2,...,N
X joint-motion =x[:,t+1,:]-x[:,t,:]|t=1,2,...,T,x∈X joint
X joint-motion =x[:,t+1,:]-x[:,t,:]||t=1,2,...,T,x∈X bone
wherein i represents the ith joint, i nei Representing the adjacent joint to the ith joint on the same frame, and t represents the nth frame of the sequence.
For the motion recognition task of the skeleton, both the first-order skeleton information (coordinates of joints) and the second-order skeleton information (directions and lengths of bones) and the motion information of the first-order skeleton information and the second-order skeleton information are helpful for motion recognition, and the motion recognition accuracy is improved by combining more data with more obvious features.
S3, the original data set comprises 25 human joints, joint movement flow and bone movement flow are integrated to form limb flow, and the limb flow only comprises 22 joints on four limbs.
S4, respectively flowing the joint fluid state X joint Inputting the bone flow state data and the limb flow data into a space-time self-adaptive feature fusion graph convolution network for training, obtaining a corresponding initial prediction result and softmax score, and finally fusing and outputting a final predictionAs a result, as shown in fig. 2.
In the embodiment, the space-time self-adaptive feature fusion graph rolling network model comprises a space-time self-adaptive feature fusion module, a global average pooling layer, a full-connection layer and a softmax classifier which are connected in sequence, the space-time self-adaptive feature fusion module comprises output channels which are 64, 64 and 64 in sequence 64, 128, 256. For joint and bone flow patterns, ten layers of feature extraction modules are used, and for limb flow patterns, eight layers of feature extraction modules are used.
As shown in fig. 3 (a), taking single flow state input as an example, different flow state data pass through the feature extraction modules with different layers, and finally global average pooling and full connection are performed to obtain output scores.
As shown in fig. 3 (b), the feature extraction modules of each layer comprise a spatial attention strive-for convolution module, a bn+relu layer and a time adaptive feature fusion module which are connected in sequence, and simultaneously combine skeleton dataInput to a 1 x 1 convolution layer and multiplied by the output of the spatial attention seeking convolution module to be input to the BN+ReLU layer, skeleton data +.>And adding the residual connection with the output of the BN+ReLU layer and inputting the added residual connection into a time self-adaptive feature fusion module.
As shown in fig. 3 (c), a spatial attention map convolution module is input as a model of the skeleton dataThe module comprises two parallel branches, each branch comprises a 1×1 convolution layer and a time pooling module, the pooled outputs of the two branches are subjected to subtraction operation, a feature map is constructed by the Tanh module and the 1×1 convolution layer in sequence, and the feature map is added with a predefined adjacency matrix A to obtain A cwt The following formula is satisfied:
A cwt =αQ(X in )+A
wherein α is a learnable parameter, A cwt Is a topology map of a particular channel. Q is defined as follows:
Q(X i )=σ(TP(φ(X in ))-TP(ψ(X in )))
where σ, φ, and ψ are the 1×1 convolutional layers and TP is a time pooling module.
As shown in fig. 3 (d), the time adaptive feature fusion module includes four branches, each branch includes a 1×1 convolution layer to reduce the channel dimension, and the first three branches include two dynamic time convolutions with convolution kernel size ks×1, dynamic hole convolutions with hole rates of 1 and dr, respectively, and a maximum pooling layer. The calculation formula of ks is as follows:
wherein abs represents absolute value, C l For the dimension of the output channel of the first layer feature extraction module, gamma and b are respectively set to 2 and 1, and a dynamic convolution kernel and a dynamic void rate can be obtained through t, and the method comprises the following steps:
ks=t if t%2 else t+1
the output of the four branches is aggregated through a Concat function to obtain a multi-scale time feature X 1 Inputting the initial skeleton feature X into an attention feature fusion module M in a residual form, multiplying the output of the attention feature fusion module M with the initial skeleton feature X to obtain an initial attention fusion feature, and obtaining the multi-scale time feature X 1 In the input attention characteristic fusion module 1-M, the output of the attention characteristic fusion module 1-M is connected with X 1 Multiplying to obtain a time attention fusion characteristic, and adding the initial attention fusion characteristic and the time attention fusion characteristic to output a characteristic X', wherein the formula is expressed as follows:
wherein M (.) is represented as:
wherein g (-) and l (-) represent global context and local context, respectively. For input features X and X 1 And (5) performing initial feature fusion. After the sigmoid activation function, the output value is between 0 and 1, and the network can determine the weight of each of the sigmoid activation function through training.
S5, all experiments in the embodiment are performed under a PyTorch deep learning framework, and training is performed by using two NVIDIA A800 GPUs. The training parameters are as follows: the initial learning rate was set to 0.1, the weight decay was set to 0.0004, the parameters were adjusted using a random gradient descent (SGD) with a nestrov momentum of 0.9, the maximum training round number was set to 80, and the learning rate was divided by 10 at the 35 th and 55 th training stages. Training of models is well known to those skilled in the art and will not be described in detail herein.
The following are specific experiments and descriptions of the additions:
this example compared to the advanced model on NTU-rgb+d60 and NTU-rgb+d120 datasets, as shown in tables 1, 2, our model achieved the most advanced results in almost all benchmarks.
Table 1: comparison of top-1 accuracy (%) with the most advanced method in NTU-rgb+d dataset
Table 2: comparison of top-1 accuracy (%) with the most advanced method in NTU-rgb+d120 dataset
S6, the embodiment demonstrates the effectiveness of the multi-mode self-adaptive feature fusion network, and all ablation experiments are performed on NTU RGB+D60 and NTU RGB+D120Cross Subject references.
Table 3 demonstrates the validity of the time adaptive feature fusion module of FIG. 3 (d)
In Table 3, the backbone network is the CTR-GCN model, on which the present invention was developed, with 0.7% and 0.8% improvement in accuracy over the Cross Subject references of NTU60 and NTU120, respectively. The time self-adaptive fusion module can guide the model to better learn action classification.
Table 4 the invention compares the performance of a conventional independent stream on different data streams.
In Table 4, compared with the advanced method, the CTR-GCN uses joint flow, bone flow, joint movement flow and bone movement flow, the experimental reproduction results are respectively 89.8%, 90.2%, 87.4% and 86.9%, and the data of the joint flow, the bone flow and the limb flow are used, compared with the CTR-GCN, the joint flow and the bone flow are respectively improved by 0.4% and 0.5%, and the limb flow is the fusion of the joint movement flow and the bone movement flow, so that the effect is better than that of the joint movement flow and the bone movement flow.
The embodiments of the present invention have been described in detail above with reference to the accompanying drawings, but the present invention is not limited to the described embodiments. It will be apparent to those skilled in the art that various changes, modifications, substitutions and alterations can be made to these embodiments, including the components, without departing from the principles and spirit of the invention, yet fall within the scope of the invention.
Claims (9)
1. A skeleton action recognition method based on a space-time self-adaptive feature fusion graph rolling network is characterized by comprising the following steps:
s1, acquiring an original data set of a skeleton action sequence of a human body, and performing data preprocessing and data enhancement;
s2, skeleton data obtained after pretreatment and data enhancementProcessing to obtain second-order skeleton information of skeleton data, wherein X represents joint flow state X joint The first-order bone information C, T, N is the characteristic dimension, the sequence frame number and the joint number of the joint, and the second-order bone information comprises a bone flow state, a joint movement flow state and a bone movement flow state;
s3, integrating the joint movement flow state and the bone movement flow state in the channel dimension in a polymerization mode to form limb flow;
s4, constructing a space-time adaptive feature fusion graph rolling network, wherein the space-time adaptive feature fusion graph rolling network model comprises a space-time adaptive feature fusion module, a global average pooling layer, a full connection layer and a softmax classifier which are sequentially connected;
s5, respectively flowing the joint state X joint And inputting the bone flow state and limb flow data into a space-time self-adaptive feature fusion graph convolution network for training, obtaining a corresponding initial prediction result and softmax fraction, and finally fusing and outputting a final prediction result in a weight adding mode.
2. The skeleton action recognition method based on the spatio-temporal adaptive feature fusion graph rolling network according to claim 1, wherein the preprocessing method of the original dataset in step S1 is as follows: the whole skeleton-action sequence is divided into 20 segments on average, and one frame is randomly selected from each segment as a new sequence of 20 frames.
3. The skeleton motion recognition method based on the spatio-temporal adaptive feature fusion graph rolling network according to claim 2, wherein in the step S1, the preprocessed skeleton motion sequence is subjected to data enhancement by randomly rotating the skeleton motion sequence.
4. The skeleton action recognition method based on the space-time adaptive feature fusion graph rolling network of claim 1, wherein, the space-time self-adaptive feature fusion module comprises output channels which are 64, 64 and 64 in sequence 64, 128, 256.
5. The skeleton action recognition method based on the space-time adaptive feature fusion graph rolling network of claim 4, wherein each layer of the feature extraction module comprises a 1×1 convolution layer, a space attention attempt convolution module, a bn+relu layer and a time adaptive feature fusion module, and the skeleton data is simultaneously processedInput to a convolution layer of 1×1 and space attention diagram convolution module, multiply the outputs of the two, input to BN+ReLU layer, and add skeleton data +.>And adding the residual connection with the output of the BN+ReLU layer and inputting the added residual connection into a time self-adaptive feature fusion module.
6. The skeleton action recognition method based on space-time adaptive feature fusion graph rolling network of claim 5, wherein the space attention graph convolution module comprises two parallel branches, each branch comprises a 1×1 convolution layer, a time pooling module and a Tanh module, the output of the time pooling module of the two branches is subjected to subtraction operation, then a feature graph is constructed by sequentially passing through the Tanh module and one 1×1 convolution layer, and the feature graph is obtainedAdding with a predefined adjacency matrix A to obtain A cwt The following formula is satisfied:
A cwt =αQ(X in )+A
wherein α is a learnable parameter, A cwt For a topology of a particular channel, Q is defined as follows:
Q(X i )=σ(TP(φ(X in ))-TP(ψ(X in )))
where σ, φ, and ψ are the 1×1 convolutional layers and TP is a time pooling module.
7. The skeleton action recognition method of claim 5, wherein the time adaptive feature fusion module comprises four branches, an attention feature fusion module M and attention feature fusion modules 1-M, each branch comprises a 1 x 1 convolution layer to reduce channel dimension, the first three branches comprise two dynamic time convolutions with convolution kernel size ks x 1, dynamic hole convolutions with hole rates of 1 and dr respectively, and a maximum pooling layer,
the output of the four branches is aggregated through a Concat function to obtain a multi-scale time feature X 1 Initial skeleton dataThe method comprises the steps of inputting the attention characteristic fusion module M in a residual form, multiplying the output of the attention characteristic fusion module M with an initial skeleton characteristic X to obtain an initial attention fusion characteristic, and obtaining the multi-scale time characteristic X 1 In the input attention characteristic fusion module 1-M, the output of the attention characteristic fusion module 1-M is connected with X 1 Multiplying to obtain a time attention fusion characteristic, and adding the initial attention fusion characteristic and the time attention fusion characteristic to output a characteristic X', wherein the formula is expressed as follows:
wherein M (.) is represented as:
wherein g (-) and l (-) represent global context and local context, respectively.
8. The skeleton action recognition method based on the spatio-temporal adaptive feature fusion graph rolling network according to claim 7, wherein the calculation formula of ks is as follows:
wherein abs represents absolute value, C l For the dimension of the output channel of the first layer feature extraction module, gamma and b are respectively set to 2 and 1, and a dynamic convolution kernel and a dynamic void rate can be obtained through t, and the method comprises the following steps:
ks=t if t%2else t+1。
9. the method of claim 7, wherein the raw dataset comprises 25 human joints and the limb flow comprises a total of 22 joints on a limb.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310609183.6A CN116665300A (en) | 2023-05-29 | 2023-05-29 | Skeleton action recognition method based on space-time self-adaptive feature fusion graph convolution network |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310609183.6A CN116665300A (en) | 2023-05-29 | 2023-05-29 | Skeleton action recognition method based on space-time self-adaptive feature fusion graph convolution network |
Publications (1)
Publication Number | Publication Date |
---|---|
CN116665300A true CN116665300A (en) | 2023-08-29 |
Family
ID=87714718
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202310609183.6A Pending CN116665300A (en) | 2023-05-29 | 2023-05-29 | Skeleton action recognition method based on space-time self-adaptive feature fusion graph convolution network |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN116665300A (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117854155A (en) * | 2024-03-07 | 2024-04-09 | 华东交通大学 | Human skeleton action recognition method and system |
-
2023
- 2023-05-29 CN CN202310609183.6A patent/CN116665300A/en active Pending
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117854155A (en) * | 2024-03-07 | 2024-04-09 | 华东交通大学 | Human skeleton action recognition method and system |
CN117854155B (en) * | 2024-03-07 | 2024-05-14 | 华东交通大学 | Human skeleton action recognition method and system |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN107437096B (en) | Image classification method based on parameter efficient depth residual error network model | |
CN110188239B (en) | Double-current video classification method and device based on cross-mode attention mechanism | |
CN111310707B (en) | Bone-based graph annotation meaning network action recognition method and system | |
CN109815826B (en) | Method and device for generating face attribute model | |
CN110472604B (en) | Pedestrian and crowd behavior identification method based on video | |
CN109948526A (en) | Image processing method and device, detection device and storage medium | |
CN116012950B (en) | Skeleton action recognition method based on multi-heart space-time attention pattern convolution network | |
CN108921123A (en) | A kind of face identification method based on double data enhancing | |
CN107609460A (en) | A kind of Human bodys' response method for merging space-time dual-network stream and attention mechanism | |
CN108960212A (en) | Based on the detection of human joint points end to end and classification method | |
CN113128424B (en) | Method for identifying action of graph convolution neural network based on attention mechanism | |
CN114049381A (en) | Twin cross target tracking method fusing multilayer semantic information | |
CN112818764A (en) | Low-resolution image facial expression recognition method based on feature reconstruction model | |
CN114821640A (en) | Skeleton action identification method based on multi-stream multi-scale expansion space-time diagram convolution network | |
CN113554653B (en) | Semantic segmentation method based on mutual information calibration point cloud data long tail distribution | |
CN110378208A (en) | A kind of Activity recognition method based on depth residual error network | |
CN106650617A (en) | Pedestrian abnormity identification method based on probabilistic latent semantic analysis | |
CN111507184B (en) | Human body posture detection method based on parallel cavity convolution and body structure constraint | |
CN111462274A (en) | Human body image synthesis method and system based on SMP L model | |
CN109934183A (en) | Image processing method and device, detection device and storage medium | |
CN112906520A (en) | Gesture coding-based action recognition method and device | |
CN112800882B (en) | Mask face pose classification method based on weighted double-flow residual error network | |
CN114170657A (en) | Facial emotion recognition method integrating attention mechanism and high-order feature representation | |
CN112149645A (en) | Human body posture key point identification method based on generation of confrontation learning and graph neural network | |
Martin et al. | 3D attention mechanism for fine-grained classification of table tennis strokes using a Twin Spatio-Temporal Convolutional Neural Networks |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |