CN113673313B

CN113673313B - Gesture recognition method based on hierarchical convolutional neural network

Info

Publication number: CN113673313B
Application number: CN202110769676.7A
Authority: CN
Inventors: 周智恒; 张明月
Original assignee: South China University of Technology SCUT
Current assignee: South China University of Technology SCUT
Priority date: 2021-07-07
Filing date: 2021-07-07
Publication date: 2024-04-09
Anticipated expiration: 2041-07-07
Also published as: CN113673313A

Abstract

The invention discloses a gesture recognition method based on a hierarchical convolutional neural network, which comprises the following steps: s1, preparing a training data set and model prediction data; s2, constructing a gesture attitude estimation network; mainly comprises: a gesture mask segmentation network, a gesture preliminary pose estimation network, and a gesture hierarchy network; s3, inputting training data into the gesture posture estimation network for learning, and outputting a predicted posture result; s4, continuously comparing the predicted result of the gesture posture estimation network with corresponding tag data in the training data set, and calculating a corresponding loss value; feeding the loss value back to a gesture attitude estimation network and carrying out continuous parameter correction on the network; and determining a gesture recognition result corresponding to the target according to the predicted gesture corresponding to the video frame. The method has the advantage of high precision, and has great application value in the fields of man-machine interaction, virtual reality, sign language recognition and the like.

Description

Gesture recognition method based on hierarchical convolutional neural network

Technical Field

The invention belongs to the technical field of AI (Artificial Intelligence ), and particularly relates to a gesture recognition method based on a hierarchical convolutional neural network.

Background

Gesture recognition refers to accurately recognizing the position of a gesture key point from an image. Gesture recognition technology plays an important role in the fields of man-machine interaction, virtual Reality (VR), augmented Reality (AR) and the like. The gesture recognition technology has transitioned from traditional data glove hardware solutions to computer vision solutions over the past few years due to the development of depth sensors. Gesture recognition techniques have also emerged as a wave climax due to the driving of large depth image datasets. However, RGB images exist in a large number in real life, and are limited by problems of a detection distance of a depth sensor, low resolution of the depth image, and the like.

In recent years, due to the rapid development of the deep neural network, some students begin to research a gesture recognition method on an RGB image based on the deep neural network. The gesture recognition method mainly comprises the following three stages: firstly, predicting a gesture mask by using a convolutional neural network; then, utilizing a gesture mask to obtain a key point heat map; and finally, returning the gesture key point position information by using the key point heat map.

The gesture recognition method is mainly innovated from the first two stages, and the method commonly used in the first stage mainly comprises the following steps: 1. and acquiring a gesture target frame through a target detection algorithm, then performing gesture cutting, and finally performing binarization. This approach can result in network redundancy and is very complex and can also result in large errors in the mask. 2. The gesture mask is predicted using the dataset with the plurality of gesture masks. This approach is labor and time intensive due to the large amount of annotation data for gesture masks. The methods commonly used in the second stage mainly comprise: 1. and directly acquiring characteristic information of the gesture mask by using a convolutional neural network, and acquiring a gesture key point heat map for the characteristic information. 2. And acquiring gesture masks through a plurality of structural network cascades, and acquiring gesture heat maps for the acquired gesture masks by utilizing a plurality of estimated network cascades. The second method is innovation based on the first method, and the accuracy of the second method is improved to a certain extent through multiple cascading.

The invention provides a gesture recognition key frame extraction method, a gesture recognition key frame extraction device and a readable storage medium, and the application number is 202110345732.4. The method is suitable for the preprocessing stage of gesture recognition, is difficult to realize end-to-end training for gesture recognition, is easy to be interfered by skin color-like background by adopting a skin detection algorithm, and is different from the method.

The invention provides a training method, a device, equipment and a storage medium of a hand gesture recognition model, and application number is CN 202010042559.6. The method effectively solves the problem of high marking cost of the gesture mask, but has a certain error for multi-scale target gesture mask segmentation and gesture estimation, and the implementation mode is different from the method.

Disclosure of Invention

The invention mainly aims to provide a gesture recognition method based on a hierarchical convolutional neural network, which is mainly innovated in a first stage and a second stage to solve the problems of low recognition precision, high gesture segmentation mask cost and difficult recognition in a small scale in the existing method.

The invention is realized at least by one of the following technical schemes.

A gesture recognition method based on a hierarchical convolutional neural network comprises the following steps:

s1, acquiring a training data set and model prediction data;

s2, inputting training data into a gesture posture estimation network for learning, and outputting a predicted posture result;

the gesture estimation network comprises a gesture mask segmentation network, a gesture preliminary gesture estimation network and a gesture hierarchy network;

dividing a network by using the gesture masks to obtain the gesture masks, and fusing the gesture masks;

estimating a network predicted gesture key point heat map by using the gesture preliminary gesture; respectively estimating gesture information of each finger and each palm through a gesture layered structure network by using a gesture key point heat map, fusing thumb and palm parts in the fingers, fusing other fingers to obtain gesture information of two parts of the fingers and the palm, and finally fusing the two parts of the fingers and the palm into integral hand gesture information;

s3, comparing the predicted result of the gesture posture estimation network with corresponding tag data in the training data set, and calculating a corresponding loss value; feeding the loss value back to a gesture attitude estimation network for parameter correction;

s4, inputting the video frame into a corrected gesture posture estimation network, and determining a gesture posture recognition result corresponding to the target in the video frame according to the predicted posture corresponding to the video frame.

Preferably, the gesture mask segmentation network mainly comprises two paths of structure prediction models, each path of structure prediction model comprises a first stage and a second stage, the first stage and the second stage respectively adopt different convolution operations of 1*1, 5*5 and 1*1, and output results are subjected to two classification to realize gesture segmentation masks;

and the structure prediction model of each path adopts the synthesized gesture mask as tag data, and realizes the prediction of the gesture segmentation mask through a cross entropy loss function.

Preferably, the VGG19 network structure is adopted to extract different scale feature information from the hand image, the obtained different scale feature information is respectively input into two paths of structure prediction models of the gesture mask segmentation network, the gesture segmentation mask output by the structure prediction model is fused with the feature image output by the final layer of convolution of the VGG19, and the fused information is input into the gesture preliminary gesture estimation network.

Preferably, the gesture mask splits the loss function L of the network _mask(1,2) The method comprises the following steps:

wherein t represents the stage of the structure prediction model, t ₁ And t ₂ The first stage and the second stage are respectively represented, and the second stage receives the mask segmentation graph output by the first stage and the feature graph output by the VGG19 at the same time to conduct re-segmentation; g represents a segmentation map set comprising a segmentation map corresponding to each finger, a palm segmentation map and an overall hand segmentation map; g represents any one of the segmentation map sets; p represents a pixelPoints, I, represent a set of hand pixels; s to (p|g) represent synthesized hand segmentation graphs; s is S ^{^} (p|g) represents a predicted hand segmentation map.

Preferably, the gesture preliminary gesture estimation network comprises two key point prediction models, each key point prediction model comprises a first stage and a second stage, the first stage and the second stage adopt different convolution operations of 1*1, 5*5 and 1*1 respectively, and coordinate regression is carried out on an output result to realize gesture preliminary gesture estimation;

the gesture preliminary gesture estimation network calculates minimum loss of key point labels and predicted key points in the data set by adopting a sum mean square error loss function so as to update gesture preliminary gesture estimation network parameters;

each key point prediction model adopts a jump connection mode; the hand segmentation map and the feature map output by the VGG19 are respectively input into a first stage and a second stage of the two key point prediction models, and the second stage fuses output results from the first stage at the same time.

Preferably, the gesture preliminary gesture estimates a loss function L of the network _2d The method comprises the following steps:

wherein T represents the stage of pose estimation; k represents any key point of 21 key points of the hand; p represents a pixel point, and I represents a hand pixel set; c (C) ^～ (p|k) represents sample hand pose information;representing gesture preliminary gesture estimation network predicted gesture information;

the C is ^～ (p|k) is expressed as:

wherein,representing the real coordinates of the kth key point; sigma (sigma) _KCM Representing the regulatory super-parameters of the gaussian width.

Preferably, the gesture hierarchical structure network adopts a hierarchical structure, and a gesture preliminary gesture estimation network prediction result, a hand segmentation map and a feature map are used as inputs of the gesture hierarchical structure network; the gesture layered structure network respectively estimates gesture information of each finger and the palm, fuses the thumb and the palm part in the fingers, fuses the other fingers to obtain gesture information of the two parts of the fingers and the palm, and finally fuses the two parts of the fingers and the palm to form integral hand gesture information.

Preferably, the value of the gesture loss function is calculated according to the gesture label in the data set and the gesture of the gesture layered structure network:

wherein s represents a stage of the hierarchical network; j represents any one of layered parts of the finger and the palm part; p represents a pixel point, and I represents a hand pixel set; c (C) ^～ (p|j) represents sample hand pose information;representing gesture preliminary gesture estimation network predicted gesture information.

Preferably, the gesture pose estimates the objective loss function of the network:

L＝λL _mask(1,2) +L _2d +L _h

wherein L is the value of the target loss function; λ represents the weight of the loss function value of the gesture mask segmentation network; parameters of the gesture pose estimation network are adjusted by the values of the objective loss function.

Preferably, the data sets include Onehand 10K and Panoptic data sets.

Compared with the prior art, the invention has the advantages that:

the invention provides a gesture recognition method based on a hierarchical convolutional neural network, which mainly comprises the steps of inputting different scale features into a parallel structure prediction model to carry out gesture segmentation mask, and then fusing multi-branch gesture segmentation masks. The method and the device effectively solve the problems of high cost and difficult recognition of small scale of gesture segmentation masks, adopt a layered network structure for the obtained gesture heat map, and further improve gesture information recognition accuracy by respectively fusing two parts of thumb, palm and other fingers and then fusing the two parts. Experiments prove that the invention has the advantages of higher precision, low cost and the like.

Drawings

FIG. 1 is a flow chart of a gesture recognition method of the present invention;

FIG. 2 is an overall block diagram of a network of the gesture recognition method of the present invention;

FIG. 3 is a gesture mask segmentation network of the present invention;

FIG. 4 is a schematic representation of a gesture preliminary pose estimation network of the present invention;

FIG. 5 is a gesture hierarchy network of the present invention.

Detailed Description

The following describes the embodiments of the present invention further with reference to the drawings. The description of these embodiments is provided to assist understanding of the present invention, but is not intended to limit the present invention. In addition, the technical features of the embodiments of the present invention described below may be combined with each other as long as they do not collide with each other.

The gesture recognition method based on the hierarchical convolutional neural network shown in fig. 1 comprises the following steps:

s1, preparing a training data set and model prediction data;

training data set: two datasets were employed, one hand 10K and panotic dataset, respectively.

The OneHand 10K dataset comprises 11703 hand images shot in a natural scene (divided into a training set (10000) and a testing set (1703)), and the dataset is provided with gesture mask segmentation labels and gesture posture information. The data set has the characteristics of complex and various backgrounds, various gesture image types, different gesture image sizes, serious gesture occlusion and the like.

The panotic dataset contained 14817 images of the indoor scene, each hand image contained 21 annotation key points. The dataset was then set to 8:1:1 proportion, randomly dividing the training set, the verification set and the test set.

Model prediction data: the hand image is an image containing the limb parts of the hand (thumb, index finger, middle finger, ring finger, little finger and palm). The predicted data may be an image acquired by various types of cameras, or may be a locally stored image or a hand image acquired on a network, which is not limited in this disclosure, and the size of the input image may be adjusted to a target size, for example 368×368.

S2, constructing a gesture posture estimation network; the gesture pose estimation network includes: a gesture mask segmentation network, a gesture preliminary pose estimation network, and a gesture hierarchy network.

The gesture mask segmentation network mainly comprises two paths of structure prediction models, each path of structure prediction model comprises a first stage and a second stage, the first stage and the second stage respectively adopt different convolution operations of 1*1, 5*5 and 1*1, and output results are subjected to two classification to realize gesture segmentation mask.

The structure prediction model adopts the synthesized gesture mask as tag data, and the prediction of the gesture segmentation mask is realized through a cross entropy loss function.

The structure prediction model acquires confidence maps corresponding to different sections of limbs of a hand according to hand gesture information provided in a panotic data set, and generates a gesture segmentation map corresponding to a hand image according to the confidence maps corresponding to the different sections of limbs, wherein the gesture segmentation map comprises images of a hand region and a non-hand region;

and synthesizing a hand segmentation map according to the hand gesture information provided in the data set and confidence maps corresponding to different sections of limbs of the hand, performing hand segmentation map prediction by using a synthesized hand segmentation map to supervise a gesture mask segmentation network, and simultaneously acquiring the value of a loss function of the gesture mask segmentation network.

The loss function L of the gesture mask segmentation network _mask(1,2) The method comprises the following steps:

wherein t represents the stage of the structure prediction model, t ₁ And t ₂ The first stage and the second stage are respectively represented, and the second stage receives the mask segmentation graph output by the first stage and the feature graph output by the VGG19 at the same time to conduct re-segmentation; g represents a set of segmentation graphs, palm segmentation graphs and whole hand segmentation graphs corresponding to each finger; g represents any one of the segmentation map sets; p represents a pixel point, and I represents a hand segmentation image pixel set; s is S ^～ (p|g) represents a synthetic hand segmentation map; s is S ^{^} (p|g) represents a predicted hand segmentation map.

The gesture preliminary gesture estimation network comprises two key point prediction models, each key point prediction model comprises a first stage and a second stage, the first stage and the second stage adopt different convolution operations of 1*1, 5*5 and 1*1 respectively, and coordinate regression is carried out on an output result to realize gesture preliminary gesture estimation;

as shown in fig. 4, the key point prediction model adopts a jump connection mode; inputting the hand segmentation map and the feature map into a first stage and a second stage of the two key point prediction models respectively, and fusing output results from the first stage at the same time;

and estimating the gesture according to the gesture label and the gesture preliminary gesture in the data set, and calculating the value of the gesture loss function.

The gesture preliminary gesture estimates a loss function L of the network _2d The method comprises the following steps:

wherein t represents the stage of attitude estimation; k represents any key point of 21 key points of the hand; p represents a pixel point, and I represents a hand pixel set; c (C) ^～ (p|k) represents sample hand pose information;representing gesture preliminary gesture estimation network predicted gesture information;

the C is ^～ (p|k) is expressed as:

The 2D gesture pose estimation is very important for the research of the later 3D gesture pose estimation, and the higher the accuracy of the 2D gesture pose information is, the smaller the later 3D gesture pose estimation error is. And (3) improving the accuracy of gesture posture information by adopting a layered network structure for the gesture key point heat map obtained in the step (S2), and firstly adopting a six-branch network structure to respectively correspond to the little finger, the ring finger, the middle finger, the index finger, the thumb and the palm branches and respectively obtain heat maps corresponding to all parts. Then, the thumb and the palm are fused, mainly because the thumb and the palm have the highest relevance, and then the remaining fingers are fused. And finally merging the heat maps obtained by the two branches in the last step.

The gesture hierarchical structure network adopts a hierarchical mode (a six-branch network structure corresponds to little finger, ring finger, middle finger, index finger, thumb and palm branches respectively), and a gesture preliminary gesture estimation network prediction result, a hand segmentation map and a feature map are used as inputs of the gesture hierarchical structure network; the gesture layered structure network respectively estimates the gesture information of each finger and the palm, and the gesture layered structure network estimates the gesture information of each fingerThe thumb and the palm part are fused, other fingers are fused to obtain gesture information of the two parts of the finger and the palm, finally, the gesture information of the whole body is fused by the two parts of the finger and the palm, and the two parts have larger relativity, so that gesture information can be more finely adjusted, and finally, the gesture information of the whole body is fused by the two parts of the finger and the palm, as shown in fig. 2 and 5, wherein P-F in fig. 5 _f1 ～P-F _f3 Representing the 2D heatmap of the palm section 1-3 layer estimates, respectively.

Calculating the value of a gesture loss function according to gesture labels in the data set and gesture gestures predicted by the gesture hierarchical structure network;

wherein s represents a stage of the hierarchical network; j represents any one of layered parts of the finger and the palm part; p represents a pixel point; c (C) ^～ (p|j) represents sample hand pose information;representing gesture preliminary gesture estimation network predicted gesture information.

Gesture pose estimates the target loss function of the network:

obtaining the value of a target loss function according to the loss function value of the gesture mask segmentation network, the loss function value of the gesture estimation network and the loss function value of the gesture layered structure network:

L＝λL _mask(1,2) +L _2d +L _h

wherein L is the value of the target loss function; λ represents the weight of the loss function value of the gesture mask segmentation network;

and adjusting parameters of the gesture posture estimation network through the value of the target loss function, stopping updating the parameters of the gesture posture estimation network when the value of the target loss function reaches a minimum value, and finally obtaining a gesture posture estimation model.

In this embodiment, as shown in fig. 2, VGG19 is adopted as a basic network structure, firstly, feature graphs with different scales are collected through VGG19, as features with different depths can adapt to targets with different sizes, feature information with different scales is extracted respectively, the obtained feature information with different scales is input into a gesture mask segmentation network respectively, the gesture mask segmentation network mainly adopts two parallel structure prediction models to form, respectively, feature information with different scales output by VGG19 is received, and the result output by the structure prediction model is fused with feature information obtained by convolution of the last layer of VGG 19; inputting the characteristic information into a gesture preliminary gesture estimation network, wherein a key point prediction model adopts a jump connection mode, a hand segmentation map and a characteristic map are respectively input into a first stage and a second stage of the key point prediction model, and the second stage receives an output result from the first stage at the same time; taking the gesture preliminary gesture estimation network prediction result, the hand segmentation map and the feature map as input information of a gesture hierarchical structure network; the network estimates the gesture information of each finger and the palm respectively, fuses the thumb and the palm part in the fingers, fuses the other fingers to obtain the gesture information of the two parts of the fingers and the palm, and finally fuses the two parts of the fingers and the palm to form the whole hand gesture information.

Because the gesture targets have the problem of different sizes, small target error segmentation can occur in the gesture mask segmentation process, the VGG19 network model also has differences in understanding capability of targets at different depths, and finally obtained characteristic information of Block4 and Block5 is respectively extracted and input into the structure prediction model for gesture mask segmentation. As shown in fig. 3, corresponding feature maps are output from the module 4 (block 4) and the module 5 (block 5) in the VGG19, respectively, and the feature maps output from the module 4 and the module 5 are used for representing abstract features of different scales of hand images because the feature maps of different scales have different recognition capabilities on the target.

S3, inputting training data into the gesture posture estimation network for learning, and outputting a predicted posture result;

in the embodiment of the invention, 368 x 368 input images are selected, and the number of hand key points is 21. The gesture pose estimation information is learned through four network structures (feature extraction network, gesture mask segmentation network, gesture preliminary pose estimation network, and gesture hierarchy network). The feature extraction network acquires feature graphs with different scales, carries out multi-branch input to the gesture mask segmentation network, and fuses the multi-branch mask segmentation network results. And the output result of the last step is input to the key point prediction model and is connected with the output of the first module and the output of the next key point prediction model in a jumping manner, and fusion is further carried out respectively. And (3) re-dividing the gesture heat map obtained in the previous step into six parts, namely each finger and each palm part, and then estimating the gesture of each part.

S4, comparing the result of gesture posture estimation network identification with corresponding tag data in the training data set, calculating corresponding loss, feeding back the loss value to the gesture posture estimation network, and correcting parameters of the network to enable gesture posture estimation prediction to achieve the best effect; and determining a gesture recognition result corresponding to the target according to the predicted gesture corresponding to the video frame.

The value of the loss function is continuously reduced through a gradient descent optimization algorithm, the gradient descent algorithm has an important position for network training, a good gradient descent algorithm determines the final classification effect, and the gradient descent algorithm mainly comprises the following steps: compared with other self-adaptive algorithms, the Adam gradient descent algorithm in the SGD series and the non-SGD series has the advantages of faster convergence speed and more effective learning effect.

The invention adopts an Adam gradient descent algorithm, wherein the Adam gradient descent algorithm is an algorithm for optimizing a random objective function based on a first-order gradient, the algorithm dynamically adjusts the learning rate of each parameter according to the first-order moment estimation and the second-order moment estimation of the gradient of each parameter by a loss function, and the Adam algorithm is used for adjusting the relation between the parameter change delta theta and the global learning rate alpha _t,k I < ≡α, where Δθ _t,k The representation is delta theta _t (whenThe previous parameter change amount) and deducing the number of updates required by the parameter to approach the optimal solution according to the global learning rate alpha, wherein when training is carried out, delta theta is far greater than alpha, the sample can be judged to be a noise point or the update of the round is not worth, and the Adam algorithm firstly calculates g _t The estimated first and second moments are as follows:

m _t ＝β ₁ m _t-1 +(1-β ₁ )g _t

wherein θ is a parameter to be solved; m is m _t Gradient g _t Is the first moment of (2); u (u) _t Gradient g _t Is a second moment of (2); beta ₁ Is a first moment attenuation coefficient; beta ₂ Is a second moment attenuation coefficient; g _t Deriving a resulting gradient for an objective function f (θ) (commonly referred to as a loss function) for θ; respectively initialize m ₀ And u ₀ Is d-dimensional zero vector and is relative to m _t And u _t Is corrected by the bias of:

wherein m is _t Is m _t Is corrected by the bias of the driver; u (u) _t Is u _t Is corrected by the bias of the driver;

the update rules of Adam are:

wherein θ _t+1 Updating parameters for the current time; alpha, beta ₁ 、β ₂ Epsilon is a super parameter; alpha is set to 0.001, beta ₁ Set to 0.9, beta ₂ Set to 0.9999, ε is set to1e-8。

Specifically, according to the predicted gesture corresponding to the video frame, determining a gesture recognition result corresponding to the target, including the following steps:

1) And acquiring gesture target videos, and extracting pictures frame by the videos.

Gesture target videos can be acquired through various cameras, or target videos which are locally stored or acquired on a network can be acquired, and pictures are extracted from the videos frame by frame.

2) And predicting the gesture posture of each frame of image through a gesture posture estimation network.

Firstly, extracting feature graphs with different scales through a VGG19 network, inputting the feature graphs obtained by the segmentation mask and the VGG19 network into a gesture preliminary gesture estimation network, inputting results obtained by the feature extraction network, the gesture mask segmentation network and the gesture preliminary gesture estimation network into a gesture hierarchical structure network, and finally obtaining a predicted gesture.

3) And determining a target gesture recognition result according to the gesture.

According to the gesture gestures respectively corresponding to the video frames and templates corresponding to the gesture gestures and instructions stored in the equipment in advance, the instructions corresponding to the gestures can be identified, and intelligent control is further carried out on the equipment.

The embodiments of the present invention have been described in detail above with reference to the accompanying drawings, but the present invention is not limited to the described embodiments. It will be apparent to those skilled in the art that various changes, modifications, substitutions and alterations can be made to these embodiments without departing from the principles and spirit of the invention, and yet fall within the scope of the invention.

Claims

1. A gesture recognition method based on a hierarchical convolutional neural network is characterized in that: the method comprises the following steps:

s1, acquiring a training data set and model prediction data;

the gesture hierarchical structure network adopts a hierarchical mode, and a gesture preliminary gesture estimation network prediction result, a hand segmentation map and a feature map are used as inputs of the gesture hierarchical structure network; the gesture layered structure network respectively estimates gesture information of each finger and the palm, fuses the thumb and the palm part in the fingers, fuses the other fingers to obtain gesture information of the two parts of the fingers and the palm, finally fuses the two parts of the fingers and the palm to form integral hand gesture information, and the two parts have correlation, so that the gesture information can be more finely adjusted, and finally fuses the two parts of the fingers and the palm to form integral hand gesture information;

calculating the value of a gesture loss function according to gesture labels in the training data set and gesture gestures predicted by the gesture hierarchical structure network;

wherein s represents a stage of the hierarchical network; j represents any one of layered parts of the finger and the palm part; p represents a pixel point; c (C) ^～ (p|j) represents sample hand pose information;representing gesture preliminary gesture estimation network predicted gesture information;

estimating a network predicted gesture key point heat map by using the gesture preliminary gesture; respectively acquiring a thermal diagram of each finger and a palm part through a gesture layered structure network, fusing the thumb and the palm part in the fingers, fusing other fingers to obtain gesture information of the two parts of the fingers and the palm, and finally fusing the two parts of the fingers and the palm to form integral hand gesture information;

2. The gesture recognition method based on the hierarchical convolutional neural network as set forth in claim 1, wherein the gesture mask segmentation network comprises two paths of structure prediction models, each path of structure prediction model comprises a first stage and a second stage, the first stage adopts a convolution operation of 1*1, the second stage adopts convolution operations of 5*5 and 1*1, and output results are subjected to two classifications to realize gesture segmentation masks;

3. The gesture recognition method based on the hierarchical convolutional neural network as claimed in claim 1 or 2, wherein the method is characterized in that a VGG19 network structure is adopted to extract different scale feature information from a hand image, the obtained different scale feature information is respectively input into two paths of structure prediction models of a gesture mask segmentation network, the gesture segmentation mask output by the structure prediction model is fused with a feature map output by the last layer of convolutional of the VGG19, and the fused information is input into a gesture preliminary gesture estimation network.

4. A method for gesture recognition based on a hierarchical convolutional neural network as set forth in claim 3, wherein the gesture mask splits the loss function L of the network _mask(1,2) The method comprises the following steps:

wherein t represents a stage of the structure prediction model, t=1 and t=2 represent a first stage and a second stage respectively, and the second stage receives the mask segmentation map output by the first stage and the feature map output by the VGG19 at the same time for re-segmentation; g represents a segmentation map set comprising a segmentation map corresponding to each finger, a palm segmentation map and an overall hand segmentation map; g represents any one of the segmentation map sets; p represents a pixel point, and I represents a hand pixel set; s is S ^～ (p|g) represents a synthesized hand segmentation map; s (p|g) represents a predicted hand segmentation map.

5. The gesture recognition method based on the hierarchical convolutional neural network as set forth in claim 4, wherein the gesture preliminary gesture estimation network comprises two key point prediction models, each key point prediction model comprises a first stage and a second stage, the first stage adopts a convolution operation of 1*1, 5*5 and 1*1, the second stage adopts a convolution operation of 1*1, 5*5 and 1*1, and coordinate regression is performed on an output result to realize gesture preliminary gesture estimation;

the gesture preliminary gesture estimation network calculates minimum loss of key point labels and predicted key points in the training data set by adopting a sum mean square error loss function so as to update gesture preliminary gesture estimation network parameters;

6. The gesture recognition method based on hierarchical convolutional neural network of claim 5, wherein the gesture preliminary gesture estimates a loss function L of the network _2d The method comprises the following steps:

the C is ^～ (p|k) is expressed as:

7. The method for gesture recognition based on hierarchical convolutional neural network of claim 6, wherein the gesture estimates the objective loss function of the network:

L＝λL _mask(1,2) +L _2d +L _h

wherein L is the value of the target loss function; λ represents the weight of the loss function value of the gesture mask segmentation network; the parameters of the gesture pose estimation network are updated by the values of the objective loss function.

8. The method for gesture recognition based on a hierarchical convolutional neural network of claim 7, wherein the training data set comprises OneHand 10K and panotic data sets.