CN107622257A

CN107622257A - A kind of neural network training method and three-dimension gesture Attitude estimation method

Info

Publication number: CN107622257A
Application number: CN201710954487.0A
Authority: CN
Inventors: 王好谦; 李达; 方璐; 王兴政; 张永兵; 戴琼海
Original assignee: Shenzhen Weilai Media Technology Research Institute; Shenzhen Graduate School Tsinghua University
Current assignee: Shenzhen Weilai Media Technology Research Institute; Shenzhen Graduate School Tsinghua University
Priority date: 2017-10-13
Filing date: 2017-10-13
Publication date: 2018-01-23

Abstract

The invention discloses a kind of neural network training method and three-dimension gesture Attitude estimation method, including：S1：The data set of multiple gesture depth maps is included by depth camera collection；S2：Random forest learner is trained using step S1 data set；S3：Multiple gesture depth maps in step S1 data set are split using random forest learner, it is partitioned into gesture subgraph, again the gesture subgraph is handled to obtain processing figure, multiple gesture depth maps in the processing figure and step S1 data set are carried out out of order to be divided into training set and test set；S4：The obtained training sets of step S3 and test set are used for training convolutional neural networks, training obtains network model.Three-dimension gesture Attitude estimation method is that the three-dimension gesture posture in individual depth picture is estimated using the network model.The present invention can accurately identify the particular location and posture of palm finger in gesture.

Description

Neural network training method and three-dimensional gesture attitude estimation method

Technical Field

The invention relates to the field of computer vision and deep learning, in particular to a neural network training method and a three-dimensional gesture posture estimation method.

Background

In recent years, with the rapid development of computer vision and deep learning, virtual reality and augmented reality technologies are gradually popularized, and still have immeasurable development prospects. As an important means of human-computer interaction, the gesture recognition technology has been highly concerned by the field of computer vision, and because the human hand has more joints, complicated shape, higher degree of freedom and easy shielding phenomenon, the rapid and accurate recognition of the gesture position and the hand action is always a difficult problem.

Conventional gesture pose estimation methods can be generally divided into two categories: sensor based and image based. The gesture posture estimation technology based on the sensor is that the sensors such as an accelerometer, an angular velocity meter and the like are fixed at specific parts of a palm and a finger of a person; the position and motion state information of a specific part of a human hand are acquired through the worn sensor equipment, and then the states of the palm and the fingers of the human hand are calculated by using a kinematics method, so that the gesture posture estimation effect is achieved; the method has great limitation on gesture detection due to the fact that sensor equipment needs to be worn, and detection errors are generally large under the influence of factors such as the accuracy of the sensor and the change of the wearing position. Another gesture posture estimation method based on images is generally to use edge or region detection-based methods such as edge detection and skin color detection on images including human hands shot by an RGB camera, firstly determine approximate regions of the human hands in the images, and then segment detailed information such as fingers and wrists by means of image segmentation and the like; because a picture containing a hand is shot by a common camera, the picture can only reflect plane information of a scene generally, if occlusion occurs between fingers, action details of the occluded fingers cannot be identified, and therefore a large error exists.

The above background disclosure is only for the purpose of assisting understanding of the concept and technical solution of the present invention and does not necessarily belong to the prior art of the present patent application, and should not be used for evaluating the novelty and inventive step of the present application in the case that there is no clear evidence that the above content is disclosed at the filing date of the present patent application.

Disclosure of Invention

In order to solve the technical problems, the invention provides a neural network training method and a three-dimensional gesture attitude estimation method, which can accurately identify the specific position and attitude of a palm finger in a gesture.

In order to achieve the purpose, the invention adopts the following technical scheme:

the invention discloses a neural network training method, which comprises the following steps:

s1: acquiring, by a depth camera, a data set comprising a plurality of gesture depth maps;

s2: training a random forest learner by using the data set of the step S1;

s3: adopting a random forest learner to segment the gesture depth maps in the data set of the step S1 to obtain gesture sub-maps, processing the gesture sub-maps to obtain a processing map, and disordering the processing map and the gesture depth maps in the data set of the step S1 to divide the processing map and the gesture depth maps into a training set and a testing set;

s4: and (5) using the training set and the test set obtained in the step (S3) to train the convolutional neural network, and training to obtain a network model.

Preferably, the step S3 of processing the gesture sub-graph to obtain a processing graph includes: s32: projecting the gesture subgraph in the direction of three axes X, Y, Z to obtain three single-channel projection graphs; wherein the processing map includes the projection map in step S32.

Preferably, the step S3 of processing the gesture sub-graph to obtain a processing graph further includes: s33: respectively carrying out down-sampling on the three projection drawings to obtain a plurality of down-sampling drawings with different sizes; wherein the processing map includes the projection map in step S32 and the down-sampling map in step S33.

Preferably, step S1 specifically includes:

s11: acquiring a plurality of gesture depth maps of different people by adopting a plurality of depth cameras;

s12: labeling each gesture depth map, and storing a plurality of gesture depth maps and corresponding labeling information in a data set.

Preferably, the labeling of each gesture depth map in step S12 specifically includes: and labeling coordinate information (x, y, d) on the preset position of the finger and the palm in each gesture depth map, wherein x and y are horizontal and vertical coordinates on the gesture depth map, and d is the pixel depth.

Preferably, the predetermined position of the finger comprises all joint points of the finger.

Preferably, step S4 specifically includes:

s41: randomly selecting m pictures and corresponding label information from the training set, and randomly selecting n pictures and corresponding label information from the testing set;

s42: the pictures are rolled and laminated in the network;

s43: pictures are passed through a pooling layer in the network;

s44: the output layer restores the picture;

s45: calculating the error between the network output and the label information, learning the network, and updating the network parameters;

s46: repeatedly iterating the steps S42-S45, and continuously updating the parameters until the parameters are converged; and storing the trained parameters to finally obtain the trained network model.

Preferably, step S45 is specifically: the formula for calculating the error between the network output and the tag information is as follows:

wherein,for predicted tag coordinates, byComposition, J is original label, consisting of (J)₁,j₂,...,j_n) The composition, n is the number of labels,

assuming that the network parameter of the neuron in the network is ω, updating the network parameter according to the following formula:

the invention also discloses a three-dimensional gesture posture estimation method, and the three-dimensional gesture posture in a single depth picture is estimated by adopting the network model obtained by training of the neural network training method.

Compared with the prior art, the invention has the beneficial effects that: according to the neural network training method, the gesture depth map acquired by the depth camera can accurately identify the gesture and position information of the palm and each finger, then the gesture depth map is segmented by the random forest learner, so that the feature information of the gesture in the picture can be beneficially explored, the residual neural network is trained through the set of the picture, and the convolution pooling layer of the neural network can learn the features in the regions with different scales in the picture, so that the trained network model is applied to three-dimensional gesture posture estimation, the shielding influence can be weakened, and the image-based method is not restricted by wearable equipment; by using the residual convolutional neural network, the problem of gradient dispersion in the process of back propagation and parameter updating is avoided, so that the network training effect is better; the three-dimensional gesture posture estimation method combines the use of a depth learning method and a depth camera and is applied to gesture recognition, and the influence of factors such as illumination change, object shielding and the like can be reduced for the gesture recognition.

In a further scheme, by performing horizontal, vertical and depth triaxial projection on the gesture subgraph, a picture with a three-dimensional view angle in the picture can be obtained, and the feature information of the gesture can be conveniently found; furthermore, the gesture subgraphs are downsampled to obtain multiscale pictures with different sizes, so that the pixel characteristics and the region characteristics with different sizes of the pictures can be better developed, and the specific positions and postures of the palms and fingers in the gestures can be more accurately identified by a trained network model; and can accurately recognize the detailed information of the occluded gesture.

Drawings

FIG. 1 is a flow chart diagram of a three-dimensional gesture pose estimation method in accordance with a preferred embodiment of the present invention;

FIG. 2 is a schematic diagram of the marker points of the fingers and palm of the hand of the preferred embodiment of the present invention;

FIG. 3 is a schematic diagram of the steps of the neural network training method according to the preferred embodiment of the present invention.

Detailed Description

The invention will be further described with reference to the accompanying drawings and preferred embodiments.

As shown in fig. 1, the three-dimensional gesture posture estimation method of the preferred embodiment of the present invention includes the following steps:

s1: collecting a data set of a gesture depth map; the method specifically comprises the following steps:

s11: acquiring gesture depth pictures of different people by using a plurality of depth cameras, acquiring a plurality of pictures containing a plurality of gestures with different angles and different postures for each gesture of each person, and sorting the acquired pictures into a picture library;

s12: labeling each picture in the picture library; in order to accurately position the detailed information of the positions and postures of the joints of the gesture, coordinate information (x, y, d) is marked at specific positions of fingers and palms in the embodiment, wherein x and y are horizontal and vertical coordinates of a foreground image (on a gesture depth image), and d is pixel depth which is the embodiment of the gesture depth on the image; as shown in fig. 2, a plurality of key points are set at specific positions of fingers and palms as specific mark points, the hand on each picture is marked as a label of a picture library, and the picture name and the corresponding label are stored in a file form; wherein the key points marked in fig. 2 include all the joint points of the five fingers and the important position point of the palm, the posture of the current hand can be accurately estimated by accurately predicting the position of each joint point.

S2: training a random forest learner by using the data set obtained in the step S1;

s3: preprocessing a gesture depth map in the data set; as shown in fig. 3, the method specifically includes the following steps:

s31: a random forest learner is adopted to segment the gesture depth map in the data set in the step S1 to obtain a gesture sub-map;

s32: projecting the gesture subgraph in the direction of three axes X, Y, Z to obtain three single-channel projection graphs;

s33: respectively carrying out down-sampling on the three projection drawings to obtain a plurality of down-sampling drawings with different sizes;

s34: sorting all the gesture depth maps in the data set in the step S1, the projection map obtained in the step S32 and the downsampling map obtained in the step S33, and dividing the data into a training set and a testing set according to the proportion of 90% and 10%.

S4: using the training set and the test set obtained in the step S34 to train a convolutional neural network, and training to obtain a network model; the method specifically comprises the following steps:

s42: the pictures are rolled and laminated in the network; assuming that the original size of the picture is l × l, k square matrices with the same size and different pixel values are selected as convolution kernels, and the size of the convolution kernels can be represented as k × c. Wherein k is the number of convolution kernels, and c is the number of parameters of each dimension of the convolution kernels; and (3) performing convolution operation on each picture and k convolution kernels respectively to obtain k pictures with the same size but not the same pixel point. New dimension l_c*l_cThe size is shown in the following formula:

l_c*l_c＝(l-c+1)*(l-c+1)

s43: pictures pass through a pooling layer in a network; assuming that the size of the picture before entering the pooling layer is l x l, pooling is a sliding of an area of size p x p over the picture each time in f steps; each sliding time, one pixel in the area is selected to represent all pixels of the area, and the size of each picture is changed into l after passing through the pooling layer_p*l_pAs shown in the following formula:

l_p*l_p＝((l_p-(f-p))/f)*((l_p-(f-p))/f)

s44: after the picture is processed by convolution pooling and the like in a network, an output layer restores the predicted picture; assuming that h convolution kernels are output after network training, and when the h convolution kernels reach the inlet of an output layer through the network, assuming that the size of each picture is l_e*l_e(l_e＜l_i)，l_iThe size of the picture entering the output layer is h x l_e*l_eDimension l is coupled through the output layer_o*l_oIs recovered to l_i*l_i。

S45: calculating the difference between the network output and the standard label, learning the network, and updating the network parameters; the Euclidean distance of the calculated error is shown as follows:

wherein,for predicted tag coordinates, byAnd (4) forming. J is the original label, consisting of (J)₁,j₂,...,j_n) The composition, n is the number of labels,therefore, it is

s46: repeatedly iterating the steps S42-S45, and continuously updating the parameters until the parameters are converged; and storing the trained parameters to finally obtain the trained convolutional neural network model.

S5: and estimating the three-dimensional gesture attitude of the single depth picture by adopting the convolutional neural network model obtained by training in the step S4.

The preferred embodiment of the invention also discloses a neural network training method, which comprises the steps S1 to S4.

According to the three-dimensional gesture posture estimation method in the preferred embodiment of the invention, a depth camera is used for collecting a large number of pictures; segmenting the gesture foreground by using a random forest classifier; manually marking gesture joint point information; training a convolutional neural network using a data set; storing the trained convolutional network, and directly using the convolutional network for three-dimensional gesture attitude estimation of a single depth map; the method applies the use of the depth learning method and the depth camera to gesture recognition, and the recognition of the gesture can reduce the influence of factors such as illumination change, object shielding and the like.

Acquiring a gesture posture image through a depth camera, displaying the depth information of a gesture in the form of a single-channel gray-scale image which represents the distance between an object and the camera by the pixel value, and restoring a gesture posture skeleton in the form of joint points according to the acquired gesture posture depth image; because the convolution pooling layer of the convolutional neural network can learn features in regions of different scales in the picture, occlusion effects can be reduced, and the image-based method is not constrained by the wearable device.

The three-dimensional gesture posture estimation method of the preferred embodiment of the invention overcomes the bottleneck of traditional gesture posture estimation, and realizes the accurate recognition of the specific position and posture of the palm and the fingers in the gesture by using a detection method of a human hand depth map shot by a depth camera and a new convolutional neural network method based on deep learning.

The foregoing is a more detailed description of the invention in connection with specific preferred embodiments and it is not intended that the invention be limited to these specific details. For those skilled in the art to which the invention pertains, several equivalent substitutions or obvious modifications can be made without departing from the spirit of the invention, and all the properties or uses are considered to be within the scope of the invention.

Claims

1. A neural network training method is characterized by comprising the following steps:

s2: training a random forest learner by using the data set of the step S1;

2. The neural network training method of claim 1, wherein the processing the gesture sub-graph in step S3 to obtain a processing graph comprises:

wherein the processing map includes the projection map in step S32.

3. The neural network training method of claim 2, wherein the processing the gesture sub-graph in step S3 to obtain a processing graph further comprises:

wherein the processing map further includes the down-sampling map in step S33.

4. The neural network training method according to claim 1, wherein step S1 specifically includes:

5. The neural network training method of claim 4, wherein labeling each gesture depth map in step S12 specifically includes: and labeling coordinate information (x, y, d) on the preset position of the finger and the palm in each gesture depth map, wherein x and y are horizontal and vertical coordinates on the gesture depth map, and d is the pixel depth.

6. The neural network training method of claim 5, wherein the predetermined positions of the finger include all joint points of the finger.

7. The neural network training method according to claim 4, wherein the step S4 specifically includes:

s42: the pictures are rolled and laminated in the network;

s43: pictures are passed through a pooling layer in the network;

s44: the output layer restores the picture;

8. The neural network training method according to claim 7, wherein the step S45 specifically comprises: the formula for calculating the error between the network output and the tag information is as follows:

wherein,for predicted tag coordinates, byComposition, J is original label, consisting of (J)₁,j₂,...,j_n) Composition, n is the number of labels, j_i＝(x_i,y_i,d_i),

<mrow> <msup> <mi>&omega;</mi> <mo>&prime;</mo> </msup> <mo>=</mo> <mi>&omega;</mi> <mo>+</mo> <mfrac> <mrow> <mo>&part;</mo> <mi>E</mi> <mi>r</mi> <mi>r</mi> <mi>o</mi> <mi>r</mi> </mrow> <mrow> <mo>&part;</mo> <mi>&omega;</mi> </mrow> </mfrac> <mo>.</mo> </mrow>

9. a three-dimensional gesture posture estimation method is characterized in that a network model obtained by training through the neural network training method according to any one of claims 1 to 8 is adopted to estimate the three-dimensional gesture posture in a single depth picture.