Background
The human body posture recognition can be applied to the fields of human body activity analysis, human-computer interaction, visual monitoring and the like, and is a popular problem in the recent computer vision field. The human body posture recognition means that the positions of all parts of a human body are detected from an image and the direction and scale information of the parts are calculated, the posture recognition result is divided into two-dimensional and three-dimensional situations, and the estimation method is divided into two ways based on a model and a model-free way.
Chinese patent application publication No. CN101350064A discloses a two-dimensional human body posture estimation method and device. The method comprises the steps of firstly detecting a human body area in a two-dimensional image and determining the search range of a human body part in the two-dimensional image. Then according to the search range of the human body part, matching similarity is calculated by combining the trunk, the head, the hands, the legs and the feet of the human body part and a template, and recognition of each part is realized; and combining the constraint relation between adjacent parts to obtain the two-dimensional human body posture. The implementation steps are as follows:
the first step is as follows: the existing methods such as an optical flow method, an interframe difference method, a background phase difference method and the like are utilized to detect the human body area in the two-dimensional image.
The second step is that: search ranges of a plurality of human body parts in a human body region are determined.
(1) Performing face detection in a human body area, and taking the position of the detected face as a search range of the head;
(2) determining the search range of the left hand and the right hand by using the detected skin color characteristics of the human face; and further determining the search ranges of the human body trunk, the left arm and the right arm.
(3) The remaining portions in the human body region are determined as search ranges for the left leg, the left foot, the right leg, and the right foot.
The third step: and calculating matching similarity in a corresponding human body part searching range according to each human body part template, determining the optimal position of each part of the human body, and obtaining the two-dimensional human body posture by combining the constraint relation between adjacent human body parts.
The above method for estimating the posture of the human body has the following disadvantages:
firstly, the existing methods of utilizing the existing optical flow method, the interframe difference method, the background difference method and the like to detect the human body area in the two-dimensional image have the problems of illumination change, background dynamic change, slow multi-scale calculation speed of the optical flow and the like, the detected human body area has larger errors, hidden dangers are buried for the subsequent human body part detection algorithm, and the failure of the whole algorithm is caused;
secondly, the problem that the human face cannot be detected due to partial or complete shielding of the human face exists when the human face detection method is adopted for positioning the head area, and the human face detection algorithm usually has high detection precision only for the front human face and has poor effect on the side human face;
thirdly, the problem of low precision is caused when the template matching method is used for identifying and positioning the human body part, which is shown in that the precision of the matching and identifying algorithm is poor due to the factors of size change, different clothes and the like of the human body part in the video image, so that the positioning of the human body part is wrong, and the whole algorithm is invalid.
Disclosure of Invention
The invention aims to provide a human body posture recognition method in a two-dimensional video image, which is high in recognition accuracy and high in recognition speed.
In order to realize the purpose of the invention, the invention is realized by adopting the following technical scheme:
a method of human pose recognition in a two-dimensional video image, the method comprising the steps of:
a. dividing original video image according to scale space layering principleIs divided intoThe number of the groups is set to be,,the resolution of the original video image;
b. for each set of video images, a scale is calculated asOf the sampled image,Is composed ofOf the number of the first and second dimensions,the function of the sampling is represented by,is shown asThe video images are grouped together to form a video image,,is the resolution of the original video image in question,a natural number greater than 1 is set, which indicates the number of sampled video images included in each group of video images,;
c. for the sampled image in each groupSeparately computing HOG underlying feature descriptors;
d. C, based on the HOG bottom layer characteristic descriptor of the sampling image in each group obtained in the step c, according to a prediction formulaCalculating the inner dimension of each group asTo (1) rest of () The HOG bottom layer feature descriptors corresponding to the sampling video images of each scale,andrespectively representing sampled imagesAnd sampling the imageThe size of (a) is greater than (b),is a set value;
e. detecting a human body target area in the original video image according to the HOG bottom layer feature descriptors of the sampled video images of all different scales in the step c and the step d and in combination with the trained SVM;
f. e, classifying the pixels of the human body target area detected in the step e by adopting a trained random forest classifier, and determining a limb part area in the human body target area;
g. and f, connecting the limb parts determined in the step f to form a human body outline, and realizing human body posture recognition.
Preferably, in said step b, use is made ofSampling each group of video images by the end part scale in the image processing system, and calculating the sampling image corresponding to the end part scale。
In the method for recognizing human body gestures in two-dimensional video images, the random forest classifier in step f is preferably trained by the following method:
acquiring a manually synthesized video image comprising a human body posture and a real video image in a target test scene, wherein each video image is used as a training sample;
marking a background area and a human body target area in each training sample according to the set limb part;
calculating the pixel characteristics of each labeled region by using a SURF operator, wherein all labeled regions and pixel characteristic data thereof form a training data set;
using the training data set and an objective functionTraining a random forest classifier;
wherein,is a classification node of a decision tree in a random forest,as a weight value, the weight value,a function is calculated for the entropy of the information,is the pixel characteristics of the labeled area in the artificially synthesized video image training sample,is the pixel characteristic of the labeled area in the real video image training sample,is the labeled second in the artificially synthesized video image training sampleStatistical descriptors of pixel characteristics of individual limb portions,is a statistical descriptor of all pixel features in all labeled regions in the artificially synthesized video image training sample,is a statistical descriptor of all pixel features in all labeled regions in the real video image training sample,is composed ofAndis/are as followsDistance.
Compared with the prior art, the invention has the advantages and positive effects that:
(1) when the HOG multi-scale bottom layer feature extraction method is adopted to detect the human body target from the original video image, only the HOG bottom layer feature descriptors of one sampling image need to be calculated in each group of the grouped sampling images, and the bottom layer feature descriptors of the other sampling images are obtained through feature prediction calculation, so that the calculation speed of the multi-scale bottom layer feature is accelerated on the basis of not reducing the detection precision, and the problems of large calculation amount and insufficient real-time property, which are faced by the practical application of the multi-scale human body target detection method, are fundamentally solved.
(2) The random forest classifier is adopted to classify and recognize the limb parts of the human body, and a new objective function is adopted to train decision tree nodes in the classifier during training of the random forest classifier, so that the weak classifier still has a consistent space activation mode from training sample space to testing sample space. Therefore, the training of the classifier can be completed by taking a human posture video image sample artificially synthesized by computer graphics as a main body and combining a small amount of marked real human posture video, so that the generalization from the artificially synthesized human posture sample to the real human posture characteristic is realized, and the requirement on the training sample is reduced.
Other features and advantages of the present invention will become more apparent from the following detailed description of the invention when taken in conjunction with the accompanying drawings.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention will be described in further detail with reference to the accompanying drawings and examples.
Firstly, a general processing idea of the invention for realizing human body posture recognition is briefly explained:
the human body posture is recognized from the two-dimensional video image, the two steps are divided into two steps, the first step is to detect a human body target area from the original video image, the second step is to classify and recognize the human body target area, the human body limb parts such as joints of a head, a hand, an elbow, a shoulder, a hip, a knee, a foot and the like are recognized, the limb parts are connected to form a human body outline, and then the recognition of the human body posture is realized. In the invention, when a human body target area is detected in the first step, an HOG multi-scale bottom layer feature extraction method is adopted, so that the influence of background, illumination and the like is reduced, and scale invariance is kept; and the bottom layer feature extraction method is improved, so that the real-time performance is improved. Secondly, recognizing the body parts of the human body by adopting a random forest classification tree, so that the classification accuracy is improved; and the objective function in the random forest classification tree is improved, the generalization capability of the classifier is improved, and the complexity of training samples required by the classifier in training is reduced. For a more specific implementation, please refer to the description below.
Referring to fig. 1, a flowchart of a method for recognizing a human body gesture in a two-dimensional video image according to an embodiment of the present invention is shown.
As shown in fig. 1, the process of recognizing the human body gesture in this embodiment is specifically implemented by the following steps:
step 101: the original video image is divided into a plurality of groups of images according to the spatial layering principle.
Dividing original video image according to scale space layering principleIs divided intoA group of one or more of the group, wherein,,for the original video imageThe resolution of (2). The principles and methods for spatial layering of video images on a scale are prior art and are not described in detail herein.
Step 102: and calculating a sampling image with a specific scale in each group, and calculating HOG bottom layer feature descriptors of the sampling images.
Each group of video images is sampled and a scale is calculated asOf the sampled image. DimensionThe amount of time required for a particular scale, in particular,is composed ofOf the substrate is selected. Preferably, the first and second liquid crystal materials are,is composed ofThe end dimension of (1). Wherein,the function of the sampling is represented by,is shown asThe video images are grouped together to form a video image,,is the resolution of the original video image in question,a natural number greater than 1 is set, which indicates the number of sampled video images included in each group of video images,. In general terms, the amount of the solvent to be used,the value of (2) is 5-8, which indicates that each group of video images contains 5-8 layers of sampling video images.
Then, the HOG (Histogram of Oriented gradients) underlying feature descriptors of the sampled images at the selected scale within each set are computed. The calculation of the HOG underlying feature descriptor may be performed by methods known in the art and will not be described in detail here.
Step 103: and calculating HOG bottom layer characteristic descriptors of the sampled video images of other specific scales in each group through a prediction algorithm.
For each set of video images, the HOG underlying feature descriptors for a sample image are computed, per step 102. Then, on the basis of the calculated HOG underlying feature descriptors, HOG underlying feature descriptors of other specific-scale sampled video images are predicted and calculated.
In particular, other particular dimensions refer toExcept that the scale of the HOG underlying feature descriptor has been calculated in step 102: () And (4) each dimension. The following formula is adopted to predict and calculate the HOG bottom layer feature descriptor of the sampling video image with other specific scales:
wherein,andrespectively representing sampled imagesAnd sampling the imageThe size of (a) is greater than (b),,is a set value, and is used as a starting point,for sampling imagesThe HOG underlying feature descriptors of (a) are,for sampling imagesHOG underlying feature descriptors.
Wherein,as a power exponent, a set value that can be fit determined according to empirical verification methods. In this embodiment of the present invention,a preferred value of (d) is 0.0042.
In the above formula, the power exponentIn order to determine the value, one of the scales and the corresponding HOG underlying feature descriptor are calculated in step 102, then, for a specified another scale, the HOG underlying feature descriptor corresponding to the specified another scale can be conveniently calculated by the above formula. By analogy, the HOG underlying feature descriptors corresponding to the rest scales in the group can be conveniently calculated, so that the HOG underlying feature descriptors of the sampled video images contained in all the groups can be calculated.
Step 104: and detecting a human body target region in the video image by combining the trained SVM according to the HOG bottom layer feature descriptors of all the sampled video images with different scales.
By using the HOG underlying feature descriptors of the sampled video images included in all the groups calculated in step 102 and step 103, the human target regions at different scales can be detected. The specific method for detecting the human body target region by using the HOG bottom layer feature descriptor and the trained SVM can be realized by using the prior art, and is not described in detail herein.
Step 105: and classifying pixels of the human body target area by adopting a random forest classifier, and determining the limb part area.
After the human body target area is determined in step 104, pixels of the human body target area are classified by using a trained random forest classifier, so that the limb part area is determined. The input of the random forest classifier is the characteristics of pixels, the parameters of the classifier are selected, the parameters comprise the number of decision trees in the forest, the number of randomly selected attributes of internal nodes and the minimum sample number of final nodes, the pixel characteristics of a human body target region are input into the classifier as input parameters, and the classifier outputs the result of the limb part region to which the pixels belong, so that the limb part region is determined. In this embodiment, a speedup robust features (SURF) operator is selected to calculate pixel features, and each pixel feature may be constructed as a 128-dimensional descriptor. The limb area comprises seven joint parts of the human body, which are respectively: foot, knee, hip, shoulder, elbow, hand, head.
Step 106: and connecting the limb parts to form a human body outline so as to realize human body posture recognition.
After the body parts are determined in step 105, the body parts are connected into a trunk according to the head-shoulder-hip-knee-foot connection, and the elbows and the hands are connected on the two sides, so that the human body outline can be identified, and the human body posture recognition based on the human body joint model is realized.
In this embodiment, when detecting a human target region, although a method of using an HOG bottom layer feature descriptor is used, only the original video images are grouped, each group determines the number of included sampled video images, that is, the number of layers of each group, each group calculates the HOG bottom layer feature descriptor of one sampled image only by using a bottom layer feature calculation function, the HOG bottom layer feature descriptors of the sampled images of other scales in the group are calculated by using the prediction algorithm of step 103, and the calculation complexity and the calculation amount are much smaller than those of the method using the bottom layer feature calculation function. Moreover, by adopting a prediction algorithm, the HOG bottom layer feature descriptor of the sampled video image is directly obtained without calculating the sampled video image corresponding to each scale, thereby further reducing the calculation amount. Furthermore, the rapidity and the real-time performance of the HOG-based human target detection are improved, and the problems of large calculation amount and insufficient real-time performance which are faced by the practical application of the multi-scale human target detection method are fundamentally solved.
In machine learning, a random forest is a classifier that contains multiple decision trees. The main reason for the gesture recognition is high classification precision, and in addition, four factors exist, one is that the learning process is very quick; secondly, the complexity of the algorithm can be controlled by the depth self-adaptation of an internal decision tree; thirdly, when a forest is built, the method can internally generate non-biased estimation for generalized errors; fourthly, the method has good tolerance on abnormal values and noise, and is not easy to generate an overfitting phenomenon. But the main disadvantage is that the training data is required to be similar to the test data, i.e. both have the same distribution, which limits the generalization capability of the classifier. Therefore, to obtain a highly accurate random forest classifier, the training samples are required to cover all possible changes in the future test data. However, in an actual test scene, due to the influence of factors such as visual angle change, limb twisting, human body dressing texture change and illumination change, a sufficient training sample cannot be obtained.
In view of the above disadvantages of the random forest classifier, in the above embodiments of the present invention, the objective function of training decision tree nodes in the random forest classifier is improved, so that the weak classifier still has a consistent spatial activation pattern when generalizing from the training sample space to the testing sample space. Therefore, only some weakly labeled samples in the target test space are needed when the training samples are selected, and other training data can be completed by utilizing human posture video image samples artificially synthesized by computer graphics, so that the requirements on the training samples are reduced. The specific training process is as follows:
and acquiring a manually synthesized video image comprising a human body posture and a real video image in a target test scene, wherein each video image is used as a training sample. And moreover, the artificially synthesized video image is taken as a main body, and a small amount of real video images marked with limb parts and backgrounds in the target test scene are combined.
And marking the background area and the human body target area in each training sample according to the set limb part. Specifically, the human body target area is marked as eight parts according to the human body joint part, wherein one part is a background, and the other seven parts are respectively: foot, knee, hip, shoulder, elbow, hand, head.
And calculating each pixel characteristic in each labeling area by using a SURF operator, wherein all the labeling areas and the corresponding pixel characteristic data form a training data set. Specifically, an SURF operator is selected to calculate each pixel feature in each labeling area in the artificially synthesized video image training sample and the real video image training sample, and each pixel feature is constructed into a 128-dimensional descriptor. The pixel characteristics of the labeling area in the training sample of the artificially synthesized video image are recorded asAnd the pixel characteristics of the labeled area in the real video image training sample are recorded as,Anda set of training data is constructed that is,is a classification node of a decision tree in a random forest. Meanwhile, calculating the statistical descriptors of all 128-dimensional SURF descriptors in all the marked areas of the artificially synthesized video image training sampleAnd all 128-dimensional SURF descriptors in all marked areas of the real video image training sample。
Finally, the training data set and the improved objective function are utilizedAnd training a random forest classifier. Wherein the improved objective functionThe expression of (a) is:
in the above-mentioned formula,the weight value is a fixed value measured by experiment, and is preferablyThe classifier has the best recognition effect.For the calculation of the function for the information entropy, the specific function expression adopts the prior art.Is marked in the training sample of the artificially synthesized video imageStatistical descriptors of all pixel features within an individual limb,is composed ofAndis/are as followsDistance.
The objective function in the above expression takes into account the entropy of the training sample (entropy of the training sample: () Combining the information difference degree between the training data and the target test data () And the weighted sum of the two is taken as an objective function of the training decision tree, so that the generalization capability of the trained classifier is improved. When the trained classifier is used for identifying the limb part of the human body, higher identification accuracy can be obtained.
The above objective function adoptsThe distance represents the information difference between the training data and the target test data, but is not limited to this, and the euclidean distance or other distances may be used to represent the difference between the training data and the target test data.
The above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be apparent to those skilled in the art that various changes may be made and equivalents may be substituted for elements thereof; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions.