CN104392223A

CN104392223A - Method for recognizing human postures in two-dimensional video images

Info

Publication number: CN104392223A
Application number: CN201410734845.3A
Authority: CN
Inventors: 王传旭; 刘云; 闫春娟; 崔雪红; 李辉
Original assignee: Qingdao University of Science and Technology
Current assignee: Haier Robotics Qingdao Co ltd
Priority date: 2014-12-05
Filing date: 2014-12-05
Publication date: 2015-03-04
Anticipated expiration: 2034-12-05
Also published as: CN104392223B

Abstract

The invention discloses a human body posture recognition method in a two-dimensional video image, which comprises the following steps: Group according to the size of the scale; calculate a sampled image of a specified scale for each group of images, and calculate HOG for the sampled image; use the HOG prediction of a sampled image in each group to calculate the HOG corresponding to other specified scale sampled images in the group; According to the obtained multi-scale HOG, combined with the trained SVM classifier to detect the original video image Human body target areas under different scales; use the trained random forest classifier to classify the pixels of the detected human body target area, and determine the body part area in the human body target area; connect the body parts to form a human body contour, and realize Human gesture recognition. By applying the method of the present invention, the calculation speed of the multi-scale bottom layer features is accelerated without reducing the detection precision, and the gesture recognition speed and precision are improved.

Description

Human Pose Recognition Method in 2D Video Image

技术领域 technical field

本发明属于图像处理技术领域，具体地说，是涉及一种二维视频图像中的人体姿态识别方法。 The invention belongs to the technical field of image processing, and in particular relates to a human body gesture recognition method in two-dimensional video images.

背景技术 Background technique

人体姿态识别可以应用于人体活动分析、人机交互以及视觉监视等领域，是近期计算机视觉领域中的一个热门问题。人体姿态识别是指从图像中检测人体各部分的位置并计算其方向和尺度信息，姿态识别的结果分二维和三维两种情况，而估计的方法分基于模型和无模型两种途径。 Human gesture recognition can be applied to human activity analysis, human-computer interaction, and visual surveillance, and it is a hot issue in the field of computer vision in the near future. Human pose recognition refers to detecting the position of each part of the human body from the image and calculating its direction and scale information. The results of pose recognition are divided into two-dimensional and three-dimensional cases, and the estimation methods are divided into model-based and model-free approaches.

公开号为CN101350064A的中国专利申请，公开了一种二维人体姿态估计方法与装置。该方法首先检测出二维图像中的人体区域并确定人体部位在二维图像中的搜索范围。然后根据人体部位的搜索范围，结合人体部位的躯干、头部、手部、腿部、脚部，模板计算匹配相似度，实现各部位的识别；结合相邻部位之间的约束关系，得到二维人体的姿态。实施步骤如下： A Chinese patent application with a publication number of CN101350064A discloses a method and device for estimating a two-dimensional human pose. This method first detects the human body area in the two-dimensional image and determines the search range of the human body parts in the two-dimensional image. Then according to the search range of human body parts, combined with the torso, head, hands, legs, and feet of human body parts, the template calculates the matching similarity to realize the identification of each part; combined with the constraint relationship between adjacent parts, the two The posture of the human body. The implementation steps are as follows:

第一步：利用现有的光流法、帧间差分法、背景相差分等已有方法检测二维图像中的人体区域。 Step 1: Use the existing methods such as optical flow method, frame difference method, and background phase difference to detect the human body area in the two-dimensional image.

第二步：确定人体区域中的多个人体部位的搜索范围。 Step 2: Determine the search range of multiple human body parts in the human body area.

（1）在人体区域中进行人脸检测，将检测到的人脸所在的位置作为头部的搜索范围； (1) Perform face detection in the human body area, and use the position of the detected face as the search range of the head;

（2）利用检测到的人脸肤色特征确定左、右手的搜索范围；进而确定人体躯干、左臂、右臂的搜索范围。 (2) Use the detected skin color features of the face to determine the search range of the left and right hands; and then determine the search range of the human torso, left arm, and right arm.

（3）将人体区域中的剩余部分确定为左腿、左脚、右腿、右脚的搜索范围。 (3) Determine the remaining part of the human body region as the search range of the left leg, left foot, right leg, and right foot.

第三步：根据各人体部位模板在相应的人体部位搜索范围内计算匹配相似度，确定人体各部位的最优位置，结合相邻人体部位之间的约束关系，得到二维人体的姿态。 Step 3: Calculate the matching similarity within the corresponding body part search range according to each body part template, determine the optimal position of each part of the human body, and combine the constraint relationship between adjacent human body parts to obtain the posture of the two-dimensional human body.

上述估计人体姿态的方法存在着下述缺点： The above-mentioned method for estimating human posture has the following disadvantages:

其一，采用利用现有的光流法、帧间差分法、背景相差分等已有方法检测二维图像中的人体区域，存在光照变化、背景动态变化、光流多尺度计算速度慢等问题，往往会导致检测到的人体区域有较大误差，为后续的人体部位检测算法埋下隐患，会导致整体算法的失效； First, using existing methods such as optical flow method, inter-frame difference method, and background phase difference to detect human body areas in two-dimensional images, there are problems such as illumination changes, background dynamic changes, and optical flow multi-scale calculation speed. , which often leads to a large error in the detected human body area, burying hidden dangers for the subsequent human body part detection algorithm, which will lead to the failure of the overall algorithm;

其二，采用人脸检测方法进行头部区域定位会存在人脸部分或全部遮挡导致无法检测的问题，而且，人脸检测算法往往仅对正面人脸有很高的的检测精度，对侧面人脸效果较差； Second, using the face detection method to locate the head area will have the problem of partial or complete occlusion of the face, resulting in the inability to detect. Moreover, face detection algorithms often only have high detection accuracy for frontal faces, and have high detection accuracy for side faces. The face effect is poor;

其三，模板匹配的方法进行人体部位识别定位会产生精度不高的问题，表现在视频图像中的人体部位会因为尺度大小变化、衣着不同等因素，造成匹配识别算法的精度变差，导致人体部位定位错误，使整个算法失效。 Third, the template matching method for human body part recognition and positioning will cause low accuracy problems. The human body parts shown in the video image will be due to factors such as scale changes and different clothing, which will cause the accuracy of the matching recognition algorithm to deteriorate, resulting in human body parts. A wrong part positioning makes the whole algorithm invalid.

发明内容 Contents of the invention

本发明的目的是提供一种识别精度高、识别速度快的二维视频图像中的人体姿态识别方法。 The purpose of the present invention is to provide a human body posture recognition method in two-dimensional video images with high recognition accuracy and fast recognition speed.

为实现上述发明目的，本发明采用下述技术方案予以实现： In order to achieve the above-mentioned purpose of the invention, the present invention adopts the following technical solutions to achieve:

一种二维视频图像中的人体姿态识别方法，所述方法包括下述步骤： A human body gesture recognition method in a two-dimensional video image, said method comprising the steps of:

a、按照尺度空间分层原理将原始视频图像分为组，，为所述原始视频图像的分辨率； a. According to the principle of scale space layering, the original video image is Divided into Group, , is the resolution of the original video image;

b、对每组视频图像，计算一个尺度为的采样图像，为中的其中一个尺度，表示采样函数，表示第组视频图像，，为所述原始视频图像的分辨率，为设定的大于1的自然数，表示每组视频图像包含的采样视频图像的数量，； b. For each group of video images, calculate a scale of The sampled image of , for One of the scales in represents the sampling function, Indicates the first group video images, , is the resolution of the original video image, is a set natural number greater than 1, indicating the number of sampled video images contained in each group of video images, ;

c、对每组内的采样图像分别计算HOG底层特征描述符； c. For the sampled images in each group Compute the HOG underlying feature descriptors separately ;

d、以步骤c获得的每组内的一个采样图像的HOG底层特征描述符为基础，根据预测公式计算每组内尺度为中其余（）个尺度的采样视频图像对应的HOG底层特征描述符，和分别表示采样图像和采样图像的尺度，为设定值； d. Based on the HOG underlying feature descriptor of a sampled image in each group obtained in step c, according to the prediction formula Calculate the inner scale of each group as in the rest ( ) The HOG underlying feature descriptor corresponding to the sampled video image of the scale, and Represent the sampled image respectively and the sampled image scale, is the set value;

e、根据步骤c和步骤d的所有不同尺度采样视频图像的HOG底层特征描述符，结合训练好的SVM，检测所述原始视频图像中的人体目标区域； E, according to the HOG bottom layer feature descriptor of all different scale sampling video images of step c and step d, in combination with the trained SVM, detect the human body target area in the original video image;

f、采用训练好的随机森林分类器将步骤e检测的人体目标区域的像素进行分类，确定所述人体目标区域中的肢体部位区域； f, using the trained random forest classifier to classify the pixels of the human body target area detected in step e, and determine the body part area in the human body target area;

g、将步骤f确定的各肢体部位连接形成人体轮廓，实现人体姿态识别。 g. Connect the parts of the limbs determined in step f to form a human body contour to realize human body posture recognition.

优选的，所述步骤b中，利用中的端部尺度对每组视频图像采样，计算端部尺度对应的采样图像。 Preferably, in the step b, using The end scale in Samples each group of video images, and calculates the sampled image corresponding to the end scale .

如上所述的二维视频图像中的人体姿态识别方法，所述步骤f中的随机森林分类器优选通过下述方法训练： As mentioned above, the human body gesture recognition method in the two-dimensional video image, the random forest classifier in the step f is preferably trained by the following method:

获取包括人体姿态的人工合成视频图像和目标测试场景中的真实视频图像，每幅视频图像作为一个训练样本； Obtain artificially synthesized video images including human poses and real video images in the target test scene, and each video image is used as a training sample;

依据设定肢体部位将每个训练样本中的背景区域及人体目标区域进行标注； Mark the background area and human body target area in each training sample according to the set body parts;

利用SURF算子计算每个标注区域的像素特征，所有标注区域及其像素特征数据构成训练数据集合； Use the SURF operator to calculate the pixel features of each marked area, and all marked areas and their pixel feature data constitute a training data set;

利用所述训练数据集合及目标函数对随机森林分类器进行训练； Using the training data set and objective function Train a random forest classifier;

其中，为随机森林中的一个决策树的一个分类节点，为权值，为信息熵计算函数，是所述人工合成视频图像训练样本中标注区域的像素特征，是所述真实视频图像训练样本中标注区域的像素特征，是所述人工合成视频图像训练样本中已标注的第个肢体部位的像素特征的统计描述符，是所述人工合成视频图像训练样本中所有标注区域内所有像素特征的统计描述符，是所述真实视频图像训练样本中所有标注区域内所有像素特征的统计描述符，为和的距离。 in, is a classification node of a decision tree in a random forest, is the weight, is the information entropy calculation function, is the pixel feature of the marked area in the artificially synthesized video image training sample, is the pixel feature of the labeled region in the real video image training sample, is the labeled first in the artificially synthesized video image training sample A statistical descriptor of the pixel features of a limb part, is the statistical descriptor of all pixel features in all labeled regions in the artificially synthesized video image training sample, is the statistical descriptor of all pixel features in all labeled regions in the real video image training sample, for and of distance.

与现有技术相比，本发明的优点和积极效果是： Compared with prior art, advantage and positive effect of the present invention are:

（1）采用HOG多尺度底层特征提取方法从原始视频图像中检测人体目标时，分组后的每组采样图像中仅需要计算一副采样图像的HOG底层特征描述符，其余采样图像的底层特征描述符通过特征预测计算得出，在不降低检测精度的基础上，加速了多尺度底层特征的计算速度，从根本上解决了制约多尺度人体目标检测方法走向实际应用面临的计算量大、实时性不足的棘手问题。 (1) When the HOG multi-scale bottom-level feature extraction method is used to detect human targets from the original video images, only the HOG bottom-level feature descriptors of one sampled image need to be calculated in each group of sampled images after grouping, and the bottom-level feature descriptors of the remaining sampled images The character is obtained through feature prediction and calculation. On the basis of not reducing the detection accuracy, the calculation speed of the multi-scale underlying features is accelerated, and it fundamentally solves the problem of large amount of calculation and real-time performance that restrict the multi-scale human target detection method to practical application. Insufficient thorny problem.

（2）采用随机森林分类器对人体肢体部位进行分类识别，随机森林分类器训练时采用新的目标函数训练分类器中决策树节点，可以使弱分类器从训练样本空间泛化到测试样本空间时仍然具有一致的空间激活模式。这样，使得该分类器的训练可以通过由计算机图形学人工合成的人体姿态视频图像样本为主体、结合少量标注好的真实人体姿态视频来完成随机森林分类器的训练，从而实现从人工合成人体姿态样本到真实的人体姿态特征的泛化，降低了对训练样本的要求。 (2) The random forest classifier is used to classify and identify human limbs. When the random forest classifier is trained, a new objective function is used to train the decision tree nodes in the classifier, which can generalize the weak classifier from the training sample space to the test sample space. still have a consistent spatial activation pattern. In this way, the training of the classifier can be completed by combining the human body posture video image samples artificially synthesized by computer graphics with a small amount of marked real human body posture videos to complete the training of the random forest classifier, thereby realizing the human body posture from artificial synthesis. The generalization of samples to real human pose features reduces the requirement for training samples.

结合附图阅读本发明的具体实施方式后，本发明的其它特点和优点将变得更加清楚。 Other features and advantages of the present invention will become clearer after reading the detailed description of the present invention in conjunction with the accompanying drawings.

附图说明 Description of drawings

图1 是本发明二维视频图像中的人体姿态识别方法一个实施例的流程图。 Fig. 1 is the flow chart of an embodiment of the human body posture recognition method in the two-dimensional video image of the present invention.

具体实施方式 Detailed ways

为了使本发明的目的、技术方案及优点更加清楚明白，以下将结合附图和实施例，对本发明作进一步详细说明。 In order to make the object, technical solution and advantages of the present invention clearer, the present invention will be further described in detail below in conjunction with the accompanying drawings and embodiments.

首先，简要说明本发明实现人体姿态识别的一般处理思路： First of all, a brief description of the general processing ideas of the present invention to realize human gesture recognition:

从二维视频图像中识别人体姿态，分为两步，第一步是从原始视频图像中检测出人体目标区域，第二步是对人体目标区域进行分类识别，识别出人体肢体部位，如头、手、肘部、肩膀、臀部、膝部、脚等关节部位，并将肢体部位连接形成人体轮廓，进而实现人体姿态的识别。在本发明中，第一步检测人体目标区域时，采用HOG多尺度底层特征提取方法，减少背景、光照等的影响，保持尺度不变性；并对底层特征提取方法进行改进，提高实时性。第二步采用随机森林分类树识别人体肢体部位，提高分类精确度；并对随机森林分类树中的目标函数进行改进，提高分类器的泛化能力，降低分类器训练时所需训练样本的复杂度。更具体的实现方法，请参考下面的描述。 Recognizing human body posture from two-dimensional video images is divided into two steps. The first step is to detect the human body target area from the original video image. , hands, elbows, shoulders, hips, knees, feet and other joints, and connect the limbs to form the outline of the human body, and then realize the recognition of human body posture. In the present invention, in the first step of detecting the target area of the human body, the HOG multi-scale bottom-level feature extraction method is used to reduce the influence of background, illumination, etc., and maintain scale invariance; and the bottom-level feature extraction method is improved to improve real-time performance. The second step is to use the random forest classification tree to identify human limbs and improve the classification accuracy; and improve the objective function in the random forest classification tree to improve the generalization ability of the classifier and reduce the complexity of the training samples required for classifier training Spend. For more specific implementation methods, please refer to the description below.

请参见图1，该图所示为本发明二维视频图像中的人体姿态识别方法一个实施例的流程图。 Please refer to FIG. 1 , which is a flow chart of an embodiment of the method for recognizing human gestures in two-dimensional video images according to the present invention.

如图1所示，该实施例识别人体姿态的过程具体采用下述步骤来实现： As shown in Figure 1, the process of recognizing the human body posture in this embodiment is implemented by the following steps:

步骤101：将原始视频图像按照空间分层原理划分为多组图像。 Step 101: Divide the original video image into multiple groups of images according to the spatial layering principle.

按照尺度空间分层原理将原始视频图像分为组，其中，，为原始视频图像的分辨率。对视频图像按照尺度空间分层的原理和方法为现有技术，在此不作具体阐述。 According to the principle of scale space layering, the original video image Divided into group, of which, , for raw video image resolution. The principle and method of layering video images according to the scale space is the prior art, and will not be described in detail here.

步骤102：每组中计算一个特定尺度的采样图像，并计算采样图像的HOG底层特征描述符。 Step 102: Calculate a sampled image of a specific scale in each group, and calculate the HOG underlying feature descriptor of the sampled image.

对每组视频图像进行采样，计算一个尺度为的采样图像。尺度为一个特定尺度，具体来说，为中的其中一个尺度。优选的，为中的端部尺度。其中，表示采样函数，表示第组视频图像，，为所述原始视频图像的分辨率，为设定的大于1的自然数，表示每组视频图像包含的采样视频图像的数量，。一般地，的取值为5-8，表示每组视频图像包含5-8层的采样视频图像。 Sampling each group of video images and calculating a scale as The sampled image of . scale for a particular scale, specifically, for One of the scales in . preferred, for The end scale in . in, represents the sampling function, Indicates the first group video images, , is the resolution of the original video image, is a set natural number greater than 1, indicating the number of sampled video images contained in each group of video images, . normally, The value of is 5-8, indicating that each group of video images includes 5-8 layers of sampled video images.

然后，计算每组内选定尺度的采样图像的HOG（Histogram of Oriented Gradient，方向梯度直方图）底层特征描述符。计算HOG底层特征描述符可以采用现有技术中的方法，在此不作具体描述。 Then, calculate the HOG (Histogram of Oriented Gradient, histogram of directional gradient) underlying feature descriptor of the sampled image of the selected scale in each group. The method in the prior art may be used to calculate the HOG bottom layer feature descriptor, which will not be described in detail here.

步骤103：通过预测算法计算每组内其它特定尺度的采样视频图像的HOG底层特征描述符。 Step 103: Calculate the HOG bottom-level feature descriptors of other sampled video images of a specific scale in each group through a prediction algorithm.

对于每组视频图像，经步骤102计算出了一个采样图像的HOG底层特征描述符。然后，以该计算出的HOG底层特征描述符为基础，预测计算出其它特定尺度的采样视频图像的HOG底层特征描述符。 For each group of video images, a HOG bottom layer feature descriptor of a sampled image is calculated through step 102 . Then, based on the calculated HOG bottom-level feature descriptor, predict and calculate the HOG bottom-level feature descriptor of sampled video images of other specific scales.

具体来说，其它特定尺度是指中除了步骤102已经计算了HOG底层特征描述符的尺度之外的其余（）个尺度。采用下述公式来预测计算其它特定尺度的采样视频图像的HOG底层特征描述符： Specifically, other specific scales refer to In addition to step 102 has calculated the scale of the HOG bottom feature descriptor ( ) scale. The following formula is used to predict and calculate the HOG underlying feature descriptors of sampled video images of other specific scales:

其中，和分别表示采样图像和采样图像的尺度，，为设定值，为采样图像的HOG底层特征描述符，为采样图像的HOG底层特征描述符。 in, and Represent the sampled image respectively and the sampled image scale, , for the set value, for the sampled image The HOG underlying feature descriptor, for the sampled image The HOG underlying feature descriptor.

其中，作为幂指数，为一个设定值，该设定值可以根据经验验证方法拟合确定。在该实施例中，的优选值为0.0042。 in, As a power exponent, it is a set value, which can be determined by fitting according to an empirical verification method. In this example, The preferred value of is 0.0042.

在上述公式中，幂指数为确定值，其中一个尺度及其对应的HOG底层特征描述符经步骤102计算得到，则，对于指定的另一尺度，可以方便地通过上述公式计算出该指定的另一尺度对应的HOG底层特征描述符。依次类推，可以方便地计算出组内其余尺度所对应的HOG底层特征描述符，从而计算出所有组内所包含的采样视频图像的HOG底层特征描述符。 In the above formula, the exponent In order to determine the value, one of the scales and its corresponding HOG bottom-level feature descriptor are calculated in step 102, then, for another specified scale, the HOG bottom-level feature corresponding to the specified another scale can be easily calculated by the above formula Descriptor. By analogy, the HOG bottom-level feature descriptors corresponding to the remaining scales in the group can be easily calculated, so as to calculate the HOG bottom-level feature descriptors of all sampled video images contained in the group.

步骤104：根据所有不同尺度采样视频图像的HOG底层特征描述符，结合训练好的SVM，检测视频图像中的人体目标区域。 Step 104: According to the HOG underlying feature descriptors of all different scales of video images sampled, combined with the trained SVM, detect the human body target area in the video image.

采用步骤102和步骤103计算出的所有组内所包含的采样视频图像的HOG底层特征描述符，即可检测出不同尺度下的人体目标区域。采用HOG底层特征描述符及训练好的SVM，实现人体目标区域检测的具体方法可以采用现有技术来实现，在此不作详细描述。 By using the HOG underlying feature descriptors of the sampled video images included in all the groups calculated in step 102 and step 103, human body target regions at different scales can be detected. Using the HOG underlying feature descriptor and the trained SVM, the specific method for realizing the detection of the human target area can be realized by using the existing technology, and will not be described in detail here.

步骤105：采用随机森林分类器对人体目标区域的像素进行分类，确定肢体部位区域。 Step 105: Use the random forest classifier to classify the pixels of the human body target area to determine the body part area.

步骤104确定了人体目标区域之后，采用训练好的随机森林分类器对人体目标区域的像素进行分类，从而确定肢体部位区域。随机森林分类器的输入是像素的特征，选定分类器的参数，包括森林中决策树的数量、内部节点随机选择属性的个数、终节点的最小样本数，将人体目标区域的像素特征作为输入参数输入分类器，分类器将输出像素所属肢体部位区域的结果，从而确定出肢体部位区域。在该实施例中，选用SURF（speed up robust features，快速鲁棒性梯度特征）算子计算像素特征，每个像素特征可以构建为128维的描述符。肢体部位区域包括人体的七个关节部分，分别为：脚、膝部、臀部、肩膀、肘部、手、头。 In step 104, after the human body target area is determined, the trained random forest classifier is used to classify the pixels of the human body target area, thereby determining the body part area. The input of the random forest classifier is the feature of the pixel, and the parameters of the selected classifier include the number of decision trees in the forest, the number of randomly selected attributes of internal nodes, and the minimum number of samples of the terminal node. The pixel features of the human body target area are used as The input parameters are input into the classifier, and the classifier will output the result of the body part area to which the pixel belongs, so as to determine the body part area. In this embodiment, a SURF (speed up robust features, fast robust gradient feature) operator is selected to calculate pixel features, and each pixel feature can be constructed as a 128-dimensional descriptor. The body parts area includes seven joint parts of the human body, namely: feet, knees, hips, shoulders, elbows, hands, and head.

步骤106：将各肢体部位连接形成人体轮廓，实现人体姿态识别。 Step 106: Connecting various body parts to form a human body outline to realize human body posture recognition.

步骤105确定了肢体部位之后，将各肢体部位连接，按照头-肩膀-臀部-膝部-脚连接成躯干，两侧再连接上肘部和手，这样可以标识出人体轮廓，从而实现基于人体关节模型的人体姿态识别。 Step 105: After determining the limb parts, connect the limb parts according to the head-shoulder-hip-knee-feet to form the torso, and then connect the elbows and hands on both sides, so that the outline of the human body can be identified, so as to realize the human body-based Human Pose Recognition for Joint Models.

在该实施例中，检测人体目标区域时，虽然采用了HOG底层特征描述符的方式，但是，仅对原始视频图像进行了分组，每组确定了所包含的采样视频图像的数量，也即每组的层数，每组内仅采用底层特征计算函数计算了一个采样图像的HOG底层特征描述符，组内其他尺度的采样图像的HOG底层特征描述符利用步骤103的预测算法计算得出，计算复杂度和计算量远小于采用底层特征计算函数方式。而且，采用预测算法，无需计算每个尺度对应的采样视频图像，直接获得该采样视频图像的HOG底层特征描述符，进一步降低了计算量。进而，提高了基于HOG人体目标检测的快速性和实时性，从根本上解决了制约多尺度人体目标检测方法走向实际应用面临的计算量大、实时性不足的棘手问题。 In this embodiment, although the method of HOG low-level feature descriptor is used when detecting the human body target area, only the original video images are grouped, and the number of sampled video images included in each group is determined, that is, each group The number of layers in a group, in each group, only the HOG bottom layer feature descriptor of a sampled image is calculated using the bottom layer feature calculation function, and the HOG bottom layer feature descriptors of sampled images of other scales in the group are calculated using the prediction algorithm in step 103, and the calculation The complexity and amount of calculation are much smaller than the way of using the underlying feature calculation function. Moreover, using the prediction algorithm, there is no need to calculate the sampled video image corresponding to each scale, and the HOG underlying feature descriptor of the sampled video image is directly obtained, which further reduces the amount of calculation. Furthermore, the rapidity and real-time performance of HOG-based human target detection are improved, and the thorny problems of large amount of calculation and insufficient real-time performance that restrict multi-scale human target detection methods to practical applications are fundamentally solved.

在机器学习中，随机森林是一个包含多个决策树的分类器。它用于姿态识别主要原因是分类精度高，此外还有四个因素，其一是其学习过程是很快速的；其二是算法的复杂度可以由内部决策树的深度自适应控制；其三是在建造森林时，它可以在内部对于一般化后的误差产生不偏差的估计；其四，对异常值和噪声有很好的容忍度，且不易出现过拟合现象。但其主要缺点是要求训练数据与测试数据是相似的，即两者具有相同的分布，这限制了该分类器的泛化能力。因此，要获得高精度的随机森林分类器，就要求训练样本涵盖将来测试数据所有可能的变化状态。但是，实际测试场景中由于视角变化、肢体的扭动、人体着装纹理变化、光照变化等因素影响，是不可能获得足够充分的训练样本的。 In machine learning, a random forest is a classifier consisting of multiple decision trees. The main reason for its use in gesture recognition is the high classification accuracy. In addition, there are four factors. One is that its learning process is very fast; the other is that the complexity of the algorithm can be controlled by the depth of the internal decision tree; the third is When building a forest, it can generate an unbiased estimate of the generalized error internally; fourth, it has a good tolerance for outliers and noise, and is not prone to overfitting. But its main disadvantage is that the training data and test data are required to be similar, that is, both have the same distribution, which limits the generalization ability of the classifier. Therefore, to obtain a high-precision random forest classifier, the training samples are required to cover all possible changes in the future test data. However, in the actual test scene, it is impossible to obtain sufficient training samples due to factors such as changes in viewing angle, twisting of limbs, changes in texture of human clothing, and changes in illumination.

针对随机森林分类器存在的上述缺点，在本发明的上述实施例中，改进了随机森林分类器中训练决策树节点的目标函数，从而使弱分类器从训练样本空间泛化到测试样本空间时仍然具有一致的空间激活模式。这样，可以在训练样本选择时仅仅需要目标测试空间中的一些弱标注的样本即可，而其它的训练数据可以利用计算机图形学人工合成的人体姿态视频图像样本来完成，从而降低了对训练样本的要求。具体训练过程如下： For the above-mentioned shortcomings of the random forest classifier, in the above-mentioned embodiments of the present invention, the objective function of the training decision tree node in the random forest classifier is improved, so that when the weak classifier generalizes from the training sample space to the test sample space Still have a consistent pattern of spatial activation. In this way, only some weakly labeled samples in the target test space are needed when selecting training samples, while other training data can be completed using human body pose video image samples artificially synthesized by computer graphics, thereby reducing the need for training samples. requirements. The specific training process is as follows:

获取包括人体姿态的人工合成视频图像和目标测试场景中的真实视频图像，每幅视频图像作为一个训练样本。而且，人工合成视频图像为主体，结合少量已标注好肢体部位及背景的目标测试场景中的真实视频图像即可。 Obtain artificially synthesized video images including human poses and real video images in the target test scene, and each video image is used as a training sample. Moreover, artificially synthesized video images are used as the main body, combined with a small amount of real video images in the target test scene with marked body parts and backgrounds.

依据设定肢体部位将每个训练样本中的背景区域及人体目标区域进行标注。具体来说，依据人体关节部位将人体目标区域标注为八部分，其中一部分为背景，其余七部分分别为：脚、膝部、臀部、肩膀、肘部、手、头。 The background area and human body target area in each training sample are marked according to the set body parts. Specifically, the target area of the human body is marked into eight parts according to the joint parts of the human body, one part is the background, and the remaining seven parts are: feet, knees, buttocks, shoulders, elbows, hands, and head.

利用SURF算子计算每个标注区域内的每个像素特征，所有标注区域及其对应的像素特征数据构成训练数据集合。具体而言，选用SURF算子计算人工合成视频图像训练样本和真实视频图像训练样本中每个标注区域内每个像素特征，每个像素特征构建为128维的描述符。人工合成视频图像训练样本中标注区域的像素特征记为，真实视频图像训练样本中标注区域的像素特征记为，和构成训练数据集合，为随机森林中的一个决策树的一个分类节点。同时，计算人工合成视频图像训练样本所有标记区域内所有128维SURF描述符的统计描述符及真实视频图像训练样本所有标记区域内所有128维SURF描述符的统计描述符。 The SURF operator is used to calculate the feature of each pixel in each marked area, and all marked areas and their corresponding pixel feature data constitute the training data set. Specifically, the SURF operator is selected to calculate the features of each pixel in each labeled region in the artificially synthesized video image training samples and the real video image training samples, and each pixel feature is constructed as a 128-dimensional descriptor. The pixel features of the labeled regions in the artificially synthesized video image training samples are denoted as , the pixel features of the labeled region in the real video image training samples are denoted as , and form the training data set, A classification node for a decision tree in a random forest. At the same time, calculate the statistical descriptors of all 128-dimensional SURF descriptors in all labeled regions of the artificially synthesized video image training samples and statistical descriptors of all 128-dimensional SURF descriptors in all labeled regions of real video image training samples .

最后，利用上述训练数据集合及改进后的目标函数对随机森林分类器进行训练。其中，改进的目标函数的表达式为： Finally, using the above training data set and the improved objective function Train a random forest classifier. Among them, the improved objective function The expression is:

上述公式中，为权值，该权值是一个实验测得的固定值，优选为，分类器的识别效果最好。为信息熵计算函数，具体函数表达式采用现有技术。是人工合成视频图像训练样本中已标注的第个肢体部位内所有像素特征的统计描述符，为和的距离。 In the above formula, is the weight, which is an experimentally measured fixed value, preferably , the classifier has the best recognition effect. It is an information entropy calculation function, and the specific function expression adopts the prior art. is the labeled first in the artificially synthesized video image training samples Statistical descriptors of all pixel features within a limb part, for and of distance.

上述表达式中的目标函数，既考虑了训练样本熵（），又结合了训练数据与目标测试数据间的信息差异度（），将两者加权求和，作为训练决策树的目标函数，因而，提高了训练好的分类器的泛化能力。在利用训练好的分类器识别人体肢体部位时，能够获得较高的识别准确率。 The objective function in the above expression not only takes into account the training sample entropy ( ), combined with the degree of information difference between the training data and the target test data ( ), the weighted sum of the two is used as the objective function of the training decision tree, thus improving the generalization ability of the trained classifier. When the trained classifier is used to identify body parts, a higher recognition accuracy can be obtained.

上述目标函数采用距离表示训练数据与目标测试数据间的信息差异度，但不局限于此，也可以采用欧式距离或其它距离来表示两者的差异度。 The above objective function uses The distance represents the degree of information difference between the training data and the target test data, but it is not limited thereto, and Euclidean distance or other distances may also be used to represent the degree of difference between the two.

以上实施例仅用以说明本发明的技术方案，而非对其进行限制；尽管参照前述实施例对本发明进行了详细的说明，对于本领域的普通技术人员来说，依然可以对前述实施例所记载的技术方案进行修改，或者对其中部分技术特征进行等同替换；而这些修改或替换，并不使相应技术方案的本质脱离本发明所要求保护的技术方案的精神和范围。 The above embodiments are only used to illustrate the technical solutions of the present invention, rather than to limit them; although the present invention has been described in detail with reference to the foregoing embodiments, those of ordinary skill in the art can still understand the foregoing embodiments. Modifications are made to the technical solutions described, or equivalent replacements are made to some of the technical features; and these modifications or replacements do not make the essence of the corresponding technical solutions deviate from the spirit and scope of the technical solutions claimed in the present invention.

Claims

1. A method for recognizing human body posture in a two-dimensional video image is characterized by comprising the following steps:

a. dividing original video image according to scale space layering principleIs divided intoThe number of the groups is set to be,，the resolution of the original video image;

b. each set of video image samples is sampled and a scale is calculated asOf the sampled image，Is composed ofOf the number of the first and second dimensions,is shown asThe video images are grouped together to form a video image,，for the resolution of the original video image, in general,a natural number greater than 1 is set, which indicates the number of sampled video images included in each group of video images,；

c. for the sampled image in each groupSeparately computing HOG underlying feature descriptors；

d. C, based on the HOG bottom layer characteristic descriptor of the sampling image in each group obtained in the step c, according to a prediction formulaCalculating the inner dimension of each group asTo (1) rest of () The HOG bottom layer feature descriptors corresponding to the sampling video images of each scale,andrespectively representing sampled imagesAnd sampling the imageThe size of (a) is greater than (b),is a set value, and is used as a starting point,for sampling imagesThe HOG underlying feature descriptors of (a) are,for sampling imagesHOG underlying feature descriptors;

e. detecting human body target areas under different scales in the original video image according to the HOG bottom layer feature descriptors of all the sampled video images in the steps c and d and in combination with the trained SVM;

f. e, classifying the pixels of the human body target area detected in the step e by adopting a trained random forest classifier, and determining a limb part area in the human body target area;

g. and f, connecting the limb parts determined in the step f to form a human body outline, and realizing human body posture recognition.

2. The method for recognizing human body gesture in two-dimensional video image according to claim 1, wherein in step b, the method utilizesSampling each group of video images by the end part scale in the image processing system, and calculating the sampling image corresponding to the end part scale。

3. The method for recognizing human body posture in two-dimensional video image according to claim 1, wherein the random forest classifier in the step f is trained by the following method:

acquiring a manually synthesized video image comprising a human body posture and a real video image in a target test scene, wherein each video image is used as a training sample;

marking a background area and a human body target area in each training sample according to the set limb part;

calculating the pixel characteristics of each labeled region by using a SURF operator, wherein all labeled regions and pixel characteristic data thereof form a training data set;

using the training data set and an objective functionTraining a random forest classifier;

wherein,is a classification node of a decision tree in a random forest,as a weight value, the weight value,a function is calculated for the entropy of the information,is the pixel characteristics of the labeled area in the artificially synthesized video image training sample,is the pixel characteristic of the labeled area in the real video image training sample,is the labeled second in the artificially synthesized video image training sampleStatistical descriptors of pixel characteristics of individual limb portions,is a statistical descriptor of all pixel features in all labeled regions in the artificially synthesized video image training sample,is a statistical descriptor of all pixel features in all labeled regions in the real video image training sample,is composed ofAndis/are as followsDistance.