CN111444764A

CN111444764A - Gesture recognition method based on depth residual error network

Info

Publication number: CN111444764A
Application number: CN202010110942.0A
Authority: CN
Inventors: 张浩川; 谢胜利; 孙为军
Original assignee: Guangdong University of Technology
Current assignee: Guangdong University of Technology
Priority date: 2020-02-21
Filing date: 2020-02-21
Publication date: 2020-07-24

Abstract

In order to solve the problems that gesture recognition in the prior art requires many conditions and the calculation amount of the model is relatively large, the present invention proposes a gesture recognition method based on a deep residual network, which includes the following steps: S1. The human hand is used as the detection target for target detection, and the detected human hand is stored as an image; S2. The hand position coordinates are obtained according to the human hand image collected in step S1, and then based on the depth residual error The joint recognition model of the network performs gesture key point detection to obtain the hand key point coordinates; S3. Input the gesture key point coordinates obtained in step S2 into the SoftMax classifier for separation, obtain the classification of various gestures, and finally recognize the gesture. The speed of recognizing gestures of the present invention is faster than other solutions.

Description

A gesture recognition method based on deep residual network

技术领域technical field

本发明涉及手势识别技术领域，特别涉及一种基于深度残差网络的手势识别方法。The invention relates to the technical field of gesture recognition, in particular to a gesture recognition method based on a deep residual network.

背景技术Background technique

手势识别这个术语指的是跟踪人类手势、识别其表示和转换为语义上有意义的命令的整个过程。手势识别的研究旨在设计和开发可以将用于设备控制的手势识别为输入并且通过将命令映射为输出的系统。一般而言，从手势交互信息采集的途径是接触式还是非接触式的，可将手势交互系统划分为基于接触式的传感器和基于非接触类的传感器的两类。The term gesture recognition refers to the entire process of tracking human gestures, recognizing their representations, and converting them into semantically meaningful commands. Gesture recognition research aims to design and develop systems that can recognize gestures for device control as inputs and by mapping commands to outputs. Generally speaking, whether the way to collect gesture interaction information is contact or non-contact, the gesture interaction system can be divided into two types: touch-based sensors and non-contact-based sensors.

基于接触式传感器的手势识别通常基于使用多个传感器的数据手套、加速度计、多点触摸屏等技术。2004年，Kevin等人设计了一种用于手势识别的无线仪器手套。2008年，北京航空航天大学的任程等人用头盔和数据手套研究了虚拟现实系统中的虚拟手。2015年，山东师范大学的吕蕾等人研究了基于数据手套的静态手势识别方法，能识别25种手势，正确率达98.9％。2007年，Bourke等人提出了一种用加速度计来检测在我们的日常活动中使用的正常手势的识别系统。2017年，电子科技大学的王琳琳等人研究了基于惯性传感器的手势交互方法，准确率达96.7％。2014年，中国科学院大学的薛姣等人研究了一种基于触摸屏的手势遥控系统，平均识别率达99％。Touch-sensor-based gesture recognition is typically based on technologies such as data gloves, accelerometers, multi-touch screens, etc. that use multiple sensors. In 2004, Kevin et al. designed a wireless instrument glove for gesture recognition. In 2008, Ren Cheng et al. of Beihang University studied virtual hands in virtual reality systems with helmets and data gloves. In 2015, Lu Lei and others from Shandong Normal University studied a static gesture recognition method based on data gloves, which can recognize 25 gestures with a correct rate of 98.9%. In 2007, Bourke et al. proposed a recognition system that uses an accelerometer to detect normal gestures used in our daily activities. In 2017, Wang Linlin and others from the University of Electronic Science and Technology of China studied a gesture interaction method based on inertial sensors, with an accuracy rate of 96.7%. In 2014, Xue Jiao and others from the University of Chinese Academy of Sciences studied a touch screen-based gesture remote control system with an average recognition rate of 99%.

基于非接触式传感器的手势识别通常基于使用光学传感、雷达探测等技术。2002年，提出了使用摄像头采集多尺度颜色特征的手势识别。2010年，清华大学的沙亮等人研究了基于无标记全手势视觉的人机交互技术，提出了一种使用通用摄像头的车载手势视觉交互系统的解决方案，复杂环境识别率达80％。2011年，微软公司公布了Kinect，该摄像头可以借助红外线来识别手势运动。2015年，江南大学的姜克等人使用Kinect研究了基于深度图像的3D手势识别，识别率达76.6％。2015年，谷歌ATAP部门公布了Project Soli，该项目采用微型雷达来识别手势运动，可以捕捉微小动作。Gesture recognition based on non-contact sensors is usually based on the use of optical sensing, radar detection and other technologies. In 2002, gesture recognition using a camera to capture multi-scale color features was proposed. In 2010, Sha Liang of Tsinghua University and others studied human-computer interaction technology based on unmarked full gesture vision, and proposed a solution for a vehicle-mounted gesture-visual interaction system using a universal camera, with a recognition rate of 80% in complex environments. In 2011, Microsoft announced the Kinect, a camera that uses infrared light to recognize gestures. In 2015, Jiang Ke et al. of Jiangnan University used Kinect to study 3D gesture recognition based on depth images, with a recognition rate of 76.6%. In 2015, Google's ATAP division unveiled Project Soli, which uses tiny radars to recognize gesture movements that can capture tiny movements.

虽然现有技术在不同的手势数据集上实现了较为准确的识别，但是仍然存在以下不足：(1)预处理和特征提取的一些关键参数需要凭人工经验设定。(2)对于各种环境下的手势，仅仅依靠某一独立的特征进行手势识别往往并不能够满足手势识别的要求。Although the prior art has achieved relatively accurate recognition on different gesture datasets, it still has the following deficiencies: (1) Some key parameters of preprocessing and feature extraction need to be set by human experience. (2) For gestures in various environments, gesture recognition only relying on an independent feature often cannot meet the requirements of gesture recognition.

人机交互技术正逐步从以计算机为中心转变为以人为中心，手势识别作为一种重要的人机交互方式受到广泛的关注。手势识别技术是最早通过可穿戴感应手套传感器获取手指的弯曲程度和手部的活动状态来判断用户的手势操作。这种技术需要佩戴专门的设备，成本高且交互方式不够自然，已逐渐被图像识别技术所取代。Human-computer interaction technology is gradually changing from computer-centric to human-centric, and gesture recognition as an important human-computer interaction method has received extensive attention. Gesture recognition technology is the first to judge the user's gesture operation by obtaining the bending degree of the finger and the activity state of the hand through the wearable inductive glove sensor. This kind of technology requires wearing special equipment, and the cost is high and the interaction method is not natural enough. It has been gradually replaced by image recognition technology.

传统图像识别技术通过图像的边缘、明亮程度、色彩等特征来分析辨别手势类型，容易受光线变换、遮挡盲区及复杂背景等因素影响，算法鲁棒性低。随着2006年深度学习的提出，基于深度学习的手势识别技术因其较低的硬件需求、更快的识别速度以及更高的识别精度成为研究热点。由于人手是复杂的变形体，手势具有多样性、多义性以及时间上的差异等特点，基于深度学习的手势识别技术目前也存在一些难点。Traditional image recognition technology analyzes and distinguishes gesture types through the edge, brightness, color and other characteristics of the image, which is easily affected by factors such as light transformation, blind spots and complex backgrounds, and the algorithm has low robustness. With the introduction of deep learning in 2006, gesture recognition technology based on deep learning has become a research hotspot because of its lower hardware requirements, faster recognition speed and higher recognition accuracy. Since the human hand is a complex deformable body, and gestures have the characteristics of diversity, ambiguity, and time differences, there are also some difficulties in gesture recognition technology based on deep learning.

在当前的多种不同方法在不同测试数据集的表现情况下，目前大多数方法都实现了孤立手场景下的手势识别，但复杂背景环境下的手势识别仍是一项挑战，为此，一些判别性的方法试图加入深度信息来降低手部渲染的难度。有文献最先提出实现两个强相互作用的手的关节运动跟踪方法。该方法使用54维参数空间表示两只手的各种可能外形结构，其中每一维参数表示具有26个自由度的运动结构，同时加入粒子群优化算法进行渐进随机优化，从而找到最能解释RGB-D传感器提供的观测结果中的双手外形结构；但该方法主要关注于实现手部重叠情况下模型进行精确识别的能力，对于实时手势检测的速度并未提及。也有方法同样希望解决出现客观遮挡或盲区下的手势识别问题，提出在手指上使用有区别的学习并将手指的区别点特征与手势进行关联，同时也考虑了图像的边缘、光流和碰撞等条件，在手与手、手与物体有互动的情况下能提供非常精确的识别效果；但由于识别需要的条件较多、模型的计算量较大，在实际的人机交互应用中并不适用。In the current performance of many different methods in different test datasets, most of the current methods have achieved gesture recognition in isolated hand scenes, but gesture recognition in complex background environments is still a challenge. For this reason, some Discriminative methods try to incorporate depth information to reduce the difficulty of hand rendering. Some literature first proposed a joint motion tracking method for two strongly interacting hands. The method uses a 54-dimensional parameter space to represent various possible shape structures of the two hands, in which each dimension parameter represents a motion structure with 26 degrees of freedom. At the same time, particle swarm optimization algorithm is added to perform asymptotic random optimization, so as to find the best explanation for RGB The shape structure of the hands in the observations provided by the -D sensor; however, this method mainly focuses on realizing the ability of the model to accurately recognize the overlapping hands, and the speed of real-time gesture detection is not mentioned. There are also methods that also hope to solve the problem of gesture recognition under objective occlusion or blind spots. It proposes to use differentiated learning on the fingers and associate the distinguishing point features of the fingers with gestures, and also consider the edge of the image, optical flow and collision, etc. However, due to the large number of conditions required for recognition and the large amount of calculation of the model, it is not applicable in actual human-computer interaction applications. .

发明内容SUMMARY OF THE INVENTION

为了解决现有技术中的手势识别需要的条件较多、模型的计算量较大的问题，本发明提出了一种基于深度残差网络的手势识别方法。In order to solve the problems that the gesture recognition in the prior art requires many conditions and the calculation amount of the model is large, the present invention proposes a gesture recognition method based on a deep residual network.

本发明解决上述技术问题所采取的技术方案是：一种基于深度残差网络的手势识别方法，其特征在于，包括以下步骤：The technical solution adopted by the present invention to solve the above-mentioned technical problems is: a gesture recognition method based on a deep residual network, which is characterized in that it includes the following steps:

S1.采集视频数据，将人的手部作为检测目标进行目标检测，并将检测到的人的手部作为图像进行存储；S1. Collect video data, use the human hand as a detection target for target detection, and store the detected human hand as an image;

S2.根据S1步骤采集到的人的手部图像获得手部位置坐标，再通过基于深度残差网络的关节识别模型进行手势关键点检测，得到手部关键点坐标；S2. Obtain hand position coordinates according to the human hand image collected in step S1, and then perform gesture key point detection through a joint recognition model based on a deep residual network to obtain hand key point coordinates;

S3.将S2步骤获得的手势关键点坐标输入到SoftMax分类器进行分离，得到各种手势的分类，最终识别手势。S3. Input the coordinates of the gesture key points obtained in step S2 into the SoftMax classifier for separation, to obtain the classification of various gestures, and finally recognize the gesture.

所述的S1步骤中将人的手部作为检测目标进行目标检测的具体方法是：The specific method of using the human hand as the detection target to perform target detection in the step S1 is:

S101.将人的手部视频作为图像经过卷积神经网络，将得到的特征图生成不同尺寸的候选框；S101. Use a human hand video as an image through a convolutional neural network, and generate candidate frames of different sizes from the obtained feature map;

S102.将S101步骤所述的候选框与训练样本标定框进行匹配与判断，得到目标样本；S102. Match and judge the candidate frame described in step S101 and the training sample calibration frame to obtain a target sample;

S103.由预测框与标定框的位置偏移损失和分类结果损失加权和得到的总损失函数来更新S101所述卷积神经网络的参数设置，得到更准确的用于人手目标检测的模型，最终得到含有人的手部的图像。S103. Update the parameter setting of the convolutional neural network described in S101 by the total loss function obtained by the weighted sum of the position offset loss of the prediction frame and the calibration frame and the loss of the classification result, so as to obtain a more accurate model for human-hand target detection, and finally An image containing a human hand is obtained.

所述的S102步骤中的匹配判断的准则为：当一个候选框与已标定的手部框的重叠面积比例大于其余所有的候选框重叠面积比时，认为该候选框匹配成功；当某一候选框与已标定的手部框的匹配度大于一定的阈值时，认为该候选框匹配成功；满足以上两个条件之一则判断为匹配成功，当候选框判断为成功后，则该候选框被激发为一个得到预测结果的正样本，反之则为负样本。The criterion for matching judgment in the step S102 is: when the overlap area ratio of a candidate frame and the calibrated hand frame is greater than the overlap area ratio of all other candidate frames, it is considered that the candidate frame is successfully matched; When the matching degree between the frame and the calibrated hand frame is greater than a certain threshold, it is considered that the candidate frame is successfully matched; if one of the above two conditions is met, it is judged that the matching is successful, and when the candidate frame is judged to be successful, the candidate frame is determined. The excitation is a positive sample that obtains the predicted result, otherwise it is a negative sample.

所述的S2步骤中基于深度残差网络的关节识别模型进行手势关键点检测的具体方法是：The specific method for detecting gesture key points based on the joint recognition model of the deep residual network in the step S2 is:

S201.在训练识别模型时，采集同一手势在同一时刻不同角度的二维图像，并用预训练检测器对上述测试视频数据中的所有手部关键点进行检测，得到不同角度下同一点的识别结果；S201. When training the recognition model, collect two-dimensional images of the same gesture at different angles at the same time, and use the pre-training detector to detect all the key points of the hand in the above-mentioned test video data, and obtain the recognition results of the same point at different angles ;

S202.将S201步骤获得的不同角度下同一点的识别结果，通过RANSAC算法构建每个关键点的三维位置以及整个手部的三维模型，使得每一时刻成为具有三维关节点特征的画面帧，以获得手部关节点标定的输出图像；S202. Construct the three-dimensional position of each key point and the three-dimensional model of the entire hand through the RANSAC algorithm based on the identification results of the same point at different angles obtained in step S201, so that each moment becomes a picture frame with three-dimensional joint point characteristics, so as to Obtain the output image of hand joint point calibration;

S203.此时将预训练检测器中的已标定手部图像与S202步骤获得的手部关节点标定的输出图像作为新的训练样本继续对上述模型进行训练更新，得到多视角引导对二维输入图像；S203. At this time, the calibrated hand image in the pre-training detector and the output image of the hand joint point calibration obtained in step S202 are used as new training samples to continue to train and update the above model, and obtain multi-view guidance for two-dimensional input. image;

S204.最后，采用Multiview Bootstrapping算法对S203步骤所述的多视角引导对二维输入图像进行识别，得到手部关键点坐标。S204. Finally, the Multiview Bootstrapping algorithm is used to identify the two-dimensional input image with the multi-view guidance described in step S203, and the coordinates of the key points of the hand are obtained.

本发明的有益效果是：本发明利用深度学习中深度残差网络的多种优化架构进行混合，实现了可以进行独立手势识别功能的方法，该方法在保证识别结果高精确度和高鲁棒性的同时，能达到流畅识别的速度；同时，本发明所述方法在应用识别时，使用单一摄像头或少量摄像头在复杂背景的场景下识别到的手势相比其他方法更多，传统图像识别技术通过图像的边缘、明亮程度、色彩等特征来分析辨别手势类型，容易受光线，角度变换等因素影响，算法鲁棒性低，本发明对采集图像进行预处理并在后续步骤进行关键点检测，角度、光线等因素变化对结果影响较小，识别响应时间相比其他方案更快。The beneficial effects of the present invention are as follows: the present invention utilizes multiple optimization architectures of the deep residual network in deep learning to mix, and realizes a method that can perform independent gesture recognition functions, which ensures high accuracy and high robustness of the recognition results. At the same time, the speed of smooth recognition can be achieved; at the same time, the method of the present invention uses a single camera or a small number of cameras to recognize more gestures in complex background scenes than other methods when applying recognition. The edge, brightness, color and other characteristics of the image are used to analyze and identify the gesture type, which is easily affected by factors such as light and angle transformation, and the algorithm has low robustness. The present invention preprocesses the collected image and performs key point detection in subsequent steps. Changes in factors such as , light and other factors have little impact on the results, and the recognition response time is faster than other solutions.

附图说明Description of drawings

图1为本发明的工作流程图。Fig. 1 is the working flow chart of the present invention.

图2为手势目标检测网络结构。Figure 2 shows the network structure of gesture target detection.

具体实施方式Detailed ways

下面结合附图对本发明进行进一步的说明。The present invention will be further described below with reference to the accompanying drawings.

如图1，所述的一种基于深度残差网络的手势识别方法，包括以下步骤：As shown in Figure 1, the described method for gesture recognition based on deep residual network includes the following steps:

步骤(1)：输入视频数据进行多尺度的人手检测，将人的手部作为检测目标进行目标检测，并将检测到的人的手部作为图像进行存储；Step (1): input video data to perform multi-scale human hand detection, use the human hand as a detection target to perform target detection, and store the detected human hand as an image;

人手检测是一个标准的目标检测问题。传统的目标检测方法是通过提取图片中不同颜色模块的感知信息来定位并将其中的目标物体进行分类。但对于计算机来说，其面对的是RGB像素矩阵，很难从图像中直接得到目标物(如猫、狗)的抽象概念并定位其位置，再加上有多个物体和杂乱的背景混合的影响，目标检测更加困难。虽然在传统视觉领域对某些特定的研究方向如人脸检测、行为检测等有很多常用特征集，但由于检测过程复杂，计算速度很难得到提升。基于深度学习的目标检测方法最早被提出时，需要区域提名及区域分类两步。区域提名即在输入图像进入卷积神经网络前先密集提取出图像中可能感兴趣的区域，然后再对各个提名区域进行识别分类。该方法对目标检测的准确度有很好的保证，相比于传统视觉方法，其检测速度有较大提升，但仍然无法满足实时性的要求。Hand detection is a standard object detection problem. The traditional object detection method is to locate and classify the target objects by extracting the perceptual information of different color modules in the picture. But for the computer, it is faced with the RGB pixel matrix, it is difficult to directly get the abstract concept of the target object (such as cat, dog) from the image and locate its position, plus there are multiple objects mixed with the cluttered background The impact of the target detection is more difficult. Although there are many common feature sets for some specific research directions such as face detection and behavior detection in the traditional vision field, the calculation speed is difficult to improve due to the complex detection process. When the target detection method based on deep learning was first proposed, it required two steps of region nomination and region classification. Region nomination is to densely extract possible regions of interest in the image before the input image enters the convolutional neural network, and then identify and classify each nominated region. This method has a good guarantee for the accuracy of target detection. Compared with the traditional visual method, its detection speed is greatly improved, but it still cannot meet the real-time requirements.

基于此，本步骤进一步优化目标检测模型，提出省去区域提名这一步，直接将整张输入图像放入深度残差网络后产生物体的位置坐标值。Based on this, this step further optimizes the target detection model, and proposes to omit the step of region nomination, and directly put the entire input image into the deep residual network to generate the position coordinate value of the object.

人手目标检测模型利用开源手部数据库进行训练。输入图像经过卷积神经网络，得到的特征图的点预先生成不同尺寸的候选框，之后与训练样本标定框进行匹配与判断，得到目标样本。该目标样本包括下述的正样本与负样本。The human hand object detection model is trained using an open source hand database. The input image is passed through the convolutional neural network, and the points of the obtained feature map generate candidate frames of different sizes in advance, and then match and judge with the calibration frame of the training sample to obtain the target sample. The target samples include the following positive samples and negative samples.

匹配判断的准则为：当一个候选框与已标定的手部框的重叠面积比例大于其余所有的候选框重叠面积比时，认为该候选框匹配成功；当某一候选框与已标定的手部框的匹配度大于一定的阈值时，认为该候选框匹配成功。满足以上两个条件之一则判断为匹配成功，当候选框判断为成功后，则该候选框被激发为一个得到预测结果的正样本，反之则为负样本。之后再计算由预测框与标定框的位置偏移损失和分类结果损失加权和得到的总损失函数来更新卷积神经网络的参数设置，得到更准确的用于人手目标检测的模型，具体检测流程如专利附图2所示。由该人手检测模型可获得输入图像中所有检测到的人手位置坐标，并以此为基础，进行下一步的手势关键点识别。The criterion for matching judgment is: when the overlap area ratio between a candidate frame and the calibrated hand frame is greater than the overlap area ratio of all other candidate frames, the candidate frame is considered to be successfully matched; when a candidate frame and the calibrated hand frame overlap When the matching degree of the box is greater than a certain threshold, the candidate box is considered to be successfully matched. If one of the above two conditions is satisfied, it is judged that the matching is successful. When the candidate frame is judged to be successful, the candidate frame is excited as a positive sample for obtaining the predicted result, otherwise, it is a negative sample. After that, the total loss function obtained by the weighted sum of the position offset loss of the prediction frame and the calibration frame and the loss of the classification result is calculated to update the parameter settings of the convolutional neural network to obtain a more accurate model for human target detection. The specific detection process As shown in Figure 2 of the patent. The hand detection model can obtain all the detected hand position coordinates in the input image, and based on this, the next gesture key point recognition is performed.

步骤(2)：在确认输入图像中存在人手且获得人手位置坐标后，我们需要对检测到的手的具体坐标进行识别分析。由于手的关节点较多同时手与手、手与一般物体间容易产生丰富的互动情景，因此，准确识别出来源于同一人的同一手部关节点是坐标识别的重要基础。Step (2): After confirming that there is a human hand in the input image and obtaining the position coordinates of the human hand, we need to identify and analyze the specific coordinates of the detected hand. Since there are many joints in the hand, and rich interaction scenarios are easily generated between the hand and the hand, and between the hand and general objects, it is an important basis for coordinate recognition to accurately identify the joints of the same hand from the same person.

本发明采用Multiview Bootstrapping算法通过多视角引导对二维输入图像实现复杂环境下的手势识别。The present invention adopts the Multiview Bootstrapping algorithm to realize gesture recognition under complex environment for the two-dimensional input image through multi-view guidance.

在训练识别模型时，使用多个摄像头捕捉同一手势在同一时刻不同角度的二维图像，并用由少量已标定的手部图像生成的预训练检测器对未标定的测试视频数据中的所有手部关键点进行检测。When training the recognition model, use multiple cameras to capture two-dimensional images of the same gesture at different angles at the same time, and use a pre-trained detector generated from a small number of calibrated hand images to detect all hands in the uncalibrated test video data. key points are detected.

对于不同角度下同一点的识别结果，通过RANSAC算法构建每个关键点的三维位置以及整个手的三维模型，使得每一时刻成为具有三维关节点特征的画面帧，并获得有手部关节点标定的输出图像。此时将最初的少量标定素材与预训练检测器检测得到的标定图像作为新的训练样本继续对模型进行训练更新。For the recognition results of the same point at different angles, the 3D position of each key point and the 3D model of the entire hand are constructed through the RANSAC algorithm, so that each moment becomes a picture frame with the characteristics of 3D joint points, and the calibration of hand joint points is obtained. the output image. At this time, the initial small amount of calibration material and the calibration image detected by the pre-training detector are used as new training samples to continue to train and update the model.

步骤(3)：由手部关键点检测器可以得到各个手部关键点位置坐标的向量形式输出，将其直接输入到SoftMax分类器进行分离，得到各种手势的分类，从而实现手势识别。Step (3): The hand key point detector can obtain the vector form output of the position coordinates of each hand key point, and directly input it to the SoftMax classifier for separation, and obtain the classification of various gestures, thereby realizing gesture recognition.

具体的，SoftMax分类器通过将需要分离的信号映射至相应的标签上，经过卷积神经网络训练后的信号将得到一个分类结果。将该结果与相应的标签数据进行比较，获得相对误差值，由多次神经网络训练得到不断缩小的分类误差并得到分类能力较好的模型。Specifically, the SoftMax classifier maps the signal to be separated to the corresponding label, and the signal after the convolutional neural network training will obtain a classification result. The result is compared with the corresponding label data to obtain the relative error value, and the classification error is continuously reduced and a model with better classification ability is obtained by multiple neural network training.

具体实施例I：SoftMax分类器使用在ImageNet预训练的ResNet-50作为特征提取模型，根据数据集的ID数设置输出层的维度。图像分类器的训练过程中，冻结ResNet-50网络中BN层、conv1层以及res2层的参数，参数更新使用的是mini-batch SGD。一个batch中的样本数量、最大迭代次数和动量分别设置为16、50和0.9。学习率使用阶级衰减策略，初始学习率为0.001，在40个epoch之后学习率衰减为0.0001，直至训练结束。Embodiment 1: The SoftMax classifier uses ResNet-50 pre-trained on ImageNet as a feature extraction model, and sets the dimension of the output layer according to the number of IDs in the dataset. During the training process of the image classifier, the parameters of the BN layer, the conv1 layer and the res2 layer in the ResNet-50 network were frozen, and the parameters were updated using mini-batch SGD. The number of samples in a batch, the maximum number of iterations and momentum are set to 16, 50 and 0.9, respectively. The learning rate uses a class decay strategy, the initial learning rate is 0.001, and the learning rate decays to 0.0001 after 40 epochs until the end of training.

本发明所述的SoftMax分类器的损失函数如下：The loss function of the SoftMax classifier of the present invention is as follows:

其中y_ik表示真实标签信息，p_ik表示模型预测样本，N是总训练样本数，K是样本的类别，正则化参数λ＝0.0005，对W＝{W₁…W₁₀₂₄}，其中W_i的维度等于输出维度。where y _ik represents the real label information, p _ik represents the model prediction sample, N is the total number of training samples, K is the category of the sample, the regularization parameter λ=0.0005, for W={W ₁ …W ₁₀₂₄ }, where the W _i dimension is equal to the output dimension.

以上所述，仅为本发明的具体实施方式，但本发明的保护范围并不局限于此，任何熟悉本技术领域的技术人员在本发明揭露的技术范围内，可轻易变化或替换，都属于本发明的保护范围之内。因此本发明的保护范围所述以权利要求的保护范围为准。The above are only specific embodiments of the present invention, but the protection scope of the present invention is not limited thereto. Any person skilled in the art who is familiar with the technical scope disclosed by the present invention can easily change or replace them, all belonging to the scope of the present invention. within the protection scope of the present invention. Therefore, the protection scope of the present invention is described in accordance with the protection scope of the claims.

Claims

1. A gesture recognition method based on a depth residual error network is characterized by comprising the following steps:

s1, collecting video data, carrying out target detection by taking a hand of a person as a detection target, and storing the detected hand of the person as an image;

s2, obtaining hand position coordinates according to the hand image of the person collected in the step S1, and detecting gesture key points through a joint recognition model based on a depth residual error network to obtain hand key point coordinates;

and S3, inputting the gesture key point coordinates obtained in the step S2 into a SoftMax classifier for separation to obtain various gesture classifications, and finally recognizing the gestures.

2. The gesture recognition method based on the deep residual error network of claim 1, wherein the specific method for performing target detection with human hands as detection targets in the step S1 is as follows:

s101, taking a hand video of a person as an image, and generating candidate frames with different sizes from the obtained feature map through a convolutional neural network;

s102, matching and judging the candidate frame and the training sample calibration frame in the step S101 to obtain a target sample;

and S103, updating S101 the parameter setting of the convolutional neural network by a total loss function obtained by weighted sum of the position offset loss and the classification result loss of the prediction frame and the calibration frame to obtain a more accurate model for detecting the human hand target, and finally obtaining an image containing the human hand.

3. The method according to claim 1, wherein the matching judgment criterion in step S102 is as follows: when the overlapping area ratio of one candidate frame to the calibrated hand frame is larger than the overlapping area ratio of all the other candidate frames, the candidate frame is considered to be successfully matched; when the matching degree of a certain candidate frame and the calibrated hand frame is greater than a certain threshold value, the candidate frame is considered to be successfully matched; if one of the two conditions is satisfied, the matching is judged to be successful, when the candidate frame is judged to be successful, the candidate frame is excited to be a positive sample for obtaining the prediction result, otherwise, the candidate frame is a negative sample.

4. The gesture recognition method based on the depth residual error network as claimed in claim 1, wherein the specific method for performing gesture key point detection based on the joint recognition model of the depth residual error network in the step S2 is as follows:

s201, when a recognition model is trained, collecting two-dimensional images of the same gesture at different angles at the same time, and detecting all hand key points in the test video data by using a pre-training detector to obtain recognition results of the same point at different angles;

s202, constructing a three-dimensional position of each key point and a three-dimensional model of the whole hand by using the RANSAC algorithm according to the recognition results of the same point at different angles obtained in the step S201, so that each moment becomes a picture frame with three-dimensional joint point characteristics to obtain an output image calibrated by the hand joint points;

s203, taking the calibrated hand image in the pre-training detector and the output image calibrated by the hand joint point obtained in the step S202 as new training samples to continue training and updating the model to obtain a multi-view guide pair two-dimensional input image;

and S204, finally, identifying the two-dimensional input image by the multi-view guide in the step S203 by adopting a Multiview boosting algorithm to obtain the coordinates of the key points of the hand.

5. The method according to claim 1, wherein the loss function of the SoftMax classifier is as follows:

wherein y is_ikRepresenting genuine label information, p_ikRepresents model prediction samples, N is the total number of training samples, K is the class of samples, the regularization parameter λ is 0.0005, and W is { W }₁…W₁₀₂₄In which W is_iIs equal to the output dimension.