CN112836566A

CN112836566A - Multitask neural network face key point detection method for edge equipment

Info

Publication number: CN112836566A
Application number: CN202011386983.9A
Authority: CN
Inventors: 李思远; 王丰
Original assignee: Beijing Zhiyunview Technology Co ltd
Current assignee: Beijing Zhiyunview Technology Co ltd
Priority date: 2020-12-01
Filing date: 2020-12-01
Publication date: 2021-05-25

Abstract

The invention relates to the field of deep learning, face recognition and face key point detection, and provides a multitask neural network face key point detection method for edge equipment, which is used for realizing face key point calibration and face accurate recognition on mobile equipment. Therefore, the technical scheme adopted by the invention is that a face image to be detected is input into a convolutional neural network aiming at a multitask neural network face key point detection method of edge equipment, and the detected face key point coordinates are output by the convolutional neural network; defining the convolutional neural network loss function. The method is mainly applied to the occasions of face recognition and face key point detection.

Description

Multitask neural network face key point detection method for edge equipment

Technical Field

The invention relates to the field of deep learning, face recognition and face key point detection, in particular to a multitask neural network face key point detection method for edge equipment.

Background

Face keypoint detection, also known as face targeting or face alignment, is intended to automatically locate a set of predefined fiducial points on a face (e.g., corners of the eyes, nose tip, corners of the mouth, etc.). This problem has been of interest in the field of computer vision and has made great progress over the past few years as a fundamental component of various face applications such as face recognition [1, 2], face verification [3], face morphing [4], and face editing [5 ]. However, due to the factors of detection accuracy, processing speed and model size, it is still challenging to develop a practical face key point detection technology.

The technical difficulty is that the face with very high quality is difficult to acquire in a real scene, that is, the face state in the natural environment is uncontrolled and unconstrained. Under different lighting conditions, the posture, the expression and the shape of the lamp body are greatly changed, and local shielding sometimes occurs, as shown in the figure I. Therefore, the challenges in face calibration detection mainly include the following four types:

1. local variation: the face image is locally disturbed by facial expressions, local extreme illumination (such as highlights and shadows), occlusion and the like, so that some key points may not be displayed or are abnormally positioned.

2. Global change: pose and image quality are two key factors, which can bring global influence to the face appearance in the image, and when the global structure of the face is estimated by mistake, the positioning of most key points is not accurate.

3. Data imbalance: the phenomenon that the distribution of the data set which can be used for training is not uniform among the face types and attributes is quite common. The imbalance is likely to make the algorithm or model unable to correctly characterize the data, thereby reducing the accuracy of the detection.

4. Model efficiency: the size and computational performance of the model also limit the utility of the algorithm. Due to the limitations of computing performance and memory resources of the mobile phone or other embedded devices, the complexity of the detection algorithm must be low and the processing speed must be fast.

In recent years, human face key point positioning research has received extensive attention, and many classical algorithms have been created. The method is characterized in that a face shape model is established firstly, the face characteristic points are described by parameters with lower dimensionality, then a face appearance model is established, and the positions of the characteristic points are updated according to the matching degree of the reconstructed face appearance and the model. Wherein Active Appearance Models (AAMs) and Constraint Local Models (CLMs) proposed by Cootes et al [6] are taken as representatives, and the characteristics of the human face are fully utilized by maximizing the position information of the human face. Active appearance models and their subsequent studies [7,8,9] attempt to jointly model overall appearance and shape, while CLMs and related algorithms [10,11] learn local information by applying various shape constraints. Furthermore, a tree-structured component model (TSPM) [12] utilizes deformable component-based models for simultaneous detection, pose estimation, and key point localization. Another type of method is the extended shape regression method (ESR) [13] and the Supervised Descent Method (SDM) [14], which attempt to solve this problem in a regression manner. The main limitations of these methods are poor robustness to complex scene detection, large computational effort, or high model complexity.

Deep Learning (Deep Learning) is a new research direction in the field of machine Learning, and research is performed by using a multilayer neural network. The Convolutional Neural Network (CNN) is a deep learning model, is widely applied to image and audio signal processing, and has a good effect in the face key point detection in recent years. Zhang et al [15] established a multitask learning network (TCDCN) for joint learning of keypoint locations and pose attributes. But due to the multitasking nature of TCDCN, it is difficult to train in practical applications. Trigeorgis et al [16] proposed a coarse-to-fine recursive convolution model (MDM). Lv et al [17] propose a depth regression structure (TSR) based on two-stage re-initialization, which segments a face into several parts to improve detection accuracy. The method [18] uses attitude angles, including yaw, pitch and roll as attributes to construct a network and estimates these three angles directly to aid in keypoint detection, but its complex nature makes it less than ideal in keypoint detection. The pose-invariant face alignment algorithm (PIFA) proposed by jourablo et al [19] estimates a three-dimensional to two-dimensional projection matrix by deep cascaded regression. Algorithm [20] first builds a model of the face depth in the Z buffer and then fits a three-dimensional model of the two-dimensional image.

Recently, Kumar and Chellapa have designed a single tree-shaped CNN named posture condition tree-shaped convolutional neural network (PCD-CNN) [21], which combines a modular classification network on the basis of a classification network to improve the detection accuracy. Honari et al [22] designed a sequence multitasking (SeqMT) network using the equal-variant scaling transform (ELT) as its loss term. The method [23] proposes a regression method based on face calibration of a coarse-to-fine regression tree set (ERT). To make the face keypoint detection method robust to intrinsic changes in image style, Dong et al [24] developed a Style Aggregation Network (SAN) that combines raw face images with style aggregation images to train the keypoint detector. Wu et al [25] propose a boundary-based face alignment algorithm (LAB) that considers boundary information as the geometry of a face to improve detection accuracy, which extracts facial key points from boundary lines to avoid ambiguity in face key point definition to a large extent. Although the deep learning algorithm has advanced sufficiently at present, there are still many shortcomings, especially in practical applications, there is still much room for improvement regarding the accuracy, efficiency and simplicity of the detection algorithm [28 ].

Disclosure of Invention

In order to overcome the defects of the prior art, the invention aims to provide a multitask neural network face key point detection method for edge equipment, which is used for realizing face key point calibration and face accurate identification on mobile equipment. Therefore, the technical scheme adopted by the invention is that a face image to be detected is input into a convolutional neural network aiming at a multitask neural network face key point detection method of edge equipment, and the detected face key point coordinates are output by the convolutional neural network; the convolutional neural network loss function is as follows:

wherein

The distance of the nth key point corresponding to the mth input is shown; n represents the number of key points which are preset for each face and need to be detected; m represents the total number of samples of the training picture set; theta¹、θ²And theta³Respectively indicating yaw angle and pitchDeviation values between the actual and predicted values of the angle and roll angle, K being 1, 2, 3; c represents different types of human faces, including front faces, side faces, head-up, head-down, expressions and shielding conditions; weight of

And adjusting according to the sample class score, and taking the reciprocal of the classification as a weight.

The convolutional neural network is based on a MobileNet convolutional neural network.

The method for enhancing the training data comprises the following specific steps:

1) turning each face picture, and rotating every 5 degrees before minus 30 degrees to 30 degrees;

2) 20% of the face area was randomly occluded on each picture.

And introducing a sub-network in the process of training the convolutional neural network for supervising the training of the model, wherein the sub-network is only used in the training stage, the input is the output of the fourth layer of the convolutional neural network, and the output is three Euler angles of yaw, pitch and roll and is used for calculating a loss function.

The invention has the characteristics and beneficial effects that:

1. the design of the network in the invention is very light, can support multiple tasks, and can simultaneously obtain the key point and face angles after the face image is input.

2. The model in the invention has very small size, saves memory space, is very suitable for running on mobile platforms such as mobile phones and the like, has high running speed, and the frame rate in the mobile platforms can reach 140 fps.

3. Aiming at the problems of geometric constraint and data imbalance, the invention designs a new loss function, thereby solving the problems of geometric constraint and data imbalance.

4. In order to enlarge the receptive field and better capture the global structure of the human face, the invention designs a multi-scale full-connection layer for accurately positioning key points in the human face image.

5. Compared with other face key point detection algorithms, the method uses a coupling mode between three-dimensional attitude estimation and two-dimensional distance measurement; the network structure is simple and visual, and forward calculation and backward propagation are easy to perform; in a single-stage network structure rather than in a cascaded form, this improves the computational efficiency and performance of the method.

6. The algorithm in the invention has high accuracy under various complex conditions of unconstrained gesture, expression, illumination, shielding and the like. We surpassed other advanced methods (such as TSR [17], SAN [24], LAB [25]) in the two data sets of 300W (300Faces in-the-world change face key point data set) and AFLW (identified face Landmarks in the Wild face key point data set). FIGS. 2-5 show examples of partial faces in 300W and AFLW and examples of multi-face pictures, where green points in the pictures mark key points of detected faces.

Description of the drawings:

fig. 1 is an overall model structure diagram in the method of the present invention, and is a diagram illustrating the architecture of a backbone network and an auxiliary network.

FIG. 2 is an example of a human face of the present invention at different poses, expressions, lights, occlusions and image qualities.

FIG. 3 is an example of the detection results of the face key points under extreme illumination, expression, occlusion and blur disturbance in the present invention.

Fig. 4 is an example of the detection result of the multi-face key points in the complex background.

Fig. 5 is an example of the detection result of the multi-face key points in the complex background.

Detailed Description

The invention provides a practical face key point detection technical method based on deep learning, which can be used for effectively calibrating face key points on a mobile terminal. The scheme is mainly realized based on a convolutional neural network, and a network model is required to be designed firstly, and mainly comprises a convolutional neural network structure and a loss function. The input of the model is a face image needing to be detected, and the output of the model is the coordinates of key points of the detected face. Therefore, the core of the method of the present invention is the design of the model, and we will introduce from the loss function, the backbone network, the auxiliary network, and other implementation details.

A first part: loss function

In the case of small data size, the accuracy of the algorithm depends mainly on the design of the loss function. Taking the geometric information into account in the loss function may help solve the training quality problem. Since the change of the local expression hardly affects the projection, the degrees of freedom including scaling and two-dimensional translation can be reduced, and only three euler angles, namely, a pitch angle, a yaw angle and a roll angle, need to be estimated.

Furthermore, in deep learning, data imbalance is another problem that often affects detection accuracy. Therefore, more penalties are made to the loss values corresponding to the rare training samples, which can help to deal with the data imbalance problem.

In view of the above, we have designed the loss function as follows:

wherein

The distance of the nth key point corresponding to the mth input is shown; n represents the number of key points which are preset for each face and need to be detected; m represents the total number of samples of the training picture set; theta¹、θ²And theta³(K-3) respectively representing deviation values between actual values and predicted values of the yaw angle, the pitch angle and the roll angle, and obviously, as the deviation angle value increases, the penalty also increases; c represents different types of human faces, such as front faces, side faces, head-up, head-down, expressions and shielding conditions; weight of

And adjusting according to the sample class score, and taking the reciprocal of the classification as a weight in the invention.

By means of the loss function, it can be found that whether the training is affected by three-dimensional posture change or data imbalance, our loss can be used for processing local change by measuring distance.

A second part: backbone network

The backbone network uses convolutional neural networks in deep learning to extract features and predict keypoints (lower level branches in fig. 1). Because a human face has strong global structures, such as symmetric spatial relationships between eyes, mouth, nose, and the like, using global structures can help in more accurate localization. We use a multi-scale profile to expand the receptive field by using different step sizes to perform the convolution operation. In order to map abstract information in different size receptive fields learned by the previous convolutional layer into a larger space and increase the representation capability of a model, the final prediction is completed by connecting the previous three multi-scale feature maps through a full connection layer, detailed parameters of a backbone network are shown in table one in detail, a picture is converted into a three-dimensional matrix of 112x 112x 3 to serve as input, wherein 112x 112 represents the pixel size of an input image, 3 represents the number of RGB (red, green and blue) three channels, an output layer is the multi-scale full connection layer of the user and is connected with the output of the three convolutional layers. The input of each layer of the network is the output of the previous layer, the first two dimensions represent the image size, and the third dimension represents the number of channels. Taking the second layer 562x 64 as an example, 56 is the previous layer pixel divided by the step size, i.e., 112/2-56, and the third dimension is the number of channels of the previous layer convolutional layer, i.e., 64.

Since the backbone network is a bottleneck in terms of processing speed and model size, MobileNet [26-27] is used instead of the conventional convolution operation. MobileNet is a lightweight convolutional neural network, primarily used for mobile and embedded vision applications. The computational load of the backbone network is greatly reduced by using the MobileNet, so that the detection speed is accelerated. In addition, the main network can adjust the width parameter of the MobileNet according to different requirements to compress the network, so that the model is smaller and the calculation speed is higher, and the model size in the invention can still obtain good detection precision after being compressed by 80%.

Table-detailed parameters of a backbone network

And a third part: auxiliary network

In the process of training the backbone network, a sub-network is introduced to supervise the training of the model (upper branch in fig. 1). The network is only used in the training phase, and the input is the output of the fourth layer of the backbone network. The auxiliary network aims to estimate three-dimensional rotation information including three euler angles of yaw, pitch and roll for each input face sample, thereby determining the head pose. The auxiliary network can effectively improve the stability and robustness of key point detection, the specific structure of the auxiliary network is shown in table two, the input represents the size of a three-dimensional array in the main network, and the output represents three Euler angles of yaw, pitch and rolling, and is used for calculating a loss function in the first part of formula. I.e. the loss function of the overall network.

Table two detailed parameters of auxiliary network

The fourth part: other details

In order to enable the deep neural network model to have good effect, the hyper-parameters in the network are optimized, and the values of the hyper-parameters in the table III can be used as references:

hyper-parameters for table three network training

In addition, to solve the problem of data imbalance, we also use a data enhancement strategy. Data enhancement, also called data augmentation, refers to having limited data produce value equivalent to more data without substantially increasing the data. Therefore, we mainly adopt the following two ways:

1) and turning each face picture, and rotating every 5 degrees before the face picture is rotated by minus 30 degrees to 30 degrees.

2) 20% of the face area was randomly occluded on each picture.

By adopting a data enhancement strategy, the expansion of the training data set is realized, so that a better detection effect is obtained.

The invention provides a face key point detection method which is mainly based on a convolutional neural network algorithm in deep learning. The neural network model mainly comprises a backbone network and an auxiliary network, wherein the backbone network takes a MobileNet block as a main structure, and simultaneously introduces a multi-scale full-connection layer to expand the receptive field and enhance the expression capability of the structural features of the human face. The auxiliary network can effectively estimate the rotation information to improve the positioning capability of the key point.

The invention solves the problems of geometric normalization and data imbalance by providing a new loss function, and the whole algorithm is superior to the most advanced method in precision, model size and operation speed. From the detection results of fig. 2 to 5, it can be observed that the present invention can still obtain satisfactory visual effects even under extreme illumination, expression, occlusion, and blur interference.

The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like that fall within the spirit and principle of the present invention are intended to be included therein.

Primary references

【1】Y.Liu,F.Wei,J.Shao,L.Sheng,J.Yan,and X.Wang.Exploring disentangled feature representation beyond face identification.In CVPR,2018.1

【2】X.Zhu,Z.Lei,J.Yan,D.Yi,and S.Z.Li.High-fidelity pose and expression normalization for face recognition in the wild.In CVPR,2015.

【3】Y.Sun,X.Wang,and X.Tang.Hybrid deep learning for face verification.IEEE TPAMI,38(10):1997-2009,2016.

【4】T.Hassner,S.Harel,E.Paz,and R.Enbar.Effective face frontalization in unconstrained images.In CVPR,2015.

【5】J.Thies,M.Zollho fer,M.Stamminger,C.Theobalt,and M.Niener.Face2face:Real-time face capture and reenactment of rgb videos.In CVPR,2016.

【6】T.Cootes,G.Edwards,and C.Taylor.Active appear-ance models.IEEE TPAMI,23(6):681-685,2001.

【7】I.Matthews and S.Baker.Active appearance models revisited.IJCV,60(2):135-164,2004.

【8】F.Khraman,M.Go kmen,S.Darkner,and R.Larsen.An active illumination and appearance(AIA)model for face alignment.In CVPR,2007.

【9】L.Liang,R.Xiao,F.Wen,and J.Sun.Face alignment via component-based discriminative search.In ECCV,2008.

【10】P.Belhumeur,D.Jacobs,D.Kriegman,and N.Kumar.Localizing parts of faces using a consensus of exem-plars.In CVPR,2011.

【11】M.Valstar,B.Martinez,X.Binefa,and M.Pantic.Facial point detection using boosted regression and graph models.In CVPR,2010.

【12】X.Zhu and D.Ramanan.Face detection,pose estimation,and landmark localization in the wild.In CVPR,2012.

【13】X.Cao,Y.Wei,F.Wen,and J.Sun.Face alignment by explicit shape regression.IJCV,107(2):177-190,2014.2,6,7

【14】X.Xiong and F.De la Torre.Supervised decent method and its applications to face alignment.In CVPR,2013.

【15】Z.Zhang,P.Luo,C.Loy,and X.Tang.Facial land-mark detection via deep multi-task learning.In ECCV,2014.

【16】G.Trigeogis,P.Snape,M.Nicolaou,E.Antonakos,and S.Zafeiriou.Mnemonic descent method:A re-current process applied for end-to-end face alignment.In CVPR,2016.

【17】J.Lv,X.Shao,J.Xing,C.Cheng,and X.Zhou.A deep regression architecture with two-stage re-initialization for high performance facial landmark detection.In CVPR,2017.

【18】H.Yang,W.Mou,Y.Zhang,I.Patras,H.Gunes,and P.Robinson.Face alignment assisted by head pose estimation.In BMVC,2015.

【19】A.Jourabloo and X.Liu.Pose-invariant 3d face align-ment.In ICCV,2015.

【20】X.Zhu,Z.Lei,X.Liu,H.Shi,and S.Z.Li.Face alignment across large poses:A 3d solution.In CVPR,2016.

【21】A.Kumar and R.Chellappa.Disentangling 3d pose in a dendritic cnn for unconstrained 2d face alignment.InCVPR,2018.

【22】S.Honari,P.Molchanov,S.Tyree,P.Vincent,C.Pal,and J.Kautz.Improving landmark localization with semi-supervised learning.In CVPR,2018.

【23】R.Valle,J.Buenaposada,A.Valdes,and L.Baumela.A deeply-initialized coarse-to-fine ensemble of regres-sion trees for face alignment.In ECCV,2018.

【24】X.Dong,Y.Yan,W.Ouyang,and Y.Yang.Style aggregated network for facial landmark detection.In CVPR,2018.

【25】W.Wu,C.Qian,S.Yang,Q.Wang,Y.Cai,and Q.Zhou.Look at boundary:A boundary-aware face alignment algorithm.In CVPR,2018.

【26】A.Howard,M.Zhu,B.Chen,D.Kalenichenko,W.Wang,T.Weyand,M.Andreetto,and H.Adam.Mobilenets:Efficient convolutional neural networks for mobile vision applications.CoRR,abs/1704.04861,2017.3,4.

【27】M.Sandle,A.Howard,M.Zhu,A.Zhmoginov,and L.-C.Chen.Mobilenetv2:Inverted residuals and lin-ear bottlenecks.CoRR,abs/1801.04381,2018.3,4.

【28】Xiaojie Guo,Siyuan Li,Jiawan Zhang,Jiayi Ma,Lin Ma,Wei Liu,Haibin Ling:PFLD:A Practical Facial Landmark Detector.CoRR abs/1902.10859,2019。

Claims

1. A multitask neural network face key point detection method for edge equipment is characterized in that a face image to be detected is input into a convolutional neural network, and the convolutional neural network outputs the detected face key point coordinates; the convolutional neural network loss function is as follows:

wherein

The distance of the nth key point corresponding to the mth input is shown; n represents the number of key points which are preset for each face and need to be detected; m represents the total number of samples of the training picture set; theta¹、θ²And theta³Respectively representing deviation values between actual values and predicted values of yaw angle, pitch angle and roll angle, wherein K is 1, 2 and 3; c represents different types of human faces, including front faces, side faces, head-up, head-down, expressions and shielding conditions; weight of

2. The method of claim 1, wherein the convolutional neural network is based on a MobileNet convolutional neural network.

3. The method for detecting the key points of the face of the multitask neural network aiming at the edge equipment as claimed in claim 1, wherein the data enhancement processing is carried out on the training data, and the specific steps are as follows:

2) 20% of the face area was randomly occluded on each picture.