CN104361328B

CN104361328B - A kind of facial image normalization method based on adaptive multiple row depth model

Info

Publication number: CN104361328B
Application number: CN201410677837.XA
Authority: CN
Inventors: 刘艳飞; 周祥东; 周曦
Original assignee: Chongqing Zhongke Yuncong Technology Co Ltd
Current assignee: Chongqing Zhongke Yuncong Technology Co Ltd
Priority date: 2014-11-21
Filing date: 2014-11-21
Publication date: 2018-11-02
Anticipated expiration: 2034-11-21
Also published as: CN104361328A

Abstract

The present invention relates to a kind of facial image normalization method based on adaptive multiple row depth model, this approach includes the following steps S1：Establish adaptive multiple row depth model；S2：The training of adaptive multiple row depth model；S3：Target facial image normalization.A kind of face normalization method based on the adaptive depth model of multiple row provided by the invention, by the way that the progress linear combination of multiple row depth model is realized that the joint of a variety of unfavorable factors for influencing recognition of face is corrected, the best initial weights of each row depth model are adaptively calculated using nonlinear optimization method simultaneously, i.e., adaptively adjust the improvement factor of each factor according to input picture.Compared with traditional method for carrying out face correction using single deep neural network model, the face normalization method provided by the invention based on the adaptive depth model of multiple row is stronger to the robustness of various change factor.

Description

Face image normalization method based on self-adaptive multi-column depth model

Technical Field

The invention belongs to the technical field of face recognition, and relates to a face image normalization method based on a self-adaptive multi-column depth model.

Background

As a typical biological feature recognition technology, the face recognition is favored by people with the advantages of high naturalness and acceptability, convenience in concealment and the like, and has wide application prospects in the aspects of national public safety, military safety, financial safety, human-computer interaction and the like. However, the best face recognition system in the world can basically meet the requirements of general applications only under the conditions of relatively good cooperation of users and relatively ideal acquisition conditions. In an unconstrained environment (user mismatch and non-ideal acquisition conditions), face recognition is difficult due to poor stability of face features, large influence of various external conditions (such as different illumination conditions, shielding and other factors), and the like. The normalization of the face image refers to the correction of the face image affected by illumination change, angle change, expression change, shielding and other factors to the front face under standard conditions, and the above difficult problems of face recognition can be effectively solved. The existing face image normalization method usually only removes and corrects one or two factors, and can be divided into face posture correction and face illumination correction.

Posture correction can be generally divided into two categories: 2D-based posture correction and 3D-based posture correction. The 2D-based method performs angle rectification by 2D image matching or a method of encoding a test image using basis functions or samples. For example, documents c.d. castillo and d.w. jacobs, "Wide-base stereo for face registration with large position variation," in IEEE Conference on Computer Vision and Pattern Registration (CVPR)2011, and pp.537-544 calculate the similarity between two faces using stereo matching. Document a.li, s.shan, and w.gao, "Coupled bias-variance tradoff for cross-face recognition," IEEE Transactions on Image Processing vol.21, pp.305-315,2012 represents a face Image to be tested by a linear combination of training images, and performs face recognition using linear regression coefficients as features. The 3D method usually first obtains 3D face data or estimates a 3D model from the 2D image and then matches it with a 2D test face image. Documents s.li, x.liu, x.chai, h.zhang, s.lao, and s.shann, "morphology field based image matching for face recognition access," incecv: Springer,2012, pp.102-115 first generate a virtual angle for the test face image through a series of 3D displacement fields (displacement fields) sampled from a 3D face database, and then match the synthesized face with the registered face. Similarly, documents a.astana, t.k.marks, m.j.jones, k.h.tieu, and m.rohith, "full automatic position-innovative surface registration via 3D position registration," in IEEE International Conference on Computer Vision (ICCV)2011, and pp.937-944 use an Active angle-based Appearance Model (AAM) to achieve matching of the 3D Model to the 2D image.

The face illumination preprocessing method can be divided into three types: an illumination normalization method, a method for modeling illumination variation, and a method for extracting illumination invariant features. The illumination normalization is to pre-process the face image by adopting an image processing technology to obtain the face image under the uniform illumination condition. Histogram Equalization (HE), Gamma correction, and Logarithmic Transformation (LT) are the most common methods of illumination normalization. Methods of modeling illumination variations typically assume the effects of illumination on the face image and then use these assumptions to model or remove the illumination effects. The illumination change modeling method can be divided into two types, namely a statistical model-based method and a physical model-based method. The statistical model-based method obtains a linear subspace with approximate illumination change by learning images formed by each person under different illumination conditions, and the linear subspace method is commonly known as Eigenface, Fisherface, segmented linear subspace method and the like. An assumption (such as a Lambertian surface assumption) is made on the illumination reflection property of the surface of an object by a physical model-based method, and a face image model under different illumination conditions is obtained according to the assumption, wherein the face image model typically has an illumination cone and a spherical harmonic function. One drawback of the model-based approach is that a large number of pictures of the face under different lighting conditions and different 3D model information are required for training. Moreover, the human face is not a perfect Lambertian surface. The above disadvantages limit the application of the illumination variation modeling method. The method for extracting the illumination invariant features is based on the simple idea of finding a face expression mode with invariant illumination. The method for extracting the illumination invariant feature comprises a gradient feature method, a Quotient Image (QI) method and a Retinex model-based method. Typical gradient features are edge maps, image gradients, and Local Binary Patterns (LBP). The method does not depend on a physical model, is simple to implement, has a limited improvement on the identification effect, and has an obvious effect only under the condition that the illumination change is not severe. The quotient image is the ratio of the test image and the composite image (linear combination of 3 images not under the same illumination condition), and can be regarded as an illumination invariant since the ratio in the Lambertian reflectance model is only related to the reflectance of the object. The Retinex model is a simplified version of the Lambertian model. Compared with the quotient image method, the Retinex-based method has the advantages that only one image is needed to extract the illumination invariants of the image, and the image does not need to be aligned. The illumination preprocessing method based on Retinex includes Single Scale Retinex (SSR), Multi-Scale Retinex (MSR), Self-Quotient Image (SQI), Morphological Quotient Image (MQI) and the like.

The face image normalization method has certain limitations. For example, acquiring 3D data requires additional computational effort and resources; and deriving 3D models from 2D data is an ill-posed problem; the statistical illumination model is usually obtained from a constraint environment and cannot be well popularized to practical application. More importantly, the above method can only remove the influence of a few one or two adverse factors. In practical situations, various influencing factors such as illumination, angle, expression, and occlusion are usually interacted, and the action of one changing factor is usually correlated with other changing factors, in which case, when one or two changing factors are processed separately, it is difficult to achieve a real practical effect due to the influence of other factors. Therefore, a more effective human face image normalization method is needed.

Disclosure of Invention

In view of the above, the present invention provides a face image normalization method based on an adaptive multi-column depth model,

in order to achieve the purpose, the invention provides the following technical scheme:

a face image normalization method based on a self-adaptive multi-column depth model comprises the following steps:

s1: establishing a self-adaptive multi-column depth model;

s2: training a self-adaptive multi-column depth model; the training includes S21: training a deep neural network; s22: prediction of training data weights; s23: training a weight prediction module;

s3: and normalizing the target face image.

Further, the training of the deep neural network in S21 includes S211: training a stack-type sparse denoising self-encoder SSDA; s212: and (5) training a Deep Convolutional Neural Network (DCNN).

Further, the training of the SSDA comprises the steps of:

s2111: the SSDA consists of K DAs, the training of the SSDA is the training of each DA, and the DA can be trained by optimizing the following sparse regularization reconstruction loss function:

where y denotes a clean training image, x denotes an image in which y is contaminated with noise, and N pieces of training data D { (x)₁,y₁),...,(x_N,y_N)}；is the output of DA, and the reconstruction of x is the approximation of y, theta is the parameter to be optimized, lambda, beta and rho are hyper-parameters, J is the number of hidden nodes,is the average output value vector of hidden layer nodes, sparse induction termIs the target activation value rho and the average activation value of the jth hidden nodeBy selecting a smaller target activation value p such thatThe component of (a) is as small as possible;

s2112: after the first DA is trained, the hidden layer outputs of the clean image and the polluted image are respectively used as the clean image and the polluted image to train a second DA, and the process is repeated until the training of K DAs is completed;

s2113: the parameters of the entire network are fine tuned using standard back propagation algorithms by minimizing the loss function:

wherein W^(l)Parameters for layer l in SSDA are indicated.

Further, the training of the DCNN comprises the steps of:

s2121: the initialization of the parameters is carried out in a layer-by-layer mode:

wherein Y is a target image group route set corresponding to the input image set X, X¹、X²、X³Outputs of three local connection layers respectively; o, P and Q are all a fixed binary matrix consisting of W¹X、W²X¹And W³X²Adding the pixel values of the same position in the generated feature map for OW¹X、PW²X¹And QW³X²The same size as Y;

s2121: updating parameters by minimizing the following reconstruction error loss function by adopting a random gradient descent method:

wherein W ═ { W ═ W¹,W²,W³,W⁴}。

Further, the prediction of the training data weight in S22 is the optimal weight vector S ═ S for solving the training data D₁,...,s_c]^TBy solving a quadratic optimization problem min_s Is obtained in whichRepresenting a combination of C DNN output images.

Further, the weight prediction module in the training of the S23 weight prediction module is a Radial Basis Function (RBF) neural network, and the parameters of the RBF are trained by adopting a standard neural network parameter optimization method by taking the characteristic vector phi as input and S as output; after step S22, a new training sample set (Φ, S) is generated, where Φ ═ f₁,...,f_C]The feature vectors are obtained by connecting output values of SSDA hidden layer nodes or DCNN feature extraction layers; f. of_iA hidden layer (SSDA) of the ith column DNN or a vector of output values of a feature extraction layer (DCNN), i 1.., C;

further, the S3 method for normalizing the target face image includes AMC-SSDA, AMC-DCNN and AMC-SSDA + DCNN.

The invention has the beneficial effects that: the invention provides a face normalization method based on a multi-column self-adaptive depth model, which realizes the combined correction of various adverse factors influencing face recognition by linearly combining the multi-column depth model, and meanwhile, the optimal weight of each column of depth model is self-adaptively calculated by utilizing a nonlinear optimization method, namely, the correction factor of each factor is self-adaptively adjusted according to an input image. Compared with the traditional method for correcting the face by using a single depth neural network model, the method based on the self-adaptive multi-column depth model has stronger robustness on various change factors.

Drawings

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention will be further described in detail with reference to the accompanying drawings, in which:

FIG. 1 is a diagram of an adaptive multi-column depth model architecture;

FIG. 2 is a view of the SSDA architecture;

FIG. 3 is a diagram of the structure of DCNN.

Detailed Description

Preferred embodiments of the present invention will be described in detail below with reference to the accompanying drawings.

The invention provides a face image normalization method based on a self-adaptive multi-column depth model, which comprises the following steps of:

step one, establishing a self-adaptive multi-column depth model. The adaptive multi-column depth model is formed by linearly combining a plurality of Deep Neural Networks (DNNs), as shown in fig. 1, each DNN (one column in fig. 1) is used for removing the influence of a certain type of factors (one or a combination of several of factors such as illumination, angle, occlusion, expression) and the parameters of the DNN are also trained by using data influenced by the factors. An uncorrected image is input, each DNN corrects the image by a certain factor and outputs a restored image, the weight prediction module predicts an optimal combined weight (namely, the weight is determined by the input image in a self-adaptive mode) according to the output (characteristic) of each DNN hidden layer node, and the final restored image is a weighted average of each DNN output image. The deep neural network DNN may select a Stacked Sparse Denoising Autoencoder (SSDA), a Deep Convolutional Neural Network (DCNN), or a combination thereof. The SSDA method takes the face picture affected by various factors as a noisy image, removes various affected factors in the face picture by a denoising method, and recovers the face with normal illumination, front, no shielding and natural expression; and the DCNN firstly extracts the characteristics with invariance through a deep convolution network, and then carries out face reconstruction to directly reconstruct the face with normal illumination, front, no shielding and natural expression.

And step two, training the model. The training process of the adaptive multi-column depth model comprises the training of a deep neural network (SSDA, DCNN), the prediction of the weight of training data and the training of a weight prediction module. The method comprises the following steps:

and step 21, training the deep neural network. The training of the deep neural network includes training of the SSDA and training of the DCNN.

Step 211, training of SSDA. The SSDA is composed of a plurality of Denoise Autoencoders (DA). The DA is a two-layer neural network (an input layer, a hidden layer, and an output layer) consisting of 1 encoder and 1 decoder. An SSDA formed by K DAs is 1 multi-layer neural network and comprises an input layer, 2K-1 hidden layers and an output layer. Let the kth DA comprise an encoderAnd a decoderThen the SSDA may be divided into code portionsToAnd a decoding part fromToAs shown in fig. 2.

Since the SSDA is composed of a plurality of DAs, training of the SSDA is training of each DA. Let y denote a clean training image, x denote an image with y contaminated with noise, and given N training data D { (x)₁,y₁),...,(x_N,y_N) The DA can be trained by optimizing a sparse regularized reconstruction loss function as follows.

is the output of DA (reconstruction of x) is an approximation to y, Θ ═ W, b, W ', b' } are parameters to be optimized λ, β and ρ are hyperparameters, J is the number of hidden nodes,is the average output value vector of hidden layer nodes, sparse induction termIs the target activation value rho and the average activation value of the jth hidden nodeBy selecting a smaller target activation value p such thatThe component (c) is as small as possible, so that the output of the hidden layer in the optimization process is 0 most of the time to achieve sparsity. After training the first DA, the hidden layer outputs of the clean image and the contaminated image are respectively used as the clean image and the contaminated image to train the second DA, and the process is repeated until the training of the K DA is completed. Finally, the parameters of the entire network are fine tuned using standard back propagation algorithms by minimizing the following loss function:

wherein W^(l)Parameters for layer l in SSDA are indicated.

Step 212, training of the DCNN. The DCNN is composed of a feature extraction layer and a reconstruction layer, as shown in fig. 3. The feature extraction layer is composed of 3 local connection layers and 2 posing layers and is used for extracting invariant features. Wherein the local connection layer passes through a sparse weight matrix W¹、W²、W³Filtering the input image (or the feature map output by the previous layer) to generate a feature map (or a new feature map), and passing the posing layer through a matrix V¹And V²The feature map (feature map) output by the local connectivity layer is downsampled to reduce the number of parameters that need to be learned while preserving more robust features. The reconstruction layer is a full connection layer and passes through a weight matrix W⁴The extracted features are transformed into a frontal face image under standard conditions.

The training of the DCNN is mainly divided into two processes of initialization and updating of parameters. The initialization process of the parameters can be performed in a layer-by-layer manner as shown in the following formula:

wherein Y is a set of target images (ground route) corresponding to the set of input images X, X¹、X²、X³Outputs of three local connection layers respectively; o, P and Q are all a fixed binary matrix consisting of W¹X、W²X¹And W³X²Adding the pixel values of the same position in the generated feature map for OW¹X、PW²X¹And QW³X²The same size as Y. According to the above formula, the input picture X is first passed through a linear transformation W without posing¹To approximate Y. Once W is¹After initialization, W can be used¹Calculating X¹Then, the second expression above can be used to pair W²The initialization is performed and repeated until all the parameter matrices are initialized.

After the initialization of the parameters is completed, the parameters are updated by minimizing the following reconstruction error loss function by adopting a random gradient descent method:

wherein W ═ { W ═ W¹,W²,W³,W⁴}。

Step 22, prediction of the weights of the training data. Prediction problem of training data weight, i.e. solving optimal weight vector s ═ s of training data D₁,...,s_c]^TCan be optimized by solving a quadratic optimization problem min_s Is obtained in whichRepresenting a combination of C DNN output images.

And step 23, training a weight prediction module. After the training of DNN (SSDA or DCNN) and the weight prediction of the training data are completed, a new training sample set (Φ, s) can be generated, where Φ ═ f₁,...,f_C]Is a feature vector obtained by connecting the output values of all SSDA hidden layer nodes (or DCNN feature extraction layers), f_iC is a hidden layer (SSDA) of the ith column (DNN) or a feature extraction layer (DCNN) output value vector. The weight prediction module can be trained by the training sample. The weight prediction module is a radial basis function RBF neural network, and parameters of the RBF can be trained by adopting a standard neural network parameter optimization method by taking the characteristic vector phi as input and s as output.

the method comprises the steps of normalizing a target face image, wherein three target face image normalization schemes can be provided based on the use and combination of neural networks with different depths, AMC-SSDA, the basic idea of face correction based on adaptive multi-column stacked sparse denoising self-encoders (AMC-SSDA) is to consider a face image affected by illumination change, expression change, posture change, shielding and the like as a noise polluted image, use a front face image under a standard condition as a clean image, train a plurality of stacked sparse denoising self-encoders (SSDA) to denoise, use each SSDA corresponding to one influence factor or a combination of different influence factors to remove the influence of one or more factors, use all linear combinations of the SSDA to form the AMC-SSDA to perform joint denoising, use illumination change, expression change, posture change, shielding and the like of the face to remove various noises, realize the face correction, use the SSDA in the AMC-SSDA to perform joint denoising, use a DCNN-SSDA combined reconstruction method to reconstruct a multi-column linear combination of the face image under a DCNN-SSDA and a standard illumination change, use a DCNN-SSDA to reconstruct a multi-SSN image, use a DCNN-SSDA to reconstruct a normal illumination image, and a multi-SSDA convolution image, and a plurality of images under a standard illumination change, and a standard illumination change, wherein the illumination change of the SSDA is formed by a DCNN.

The deep neural network selected by the model in the embodiment of the present invention is SSDA or DCNN, but in practical cases, the model is not limited to SSDA or DCNN.

Finally, it is noted that the above-mentioned preferred embodiments illustrate rather than limit the invention, and that, although the invention has been described in detail with reference to the above-mentioned preferred embodiments, it will be understood by those skilled in the art that various changes in form and detail may be made therein without departing from the scope of the invention as defined by the appended claims.

Claims

1. A face image normalization method based on a self-adaptive multi-column depth model is characterized by comprising the following steps: the method comprises the following steps:

s1: establishing a self-adaptive multi-column depth model; the self-adaptive multi-column depth model is formed by linearly combining a plurality of Deep Neural Networks (DNNs), each DNN is used for removing the influence of certain specific type factors, including one or more of illumination change, expression change, posture change and shielding factors, and parameters of the DNN are trained by adopting data influenced by the factors; inputting an uncorrected image, performing correction of certain factors on the uncorrected image by each DNN and outputting a restored image, predicting the optimal combination weight by a weight prediction module according to the output of each DNN hidden layer node, wherein the weight is determined by the input image in a self-adaptive manner, and the final restored image is the weighted average of each DNN output image; wherein, the deep neural network DNN selects a Stacked Sparse Denoising Autoencoder (SSDA), a Deep Convolutional Neural Network (DCNN), or a combination of the two; the SSDA takes the face image affected by various factors as a noisy image, removes various affected factors in the face image by a denoising method, and recovers the face with normal illumination, front, no shielding and natural expression; the DCNN firstly extracts the characteristics with invariance through a deep convolution network, and then carries out face reconstruction to directly reconstruct the face with normal illumination, front, no shielding and natural expression;

s2: training a self-adaptive multi-column depth model;

the training includes S21: training a deep neural network DNN; s22: prediction of training data weights; s23: training a weight prediction module;

s3: normalizing the target face image;

the training of the deep neural network in S21 includes S211: training a stack-type sparse denoising self-encoder SSDA; s212: training a Deep Convolutional Neural Network (DCNN);

the training of the SSDA comprises the following steps:

s2111: the SSDA is composed of a plurality of Denoising Autoencoders (DA); the DA is a neural network with two layers, an input layer, a hidden layer and an output layer, and consists of 1 encoder and 1 decoder; an SSDA formed by K DAs is 1 multilayer neural network and comprises an input layer, 2K-1 hidden layers and an output layer; let the kth DA comprise an encoderAnd a decoderThen the SSDA is divided into coded portionsToAnd a decoding part fromToThe SSDA consists of K denoising autocoders DA, training of the SSDA is training of each DA, and the DA can be trained by optimizing a sparse regularization reconstruction loss function as follows:

wherein X ═ { X ═ X₁,x₂,...x_NDenotes a set of input images affected by various factors, Y ═ Y₁,y₂,...y_NRepresenting a target image group treth set corresponding to the input image set; given N training data D { (x)₁,y₁),...,(x_N,y_N)}；Is the output of the DA; Θ ═ W_S,b,W_S'and b' are parameters to be optimized, lambda, β and rho are hyper-parameters, J is the number of hidden nodes,is the average output value vector of hidden layer nodes, sparse induction termIs the target activation value rho and the average activation value of the jth hidden nodeBy selecting a smaller target activation value p such thatThe component of (a) is as small as possible;

whereinA parameter representing layer I in SSDA;

s2121: the DCNN comprises a feature extraction layer and a reconstruction layer, wherein the feature extraction layer comprises 3 local connection layers and 2 posing layers and is used for extracting invariant features; wherein the local connection layer passes through the sparse weight matrix Filtering the input image or the feature map output by the previous layer to generate a feature map or a new feature map, and downsampling the feature map featuremap output by the local connection layer through matrixes V1 and V2 by the pooling layer to reduce the number of parameters needing to be learned and simultaneously retain more robust features; the reconstruction layer is a full connection layer and passes through a weight matrixWill liftThe obtained features are converted into a front face image under a standard condition; the initialization of the parameters is carried out in a layer-by-layer mode:

wherein, X¹、X²、X³Outputs of three local connection layers respectively; o, P and Q are all a fixed binary matrix consisting ofAndadding the pixel values of the same position in the generated feature map for makingAndthe same size as Y;

s2122: updating parameters by minimizing the following reconstruction error loss function by adopting a random gradient descent method:

wherein,the prediction of the training data weight of S22 is to solve the optimal weight vector S ═ S of the training data D₁,...,s_c]^TBy solving a quadratic optimization problemIs obtained in whichA combination of output images representing C DNNs; the weight prediction module in the training of the S23 weight prediction module is a Radial Basis Function (RBF) neural network, and the parameters of the RBF are trained by adopting a neural network parameter optimization method by taking the characteristic vector phi as input and S as output; after step S22, a new training sample set (Φ, S) is generated, where Φ ═ f₁,...,f_C]The feature vectors are obtained by connecting output values of SSDA hidden layer nodes or DCNN feature extraction layers;

the S3 target face image normalization method comprises AMC-SSDA, AMC-DCNN and AMC-SSDA + DCNN, AMC-SSDA is based on the basic idea of face correction of an adaptive multi-column stacked sparse denoising self-encoder (AMC-SSDA) that a face image affected by illumination change, expression change, posture change and shielding factors is regarded as a noise pollution image, a front face image under a standard condition is used as a clean image, a plurality of stacked sparse denoising self-encoders (SSDA) are trained to denoise, each SSDA corresponds to one or a combination of different influencing factors, the influence of one or more factors is removed, then the AMC-SSDA is formed by linear combination of all SSDAs to form the AMC-SSDA to perform combined denoising, illumination change, expression change, posture change and shielding removal of various noises are completed, face correction is realized, the SSDA in the AMC-SSDA is replaced by the SSDA in the AMC-SSDA, namely, the AMC-SSDA is formed, different from the SSDA through correction, the DCNN is trained for normalization of the illumination change of the face image, the SSDA, the face image is reconstructed by a multi-SSDA combined reconstruction of the DCNN, the normal AMC-SSDA, the facial image is used for realizing the reconstruction of the facial image, the facial image reconstruction of the facial image under the facial image, the facial image reconstruction of the facial image, the facial reconstruction of facial image, the facial reconstruction of the facial image under the facial image, the facial image under the facial image, the facial image.