CN111428661A

CN111428661A - Method for processing face image based on intelligent human-computer interaction

Info

Publication number: CN111428661A
Application number: CN202010232764.9A
Authority: CN
Inventors: 涂山山; 张玙彤; 穆罕默德·瓦卡斯; 张振昊; 石岩; 张永继
Original assignee: Beijing University of Technology
Current assignee: Beijing University of Technology
Priority date: 2020-03-28
Filing date: 2020-03-28
Publication date: 2020-07-17

Abstract

A method for processing face images based on intelligent human-computer interaction belongs to the field of artificial intelligence. The facial features related to the user information are divided into overall facial features, expression features and visual tracking, and three-dimensional face information can be obtained through combined analysis of the three aspects. The invention binds the specific action in the facial features of the user into the behavior information, realizes man-machine interaction through the facial feature information and reduces the selection error of the facial recognition point. According to the invention, the captured human face is subjected to facial region analysis, the basic facial features of the human face are positioned, visual tracking is carried out aiming at the eye region, the facial expression is bound with behavior information so as to realize man-machine interaction, the selection error of facial recognition points is reduced, and the accuracy of facial recognition is improved.

Description

Method for processing face image based on intelligent human-computer interaction

Technical Field

The invention belongs to the field of artificial intelligence, relates to a method for processing a face image, and particularly relates to a method for processing a face image by acquiring face images and video information in real time, and performing face point detection, expression recognition and visual tracking.

Background

Human-computer interaction is a process of information exchange between human and system, and the process can exist in a plurality of different types of systems, the earliest method of human-computer interaction is realized by inputting machine language instructions through manual operation, and the medium of interaction is a computer language. With the development of the graphic processing field, the human-computer interaction media are gradually changed to the graphic interface, so that the graphic interface has better interaction experience, the interaction of the user is more convenient, and more accurate information is fed back to the user. With the development of the related technologies such as pervasive computing and deep learning, the variety of human-computer interaction is continuously enriched, the media for interaction becomes diversified, the main methods are continuously improved to include voice recognition, gesture recognition, tracking and the like, the interaction forms become diversified, and the amount of information which can be transmitted is greatly increased.

As the face analysis technology becomes mature, the face features are widely applied to various occasions, and the problem of processing face images is pushed up to the wind gap wave tip of the times. The invention carries out face point detection, expression recognition and visual tracking on the face by positioning the face characteristic region, combines and applies the three methods, overcomes different limitations of each method and realizes an intelligent human-computer interaction process.

The human face point detection and recognition is one of important applications in the field of deep learning, and the human face point recognition refers to positioning key points on a human face after the human face is detected, preprocessing data after a human face characteristic region is positioned, extracting characteristics by using a recognition algorithm, and completing the human face recognition, wherein the process is shown in fig. 1. With the rapid development of the deep learning technology, the face point recognition technology is mature day by day.

Expression recognition is an important direction for understanding human emotion by a computer and is also an important aspect of man-machine interaction, and by analyzing expressions, a user can capture user information and make a decision, so that the emotion and the psychology of the user can be directly judged. The emotion common convolutional neural network is analyzed, and by analyzing a plurality of convolutional layers and acquisition layers in the cellular neural network, high-level and multi-level features of the whole face or a local area can be extracted, and good expression image feature analysis is obtained. Experience shows that the application of the convolutional neural network in image recognition is superior to other types of neural networks, and the best expression recognition effect can be achieved through the convolutional neural network.

The visual tracking is another important aspect in the human-computer interaction process, the focus of a user can be conveniently observed through the visual tracking, the interested area of the user can be more favorably analyzed, the selection and the preference of the user can be analyzed, human eyes are used as an input source of a computer, the sight range of the human eyes is determined by tracking the sight of the user, and the corresponding human-computer interaction is completed.

The analysis is integrated to see that the three methods respectively aim at one direction in face recognition, so that the defects exist in the independent application to cause the occurrence of recognition errors. Therefore, the invention provides a new face recognition method by combining the three methods, the basic facial features of the face are positioned by analyzing the facial region of the captured face, the visual tracking is carried out aiming at the eye region, the facial expression is bound with the behavior information to realize man-machine interaction, the selection error of the facial recognition point is reduced, and the accuracy of the facial recognition is improved.

Disclosure of Invention

The invention aims to solve the problem of processing a face image, and provides a method for realizing face image processing by acquiring face images and video information in real time, and performing face point detection, expression recognition and visual tracking.

The method for implementing the invention is described as follows:

the invention discloses a face image processing technology based on a convolutional neural network to realize intelligent man-machine interaction, and the specific realization method comprises the following three processes:

method for detecting face points

The method adopts a three-layer convolutional neural network structure, adopts an absolute value correction and parameter sharing mechanism to establish a first layer of convolutional neural network, and adopts a multi-layer regression idea to obtain two-layer and three-layer convolutional neural networks. Because the face pose changes greatly, detection is unstable, the relative positions of the face points and the detection points may change in a large range, namely, a large relative error is generated, and therefore the input area of the first-level network is selected to be large so as to cover as many predicted positions as possible. The output of the first-level network provides a selection condition for the selection of a subsequent detection area, so that the second-level detection area and the third-level detection area are correspondingly reduced, and the selection condition of the detection areas is as follows: a circular area containing 75% of all predicted positions obtained by the previous network, and centered at a position point where the predicted position density of the previous network is highest.

The prediction position is obtained again in the new prediction area, the process is repeated for many times until the detection area is reduced to 1% of the first-stage detection area, the obtained prediction position is the prediction position of each point, and then a plurality of networks of different input areas of each level are obtained;

the final predicted position x of the face point can be formally expressed as a cascade of n levels, and the mathematical expression of the predicted position x is as follows:

where x is the predicted position, l_iThe number of predicted positions on the level i is the number of predicted positions on the level i cascaded by n levels

Is represented by, i.e. x₁ ⁽¹⁾For the first predicted position at level 1, l at level i_iA predicted position

Compared with the corresponding l-th level on the i-1 level_iΔ x for the change of predicted position₁ ⁽i⁾And (4) showing.

The method adopts the design of three layers of convolutional neural networks. The first layer of convolutional neural network comprises three deep convolutional networks with different detection areas, namely an F1 network detection area covering the whole face, an EN1 network detection area covering only the eyes and nose area and an NM1 network detection area covering only the nose and mouth area. The three networks simultaneously adopt the prediction method to predict different areas of the same face, the obtained predicted values of the three networks are averaged, the first-layer predicted position of the whole face feature point can be obtained by reducing the variance, and the deviation of the face prediction result from the reality caused by the excessively obvious local features is avoided. And (3) according to the idea of regression adopted by the first-layer predicted position, respectively obtaining corresponding second-layer predicted positions and third-layer predicted positions for three networks F1, EN1 and NM 1. Because the input area range of the second and third levels is strictly limited by the prediction result of the first level, the prediction positions of the second and third levels can achieve extremely high precision, but are also strictly limited.

(II) facial expression recognition method

In the facial expression recognition method, an end-to-end learning model is provided, facial image synthesis is carried out on the model through two angles of the gesture and the expression, and facial expression recognition is carried out when the gesture is unchanged. The structure of the model consists of a generator, two discriminators and a classifier. The image is pre-processed before being introduced into the model, and a face detection algorithm is applied to a base library containing 68 landmark points for face detection. After preprocessing, the facial image is input into a generator G to generate an identity identifier, specifically, a rule f (x) exists, each input image has a determined and unique identity identifier, and the identity identifier is connected with an expression code e and a posture code p in a cascade mode to represent the change of the face. By applying the maximum and minimum algorithm between the generator G and the discriminator D and adding corresponding labels at the input end of the decoder, new labels of face images with different postures and expressions can be obtained. The invention herein uses two discriminator structures, denoted Datt and Di, respectively, where the discriminator Datt is used to identify and indicate the identity entanglement and Di is used to improve the quality of the resulting image. After the face image is synthesized, the classifier Cexp is used for completing the facial expression recognition task of the face image, specifically, the deep learning algorithm is applied to the classifier, and the classification key factors are ensured to be gradually stable in each presentation layer while the feature information of each facial expression is kept.

(III) visual tracking method

The invention adopts a tracking algorithm based on detection, analyzes and captures an image gradient vector field of a human face, describes the relation between the possible center and the direction of the image gradient by a mathematical method, sets a possible center c, provides a gradient vector at the center position, and leads the displacement to be the same as the gradient direction by normalization. By calculating a normalized displacement related to the position of the centre

And gradient vector g_jThe optimal center c of the eye region in the face image is obtained^*I.e., the pupil center position, is given by:

aiming at a possible center c, selecting N different gradients which respectively correspond to different normalized displacements

And gradient vector g₁…g_NI.e. the selected normalized displacement corresponding to the jth gradient is

Gradient vector is g_jWhen the objective function obtains the maximum value, the corresponding position variable is the optimal center position c^*. Wherein the selected different gradients correspond to a displacement d_jThe method can be obtained by the following steps:

will displace d_jZooming to unit length to obtain the same weight at different positions in the face image, and gradient vector g for improving the robustness of the method to the linear change of illumination and contrast_jAlso scaled to a single length, it can be found that the objective function produces a maximum at the pupil center position. In addition, the complexity of the algorithm can be reduced by only considering gradient vectors with significant amplitudes.

The method provided by the invention has the following advantages when being applied to the field of face image processing:

the invention realizes man-machine interaction by identifying facial features.

The facial features related to the user information are divided into overall facial features, expression features and visual tracking, and three-dimensional face information can be obtained through combined analysis of the overall facial features, the expression features and the visual tracking.

And thirdly, binding specific actions in the facial features of the user into the behavior information, realizing man-machine interaction through the facial feature information and reducing the selection error of the facial recognition points.

Drawings

FIG. 1 is a schematic diagram of a human-computer interaction processing process based on face detection;

FIG. 2 is a block diagram of the model;

Detailed Description

The method for processing human face images based on intelligent human-computer interaction according to the present invention will now be described in further detail with reference to the model block diagram and the implementation example shown in fig. 2.

The invention discloses a method for processing a face image based on intelligent human-computer interaction, which is a method for realizing the processing of the face image by acquiring the face image and video information in real time, and performing face point detection, expression recognition and visual tracking. The general working framework applied to the field of face recognition of the invention is described as follows: the method comprises the steps of capturing a human face, positioning key points of the human face through a three-layer convolutional neural network, identifying facial expressions of the human face by adopting an end-to-end deep learning model, realizing human face image synthesis and expression identification by utilizing different gestures and expressions, and positioning the center of the eye by using image gradients to realize visual tracking. After obtaining these three features, we combine them into a three-layer neural network for training so that the machine responds reasonably according to the combined features.

The invention adopts the following technical scheme and implementation steps:

method for detecting face points

Compared with the corresponding l-th level on the i-1 level_iΔ x for the change of predicted position₁ ⁽ⁱ⁾And (4) showing.

(II) facial expression recognition method

(III) visual tracking method

And gradient vector g₁…g_NI.e. the selected first gradient corresponds to a normalized displacement of

Claims

1. A method for processing face images based on intelligent human-computer interaction is characterized in that the specific implementation method comprises the following three processes:

method for detecting face points

Adopting a three-layer convolutional neural network structure, adopting an absolute value correction and parameter sharing mechanism to establish a first layer of convolutional neural network, and adopting a multi-layer regression idea to obtain two-layer and three-layer convolutional neural networks;

the input area of the first-level network is selected to be larger so as to cover the predicted positions as much as possible; correspondingly reducing the second and third detection areas, wherein the selection conditions of the detection areas are as follows: a circular area including 75% of predicted positions among all predicted positions obtained by the previous network, the circular area being centered at a position point where the density of predicted positions of the previous network is highest;

the final predicted position x of the face point is represented as a cascade of n levels, and the mathematical representation of the predicted position x is as follows:

where x is the predicted position, l_iThe number of predicted positions on the ith level is n levels of cascade connection, and the predicted positions on the ith level are respectively used

Compared with the corresponding l-th level on the i-1 level_iΔ x for the change of predicted position₁ ⁽ⁱ⁾Represents;

(II) facial expression recognition method

In the facial expression recognition method, an end-to-end learning model is provided, the model carries out facial image synthesis through two angles of gesture and expression, and carries out facial expression recognition when the gesture is unchanged; the structure of the model consists of a generator, two discriminators and a classifier;

preprocessing is carried out before the image is transmitted into the model, and a face detection algorithm is applied to a basic library containing 68 mark points for face detection; after preprocessing, inputting facial images into a generator G to generate identity marks, wherein each input image has a determined and unique identity mark, and then cascading the identity marks with an expression code e and a posture code p to express the change of the face; applying a maximum and minimum algorithm between the generator G and the discriminator D, and adding corresponding labels at the input end of a decoder to obtain new labels of face images with different postures and expressions;

two discriminator structures are respectively represented by Datt and Di, wherein the discriminator Datt is used for identifying and representing the entanglement degree of the mark, and the Di is used for improving the quality of the generated image; after the face image is synthesized, the classifier Cexp is used for completing the facial expression recognition task of the face image;

(III) visual tracking method

Adopting a tracking algorithm based on detection, analyzing an image gradient vector field of a captured face, describing the relation between the possible center and the direction of the image gradient by a mathematical method, setting a possible center c, providing a gradient vector at the center position, and enabling the displacement to be the same as the gradient direction through normalization;

by calculating a normalized displacement related to the position of the centre

Gradient vector is g_jWhen the objective function obtains the maximum value, the corresponding position variable is the optimal center position c^*(ii) a Wherein the selected different gradients correspond to a displacement d_jThe method comprises the following steps:

will displace d_jZooming to unit length to obtain the same weight and gradient vector g of different positions in the face image_jAnd the unit length is also scaled, so that the maximum value of the target function at the pupil center position is obtained.

2. The method of claim 1, wherein:

the first layer of convolutional neural network comprises three deep convolutional networks with different detection areas, wherein F1 network detection areas cover the whole face, EN1 network detection areas only cover the eyes and nose areas, and NM1 network detection areas only cover the nose and mouth areas; the three networks simultaneously predict different areas of the same face, the obtained predicted values of the three networks are averaged, and the corresponding second-layer predicted positions and the corresponding third-layer predicted positions are obtained respectively according to the three networks F1, FN1 and NM1 by adopting a regression idea according to the first-layer predicted positions.