CN114582002B

CN114582002B - Facial expression recognition method combining attention module and second-order pooling mechanism

Info

Publication number: CN114582002B
Application number: CN202210403298.5A
Authority: CN
Inventors: 周婷; 陈劲全; 余卫宇
Original assignee: South China University of Technology SCUT
Current assignee: South China University of Technology SCUT
Priority date: 2022-04-18
Filing date: 2022-04-18
Publication date: 2024-07-09
Anticipated expiration: 2042-04-18
Also published as: CN114582002A

Abstract

The invention discloses a facial expression recognition method combining an attention module and a second-order pooling mechanism, and relates to the field of deep learning. The invention comprises the following steps: acquiring a face image; preprocessing a face image, wherein the face image preprocessing comprises face detection and alignment, data enhancement and image normalization; and extracting the characteristics of the preprocessed face image, and finishing expression classification. The invention effectively converts the face picture which is interfered by factors such as illumination, head gesture, shielding and the like in the natural environment into the front face picture with proper contrast and no shielding, thereby solving the problem of interference on expression recognition effect caused by other expression-independent factor variables in the real world environment.

Description

Facial expression recognition method combining attention module and second-order pooling mechanism

Technical Field

The invention relates to the field of deep learning, in particular to a facial expression recognition method combining an attention module and a second-order pooling mechanism.

Background

The facial expression is an important mode for transmitting information in human-to-human communication, and the development of facial expression recognition technology can effectively promote the development of related fields such as pattern recognition, image processing and the like, and has high scientific research value, and application scenes of the facial expression recognition technology comprise severe monitoring, fatigue driving monitoring, criminal investigation, human-computer interaction and the like. With the rapid development of large-scale image data and computer hardware (especially GPU), the deep learning method has breakthrough results in image understanding, and the deep neural network has strong feature expression capability, can learn features with discrimination capability and is gradually applied to automatic facial expression recognition tasks. According to the different types of the processed data, the deep facial expression recognition method can be roughly divided into two main types, namely a deep facial expression recognition network based on static images and a deep facial expression recognition network based on videos.

The current advanced depth facial expression recognition method based on static images can be mainly divided into: diversified network inputs, cascade networks, multi-task networks, multi-network fusion, and generation of countermeasure networks, etc., while video-based deep facial expression recognition mainly uses a basic timing network to analyze temporal information carried in video sequences, such as LSTM, C3D, etc., or uses human face key point trajectories to capture dynamic changes of face components in continuous frames, and fuses spatial networks and temporal networks in parallel. In addition, the expression recognition can be expanded to scenes with more practical application value by combining other expression models such as facial action unit models and other multimedia modes such as an audio mode and human physiological information.

Since 2013, expression recognition games such as FER2013 and EmotiW have collected relatively abundant training samples from challenging real world scenes, facilitating the transition of facial expression recognition from a laboratory controlled environment to a natural environment. From the study object, the expression recognition field is experiencing rapid development from laboratory beat to real world spontaneous expression, from long lasting exaggerated expressions to transient micro-expressions, from basic expression classification to complex expression analysis.

As facial expression recognition tasks gradually shift from laboratory controlled environments to challenging real world environments, current deep facial expression recognition systems seek to address several issues:

1) Overfitting problems due to lack of sufficient training data;

2) Interference problems caused by other expression-independent factor variables (such as illumination, head posture and identity characteristics) in a real-world environment;

3) And the recognition accuracy of the facial expression recognition system in the real environment is improved.

Disclosure of Invention

In view of the above, the present invention provides a facial expression recognition method combining an attention module and a second order pooling mechanism to solve the above-mentioned problems in the background art.

In order to achieve the above purpose, the present invention adopts the following technical scheme:

a facial expression recognition method combining an attention module and a second-order pooling mechanism comprises the following steps:

Acquiring a face image;

Preprocessing a face image, wherein the face image preprocessing comprises face detection and alignment, data enhancement and image normalization;

And extracting the characteristics of the preprocessed face image, and finishing expression classification.

Optionally, the face detection and alignment includes face detection, key point positioning and face alignment, specifically:

the face detection module inputs the facial expression picture and outputs the facial expression picture as a detected face area;

Positioning the coordinates of key points of the human face according to the human face detection area, and importing a five-point key point detection model by using a human face key point detection interface in a dlib library to obtain the coordinates of the five-point key points of the human face;

and carrying out face alignment by utilizing the coordinates of the five-point key points.

Optionally, the face alignment calculation process is as follows:

First, the center coordinates of the left and right eyes are calculated from the four coordinates (x ₁,y₁),(x₂,y₂),(x₃,y₃),(x₄,y₄) of the left and right eyes:

After the center coordinates of the two eyes are obtained, connecting the center coordinates of the two eyes, calculating an included angle theta between the connecting line and a horizontal line, and then averaging the center coordinates of the left eye, the center coordinates of the right eye and the three-point coordinates of the point coordinates below the nose to calculate a rotation center point coordinate (x _center,y_center):

and combining the rotation center point coordinate (x _center,y_center) and the included angle theta, solving an interface of the affine transformation matrix in the OpenCV to obtain the affine transformation matrix, and calling an interface function of the OpenCV to carry out affine transformation on the image to obtain the photo with aligned faces.

Optionally, the data enhancement is to dynamically perform random geometric or color transformation operation on the input image in the data reading stage through a transform. Composition () interface in the deep learning framework pytorch, and then input the transformed image into a network for training, so as to realize data expansion.

Optionally, the image normalization divides the pixel value of the image by 255, and any pixel value of the normalized image is between [0,1 ].

Optionally, the extraction of facial expression features is realized by using an 18-layer resnet network, and the network output result is normalized to a probability value of 7 types of expressions by adding a softmax layer at the end of the resnet network, wherein the maximum value is the classification result.

Optionally, feature extraction and expression classification are implemented by using an end-to-end deep neural network, and the structure of the deep neural network is as follows: the first layer is a convolution layer with a convolution kernel size of 7 multiplied by 7, and the number of channels is 64; the second layer is a pooling layer with a pooling core size of 3×3, and the channel number is 64; and connecting eight residual structures fused with the convolution attention module, outputting a feature map of a 512-dimensional channel, connecting a second-order pooling layer to realize feature aggregation, and finally obtaining a classification result by using a full-connection layer and a softmax layer.

Compared with the prior art, the invention discloses a facial expression recognition method combining an attention module and a second-order pooling mechanism, which has the following beneficial effects:

(1) The face picture is subjected to two steps of face detection, alignment and image normalization, so that the face picture which is interfered by factors such as illumination, head posture and shielding in a natural environment is effectively converted into the face picture with proper contrast and no shielding, and the interference problem of other factors which are irrelevant to expression in a real-world environment on the expression recognition effect is solved.

(2) Through the data enhancement means, the operations of random cutting, rotation, overturning, noise adding, color changing and the like are dynamically carried out on the input image in the data reading stage in the network training process, so that the data is expanded to multiple times of the original data, and better data diversity is obtained. The problem of overfitting caused by lack of sufficient training data is effectively solved.

(3) Through improving resnet network structure, make it more suitable for the extraction of expression characteristic, add convolution attention module (CBAM) and make the network more focus on the characteristic extraction of waiting to discern the object, add second order pooling mechanism and draw the second order characteristic of expression to catch the distortion degree information of facial expression muscle better, thereby promote the extraction ability of network model to facial expression characteristic, effectively how to promote the problem of facial expression recognition system recognition accuracy under the real environment.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings that are required to be used in the embodiments or the description of the prior art will be briefly described below, and it is obvious that the drawings in the following description are only embodiments of the present invention, and that other drawings can be obtained according to the provided drawings without inventive effort for a person skilled in the art.

FIG. 1 is a schematic overall flow chart of the present invention;

FIG. 2 is a face detection result diagram of the present invention;

FIGS. 3 a-3 b are graphs showing the detection results of key points of the face according to the present invention;

FIGS. 4 a-4 b are front-to-back alignment diagrams of the face alignment of the present invention;

FIG. 5 is a block diagram of an improvement resnet of the present invention;

FIG. 6 is a diagram of the ResBlock + CBAM structure of the present invention;

FIG. 7 is a block diagram of a channel attention module of the present invention;

Fig. 8 is a block diagram of a spatial attention module according to the present invention.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

The embodiment of the invention discloses a facial expression recognition method combining an attention module and a second-order pooling mechanism, which comprises two stages of data preprocessing, feature extraction and expression classification, wherein the input of an algorithm is a facial expression picture, and the output is a classification result value of the picture, and the classification result comprises seven types of qi generation, nausea, fear, happiness, sadness, surprise and neutrality.

Data preprocessing: the data preprocessing module adopted by the scheme comprises three steps of face detection and alignment, data enhancement and image normalization. The face detection and alignment comprises three steps of face detection, key point positioning and face alignment. Face detection is first achieved using the HOG feature and linear classifier based face detector interface (dlib. Get_ frontal _face_detector) and CNN based face detector interface (dlib. Cnn_face_detection_model_v1) in the dlib library. The face detection module inputs the facial expression picture, outputs the facial expression picture as the detected face area, and the face detection result is shown in fig. 2.

And then, according to the result detected in the last step, carrying out the next step of face key point coordinate positioning, and importing a five-point key point detection model by using a face key point detection interface (dlib. Shape_predictor) in a dlib library to obtain five-point key point coordinates of the face, wherein the positions of the five points are shown in fig. 3 a-3 b.

After the coordinates are obtained, the coordinates of the key points are utilized to align the faces, and the specific implementation is as follows: first, the center coordinates of the left and right eyes are calculated from the four coordinates (x ₁,y₁),(x₂,y₂),(x₃,y₃),(x₄,y₄) of the left and right eyes:

After the coordinates of the rotation center point and the included angle theta are included, an affine transformation matrix can be obtained by utilizing an interface of the affine transformation matrix in the OpenCV, then an interface function of the OpenCV is called to carry out affine transformation on the image, the image is adjusted to 224 multiplied by 224 pixel size, and a result diagram after the face alignment of FIG. 4a is shown in FIG. 4 b.

The data enhancement is to dynamically perform random geometric or color class transformation operation on the input image in the data reading stage through a transformation.composition () interface in the deep learning framework pytorch, and then input the transformed image into a network for training, so as to realize data expansion. The normalization of the image divides the pixel value of the image by 255, and any pixel value of the normalized image is between 0 and 1.

Feature extraction and expression classification: the module utilizes an improved 18-layer resnet network, as shown in fig. 5, to extract facial expression features, and normalizes a network output result into a probability value of 7 types of expressions by adding a softmax layer at the end of the resnet network, wherein the maximum value is a classification result, so that the classification of the expressions is realized.

The whole module uses an end-to-end deep neural network to realize the steps of feature extraction and expression classification, and the structure of the deep neural network is shown in fig. 5. The first layer is a convolution layer with a convolution kernel size of 7 multiplied by 7, and the number of channels is 64; the second layer is a pooling layer with a pooling core size of 3×3, and the channel number is 64; and then connecting eight residual structures fused with the convolution attention module, outputting a characteristic diagram of a 512-dimensional channel by the structure, connecting a second-order pooling layer to realize characteristic aggregation, and finally obtaining a classification result by using a full-connection layer and a softmax layer. The input of the network is a facial expression picture obtained after the data preprocessing in the step 1, the characteristics of the facial expression from the bottom layer to the high layer are extracted through a multi-layer convolution layer and a pooling layer of the network, then the characteristics are converted into column vectors with 1X 7 dimension by using a full connection layer, and the final classification probability value is obtained by normalizing seven classification results through a softmax layer.

(1) Residual structure incorporating convolution attention module (ResBlock + CBAM module in FIG. 6)

The structure of ResBlock + CBAM module is shown in fig. 6, where F is the feature extracted by the previous convolution layer, and the F input channel attention module calculates the channel attention map M _C,M_C and multiplies the input F to obtain the output F1 of the channel attention module; then, the spatial attention module is input into F1 to calculate a spatial attention force map M _S,M_S, and the spatial attention force map M _S,M_S is multiplied by the input F1 to obtain a final output F2 of the convolution attention module, and then the convolution layer after the F2 is continuously input continues to perform feature learning. By adding convolution modules in the base residual block of resnet, the resnet network can be made more aware of the object to be identified.

The convolution attention module (CBAM) is divided into a channel attention module and a spatial attention module, given an intermediate feature map, the CBAM module sequentially extrapolates an attention map along the channel and the space, and then multiplies the attention map with the input feature map for adaptive feature optimization.

The channel attention module structure is shown in fig. 7, in which the feature map extracted from the previous layer is simultaneously subjected to global average pooling and global maximum pooling to achieve compression in the space dimension, a one-dimensional vector is obtained and then sent to a shared full-connection layer network of two layers, the two branches are summed and combined element by element, and finally a sigmoid activation function is used to generate a channel attention map M _C.

The structure of the spatial attention module is shown in fig. 8, and the feature diagram output by the channel attention module is used as the input feature diagram of the module. And respectively carrying out average value pooling and maximum value pooling on the channel dimensions, and then splicing and combining the extracted feature graphs (the number of the channels is 1) to obtain a 2-channel feature graph. Then the dimension is reduced to 1 channel through a convolution layer with the convolution kernel size of 7 multiplied by 7, and a spatial attention map M _S is generated through a sigmoid activation function.

(2) Second order pooling mechanism

The global second order pooling is performed by calculating the covariance matrix (second order information) of the feature map to select the value representing the data distribution of the feature map. It is assumed that a set of feature maps F _i (i=1, 2, …, C) of size h×w is obtained after the preceding convolution operation, where C is the number of channels of the set of feature maps. The idea of global covariance pooling is to consider the feature map as a random variable, each element of which is a sample value of this random variable. The feature map F _i is straightened into a vector F _i of (h×w, 1), and the covariance matrix of the set of feature maps is calculated:

the physical significance of the covariance matrix is very obvious, and the ith row represents the statistical correlation of channel i with all channels.

The second-order covariance pooling can effectively utilize the relevant information among the channels extracted by the deep neural network learning, and contains richer characteristic information, so that the global average pooling layer of resnet is changed into a global second-order pooling layer, and the characteristic expression capability of the network can be improved. The specific implementation details are as follows: firstly, reducing the dimension of a 512-dimensional feature map of the output of a convolution layer before a second-order pooling layer to 256 dimensions by using a 1X 1 convolution check feature map, then calculating a covariance matrix of the set of features and carrying out matrix square root normalization to obtain a 32896-dimensional feature map so as to realize the second-order pooling operation.

The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. The facial expression recognition method combining the attention module and the second-order pooling mechanism is characterized by comprising the following steps of:

Acquiring a face image;

extracting features of the preprocessed face image, and finishing expression classification;

The face detection and alignment comprises face detection, key point positioning and face alignment, and specifically comprises the following steps:

Carrying out face alignment by utilizing coordinates of the five key points;

The face alignment calculation process is as follows:

After the center coordinates of the two eyes are obtained, connecting the center coordinates of the two eyes, calculating an included angle theta between the connecting line and a horizontal line, and then averaging three-point coordinates of the center coordinates of the left eye, the center coordinates of the right eye and the coordinates of a point below the nose (x ₅,y₅) to calculate a rotation center point coordinate (x _center,y_center):

Combining the rotation center point coordinate (x _center,y_center) and the included angle theta, solving an interface of the affine transformation matrix in the OpenCV to obtain the affine transformation matrix, and calling an interface function of the OpenCV to carry out affine transformation on the image to obtain a photo with aligned faces;

Feature extraction and expression classification are realized by using an end-to-end deep neural network, and the deep neural network has the following structure: the first layer is a convolution layer with a convolution kernel size of 7 multiplied by 7, and the number of channels is 64; the second layer is a pooling layer with a pooling core size of 3×3, and the channel number is 64; and connecting eight residual structures fused with the convolution attention module, outputting a feature map of a 512-dimensional channel, connecting a second-order pooling layer to realize feature aggregation, and finally obtaining a classification result by using a full-connection layer and a softmax layer.

2. The facial expression recognition method combining an attention module and a second order pooling mechanism according to claim 1, wherein the data enhancement is to dynamically perform a random geometric or color class transformation operation on an input image in a data reading stage through a transform () interface in a deep learning framework pytorch, and then train the transformed image input network to realize data expansion.

3. The facial expression recognition method combining an attention module and a second order pooling mechanism of claim 1, wherein the image normalization divides the pixel value of the image by 255, and any pixel value of the normalized image is between [0,1 ].

4. The facial expression recognition method combining an attention module and a second-order pooling mechanism according to claim 1, wherein extraction of facial expression features is achieved by using an 18-layer resnet network, and a softmax layer is added at the end of the resnet network to normalize a network output result into a probability value of 7 types of expressions, wherein the maximum value is a classification result.