Dynamic expression recognition method based on time sequence relation reasoning
Technical Field
The invention belongs to the field of dynamic expression recognition in image processing and machine vision, and particularly relates to a dynamic expression recognition method based on time sequence relation reasoning.
Background
Facial expression recognition is an important research topic in the field of computer vision, and most of research is focused on a static expression recognition task which takes a single-frame expression image as a research object. However, the facial expression is a dynamic change process, and a single frame of facial expression image cannot completely capture the emotional change of the person. Compared with the method, the method has the advantages that the expression video or the expression image sequence is used as the dynamic expression recognition of the research object, and the texture information and the motion information which are rich and relevant to expression change can be utilized, so that the emotion change process of a person can be expressed more completely. However, because the problems of small expression data set scale, unbalanced distribution, data annotation deviation, posture change, illumination change, emotional expression difference, conflict between expressions and speaking and the like exist, the current dynamic expression recognition still faces many challenge problems.
The actual dynamic expression recognition research work mainly comprises two parts: expression sequence feature extraction and time sequence relation modeling, and in recent years, research on dynamic expression recognition is also well successful. At present, a 3D convolutional neural network is generally used for identifying dynamic expressions, the method is simple and direct, but the following problems and disadvantages exist: (1) in the aspect of input, the common method is to use a mode of densely sampling video frames and take 16 continuous frame images as input to extract features of an image sequence, so that the length of the input sequence is limited to a great extent, and the input sequence cannot be applied to an expression sequence with a long time sequence; (2) in the aspect of expression image feature extraction, in the conventional method, a convolution kernel with shared weight is generally adopted to perform global feature extraction on an expression image, different facial expression motions in different regions of a face have different structures and texture information, and although a single-scale region layer adopts different convolution kernels to process different local regions in order to utilize local information, the designed local regions are consistent in size and cannot be suitable for local multi-scale region feature learning, so that local information of facial expression changes cannot be fully utilized, and the subsequent recognition accuracy is influenced;
generally, the existing dynamic expression recognition method has the problems that the existing dynamic expression recognition method cannot adapt to long time sequence input, and the recognition accuracy is low due to the fact that the local facial region features cannot be fully utilized.
Disclosure of Invention
Aiming at the defects of the prior art, the invention aims to provide a dynamic expression recognition method based on time sequence relation reasoning, and aims to solve the problems that the existing dynamic expression recognition method cannot adapt to long time sequence input and has low recognition accuracy.
In order to achieve the above object, the present invention provides a dynamic expression recognition method based on time series relationship reasoning, which comprises:
(1) carrying out multi-scale time sequence sparse sampling on the expression image sequence to obtain a plurality of expression sequence segments with different scales, and carrying out data enhancement on the expression sequence segments and then converting the expression sequence segments into fixed sizes;
(2) constructing a dynamic expression recognition model;
the dynamic expression recognition model comprises a multi-scale regional feature extraction network and a time sequence relation reasoning module which are sequentially connected;
the multi-scale regional feature extraction network comprises: the first characteristic layer, the second characteristic layer, the third characteristic layer, the fourth characteristic layer, the fifth characteristic layer and the sixth characteristic layer are connected in sequence;
the first characteristic layer comprises an extrusion excitation characteristic extraction module and a multi-scale region module which are sequentially connected; the extrusion excitation feature extraction module is used for extracting features of the input image to obtain a feature map; the multi-scale region module comprises a convolution layer and three region layers with different scales; the convolution layer is used for performing convolution operation on the feature map output by the extrusion excitation feature extraction module; the region layer is used for dividing the characteristic diagram output by the convolution layer into a plurality of regions with fixed sizes, and performing convolution on each region by adopting different convolution kernels;
the second characteristic layer comprises an extrusion excitation characteristic extraction module and a multi-scale area module which are sequentially connected, and is used for extracting the characteristics of the characteristic diagram output by the first characteristic layer again to obtain the characteristic diagram with more abundant information;
the third characteristic layer, the fourth characteristic layer and the fifth characteristic layer are all composed of an extrusion excitation characteristic extraction module and are used for extracting characteristics of a characteristic diagram output before the current layer to obtain characteristics of a higher layer;
the sixth feature layer is a mean pooling layer and is used for reducing the dimension of the features output by the fifth feature layer to obtain semantic features of the expression images;
the time sequence relation reasoning module is used for constructing a time sequence relation between adjacent expression image frames for semantic features of expression images output by the multi-scale regional feature extraction network;
(3) inputting the expression sequence segments obtained in the step (1) into the dynamic recognition model for training to obtain a trained dynamic expression recognition model;
(4) and inputting the expression image sequence to be recognized into the trained dynamic expression recognition model to obtain a dynamic expression recognition result.
Further, the data enhancement in step (1) includes random horizontal flipping and random cropping.
Further, the data-enhanced expression sequence segments are converted into a fixed size of 224 × 224 pixels in step (1).
Further, the convolution layer in the multi-scale region module and three region layers with different scales form a residual error structure.
Further, the three different scale regional layers sequentially divide the feature map into 8 × 8, 4 × 4, and 2 × 2 regional blocks.
Further, the extrusion excitation feature extraction module comprises a depth separable convolution submodule and an extrusion excitation submodule;
the depth separable convolution submodule comprises a depth convolution layer and a common convolution layer with convolution kernel size of 1 x 1; the extrusion excitation submodule comprises a global mean value pooling layer, a first full-connection layer, a nonlinear activation layer, a second full-connection layer, an S-shaped function activation layer and a scale normalization layer.
Further, the time sequence relation reasoning module comprises a first layer perceptron, a second layer perceptron and a third layer perceptron;
the number of the nodes of the first layer of sensing machine is 512, the number of the nodes of the second layer of sensing machine is 256, and the number of the nodes of the third layer of sensing machine is the number of expression categories.
Further, the loss function of the dynamic expression recognition model is:
wherein C represents the total category number of expression sequences, yiAnd G represents the posterior probability output of each category obtained after the output of the time sequence relation reasoning module is normalized.
Through the technical scheme, compared with the prior art, the invention can obtain the following advantages
Has the advantages that:
(1) the method and the device have the advantages that the input expression images are subjected to multi-scale sparse sampling to obtain a plurality of expression sequence segments with different time scales, the method and the device are suitable for expression images with long time sequences, and for expression images with different expression change speeds, a complete fluctuation process from stable expression to stable expression can be acquired, so that the characteristics of the acquired images are closer to the change process of real expressions, and the accuracy of subsequent expression recognition is improved.
(2) The multi-scale regional feature extraction network constructed by the invention can better extract the local regional features of the face and enhance the expression capability of the network on the expression image features; meanwhile, the network structure uses the depth separable convolution module and the multi-scale area module constructed by the invention, compared with the traditional convolution module, the computation amount is reduced, and the model performance is improved.
(3) The invention adopts the time sequence relation reasoning module to construct the time sequence relation between the expression image sequence frames, and compared with a long-term and short-term memory network, the invention can not only improve the model training speed, but also improve the accuracy of dynamic expression recognition.
Drawings
FIG. 1 is a flow chart of a dynamic expression recognition method based on timing relationship reasoning according to an embodiment of the present invention;
FIG. 2 is a schematic diagram of a multi-scale regional feature extraction network structure provided by an embodiment of the present invention;
FIG. 3 is a block diagram of a single-scale region layer provided by an embodiment of the present invention;
FIG. 4 is a schematic structural diagram of a multi-scale region module provided in an embodiment of the present invention;
fig. 5 shows the result of dynamic expression recognition obtained by the method of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.
Referring to fig. 1, the method for identifying dynamic expressions based on temporal relationship inference provided by the present invention includes:
(1) carrying out multi-scale time sequence sparse sampling on the expression image sequence to obtain a plurality of expression sequence segments with different scales, and carrying out data enhancement on the expression sequence segments and then converting the expression sequence segments into fixed sizes;
specifically, the invention carries out multi-scale sparse sampling on an input expression image sequence to obtain a plurality of expression sequence segments with different time scales, can be suitable for expression images with long time sequence, and can also acquire a complete fluctuation process from stability, fluctuation to stability of the expression for expression images with different expression change speeds, so that the characteristics of the acquired images are closer to the change process of real expression, as shown in figure 1, for expression image sequences with the length of 12 frames, when 2 frames of images are sampled, the expression image sequences are equally divided into 2 parts, and 1 frame of images are randomly sampled from each part; when sampling 3 frames of images, equally dividing the expression image sequence into 3 parts, and randomly sampling 1 frame of image from each part; similarly, when 4 frames of images are sampled, the expression image sequence is equally divided into 4 parts and then random sampling is carried out; and then carrying out data enhancement including random horizontal turning and random cutting on the acquired expression image sequence, and converting the expression sequence segment subjected to data enhancement into a fixed size of 224 x 224 pixels.
(2) Constructing a dynamic expression recognition model;
the dynamic expression recognition model comprises a multi-scale regional feature extraction network and a time sequence relation reasoning module which are connected in sequence; as shown in fig. 2, the multi-scale regional feature extraction network includes: the first feature layer1, the second feature layer2, the third feature layer3, the fourth feature layer4, the fifth feature layer5 and the sixth feature layer6 are connected in sequence;
the first feature layer1 comprises a squeezing excitation feature extraction module and a multi-scale region module which are sequentially connected; the device comprises an extrusion excitation feature extraction module, a feature extraction module and a feature extraction module, wherein the extrusion excitation feature extraction module is used for extracting features of an input image to obtain a feature map; the extrusion excitation feature extraction module comprises a depth separable convolution submodule and an extrusion excitation submodule; the depth separable convolution submodule comprises a depth convolution layer and a common convolution layer with convolution kernel size of 1 x 1;
in this way, the amount of computation and parameters can be reduced substantially by decomposing the standard convolution into a depth convolution and a point-by-point convolution, as compared to the standard convolution operation. Assume that the size of the input feature map is (D)F×DFX M), operating with a standard convolution kernel having a size of (D)K×DKXn), the amount of computation of the convolution operation is DK·DK·M·N·DF·DF(ii) a For the same input, a depth separable convolution operation is used, the amount of computation being DK·DK·M·DF·DF+M·N·DF·DF. In contrast, the amount of computation is reduced:
therefore, the depth separable convolution module can reduce the calculated amount and parameters of the model, and the extrusion excitation sub-module enhances the characteristic characterization capability of the model by designing a lightweight gating mechanism to adaptively select the channel importance, so that the model performance is improved;
the extrusion excitation submodule structurally comprises: the extrusion excitation submodule comprises a global mean value pooling layer, a first full-connection layer, a nonlinear activation layer, a second full-connection layer, an S-shaped function activation layer and a scale normalization layer; its main functions can be divided into two parts: embedding global information and self-adaptive readjusting; in order to solve the problem that a convolution kernel only aims at a local receptive field when performing convolution operation and cannot utilize semantic information outside the receptive field, global information embedding compresses global spatial information of each channel into a scalar to represent through extrusion operation, and the scalar can be regarded as importance weight of the corresponding channel; and the self-adaptive readjustment is to multiply the importance weight by the original input in a channel mode to obtain a new characteristic output, and the process is called excitation.
Different structures and texture information of different facial expression motions in different areas of the face are different, when a convolution kernel is used for extracting features, different convolution kernels are used for processing different local areas, and the structure of the existing single-scale area layer is shown in fig. 3, and the existing single-scale area layer cannot be suitable for local multi-scale area feature learning because the designed local areas are consistent in size. On the basis, a multi-scale area module is designed, and the structure of the multi-scale area module is shown in FIG. 4 and comprises a convolution layer and three area layers with different scales; the convolution layer is used for performing convolution operation on the characteristic graph output by the depth separable convolution module; the region layer is used for dividing the characteristic graph output by the convolution layer into a plurality of regions with fixed sizes, and performing convolution on each region by adopting different convolution kernels to obtain region characteristics with a plurality of scales; in the invention, three regional layers with different scales sequentially divide the characteristic diagram output by the convolutional layer into 8 × 8, 4 × 4 and 2 × 2 regional blocks. In addition, in order to avoid the problem of gradient disappearance, the multi-scale region module adopts a residual error structure, and the features output by the convolutional layer and the multi-scale region features output by the three region layers with different scales are summed, so that the combination of local features and global features is realized.
The second characteristic layer comprises a depth separable convolution module and a multi-scale area module which are sequentially connected, and is used for extracting the characteristics of the characteristic graph output by the first characteristic layer again to obtain the characteristic graph with more abundant information;
the third feature layer, the fourth feature layer and the fifth feature layer are depth separable convolution modules and are used for extracting features of a feature map output before the current layer to obtain features of a higher layer;
the sixth characteristic layer is a mean pooling layer and is used for reducing the dimension of the characteristics output by the fifth characteristic layer to obtain the semantic characteristics of the expression images;
the time sequence relation reasoning module is used for constructing a time sequence relation between adjacent expression image frames for semantic features of expression images output by the multi-scale region feature extraction network; the time sequence relation reasoning module comprises a first layer perceptron, a second layer perceptron and a third layer perceptron; the number of nodes of the first layer of perceptron is 512, the number of nodes of the second perceptron is 256, and the number of nodes of the third layer of perceptron is the number of expression categories;
the timing relationship between given three frames of expression sequence images is defined as follows:
wherein, the input is an expression sequence V ═ f1,f2,...,fn},fiAnd fjRespectively representing the feature representation of the ith frame image and the jth frame image in the sequence, namely the feature output of a sixth feature layer6 in the multi-scale regional feature extraction network, gθFunction sum hφThe function is used for fusing different ordered frame features, and the first layer perceptron and the second perceptron are used for representing gθFunction, using a third tier perceptron to represent hφA function.
(3) Inputting the expression sequence segments obtained in the step (1) into the dynamic recognition model for training to obtain a trained dynamic expression recognition model;
specifically, parameters in the network are optimized and upgraded by using an SGD algorithm, wherein the category prediction loss of the time sequence reasoning module about the expression sequence is calculated in a cross entropy mode, and a loss function is as follows:
wherein C represents the total category number of expression sequences, yiRepresenting the real label category, G representing the posterior probability output of each category obtained after the output of the time sequence relation reasoning module is normalized;
the model parameters to be optimized comprise multi-scale regional characteristic extraction network parameters W, g
θA function weight parameter theta and
function weight parameter
If the parameter W is optimized by using a random gradient descent algorithm, the gradient of the corresponding parameter W can be expressed as:
by optimizing the parameter W in the above manner, it can be ensured that the learning of the parameter W is based on the whole expression sequence, rather than the partial expression sequence of a certain time period.
(4) And inputting the expression image sequence to be recognized into the trained dynamic expression recognition model to obtain a dynamic expression recognition result.
A partial recognition result visualization is shown in fig. 5. The classification results of the first two ranked classification confidence levels obtained in the model provided by the method are correspondingly arranged below each expression image sequence, the number behind each classification result represents the classification confidence level, the first expression model finally identifies the probability value of the category, and the second expression model most easily identifies the probability value of the category by mistake.
In order to verify the effectiveness of the method for identifying the dynamic expression, the method uses parameters, calculated amount and overall accuracy as evaluation indexes to perform comparative analysis on the same data set with the existing mainstream methods, training sets and test sets adopted by different methods on the same data set are completely consistent, and experimental results are shown in table 1, wherein Baseline is a standard model algorithm provided for the data set.
TABLE 1 comparison of computational efficiency between different methods
From the comparison results, the method is also a single model method, and compared with other methods, the method provided by the invention achieves the highest overall accuracy under the condition of smaller calculation amount and parameter amount. Particularly, compared with the MRE-CNN method with the highest overall accuracy, the method has the accuracy higher than that of the MRE-CNN method by 1.09 percent under the condition that both the parameter quantity and the calculated quantity are reduced by 2 orders of magnitude. Therefore, the method effectively improves the accuracy of dynamic expression recognition on the basis of greatly reducing the calculated amount and the parameter quantity.
It will be understood by those skilled in the art that the foregoing is only a preferred embodiment of the present invention, and is not intended to limit the invention, and that any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the scope of the present invention.