CN112036260A - Expression recognition method and system for multi-scale sub-block aggregation in natural environment - Google Patents
Expression recognition method and system for multi-scale sub-block aggregation in natural environment Download PDFInfo
- Publication number
- CN112036260A CN112036260A CN202010795929.3A CN202010795929A CN112036260A CN 112036260 A CN112036260 A CN 112036260A CN 202010795929 A CN202010795929 A CN 202010795929A CN 112036260 A CN112036260 A CN 112036260A
- Authority
- CN
- China
- Prior art keywords
- scale
- sub
- fusion
- blocks
- expression
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 230000014509 gene expression Effects 0.000 title claims abstract description 106
- 230000002776 aggregation Effects 0.000 title claims abstract description 38
- 238000004220 aggregation Methods 0.000 title claims abstract description 38
- 238000000034 method Methods 0.000 title claims abstract description 36
- 230000004927 fusion Effects 0.000 claims abstract description 84
- 239000013598 vector Substances 0.000 claims abstract description 65
- 238000005070 sampling Methods 0.000 claims abstract description 12
- 230000004931 aggregating effect Effects 0.000 claims abstract description 11
- 230000007246 mechanism Effects 0.000 claims description 17
- 238000013527 convolutional neural network Methods 0.000 claims description 15
- 210000002569 neuron Anatomy 0.000 claims description 10
- 238000000605 extraction Methods 0.000 claims description 9
- 230000000306 recurrent effect Effects 0.000 claims description 4
- 238000013528 artificial neural network Methods 0.000 abstract 1
- 230000006870 function Effects 0.000 description 8
- 238000013145 classification model Methods 0.000 description 6
- 230000008859 change Effects 0.000 description 5
- 238000006116 polymerization reaction Methods 0.000 description 5
- 230000000694 effects Effects 0.000 description 4
- PXFBZOLANLWPMH-UHFFFAOYSA-N 16-Epiaffinine Natural products C1C(C2=CC=CC=C2N2)=C2C(=O)CC2C(=CC)CN(C)C1C2CO PXFBZOLANLWPMH-UHFFFAOYSA-N 0.000 description 3
- 238000004364 calculation method Methods 0.000 description 3
- 230000002269 spontaneous effect Effects 0.000 description 3
- 230000009466 transformation Effects 0.000 description 3
- 230000009286 beneficial effect Effects 0.000 description 2
- 238000012512 characterization method Methods 0.000 description 2
- 238000010586 diagram Methods 0.000 description 2
- 230000008451 emotion Effects 0.000 description 2
- 230000008921 facial expression Effects 0.000 description 2
- 235000019580 granularity Nutrition 0.000 description 2
- 230000006872 improvement Effects 0.000 description 2
- 238000007500 overflow downdraw method Methods 0.000 description 2
- 230000008569 process Effects 0.000 description 2
- 241000282412 Homo Species 0.000 description 1
- 230000004913 activation Effects 0.000 description 1
- 230000008901 benefit Effects 0.000 description 1
- 239000002131 composite material Substances 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 239000000284 extract Substances 0.000 description 1
- 238000005286 illumination Methods 0.000 description 1
- 230000002401 inhibitory effect Effects 0.000 description 1
- 230000003993 interaction Effects 0.000 description 1
- 239000011159 matrix material Substances 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 239000002245 particle Substances 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V40/00—Recognition of biometric, human-related or animal-related patterns in image or video data
- G06V40/10—Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
- G06V40/16—Human faces, e.g. facial parts, sketches or expressions
- G06V40/174—Facial expression recognition
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/25—Fusion techniques
- G06F18/253—Fusion techniques of extracted features
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/20—Image preprocessing
- G06V10/26—Segmentation of patterns in the image field; Cutting or merging of image elements to establish the pattern region, e.g. clustering-based techniques; Detection of occlusion
- G06V10/267—Segmentation of patterns in the image field; Cutting or merging of image elements to establish the pattern region, e.g. clustering-based techniques; Detection of occlusion by performing operations on regions, e.g. growing, shrinking or watersheds
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V40/00—Recognition of biometric, human-related or animal-related patterns in image or video data
- G06V40/10—Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
- G06V40/16—Human faces, e.g. facial parts, sketches or expressions
- G06V40/168—Feature extraction; Face representation
- G06V40/169—Holistic features and representations, i.e. based on the facial image taken as a whole
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Health & Medical Sciences (AREA)
- General Health & Medical Sciences (AREA)
- Data Mining & Analysis (AREA)
- Artificial Intelligence (AREA)
- Oral & Maxillofacial Surgery (AREA)
- Multimedia (AREA)
- Life Sciences & Earth Sciences (AREA)
- Evolutionary Computation (AREA)
- General Engineering & Computer Science (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- Computational Linguistics (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Human Computer Interaction (AREA)
- Evolutionary Biology (AREA)
- Bioinformatics & Computational Biology (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Image Analysis (AREA)
Abstract
The invention discloses an expression recognition method and system for multi-scale sub-block aggregation in a natural environment. The method comprises the following steps: predefining multi-scale parameters, inputting the expression picture into a regression convolution neural network, and acquiring attention area parameters of the expression picture; sampling sub-blocks of the expression picture according to the attention area parameters, respectively constructing a stacking convolution layer for each sub-block of each scale, and extracting the characteristics of all the sub-blocks by using the stacking convolution layers; fusing the characteristics of all sub-blocks under the same scale to obtain a single scale fusion characteristic vector corresponding to each scale; and extracting the global features of the expression picture, aggregating the single-scale fusion feature vectors with all scales and the global features, and inputting the aggregated single-scale fusion feature vectors and the global features into a full-connection layer network to obtain an expression recognition result. The expression recognition method does not need to rely on manual selection or human face characteristic points, and improves the expression recognition precision under natural conditions.
Description
Technical Field
The invention belongs to the technical field of computer vision, and particularly relates to an expression recognition method and system for multi-scale sub-block aggregation in a natural environment.
Background
Expression is one of the important ways that humans communicate emotions. Expression recognition is a key technology for realizing natural human-computer interaction, and has wide application prospects in the fields of computer vision and emotion calculation.
The existing expression recognition method has higher recognition precision on expressions which are photographed in a laboratory environment, but still has low recognition precision on spontaneous expressions in a natural environment. The main reasons are that the resolution of the face image in the natural environment is different, the change of the head posture is large, the accuracy of feature point extraction is low due to the factors, the feature alignment is further influenced, and the expression recognition accuracy is finally reduced; secondly, expression information in natural environment is easily interfered by factors such as head posture change, illumination change and local shielding, and a single feature or model is difficult to face the challenges. Thirdly, the expression intensity of the spontaneous expression is weaker than the extreme expression of the beat, the inter-class distance is smaller, and the spontaneous expression is easier to be confused.
In order to solve the above problem, an effective method is to extract a local sub-block region with an expression from a face region, and identify the expression by fusing a local feature and a global feature. On one hand, the local sub-blocks can well inhibit the problem of local shielding, have certain robustness on the change of the head posture and effectively overcome the problem that global characteristics have a large amount of redundant information; on the other hand, the global features can make up for the problem of insufficient characterization capability of the local sub-blocks, and the fusion of various features is beneficial to responding to various composite challenges.
However, when extracting local sub-blocks, the existing method depends on manual selection or human face feature points, and cannot self-adaptively search important sub-blocks from a human face image, and when the extraction accuracy of the human face feature points is not high, the accuracy of expression recognition is also influenced; in the setting of the sub-block scale, only the sub-block with a single scale is usually considered, and the effect of different sub-blocks on the expression classification is the same, so that the expression recognition accuracy is not high.
Disclosure of Invention
Aiming at least one defect or improvement requirement in the prior art, the invention provides the expression recognition method and system for multi-scale sub-block aggregation in the natural environment, which do not depend on manual selection or human face characteristic points and improve the expression recognition precision under the natural condition.
To achieve the above object, according to a first aspect of the present invention, there is provided an expression recognition method for multi-scale sub-block aggregation in a natural environment, including:
s1, predefining multi-scale parameters, inputting the expression picture into a regression convolutional neural network, and acquiring attention area parameters of the expression picture, wherein the attention area parameters are translation parameters of each sub-block of each scale;
s2, sampling sub-blocks of the expression picture according to the attention area parameters and the multi-scale parameters, respectively constructing a stacked convolutional layer for each sub-block of each scale, and extracting the characteristics of all the sub-blocks by using the stacked convolutional layers;
s3, fusing the characteristics of all sub-blocks under the same scale to obtain a single scale fusion characteristic vector corresponding to each scale;
and S4, extracting the global features of the expression picture, aggregating the single-scale fusion feature vectors with all scales and the global features, and inputting the aggregated single-scale fusion feature vectors and the global features into a full-connection layer network to obtain expression recognition results.
Preferably, the recurrent convolutional neural network is the last VGG networkThe number of the neurons of the whole connecting layer is modified intoOr modifying the number of neurons in the last fully-connected layer in the Resnet network toOr modifying the number of neurons of the last fully-connected layer in the Googlenet network toWherein S represents the number of predetermined subblock scales, NiDenotes the number of sub-blocks generated at each scale, D denotes the dimension of the attention area parameter generating the sub-block, preferably D-2.
Preferably, the number of sub-blocks generated at different scales remains consistent.
Preferably, the parameters in the stacked convolutional layer corresponding to each subblock of each scale are independently trained, and a parameter sharing mechanism is not introduced, so that different subblocks can be ensured to extract the optimal features.
Preferably, the fusion is one of direct fusion, weighted fusion or splicing fusion, the direct fusion refers to fusion of the features of all sub-blocks of the same scale into one feature vector through summation, the weighted fusion refers to fusion of the features of all sub-blocks of the same scale into one feature vector through attention system weighted summation, and the splicing fusion refers to splicing the features of all sub-blocks of the same scale into one feature vector end to end.
Preferably, the step S3 further includes inputting the single-scale fusion feature vector corresponding to each scale into a full-connected layer, and obtaining an expression classification result of each single-scale fusion feature vector; using a function L for minimizing the cross-entropy loss of SoftmaxsfTo train the recurrent convolutional neural network, the stacked convolutional layer, and the fused parameters.
Preference is given toGround, at the cross entropy loss function LsfSuperimposing the attention area parameter constraint penalty.
Preferably, the aggregation is one of direct aggregation or weighted aggregation, the direct aggregation refers to summing and aggregating the single-scale fused feature vectors and the global features of all scales, and the weighted aggregation refers to weighting and summing and aggregating the single-scale fused feature vectors and the global features of all scales through an attention mechanism.
Preferably, the weighted aggregation is an aggregation method based on an attention mechanism, the single-scale fusion feature vector and the global feature are respectively regarded as one object to be fused, the weight of each object to be fused is obtained through the attention mechanism, and a formula is adopted for a weighted aggregated feature vector z based on the attention mechanismIs calculated and obtained, wherein hjRepresenting said single scale fused feature vector or said global feature, ajWeights, α, corresponding to said single scale fusion feature vector or said global featurejAnd obtaining by adopting an attention mechanism.
According to a second aspect of the present invention, there is provided an expression recognition system for multi-scale sub-block aggregation in a natural environment, including:
the multi-scale subblock generating module is used for predefining multi-scale parameters, inputting the expression picture into a regression convolutional neural network, and acquiring attention area parameters of the expression picture, wherein the attention area parameters are translation parameters of each subblock of each scale;
the feature extraction module is used for sampling the sub-blocks of the expression picture according to the attention area parameter and the multi-scale parameter, respectively constructing a stacking convolution layer for each sub-block of each scale, and extracting features of all the sub-blocks by using the stacking convolution layers, and is also used for extracting global features of the expression picture;
the single-scale feature fusion module is used for fusing the features of all the sub-blocks under the same scale to obtain a single-scale fusion feature vector corresponding to each scale;
and the multi-scale feature aggregation and identification module is used for aggregating the single-scale fusion feature vectors and the global features of all scales and inputting the aggregated single-scale fusion feature vectors and the global features into a full-connection layer network to obtain expression identification results.
In general, compared with the prior art, the invention has the following beneficial effects:
(1) the invention utilizes the attention mechanism to automatically search the sub-blocks in the image without depending on manual selection or human face characteristic points, the expression recognition precision is not influenced by the extraction precision of the characteristic points, and the expression recognition precision of multi-scale sub-block aggregation under natural conditions is improved.
(2) The invention extracts the subblocks with different scales to represent the expression, the subblocks with different scales represent different granularities, and the smaller subblock has limited representation capability but has better inhibiting effect on local shielding and head posture change; the larger sub-blocks are not robust to local occlusion and head pose changes, but have a stronger ability to characterize the expression. By fusing the characteristics of the subblocks with different granularities, different characteristics can be mutually promoted, the overall recognition precision of the model is improved, and the method is superior to the expression recognition method based on the subblocks with a single scale;
(3) the importance degree of different sub-block features is measured through an attention mechanism, the effect of the features with different scales on the expression recognition is found through self-adaptive learning, and the accuracy of the expression recognition is further optimized.
Drawings
FIG. 1 is a flow chart of a facial expression recognition method according to an embodiment of the present invention;
FIG. 2 is a diagram illustrating the fusion of multi-scale features and recognition according to various embodiments of the present invention;
fig. 3 is a schematic diagram of a facial expression recognition system according to an embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention. In addition, the technical features involved in the embodiments of the present invention described below may be combined with each other as long as they do not conflict with each other.
The expression recognition method for multi-scale sub-block aggregation in a natural environment according to the embodiment of the present invention, as shown in fig. 1, includes steps S1 to S4.
S1: predefining multi-scale parameters, inputting the expression picture into a regression convolutional neural network, and acquiring attention area parameters of the expression picture, wherein the attention area parameters are translation parameters of each sub-block of each scale.
The predefined multi-scale parameters are a plurality of preset values, and the preset values define the size proportion of the sub-blocks relative to the original image. For example, the predefined multi-scale parameters are 0.2,0.4, 0.6 and 0.8, which means that sub-blocks of 4 scales are extracted from the original respectively, and the side lengths of the sub-blocks of different scales are 0.2,0.4, 0.6 and 0.8 of the side length of the original respectively.
A sub-block refers to a local image region in an expression picture, which is determined by scale and translation parameters, and may be represented by a set of all pixels in the region.
The attention area parameter is used to find the attention area capable of expressing the expression, which is called the focus of attention in step S2, and then more attention resources are invested in this area to obtain more detailed information of the target of interest, while suppressing other useless information. The sub-blocks in the image are automatically searched by using an attention mechanism, manual selection or human face characteristic points are not required, and the expression recognition precision is not influenced by the extraction precision of the characteristic points, so that the expression recognition precision is improved.
The attention area parameter is the translation parameter of each sub-block of each scale. The translation parameter is the offset of the geometric center of the sub-block relative to the geometric center of the original image in the vertical direction and the horizontal direction.
S2: sampling sub-blocks of the expression picture according to the attention area parameters and the multi-scale parameters obtained in the step S1, obtaining the sub-blocks of each scale, respectively constructing stacked convolutional layers for the sub-blocks of each scale, and extracting the characteristics of each sampled sub-block by using the stacked convolutional layer corresponding to each sub-block.
Sampling the sub-blocks of the expression picture according to the attention area parameters acquired in the step S1, calculating the position of each pixel in the sub-blocks in the original picture by using the translation parameters, and acquiring the sub-blocks of each scale by using the pixel value of the corresponding position of the original picture as the pixel value of the current pixel position of the sub-blocks, wherein the output sub-blocks represent the attention area of the expression picture.
And building stacked convolutional layers for each subblock of each scale, wherein each subblock of each scale has a corresponding stacked convolutional layer for extracting the characteristics of the corresponding subblock. The implementation of the stack convolutional layer is that a plurality of convolutional layers are connected in series, and the stack convolutional layer used in the front end of the network such as VGG, Resnet or Googlenet can be used.
S3: and fusing the characteristics of all the sub-blocks under the same scale to obtain a single scale fusion characteristic vector corresponding to each scale. The method comprises the steps of fusing the characteristics of all sampling sub-blocks under the scale 1 to obtain single-scale fusion characteristic vectors corresponding to the scale 1, fusing the characteristics of all sampling sub-blocks under the scale 2 to obtain single-scale fusion characteristic vectors corresponding to the scale 2, and fusing the characteristics of all sampling sub-blocks under the scale S to obtain single-scale fusion characteristic vectors corresponding to the scale S at … ….
S4: and extracting the global features of the expression pictures, aggregating the single-scale fusion feature vectors of all scales and the global features of the expression pictures, and inputting the aggregated vectors and the global features into a full-connection layer network to obtain expression recognition results. Namely, the single scale fusion feature vector corresponding to the scale 1, the single scale fusion feature vector corresponding to the scale 2, the single scale fusion feature vector corresponding to the scale S of the single scale fusion feature vector … … and the global feature, which are output in the step S3, are fused.
The global features of the expression pictures refer to features extracted from the whole image of the original expression picture, and the global features are relative to the sub-block features. The full connectivity layer network of step S4 constitutes a multi-scale fusion classification model.
Alternative implementations of each step are described in detail below.
Further, the recurrent convolutional neural network in step S1 may be one of the existing classical convolutional neural networks VGG, Resnet or Googlenet, and only the number of the neurons in the last fully-connected layer of the classical network needs to be modified to be equal to the number of the neurons in the last fully-connected layer of the classical networkWherein S represents the scale number of the preset sub-blocks; n is a radical ofiRepresenting the number of sub-blocks generated at each scale; d represents the dimension of the attention area parameter that generates the sub-block. Taking D as 2, the translation parameter is exemplified by an affine transformation parameter, and the s is a preset scaleiAnd the output attention area parameterAn affine transformation matrix can be obtained
Preferably, the number of sub-blocks generated at each scale remains consistent for subsequent operations. For example, if the splicing fusion is adopted in the subsequent step S3, the number of sub-blocks generated at each scale needs to be consistent.
Further, the affine transformation formula of the sub-block sampled in S2 isWherein,which is a point on the coordinates of the original figure,as points on the coordinates of the sub-blocks.
Further, all the stacked convolutional layers in S2 may adopt one of the existing classical convolutional neural networks VGG, Resnet or Googlenet, input the extracted subblocks into the network, and use the output of the last convolutional layer in the network as the extracted feature of each subblock.
Preferably, in S2, the subblock features may be extracted by constructing stacked convolution layers with different parameters not shared for subblocks of different dimensions and different regions. The parameters of the stacked convolutional layers may or may not be shared. When the parameters of the stacked convolutional layers are not shared, the extracted features can be diversified as much as possible, and the characterization capability of the final aggregated features is improved.
Further, the fusion method in S3 may be one of direct fusion, weighted fusion or splicing fusion. The direct fusion means that the features of all the sub-blocks with the same scale are fused into a feature vector through summation, the weighted fusion means that the features of all the sub-blocks with the same scale are fused into a feature vector through weighted summation of an attention mechanism, and the splicing fusion means that the features of all the sub-blocks with the same scale are spliced into a feature vector end to end. If the fusion method in S3 selects aggregation fusion, no additional training parameters are needed for fusion, and the feature dimension after fusion is smaller than that of splicing fusion.
Preferably, step S3 further includes inputting the single-scale fusion feature vector corresponding to each scale into the full-connected layer, and obtaining an expression classification result of each single-scale fusion feature vector. And a full connection layer is added in S3 to obtain a classification model of the single-scale fusion feature vector, and the expression is classified. The full link layer here is not the same as the full link layer of step S4. When training the classification model of the single-scale fusion feature vector, the cross entropy loss function L is minimized by SoftmaxsfThe parameters in S1, S2 and S3, i.e. the parameters referring to the regression convolutional neural network of step S1, the stacked convolutional layer of step S2 and the fusion of step S3, are trained. The training process can obtain an expression recognition model of a plurality of single-scale sub-block fusion characteristics.
Preferably, the constraint loss of the attention area parameters can be superimposed on the basis of the cross entropy loss, so that the obtained sub-blocks with the same scale are far away from each other as much as possible and overlap as little as possible, and therefore, expression information is obtained as much as possible.
In particular, the stacking loss function may representIs composed of Where σ is a hyperparameter for controlling the difference boundaries of the sub-block parameters.
Further, the polymerization method in S4 may be one of direct polymerization or weighted polymerization.
Preferably, the weighted polymerization method may employ an attention-based polymerization method. And regarding each feature vector as an object to be fused, and giving different weights to different objects to be fused through an attention mechanism.
Specifically, the multi-instance aggregated feature z may be calculated using the following formula:wherein h isjRepresenting a single scale fused feature vector or said global feature, alphajFor a single scale fusion of the weights, alpha, corresponding to the global features of the feature vectors or the expression picturesjThe calculation formula of (2) is as follows:αjthe calculation of (c) is obtained using an attention module, which, as shown in fig. 3, comprises two fully-connected layers, where the parameter in the first fully-connected layer is V, followed by an activation function of tanh, and the parameter in the second fully-connected layer is w, followed by a Softmax function. By adopting multi-example attention during multi-scale feature fusion, the effects of different scale features on expression recognition are found through self-adaptive learning, and the accuracy of expression recognition is further optimized.
Preferably, the multi-instance fused features are followed by full-connected layers, when training the multi-scale fused classification model, and Softmax cross entropy loss function is used to optimize the parameters in S1, S2, S3 and S4, i.e., the regression convolutional neural network of step S1, the stack convolutional layer of step S2, the fused parameters of step S3 and the full-connected layer parameters of step S4, excluding the full-connected layer parameters of step S3.
Preferably, in order to obtain a better model, the result of training the parameters in S1, S2, S3 in the training of the classification model with single-scale fusion feature vectors can be used as the initialization of the parameters in S1, S2, S3 in the training process of the classification model with multi-scale fusion.
According to the expression recognition method for multi-scale sub-block aggregation in the natural environment, the multi-scale sub-blocks are generated through an attention mechanism, the sub-blocks with different scales in different areas are fused in a grading mode, and the accuracy and robustness of expression recognition are improved by utilizing different functions of the sub-blocks with different particle sizes in the expression recognition.
The expression recognition method for multi-scale sub-block aggregation in natural environment according to another embodiment of the present invention is as follows: inputting the expression picture to a multi-scale sub-block generation network, wherein a backbone network of the generation network adopts a VGG-16 network, and the last full connection layer of the VGG-16 is changed into 32 neurons for outputting attention area parameters of the expression picture. Wherein the preset multi-scale parameters are set as a set of fixed values, 0.2,0.4, 0.6 and 0.8 respectively. N is a radical ofi4, S is 4, D is 2, and 4 sub-blocks with different positions are generated at each scale, and the translation parameters of 16 sub-blocks are generated in total. Sub-blocks are obtained by sampling the input original image by utilizing the 16 translation parameters, the characteristics of each sub-block are extracted by a characteristic extraction network, and a backbone network for characteristic extraction adopts a convolutional layer of a VGG-16 network. Summing the features extracted from the sub-blocks in different areas with the same scale into a feature vector to obtain 4 single-scale fusion feature vectors. And (3) 5 feature vectors are formed by the 4 single-scale fusion feature vectors and features extracted from the original image, the weight of each feature is calculated through a feature fusion frame of an attention mechanism, and the weighted features are output to expression classification results through a full connection layer.
The expression recognition system for multi-scale sub-block aggregation in natural environment comprises:
the multi-scale sub-block generation module is used for predefining multi-scale parameters, inputting the expression picture into the regression convolutional neural network, and acquiring attention area parameters of the expression picture, wherein the attention area parameters are translation parameters of each sub-block of each scale;
the feature extraction module is used for sampling sub-blocks of the expression picture according to the attention area parameters and the multi-scale parameters, respectively constructing stacked convolutional layers for each sub-block of each scale, and extracting features of all the sub-blocks by using the stacked convolutional layers, and is also used for extracting global features of the expression picture;
the single scale feature fusion module is used for fusing the features of all the subblocks under the same scale to obtain a single scale fusion feature vector corresponding to each scale;
and the multi-scale feature aggregation and identification module is used for aggregating the single-scale fusion feature vectors of all scales and the global features of the expression pictures and then inputting the aggregated vectors and the global features into a full-connection layer network to obtain expression identification results.
The implementation principle and technical effect of the expression recognition system are similar to those of the expression recognition method, and are not repeated here.
It must be noted that in any of the above embodiments, the methods are not necessarily executed in order of sequence number, and as long as it cannot be assumed from the execution logic that they are necessarily executed in a certain order, it means that they can be executed in any other possible order.
It will be understood by those skilled in the art that the foregoing is only a preferred embodiment of the present invention, and is not intended to limit the invention, and that any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the scope of the present invention.
Claims (10)
1. A method for recognizing expressions through multi-scale sub-block aggregation in a natural environment is characterized by comprising the following steps:
s1, predefining multi-scale parameters, inputting the expression picture into a regression convolutional neural network, and acquiring attention area parameters of the expression picture, wherein the attention area parameters are translation parameters of each sub-block of each scale;
s2, sampling sub-blocks of the expression picture according to the attention area parameters and the multi-scale parameters, respectively constructing a stacked convolutional layer for each sub-block of each scale, and extracting the characteristics of all the sub-blocks by using the stacked convolutional layers;
s3, fusing the characteristics of all sub-blocks under the same scale to obtain a single scale fusion characteristic vector corresponding to each scale;
and S4, extracting the global features of the expression picture, aggregating the single-scale fusion feature vectors with all scales and the global features, and inputting the aggregated single-scale fusion feature vectors and the global features into a full-connection layer network to obtain expression recognition results.
2. The method for recognizing the expression of the multi-scale sub-block aggregation in the natural environment according to claim 1, wherein the regression convolutional neural network is obtained by modifying the number of the neurons of the last fully-connected layer of the VGG network to be equal to that of the neurons of the last fully-connected layer of the VGG networkOr modifying the number of neurons in the last fully-connected layer in the Resnet network toOr modifying the number of neurons of the last fully-connected layer in the Googlenet network toWherein S represents the number of predetermined subblock scales, NiRepresenting the number of sub-blocks generated at each scale, D represents the dimension of the attention area parameter generating a sub-block.
3. The method for recognizing the expression of multi-scale sub-block aggregation in a natural environment as claimed in claim 2, wherein the number of sub-blocks generated in different scales is kept consistent.
4. The method for recognizing the expression of multi-scale sub-block aggregation in a natural environment according to claim 1, wherein the parameters of the stacked convolutional layers corresponding to each sub-block of each scale are independently trained.
5. The method for recognizing expressions through multi-scale sub-block aggregation in natural environment according to claim 1, wherein the fusion is one of direct fusion, weighted fusion or splicing fusion, the direct fusion refers to fusion of the features of all sub-blocks of the same scale into one feature vector through summation, the weighted fusion refers to fusion of the features of all sub-blocks of the same scale into one feature vector through attention mechanism weighted summation, and the splicing fusion refers to splicing of the features of all sub-blocks of the same scale into one feature vector end to end.
6. The method for recognizing expressions through multi-scale sub-block aggregation in natural environment as claimed in claim 1, wherein the step S3 further includes inputting the single-scale fusion feature vector corresponding to each scale into a full-connected layer, and obtaining an expression classification result of each single-scale fusion feature vector;
using a function L for minimizing the cross-entropy loss of SoftmaxsfTo train the recurrent convolutional neural network, the stacked convolutional layer, and the fused parameters.
7. The method according to claim 6, wherein the cross entropy loss function L is a function of a maximum number of sub-blocks in the sub-blockssfSuperimposing the attention area parameter constraint penalty.
8. The method for recognizing expressions in multi-scale sub-block aggregation in natural environment according to claim 1, wherein the aggregation is one of direct aggregation or weighted aggregation, the direct aggregation refers to summing and aggregating the single-scale fusion feature vectors and the global features of all scales, and the weighted aggregation refers to weighting and summing and aggregating the single-scale fusion feature vectors and the global features of all scales through attention mechanism.
9. The method according to claim 8, wherein the weighted aggregation is an aggregation method based on an attention mechanism, the single-scale fusion feature vector and the global feature are respectively regarded as a fused object, the weight of each fused object is obtained through the attention mechanism, and a formula is adopted for a weighted aggregated feature vector z based on the attention mechanismIs calculated and obtained, wherein hjRepresenting said single scale fused feature vector or said global feature, ajWeights, α, corresponding to said single scale fusion feature vector or said global featurejAnd obtaining by adopting an attention mechanism.
10. An expression recognition system for multi-scale sub-block aggregation in a natural environment is characterized by comprising:
the multi-scale subblock generating module is used for predefining multi-scale parameters, inputting the expression picture into a regression convolutional neural network, and acquiring attention area parameters of the expression picture, wherein the attention area parameters are translation parameters of each subblock of each scale;
the feature extraction module is used for sampling the sub-blocks of the expression picture according to the attention area parameter and the multi-scale parameter, respectively constructing a stacking convolution layer for each sub-block of each scale, and extracting features of all the sub-blocks by using the stacking convolution layers, and is also used for extracting global features of the expression picture;
the single-scale feature fusion module is used for fusing the features of all the sub-blocks under the same scale to obtain a single-scale fusion feature vector corresponding to each scale;
and the multi-scale feature aggregation and identification module is used for aggregating the single-scale fusion feature vectors and the global features of all scales and inputting the aggregated single-scale fusion feature vectors and the global features into a full-connection layer network to obtain expression identification results.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010795929.3A CN112036260B (en) | 2020-08-10 | 2020-08-10 | Expression recognition method and system for multi-scale sub-block aggregation in natural environment |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010795929.3A CN112036260B (en) | 2020-08-10 | 2020-08-10 | Expression recognition method and system for multi-scale sub-block aggregation in natural environment |
Publications (2)
Publication Number | Publication Date |
---|---|
CN112036260A true CN112036260A (en) | 2020-12-04 |
CN112036260B CN112036260B (en) | 2023-03-24 |
Family
ID=73576828
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202010795929.3A Active CN112036260B (en) | 2020-08-10 | 2020-08-10 | Expression recognition method and system for multi-scale sub-block aggregation in natural environment |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN112036260B (en) |
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112954633A (en) * | 2021-01-26 | 2021-06-11 | 电子科技大学 | Parameter constraint-based dual-network architecture indoor positioning method |
CN113095185A (en) * | 2021-03-31 | 2021-07-09 | 新疆爱华盈通信息技术有限公司 | Facial expression recognition method, device, equipment and storage medium |
CN113111940A (en) * | 2021-04-13 | 2021-07-13 | 东南大学 | Expression recognition method based on feature fusion |
CN113408503A (en) * | 2021-08-19 | 2021-09-17 | 明品云(北京)数据科技有限公司 | Emotion recognition method and device, computer readable storage medium and equipment |
CN114299041A (en) * | 2021-12-31 | 2022-04-08 | 之江实验室 | Electronic choledochoscope image auxiliary diagnosis method based on deep multi-instance learning |
Citations (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20130215264A1 (en) * | 2010-07-08 | 2013-08-22 | The Regents Of The University Of California | End-to-end visual recognition system and methods |
CN107578007A (en) * | 2017-09-01 | 2018-01-12 | 杭州电子科技大学 | A kind of deep learning face identification method based on multi-feature fusion |
CN109409222A (en) * | 2018-09-20 | 2019-03-01 | 中国地质大学(武汉) | A kind of multi-angle of view facial expression recognizing method based on mobile terminal |
CN109508654A (en) * | 2018-10-26 | 2019-03-22 | 中国地质大学(武汉) | Merge the human face analysis method and system of multitask and multiple dimensioned convolutional neural networks |
CN110070511A (en) * | 2019-04-30 | 2019-07-30 | 北京市商汤科技开发有限公司 | Image processing method and device, electronic equipment and storage medium |
CN110097136A (en) * | 2019-05-09 | 2019-08-06 | 杭州筑象数字科技有限公司 | Image classification method neural network based |
US20190320974A1 (en) * | 2018-04-19 | 2019-10-24 | University Of South Florida | Comprehensive and context-sensitive neonatal pain assessment system and methods using multiple modalities |
CN110580461A (en) * | 2019-08-29 | 2019-12-17 | 桂林电子科技大学 | Facial expression recognition algorithm combined with multilevel convolution characteristic pyramid |
CN110674774A (en) * | 2019-09-30 | 2020-01-10 | 新疆大学 | Improved deep learning facial expression recognition method and system |
-
2020
- 2020-08-10 CN CN202010795929.3A patent/CN112036260B/en active Active
Patent Citations (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20130215264A1 (en) * | 2010-07-08 | 2013-08-22 | The Regents Of The University Of California | End-to-end visual recognition system and methods |
CN107578007A (en) * | 2017-09-01 | 2018-01-12 | 杭州电子科技大学 | A kind of deep learning face identification method based on multi-feature fusion |
US20190320974A1 (en) * | 2018-04-19 | 2019-10-24 | University Of South Florida | Comprehensive and context-sensitive neonatal pain assessment system and methods using multiple modalities |
CN109409222A (en) * | 2018-09-20 | 2019-03-01 | 中国地质大学(武汉) | A kind of multi-angle of view facial expression recognizing method based on mobile terminal |
CN109508654A (en) * | 2018-10-26 | 2019-03-22 | 中国地质大学(武汉) | Merge the human face analysis method and system of multitask and multiple dimensioned convolutional neural networks |
CN110070511A (en) * | 2019-04-30 | 2019-07-30 | 北京市商汤科技开发有限公司 | Image processing method and device, electronic equipment and storage medium |
CN110097136A (en) * | 2019-05-09 | 2019-08-06 | 杭州筑象数字科技有限公司 | Image classification method neural network based |
CN110580461A (en) * | 2019-08-29 | 2019-12-17 | 桂林电子科技大学 | Facial expression recognition algorithm combined with multilevel convolution characteristic pyramid |
CN110674774A (en) * | 2019-09-30 | 2020-01-10 | 新疆大学 | Improved deep learning facial expression recognition method and system |
Non-Patent Citations (6)
Title |
---|
ANDREW TAO ET AL.: "Hierarchical multi-scale attention for semantic segmentation", 《ARXIX》 * |
ANDREW TAO ET AL.: "Hierarchical multi-scale attention for semantic segmentation", 《ARXIX》, 21 May 2020 (2020-05-21), pages 1 - 11 * |
ZHOUXIA WANG ET AL.: "Multi-label Image Recognition by Recurrently Discovering Attentional Regions", 《 2017 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV)》, 25 December 2017 (2017-12-25), pages 3 - 6 * |
李政浩: "基于深度注意力网络的人脸表情识别", 《中国优秀硕士学位论文全文数据库信息科技辑》, 15 January 2020 (2020-01-15), pages 138 - 1881 * |
胡玉晗: "域自适应表情识别研究", 《中国优秀硕士学位论文全文数据库信息科技辑》 * |
胡玉晗: "域自适应表情识别研究", 《中国优秀硕士学位论文全文数据库信息科技辑》, 15 July 2020 (2020-07-15), pages 24 - 44 * |
Cited By (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112954633A (en) * | 2021-01-26 | 2021-06-11 | 电子科技大学 | Parameter constraint-based dual-network architecture indoor positioning method |
CN112954633B (en) * | 2021-01-26 | 2022-01-28 | 电子科技大学 | Parameter constraint-based dual-network architecture indoor positioning method |
CN113095185A (en) * | 2021-03-31 | 2021-07-09 | 新疆爱华盈通信息技术有限公司 | Facial expression recognition method, device, equipment and storage medium |
CN113095185B (en) * | 2021-03-31 | 2024-09-10 | 新疆爱华盈通信息技术有限公司 | Facial expression recognition method, device, equipment and storage medium |
CN113111940A (en) * | 2021-04-13 | 2021-07-13 | 东南大学 | Expression recognition method based on feature fusion |
CN113408503A (en) * | 2021-08-19 | 2021-09-17 | 明品云(北京)数据科技有限公司 | Emotion recognition method and device, computer readable storage medium and equipment |
CN113408503B (en) * | 2021-08-19 | 2021-12-21 | 明品云(北京)数据科技有限公司 | Emotion recognition method and device, computer readable storage medium and equipment |
CN114299041A (en) * | 2021-12-31 | 2022-04-08 | 之江实验室 | Electronic choledochoscope image auxiliary diagnosis method based on deep multi-instance learning |
Also Published As
Publication number | Publication date |
---|---|
CN112036260B (en) | 2023-03-24 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN108520535B (en) | Object classification method based on depth recovery information | |
CN110276316B (en) | Human body key point detection method based on deep learning | |
WO2023056889A1 (en) | Model training and scene recognition method and apparatus, device, and medium | |
Chen et al. | Fsrnet: End-to-end learning face super-resolution with facial priors | |
CN112036260B (en) | Expression recognition method and system for multi-scale sub-block aggregation in natural environment | |
CN111401384B (en) | Transformer equipment defect image matching method | |
CN113158862B (en) | Multitasking-based lightweight real-time face detection method | |
Wang et al. | Small-object detection based on yolo and dense block via image super-resolution | |
CN111783748B (en) | Face recognition method and device, electronic equipment and storage medium | |
CN113221663B (en) | Real-time sign language intelligent identification method, device and system | |
CN112101262B (en) | Multi-feature fusion sign language recognition method and network model | |
Tuzel et al. | Global-local face upsampling network | |
CN110378208B (en) | Behavior identification method based on deep residual error network | |
CN112329525A (en) | Gesture recognition method and device based on space-time diagram convolutional neural network | |
CN110175248B (en) | Face image retrieval method and device based on deep learning and Hash coding | |
CN114758288A (en) | Power distribution network engineering safety control detection method and device | |
CN106650617A (en) | Pedestrian abnormity identification method based on probabilistic latent semantic analysis | |
Liu et al. | Pose-adaptive hierarchical attention network for facial expression recognition | |
CN112906520A (en) | Gesture coding-based action recognition method and device | |
CN113378949A (en) | Dual-generation confrontation learning method based on capsule network and mixed attention | |
Zhou et al. | Attention transfer network for nature image matting | |
CN117333908A (en) | Cross-modal pedestrian re-recognition method based on attitude feature alignment | |
CN110348395B (en) | Skeleton behavior identification method based on space-time relationship | |
CN111898566A (en) | Attitude estimation method, attitude estimation device, electronic equipment and storage medium | |
Xiong et al. | SPEAL: Skeletal Prior Embedded Attention Learning for Cross-Source Point Cloud Registration |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |