CN117423154A

CN117423154A - Micro-expression recognition method, recognition model training method, device, equipment and medium

Info

Publication number: CN117423154A
Application number: CN202311604155.1A
Authority: CN
Inventors: 满园园
Original assignee: Ping An Life Insurance Company of China Ltd
Current assignee: Ping An Life Insurance Company of China Ltd
Priority date: 2023-11-27
Filing date: 2023-11-27
Publication date: 2024-01-19

Abstract

The invention relates to the technical field of artificial intelligence, in particular to a micro-expression recognition method, a recognition model training method, a device, equipment and a medium. In the method, the image mode data and the text mode data are used for training the recognition model, during training, the text decoder is used for decoding and reconstructing image features in the image mode data, the image decoder is used for decoding and reconstructing text features in the text mode data, so that the image decoder can learn the features of the corresponding text mode, the text decoder can learn the features of the image mode, the decoder can learn the features of multiple modes, during decoding and reconstructing, the reconstruction precision of different modes can be improved, during training based on reconstruction loss, the feature representation capability of the image encoder and the text encoder on the corresponding mode data is improved, and further, during micro-expression recognition by using the trained recognition model, the recognition precision can be improved.

Description

Micro-expression recognition method, recognition model training method, device, equipment and medium

Technical Field

The invention relates to the technical field of artificial intelligence, in particular to a micro-expression recognition method, a recognition model training method, a device, equipment and a medium.

Background

The micro-expression is a subtle unconscious facial expression and has the characteristics of short duration and low intensity. The existing micro expression recognition methods mainly comprise two types: a method based on manual design and a method based on deep learning. The classification method based on the manual features is unstable in the recognition process, and when the micro-expression recognition is performed based on the depth features, but the number of samples of the conventional micro-expression database is small, and the recognition model is trained only by using single-mode information under the limited samples, so that the recognition accuracy of the recognition model obtained by training is low, and therefore, how to improve the recognition accuracy of the recognition model in the micro-expression recognition process is a problem to be solved urgently.

Disclosure of Invention

Based on the foregoing, it is necessary to provide a micro-expression recognition method, a recognition model training method, a device, equipment and a medium to solve the problem of low recognition accuracy of the recognition model in the micro-expression recognition process.

A first aspect of an embodiment of the present application provides a training method of a microexpressive recognition model based on an offline guest-meeting service, where the training method of the recognition model includes:

acquiring a target image and a target text corresponding to the target image, and an initial recognition model, wherein the target image comprises an image feature tag and an image classification tag, the target text comprises a text feature tag and a text classification tag, the initial recognition model comprises an initial image encoder, an initial image decoder, an initial text encoder and an initial text decoder, the initial image decoder is used for decoding coding features output by the initial text encoder, and the initial text decoder is used for decoding the coding features output by the initial image encoder;

And performing supervision training on the initial recognition model by using the target image and the target text to obtain a trained recognition model.

A second aspect of an embodiment of the present application provides a method for identifying a micro-expression based on an offline guest-meeting service, where the method includes:

acquiring an image to be identified and a text to be identified corresponding to the image to be identified;

inputting the image to be recognized and the text to be recognized into the recognition model trained by the training method of the recognition model in the first aspect, and outputting a microexpressive recognition result.

A third aspect of the embodiments of the present application provides a training device for a microexpressive recognition model based on an offline guest-meeting service, where the training device for the recognition model includes:

the system comprises a first acquisition module, an initial recognition module and a second acquisition module, wherein the first acquisition module is used for acquiring a target image and a target text corresponding to the target image, the target image comprises an image feature tag and an image classification tag, the target text comprises a text feature tag and a text classification tag, the initial recognition module comprises an initial image encoder, an initial image decoder, an initial text encoder and an initial text decoder, the initial image decoder is used for decoding coding features output by the initial text encoder, and the initial text decoder is used for decoding the coding features output by the initial image encoder;

And the training module is used for performing supervision training on the initial recognition model by using the target image and the target text to obtain a trained recognition model.

A fourth aspect of the embodiments of the present application provides a micro-expression recognition device based on an offline guest-meeting service, where the recognition device includes:

the second acquisition module is used for acquiring the image to be identified;

the use module is used for inputting the image to be identified into the identification model trained by the training method of the identification model in the first aspect, and outputting a microexpressive identification result.

A fifth aspect of embodiments of the present application provides a terminal device comprising a processor, a memory and a computer program stored in the memory and executable on the processor, the processor implementing the training method of the recognition model as described in the first aspect or the recognition method as described in the second aspect when executing the computer program.

A sixth aspect of the embodiments of the present application provides a computer-readable storage medium storing a computer program which, when executed by a processor, implements the training method of the recognition model as described in the first aspect above or the recognition method as described in the second aspect above.

Compared with the prior art, the invention has the beneficial effects that:

the method comprises the steps of obtaining a target image and a target text corresponding to the target image, and an initial recognition model, wherein the target image comprises an image feature tag and an image classification tag, the target text comprises a text feature tag and a text classification tag, the initial recognition model comprises an initial image encoder, an initial image decoder, an initial text encoder and an initial text decoder, the initial image decoder is used for decoding coding features output by the initial text encoder, the initial text decoder is used for decoding the coding features output by the initial image encoder, and the target image and the target text are used for performing supervision training on the initial recognition model to obtain a trained recognition model. In the method, the image mode data and the text mode data are used for training the recognition model, during training, the text decoder is used for decoding and reconstructing image features in the image mode data, the image decoder is used for decoding and reconstructing text features in the text mode data, so that the image decoder can learn the features of the corresponding text mode, the text decoder can learn the features of the image mode, the decoder can learn the features of multiple modes, during decoding and reconstructing, the reconstruction precision of different modes can be improved, during training based on reconstruction loss, the feature representation capability of the image encoder and the text encoder on the corresponding mode data is improved, and further, during micro-expression recognition by using the trained recognition model, the recognition precision can be improved.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings that are needed in the description of the embodiments of the present invention will be briefly described below, it being obvious that the drawings in the following description are only some embodiments of the present invention, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a schematic diagram of an application environment of a micro-expression recognition method and a recognition model training method according to an embodiment of the present invention;

fig. 2 is a flow chart of a training method of a microexpressive recognition model based on an offline guest-meeting service according to an embodiment of the invention;

fig. 3 is a flow chart of a micro-expression recognition method based on an offline guest-meeting service according to an embodiment of the present invention;

fig. 4 is a schematic structural diagram of a training device of a microexpressive recognition model based on an offline guest-meeting service according to an embodiment of the present invention;

fig. 5 is a schematic structural diagram of a microexpressive recognition device based on an offline guest-meeting service according to an embodiment of the present invention;

fig. 6 is a schematic structural diagram of a terminal device according to an embodiment of the present invention.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and fully with reference to the accompanying drawings, in which it is evident that the embodiments described are some, but not all embodiments of the invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

It should be understood that the terms "comprises" and/or "comprising," when used in this specification and the appended claims, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.

It should also be understood that the term "and/or" as used in the present specification and the appended claims refers to any and all possible combinations of one or more of the associated listed items, and includes such combinations.

As used in the present description and the appended claims, the term "if" may be interpreted as "when..once" or "in response to a determination" or "in response to detection" depending on the context. Similarly, the phrase "if a determination" or "if a [ described condition or event ] is detected" may be interpreted in the context of meaning "upon determination" or "in response to determination" or "upon detection of a [ described condition or event ]" or "in response to detection of a [ described condition or event ]".

Furthermore, the terms "first," "second," "third," and the like in the description of the present specification and in the appended claims, are used for distinguishing between descriptions and not necessarily for indicating or implying a relative importance.

Reference in the specification to "one embodiment" or "some embodiments" or the like means that a particular feature, structure, or characteristic described in connection with the embodiment is included in one or more embodiments of the invention. Thus, appearances of the phrases "in one embodiment," "in some embodiments," "in other embodiments," and the like in the specification are not necessarily all referring to the same embodiment, but mean "one or more but not all embodiments" unless expressly specified otherwise. The terms "comprising," "including," "having," and variations thereof mean "including but not limited to," unless expressly specified otherwise.

The embodiment of the invention can acquire and process the related data based on the artificial intelligence technology. Among these, artificial intelligence (Artificial Intelligence, AI) is the theory, method, technique and application system that uses a digital computer or a digital computer-controlled machine to simulate, extend and extend human intelligence, sense the environment, acquire knowledge and use knowledge to obtain optimal results.

Artificial intelligence infrastructure technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a robot technology, a biological recognition technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and other directions.

It should be understood that the sequence numbers of the steps in the following embodiments do not mean the order of execution, and the execution order of the processes should be determined by the functions and the internal logic, and should not be construed as limiting the implementation process of the embodiments of the present invention.

In order to illustrate the technical scheme of the invention, the following description is made by specific examples.

The micro-expression recognition method and the recognition model training method provided by the embodiment of the invention can be applied to an application environment as shown in fig. 1, wherein a client communicates with a server. The client includes, but is not limited to, a handheld computer, a desktop computer, a notebook computer, an ultra-mobile personal computer (UMPC), a netbook, a personal digital assistant (personal digital assistant, PDA), and other terminal devices. The server may be implemented as a stand-alone server or as a cluster of servers generated by multiple servers.

Referring to fig. 2, a flow chart of a training method of a micro-expression recognition model based on an offline guest-receiving service according to an embodiment of the present invention is shown, where the training method of the micro-expression recognition model based on the offline guest-receiving service may be applied to a server in fig. 1, and the server is connected to a corresponding client, as shown in fig. 2, and the training method of the micro-expression recognition model based on the offline guest-receiving service may include the following steps.

S201: the method comprises the steps of obtaining a target image and a target text corresponding to the target image, and an initial recognition model, wherein the target image comprises an image feature tag and an image classification tag, the target text comprises a text feature tag and a text classification tag, the initial recognition model comprises an initial image encoder, an initial image decoder, an initial text encoder and an initial text decoder, the initial image decoder is used for decoding coding features output by the initial text encoder, and the initial text decoder is used for decoding the coding features output by the initial image encoder.

In step S201, a target image and a target text corresponding to the target image are acquired, where the target image and the target text are training data sets, the target image is a facial image containing a micro expression, the target text is a text description of the corresponding target image, that is, each target image corresponds to one target text, the corresponding target image and the target text form a description pair, and when the recognition model is trained, the input target image and the target text are images and texts in the same description pair. The target image comprises an image feature tag and an image classification tag, the target text comprises a text feature tag and a text classification tag, the initial recognition model comprises an initial image encoder, an initial image decoder, an initial text encoder and an initial text decoder, the initial image decoder is used for decoding coding features output by the initial text encoder, and the initial text decoder is used for decoding the coding features output by the initial image encoder.

In this embodiment, in the scene of security sales, when a business person looks at a customer, the customer will be introduced with a manufactured insurance scheme, and a large amount of materials will be introduced to the customer according to the questions and requirements of the customer. However, in order to improve the efficiency of the sales personnel, the response of the customer to each insurance product is generally observed through the artificial intelligence system, so that the sales personnel can recommend the corresponding insurance product to the customer based on the new scheme, and when the customer is observed through the artificial intelligence system, the tendency degree of the customer to each insurance product can be generally determined through observing the micro-expression of the customer.

The method comprises the steps of extracting videos of a business person when the business person faces a client, acquiring corresponding target images from video data, wherein the target images comprise image data of different micro expressions, locating climax frames of a micro expression sequence in the video data, extracting video frames corresponding to the climax frames as target images, and performing text description on the target images to obtain target texts corresponding to the target images.

It should be noted that, when the text description is performed on the target image, in this embodiment, an expression Action (Action Unit, AU) Unit is used to perform text description on the micro-expression in the target image, where the AU units are units in a facial behavior coding system, for example, in the facial behavior coding system, two AU units corresponding to happiness are respectively AU6 and AU12, where text information corresponding to AU6 is a cheek lifting, and text information corresponding to AU12 is a mouth lifting angle. Therefore, text expressions of different micro expressions can be found from the AU units, and corresponding text descriptions are taken as target texts.

And labeling the feature labels and the classification labels of the target image and the target text, determining the image feature labels and the image classification labels in the target image, and determining the text feature labels and the text classification labels in the target text, and corresponding real labels so as to calculate the loss of the initial recognition model according to the real labels.

The initial recognition model comprises an initial image encoder, an initial image decoder, an initial text encoder and an initial text decoder, wherein the initial image decoder is used for decoding coding features output by the initial text encoder, and the initial text decoder is used for decoding the coding features output by the initial image encoder. The initial recognition model comprises different branches, and different modal data are processed respectively.

It should be noted that, the initial recognition model further includes a classification network, and the classification network is used for classifying the coding features output by the image encoder and the text codes output by the text encoder, and by calculating the corresponding classification loss, the feature extraction precision of the image encoder and the text encoder is improved.

It should be noted that, the initial image encoder in the initial recognition model may include a convolution processing layer, a deconvolution processing layer, and a skip connection layer. The convolution processing layer is used for extracting convolution features from the target image, the deconvolution processing layer restores low-level features from the convolution features, and the jump connection layer is used for splicing the features together in the channel dimension by combining different features of the target image, so that multi-scale feature fusion can be realized, and image coding features are obtained. The initial image decoder may include three convolution blocks, the first two convolution blocks each including a deconvolution layer, a BN layer, and a pralu layer, which are sequentially set, and the third convolution block including a convolution layer and an activation function layer, the decoder being configured to perform an upsampling operation on the encoded features and reconstruct the encoded features.

The initial text encoder and the initial text decoder may use the same type of neural network model, or may use different types of neural network models. For example, the initial text encoder and the initial text decoder may both use convolutional neural network models, or the initial text encoder may use long-and-short-term memory artificial neural network models, the initial text decoder may use cyclic neural network models, and so on.

S202: and performing supervision training on the initial recognition model by using the target image and the target text to obtain a trained recognition model.

In step S202, the initial recognition model is supervised and trained by using the target image and the target text until the loss value of the loss function obtained by the initial recognition model meets the preset condition, and the training is completed, so as to obtain a trained recognition model.

In this embodiment, a target image is input into an initial image encoder, an image coding feature of the target image is output, a target text is input into the initial text encoder, a text coding feature of the target text is output, then the image coding feature is input into an initial text decoder, a reconstructed image of the image coding feature is output, the text coding feature is input into the initial image decoder, a reconstructed text of the text coding feature is output, a corresponding loss value is calculated according to the reconstructed image, the reconstructed text, the image coding feature, the text coding feature, and an image feature tag of the target image and a text feature tag of the target text, and when the loss value meets a preset condition, training of an initial recognition model is completed, and a trained recognition model is obtained.

If the initial recognition model includes a classification network, inputting the image coding feature and the text coding feature into the classification network, outputting a classification result of the target image and a classification result of the target text, calculating a classification loss value according to the classification result of the target image, the classification result of the target text, and the classification label of the target image and the classification label of the target text, and completing training of the initial recognition model according to the corresponding classification loss and the loss value, thereby obtaining a trained recognition model.

Optionally, performing supervised training on the initial recognition model by using the target image and the target text to obtain a trained recognition model, including:

encoding the target image by using an initial image encoder to obtain a first encoding characteristic, and encoding the target text by using an initial text encoder to obtain a second encoding characteristic;

calculating a semantic loss value according to the first coding feature and the second coding feature;

adopting an initial image decoder to decode the second coding feature to obtain a second decoding result, and adopting an initial text decoder to decode the first coding feature to obtain a first decoding result;

Calculating a cross-modal reconstruction loss value according to the first decoding result, the second decoding result, the image feature tag and the text feature tag;

and performing supervision training on the initial recognition model according to the cross-modal reconstruction loss value and the semantic loss value to obtain a trained recognition model.

In this embodiment, when training the initial recognition model, the target image is input into the initial image encoder to perform encoding processing on the target image, the first encoding feature is output, the target text is input into the initial text encoder to perform encoding processing on the target text, and the second encoding feature is output.

The first coding feature is a facial feature capable of representing an object to be identified in the target image, and when the first coding feature is extracted, a convolution processing mode can be adopted, namely feature extraction is realized through convolution, for example, the convolution processing comprises a plurality of convolution layers, namely multiple convolution calculation is performed, and the feature of the target image is gradually extracted through convolution by adopting a 3×3 convolution kernel. If the target image is 572×572×1, the target image is subjected to convolution (3×3 convolution kernel) and converted into 570×570×64. Each convolution process generates a feature map or feature of a number of patches that preserve the relationship between pixels in the target image. It should be noted that the first coding feature may include a color feature, a texture feature, a shape feature, a spatial relationship feature, and the like.

The initial text encoder may be a convolutional neural network, a long-short-term memory network, or a transform, for example, the initial text encoder may include a multi-layer long-short-term memory network, where the multi-long-term memory network includes a first long-short-term memory network layer for acquiring a feature of each word in the target text and a second long-term memory network layer for acquiring a feature of each target text based on the feature of each word in each target text.

According to the first coding feature and the second coding feature, calculating a semantic loss value, wherein the semantic loss value is a distance function value obtained by calculating the distance between the first coding feature and the second coding feature by adopting a distance algorithm for the difference between different modal features in the same micro expression. In this embodiment, the semantic loss value is calculated, so that the text feature can be embedded into the image encoder, and the image feature is embedded into the text encoder, thereby improving the representation capability of the micro-expression.

And decoding the second coding feature by using an initial image decoder to obtain a second decoding result, and decoding the first coding feature by using an initial text decoder to obtain a first decoding result.

The initial image decoder may include three convolution blocks, where the first two convolution blocks include a deconvolution layer, a BN layer, and a pralu layer that are sequentially set, and the third convolution block includes a convolution layer and an activation function layer, where the initial image decoder is configured to perform an upsampling operation on the second coding feature, and reconstruct the second coding feature into a predicted text similar to the target text, that is, a second decoding result.

The initial text decoder may be constructed based on a transform decoder, including a self-attention sub-layer, a feed-forward network sub-layer, and a cross-attention sub-layer, to output a reconstructed image, i.e., a decoding result, according to the first encoding characteristics.

And calculating a cross-modal reconstruction loss value according to the first decoding result, the second decoding result, the image feature tag and the text feature tag. When the cross-modal reconstruction loss value is calculated, the difference between the reconstructed image and the target image and the difference between the reconstructed text and the target text can be calculated. The cross-modal reconstruction loss value enables the image decoder to learn the characteristics of the corresponding text modes, the text decoder can learn the characteristics of the image modes, the decoder learns the multi-modal characteristics, the reconstruction precision of different modes can be improved when decoding and reconstruction is carried out, and the characteristic representation capability of the image encoder and the text encoder on the corresponding mode data is improved when training is carried out based on the reconstruction loss, so that the recognition precision can be improved when the trained recognition model is used for carrying out micro-expression recognition.

And performing supervision training on the initial recognition model according to the cross-modal reconstruction loss value and the semantic loss value to obtain a trained recognition model. And determining total loss according to the cross-modal reconstruction loss value and the semantic loss value, and performing joint training on the initial image encoder, the initial image decoder, the initial text encoder and the initial text decoder by adopting a gradient descent method during training to obtain a trained image encoder, a trained image decoder, a trained text encoder and a trained text decoder, thereby obtaining a trained recognition model. The gradient descent method may be random gradient descent method, batch gradient descent method, etc.

Optionally, performing supervised training on the initial recognition model by using the target image and the target text, and obtaining the trained recognition model further includes:

classifying the first coding feature and the second coding feature respectively to obtain a first classification result of the first coding feature and a second classification result of the second coding feature;

calculating a classification loss value according to the first classification result, the second classification result, the image classification label and the text classification label;

according to a preset weight value, weighting and summing the cross-modal reconstruction loss value, the semantic loss value and the classification loss value to obtain a total loss value;

And adjusting parameters of the initial image encoder, the initial image decoder, the initial text encoder and the initial text decoder in the initial recognition model according to the total loss value until the total loss value meets a preset condition, so as to obtain a trained recognition model.

In this embodiment, the first coding feature and the second coding feature are classified respectively to obtain a first classification result of the first coding feature and a second classification result of the second coding feature, and when classifying, the first coding feature and the second coding feature output by the encoder can be classified by a classification layer in the initial recognition model, so as to obtain the first classification result of the first coding feature and the second classification result of the second coding feature.

And calculating a classification loss value according to the first classification result, the second classification result, the image classification label and the text classification label. The classification loss calculation formula is as follows:

wherein L is _rec For the classification loss value, j represents the classification category of the target image and the target text microexpressive, and n represents the total category number.For the first classification result, i.e. the probability value of the j-th class of the predicted target image,/ >For the second classification result, i.e. the probability value of the predicted target text being the j-th class,/>For image label->And i is the description pair of the ith target image and the target text, and m is the number of description pairs corresponding to the target image and the target text.

And according to a preset weight value, carrying out weighted summation on the cross-modal reconstruction loss value, the semantic loss value and the classification loss value to obtain a total loss value, and according to the total loss value, adjusting parameters of an initial image encoder, an initial image decoder, an initial text encoder and an initial text decoder in the initial recognition model until the total loss value meets a preset condition to obtain a trained recognition model, wherein the preset condition can be stopping training when the total loss value is smaller than a preset threshold value, or stopping training when the training times meet a preset training times threshold value, and the like.

When the cross-modal reconstruction loss value, the semantic loss value and the classification loss value are weighted and summed, the preset weight value of each loss value can be set according to the importance degree of different loss values, for example, in this embodiment, the cross-modal reconstruction loss value can be set to a larger weight value, and the semantic loss value and the classification loss value can be set to a smaller weight value, so that the preset weight values of the cross-modal reconstruction loss value, the semantic loss value and the classification loss value can be 0.5,0.3,0.2 in sequence. In the implementation, according to the cross-modal reconstruction loss value, the semantic loss value and the classification loss value, the corresponding total loss value is determined, and the accuracy of the total loss value is improved, so that the accuracy of an identification model obtained through subsequent training is improved.

In another embodiment, the total loss value may also be calculated by other means, for example, classifying the loss value and the semantic loss value as the total loss value, etc.

Optionally, calculating the semantic loss value according to the first encoding feature and the second encoding feature includes:

mapping the first coding feature and the second coding feature into a preset subspace to obtain a first mapping feature of the first coding feature and a second mapping feature of the second coding feature;

and calculating the mean square error of the first mapping feature and the second mapping feature to obtain a semantic loss value.

In this embodiment, when calculating the semantic loss value, since the first coding feature and the second coding feature are different modal features, the feature dimensions may not be equal, so the first coding feature and the second coding feature may be mapped into a preset subspace, the subspace is used as a public space, and the first mapping feature of the mapped first coding feature and the second mapping feature of the mapped second coding feature have the same space dimension, so a distance difference between the first mapping feature and the second mapping feature may be calculated, and the distance difference is used as the semantic loss value. The calculation formula of the semantic loss value is as follows:

Wherein L is _corr And i is the description pair of the ith target image and the target text, and m is the number of description pairs corresponding to the target image and the target text.For the first encoding feature to be a first encoding feature,/>is a second encoding feature.

In the training process, a plurality of pairs of target images and target texts are used to obtain a plurality of classification loss values, and the mean square error of the plurality of classification loss values is calculated as a semantic loss value.

Optionally, calculating the cross-modal reconstruction loss value according to the first decoding result, the second decoding result, the image feature tag and the text tag includes:

determining an image reconstruction value in the first decoding result according to the first decoding result;

determining a text reconstruction value in the second decoding result according to the second decoding result;

calculating a first sub-loss value according to the image reconstruction value and the image characteristic label;

calculating a second sub-loss value according to the text reconstruction value and the text feature tag;

and calculating a cross-modal reconstruction loss value according to the first sub-loss value and the second sub-loss value.

In this embodiment, when the cross-mode reconstruction loss value is calculated according to the first decoding result, the second decoding result, the image feature tag and the text tag, the image reconstruction value in the first decoding result is determined according to the first decoding result, where the image reconstruction value may be a pixel value in a heavy image, the text reconstruction value in the second decoding result is determined according to the second decoding result, and the text reconstruction value is a word vector of each word in the target text.

According to the image reconstruction value and the image feature tag, a first sub-loss value is calculated, when the first sub-loss value is calculated, the feature tag in the target image and the image pixel value in the first decoding result can be compared, the difference between the feature tag in the target image and the image pixel value in the first decoding result is determined, and the difference is used as the first sub-loss value.

According to the text reconstruction value and the text feature tag, a second sub-loss value is calculated, the text reconstruction value is a word vector of each word in the target text, the text feature tag is a word vector tag of the target text, a vector distance between the text reconstruction value and the text feature tag is calculated, the distance is used as the second sub-loss value, and the first sub-loss value and the second sub-loss value are added to obtain a cross-modal reconstruction loss value. The calculation of the cross-modal reconstruction loss values is as follows:

wherein L is _rec For the cross-modal reconstruction loss value, i is the description pair of the ith target image and the target text, m is the number of description pairs corresponding to the target image and the target text, v _i As an image feature tag of the target image,reconstructing a value, t, for an image in the first decoding result _i Text feature tag for target text, +. >Reconstructing a value for the text in the second decoding result.

Referring to fig. 3, a flowchart of a micro-expression recognition method according to an embodiment of the invention is shown in fig. 3, and the micro-expression recognition method may include the following steps:

s301: acquiring an image to be identified;

s302: inputting the image to be identified into the trained identification model, and outputting the micro-expression identification result.

In this embodiment, an image to be recognized and a text to be recognized corresponding to the image to be recognized are obtained, wherein the image to be recognized is a facial image including micro expressions.

Inputting an image to be recognized and a trained recognition model, and outputting a micro-expression recognition result, wherein the trained recognition model is obtained by training based on the training method of the recognition model. The trained recognition model comprises a trained image encoder, the trained image encoder is used for carrying out feature coding on the image to be recognized to obtain image coding features of the image to be recognized, and the image coding features are classified according to a classification layer in the trained recognition model to obtain a micro-expression recognition result of the image to be recognized.

In this embodiment, the trained image encoder in the recognition model is a trained image encoder, and the trained image encoder is an encoder for learning image features and text features, so that the trained image encoder can be used to perform microexpressive recognition on the image to be recognized, and a microexpressive recognition result with higher recognition accuracy is obtained.

Referring to fig. 4, fig. 4 is a schematic structural diagram of a training device for a micro-expression recognition model based on an offline guest-meeting service according to an embodiment of the present invention. The units included in the present embodiment are used to perform the steps in the embodiment corresponding to fig. 2. Please refer to the related description in the embodiment corresponding to fig. 2. For convenience of explanation, only the portions related to the present embodiment are shown. Referring to fig. 4, the training apparatus 40 for recognition model includes: a first acquisition module 41, a training module 42.

The first obtaining module 41 is configured to obtain a target image and a target text corresponding to the target image, where the target image includes an image feature tag and an image classification tag, and the target text includes a text feature tag and a text classification tag, and the initial recognition module includes an initial image encoder, an initial image decoder, an initial text encoder, and an initial text decoder, where the initial image decoder is configured to perform decoding processing on text encoding features output by the initial text encoder, and the initial text decoder is configured to perform decoding processing on the image encoding features output by the initial image encoder.

The training module 42 is configured to perform supervised training on the initial recognition model by using the target image and the target text, so as to obtain a trained recognition model.

Optionally, the training module 42 includes:

and the coding unit is used for coding the target image by adopting an initial image coder to obtain a first coding characteristic, and coding the target text by adopting an initial text coder to obtain a second coding characteristic.

The first calculating unit is used for calculating a semantic loss value according to the first coding feature and the second coding feature.

And the decoding unit is used for decoding the second coding feature by adopting the initial image decoder to obtain a second decoding result, and decoding the first coding feature by adopting the initial text decoder to obtain a first decoding result.

The second calculating unit is used for calculating a cross-modal reconstruction loss value according to the first decoding result, the second decoding result, the image feature tag and the text feature tag.

The obtaining unit is used for performing supervision training on the initial recognition model according to the cross-modal reconstruction loss value and the semantic loss value to obtain a trained recognition model.

Optionally, the training module 42 further includes:

the classification unit is used for classifying the first coding feature and the second coding feature respectively to obtain a first classification result of the first coding feature and a second classification result of the second coding feature.

And the third calculation unit is used for calculating the classification loss value according to the first classification result, the second classification result, the image classification label and the text classification label.

And the weighted summation unit is used for carrying out weighted summation on the cross-modal reconstruction loss value, the semantic loss value and the classification loss value according to a preset weight value to obtain a total loss value.

And the adjusting unit is used for adjusting parameters of the initial image encoder, the initial image decoder, the initial text encoder and the initial text decoder in the initial recognition model according to the total loss value until the total loss value meets the preset condition, so as to obtain the trained recognition model.

Optionally, the first computing unit includes:

and the mapping subunit is used for mapping the first coding feature and the second coding feature into a preset subspace to obtain a first mapping feature of the first coding feature and a second mapping feature of the second coding feature.

The semantic loss value calculating subunit is used for calculating the mean square error of the first mapping feature and the second mapping feature to obtain a semantic loss value.

Optionally, the second computing unit includes:

and the first determination subunit is used for determining an image reconstruction value in the first decoding result according to the first decoding result.

And the second determining subunit is used for determining a text reconstruction value in the second decoding result according to the second decoding result.

And the first calculating subunit is used for calculating a first sub-loss value according to the image reconstruction value and the image characteristic label.

And the second calculating subunit is used for calculating a second sub-loss value according to the text reconstruction value and the text feature tag.

And the third calculation subunit is used for calculating the cross-modal reconstruction loss value according to the first sub-loss value and the second sub-loss value.

Referring to fig. 5, fig. 5 is a schematic structural diagram of a micro-expression recognition device based on an offline guest-meeting service according to an embodiment of the present invention. The terminal in this embodiment includes units for executing the steps in the embodiment corresponding to fig. 3. Refer specifically to the related description in the embodiment corresponding to fig. 3. For convenience of explanation, only the portions related to the present embodiment are shown. Referring to fig. 5, the recognition apparatus 50 includes: the second acquisition module 51 uses the module 52.

The second obtaining module 51 is configured to obtain an image of the micro-expression of the object to be identified and a text describing the micro-expression of the object to be identified.

The use module 52 is configured to input the image to be recognized and the text to be recognized into the trained recognition model, and output a microexpressive recognition result.

It should be noted that, because the content of information interaction and execution process between the modules is based on the same concept as the method embodiment of the present application, specific functions and technical effects thereof may be referred to in the method embodiment section, and details are not repeated herein.

Fig. 6 is a schematic structural diagram of a terminal device according to an embodiment of the present application. As shown in fig. 6, the terminal device of this embodiment includes: at least one processor (only one shown in fig. 6), a memory, and a computer program stored in the memory and executable on the at least one processor, the processor executing the computer program to perform the steps of any of the various microexpressive recognition method, recognition model training method embodiments described above.

The terminal device may include, but is not limited to, a processor, a memory. It will be appreciated by those skilled in the art that fig. 6 is merely an example of a terminal device and is not limiting of the terminal device, and that the terminal device may comprise more or less components than shown, or may combine some components, or different components, e.g. may further comprise a network interface, a display screen, an input device, etc.

The processor may be a CPU, but may also be other general purpose processors, digital signal processors (Digital Signal Processor, DSP), application specific integrated circuits (Application Specific Integrated Circuit, ASIC), off-the-shelf programmable gate arrays (Field-Programmable Gate Array, FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, or the like. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.

The memory includes a readable storage medium, an internal memory, etc., where the internal memory may be a memory of the terminal device, and the internal memory provides an environment for the operation of an operating system and computer readable instructions in the readable storage medium. The readable storage medium may be a hard disk of the terminal device, and in other embodiments may be an external storage device of the terminal device, for example, a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash Card (Flash Card), etc. that are provided on the terminal device. Further, the memory may also include both an internal storage unit of the terminal device and an external storage device. The memory is used to store an operating system, application programs, boot loader (BootLoader), data, and other programs such as program codes of computer programs, and the like. The memory may also be used to temporarily store data that has been output or is to be output.

It will be apparent to those skilled in the art that, for convenience and brevity of description, only the above-described division of the functional units and modules is illustrated, and in practical application, the above-described functional distribution may be performed by different functional units and modules according to needs, i.e. the internal structure of the apparatus is divided into different functional units or modules to perform all or part of the above-described functions. The functional units and modules in the embodiment may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit, where the integrated units may be implemented in a form of hardware or a form of a software functional unit. In addition, specific names of the functional units and modules are only for convenience of distinguishing from each other, and are not used for limiting the protection scope of the present application. The specific working process of the units and modules in the above device may refer to the corresponding process in the foregoing method embodiment, which is not described herein again. The integrated units, if implemented in the form of software functional units and sold or used as stand-alone products, may be stored in a computer readable storage medium. Based on such understanding, the present application implements all or part of the flow of the method of the above-described embodiments, and may be implemented by a computer program to instruct related hardware, where the computer program may be stored in a computer readable storage medium, where the computer program, when executed by a processor, may implement the steps of the method embodiments described above. Wherein the computer program comprises computer program code, which may be in the form of source code, object code, executable files or in some intermediate form, etc. The computer readable medium may include at least: any entity or device capable of carrying computer program code, a recording medium, a computer Memory, a Read-Only Memory (ROM), a random access Memory (Random Access Memory, RAM), an electrical carrier signal, a telecommunications signal, and a software distribution medium. Such as a U-disk, removable hard disk, magnetic or optical disk, etc. In some jurisdictions, computer readable media may not be electrical carrier signals and telecommunications signals in accordance with legislation and patent practice.

The implementation of all or part of the flow of the method in the foregoing embodiment may also be implemented by a computer program product, which when executed on a terminal device, causes the terminal device to implement the steps in the foregoing method embodiment.

In the foregoing embodiments, the descriptions of the embodiments are emphasized, and in part, not described or illustrated in any particular embodiment, reference is made to the related descriptions of other embodiments.

Those of ordinary skill in the art will appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.

In the embodiments provided in the present application, it should be understood that the disclosed apparatus/terminal device and method may be implemented in other manners. For example, the apparatus/terminal device embodiments described above are merely illustrative, e.g., the division of modules or units is merely a logical function division, and there may be additional divisions when actually implemented, e.g., multiple units or components may be combined or integrated into another system, or some features may be omitted or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed may be an indirect coupling or communication connection via interfaces, devices or units, which may be in electrical, mechanical or other forms.

The units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed over a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.

The above embodiments are only for illustrating the technical solution of the present application, and are not limiting thereof; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present application, and are intended to be included in the scope of the present application.

Claims

1. The training method of the micro-expression recognition model based on the off-line guest receiving service is characterized by comprising the following steps of:

2. The method for training a recognition model according to claim 1, wherein the performing the supervised training on the initial recognition model using the target image and the target text to obtain a trained recognition model comprises:

the initial image encoder is adopted to encode the target image to obtain a first encoding characteristic, and the initial text encoder is adopted to encode the target text to obtain a second encoding characteristic;

adopting the initial image decoder to decode the second coding feature to obtain a second decoding result, and adopting the initial text decoder to decode the first coding feature to obtain a first decoding result;

3. The method of claim 2, wherein the performing supervised training of the initial recognition model using the target image and the target text to obtain a trained recognition model further comprises:

according to a preset weight value, carrying out weighted summation on the cross-modal reconstruction loss value, the semantic loss value and the classification loss value to obtain a total loss value;

4. The method of training an identification model of claim 2, wherein said calculating a semantic loss value based on said first encoding feature and said second encoding feature comprises:

5. The method for training the recognition model according to claim 1, wherein calculating a cross-modal reconstruction loss value according to the first decoding result, the second decoding result, the image feature tag and the text tag comprises:

and calculating the cross-modal reconstruction loss value according to the first sub-loss value and the second sub-loss value.

6. The micro-expression recognition method based on the off-line guest-meeting business is characterized by comprising the following steps of:

Acquiring an image to be identified;

inputting the image to be recognized into a recognition model trained by the training method of the recognition model according to any one of claims 1 to 5, and outputting a microexpressive recognition result.

7. The training device of the micro-expression recognition model based on the off-line guest receiving service is characterized in that the training device of the recognition model comprises:

the system comprises a first acquisition module, an initial recognition module and a second acquisition module, wherein the first acquisition module is used for acquiring a target image and a target text corresponding to the target image, the target image comprises an image feature tag and an image classification tag, the target text comprises a text feature tag and a text classification tag, the initial recognition module comprises an initial image encoder, an initial image decoder, an initial text encoder and an initial text decoder, the initial image decoder is used for decoding text encoding features output by the initial text encoder, and the initial text decoder is used for decoding image encoding features output by the initial image encoder;

8. Micro-expression recognition device based on off-line guest reception business, which is characterized in that the recognition device comprises:

the second acquisition module is used for acquiring the image to be identified;

a use module, configured to input the image to be identified into an identification model trained by the training method using the identification model according to any one of claims 1 to 5, and output a microexpressive identification result.

9. A terminal device comprising a processor, a memory and a computer program stored in the memory and executable on the processor, the processor implementing the method of training a microexpressive recognition model according to any of claims 1 to 5 or the method of microexpressive recognition according to claim 6 when the computer program is executed by the processor.

10. A computer-readable storage medium storing a computer program, characterized in that the computer program, when executed by a processor, implements the training method of the micro-expression recognition model according to any one of claims 1 to 5 or the micro-expression recognition method according to claim 6.