CN112800998A

CN112800998A - Multi-mode emotion recognition method and system integrating attention mechanism and DMCCA

Info

Publication number: CN112800998A
Application number: CN202110159085.8A
Authority: CN
Inventors: 卢官明; 朱清扬; 卢峻禾
Original assignee: Nanjing University of Posts and Telecommunications
Current assignee: Nanjing University of Posts and Telecommunications
Priority date: 2021-02-05
Filing date: 2021-02-05
Publication date: 2021-05-14
Anticipated expiration: 2041-02-05
Also published as: CN112800998B

Abstract

The invention discloses a multi-modal emotion recognition method and system for fusing an attention mechanism and identifying multi-set canonical correlation analysis (DMCCA). The method comprises the following steps: respectively extracting electroencephalogram signal features, peripheral physiological signal features and expression features from the preprocessed electroencephalogram signals, peripheral physiological signals and facial expression videos; respectively extracting discriminating electroencephalogram emotional characteristics, peripheral physiological emotional characteristics and expression emotional characteristics by using an attention mechanism; obtaining electroencephalogram-peripheral physiological-expression multi-mode emotional characteristics by using a DMCCA method for the electroencephalogram emotional characteristics, the peripheral physiological emotional characteristics and the expression emotional characteristics; and (4) performing classification and identification on the multi-modal emotional features by using a classifier. According to the method, the attention mechanism is adopted to selectively focus on the characteristics with emotion discrimination in each mode, and the relevance and complementarity among emotion characteristics of different modes are fully utilized by combining DMCCA, so that the accuracy and robustness of emotion recognition can be effectively improved.

Description

Multi-mode emotion recognition method and system integrating attention mechanism and DMCCA

Technical Field

The invention relates to the technical field of emotion recognition and artificial intelligence, in particular to a multi-modal emotion recognition method and system for fusing an attention mechanism and identifying multi-set canonical correlation analysis (DMCCA).

Background

Human emotion is a psychological and physiological state accompanying the process of human consciousness, and plays an important role in interpersonal communication. With the continuous progress of technologies such as artificial intelligence, people pay more attention to the more intelligent and humanized Human-Computer interaction (HCIs) experience. People have higher and higher requirements on machine intellectualization, and the machine is expected to have the capability of perceiving, understanding and even expressing emotion, realize humanized human-computer interaction and better serve human beings. Emotion recognition is a branch of emotion calculation, is a basic and core technology for realizing human-computer emotion interaction, has become a research hotspot in the fields of computer science, cognitive science, artificial intelligence and the like, and is widely concerned by the academic and industrial fields. For example, in clinical care, if the emotional state of a patient, especially a patient with a dysexpressive disorder, can be known, different care measures can be taken to improve the quality of care. In addition, there is also an increasing interest in psychobehavioral monitoring of patients with mental disorders, human-machine friendly interaction of emotional robots, and the like.

In the past, many studies on emotion recognition have focused on recognizing human emotional states using information of a single modality, such as speech-based emotion recognition and facial expression-based emotion recognition. Because the emotion information expressed by single voice or expression information is incomplete and is easily influenced by various external factors, for example, facial expression recognition is easily influenced by shading and illumination change, while emotion recognition based on voice is easily influenced by environmental noise interference and sound difference of different subjects, in addition, sometimes people face and smile, hold a cavity and do nothing to silence in order to cover up their real emotions, at this time, the facial expression or body posture has certain deception, and the emotion recognition method based on voice is invalid when people silence and are not speaking, so that the single-mode emotion recognition has certain limitation. Therefore, more and more researchers are focusing on emotion recognition research based on multi-mode information fusion, and it is expected that a robust emotion recognition model can be constructed by utilizing complementarity between various modal information so as to achieve higher emotion recognition accuracy.

Currently, in multi-modal emotion recognition research, a more common information fusion strategy includes decision layer fusion and feature layer fusion. Decision layer fusion is usually based on the result of individual identification of each mode, and then decision judgment is made according to relevant rules, such as a Mean (Mean) rule, a Sum (Sum) rule, a maximum (Max) rule, a voting mechanism of minority majority obeying, and the like, so as to obtain a final identification result. The decision layer fusion technology considers the difference of different modal information comprehensively according to different contributions of the different modal information to emotion recognition, but ignores the correlation of the different modal information. The multi-modal emotion recognition performance based on decision-making layer fusion is not only related to the emotion recognition rate of a single mode, but also depends on the performance of a decision-making layer fusion algorithm. The feature layer fusion refers to combining emotional features of a plurality of modes to form a fused feature vector. The feature layer fusion method utilizes the complementarity of different modal emotional features, but how to determine the weights of the different modal emotional features to reflect the differences of the different features in emotion classification and identification is a key for performing multi-modal feature fusion, and is still an open subject facing challenges at present.

Disclosure of Invention

The purpose of the invention is as follows: aiming at the defects of low accuracy and poor robustness of single-mode emotion recognition and the defects of the existing multi-mode emotion feature fusion method, the invention aims to provide the multi-mode emotion recognition method and system for fusing an attention mechanism and identifying multi-set canonical correlation analysis (DMCCA).

The technical scheme is as follows: the invention adopts the following technical scheme for realizing the aim of the invention:

a multi-modal emotion recognition method fusing an attention mechanism and DMCCA comprises the following steps:

(1) extracting electroencephalogram signal feature vectors and expression feature vectors from the preprocessed electroencephalogram signals and facial expression videos by using respective trained neural network models, and extracting peripheral physiological signal feature vectors from the preprocessed peripheral physiological signals by extracting signal waveform descriptors and statistical features thereof;

(2) mapping the electroencephalogram signal feature vector, the peripheral physiological signal feature vector and the expression feature vector into a plurality of groups of feature vectors through linear transformation matrixes respectively, determining importance weights of different feature vector groups by using an attention mechanism module respectively, and forming an electroencephalogram emotion feature vector, a peripheral physiological emotion feature vector and an expression emotion feature vector which have the same dimension and are discriminating through weighting fusion;

(3) determining a projection matrix of each emotion characteristic vector by using a discrimination multiple set canonical correlation analysis (DMCCA) method for the electroencephalogram emotion characteristic vector, the peripheral physiological emotion characteristic vector and the expression emotion characteristic vector and maximizing the correlation among different modal emotion characteristics of the same type of sample, projecting each emotion characteristic vector to a public subspace, and obtaining the electroencephalogram-peripheral physiological-expression multi-modal emotion characteristic vector after addition and fusion;

(4) and classifying and identifying the multi-mode emotion feature vectors by using a classifier to obtain emotion categories.

Further, the specific steps of extracting discriminating electroencephalogram emotional characteristics, peripheral physiological emotional characteristics and expression emotional characteristics by using an attention mechanism module in the step (2) comprise:

(2.1) representing the electroencephalogram signal characteristics extracted in the step (1) into a matrix form

And by linearly transforming the matrix W⁽¹⁾Mapping to M₁Group feature vector

4≤M₁Not more than 16, the dimension of each group of feature vectors is N, N is not less than 16 and not more than 64, and the order is

The linear transformation expression is as follows:

E⁽¹⁾＝(F⁽¹⁾)^TW⁽¹⁾

wherein, the superscript (1) represents an electroencephalogram mode, and T represents a transposed symbol;

determining importance weights of different feature vector groups by using a first attention mechanism module, and forming the electroencephalogram emotional feature vector with discriminative power by weighted fusion, wherein the weight of the characteristic vector of the r-th group of electroencephalogram signals

And the electroencephalogram emotional characteristic vector x⁽¹⁾Expressed as:

wherein, r is 1,2, …, M₁，

Representing the r-th group of electroencephalogram signal feature vectors,

for a trainable linear transformation parameter vector, exp (·) represents an exponential function based on a natural constant e;

(2.2) expressing the peripheral physiological signal characteristics extracted in the step (1) in a matrix form

Combined pipeOver-linear transformation matrix W⁽²⁾Mapping to M₂Group feature vector

4≤M₂Less than or equal to 16, order

The linear transformation expression is as follows:

E⁽²⁾＝(F⁽²⁾)^TW⁽²⁾

wherein the superscript (2) represents a peripheral physiological modality;

determining importance weights of different feature vector groups by using a second attention mechanism module, and forming discriminating peripheral physiological emotion feature vectors by weighted fusion, wherein the weights of the s-th group of peripheral physiological signal feature vectors

And peripheral physiological emotion feature vector x⁽²⁾Expressed as:

wherein, s is 1,2, …, M₂，

Represents the s-th group of peripheral physiological signal feature vectors,

a trainable linear transformation parameter vector;

(2.3) expressing features extracted in the step (1) in a matrix formIs shown as

And by linearly transforming the matrix W⁽³⁾Mapping to M₃Group feature vector

4≤M₃Less than or equal to 16, order

The linear transformation expression is as follows:

E⁽³⁾＝(F⁽³⁾)^TW⁽³⁾

wherein, the superscript (3) represents an expression mode;

determining importance weights of different feature vector groups by using a third attention mechanism module, and forming expression emotion feature vectors with discriminative power by weighted fusion, wherein the weights of the t-th group of expression emotion feature vectors

And expression emotion feature vector x⁽³⁾Expressed as:

wherein, t is 1,2, …, M₃，

Representing the characteristic vector of the t-th group expression,

for trainable linear transformationA parameter vector.

Further, the step (3) specifically comprises the following sub-steps:

(3.1) acquiring DMCCA projection matrix which is obtained through training and respectively corresponds to electroencephalogram emotional characteristics, peripheral physiological emotional characteristics and expression emotional characteristics

And

32≤d≤128；

(3.2) respectively using projection matrixes omega, phi and psi to extract the electroencephalogram emotion feature vector x from the step (2)⁽¹⁾Peripheral physiological emotion feature vector x⁽²⁾And expression emotion feature vector x⁽³⁾Projected into a d-dimensional public subspace, wherein the electroencephalogram emotional characteristic vector x⁽¹⁾Projection into d-dimensional common subspace is Ω^Tx⁽¹⁾Peripheral physiological affective feature vector x⁽²⁾Projection into d-dimensional common subspace is Ψ^Tx⁽²⁾Expression emotion feature vector x⁽³⁾Projection into d-dimensional common subspace is Ψ^Tx⁽³⁾；

(3.3) reducing omega^Tx⁽¹⁾、Φ^Tx⁽²⁾And Ψ^Tx⁽³⁾Fusing to obtain electroencephalogram-peripheral physiology-expression multi-modal emotion feature vector omega^Tx⁽¹⁾+Φ^Tx⁽²⁾+Ψ^Tx⁽³⁾。

Further, the projection matrices Ω, Φ, and Ψ in step (3.1) are obtained by training in the following steps:

(3.1.1) respectively extracting training samples of all emotion types from the training sample set to generate 3 groups of emotion feature vectors

Wherein

M is the number of training samples, N is

I 1,2,3, M1, 2, …, M; let i-1 represent the electroencephalogram modality, i-2 represent the peripheral physiological modality, i-3 represent the expression modality,

representing the electroencephalogram emotional characteristic vector,

representing a vector of peripheral physiological emotional features,

representing an expression emotion feature vector;

(3.1.2) calculation of X⁽ⁱ⁾Mean of vectors in each column, pair X⁽ⁱ⁾Carrying out centralized operation;

(3.1.3) solving a group of projection matrixes omega, phi and psi based on the idea of identifying multi-set canonical correlation analysis (DMCCA), so that the linear correlation of the same type of samples in a public projection shadow space is maximized, the inter-class dispersion of data in the modality is maximized, and the intra-class dispersion of the data in the modality is minimized, and X is enabled to be⁽ⁱ⁾Is a projection vector of

1,2,3, the objective function of DMCCA is:

wherein,

represents X⁽ⁱ⁾The intra-class dispersion matrix of (a),

represents X⁽ⁱ⁾Cov (·, ·) represents the covariance, i, j ∈ {1,2,3 };

constructing an optimization model as follows and solving to obtain projection matrixes omega, phi and psi:

further, solving the optimization model of the DMCCA objective function using Lagrange multiplier (Lagrange multiplier) can obtain the following Lagrange (Lagrange) function:

wherein λ is Lagrange multiplier, and then respectively calculating L (w)⁽¹⁾，w⁽²⁾，w⁽³⁾) To w⁽¹⁾、w⁽²⁾And w⁽³⁾And making it zero, i.e. order

To obtain

By further simplifying the above equation, the following generalized eigenvalue problem can be obtained:

the first d maximum eigenvalues lambda are selected by solving the generalized eigenvalue problem in the above formula₁≥λ₂≥…≥λ_dCorresponding characteristic vector, namely obtaining a projection matrix

And

based on the same inventive concept, the multi-modal emotion recognition system integrating the attention mechanism and the DMCCA, provided by the invention, comprises:

the characteristic primary extraction module is used for respectively extracting electroencephalogram signal characteristic vectors and expression characteristic vectors from the preprocessed electroencephalogram signals and facial expression videos by using respective trained neural network models, and extracting peripheral physiological signal characteristic vectors from the preprocessed peripheral physiological signals by extracting signal waveform descriptors and statistical characteristics thereof;

the characteristic identification enhancement module is used for mapping the electroencephalogram signal characteristic vector, the peripheral physiological signal characteristic vector and the expression characteristic vector into a plurality of groups of characteristic vectors through linear transformation matrixes respectively, determining importance weights of different characteristic vector groups respectively by using the attention mechanism module, and forming an electroencephalogram emotion characteristic vector, a peripheral physiological emotion characteristic vector and an expression emotion characteristic vector which have the same dimension and have identification power through weighting fusion;

the projection matrix determining module is used for determining a projection matrix of each emotion characteristic vector by maximizing the correlation among different modal emotion characteristics of the same class of samples by using a discrimination multi-set canonical correlation analysis (DMCCA) method;

the feature fusion module is used for projecting the electroencephalogram emotion feature vector, the peripheral physiological emotion feature vector and the expression emotion feature vector to a public subspace through respective corresponding projection matrixes, and obtaining an electroencephalogram-peripheral physiological-expression multi-mode emotion feature vector after addition and fusion;

and the classification and identification module is used for classifying and identifying the multi-mode emotion feature vectors by using the classifier to obtain the emotion types.

Based on the same inventive concept, the multi-modal emotion recognition system fusing the attention mechanism and the DMCCA provided by the invention comprises at least one computing device, wherein the computing device comprises a memory, a processor and a computer program which is stored on the memory and can run on the processor, and the computer program realizes the multi-modal emotion recognition method fusing the attention mechanism and the DMCCA when being loaded to the processor.

Has the advantages that: compared with the prior art, the invention has the following technical effects:

(1) according to the invention, an attention mechanism is adopted to selectively focus on the significant characteristics playing a key role in emotion recognition in each mode, the characteristics with emotion identification capability are adaptively learned, and the accuracy and robustness of multi-mode emotion recognition can be effectively improved.

(2) The invention adopts a typical correlation analysis method for identifying multiple sets, introduces the category information of the samples, can excavate the nonlinear correlation relationship among different modes by maximizing the correlation among different modal emotional characteristics of the same category sample and maximizing the inter-class dispersion of the same modal emotional characteristics and minimizing the intra-class dispersion of the same modal emotional characteristics, fully utilizes the correlation and complementarity among electroencephalogram emotional characteristics, peripheral physiological emotional characteristics and expression emotional characteristics, eliminates some invalid redundant characteristics at the same time, and can effectively improve the identification power and robustness of characteristic representation.

(3) Compared with a single-mode emotion recognition method, the method comprehensively utilizes various modal information in the emotion expression process, can combine the characteristics of different modes and fully utilize the complementarity of the characteristics to mine multi-mode emotion characteristics, and can effectively improve the accuracy and robustness of emotion recognition.

Drawings

FIG. 1 is a flow chart of a method of an embodiment of the present invention;

fig. 2 is a block diagram of an embodiment of the present invention.

Detailed Description

For a more detailed understanding of the present invention, reference is now made to the following descriptions taken in conjunction with the accompanying drawings and specific examples.

As shown in fig. 1 and fig. 2, a multi-modal emotion recognition method combining an attention mechanism and a DMCCA provided by an embodiment of the present invention mainly includes the following steps:

(1) extracting electroencephalogram signal feature vectors and expression feature vectors from the preprocessed electroencephalogram signals and facial expression videos by using the trained neural network models respectively, and extracting peripheral physiological signal feature vectors from the preprocessed peripheral physiological signals by extracting signal waveform descriptors and statistical features thereof.

In this embodiment, a deap (database for electronic Analysis using physical signals) Emotion database is used, and in practice, other Emotion databases including electroencephalogram, peripheral Physiological signals, and facial expression videos may be used. The DEAP database used in this example was a published multimodal emotion database collected by Koelstra et al, university of Mary, London, England. The database comprises physiological signals generated by 32 subjects watching different types of music video clip evoked stimuli with the time length of 40 minutes, peripheral physiological signals and facial expression videos of the first 22 subjects watching the music video clip. Each subject required 40 experiments and had a timely Self-assessment (SAM) after each experiment was completed, 40 Self-assessments on a SAM questionnaire. The SAM questionnaire contains mental scales of the subjects' Arousal (Arousal), Valence (Valence), Dominance (Dominance) and Liking (Liking) for the video. The arousal degree represents the state excitation degree of the human, the change range is gradually transited from a calm state to an excitation state, and the value is measured by the value from 1 to 9; the valence degree is also called the pleasure degree and represents the pleasure degree of the mood of a person, and the variation range is gradually transited from a Negative (Negative) state to a Positive (Positive) state and is also measured by the scores of the numbers 1 to 9; the degree of dominance varies from compliant (or "uncontrolled") to dominant (or "controlled"); the preference indicates the individual preference of the subject for the video. Each subject needs to select a score representing the emotional state after each experiment for classification and identification analysis of the subsequent emotional classifications.

In the DEAP database, the physiological signals are 512Hz sampled, 128Hz complex sampled (preprocessed complex sampled data is provided by the authorities), and the physiological signal matrix of each subject is 40 × 40 × 8064(40 different kinds of music video clips, 40 physiological signal channels, 8064 sampling points). Of the 40 physiological signal channels, the first 32 channels collect electroencephalogram signals, and the last 8 channels collect peripheral physiological signals. The 8064 samples are 63s long at 128Hz sampling rate, and each segment of the signal has 3s silence time before recording.

In the embodiment of the invention, 880 samples with electroencephalogram signals, peripheral physiological signals and facial expressions are used as training samples, and classification recognition is respectively carried out on 4 dimensions of arousal degree, valence degree, dominance degree and preference degree.

The Neural Network model for extracting the electroencephalogram signal features can adopt a Long Short-Term Memory (LSTM) Network or a Convolutional Neural Network (CNN), and the Neural Network model for extracting the expression features can adopt a 3D Convolutional Neural Network, a CNN-LSTM, and the like. In this embodiment, a trained Convolutional Neural Network (CNN) model is used to perform feature extraction on the preprocessed electroencephalogram signal, so as to obtain a 256-dimensional electroencephalogram signal feature vector; extracting 128-dimensional peripheral physiological signal characteristic vectors of preprocessed peripheral physiological signals such as electrocardio, respiration, electrooculogram and myoelectricity by extracting Low Level Descriptors (LLD) of signal waveforms and statistical characteristics (including average value, standard deviation, power spectrum, median, maximum value and minimum value) of the LLD; and extracting 256-dimensional expression feature vectors from the preprocessed facial expression video by using a trained CNN-LSTM model.

(2) And respectively extracting the discriminating electroencephalogram emotion characteristic vector, peripheral physiological emotion characteristic vector and expression emotion characteristic vector from the electroencephalogram signal characteristic vector, the peripheral physiological signal characteristic vector and the expression characteristic vector by using an attention mechanism module.

(3) And obtaining the electroencephalogram-peripheral physiology-expression multi-mode emotion feature vector by using a discrimination multi-set canonical correlation analysis (DMCCA) method for the electroencephalogram emotion feature vector, the peripheral physiology emotion feature vector and the expression emotion feature vector.

The linear transformation expression is as follows:

E⁽¹⁾＝(F⁽¹⁾)^TW⁽¹⁾

wherein, the superscript (1) represents the electroencephalogram mode, and T represents the transposed symbol.

wherein, r is 1,2, …, M₁，

Representing the r-th group of electroencephalogram signal feature vectors,

for a trainable linear transformation parameter vector, exp (·) represents an exponential function based on a natural constant e. In this embodiment, M₁＝8，N＝32。

To train the linear transformation matrix W⁽¹⁾The parameter of (2) needs to be connected with a softmax classifier after the first attention mechanism module, and the electroencephalogram emotion feature vector x output by the first attention mechanism module is used for⁽¹⁾C output nodes connected to the softmax classifier output a probability distribution vector after passing through the softmax function

Wherein C is [1, C ]]And C is the number of emotion categories.

Further, the linear transformation matrix W is trained by the cross entropy loss function shown in the following equation⁽¹⁾The parameter (c) of (c).

Wherein x is⁽¹⁾The electroencephalogram emotion feature vector is 32-dimensional;

probability distribution vectors representing the prediction emotion classes of the softmax classification model;

representing the real emotion category label of the mth electroencephalogram sample, and if the real emotion category label of the mth electroencephalogram sample is c when one-hot coding is adopted

Otherwise

Representing the probability that the softmax classification model predicts the mth electroencephalogram sample as the class c; loss⁽¹⁾Representing a linear transformation matrix W⁽¹⁾A loss function during training; in this embodiment, C is 2 and M is 880.

And continuously carrying out iterative training through an error back propagation algorithm until the model parameters reach the optimal values. Then, the electroencephalogram emotional characteristic vector x can be extracted from the electroencephalogram signal of the newly input test sample⁽¹⁾。

And by linearly transforming the matrix W⁽²⁾Mapping to M₂Group feature vector

4≤M₂Less than or equal to 16, order

The linear transformation expression is as follows:

E⁽²⁾＝(F⁽²⁾)^TW⁽²⁾

wherein the superscript (2) represents the peripheral physiological modality.

Determining importance weights of different feature vector groups by using a second attention mechanism module, and forming discriminating peripheral physiological emotion feature vectors by weighted fusion, wherein the weights of the s-th group of peripheral physiological signal feature vectorsHeavy load

And peripheral physiological emotion feature vector x⁽²⁾Expressed as:

wherein, s is 1,2, …, M₂，

Represents the s-th group of peripheral physiological signal feature vectors,

the parameter vector is transformed linearly, which is trainable. In this embodiment, M₂＝4。

To train the linear transformation matrix W⁽²⁾The peripheral physiological emotion feature vector x output by the second attention mechanism module needs to be connected with a softmax classifier after the second attention mechanism module⁽²⁾C output nodes connected to the softmax classifier output a probability distribution vector after passing through the softmax function

Further, the linear transformation matrix W is trained by the cross entropy loss function shown in the following equation⁽²⁾The parameter (c) of (c).

Wherein x is⁽²⁾A 32-dimensional peripheral physiological emotion feature vector;

when one-hot coding is adopted, if the real emotion category label of the mth peripheral physiological signal sample is c, then

Otherwise

Representing the probability that the softmax classification model predicts the mth peripheral physiological signal sample as class c; loss⁽²⁾Representing a linear transformation matrix W⁽²⁾A loss function during training; in this embodiment, C is 2 and M is 880.

And continuously carrying out iterative training through an error back propagation algorithm until the model parameters reach the optimal values. Then, a peripheral physiological emotion characteristic vector x can be extracted from the newly input peripheral physiological signal of the test sample⁽²⁾。

(2.3) expressing the expression characteristics extracted in the step (1) in a matrix form into expression characteristics

4≤M₃Less than or equal to 16, order

The linear transformation expression is as follows:

E⁽³⁾＝(F⁽³⁾)^TW⁽³⁾

wherein, the superscript (3) represents an expression mode.

And expression emotion feature vector x⁽³⁾Expressed as:

wherein, t is 1,2, …, M₃，

Representing the characteristic vector of the t-th group expression,

the parameter vector is transformed linearly, which is trainable. In this embodiment, M₃＝8。

To train the linear transformation matrix W⁽³⁾The third attention mechanism module is connected with a softmax classifier, and the expression emotion feature vector x output by the third attention mechanism module is used for classifying the expression emotion feature vector x⁽³⁾C output nodes connected to the softmax classifier output a probability distribution vector after passing through the softmax function

Further, represented by the following formulaTraining linear transformation matrix W by cross entropy loss function⁽³⁾The parameter (c) of (c).

Wherein x is⁽³⁾Expression emotion feature vectors in 32 dimensions;

and when one-hot coding is adopted, if the real emotion category label of the mth expression video sample is c, then

Otherwise

Representing the probability that the m-th expression video sample is predicted to be of the category c by the softmax classification model; loss⁽³⁾Representing a linear transformation matrix W⁽³⁾A loss function during training; in this embodiment, C is 2 and M is 880.

And continuously carrying out iterative training through an error back propagation algorithm until the model parameters reach the optimal values. Then, the expression emotion feature vector x can be extracted from the newly input expression video of the test sample⁽³⁾。

Further, the step (3) specifically comprises the following sub-steps:

And

d is more than or equal to 32 and less than or equal to 128. In the present embodiment, d is 40.

(3.2) respectively using projection matrixes omega, phi and psi to extract the electroencephalogram emotion feature vector x from the step (2)⁽¹⁾Peripheral physiological emotion feature vector x⁽²⁾And expression emotion feature vector x⁽³⁾Projected into a d-dimensional public subspace, wherein the electroencephalogram emotional characteristic vector x⁽¹⁾Projection into d-dimensional common subspace is Ω^Tx⁽¹⁾Peripheral physiological affective feature vector x⁽²⁾Projection into d-dimensional common subspace is Φ^Tx⁽²⁾Expression emotion feature vector x⁽³⁾Projection into d-dimensional common subspace is Ψ^Tx⁽³⁾。

(3.1.1) generating 3 groups of emotional feature vectors for the samples of the class C emotion classes in the training sample set

Wherein

M is the number of training samples (in this example, the data size in the sample set is not large, all samples participate in the calculation, and the sample set with large data size can randomly extract samples of each emotion type), i is 1,2,3, M is 1,2, …, M; let i-1 represent the electroencephalogram mode,i-2 stands for peripheral physiological modality, i-3 stands for expression modality,

representing the electroencephalogram emotional characteristic vector,

representing a vector of peripheral physiological emotional features,

representing an expression emotion feature vector; in this embodiment, C is 2, M is 880, and N is 32.

(3.1.2) calculation of X⁽ⁱ⁾Mean value of the vectors of each column

To X⁽ⁱ⁾Performing a centralization operation to obtain

For convenience of description, the following will be centered

Is still marked as X⁽ⁱ⁾I.e. to assume

Have all been centralized.

(3.1.3) the idea of discriminating multiple sets canonical correlation analysis (DMCCA) is to find a set of projection matrices Ω, Φ, and Ψ to maximize the linear correlation of homogeneous samples in the common projection shadow space, while also maximizing the inter-class scattering of data within the modality and minimizing the intra-class scattering of data within the modality, let X be⁽ⁱ⁾Is a projection vector of

1,2,3, the objective function of DMCCA is:

wherein,

represents X⁽ⁱ⁾The intra-class dispersion matrix of (a),

represents X⁽ⁱ⁾Cov (·, ·) represents the covariance, i, j ∈ {1,2,3 }.

The solution to the DMCCA objective function may be represented as an optimization model as follows:

(3.1.4) solving the optimization model of the DMCCA objective function using Lagrange multiplier (Lagrange multiplier) yields the following Lagrange (Lagrange) function:

wherein λ is Lagrange multiplier, and then respectively calculating L (w)⁽¹⁾，w⁽²⁾,w⁽³⁾) To w⁽¹⁾、w⁽²⁾And w⁽³⁾And making it zero, i.e. order

To obtain

And

in the present embodiment, d is 40.

Based on the same inventive concept, the multi-modal emotion recognition system integrating the attention mechanism and the DMCCA provided by the embodiment of the invention comprises:

the projection matrix determining module is used for determining a projection matrix of each emotion characteristic vector by maximizing the correlation among different modal emotion characteristics of the same type of samples by using a DMCCA method;

the feature fusion module is used for projecting the electroencephalogram emotion feature vector, the peripheral physiological emotion feature vector and the expression emotion feature vector to a public subspace through respective corresponding projection matrixes, and obtaining the electroencephalogram-peripheral physiological-expression multi-mode emotion feature vector after addition and fusion;

For specific implementation of each module, reference is made to the above method embodiment, and details are not repeated. Those skilled in the art will appreciate that the modules in the embodiments may be adaptively changed and arranged in one or more systems different from the embodiments. The modules or units or components of the embodiments may be combined into one module or unit or component, and furthermore they may be divided into a plurality of sub-modules or sub-units or sub-components.

Based on the same inventive concept, the multi-modal emotion recognition system combining the attention mechanism and the DMCCA provided by the embodiment of the invention comprises at least one computing device, wherein the computing device comprises a memory, a processor and a computer program which is stored in the memory and can run on the processor, and the computer program realizes the multi-modal emotion recognition method combining the attention mechanism and the DMCCA when being loaded into the processor.

The technical scheme disclosed by the invention not only comprises the technical methods related in the above embodiments, but also comprises the technical scheme formed by randomly combining the above technical methods. Those skilled in the art can make certain improvements and modifications without departing from the principles of the present invention, and such improvements and modifications are to be considered within the scope of the present invention.

Claims

1. The multimode emotion recognition method integrating the attention mechanism and the DMCCA is characterized by comprising the following steps of:

2. The multi-modal emotion recognition method combining attention mechanism and DMCCA as recited in claim 1, wherein step (2) comprises the sub-steps of:

The linear transformation expression is as follows:

E⁽¹⁾＝(F⁽¹⁾)^TW⁽¹⁾

wherein, r is 1,2, …, M₁，

Representing the r-th group of electroencephalogram signal feature vectors,

4≤M₂Less than or equal to 16, order

The linear transformation expression is as follows:

E⁽²⁾＝(F⁽²⁾)^TW⁽²⁾

wherein the superscript (2) represents a peripheral physiological modality;

And peripheral physiological emotion feature vector x⁽²⁾Expressed as:

wherein, s is 1,2, …, M₂，

Represents the s-th group of peripheral physiological signal feature vectors,

a trainable linear transformation parameter vector;

4≤M₃Less than or equal to 16, order

The linear transformation expression is as follows:

E⁽³⁾＝(F⁽³⁾)^TW⁽³⁾

wherein, the superscript (3) represents an expression mode;

And expression emotion feature vector x⁽³⁾Expressed as:

wherein, t is 1,2, …, M₃，

Representing the characteristic vector of the t-th group expression,

the parameter vector is transformed linearly, which is trainable.

3. The multi-modal emotion recognition method combining attention mechanism and DMCCA as recited in claim 2, wherein step (3) comprises the sub-steps of:

And

32≤d≤128；

(3.2) respectively using projection matrixes omega, phi and psi to extract the electroencephalogram emotion feature vector x from the step (2)⁽¹⁾Peripheral physiological emotion feature vector x⁽²⁾And expression emotion feature vector x⁽³⁾Projected into a d-dimensional public subspace, wherein the electroencephalogram emotional characteristic vector x⁽¹⁾Projection into d-dimensional common subspace is Ω^Tx⁽¹⁾Peripheral physiological affective feature vector x⁽²⁾Projection into d-dimensional common subspace is Φ^Tx⁽²⁾Expression emotion feature vector x⁽³⁾Projection into d-dimensional common subspace is Ψ^Tx⁽³⁾；

4. The multi-modal emotion recognition method integrating an attention mechanism and DMCCA according to claim 3, wherein the projection matrices Ω, Φ and Ψ in step (3.1) are obtained by training:

(3.1.1) respectively extracting training samples of each emotion type from the training sample set to generate 3 groups of emotional feature vectorsMeasurement of

Wherein

M is the number of training samples, i is 1,2,3, M is 1,2, …, M; let i-1 represent the electroencephalogram modality, i-2 represent the peripheral physiological modality, i-3 represent the expression modality,

representing the electroencephalogram emotional characteristic vector,

representing a vector of peripheral physiological emotional features,

representing an expression emotion feature vector;

The objective function of DMCCA is:

wherein,

represents X⁽ⁱ⁾The intra-class dispersion matrix of (a),

represents X⁽ⁱ⁾Cov (·, ·) represents the covariance, i, j ∈ {1,2,3 }; constructing an optimization model as follows and solving to obtain projection matrixes omega, phi and psi:

5. the multi-modal emotion recognition method integrating the attention mechanism and the DMCCA according to claim 4, wherein the optimized model of the DMCCA objective function constructed by solving the method by using the Lagrangian multiplier method is specifically as follows: the optimization model is expressed as the following lagrange function:

To obtain

And

6. the multimode emotion recognition system fusing an attention mechanism and DMCCA is characterized by comprising:

7. A multi-modal emotion recognition system combining an attention mechanism and DMCCA, comprising at least one computing device comprising a memory, a processor, and a computer program stored on the memory and executable on the processor, the computer program when loaded into the processor implementing the multi-modal emotion recognition method combining an attention mechanism and DMCCA according to any of claims 1-5.