CN114724222B - AI digital human emotion analysis method based on multiple modes - Google Patents
AI digital human emotion analysis method based on multiple modes Download PDFInfo
- Publication number
- CN114724222B CN114724222B CN202210394800.0A CN202210394800A CN114724222B CN 114724222 B CN114724222 B CN 114724222B CN 202210394800 A CN202210394800 A CN 202210394800A CN 114724222 B CN114724222 B CN 114724222B
- Authority
- CN
- China
- Prior art keywords
- emotion
- face
- recognition model
- voice
- module
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 230000008451 emotion Effects 0.000 title claims abstract description 148
- 238000004458 analytical method Methods 0.000 title claims abstract description 32
- 230000008921 facial expression Effects 0.000 claims abstract description 20
- 238000000034 method Methods 0.000 claims abstract description 20
- 230000008909 emotion recognition Effects 0.000 claims abstract description 10
- 238000001514 detection method Methods 0.000 claims description 52
- 238000012549 training Methods 0.000 claims description 27
- 238000013527 convolutional neural network Methods 0.000 claims description 14
- 230000007935 neutral effect Effects 0.000 claims description 12
- 238000010586 diagram Methods 0.000 claims description 9
- 238000012360 testing method Methods 0.000 claims description 9
- 238000012795 verification Methods 0.000 claims description 9
- ZKKMHTVYCRUHLW-UHFFFAOYSA-N 2h-pyran-5-carboxamide Chemical compound NC(=O)C1=COCC=C1 ZKKMHTVYCRUHLW-UHFFFAOYSA-N 0.000 claims description 4
- 238000011156 evaluation Methods 0.000 claims description 4
- 208000037170 Delayed Emergence from Anesthesia Diseases 0.000 claims description 3
- 230000008859 change Effects 0.000 claims description 3
- 230000001815 facial effect Effects 0.000 claims description 3
- 238000001228 spectrum Methods 0.000 claims description 3
- 230000009466 transformation Effects 0.000 claims description 2
- 230000003993 interaction Effects 0.000 description 5
- 238000000605 extraction Methods 0.000 description 3
- 230000035945 sensitivity Effects 0.000 description 3
- 208000029152 Small face Diseases 0.000 description 2
- 238000004364 calculation method Methods 0.000 description 2
- 230000001419 dependent effect Effects 0.000 description 2
- 230000008569 process Effects 0.000 description 2
- 230000009471 action Effects 0.000 description 1
- 230000002776 aggregation Effects 0.000 description 1
- 238000004220 aggregation Methods 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000015572 biosynthetic process Effects 0.000 description 1
- 238000004422 calculation algorithm Methods 0.000 description 1
- 238000010276 construction Methods 0.000 description 1
- 230000002950 deficient Effects 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000005457 optimization Methods 0.000 description 1
- 238000012545 processing Methods 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
- 238000003786 synthesis reaction Methods 0.000 description 1
- 238000011426 transformation method Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/48—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
- G10L25/51—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
- G10L25/63—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for estimating an emotional state
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D10/00—Energy efficient computing, e.g. low power processors, power management or thermal management
Landscapes
- Engineering & Computer Science (AREA)
- Health & Medical Sciences (AREA)
- Physics & Mathematics (AREA)
- Computational Linguistics (AREA)
- Theoretical Computer Science (AREA)
- General Health & Medical Sciences (AREA)
- Multimedia (AREA)
- Data Mining & Analysis (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Psychiatry (AREA)
- Acoustics & Sound (AREA)
- Hospice & Palliative Care (AREA)
- Life Sciences & Earth Sciences (AREA)
- Artificial Intelligence (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- Signal Processing (AREA)
- Evolutionary Computation (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Child & Adolescent Psychology (AREA)
- Image Analysis (AREA)
Abstract
The invention discloses a multimode-based AI digital human emotion analysis method, which comprises the following steps: s1, facial expression recognition emotion judgment, wherein an output result a is input into a multi-mode emotion analysis module; s2, voice emotion recognition and judgment, and outputting a result e to a multi-modal emotion analysis module; s3, identifying and judging text emotion, and outputting a result f to a multi-modal emotion analysis module; s4, in the multi-mode emotion analysis module, carrying out emotion random combination judgment on the results a, e and f, taking an average probability value of random emotion combination conditions as a final emotion judgment result g, and outputting the final emotion judgment result g to the AI digital person. According to the method, the emotion states of the user can be judged most comprehensively and accurately through the multi-mode judgment of the emotion states of the user, the meaning expressed by the user can be accurately grasped, and the method is not only suitable for chat robots in financial scenes, but also can be used as chat robots in other vertical fields, such as medical, educational, service and other fields.
Description
Technical Field
The invention relates to the technical field of AI digital persons, in particular to an AI digital person emotion analysis method based on multiple modes.
Background
The AI digital person system is generally composed of 5 modules of character image, voice generation, animation generation, audio and video synthesis display, interaction and the like, the interaction module enables the AI digital person to have an interaction function, namely, the intention of a user is identified through intelligent technologies such as voice semantic identification and the like, the subsequent voice and action of the digital person are determined according to the current intention of the user, the character is driven to start the next interaction, in the interaction process, the AI digital person needs to accurately judge emotion of the client so as to provide accurate service, the method can be divided into text emotion tendency judgment after semantic understanding, or the facial expression of the client is captured through a camera, and then the digital person is provided with emotion analysis through expression identification.
Firstly, face expression recognition is characterized in that face detection is carried out, the problem of missing detection often exists when a traditional face detection method is used for carrying out face detection on an image, the robustness is insufficient, the face is often not detected in a side face or light deficient environment, and the emotion analysis result is affected;
Secondly, for some specific scenes such as finance, medical treatment and education industry, an AI digital person generally has the capability of understanding the "look-and-feel" of the text semantics of the client, and makes correct judgment on the positive (yes), negative (no) or neutral emotion semantics of the client through the text semantics in combination with the business scene, but the text semantics understanding capability is accompanied by a large amount of data corpus or artificial dictionary construction, is very dependent on data resources and manpower resources, and in a wider scene, only the text semantics understanding is used for judging that the client emotion is slightly insufficient;
Finally, the existing part of AI digital people judge the emotion state of the user through voice features, one method is to judge the emotion state through voice text recognition and then through text, the method is very dependent on the accuracy of voice recognition, and the other method is to judge the emotion state directly through voice, but the feature extraction method for judging emotion from voice extraction is still immature, and as a result, the accuracy of judging the emotion state is lower;
In summary, based on the single-mode emotion state recognition, when judging the emotion state of a client, the accuracy is still lower than the multi-mode emotion comprehensive recognition results of images, voices, characters and the like, so the invention provides a multi-mode AI digital human emotion analysis method.
Disclosure of Invention
In order to solve the technical problems mentioned in the background art, a multi-mode-based AI digital human emotion analysis method is provided.
In order to achieve the above purpose, the present invention adopts the following technical scheme:
A multimode-based AI digital human emotion analysis method comprises the following steps:
s1, facial expression recognition emotion judgment:
s11, acquiring an image through a camera module to serve as an original image A to be detected;
S12, obtaining an original image B by converting the original image A to be detected into (640, 640,3);
s13, inputting the original image B into a trained retinaface face detection and recognition model, and outputting a face detection frame C;
S14, intercepting a target face area from the face detection frame C, wherein the target face area resize is a 224 multiplied by 224 face image D;
s15, inputting a face image D into a trained facial expression recognition model, classifying the face image C by using a convolutional neural network, and obtaining probability values of seven emotion categories, namely 'angry', 'disgusted', 'fearful', 'happy', 'sad', 'surprise', 'neutral';
s16, outputting the emotion type corresponding to the maximum probability value, and outputting a result a to be input into the multi-mode emotion analysis module;
S2, voice emotion recognition judgment:
s21, collecting voice E through a voice collecting module;
s22, inputting the voice E into a voice emotion judging model, extracting zero crossing rate, amplitude, spectrum centroid and Mel frequency cepstrum coefficient of an audio map, and obtaining probability values of seven emotion categories, namely 'angry', 'disgusted', 'fearful', 'happy', 'sad', 'surprise', 'neutral';
s23, outputting emotion categories corresponding to the maximum probability values, and outputting a result e to the multi-modal emotion analysis module;
S3, identifying and judging text emotion:
S31, collecting the voice E through a voice acquisition module, and converting the voice E into a text F;
S32, inputting the text F into a text emotion recognition model, performing emotion scoring on the text F, and outputting probability values of seven emotion categories obtained by "angry", "disgusted", "fearful", "happy", "sad", "surprise" and "neutral";
s33, outputting emotion categories corresponding to the maximum probability values, and outputting a result f to the multi-modal emotion analysis module;
S4, in the multi-mode emotion analysis module, carrying out emotion random combination judgment on the results a, e and f, taking an average probability value of random emotion combination conditions as a final emotion judgment result g, and outputting the final emotion judgment result g to the AI digital person.
As a further description of the above technical solution:
The training method of the retinaface face detection and recognition model comprises the following steps: the widerface data set is selected, the widerface data set comprises at least 32203 pictures, the pictures are divided into a training set, a verification set and a test set according to a ratio of 4:1:5, the training set is used for training a retinaface face detection recognition model, and the pictures are subjected to data enhancement through methods of brightness change, saturation adjustment, hue adjustment, random clipping, mirror image overturning and size transformation during training.
As a further description of the above technical solution:
the retinaface face detection and recognition model comprises 5 pyramid feature diagrams, an SSH context module and a deformable convolution network DCN module, and in the retinaface face detection and recognition model, a global loss function L is as follows:
Wherein L cls is loss of a face two-classifier softmax, L box is regression loss function smoothL1 of box, L pixel is dense regression loss, p i is probability of ith anchor as face, For the i-th anchor's true category, 1 represents a face, 0 represents a non-face ,λ1=0.25、λ2=0.1、λ3=0.01,ti={tx,ty,tw,th} represents the predicted box coordinates,The representation real frame coordinates ,li={lx1,ly1,lx2,ly2,lx3,ly3,lx4,ly4,lx5,ly5} represent predicted face key coordinates ,li*={lx1*,ly1*,lx2*,ly2*,lx3*,ly3*,lx4*,ly4*,lx5*,ly5*} represent real face key coordinates.
As a further description of the above technical solution:
The training method of the facial expression recognition model comprises the following steps: and selecting a dataset of seven types of face images, namely 'angry', 'disgusted', 'fearful', 'happy', 'sad', 'surprise', 'neutral', wherein the dataset comprises at least 35887 pictures, and each type of at least 3000 pictures is divided into a training set, a verification set and a test set according to a ratio of 8:1:1, and the training set, the verification set and the test set are used for training a facial expression recognition model.
As a further description of the above technical solution:
The facial expression recognition model comprises a CNN backbone network and a stream module, wherein the CNN backbone network comprises an inverse residual block BlockA and a downsampling module BlockB, the left side auxiliary branch of the downsampling module BlockB uses AVGPool, and the stream module uses a depth convolution DWConv layer with the step length larger than 1 to downsample and output a 1-dimensional feature vector.
As a further description of the above technical solution:
The emotion random combination condition comprises the following:
Emotion random combination case one: results e, f combined output: if the AI digital person does not contain the camera module, the image information can not be obtained in real time, the emotion state is judged by using voice and text characteristics, and further, the average probability value ef of the results e and f is taken, and the emotion type corresponding to the average probability value ef is used as a final emotion judgment result g1;
Emotion random combination case two: results a, e combined output: if the AI digital person does not contain the text emotion recognition model, the text emotion judgment result cannot be output in real time, the emotion state is judged by using the facial image and the voice characteristics, and further, the average probability value ae of the results a and e is taken as the final emotion judgment result g2, wherein the emotion category corresponds to the average probability value ae;
Emotion random combination case three: results a, f combined output: if the AI digital person does not contain the voice emotion judging model, the voice emotion judging result cannot be output in real time, judging the emotion state by using the face image and the text characteristics, and further, taking the average probability value af of the results a and f and the emotion type corresponding to the average probability value af as a final emotion judging result g3;
Affective random combination case four: results a, e, f combined output: and taking the average probability values aef of the results a, e and f and the emotion category corresponding to the average probability value aef as a final emotion judgment result g4.
As a further description of the above technical solution:
On the premise that the IOU of the face detection frames C and GT is larger than 0.5, standard evaluation is carried out on the face detection recognition model through accuracy and recall rate:
precision = TP/(tp+fp);
recall = TP/(tp+fn);
the TP=GT and n PRED are predicted correctly, the real example is predicted as the positive example, and the model is predicted as the positive example;
Fp=pred- (GT n PRED), misprediction, false positive, model prediction as positive, actually negative;
Fn=gt- (GT n PRED), misprediction, false counterexample, model prediction as counterexample, actually positive example;
Specifically, GT represents the real value of the picture input to the retinaface face detection and recognition model, PRED is a prediction sample output by the retinaface face detection and recognition model, and IOU represents the ratio of intersection to union of the picture prediction result input to the retinaface face detection and recognition model in the widerface dataset and the real value of the input picture.
In summary, due to the adoption of the technical scheme, the beneficial effects of the invention are as follows: the method is suitable for chat robots in financial scenes, and can be used as chat robots in other vertical fields, such as medical, educational, service and the like.
Drawings
Fig. 1 shows a schematic flow chart of a facial expression recognition emotion judgment method according to an embodiment of the present invention;
Fig. 2 is a schematic diagram of a network structure of a retinaface face detection and recognition model according to an embodiment of the present invention;
fig. 3 shows a schematic diagram of a CNN backbone network according to an embodiment of the present invention;
fig. 4 is a schematic diagram of a CNN backbone network structure module according to an embodiment of the present invention;
FIG. 5 shows a schematic diagram of an SSH context module architecture provided in accordance with an embodiment of the present invention;
FIG. 6 illustrates a schematic diagram of an IOU provided in accordance with an embodiment of the invention.
Detailed Description
The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
Example 1
Referring to fig. 1-6, the present invention provides a technical solution: a multimode-based AI digital human emotion analysis method comprises the following steps:
s1, facial expression recognition emotion judgment:
s11, acquiring an image through a camera module to serve as an original image A to be detected;
S12, obtaining an original image B by converting the original image A to be detected into (640, 640,3);
s13, inputting the original image B into a trained retinaface face detection and recognition model, and outputting a face detection frame C;
specifically, the training method of retinaface face detection and recognition model comprises the following steps: selecting widerface a widerface dataset which comprises at least 32203 pictures, dividing the dataset into a training set, a verification set and a test set according to a ratio of 4:1:5, and training a retinaface face detection recognition model, wherein the pictures are subjected to data enhancement by a brightness change, saturation adjustment, hue adjustment, random clipping, mirror image turning and size transformation method during training;
The traditional face detection method has the problems of omission and low robustness, training data is enhanced through data, and a retinaface face detection recognition model is trained to obtain a depth detection model with high accuracy;
Further, as shown in fig. 2, the retinaface face detection and recognition model includes 5 pyramid feature diagrams, an SSH context module and a deformable convolution network DCN module, and the retinaface face detection and recognition model adds the context SSH module in the 5 pyramid feature diagrams to improve the detection precision of the small face, introduces the deformable convolution network DCN module to improve the precision, adds 5 face key points to improve the precision of the detection algorithm in the hard part of the widerface dataset;
Specifically, the SSH context module improves the detection of small faces by introducing context detection in the feature map, in a two-stage detector, typically integrating the context by expanding the window around the candidates proposals, by mimicking this strategy with a simple convolution layer, fig. 5 shows the context layer integrated into the detection module, since the anchors are classified and regressed in a convolutional manner, a larger filter (larger convolution kernel size) is employed, which is similar to increasing the window size around proposals in a two-stage detector, for which the invention uses filters (convolution kernels) 5*5 and 7*7 in the SSH context module, in such a way that the sensitivity field proportional to the step size of the corresponding layer is increased for the context modeling, and thus the target dimension of each detection module is increased, in order to reduce the number of parameters, we use some series 3*3 convolution kernels instead of larger convolution kernels, a larger sensitivity field can be obtained, while in the present invention, the convolution kernels use some series 3*3 convolution kernels can achieve a larger sensitivity field, while the same number of parameters can be reduced by a small number of SSH modules, while the average widerface is achieved by a large number of average values of widerface;
The loss function comprises classification loss, box regression loss, face key points and dense regression loss, the regression accuracy of the box can be improved by introducing the face key points, and in retinaface face detection and identification model, the global loss function L is as follows:
Wherein L cls is loss of a face two-classifier softmax, L box is regression loss function smoothL1 of box, L pixel is dense regression loss, p i is probability of ith anchor as face, For the real class of the i-th anchor, 1 represents a face, 0 represents a non-face, λ 1=0.25、λ2=0.1、λ3 =0.01, λ is used to balance the weights of the different loss types, t i={tx,ty,tw,th represents the predicted box coordinates,/>The representation real frame coordinates ,li={lx1,ly1,lx2,ly2,lx3,ly3,lx4,ly4,lx5,ly5} represents predicted face key point coordinates ,li*={lx1*,ly1*,lx2*,ly2*,lx3*,ly3*,lx4*,ly4*,lx5*,ly5*} represents real face key point coordinates;
S14, intercepting a target face area from the face detection frame C, wherein the target face area resize is a 224 multiplied by 224 face image D;
s15, inputting a face image D into a trained facial expression recognition model, classifying the face image C by using a convolutional neural network, and obtaining probability values of seven emotion categories, namely 'angry', 'disgusted', 'fearful', 'happy', 'sad', 'surprise', 'neutral';
Specifically, the training method of the facial expression recognition model comprises the following steps: selecting a dataset of seven types of face images, namely an anary, disgusted, fearful, happy, sad, surprise and neutr, wherein the dataset comprises at least 35887 pictures, and each type of at least 3000 pictures is divided into a training set, a verification set and a test set according to the ratio of 8:1:1, and the training set, the verification set and the test set are used for training a facial expression recognition model;
Further, as shown in fig. 3 and fig. 4, the facial expression recognition model includes a CNN backbone network and a stream module, the CNN backbone network includes an inverse residual block BlockA and a downsampling module BlockB, and the downsampling module BlockB assists in using AVGPool on the left side, because it can embed multi-scale information and aggregation features in different receptive fields, resulting in improvement of performance, downsampling is performed in the stream module by using a depth convolution DWConv layer with a step size greater than 1, and a 1-dimensional feature vector is output;
In the aspect of facial expression recognition, the size of an input data image is resize to 224 multiplied by 224, then, after the input data image is input into a CNN main network to carry out operations such as convolution extraction of features and the like, a 7 multiplied by 7 feature image is generated, after the CNN main network is adopted, in order to better extract feature image information, a stream module is used, a depth convolution DWConv layer with the step length larger than 1 is used for downsampling in the stream module, and then the downsampling is carried out on the stream module to output a 1-dimensional feature vector (1 x 7), so that the overfitting risk caused by a full connection layer can be reduced, and then, the feature vector calculation loss is used for prediction, wherein a rapid downsampling strategy is adopted in the initial stage of the CNN main network, so that the size of the feature image can be rapidly reduced, less parameters are spent, and the problems of weak feature embedding capability and long processing time caused by a slow downsampling process with limited calculation force can be avoided;
s16, outputting the emotion type corresponding to the maximum probability value, and outputting a result a to be input into the multi-mode emotion analysis module;
S2, voice emotion recognition judgment:
s21, collecting voice E through a voice collecting module;
s22, inputting the voice E into a voice emotion judging model, extracting zero crossing rate, amplitude, spectrum centroid and Mel frequency cepstrum coefficient of an audio map, and obtaining probability values of seven emotion categories, namely 'angry', 'disgusted', 'fearful', 'happy', 'sad', 'surprise', 'neutral';
s23, outputting emotion categories corresponding to the maximum probability values, and outputting a result e to the multi-modal emotion analysis module;
S3, identifying and judging text emotion:
S31, collecting the voice E through a voice acquisition module, and converting the voice E into a text F;
S32, inputting the text F into a text emotion recognition model, performing emotion scoring on the text F, and outputting probability values of seven emotion categories obtained by "angry", "disgusted", "fearful", "happy", "sad", "surprise" and "neutral";
s33, outputting emotion categories corresponding to the maximum probability values, and outputting a result f to the multi-modal emotion analysis module;
S4, in the multi-mode emotion analysis module, carrying out emotion random combination judgment on the results a, e and f, taking an average probability value of random emotion combination conditions as a final emotion judgment result g, and outputting the final emotion judgment result g to an AI digital person;
Specifically, the emotion random combination condition includes the following:
Emotion random combination case one: results e, f combined output: if the AI digital person does not contain the camera module, the image information can not be obtained in real time, the emotion state is judged by using voice and text characteristics, and further, the average probability value ef of the results e and f is taken, and the emotion type corresponding to the average probability value ef is used as a final emotion judgment result g1;
Emotion random combination case two: results a, e combined output: if the AI digital person does not contain the text emotion recognition model, the text emotion judgment result cannot be output in real time, the emotion state is judged by using the facial image and the voice characteristics, and further, the average probability value ae of the results a and e is taken as the final emotion judgment result g2, wherein the emotion category corresponds to the average probability value ae;
Emotion random combination case three: results a, f combined output: if the AI digital person does not contain the voice emotion judging model, the voice emotion judging result cannot be output in real time, judging the emotion state by using the face image and the text characteristics, and further, taking the average probability value af of the results a and f and the emotion type corresponding to the average probability value af as a final emotion judging result g3;
Affective random combination case four: results a, e, f combined output: taking the average probability values aef of the results a, e and f and the emotion category corresponding to the average probability value aef as a final emotion judgment result g4;
The method is suitable for chat robots in financial scenes, and can be used as chat robots in other vertical fields, such as medical, educational, service and the like.
Referring to fig. 6, on the premise that the IOU of the face detection frames C and GT is greater than 0.5, standard evaluation is performed on the face detection recognition model through accuracy and recall:
precision = TP/(tp+fp);
recall = TP/(tp+fn);
the TP=GT and n PRED are predicted correctly, the real example is predicted as the positive example, and the model is predicted as the positive example;
Fp=pred- (GT n PRED), misprediction, false positive, model prediction as positive, actually negative;
Fn=gt- (GT n PRED), misprediction, false counterexample, model prediction as counterexample, actually positive example;
Specifically, GT represents the real value of the picture input to the retinaface face detection and recognition model, PRED passes through the prediction sample output by the retinaface face detection and recognition model, and IOU represents the ratio of intersection and union of the picture prediction result input to the retinaface face detection and recognition model in the widerface dataset and the real value of the input picture;
The accuracy represents the prediction accuracy degree in the positive sample result, for the evaluation of some negative samples, the recall rate is used, the higher the recall rate is, the higher the probability that the actual negative sample is predicted is represented, and by combining the negative samples for analysis, whether the retinaface face detection recognition model is generalized or not can be insufficient for some special scenes, and some optimization modes, such as image enhancement and the like, are provided.
The method is suitable for the chat robots in financial scenes, and can be used as chat robots in other vertical fields, such as medical, education, service and the like.
The foregoing is only a preferred embodiment of the present invention, but the scope of the present invention is not limited thereto, and any person skilled in the art, who is within the scope of the present invention, should make equivalent substitutions or modifications according to the technical scheme of the present invention and the inventive concept thereof, and should be covered by the scope of the present invention.
Claims (5)
1. The multi-mode-based AI digital human emotion analysis method is characterized by comprising the following steps: s1, facial expression recognition emotion judgment:
s11, acquiring an image through a camera module to serve as an original image A to be detected;
S12, obtaining an original image B by converting the original image A to be detected into (640, 640,3);
s13, inputting the original image B into a trained retinaface face detection and recognition model, and outputting a face detection frame C;
S14, intercepting a target face area from the face detection frame C, wherein the target face area resize is a 224 multiplied by 224 face image D;
s15, inputting a face image D into a trained facial expression recognition model, classifying the face image C by using a convolutional neural network, and obtaining probability values of seven emotion categories, namely 'angry', 'disgusted', 'fearful', 'happy', 'sad', 'surprise', 'neutral';
s16, outputting the emotion type corresponding to the maximum probability value, and outputting a result a to be input into the multi-mode emotion analysis module;
S2, voice emotion recognition judgment:
s21, collecting voice E through a voice collecting module;
s22, inputting the voice E into a voice emotion judging model, extracting zero crossing rate, amplitude, spectrum centroid and Mel frequency cepstrum coefficient of an audio map, and obtaining probability values of seven emotion categories, namely 'angry', 'disgusted', 'fearful', 'happy', 'sad', 'surprise', 'neutral';
s23, outputting emotion categories corresponding to the maximum probability values, and outputting a result e to the multi-modal emotion analysis module;
S3, identifying and judging text emotion:
S31, collecting the voice E through a voice acquisition module, and converting the voice E into a text F;
S32, inputting the text F into a text emotion recognition model, performing emotion scoring on the text F, and outputting probability values of seven emotion categories obtained by "angry", "disgusted", "fearful", "happy", "sad", "surprise" and "neutral";
s33, outputting emotion categories corresponding to the maximum probability values, and outputting a result f to the multi-modal emotion analysis module;
S4, in the multi-mode emotion analysis module, carrying out emotion random combination judgment on the results a, e and f, taking an average probability value of random emotion combination conditions as a final emotion judgment result g, and outputting the final emotion judgment result g to an AI digital person;
the retinaface face detection and recognition model comprises 5 pyramid feature diagrams, an SSH context module and a deformable convolution network DCN module, and in the retinaface face detection and recognition model, a global loss function L is as follows:
Wherein L cls is loss of a face two-classifier softmax, L box is loss of a box regression function sm 1, L pixel is loss of dense regression, p i is probability of an ith anchor being a face, For the i-th anchor's true category, 1 represents a face, 0 represents a non-face ,λ1=0.25、λ2=0.1、λ3=0.01,ti={tx,ty,tw,th} represents the predicted box coordinates,The representation real frame coordinates ,li={lx1,ly1,lx2,ly2,lx3,ly3,lx4,ly4,lx5,ly5} represents predicted face key point coordinates ,li*={lx1*,ly1*,lx2*,ly2*,lx3*,ly3*,lx4*,ly4*,lx5*,ly5*} represents real face key point coordinates;
The facial expression recognition model comprises a CNN backbone network and a flow module, wherein the CNN backbone network comprises an inverse residual block BlockA and a downsampling module BlockB, an AVG Pool is used for the left auxiliary branch of the downsampling module BlockB, and a depth convolution DWConv layer with the step length larger than 1 is used for downsampling in the flow module, and a 1-dimensional feature vector is output.
2. The multi-modal-based AI digital human emotion analysis method of claim 1, wherein the retinaface face detection recognition model training method is as follows: the widerface data set is selected, the widerface data set comprises at least 32203 pictures, the pictures are divided into a training set, a verification set and a test set according to a ratio of 4:1:5, the training set is used for training a retinaface face detection recognition model, and the pictures are subjected to data enhancement through methods of brightness change, saturation adjustment, hue adjustment, random clipping, mirror image overturning and size transformation during training.
3. The multi-modal-based AI digital human emotion analysis method of claim 1, wherein the training method of the facial expression recognition model is as follows: and selecting a dataset of seven types of face images, namely 'angry', 'disgusted', 'fearful', 'happy', 'sad', 'surprise', 'neutral', wherein the dataset comprises at least 35887 pictures, and each type of at least 3000 pictures is divided into a training set, a verification set and a test set according to a ratio of 8:1:1, and the training set, the verification set and the test set are used for training a facial expression recognition model.
4. The multi-modal based AI digital human emotion analysis method of claim 1, wherein the emotion random combination condition comprises the following:
Emotion random combination case one: results e, f combined output: if the AI digital person does not contain the camera module, the image information can not be obtained in real time, the emotion state is judged by using voice and text characteristics, and further, the average probability value ef of the results e and f is taken, and the emotion type corresponding to the average probability value ef is used as a final emotion judgment result g1;
Emotion random combination case two: results a, e combined output: if the AI digital person does not contain the text emotion recognition model, the text emotion judgment result cannot be output in real time, the emotion state is judged by using the facial image and the voice characteristics, and further, the average probability value ae of the results a and e is taken as the final emotion judgment result g2, wherein the emotion category corresponds to the average probability value ae;
Emotion random combination case three: results a, f combined output: if the AI digital person does not contain the voice emotion judging model, the voice emotion judging result cannot be output in real time, judging the emotion state by using the face image and the text characteristics, and further, taking the average probability value af of the results a and f and the emotion type corresponding to the average probability value af as a final emotion judging result g3;
Affective random combination case four: results a, e, f combined output: and taking the average probability values aef of the results a, e and f and the emotion category corresponding to the average probability value aef as a final emotion judgment result g4.
5. The multi-mode-based AI digital human emotion analysis method of claim 1, wherein on the premise that the ios of face detection frames C and GT are greater than 0.5, standard evaluation is performed on a face detection recognition model by precision and recall:
precision = TP/(tp+fp);
recall = TP/(tp+fn);
the TP=GT and n PRED are predicted correctly, the real example is predicted as the positive example, and the model is predicted as the positive example;
Fp=pred- (GT n PRED), misprediction, false positive, model prediction as positive, actually negative;
Fn=gt- (GT n PRED), misprediction, false counterexample, model prediction as counterexample, actually positive example;
Specifically, GT represents the real value of the picture input to the retinaface face detection and recognition model, PRED is a prediction sample output by the retinaface face detection and recognition model, and IOU represents the ratio of intersection to union of the picture prediction result input to the retinaface face detection and recognition model in the widerface dataset and the real value of the input picture.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210394800.0A CN114724222B (en) | 2022-04-14 | 2022-04-14 | AI digital human emotion analysis method based on multiple modes |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210394800.0A CN114724222B (en) | 2022-04-14 | 2022-04-14 | AI digital human emotion analysis method based on multiple modes |
Publications (2)
Publication Number | Publication Date |
---|---|
CN114724222A CN114724222A (en) | 2022-07-08 |
CN114724222B true CN114724222B (en) | 2024-04-19 |
Family
ID=82244023
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202210394800.0A Active CN114724222B (en) | 2022-04-14 | 2022-04-14 | AI digital human emotion analysis method based on multiple modes |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN114724222B (en) |
Families Citing this family (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115641837A (en) * | 2022-12-22 | 2023-01-24 | 北京资采信息技术有限公司 | Intelligent robot conversation intention recognition method and system |
CN116520980A (en) * | 2023-04-03 | 2023-08-01 | 湖北大学 | Interaction method, system and terminal for emotion analysis of intelligent shopping guide robot in mall |
CN117234369B (en) * | 2023-08-21 | 2024-06-21 | 华院计算技术(上海)股份有限公司 | Digital human interaction method and system, computer readable storage medium and digital human equipment |
CN117576279B (en) * | 2023-11-28 | 2024-04-19 | 世优(北京)科技有限公司 | Digital person driving method and system based on multi-mode data |
Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107609572A (en) * | 2017-08-15 | 2018-01-19 | 中国科学院自动化研究所 | Multi-modal emotion identification method, system based on neutral net and transfer learning |
CN109558935A (en) * | 2018-11-28 | 2019-04-02 | 黄欢 | Emotion recognition and exchange method and system based on deep learning |
KR20190119863A (en) * | 2018-04-13 | 2019-10-23 | 인하대학교 산학협력단 | Video-based human emotion recognition using semi-supervised learning and multimodal networks |
CN111564164A (en) * | 2020-04-01 | 2020-08-21 | 中国电力科学研究院有限公司 | Multi-mode emotion recognition method and device |
CN112348075A (en) * | 2020-11-02 | 2021-02-09 | 大连理工大学 | Multi-mode emotion recognition method based on contextual attention neural network |
CN112766173A (en) * | 2021-01-21 | 2021-05-07 | 福建天泉教育科技有限公司 | Multi-mode emotion analysis method and system based on AI deep learning |
CN113158828A (en) * | 2021-03-30 | 2021-07-23 | 华南理工大学 | Facial emotion calibration method and system based on deep learning |
Family Cites Families (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109815785A (en) * | 2018-12-05 | 2019-05-28 | 四川大学 | A kind of face Emotion identification method based on double-current convolutional neural networks |
-
2022
- 2022-04-14 CN CN202210394800.0A patent/CN114724222B/en active Active
Patent Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107609572A (en) * | 2017-08-15 | 2018-01-19 | 中国科学院自动化研究所 | Multi-modal emotion identification method, system based on neutral net and transfer learning |
KR20190119863A (en) * | 2018-04-13 | 2019-10-23 | 인하대학교 산학협력단 | Video-based human emotion recognition using semi-supervised learning and multimodal networks |
CN109558935A (en) * | 2018-11-28 | 2019-04-02 | 黄欢 | Emotion recognition and exchange method and system based on deep learning |
CN111564164A (en) * | 2020-04-01 | 2020-08-21 | 中国电力科学研究院有限公司 | Multi-mode emotion recognition method and device |
CN112348075A (en) * | 2020-11-02 | 2021-02-09 | 大连理工大学 | Multi-mode emotion recognition method based on contextual attention neural network |
CN112766173A (en) * | 2021-01-21 | 2021-05-07 | 福建天泉教育科技有限公司 | Multi-mode emotion analysis method and system based on AI deep learning |
CN113158828A (en) * | 2021-03-30 | 2021-07-23 | 华南理工大学 | Facial emotion calibration method and system based on deep learning |
Also Published As
Publication number | Publication date |
---|---|
CN114724222A (en) | 2022-07-08 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN114724222B (en) | AI digital human emotion analysis method based on multiple modes | |
CN112906485B (en) | Visual impairment person auxiliary obstacle perception method based on improved YOLO model | |
CN111523462B (en) | Video sequence expression recognition system and method based on self-attention enhanced CNN | |
CN106599800A (en) | Face micro-expression recognition method based on deep learning | |
CN113011357A (en) | Depth fake face video positioning method based on space-time fusion | |
CN109255289B (en) | Cross-aging face recognition method based on unified generation model | |
CN111582397A (en) | CNN-RNN image emotion analysis method based on attention mechanism | |
CN112183334B (en) | Video depth relation analysis method based on multi-mode feature fusion | |
CN110738160A (en) | human face quality evaluation method combining with human face detection | |
CN111666845A (en) | Small sample deep learning multi-mode sign language recognition method based on key frame sampling | |
CN110232564A (en) | A kind of traffic accident law automatic decision method based on multi-modal data | |
CN111476133B (en) | Unmanned driving-oriented foreground and background codec network target extraction method | |
CN112668638A (en) | Image aesthetic quality evaluation and semantic recognition combined classification method and system | |
CN111652307A (en) | Intelligent nondestructive identification method and device for redwood furniture based on convolutional neural network | |
CN113378949A (en) | Dual-generation confrontation learning method based on capsule network and mixed attention | |
CN117149944B (en) | Multi-mode situation emotion recognition method and system based on wide time range | |
CN110222636A (en) | The pedestrian's attribute recognition approach inhibited based on background | |
CN113627550A (en) | Image-text emotion analysis method based on multi-mode fusion | |
CN115797827A (en) | ViT human body behavior identification method based on double-current network architecture | |
CN112560668B (en) | Human behavior recognition method based on scene priori knowledge | |
KR20210011707A (en) | A CNN-based Scene classifier with attention model for scene recognition in video | |
Le Cornu et al. | Voicing classification of visual speech using convolutional neural networks | |
CN111860601A (en) | Method and device for predicting large fungus species | |
CN117591752A (en) | Multi-mode false information detection method, system and storage medium | |
CN116935438A (en) | Pedestrian image re-recognition method based on autonomous evolution of model structure |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
CB02 | Change of applicant information |
Country or region after: China Address after: No. 2-206, No. 1399 Liangmu Road, Cangqian Street, Yuhang District, Hangzhou City, Zhejiang Province, 311100 Applicant after: Kangxu Technology Co.,Ltd. Address before: 310000 2-206, 1399 liangmu Road, Cangqian street, Yuhang District, Hangzhou City, Zhejiang Province Applicant before: Zhejiang kangxu Technology Co.,Ltd. Country or region before: China |
|
CB02 | Change of applicant information | ||
GR01 | Patent grant | ||
GR01 | Patent grant |