CN111339913A - Method and device for recognizing emotion of character in video - Google Patents
Method and device for recognizing emotion of character in video Download PDFInfo
- Publication number
- CN111339913A CN111339913A CN202010111614.2A CN202010111614A CN111339913A CN 111339913 A CN111339913 A CN 111339913A CN 202010111614 A CN202010111614 A CN 202010111614A CN 111339913 A CN111339913 A CN 111339913A
- Authority
- CN
- China
- Prior art keywords
- feature vector
- sound
- text
- feature
- face image
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 230000008451 emotion Effects 0.000 title claims abstract description 79
- 238000000034 method Methods 0.000 title claims abstract description 60
- 239000013598 vector Substances 0.000 claims abstract description 267
- 230000008909 emotion recognition Effects 0.000 claims abstract description 85
- 238000012549 training Methods 0.000 claims abstract description 71
- 238000012545 processing Methods 0.000 claims abstract description 45
- 230000004927 fusion Effects 0.000 claims abstract description 32
- 238000010801 machine learning Methods 0.000 claims abstract description 15
- 238000000605 extraction Methods 0.000 claims description 64
- 238000011176 pooling Methods 0.000 claims description 17
- 238000003062 neural network model Methods 0.000 claims description 16
- 238000013527 convolutional neural network Methods 0.000 claims description 15
- 230000008569 process Effects 0.000 claims description 14
- 238000010183 spectrum analysis Methods 0.000 claims description 12
- 238000010606 normalization Methods 0.000 claims description 11
- 239000011159 matrix material Substances 0.000 claims description 10
- 230000009467 reduction Effects 0.000 claims description 10
- 230000011218 segmentation Effects 0.000 claims description 10
- 238000001514 detection method Methods 0.000 claims description 9
- 239000012634 fragment Substances 0.000 claims description 7
- 238000013075 data extraction Methods 0.000 claims description 6
- 230000000007 visual effect Effects 0.000 description 5
- 230000009471 action Effects 0.000 description 4
- 239000000284 extract Substances 0.000 description 4
- 230000009466 transformation Effects 0.000 description 4
- 238000010586 diagram Methods 0.000 description 3
- 238000005286 illumination Methods 0.000 description 3
- 238000005070 sampling Methods 0.000 description 3
- 238000004458 analytical method Methods 0.000 description 1
- 230000003190 augmentative effect Effects 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000004364 calculation method Methods 0.000 description 1
- 238000007405 data analysis Methods 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 230000003993 interaction Effects 0.000 description 1
- 238000002372 labelling Methods 0.000 description 1
- 230000000873 masking effect Effects 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 210000002569 neuron Anatomy 0.000 description 1
- 230000000750 progressive effect Effects 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 230000004044 response Effects 0.000 description 1
- 230000003595 spectral effect Effects 0.000 description 1
- 238000001228 spectrum Methods 0.000 description 1
- 238000000844 transformation Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V40/00—Recognition of biometric, human-related or animal-related patterns in image or video data
- G06V40/10—Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
- G06V40/16—Human faces, e.g. facial parts, sketches or expressions
- G06V40/168—Feature extraction; Face representation
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/25—Fusion techniques
- G06F18/253—Fusion techniques of extracted features
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N20/00—Machine learning
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/047—Probabilistic or stochastic networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V20/00—Scenes; Scene-specific elements
- G06V20/60—Type of objects
- G06V20/62—Text, e.g. of license plates, overlay texts or captions on TV images
- G06V20/635—Overlay text, e.g. embedded captions in a TV program
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V40/00—Recognition of biometric, human-related or animal-related patterns in image or video data
- G06V40/10—Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
- G06V40/16—Human faces, e.g. facial parts, sketches or expressions
- G06V40/172—Classification, e.g. identification
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Health & Medical Sciences (AREA)
- Data Mining & Analysis (AREA)
- Software Systems (AREA)
- General Engineering & Computer Science (AREA)
- Artificial Intelligence (AREA)
- Evolutionary Computation (AREA)
- General Health & Medical Sciences (AREA)
- Mathematical Physics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Computing Systems (AREA)
- Biophysics (AREA)
- Oral & Maxillofacial Surgery (AREA)
- Computational Linguistics (AREA)
- Biomedical Technology (AREA)
- Molecular Biology (AREA)
- Multimedia (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Human Computer Interaction (AREA)
- Bioinformatics & Computational Biology (AREA)
- Evolutionary Biology (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Probability & Statistics with Applications (AREA)
- Medical Informatics (AREA)
- Image Analysis (AREA)
Abstract
The invention provides a method and a device for recognizing the emotion of a character in a video, which are used for extracting a face image in the video, a sound frequency spectrogram and a subtitle text corresponding to the face image, on the basis of extracting and obtaining the image characteristic vector, the sound characteristic vector and the text characteristic vector, the image feature vector, the sound feature vector and the text feature vector are subjected to feature fusion to obtain a multi-modal combined feature vector, the multi-modal combined feature vector increases the sound feature vector and the text feature vector relative to the face image feature, improves the richness of the character emotion feature, and inputs the combined feature vector into a character emotion recognition model obtained by training at least one machine learning model in advance by using a combined feature vector training data set for recognition processing, so that a more accurate character emotion recognition result can be obtained.
Description
Technical Field
The invention relates to the technical field of video data analysis, in particular to a method and a device for recognizing emotion of a person in a video.
Background
Human emotion recognition is an important component of human-computer interaction and emotion calculation research. Common and important categories of emotion in video include happiness, anger, disgust, fear, sadness, surprise, etc. The emotion is an important component of video content, and the emotion expressed by the video segment can be analyzed by recognizing the emotion, so that the video application related to the emotion is derived.
Most of the existing emotion recognition technologies in videos focus on a mode of emotion recognition based on human face visual features, namely, emotion classification is performed according to the visual features of human face region images through human face detection positioning, analysis and recognition of the human face region images. The visual features of the face region images are the visual features which can reflect the emotion of the face most, but the face images in the video have various interference factors, such as face image blurring, poor illumination conditions, angle deviation and the like, so that the emotion recognition accuracy of the person based on the visual features is low.
Disclosure of Invention
In view of this, the present invention provides a method and an apparatus for recognizing emotion of a person in a video, so as to improve the accuracy of recognizing emotion of the person.
In order to achieve the above purpose, the invention provides the following specific technical scheme:
a method for recognizing emotion of a person in a video comprises the following steps:
extracting a face image in a video, and a sound frequency spectrogram and a subtitle text which correspond to the face image;
extracting image feature vectors from the face image, extracting sound feature vectors from the sound spectrogram, and extracting text feature vectors from the subtitle text;
performing feature fusion on the image feature vector, the sound feature vector and the text feature vector to obtain a combined feature vector;
and calling a character emotion recognition model obtained by pre-training, and processing the combined feature vector to obtain a character emotion recognition result, wherein the character emotion recognition model is obtained by training at least one machine learning model by using a combined feature vector training data set comprising an image feature vector, a sound feature vector and a text feature vector of a face image.
Optionally, the extracting the face image in the video, and the sound spectrogram and the subtitle text corresponding to the face image includes:
splitting a video into a plurality of video frames, and recording the time of each video frame;
sequentially identifying a plurality of video frames by using a preset face identification model to obtain a face image, and recording the time of the video frames containing the face image;
intercepting sound segments in a corresponding time period in the video according to the time of a video frame containing the face image;
carrying out spectrum analysis on the sound fragment to obtain the sound spectrogram;
and calling a pre-constructed subtitle detection model to process the video frame containing the face image to obtain a subtitle text of the video frame containing the face image.
Optionally, the extracting an image feature vector from the face image includes:
the face image is input into an image feature extraction model obtained by pre-training and processed, feature vectors output by a full connection layer in the image feature extraction model are determined as the image feature vectors, the image feature extraction model is obtained by training a preset depth convolution neural network model, and the preset depth convolution neural network model comprises a pooling layer, a full connection layer, a dropout layer in front of the full connection layer and a softmax layer behind the full connection layer.
Optionally, the extracting the acoustic feature vector from the acoustic spectrogram includes:
the method comprises the steps of inputting a sound frequency spectrogram into a sound feature extraction model obtained through pre-training for processing, determining feature vectors output by a full connection layer in the sound feature extraction model to be the sound feature vectors, training the sound feature extraction model to a preset deep convolutional neural network model, wherein the preset deep convolutional neural network model comprises a pooling layer, a full connection layer, a dropout layer in front of the full connection layer and a softmax layer behind the full connection layer.
Optionally, the extracting a text feature vector from the subtitle text includes:
performing word segmentation on the subtitle text, and removing low-frequency words and stop words in word segmentation results to obtain a plurality of words;
calling a word2vec model, and processing the words to obtain a vector matrix;
the method comprises the steps of inputting a vector matrix into a pre-trained text feature extraction model for processing, determining feature vectors output by a full connection layer in the text feature extraction model as the text feature vectors, training a preset text convolution neural network model to obtain the text feature extraction model, wherein the preset text convolution neural network model comprises a pooling layer, a full connection layer, a dropout layer in front of the full connection layer and a softmax layer behind the full connection layer.
Optionally, the performing feature fusion on the image feature vector, the sound feature vector, and the text feature vector to obtain a joint feature vector includes:
performing feature fusion on the image feature vector, the sound feature vector and the text feature vector;
performing dimensionality reduction on the feature vector after feature fusion by using a PCA method packaged in a sklern tool library;
and carrying out normalization processing on the feature vector obtained after the dimension reduction processing to obtain a three-channel combined feature vector.
Optionally, training the character emotion recognition model includes:
acquiring an open source emotion data set of a face image, an open source emotion data set of sound and an open source emotion data set of a text;
carrying out data enhancement processing on the open source emotion data set of the face image to obtain a face image data set;
carrying out spectrum analysis on the sound segments in the open source emotion data set of the sound to obtain a sound spectrogram data set;
extracting an image feature vector set from the face image data set, extracting a sound feature vector set from the sound spectrogram data set, and extracting a text feature vector set from the open source emotion data set of the text;
respectively carrying out feature fusion on the feature vectors with the same emotion label in the image feature vector set, the sound feature vector set and the text feature vector set to obtain joint feature vectors corresponding to different emotion labels, and using the joint feature vectors as a joint feature vector training data set of the character emotion recognition model;
and respectively training at least one machine learning model by utilizing a joint feature vector training data set to obtain the character emotion recognition model.
Optionally, the character emotion recognition model comprises a plurality of sub-recognition models; the step of calling a character emotion recognition model obtained through pre-training and processing the combined feature vector to obtain a character emotion recognition result comprises the following steps:
respectively inputting the combined feature vectors into a plurality of sub-recognition models for recognition processing to obtain character emotion recognition results of the sub-recognition models;
and determining the character emotion recognition result with the largest number of same results in the character emotion recognition results of the plurality of sub-recognition models as a final character emotion recognition result output by the character emotion recognition model.
An emotion recognition apparatus for a person in a video, comprising:
the multi-modal data extraction unit is used for extracting a face image in a video, and a sound frequency spectrogram and a subtitle text which correspond to the face image;
the multi-modal feature extraction unit is used for extracting image feature vectors from the face image, extracting sound feature vectors from the sound spectrogram and extracting text feature vectors from the subtitle text;
the feature fusion unit is used for performing feature fusion on the image feature vector, the sound feature vector and the text feature vector to obtain a combined feature vector;
and the emotion recognition unit is used for calling a character emotion recognition model obtained by pre-training, processing the combined feature vector to obtain a character emotion recognition result, and the character emotion recognition model is obtained by training at least one machine learning model by using a combined feature vector training data set comprising an image feature vector, a sound feature vector and a text feature vector of a face image.
Optionally, the multi-modal data extraction unit is specifically configured to:
splitting a video into a plurality of video frames, and recording the time of each video frame;
sequentially identifying a plurality of video frames by using a preset face identification model to obtain a face image, and recording the time of the video frames containing the face image;
intercepting sound segments in a corresponding time period in the video according to the time of a video frame containing the face image;
carrying out spectrum analysis on the sound fragment to obtain the sound spectrogram;
and calling a pre-constructed subtitle detection model to process the video frame containing the face image to obtain a subtitle text of the video frame containing the face image.
Optionally, the multi-modal feature extraction unit includes a face image feature extraction subunit, the face image feature extraction subunit is configured to perform processing on an image feature extraction model obtained by pre-training the face image input, determine feature vectors output by a full connection layer in the image feature extraction model as the image feature vectors, train a preset depth convolutional neural network model to obtain the image feature extraction model, and the preset depth convolutional neural network model includes a pooling layer, a full connection layer, a dropout layer in front of the full connection layer, and a softmax layer behind the full connection layer.
Optionally, the multi-modal feature extraction unit includes a sound feature extraction subunit, the sound feature extraction subunit is configured to process the sound feature extraction model obtained by pre-training the input of the sound spectrogram, determine feature vectors output by a full connection layer in the sound feature extraction model as the sound feature vectors, train the preset deep convolutional neural network model to obtain, and the preset deep convolutional neural network model includes a pooling layer, a full connection layer, a dropout layer before the full connection layer, and a softmax layer after the full connection layer.
Optionally, the multi-modal feature extraction unit includes a text feature extraction subunit, and the text feature extraction subunit is configured to:
performing word segmentation on the subtitle text, and removing low-frequency words and stop words in word segmentation results to obtain a plurality of words;
calling a word2vec model, and processing the words to obtain a vector matrix;
the method comprises the steps of inputting a vector matrix into a pre-trained text feature extraction model for processing, determining feature vectors output by a full connection layer in the text feature extraction model as the text feature vectors, training a preset text convolution neural network model to obtain the text feature extraction model, wherein the preset text convolution neural network model comprises a pooling layer, a full connection layer, a dropout layer in front of the full connection layer and a softmax layer behind the full connection layer.
Optionally, the feature fusion unit is specifically configured to:
performing feature fusion on the image feature vector, the sound feature vector and the text feature vector;
performing dimensionality reduction on the feature vector after feature fusion by using a PCA method packaged in a sklern tool library;
and carrying out normalization processing on the feature vector obtained after the dimension reduction processing to obtain a three-channel combined feature vector.
Optionally, the apparatus further includes a recognition model training unit, specifically configured to:
acquiring an open source emotion data set of a face image, an open source emotion data set of sound and an open source emotion data set of a text;
carrying out data enhancement processing on the open source emotion data set of the face image to obtain a face image data set;
carrying out spectrum analysis on the sound segments in the open source emotion data set of the sound to obtain a sound spectrogram data set;
extracting an image feature vector set from the face image data set, extracting a sound feature vector set from the sound spectrogram data set, and extracting a text feature vector set from the open source emotion data set of the text;
respectively carrying out feature fusion on the feature vectors with the same emotion label in the image feature vector set, the sound feature vector set and the text feature vector set to obtain joint feature vectors corresponding to different emotion labels, and using the joint feature vectors as a joint feature vector training data set of the character emotion recognition model;
and respectively training at least one machine learning model by utilizing a joint feature vector training data set to obtain the character emotion recognition model.
Optionally, the emotion recognition unit is specifically configured to:
respectively inputting the combined feature vectors into a plurality of sub-recognition models for recognition processing to obtain character emotion recognition results of the sub-recognition models;
and determining the character emotion recognition result with the largest number of same results in the character emotion recognition results of the plurality of sub-recognition models as a final character emotion recognition result output by the character emotion recognition model.
Compared with the prior art, the invention has the following beneficial effects:
the invention discloses a method for recognizing the emotion of a character in a video, which extracts a face image in the video, a sound frequency spectrogram and a subtitle text corresponding to the face image, on the basis of extracting and obtaining the image characteristic vector, the sound characteristic vector and the text characteristic vector, the image feature vector, the sound feature vector and the text feature vector are subjected to feature fusion to obtain a multi-modal combined feature vector, the multi-modal combined feature vector increases the sound feature vector and the text feature vector relative to the face image features, so that the diversity and the richness of the character emotion features are improved, the combined feature vector is input into a character emotion recognition model obtained by training at least one machine learning model in advance by using a combined feature vector training data set for recognition, and a more accurate character emotion recognition result can be obtained.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the provided drawings without creative efforts.
Fig. 1 is a schematic flowchart of a method for recognizing emotion of a person in a video according to an embodiment of the present invention;
fig. 2 is a schematic flowchart of a method for extracting a face image in a video, and a sound spectrogram and a subtitle text corresponding to the face image according to an embodiment of the present invention;
FIG. 3 is a schematic flow chart of a training method of an emotion recognition model disclosed in an embodiment of the present invention;
FIG. 4 is a schematic diagram of multi-modal feature fusion as disclosed in an embodiment of the present invention;
fig. 5 is a schematic structural diagram of a device for recognizing emotion of a person in a video according to an embodiment of the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
The invention discloses a method for recognizing the emotion of a person in a video, which is used for recognizing a multi-modal combined feature vector comprising a face image, sound and a subtitle text extracted from the video and can obtain a more accurate emotion recognition result of the person.
Specifically, referring to fig. 1, the method for recognizing emotion of a person in a video disclosed in this embodiment includes the following steps:
s101: extracting a face image in a video, and a sound frequency spectrogram and a subtitle text which correspond to the face image;
referring to fig. 2, an alternative method for extracting a face image in a video, and a sound spectrogram and a subtitle text corresponding to the face image includes the following steps:
s201: splitting a video into a plurality of video frames, and recording the time of each video frame;
specifically, the video can be read by opencv, and the video is split into a plurality of video frames.
S202: sequentially identifying a plurality of video frames by using a preset face identification model to obtain a face image, and recording the time of the video frames containing the face image;
the face recognition model can be obtained by training a machine learning model by using a face image training sample, and can also be obtained by sequentially recognizing a plurality of video frames by using the existing face recognition model, such as a face classifier carried by an opencv.
Preferably, the video frame can be converted into a gray scale image to improve the speed of face recognition.
And intercepting the face area of the video frame containing the face image obtained by identification according to a format of 128 × 128, and obtaining the face image.
S203: intercepting sound segments in a corresponding time period in the video according to the time of a video frame containing the face image;
it can be understood that the video frames containing the face images in the video are a plurality of continuous video frames, each video frame corresponds to a time, the plurality of continuous video frames correspond to a time period, and then the sound segments in the time period in the video can be intercepted.
S204: carrying out spectrum analysis on the sound fragment to obtain the sound spectrogram;
and performing spectrum analysis on the sound segment, wherein the spectrum is quantized into 128 frequency bands, each 128 sampling points is a sampling group, the time length of each sampling segment is 0.02 seconds by 128 seconds to 2.56 seconds, and a 128-dimensional spectral response image, namely a sound spectrogram, is formed.
S205: and calling a pre-constructed subtitle detection model to process the video frame containing the face image to obtain a subtitle text of the video frame containing the face image.
The pre-constructed caption detection model can be any one of the existing caption detection models.
S102: extracting image feature vectors from the face image, extracting sound feature vectors from the sound spectrogram, and extracting text feature vectors from the subtitle text;
specifically, the face image is input into an image feature extraction model obtained through pre-training and processed, the feature vector output by a full connection layer in the image feature extraction model is determined to be the image feature vector, the image feature extraction model is obtained by training a preset depth convolution neural network model, and the preset depth convolution neural network model comprises a pooling layer, a full connection layer, a dropout layer in front of the full connection layer and a softmax layer behind the full connection layer.
The method comprises the steps of inputting a sound frequency spectrogram into a sound feature extraction model obtained through pre-training for processing, determining feature vectors output by a full connection layer in the sound feature extraction model to be the sound feature vectors, training the sound feature extraction model to a preset deep convolutional neural network model, wherein the preset deep convolutional neural network model comprises a pooling layer, a full connection layer, a dropout layer in front of the full connection layer and a softmax layer behind the full connection layer.
Performing word segmentation on the subtitle text, and removing low-frequency words and stop words in word segmentation results to obtain a plurality of words;
calling a word2vec model, processing the words, wherein each word is represented by a K-dimensional vector, and if the words are N words, obtaining an N-X-K-dimensional vector matrix;
the method comprises the steps of inputting a vector matrix into a pre-trained text feature extraction model for processing, determining feature vectors output by a full connection layer in the text feature extraction model as the text feature vectors, training a preset text convolution neural network model to obtain the text feature extraction model, wherein the preset text convolution neural network model comprises a pooling layer, a full connection layer, a dropout layer in front of the full connection layer and a softmax layer behind the full connection layer.
In the model for extracting the feature vector in the embodiment, a plurality of full-connected layers in the traditional convolutional neural network model are replaced by one full-connected layer, a softmax layer is directly added behind the one full-connected layer, and a mixed model of a residual error structure and an inclusion structure is combined, meanwhile, the input data is processed by Batch Normalization (Batch Normalization), the pooling layer uses a global average pooling method, the Dropout layer is added in front of the full-connection layer, the Dropout layer can effectively relieve the occurrence of overfitting, to the extent that regularization is achieved, the Dropout layer results in two neurons not necessarily appearing in one Dropout network at a time, therefore, the updating of the weight value does not depend on the common action of hidden nodes with fixed relations, the condition that certain characteristics are only effective under other specific characteristics is prevented, the network is forced to learn more robust characteristics, and the robustness of the model is increased.
Since the image feature vector, the sound feature vector and the text feature vector are 512-dimensional feature vectors output by the full connected layer of the corresponding model.
S103: performing feature fusion on the image feature vector, the sound feature vector and the text feature vector to obtain a combined feature vector;
and performing feature fusion on the image feature vector, the sound feature vector and the text feature vector to obtain a 512 x 3-dimensional feature vector.
Preferably, in order to reduce the data amount processed by the human emotion recognition model, a PCA method packaged in a sklern tool library may be used, for example, a parameter n _ components is set to 768, the feature vector after feature fusion is subjected to dimension reduction processing, so that a feature vector of 768 dimensions may be obtained, and the feature vector obtained after the dimension reduction processing is subjected to normalization processing, so that a three-channel combined feature vector is obtained.
The normalization process herein refers to maximum-minimum normalization, where maximum-minimum normalization is a linear transformation of raw data, min a and max a are respectively the minimum and maximum values of attribute a, and one raw value x is mapped to a value x' of interval [0,1] by maximum-minimum normalization, then the following formula:
s104: and calling a character emotion recognition model obtained by pre-training, and processing the combined feature vector to obtain a character emotion recognition result.
The character emotion recognition model is obtained by training at least one machine learning model by using a combined feature vector training data set comprising an image feature vector, a sound feature vector and a text feature vector of a human face image.
On the basis, calling a character emotion recognition model obtained by pre-training, and processing the combined feature vector, wherein the method specifically comprises the following steps: respectively inputting the combined feature vectors into a plurality of sub-recognition models for recognition processing to obtain character emotion recognition results of the sub-recognition models; and determining the character emotion recognition result with the largest number of same results in the character emotion recognition results of the plurality of sub-recognition models as a final character emotion recognition result output by the character emotion recognition model.
For example, the character emotion recognition model includes 3 sub-recognition models C1, C2, and C3, where the recognition result of the sub-recognition model C1 on the joint feature vector is emotion label L1, the recognition result of the sub-recognition model C2 on the joint feature vector is emotion label L1, the recognition result of the sub-recognition model C3 on the joint feature vector is emotion label L2, and the final character emotion recognition result output by the character emotion recognition model is emotion label L1.
Further, referring to fig. 3, the embodiment also discloses a method for training a character emotion recognition model, which specifically includes the following steps:
s301: acquiring an open source emotion data set of a face image, an open source emotion data set of sound and an open source emotion data set of a text;
the open source emotion data set of the face image can be an Asian face database (KFDB) and comprises 1000 artificial 52000 multi-pose, multi-illumination and multi-expression face images, wherein the images with changed poses and illumination are acquired under the condition of strict control.
The open source emotion data set of the voice can be recorded by professional speakers by utilizing a CASIA Chinese emotion corpus, and comprises six emotions: anger, happiness, fear, sadness, surprise and neutrality, 9600 different pronunciation sound fragment data sets.
The open source emotion dataset of the text can be a chinese conversation emotion dataset, covering emotion words of 40000 multiple chinese instances.
S302: carrying out data enhancement processing on the open source emotion data set of the face image to obtain a face image data set;
image transformations used for data enhancement include, but are not limited to, scaling, rotating, flipping, warping, erasing, miscut, perspective, blurring, or combinations thereof, among others, and the amount of data is augmented by the enhancement process.
S303: carrying out spectrum analysis on sound segments in the open source emotion data set of the sound to obtain a sound spectrogram data set;
s304: extracting an image feature vector set from a face image data set, extracting a sound feature vector set from a sound spectrogram data set, and extracting a text feature vector set from an open source emotion data set of a text;
s305: respectively carrying out feature fusion on feature vectors with the same emotion label in the image feature vector set, the sound feature vector set and the text feature vector set to obtain joint feature vectors corresponding to different emotion labels, and using the joint feature vectors as a joint feature vector training data set of the character emotion recognition model;
referring to fig. 4, fig. 4 is a schematic diagram of multi-modal feature fusion including face images, sounds and text.
S306: and respectively training at least one machine learning model by utilizing the joint feature vector training data set to obtain a character emotion recognition model.
According to the embodiment, a large amount of labor labeling cost is saved by automatically generating the training data set, the method has flexible expansibility, and the transformation of the human face, including the transformation of zooming, masking, rotating, miscut and the like, can be conveniently increased.
Based on the method for recognizing the emotion of a person in a video disclosed in the above embodiments, this embodiment correspondingly discloses a device for recognizing the emotion of a person in a video, please refer to fig. 5, and the device includes:
the multi-modal data extraction unit 501 is configured to extract a face image in a video, and a sound spectrogram and a subtitle text corresponding to the face image;
a multi-modal feature extraction unit 502, configured to extract image feature vectors from the face image, extract sound feature vectors from the sound spectrogram, and extract text feature vectors from the subtitle text;
a feature fusion unit 503, configured to perform feature fusion on the image feature vector, the sound feature vector, and the text feature vector to obtain a joint feature vector;
and an emotion recognition unit 504, configured to invoke a pre-trained character emotion recognition model, and process the joint feature vector to obtain a character emotion recognition result, where the character emotion recognition model is obtained by training at least one machine learning model using a joint feature vector training data set including an image feature vector, a sound feature vector, and a text feature vector of a face image.
Optionally, the multi-modal data extraction unit is specifically configured to:
splitting a video into a plurality of video frames, and recording the time of each video frame;
sequentially identifying a plurality of video frames by using a preset face identification model to obtain a face image, and recording the time of the video frames containing the face image;
intercepting sound segments in a corresponding time period in the video according to the time of a video frame containing the face image;
carrying out spectrum analysis on the sound fragment to obtain the sound spectrogram;
and calling a pre-constructed subtitle detection model to process the video frame containing the face image to obtain a subtitle text of the video frame containing the face image.
Optionally, the multi-modal feature extraction unit includes a face image feature extraction subunit, the face image feature extraction subunit is configured to perform processing on an image feature extraction model obtained by pre-training the face image input, determine feature vectors output by a full connection layer in the image feature extraction model as the image feature vectors, train a preset depth convolutional neural network model to obtain the image feature extraction model, and the preset depth convolutional neural network model includes a pooling layer, a full connection layer, a dropout layer in front of the full connection layer, and a softmax layer behind the full connection layer.
Optionally, the multi-modal feature extraction unit includes a sound feature extraction subunit, the sound feature extraction subunit is configured to process the sound feature extraction model obtained by pre-training the input of the sound spectrogram, determine feature vectors output by a full connection layer in the sound feature extraction model as the sound feature vectors, train the preset deep convolutional neural network model to obtain, and the preset deep convolutional neural network model includes a pooling layer, a full connection layer, a dropout layer before the full connection layer, and a softmax layer after the full connection layer.
Optionally, the multi-modal feature extraction unit includes a text feature extraction subunit, and the text feature extraction subunit is configured to:
performing word segmentation on the subtitle text, and removing low-frequency words and stop words in word segmentation results to obtain a plurality of words;
calling a word2vec model, and processing the words to obtain a vector matrix;
the method comprises the steps of inputting a vector matrix into a pre-trained text feature extraction model for processing, determining feature vectors output by a full connection layer in the text feature extraction model as the text feature vectors, training a preset text convolution neural network model to obtain the text feature extraction model, wherein the preset text convolution neural network model comprises a pooling layer, a full connection layer, a dropout layer in front of the full connection layer and a softmax layer behind the full connection layer.
Optionally, the feature fusion unit is specifically configured to:
performing feature fusion on the image feature vector, the sound feature vector and the text feature vector;
performing dimensionality reduction on the feature vector after feature fusion by using a PCA method packaged in a sklern tool library;
and carrying out normalization processing on the feature vector obtained after the dimension reduction processing to obtain a three-channel combined feature vector.
Optionally, the apparatus further includes a recognition model training unit, specifically configured to:
acquiring an open source emotion data set of a face image, an open source emotion data set of sound and an open source emotion data set of a text;
carrying out data enhancement processing on the open source emotion data set of the face image to obtain a face image data set;
carrying out spectrum analysis on the sound segments in the open source emotion data set of the sound to obtain a sound spectrogram data set;
extracting an image feature vector set from the face image data set, extracting a sound feature vector set from the sound spectrogram data set, and extracting a text feature vector set from the open source emotion data set of the text;
respectively carrying out feature fusion on the feature vectors with the same emotion label in the image feature vector set, the sound feature vector set and the text feature vector set to obtain joint feature vectors corresponding to different emotion labels, and using the joint feature vectors as a joint feature vector training data set of the character emotion recognition model;
and respectively training at least one machine learning model by utilizing a joint feature vector training data set to obtain the character emotion recognition model.
Optionally, the emotion recognition unit is specifically configured to:
respectively inputting the combined feature vectors into a plurality of sub-recognition models for recognition processing to obtain character emotion recognition results of the sub-recognition models;
and determining the character emotion recognition result with the largest number of same results in the character emotion recognition results of the plurality of sub-recognition models as a final character emotion recognition result output by the character emotion recognition model.
The device for recognizing the emotion of a person in a video, disclosed by the embodiment, extracts a face image in the video and a sound spectrogram and a subtitle text corresponding to the face image, on the basis of extracting and obtaining the image characteristic vector, the sound characteristic vector and the text characteristic vector, the image feature vector, the sound feature vector and the text feature vector are subjected to feature fusion to obtain a multi-modal combined feature vector, the multi-modal combined feature vector increases the sound feature vector and the text feature vector relative to the face image features, so that the diversity and the richness of the character emotion features are improved, the combined feature vector is input into a character emotion recognition model obtained by training at least one machine learning model in advance by using a combined feature vector training data set for recognition, and a more accurate character emotion recognition result can be obtained.
The embodiments in the present description are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other. The device disclosed by the embodiment corresponds to the method disclosed by the embodiment, so that the description is simple, and the relevant points can be referred to the method part for description.
It is further noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.
The steps of a method or algorithm described in connection with the embodiments disclosed herein may be embodied directly in hardware, in a software module executed by a processor, or in a combination of the two. A software module may reside in Random Access Memory (RAM), memory, Read Only Memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, registers, hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art.
The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.
Claims (10)
1. A method for recognizing emotion of a person in a video, the method comprising:
extracting a face image in a video, and a sound frequency spectrogram and a subtitle text which correspond to the face image;
extracting image feature vectors from the face image, extracting sound feature vectors from the sound spectrogram, and extracting text feature vectors from the subtitle text;
performing feature fusion on the image feature vector, the sound feature vector and the text feature vector to obtain a combined feature vector;
and calling a character emotion recognition model obtained by pre-training, and processing the combined feature vector to obtain a character emotion recognition result, wherein the character emotion recognition model is obtained by training at least one machine learning model by using a combined feature vector training data set comprising an image feature vector, a sound feature vector and a text feature vector of a face image.
2. The method of claim 1, wherein the extracting of the face image and the audio spectrogram and subtitle text corresponding to the face image in the video comprises:
splitting a video into a plurality of video frames, and recording the time of each video frame;
sequentially identifying a plurality of video frames by using a preset face identification model to obtain a face image, and recording the time of the video frames containing the face image;
intercepting sound segments in a corresponding time period in the video according to the time of a video frame containing the face image;
carrying out spectrum analysis on the sound fragment to obtain the sound spectrogram;
and calling a pre-constructed subtitle detection model to process the video frame containing the face image to obtain a subtitle text of the video frame containing the face image.
3. The method of claim 1, wherein the extracting image feature vectors from the face image comprises:
the face image is input into an image feature extraction model obtained by pre-training and processed, feature vectors output by a full connection layer in the image feature extraction model are determined as the image feature vectors, the image feature extraction model is obtained by training a preset depth convolution neural network model, and the preset depth convolution neural network model comprises a pooling layer, a full connection layer, a dropout layer in front of the full connection layer and a softmax layer behind the full connection layer.
4. The method of claim 1, wherein the extracting of the acoustic feature vector from the acoustic spectrogram comprises:
the method comprises the steps of inputting a sound frequency spectrogram into a sound feature extraction model obtained through pre-training for processing, determining feature vectors output by a full connection layer in the sound feature extraction model to be the sound feature vectors, training the sound feature extraction model to a preset deep convolutional neural network model, wherein the preset deep convolutional neural network model comprises a pooling layer, a full connection layer, a dropout layer in front of the full connection layer and a softmax layer behind the full connection layer.
5. The method of claim 1, wherein extracting text feature vectors from the subtitle text comprises:
performing word segmentation on the subtitle text, and removing low-frequency words and stop words in word segmentation results to obtain a plurality of words;
calling a word2vec model, and processing the words to obtain a vector matrix;
the method comprises the steps of inputting a vector matrix into a pre-trained text feature extraction model for processing, determining feature vectors output by a full connection layer in the text feature extraction model as the text feature vectors, training a preset text convolution neural network model to obtain the text feature extraction model, wherein the preset text convolution neural network model comprises a pooling layer, a full connection layer, a dropout layer in front of the full connection layer and a softmax layer behind the full connection layer.
6. The method according to claim 1, wherein the feature fusing the image feature vector, the sound feature vector and the text feature vector to obtain a joint feature vector comprises:
performing feature fusion on the image feature vector, the sound feature vector and the text feature vector;
performing dimensionality reduction on the feature vector after feature fusion by using a PCA method packaged in a sklern tool library;
and carrying out normalization processing on the feature vector obtained after the dimension reduction processing to obtain a three-channel combined feature vector.
7. The method of claim 1, wherein training the character emotion recognition model comprises:
acquiring an open source emotion data set of a face image, an open source emotion data set of sound and an open source emotion data set of a text;
carrying out data enhancement processing on the open source emotion data set of the face image to obtain a face image data set;
carrying out spectrum analysis on the sound segments in the open source emotion data set of the sound to obtain a sound spectrogram data set;
extracting an image feature vector set from the face image data set, extracting a sound feature vector set from the sound spectrogram data set, and extracting a text feature vector set from the open source emotion data set of the text;
respectively carrying out feature fusion on the feature vectors with the same emotion label in the image feature vector set, the sound feature vector set and the text feature vector set to obtain joint feature vectors corresponding to different emotion labels, and using the joint feature vectors as a joint feature vector training data set of the character emotion recognition model;
and respectively training at least one machine learning model by utilizing a joint feature vector training data set to obtain the character emotion recognition model.
8. The method of claim 1, wherein the character emotion recognition model comprises a plurality of sub-recognition models; the step of calling a character emotion recognition model obtained through pre-training and processing the combined feature vector to obtain a character emotion recognition result comprises the following steps:
respectively inputting the combined feature vectors into a plurality of sub-recognition models for recognition processing to obtain character emotion recognition results of the sub-recognition models;
and determining the character emotion recognition result with the largest number of same results in the character emotion recognition results of the plurality of sub-recognition models as a final character emotion recognition result output by the character emotion recognition model.
9. An apparatus for recognizing emotion of a person in a video, comprising:
the multi-modal data extraction unit is used for extracting a face image in a video, and a sound frequency spectrogram and a subtitle text which correspond to the face image;
the multi-modal feature extraction unit is used for extracting image feature vectors from the face image, extracting sound feature vectors from the sound spectrogram and extracting text feature vectors from the subtitle text;
the feature fusion unit is used for performing feature fusion on the image feature vector, the sound feature vector and the text feature vector to obtain a combined feature vector;
and the emotion recognition unit is used for calling a character emotion recognition model obtained by pre-training, processing the combined feature vector to obtain a character emotion recognition result, and the character emotion recognition model is obtained by training at least one machine learning model by using a combined feature vector training data set comprising an image feature vector, a sound feature vector and a text feature vector of a face image.
10. The apparatus according to claim 9, wherein the multimodal data extraction unit is specifically configured to:
splitting a video into a plurality of video frames, and recording the time of each video frame;
sequentially identifying a plurality of video frames by using a preset face identification model to obtain a face image, and recording the time of the video frames containing the face image;
intercepting sound segments in a corresponding time period in the video according to the time of a video frame containing the face image;
carrying out spectrum analysis on the sound fragment to obtain the sound spectrogram;
and calling a pre-constructed subtitle detection model to process the video frame containing the face image to obtain a subtitle text of the video frame containing the face image.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010111614.2A CN111339913A (en) | 2020-02-24 | 2020-02-24 | Method and device for recognizing emotion of character in video |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010111614.2A CN111339913A (en) | 2020-02-24 | 2020-02-24 | Method and device for recognizing emotion of character in video |
Publications (1)
Publication Number | Publication Date |
---|---|
CN111339913A true CN111339913A (en) | 2020-06-26 |
Family
ID=71185495
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202010111614.2A Pending CN111339913A (en) | 2020-02-24 | 2020-02-24 | Method and device for recognizing emotion of character in video |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN111339913A (en) |
Cited By (17)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111798849A (en) * | 2020-07-06 | 2020-10-20 | 广东工业大学 | Robot instruction identification method and device, electronic equipment and storage medium |
CN112183334A (en) * | 2020-09-28 | 2021-01-05 | 南京大学 | Video depth relation analysis method based on multi-modal feature fusion |
CN112233698A (en) * | 2020-10-09 | 2021-01-15 | 中国平安人寿保险股份有限公司 | Character emotion recognition method and device, terminal device and storage medium |
CN112364829A (en) * | 2020-11-30 | 2021-02-12 | 北京有竹居网络技术有限公司 | Face recognition method, device, equipment and storage medium |
CN112464958A (en) * | 2020-12-11 | 2021-03-09 | 沈阳芯魂科技有限公司 | Multi-modal neural network information processing method and device, electronic equipment and medium |
CN112487937A (en) * | 2020-11-26 | 2021-03-12 | 北京有竹居网络技术有限公司 | Video identification method and device, storage medium and electronic equipment |
CN112669876A (en) * | 2020-12-18 | 2021-04-16 | 平安科技(深圳)有限公司 | Emotion recognition method and device, computer equipment and storage medium |
CN112699774A (en) * | 2020-12-28 | 2021-04-23 | 深延科技(北京)有限公司 | Method and device for recognizing emotion of person in video, computer equipment and medium |
CN112861949A (en) * | 2021-01-29 | 2021-05-28 | 成都视海芯图微电子有限公司 | Face and voice-based emotion prediction method and system |
CN113033647A (en) * | 2021-03-18 | 2021-06-25 | 网易传媒科技(北京)有限公司 | Multimodal feature fusion method, device, computing equipment and medium |
CN113139525A (en) * | 2021-05-21 | 2021-07-20 | 国家康复辅具研究中心 | Multi-source information fusion-based emotion recognition method and man-machine interaction system |
CN113158727A (en) * | 2020-12-31 | 2021-07-23 | 长春理工大学 | Bimodal fusion emotion recognition method based on video and voice information |
CN113420733A (en) * | 2021-08-23 | 2021-09-21 | 北京黑马企服科技有限公司 | Efficient distributed big data acquisition implementation method and system |
CN113536999A (en) * | 2021-07-01 | 2021-10-22 | 汇纳科技股份有限公司 | Character emotion recognition method, system, medium and electronic device |
CN113837072A (en) * | 2021-09-24 | 2021-12-24 | 厦门大学 | Method for sensing emotion of speaker by fusing multidimensional information |
WO2024000867A1 (en) * | 2022-06-30 | 2024-01-04 | 浪潮电子信息产业股份有限公司 | Emotion recognition method and apparatus, device, and storage medium |
CN118380020A (en) * | 2024-06-21 | 2024-07-23 | 吉林大学 | Method for identifying emotion change of interrogation object based on multiple modes |
Citations (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2010133661A1 (en) * | 2009-05-20 | 2010-11-25 | Tessera Technologies Ireland Limited | Identifying facial expressions in acquired digital images |
CN105740758A (en) * | 2015-12-31 | 2016-07-06 | 上海极链网络科技有限公司 | Internet video face recognition method based on deep learning |
CN108255307A (en) * | 2018-02-08 | 2018-07-06 | 竹间智能科技(上海)有限公司 | Man-machine interaction method, system based on multi-modal mood and face's Attribute Recognition |
CN108764010A (en) * | 2018-03-23 | 2018-11-06 | 姜涵予 | Emotional state determines method and device |
US20180330152A1 (en) * | 2017-05-11 | 2018-11-15 | Kodak Alaris Inc. | Method for identifying, ordering, and presenting images according to expressions |
CN109308466A (en) * | 2018-09-18 | 2019-02-05 | 宁波众鑫网络科技股份有限公司 | The method that a kind of pair of interactive language carries out Emotion identification |
CN109614895A (en) * | 2018-10-29 | 2019-04-12 | 山东大学 | A method of the multi-modal emotion recognition based on attention Fusion Features |
CN109902632A (en) * | 2019-03-02 | 2019-06-18 | 西安电子科技大学 | A kind of video analysis device and video analysis method towards old man's exception |
CN110084266A (en) * | 2019-03-11 | 2019-08-02 | 中国地质大学(武汉) | A kind of dynamic emotion identification method based on audiovisual features depth integration |
CN110414323A (en) * | 2019-06-14 | 2019-11-05 | 平安科技(深圳)有限公司 | Mood detection method, device, electronic equipment and storage medium |
-
2020
- 2020-02-24 CN CN202010111614.2A patent/CN111339913A/en active Pending
Patent Citations (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2010133661A1 (en) * | 2009-05-20 | 2010-11-25 | Tessera Technologies Ireland Limited | Identifying facial expressions in acquired digital images |
CN105740758A (en) * | 2015-12-31 | 2016-07-06 | 上海极链网络科技有限公司 | Internet video face recognition method based on deep learning |
US20180330152A1 (en) * | 2017-05-11 | 2018-11-15 | Kodak Alaris Inc. | Method for identifying, ordering, and presenting images according to expressions |
CN108255307A (en) * | 2018-02-08 | 2018-07-06 | 竹间智能科技(上海)有限公司 | Man-machine interaction method, system based on multi-modal mood and face's Attribute Recognition |
CN108764010A (en) * | 2018-03-23 | 2018-11-06 | 姜涵予 | Emotional state determines method and device |
CN109308466A (en) * | 2018-09-18 | 2019-02-05 | 宁波众鑫网络科技股份有限公司 | The method that a kind of pair of interactive language carries out Emotion identification |
CN109614895A (en) * | 2018-10-29 | 2019-04-12 | 山东大学 | A method of the multi-modal emotion recognition based on attention Fusion Features |
CN109902632A (en) * | 2019-03-02 | 2019-06-18 | 西安电子科技大学 | A kind of video analysis device and video analysis method towards old man's exception |
CN110084266A (en) * | 2019-03-11 | 2019-08-02 | 中国地质大学(武汉) | A kind of dynamic emotion identification method based on audiovisual features depth integration |
CN110414323A (en) * | 2019-06-14 | 2019-11-05 | 平安科技(深圳)有限公司 | Mood detection method, device, electronic equipment and storage medium |
Cited By (24)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111798849A (en) * | 2020-07-06 | 2020-10-20 | 广东工业大学 | Robot instruction identification method and device, electronic equipment and storage medium |
CN112183334A (en) * | 2020-09-28 | 2021-01-05 | 南京大学 | Video depth relation analysis method based on multi-modal feature fusion |
CN112183334B (en) * | 2020-09-28 | 2024-03-22 | 南京大学 | Video depth relation analysis method based on multi-mode feature fusion |
CN112233698A (en) * | 2020-10-09 | 2021-01-15 | 中国平安人寿保险股份有限公司 | Character emotion recognition method and device, terminal device and storage medium |
CN112233698B (en) * | 2020-10-09 | 2023-07-25 | 中国平安人寿保险股份有限公司 | Character emotion recognition method, device, terminal equipment and storage medium |
CN112487937B (en) * | 2020-11-26 | 2022-12-06 | 北京有竹居网络技术有限公司 | Video identification method and device, storage medium and electronic equipment |
CN112487937A (en) * | 2020-11-26 | 2021-03-12 | 北京有竹居网络技术有限公司 | Video identification method and device, storage medium and electronic equipment |
CN112364829A (en) * | 2020-11-30 | 2021-02-12 | 北京有竹居网络技术有限公司 | Face recognition method, device, equipment and storage medium |
CN112464958A (en) * | 2020-12-11 | 2021-03-09 | 沈阳芯魂科技有限公司 | Multi-modal neural network information processing method and device, electronic equipment and medium |
CN112669876A (en) * | 2020-12-18 | 2021-04-16 | 平安科技(深圳)有限公司 | Emotion recognition method and device, computer equipment and storage medium |
CN112699774B (en) * | 2020-12-28 | 2024-05-24 | 深延科技(北京)有限公司 | Emotion recognition method and device for characters in video, computer equipment and medium |
CN112699774A (en) * | 2020-12-28 | 2021-04-23 | 深延科技(北京)有限公司 | Method and device for recognizing emotion of person in video, computer equipment and medium |
CN113158727A (en) * | 2020-12-31 | 2021-07-23 | 长春理工大学 | Bimodal fusion emotion recognition method based on video and voice information |
CN112861949B (en) * | 2021-01-29 | 2023-08-04 | 成都视海芯图微电子有限公司 | Emotion prediction method and system based on face and sound |
CN112861949A (en) * | 2021-01-29 | 2021-05-28 | 成都视海芯图微电子有限公司 | Face and voice-based emotion prediction method and system |
CN113033647A (en) * | 2021-03-18 | 2021-06-25 | 网易传媒科技(北京)有限公司 | Multimodal feature fusion method, device, computing equipment and medium |
CN113139525B (en) * | 2021-05-21 | 2022-03-01 | 国家康复辅具研究中心 | Multi-source information fusion-based emotion recognition method and man-machine interaction system |
CN113139525A (en) * | 2021-05-21 | 2021-07-20 | 国家康复辅具研究中心 | Multi-source information fusion-based emotion recognition method and man-machine interaction system |
CN113536999A (en) * | 2021-07-01 | 2021-10-22 | 汇纳科技股份有限公司 | Character emotion recognition method, system, medium and electronic device |
CN113420733B (en) * | 2021-08-23 | 2021-12-31 | 北京黑马企服科技有限公司 | Efficient distributed big data acquisition implementation method and system |
CN113420733A (en) * | 2021-08-23 | 2021-09-21 | 北京黑马企服科技有限公司 | Efficient distributed big data acquisition implementation method and system |
CN113837072A (en) * | 2021-09-24 | 2021-12-24 | 厦门大学 | Method for sensing emotion of speaker by fusing multidimensional information |
WO2024000867A1 (en) * | 2022-06-30 | 2024-01-04 | 浪潮电子信息产业股份有限公司 | Emotion recognition method and apparatus, device, and storage medium |
CN118380020A (en) * | 2024-06-21 | 2024-07-23 | 吉林大学 | Method for identifying emotion change of interrogation object based on multiple modes |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN111339913A (en) | Method and device for recognizing emotion of character in video | |
Borde et al. | Recognition of isolated words using Zernike and MFCC features for audio visual speech recognition | |
US7515770B2 (en) | Information processing method and apparatus | |
Dabre et al. | Machine learning model for sign language interpretation using webcam images | |
Sahoo et al. | Emotion recognition from audio-visual data using rule based decision level fusion | |
CN108648760B (en) | Real-time voiceprint identification system and method | |
CN112784696A (en) | Lip language identification method, device, equipment and storage medium based on image identification | |
CN111950497A (en) | AI face-changing video detection method based on multitask learning model | |
CN113223560A (en) | Emotion recognition method, device, equipment and storage medium | |
Jachimski et al. | A comparative study of English viseme recognition methods and algorithms | |
CN114639150A (en) | Emotion recognition method and device, computer equipment and storage medium | |
CN113128284A (en) | Multi-mode emotion recognition method and device | |
Jaratrotkamjorn et al. | Bimodal emotion recognition using deep belief network | |
Tsai et al. | Sentiment analysis of pets using deep learning technologies in artificial intelligence of things system | |
Kuang et al. | Simplified inverse filter tracked affective acoustic signals classification incorporating deep convolutional neural networks | |
CN111462762A (en) | Speaker vector regularization method and device, electronic equipment and storage medium | |
Birara et al. | Augmenting machine learning for Amharic speech recognition: a paradigm of patient’s lips motion detection | |
CN117172853A (en) | Automatic advertisement generation system, method and computer storage medium | |
Abel et al. | Cognitively inspired audiovisual speech filtering: towards an intelligent, fuzzy based, multimodal, two-stage speech enhancement system | |
Bhukhya et al. | Virtual Assistant and Navigation for Visually Impaired using Deep Neural Network and Image Processing | |
Xu et al. | Gabor based lipreading with a new audiovisual mandarin corpus | |
CN113642446A (en) | Detection method and device based on face dynamic emotion recognition | |
Brahme et al. | Marathi digit recognition using lip geometric shape features and dynamic time warping | |
Mattos et al. | Towards view-independent viseme recognition based on CNNs and synthetic data | |
Upadhyaya et al. | Block energy based visual features using histogram of oriented gradient for bimodal hindi speech recognition |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20200626 |