[go: up one dir, main page]
More Web Proxy on the site http://driver.im/

WO2020125386A1 - 表情识别方法、装置、计算机设备和存储介质 - Google Patents

表情识别方法、装置、计算机设备和存储介质 Download PDF

Info

Publication number
WO2020125386A1
WO2020125386A1 PCT/CN2019/122313 CN2019122313W WO2020125386A1 WO 2020125386 A1 WO2020125386 A1 WO 2020125386A1 CN 2019122313 W CN2019122313 W CN 2019122313W WO 2020125386 A1 WO2020125386 A1 WO 2020125386A1
Authority
WO
WIPO (PCT)
Prior art keywords
expression
recognition result
classifier
expression recognition
facial
Prior art date
Application number
PCT/CN2019/122313
Other languages
English (en)
French (fr)
Inventor
郑子奇
徐国强
邱寒
Original Assignee
深圳壹账通智能科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 深圳壹账通智能科技有限公司 filed Critical 深圳壹账通智能科技有限公司
Publication of WO2020125386A1 publication Critical patent/WO2020125386A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/10Character recognition
    • G06V30/19Recognition using electronic means
    • G06V30/192Recognition using electronic means using simultaneous comparisons or correlations of the image signals with a plurality of references
    • G06V30/194References adjustable by an adaptive method, e.g. learning
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice

Definitions

  • the present application relates to a facial expression recognition method, device, computer equipment and storage medium.
  • the traditional expression recognition method is mainly based on the recorded images and videos to determine the user's expression type, so there are certain requirements for the quality of the picture and the scene taken.
  • the inventor realized that if the quality of the picture and the scene of the shooting are not up to the requirements, and when the user's facial movements are not rich enough, it is easy to miss judgment, resulting in a low accuracy of facial expression recognition.
  • an expression recognition method for determining whether an expression recognition method, device, computer device, and storage medium are provided.
  • An expression recognition method includes:
  • the video data and the audio data both carry the same user identifier, the audio data includes audio features and text information, and the video data includes facial features corresponding to the user's face image;
  • the facial expression recognition result with the largest weight is selected as the facial expression category of the user corresponding to the user identification.
  • An expression recognition device includes:
  • the data acquisition module is used to acquire to-be-processed video data and audio data, the video data and the audio data both carry the same user identifier, the audio data includes audio characteristics and text information, and the video data includes a user's face Facial features corresponding to the image;
  • An expression acquisition module configured to input the audio features, text information, and facial features into corresponding expression classifiers, respectively, to obtain expression recognition results output by each of the expression classifiers and weights corresponding to the expression recognition results;
  • the expression screening module is configured to screen out the expression recognition result with the largest weight from the output expression recognition results as the expression category of the user corresponding to the user identification.
  • a computer device includes a memory and one or more processors.
  • the memory stores computer-readable instructions.
  • the one or more processors are executed The following steps:
  • the video data and the audio data both carry the same user identifier, the audio data includes audio features and text information, and the video data includes facial features corresponding to the user's face image;
  • the facial expression recognition result with the largest weight is selected as the facial expression category of the user corresponding to the user identification.
  • One or more non-volatile computer-readable storage media storing computer-readable instructions.
  • the computer-readable instructions When executed by one or more processors, the one or more processors perform the following steps:
  • the video data and the audio data both carry the same user identifier, the audio data includes audio features and text information, and the video data includes facial features corresponding to the user's face image;
  • the facial expression recognition result with the largest weight is selected as the facial expression category of the user corresponding to the user identification.
  • FIG. 1 is an application scene diagram of an expression recognition method according to one or more embodiments
  • FIG. 2 is a schematic flowchart of an expression recognition method according to one or more embodiments
  • FIG. 3 is a schematic flowchart of steps for obtaining expression recognition results output by various expression classifiers according to an embodiment
  • FIG. 5 is a block diagram of an expression recognition device according to one or more embodiments.
  • Figure 6 is a block diagram of a computer device in accordance with one or more embodiments.
  • the expression recognition method provided by this application can be applied to the application environment shown in FIG. 1.
  • the terminal 110 communicates with the server 120 via the network.
  • a video recording system is installed in the terminal 110, and the video recording system can upload the recorded video data and audio data to the server 120; the video data and audio data carry the same user ID.
  • the server 120 extracts audio features and text information from the audio data, and extracts facial features corresponding to the user's face image from the video data; input the extracted audio features, text information, and facial features to the corresponding expression classifier, respectively
  • the terminal 110 may be, but not limited to, various personal computers, notebook computers, smart phones, and tablet computers.
  • the server 120 may be implemented by an independent server or a server cluster composed of multiple servers.
  • an expression recognition method is provided. Taking the method applied to the server in FIG. 1 as an example for illustration, it includes the following steps:
  • Step S201 Obtain to-be-processed video data and audio data.
  • the video data and audio data both carry the same user identifier.
  • the audio data includes audio features and text information.
  • the video data includes facial features corresponding to the user's face image.
  • Video data and audio data refer to the data recorded to the user through the video recording system in the terminal during the user interview and interview.
  • the user ID is used to identify video data and audio data, which is convenient for the server to distinguish; the user ID can be the user name, user ID number, etc.
  • Audio feature refers to the feature information used to analyze the user's expression category, such as volume, frequency response, etc.; text information refers to the information converted from the audio information in the audio feature through voice recognition technology; facial feature refers to the video Extracted from the face images in the data, it is used to analyze the feature information of the user's expression categories, such as eye features and mouth features. There are many types of user expressions, such as sadness, joy, disgust, etc.
  • the terminal (such as a smart phone) is installed with a video recording system.
  • the video recording system can upload video data and audio data recorded in user interviews, interviews, etc. as pending video data and audio data to the server.
  • the server extracts audio features and text information from the audio data to be processed; extracts the user's face image from the to-be-processed video data, and extracts facial features from the extracted user's face image to facilitate subsequent extraction of the extracted audio features, Text information and facial image input correspond to the expression classifier to comprehensively judge the user's expression category from multiple angles, which further improves the accuracy of expression recognition.
  • Step S202 Input audio features, text information, and facial features into corresponding expression classifiers respectively, and obtain expression recognition results output by each expression classifier and weights corresponding to the expression recognition results.
  • the expression classifier can output the corresponding expression recognition results based on the input information.
  • the expression classifier based on voice recognition can output the user's expression category based on the input audio features
  • the expression classifier based on text recognition can be based on the input text Information, output the user's expression category
  • the expression classifier based on face recognition can output the user's expression category based on the input facial features.
  • Weights are used to measure the importance of expression recognition results. Different expression recognition results have different weights. It should be noted that with the same expression classifier, the weight of the expression recognition result output by each time is different, which is specifically related to the input information.
  • the server inputs audio features, text information, and facial features to the corresponding expression classifiers respectively to obtain the expression recognition results output by each expression classifier and the weights corresponding to the expression recognition results, which is convenient from the three perspectives of speech, text, and vision Comprehensive judgment of the user's expression categories can provide more diverse expression recognition results, which is convenient for comprehensive analysis of the user's expression categories from the various expression recognition results, further improving the accuracy of expression recognition, avoiding the traditional methods. Judging the user's facial expression category from a visual perspective leads to the defect of low accuracy of facial expression recognition.
  • Step S203 From the output facial expression recognition results, the facial expression recognition result with the largest weight is selected as the facial expression category of the user corresponding to the user identification.
  • the server obtains the expression recognition result with the largest weight from the expression recognition results output by each expression classifier, as the expression category of the user corresponding to the user identification, which realizes the purpose of comprehensively judging the expression category of the user from multiple angles, avoiding Missed judgment further improves the accuracy and stability of facial expression recognition. It should be noted that, if the weights corresponding to the respective expression recognition results are the same, the expression recognition results output by the expression classifier based on face recognition are the main ones.
  • the server obtains the video data and audio data to be processed.
  • the video data and audio data both carry the same user identification.
  • the audio data includes audio features and text information.
  • the video data includes facial features corresponding to the user’s face image; Audio features, text information, and facial features are input into the corresponding expression classifiers respectively to obtain the expression recognition results output by each expression classifier and the weights corresponding to the expression recognition results; from the output expression recognition results, the largest weight is selected
  • Expression recognition results as the user's expression category corresponding to the user's logo; achieve the purpose of comprehensive judgment of the user's expression category from three perspectives of voice, text and vision, can provide more diverse expression recognition results, convenient from Comprehensive analysis of the user's expression categories in various expression recognition results to avoid missing judgments, which further improves the accuracy and stability of expression recognition, and overcomes the traditional method of judging the user's expression categories only from a visual perspective, resulting in expressions Identify defects with low accuracy.
  • the server may classify the expression classifiers into a first expression classifier, a second expression classifier, and a third expression classifier.
  • audio features, text information, and facial features are respectively input into corresponding expression classifiers, and expression recognition results output by each expression classifier and weights corresponding to the expression recognition results are obtained
  • the steps include:
  • Step S301 Input audio features into the first expression classifier to obtain the expression recognition result of the first expression classifier and the first weight corresponding to the expression recognition result.
  • Step S302 input text information into the second expression classifier, and obtain the expression recognition result of the second expression classifier and the second weight corresponding to the expression recognition result.
  • Step S303 Input facial features into the third expression classifier to obtain the expression recognition result of the third expression classifier and the third weight corresponding to the expression recognition result.
  • the first expression classifier is an expression classifier based on speech recognition, and can output an expression category corresponding to the audio feature as the user's expression category according to the input audio feature.
  • the second expression classifier is an expression classifier based on character recognition, and can output an expression type corresponding to the text information as the user's expression type according to the input text information.
  • the third expression classifier is an expression classifier based on face recognition, and can output an expression category corresponding to the facial feature as the user's expression category according to the input facial features.
  • the server inputs audio features, text information, and facial features to the corresponding expression classifier to comprehensively judge the user's expression categories from three perspectives: voice, text, and vision, to avoid missing judgments, and further improve the expression recognition. Accuracy; at the same time, judging the user's expression category from multiple angles can improve the stability of expression recognition.
  • inputting audio features into the first expression classifier to obtain the expression recognition result of the first expression classifier and the first weight corresponding to the expression recognition result includes: inputting audio features To the first expression classifier, the first expression classifier is used to extract the target feature from the audio feature, query the first database according to the target feature, obtain the expression category corresponding to the target feature, as an expression recognition result, and determine the expression recognition The first weight corresponding to the result; acquiring the expression recognition result of the first expression classifier and the first weight corresponding to the expression recognition result.
  • the target feature refers to a feature that matches the set audio feature (such as tone) among the input audio features.
  • the first database stores a plurality of expression categories corresponding to audio features.
  • the server collects a number of different audio features and expression categories corresponding to the audio features in advance, and extracts the target features from the audio features; the expression categories corresponding to the audio features are taken as the ones extracted from the audio features The expression categories corresponding to the target features to obtain multiple expression categories corresponding to the target features; storing the multiple expression categories corresponding to the target features in the first database to facilitate subsequent acquisition of the expression categories corresponding to the target features through the first database .
  • fear is often screaming, so the average pitch is high; so the first feature classifier is used to extract the target feature pitch from the audio features; when the pitch is high, the user’s voice can be judged
  • the emoticon category is fear.
  • the first expression classifier you can obtain the expression type corresponding to the input audio feature, combined with voice recognition technology, analyze the user's current expression type through the audio feature in the user's voice, further improve the accuracy of expression recognition, and avoid leakage Judgment caused the defect of low accuracy of facial expression recognition.
  • the first expression classifier may be trained multiple times.
  • the first expression classifier is obtained by: acquiring multiple sample audio features and corresponding expression categories; identifying the sample audio features by the first expression classifier to be trained to obtain the first expression The expression recognition result of the classifier; compare the expression recognition result with the corresponding actual expression category to obtain the recognition error; when the recognition error is greater than or equal to the preset first threshold, the first expression classifier is trained according to the recognition error until According to the recognition error obtained by the first expression classifier after training is less than the preset first threshold, the training is ended.
  • the server adjusts the parameters of the first expression classifier according to the recognition error; re-recognizes the sample audio features according to the adjusted first expression classifier to obtain the first expression
  • the recognition error between the expression recognition result obtained by the classifier and the corresponding actual expression category, the parameters of the first expression classifier are readjusted according to the recognition error, to retrain the first expression classifier until The recognition error obtained by the first expression classifier is less than the preset first threshold, and the training ends.
  • the server performs multiple trainings on the first expression classifier according to the recognition error, which is convenient for outputting more accurate expression recognition results through the first expression classifier and avoiding missed judgment, thereby further improving the accuracy of expression recognition of the first expression classifier.
  • the text information is input into the second expression classifier to obtain the expression recognition result of the second expression classifier and the second weight corresponding to the expression recognition result, including: inputting the text information To the second expression classifier, the second expression classifier is used to extract the target information from the text information, query the second database according to the target information, obtain the expression category corresponding to the target information, as an expression recognition result, and determine the expression recognition The second weight corresponding to the result; acquiring the expression recognition result of the second expression classifier and the second weight corresponding to the expression recognition result.
  • Target information refers to information containing emotions extracted from input text information, such as happiness and anger.
  • the second database stores a plurality of expression categories corresponding to the text information.
  • the server collects the voice information of multiple different users in advance, converts the voice information into text information, extracts the target information from the text information, and determines the expression type corresponding to the target information.
  • the expression categories are stored in the second database, which is convenient for subsequently obtaining the expression categories corresponding to the target information through the second database.
  • the text information "happy” usually represents happiness, so the second expression classifier extracts the target information from the text information; when the target information is identified as "happy", the user's expression category can be determined to be happy.
  • the expression type corresponding to the input text information can be obtained to determine the current expression type of the user, thereby further improving the accuracy rate of expression recognition and avoiding the defect of low accuracy rate of expression recognition caused by missed judgment .
  • the second expression classifier can also extract the target information from the text information, and determine the context information associated with the target information from the text information; according to the target information and the context information associated with the target information, determine the actual meaning of the target information Query the second database according to the actual meaning of the target information, obtain the expression category corresponding to the actual meaning of the target information as the expression recognition result, and determine the second weight corresponding to the expression recognition result. For example, extract the target information "happy” from the text message "Do you say that I can be happy with this kind of thing", and combine the context information of "happy” to determine the actual meaning of "happy” as the negative emotion "unhappy". Based on this method, multiple expression categories corresponding to the target information can be acquired and stored in the second database. Combining the extracted context information of the target information, the recognition error of the second expression classifier can be further reduced, thereby improving the accuracy of expression recognition.
  • the facial features are input into the third expression classifier to obtain the expression recognition result of the third expression classifier and the third weight corresponding to the expression recognition result, including: inputting the facial features To the third expression classifier, the third expression classifier is used to query the third database according to facial features, obtain the expression category corresponding to the facial features as the expression recognition result, and determine the third weight corresponding to the expression recognition result; obtain the third The expression recognition result of the three expression classifier and the third weight corresponding to the expression recognition result.
  • the third database stores a plurality of expression categories corresponding to facial features.
  • the server collects in advance multiple different facial features and expression categories corresponding to the facial features; storing multiple facial expression categories corresponding to the facial features in the third database to facilitate subsequent acquisition of facial features through the third database
  • the corresponding emoticon category For example, when people are happy, the facial features are raised mouth corners, wrinkled cheeks, eyelid contraction, and "crow's feet" will form at the end of the eyes; when sad, facial features are squint, eyebrows tightened, mouth corners pulled down, chin raised or Tighten.
  • the third expression classifier when the facial feature is identified as the corners of the mouth are raised, the cheeks are raised, the eyelids are contracted, and the tail of the eyes will form a "fish tail pattern", which can be judged that the user's expression type is happy.
  • the server can also extract multiple facial images from the video data to be processed, extract facial features from each facial image, and input facial features into the third facial expression classifier to obtain multiple facial expressions.
  • Expression recognition results and corresponding multiple third weights; from the multiple expression recognition results, the expression recognition result with the third largest weight is selected as the final expression recognition result output by the third expression classifier. Extract multiple face images from the video data to be processed and analyze them with the third expression classifier to avoid judging the user's expression category based on the facial features in a single face image, resulting in the accuracy of expression recognition Low defects, thus improving the accuracy of facial expression recognition.
  • the server may also perform multiple trainings on the third facial expression classifier.
  • the third expression classifier is obtained by: acquiring multiple sample facial features and corresponding expression categories; identifying the sample facial features by the third expression classifier to be trained to obtain the third expression The expression recognition result of the classifier; obtain the similarity between the expression recognition result and the corresponding actual expression category; when the similarity is less than the preset second threshold, the third expression classifier is trained according to the similarity until after the training The similarity between the expression recognition result obtained by the third expression classifier and the corresponding actual expression category is greater than or equal to the preset second threshold, and the training ends.
  • the server adjusts the parameters of the third expression classifier according to the similarity; re-recognizes the facial features of the sample according to the adjusted third expression classifier, and obtains the third expression classifier
  • the similarity between the obtained expression recognition result and the corresponding actual expression category the parameters of the third expression classifier are adjusted again according to the similarity, to retrain the third expression classifier until the third
  • the similarity between the expression recognition result obtained by the expression classifier and the corresponding actual expression category is greater than or equal to a preset second threshold, and the training ends.
  • the server performs multiple trainings on the third expression classifier according to the similarity, which is convenient for outputting more accurate expression recognition results through the third expression classifier and avoiding missed judgment, thereby further improving the accuracy of the third expression classifier's expression recognition.
  • another expression recognition method including the following steps:
  • Step S401 Obtain to-be-processed video data and audio data.
  • the video data and audio data both carry the same user identifier.
  • the audio data includes audio features and text information.
  • the video data includes facial features corresponding to the user's face image.
  • Step S402 the audio features are input into the first expression classifier, and the first expression classifier is used to extract the target features from the audio features, query the first database according to the target features, and obtain the expression category corresponding to the target features as expression recognition As a result, the first weight corresponding to the expression recognition result is determined; the expression recognition result of the first expression classifier and the first weight corresponding to the expression recognition result are obtained.
  • Step S403 input text information into the second expression classifier.
  • the second expression classifier is used to extract target information from the text information, query the second database according to the target information, and obtain an expression category corresponding to the target information as expression recognition As a result, the second weight corresponding to the expression recognition result is determined; the expression recognition result of the second expression classifier and the second weight corresponding to the expression recognition result are obtained.
  • Step S404 the facial features are input into the third expression classifier, and the third expression classifier is used to query the third database according to the facial features, obtain the expression category corresponding to the facial features, as an expression recognition result, and determine to correspond to the expression recognition result
  • the third weight obtain the expression recognition result of the third expression classifier and the third weight corresponding to the expression recognition result.
  • Step S405 From the output facial expression recognition results, the facial expression recognition result with the largest weight is selected as the facial expression category of the user corresponding to the user identification.
  • the purpose of comprehensively judging the user's expression categories from three perspectives of speech, text and vision is achieved, which can provide more diverse expression recognition results and facilitate the user's expression from various expression recognition results
  • Comprehensive analysis of expression categories to avoid missing judgments which further improves the accuracy and stability of expression recognition, and overcomes the defect that traditional methods only judge the user's expression categories from a visual perspective, resulting in low accuracy of expression recognition.
  • steps in the flowcharts of FIGS. 2-4 are displayed in order according to the arrows, the steps are not necessarily executed in the order indicated by the arrows. Unless clearly stated in this article, the execution of these steps is not strictly limited in order, and these steps may be executed in other orders. Moreover, at least some of the steps in FIGS. 2-4 may include multiple sub-steps or multiple stages. These sub-steps or stages are not necessarily executed at the same time, but may be executed at different times. These sub-steps or stages The execution order of is not necessarily sequential, but may be executed in turn or alternately with at least a part of other steps or sub-steps or stages of other steps.
  • an expression recognition device including: a data acquisition module 510, an expression acquisition module 520, and an expression screening module 530, wherein:
  • the data obtaining module 510 is used to obtain to-be-processed video data and audio data. Both the video data and the audio data carry the same user identifier, the audio data includes audio features and text information, and the video data includes facial features corresponding to the user's face image.
  • the expression acquisition module 520 is used to input audio features, text information, and facial features into corresponding expression classifiers respectively, to obtain expression recognition results output by each expression classifier and weights corresponding to the expression recognition results.
  • the expression screening module 530 is used to filter out the expression recognition result with the largest weight from the output expression recognition results as the user's expression category corresponding to the user identification.
  • the expression acquisition module is further used to input audio features into the first expression classifier to acquire the expression recognition result of the first expression classifier and the first weight corresponding to the expression recognition result; input text information To the second expression classifier, obtain the expression recognition result of the second expression classifier and the second weight corresponding to the expression recognition result; input facial features into the third expression classifier to obtain the expression recognition of the third expression classifier The result and the third weight corresponding to the expression recognition result.
  • the expression acquisition module is also used to input audio features into the first expression classifier.
  • the first expression classifier is used to extract target features from the audio features, query the first database according to the target features, and obtain The expression category corresponding to the target feature is used as the expression recognition result, and the first weight corresponding to the expression recognition result is determined; the expression recognition result of the first expression classifier and the first weight corresponding to the expression recognition result are obtained.
  • the expression acquisition module is also used to input text information into the second expression classifier, and the second expression classifier is used to extract target information from the text information, query the second database according to the target information, and obtain and The expression category corresponding to the target information is used as the expression recognition result, and the second weight corresponding to the expression recognition result is determined; the expression recognition result of the second expression classifier and the second weight corresponding to the expression recognition result are obtained.
  • the facial expression acquiring module is further used to input facial features into the third facial expression classifier, and the third facial expression classifier is used to query the third database according to facial features to acquire facial expression categories corresponding to facial features as facial expressions Recognize the result and determine the third weight corresponding to the expression recognition result; obtain the expression recognition result of the third expression classifier and the third weight corresponding to the expression recognition result.
  • the expression recognition device further includes a first training module for acquiring multiple sample audio features and corresponding expression categories; the sample audio features are recognized by the first expression classifier to be trained to obtain the first Expression recognition result of the expression classifier; compare the expression recognition result with the corresponding actual expression category to obtain a recognition error; when the recognition error is greater than or equal to a preset first threshold, the first expression classifier is trained according to the recognition error, Until the recognition error obtained according to the trained first expression classifier is less than the preset first threshold, the training is ended.
  • a first training module for acquiring multiple sample audio features and corresponding expression categories; the sample audio features are recognized by the first expression classifier to be trained to obtain the first Expression recognition result of the expression classifier; compare the expression recognition result with the corresponding actual expression category to obtain a recognition error; when the recognition error is greater than or equal to a preset first threshold, the first expression classifier is trained according to the recognition error, Until the recognition error obtained according to the trained first expression classifier is less than the preset first threshold, the training is ended.
  • the expression recognition device further includes a second training module for acquiring multiple sample facial features and corresponding expression categories; the third facial expression classifier to be trained recognizes the sample facial features to obtain the third The expression recognition result of the expression classifier; obtain the similarity between the expression recognition result and the corresponding actual expression category; when the similarity is less than the preset second threshold, the third expression classifier is trained according to the similarity until The similarity between the facial expression recognition result obtained by the third facial expression classifier and the corresponding actual facial expression category is greater than or equal to the preset second threshold, and the training ends.
  • the expression recognition device achieves the purpose of comprehensively judging the user's expression category from three perspectives: voice, text, and vision, and can provide more diverse expression recognition results, which is convenient for comparing the various expression recognition results.
  • Each module in the above facial expression recognition device may be implemented in whole or in part by software, hardware, or a combination thereof.
  • the above modules may be embedded in the hardware form or independent of the processor in the computer device, or may be stored in the memory in the computer device in the form of software so that the processor can call and execute the operations corresponding to the above modules.
  • a computer device in one of the embodiments, the computer device may be a server, and an internal structure diagram thereof may be as shown in FIG. 6.
  • the computer device includes a processor, memory, network interface, and database connected by a system bus.
  • the processor of the computer device is used to provide computing and control capabilities.
  • the memory of the computer device includes a non-volatile storage medium and an internal memory.
  • the non-volatile storage medium stores an operating system, computer-readable instructions, and a database.
  • the internal memory provides an environment for the operation of the operating system and computer-readable instructions in the non-volatile storage medium.
  • the database of the computer device is used to store expression categories.
  • the network interface of the computer device is used to communicate with external terminals through a network connection.
  • the computer-readable instructions are executed by the processor to implement an expression recognition method.
  • FIG. 6 is only a block diagram of a part of the structure related to the solution of the present application, and does not constitute a limitation on the computer equipment to which the solution of the present application is applied.
  • the specific computer equipment may It includes more or fewer components than shown in the figure, or some components are combined, or have a different component arrangement.
  • a computer device includes a memory and one or more processors.
  • the memory stores computer-readable instructions.
  • the one or more processors perform the following steps:
  • the video data and audio data both carry the same user ID, the audio data includes audio features and text information, and the video data includes facial features corresponding to the user's face image;
  • the facial expression recognition result with the largest weight is selected as the facial expression category of the user corresponding to the user identification.
  • the processor executes the computer-readable instructions, the following steps are further implemented: the audio feature is input into the first expression classifier, and the expression recognition result of the first expression classifier and the third corresponding to the expression recognition result are obtained A weight; input text information into the second expression classifier to obtain the expression recognition result of the second expression classifier and the second weight corresponding to the expression recognition result; and input facial features into the third expression classifier to obtain The expression recognition result of the third expression classifier and the third weight corresponding to the expression recognition result.
  • the processor further implements the following steps when executing the computer-readable instructions: inputting audio features into the first expression classifier, the first expression classifier is used to extract target features from the audio features, and according to the target features Query the first database, obtain the expression category corresponding to the target feature as the expression recognition result, and determine the first weight corresponding to the expression recognition result; and obtain the expression recognition result of the first expression classifier and the first One weight.
  • the processor also implements the following steps when executing the computer-readable instructions: input text information into the second expression classifier, and the second expression classifier is used to extract target information from the text information, based on the target information Query the second database, obtain the expression category corresponding to the target information as the expression recognition result, and determine the second weight corresponding to the expression recognition result; and obtain the expression recognition result of the second expression classifier and the first corresponding to the expression recognition result Two weights.
  • the processor executes the computer-readable instructions
  • the following steps are further implemented: the facial features are input into the third expression classifier, and the third expression classifier is used to query the third database according to the facial features to obtain the facial features
  • the corresponding expression category is used as the expression recognition result, and the third weight corresponding to the expression recognition result is determined; and the expression recognition result of the third expression classifier and the third weight corresponding to the expression recognition result are obtained.
  • the processor also implements the following steps when executing the computer-readable instructions: acquiring multiple sample audio features and corresponding expression categories; identifying the sample audio features through the first expression classifier to be trained to obtain the first An expression recognition result of an expression classifier; comparing the expression recognition result with the corresponding actual expression category to obtain a recognition error; and when the recognition error is greater than or equal to a preset first threshold, the first expression classifier is performed according to the recognition error Training until the recognition error obtained according to the trained first expression classifier is less than the preset first threshold, and the training ends.
  • the processor also implements the following steps when executing the computer-readable instructions: acquiring multiple sample facial features and corresponding expression categories; identifying the sample facial features by a third expression classifier to be trained to obtain the first The expression recognition result of the three expression classifier; obtaining the similarity between the expression recognition result and the corresponding actual expression category; and when the similarity is less than the preset second threshold, the third expression classifier is trained according to the similarity until According to the similarity between the expression recognition result obtained by the trained third expression classifier and the corresponding actual expression category is greater than or equal to the preset second threshold, the training ends.
  • the computer device realizes the purpose of comprehensively judging the user's expression categories from three perspectives of speech, text and vision through computer-readable instructions running on the processor, and can provide more diverse expression recognition results . It is convenient to comprehensively analyze the user's expression categories from various expression recognition results, to avoid missing judgments, further improve the accuracy and stability of expression recognition, and overcome the traditional method to judge the user's expression categories only from a visual angle, Defects that lead to low accuracy of facial expression recognition.
  • One or more non-volatile storage media storing computer readable instructions.
  • the computer readable instructions When executed by one or more processors, the one or more processors perform the following steps:
  • the video data and audio data both carry the same user ID, the audio data includes audio features and text information, and the video data includes facial features corresponding to the user's face image;
  • the facial expression recognition result with the largest weight is selected as the facial expression category of the user corresponding to the user identification.
  • the following steps are further implemented: inputting audio features into the first expression classifier, obtaining the expression recognition result of the first expression classifier and the corresponding expression recognition result The first weight; input the text information into the second expression classifier, obtain the expression recognition result of the second expression classifier and the second weight corresponding to the expression recognition result; and input the facial features into the third expression classifier, Obtain the expression recognition result of the third expression classifier and the third weight corresponding to the expression recognition result.
  • the audio features are input into the first expression classifier, and the first expression classifier is used to extract the target features from the audio features, according to the target
  • the feature queries the first database, obtains the expression category corresponding to the target feature as the expression recognition result, and determines the first weight corresponding to the expression recognition result; and obtains the expression recognition result of the first expression classifier and the corresponding expression recognition result The first weight.
  • the following steps are also implemented: inputting text information into the second expression classifier, and the second expression classifier is used to extract target information from the text information, according to the target
  • the information queries the second database, obtains the expression category corresponding to the target information as the expression recognition result, and determines the second weight corresponding to the expression recognition result; and obtains the expression recognition result of the second expression classifier and the corresponding expression recognition result Second weight.
  • the facial features are input into the third expression classifier, and the third expression classifier is used to query the third database according to the facial features to obtain the facial expression
  • the expression category corresponding to the feature is used as the expression recognition result, and the third weight corresponding to the expression recognition result is determined; and the expression recognition result of the third expression classifier and the third weight corresponding to the expression recognition result are obtained.
  • the following steps are further implemented: acquiring multiple sample audio features and corresponding expression categories; identifying the sample audio features by the first expression classifier to be trained, to obtain The expression recognition result of the first expression classifier; comparing the expression recognition result with the corresponding actual expression category to obtain a recognition error; and when the recognition error is greater than or equal to a preset first threshold, the first expression classifier is based on the recognition error The training is performed until the recognition error obtained according to the trained first expression classifier is less than the preset first threshold, and the training ends.
  • the following steps are also achieved: acquiring multiple sample facial features and corresponding expression categories; identifying the sample facial features through a third expression classifier to be trained to obtain The expression recognition result of the third expression classifier; obtaining the similarity between the expression recognition result and the corresponding actual expression category; and when the similarity is less than the preset second threshold, training the third expression classifier according to the similarity, Until the similarity between the expression recognition result obtained by the trained third expression classifier and the corresponding actual expression category is greater than or equal to the preset second threshold, the training ends.
  • the computer-readable storage medium realizes the purpose of comprehensively judging the user's expression categories from the three perspectives of voice, text, and vision through the computer-readable instructions stored therein, and can provide more diverse expression recognition As a result, it is convenient to conduct a comprehensive analysis of the user's expression categories from various expression recognition results, to avoid missing judgments, further improve the accuracy and stability of expression recognition, and overcome the shortcomings of low accuracy of traditional methods of expression recognition.
  • Non-volatile memory may include read-only memory (ROM), programmable ROM (PROM), electrically programmable ROM (EPROM), electrically erasable programmable ROM (EEPROM), or flash memory.
  • Volatile memory can include random access memory (RAM) or external cache memory.
  • RAM is available in many forms, such as static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double data rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronous chain (Synchlink) DRAM (SLDRAM), memory bus (Rambus) direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM), etc.
  • SRAM static RAM
  • DRAM dynamic RAM
  • SDRAM synchronous DRAM
  • DDRSDRAM double data rate SDRAM
  • ESDRAM enhanced SDRAM
  • SLDRAM synchronous chain (Synchlink) DRAM
  • SLDRAM synchronous chain (Synchlink) DRAM
  • Rambus direct RAM
  • DRAM direct memory bus dynamic RAM
  • RDRAM memory bus dynamic RAM

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • General Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • General Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Acoustics & Sound (AREA)
  • Image Analysis (AREA)

Abstract

一种表情识别方法, 包括: 获取待处理的视频数据和音频数据, 所述视频数据和所述音频数据均携带同一用户标识, 所述音频数据包括音频特征和文字信息, 所述视频数据包括用户人脸图像对应的面部特征; 将所述音频特征、文字信息和面部特征分别输入至对应的表情分类器中, 获取各个所述表情分类器输出的表情识别结果以及与所述表情识别结果对应的权重; 从所述输出的表情识别结果中, 筛选出权重最大的表情识别结果, 作为与所述用户标识对应的用户的表情类别.

Description

表情识别方法、装置、计算机设备和存储介质
本申请要求于2018年12月18日提交中国专利局,申请号为201811553986.X,申请名称为“表情识别方法、装置、计算机设备和存储介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及一种表情识别方法、装置、计算机设备和存储介质。
背景技术
人类表情蕴含着丰富的情绪,其传递的信息非常重要,故在用户视频面试、面审等既有视频录制也有音频录制的场景中,往往需要对用户表情进行分析,以了解用户真实的情绪。因此,表情识别的准确性显得越来越重要。
然而,传统的表情识别方法主要是基于录制的图像和视频来对用户的表情类型进行判断,因此对于图片的质量和拍摄的场景都有一定的要求。但是,发明人意识到,若图片的质量和拍摄的场景达不到要求,而且在用户面部动作不够丰富时,很容易出现漏判的情况,从而造成表情识别的准确率较低。
发明内容
根据本申请公开的各种实施例,提供一种表情识别方法、装置、计算机设备和存储介质。
一种表情识别方法包括:
获取待处理的视频数据和音频数据,所述视频数据和所述音频数据均携带同一用户标识,所述音频数据包括音频特征和文字信息,所述视频数据包括用户人脸图像对应的面部特征;
将所述音频特征、文字信息和面部特征分别输入至对应的表情分类器中,获取各个所述表情分类器输出的表情识别结果以及与所述表情识别结果对应的权重;及
从所述输出的表情识别结果中,筛选出权重最大的表情识别结果,作为与所述用户标识对应的用户的表情类别。
一种表情识别装置包括:
数据获取模块,用于获取待处理的视频数据和音频数据,所述视频数据和所述音频数据均携带同一用户标识,所述音频数据包括音频特征和文字信息,所述视频数据包括用户人脸图像对应的面部特征;
表情获取模块,用于将所述音频特征、文字信息和面部特征分别输入至对应的表情分类器中,获取各个所述表情分类器输出的表情识别结果以及与所述表情识别结果对应的权 重;及
表情筛选模块,用于从所述输出的表情识别结果中,筛选出权重最大的表情识别结果,作为与所述用户标识对应的用户的表情类别。
一种计算机设备,包括存储器和一个或多个处理器,所述存储器中储存有计算机可读指令,所述计算机可读指令被所述处理器执行时,使得所述一个或多个处理器执行以下步骤:
获取待处理的视频数据和音频数据,所述视频数据和所述音频数据均携带同一用户标识,所述音频数据包括音频特征和文字信息,所述视频数据包括用户人脸图像对应的面部特征;
将所述音频特征、文字信息和面部特征分别输入至对应的表情分类器中,获取各个所述表情分类器输出的表情识别结果以及与所述表情识别结果对应的权重;及
从所述输出的表情识别结果中,筛选出权重最大的表情识别结果,作为与所述用户标识对应的用户的表情类别。
一个或多个存储有计算机可读指令的非易失性计算机可读存储介质,计算机可读指令被一个或多个处理器执行时,使得一个或多个处理器执行以下步骤:
获取待处理的视频数据和音频数据,所述视频数据和所述音频数据均携带同一用户标识,所述音频数据包括音频特征和文字信息,所述视频数据包括用户人脸图像对应的面部特征;
将所述音频特征、文字信息和面部特征分别输入至对应的表情分类器中,获取各个所述表情分类器输出的表情识别结果以及与所述表情识别结果对应的权重;及
从所述输出的表情识别结果中,筛选出权重最大的表情识别结果,作为与所述用户标识对应的用户的表情类别。
本申请的一个或多个实施例的细节在下面的附图和描述中提出。本申请的其它特征和优点将从说明书、附图以及权利要求书变得明显。
附图说明
为了更清楚地说明本申请实施例中的技术方案,下面将对实施例中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本申请的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其它的附图。
图1为根据一个或多个实施例中表情识别方法的应用场景图;
图2为根据一个或多个实施例中表情识别方法的流程示意图;
图3为根据一个实施例中获取各个表情分类器输出的表情识别结果的步骤的流程示意图;
图4为另一个实施例中表情识别方法的流程示意图;
图5为根据一个或多个实施例中表情识别装置的框图;
图6为根据一个或多个实施例中计算机设备的框图。
具体实施方式
为了使本申请的技术方案及优点更加清楚明白,以下结合附图及实施例,对本申请进行进一步详细说明。应当理解,此处描述的具体实施例仅仅用以解释本申请,并不用于限定本申请。
本申请提供的表情识别方法,可以应用于如图1所示的应用环境中。终端110通过网络与服务器120进行通信。终端110中安装了视频录制系统,视频录制系统能够将录制的视频数据和音频数据上传至服务器120;视频数据和音频数据携带同一用户标识。服务器120从音频数据中提取出音频特征和文字信息,从视频数据中提取出用户人脸图像对应的面部特征;将提取出的音频特征、文字信息和面部特征分别输入至对应的表情分类器中,获取各个表情分类器输出的表情识别结果以及与表情识别结果对应的权重;从输出的表情识别结果中,筛选出权重最大的表情识别结果,作为与用户标识对应的用户的表情类别。终端110可以但不限于是各种个人计算机、笔记本电脑、智能手机和平板电脑,服务器120可以用独立的服务器或者是多个服务器组成的服务器集群来实现。
在其中一个实施例中,如图2所示,提供了一种表情识别方法,以该方法应用于图1中的服务器为例进行说明,包括以下步骤:
步骤S201,获取待处理的视频数据和音频数据,视频数据和音频数据均携带同一用户标识,音频数据包括音频特征和文字信息,视频数据包括用户人脸图像对应的面部特征。
视频数据和音频数据是指在用户面审、面试等环节中,通过终端中的视频录制系统对用户所录制的数据。用户标识用于标识视频数据和音频数据,便于服务器进行区分;用户标识可以是用户姓名,用户身份证号码等。音频特征是指用于分析用户的表情类别的特征信息,比如音量、频响等;文字信息是指通过语音识别技术,由音频特征中的音频信息转化而来的信息;面部特征是指从视频数据中的人脸图像中提取出来的,用于分析用户的表情类别的特征信息,比如眼部特征,嘴部特征等。用户的表情类别可以分为很多种,比如悲伤、高兴、厌恶等。
终端(比如智能手机)安装了视频录制系统,视频录制系统能够将在用户面审、面试等环节中录制的视频数据和音频数据,作为待处理的视频数据和音频数据,上传至服务器。服务器从待处理的音频数据中提取音频特征和文字信息;从待处理的视频数据中提取用户人脸图像,并从提取出的用户人脸图像提取面部特征,方便后续将提取出的音频特征、文字信息和人脸图像输入对应的表情分类器,以从多个角度对用户的表情类别进行综合判断,进一步提高了表情识别的准确率。
步骤S202,将音频特征、文字信息和面部特征分别输入至对应的表情分类器中,获取各个表情分类器输出的表情识别结果以及与表情识别结果对应的权重。
表情分类器能够根据输入的信息,输出对应的表情识别结果,比如,基于语音识别的 表情分类器能够根据输入的音频特征,输出用户的表情类别;基于文字识别的表情分类器能够根据输入的文字信息,输出用户的表情类别;基于人脸识别的表情分类器能够根据输入的面部特征,输出用户的表情类别。
权重用于衡量表情识别结果的重要程度,不同的表情识别结果,对应的权重不一样。需要说明的是,同一表情分类器,每次输出的表情识别结果所占有的权重是不同的,具体跟输入的信息有关。
服务器将音频特征、文字信息和面部特征分别输入至对应的表情分类器中,获取各个表情分类器输出的表情识别结果以及与表情识别结果对应的权重,方便从语音、文字和视觉这三个角度对用户的表情类别进行综合判断,能够提供更为多样的表情识别结果,方便从多样的表情识别结果对用户的表情类别进行综合分析,进一步提高了表情识别的准确率,避免了传统方法中仅从视觉角度对用户的表情类别进行判断,导致表情识别的准确率低的缺陷。
步骤S203,从输出的表情识别结果中,筛选出权重最大的表情识别结果,作为与用户标识对应的用户的表情类别。
服务器从各个表情分类器输出的表情识别结果中,获取权重最大的表情识别结果,作为与用户标识对应的用户的表情类别,实现了从多个角度对用户的表情类别进行综合判断的目的,避免漏判,进一步提高了表情识别的准确率和稳定性。需要说明的是,若各个表情识别结果对应的权重一样,则以基于人脸识别的表情分类器输出的表情识别结果为主。
上述表情识别方法中,服务器获取待处理的视频数据和音频数据,视频数据和音频数据均携带同一用户标识,音频数据包括音频特征和文字信息,视频数据包括用户人脸图像对应的面部特征;将音频特征、文字信息和面部特征分别输入至对应的表情分类器中,获取各个表情分类器输出的表情识别结果以及与表情识别结果对应的权重;从输出的表情识别结果中,筛选出权重最大的表情识别结果,作为与用户标识对应的用户的表情类别;实现了从语音、文字和视觉这三个角度对用户的表情类别进行综合判断的目的,能够提供更为多样的表情识别结果,方便从多样的表情识别结果中对用户的表情类别进行综合分析,避免漏判,进一步提高了表情识别的准确率和稳定性,克服了传统方法中仅从视觉角度对用户的表情类别进行判断,导致表情识别的准确率低的缺陷。
考虑到不同的表情分类器,对应的输入信息不同,为了区分不同的表情分类器,服务器可以将表情分类器分为第一表情分类器、第二表情分类器和第三表情分类器。
在其中一个实施例中,如图3所示,将音频特征、文字信息和面部特征分别输入至对应的表情分类器中,获取各个表情分类器输出的表情识别结果以及与表情识别结果对应的权重的步骤具体包括:
步骤S301,将音频特征输入至第一表情分类器中,获取第一表情分类器的表情识别结果以及与表情识别结果对应的第一权重。
步骤S302,将文字信息输入至第二表情分类器中,获取第二表情分类器的表情识别 结果以及与表情识别结果对应的第二权重。
步骤S303,将面部特征输入至第三表情分类器中,获取第三表情分类器的表情识别结果以及与表情识别结果对应的第三权重。
第一表情分类器是基于语音识别的表情分类器,能够根据输入的音频特征,输出与音频特征对应的表情类别,作为用户的表情类别。第二表情分类器是基于文字识别的表情分类器,能够根据输入的文字信息,输出与文字信息对应的表情类型,作为用户的表情类别。第三表情分类器是基于人脸识别的表情分类器,能够根据输入的面部特征,输出与面部特征对应的表情类别,作为用户的表情类别。
服务器将音频特征、文字信息和面部特征分别输入至对应的表情分类器中,以从语音、文字和视觉这三个角度对用户的表情类别进行综合判断,避免漏判,进一步提高了表情识别的准确率;同时,从多个角度对用户的表情类别进行判断,可以提高表情识别的稳定性。
在其中一个实施例中,上述步骤S301,将音频特征输入至第一表情分类器中,获取第一表情分类器的表情识别结果以及与表情识别结果对应的第一权重,包括:将音频特征输入至第一表情分类器中,第一表情分类器用于从音频特征中提取出目标特征,根据目标特征查询第一数据库,获取与目标特征对应的表情类别,作为表情识别结果,并确定与表情识别结果对应的第一权重;获取第一表情分类器的表情识别结果以及与表情识别结果对应的第一权重。目标特征是指输入的音频特征中,与设定音频特征(比如音调)匹配的特征。第一数据库存储有多个与音频特征对应的表情类别。
服务器基于大数据,预先收集了多个不同的音频特征以及与音频特征对应的表情类别,从音频特征中提取出目标特征;将与音频特征对应的表情类别,作为与从该音频特征中提取的目标特征对应的表情类别,以得到多个与目标特征对应的表情类别;将多个与目标特征对应的表情类别存储至第一数据库中,方便后续通过第一数据库获取与目标特征对应的表情类别。比如,恐惧的情绪中时常会带有尖叫,因此平均音调较高;故通过第一表情分类器,从音频特征中提取出目标特征音调;当识别到音调较高时,可以判断出用户的表情类别为恐惧。通过第一表情分类器,可以获取与输入的音频特征对应的表情类别,结合语音识别技术,通过用户语音中的音频特征来分析用户当前的表情类别,进一步提高了表情识别的准确率,避免漏判而造成表情识别的准确率低的缺陷。
为了进一步提高第一表情分类器的表情识别准确率,可以对第一表情分类器进行多次训练。在其中一个实施例中,第一表情分类器通过下述方法得到:获取多个样本音频特征及对应的表情类别;通过待训练的第一表情分类器对样本音频特征进行识别,得到第一表情分类器的表情识别结果;将表情识别结果与对应的实际表情类别进行比较,得到识别误差;当识别误差大于或等于预设第一阈值时,根据识别误差对第一表情分类器进行训练,直到根据训练后的第一表情分类器得到的识别误差小于预设第一阈值,结束训练。
比如,当识别误差大于或等于预设第一阈值时,服务器根据识别误差调整第一表情分 类器的参数;根据调整后的第一表情分类器对样本音频特征进行再次识别,获取根据第一表情分类器得到的表情识别结果与对应的实际表情类别之间的识别误差,根据识别误差对第一表情分类器的参数进行再次调整,以对第一表情分类器进行再次训练,直到根据训练后的第一表情分类器得到的识别误差小于预设第一阈值,结束训练。服务器根据识别误差,对第一表情分类器进行多次训练,方便通过第一表情分类器输出更准确的表情识别结果,避免漏判,从而进一步提高了第一表情分类器的表情识别准确率。
在其中一个实施例中,上述步骤S302,将文字信息输入至第二表情分类器中,获取第二表情分类器的表情识别结果以及与表情识别结果对应的第二权重,包括:将文字信息输入至第二表情分类器中,第二表情分类器用于从文字信息中提取出目标信息,根据目标信息查询第二数据库,获取与目标信息对应的表情类别,作为表情识别结果,并确定与表情识别结果对应的第二权重;获取第二表情分类器的表情识别结果以及与表情识别结果对应的第二权重。目标信息是指从输入的文字信息中,提取出的蕴含情绪的信息,比如开心、愤怒等。第二数据库存储有多个与文字信息对应的表情类别。
服务器基于大数据,预先收集了多个不同用户的语音信息,将语音信息转化成文字信息,从文字信息中提取出目标信息,并确定目标信息对应的表情类别,将多个与目标信息对应的表情类别存储至第二数据库中,方便后续通过第二数据库获取与目标信息对应的表情类别。比如,文字信息“开心”通常代表着快乐,故通过第二表情分类器,从文字信息中提取出目标信息;当识别到目标信息为“开心”时,可以判断出用户的表情类别为快乐。通过第二表情分类器,可以获取与输入的文字信息对应的表情类别,以确定用户当前的表情类别,从而进一步提高了表情识别的准确率,避免漏判而造成表情识别的准确率低的缺陷。
进一步,第二表情分类器还可以从文字信息中提取出目标信息,并从文字信息中确定与目标信息关联的上下文信息;根据目标信息以及与目标信息关联的上下文信息,确定目标信息的实际含义;根据目标信息的实际含义查询第二数据库,获取与目标信息的实际含义对应的表情类别,作为表情识别结果,并确定与表情识别结果对应的第二权重。比如,从文字信息“你说发生这样的事情我能开心吗”中提取目标信息“开心”,并结合“开心”的上下文信息,确定“开心”的实际含义为消极情绪“不开心”。基于此方法,可以获取多个与目标信息对应的表情类别,并将其存储至第二数据库中。结合提取的目标信息的上下文信息,能够进一步减少第二表情分类器的识别误差,从而提高了表情识别的准确率。
在其中一个实施例中,上述步骤S303,将面部特征输入至第三表情分类器中,获取第三表情分类器的表情识别结果以及与表情识别结果对应的第三权重,包括:将面部特征输入至第三表情分类器中,第三表情分类器用于根据面部特征查询第三数据库,获取与面部特征对应的表情类别,作为表情识别结果,并确定与表情识别结果对应的第三权重;获取第三表情分类器的表情识别结果以及与表情识别结果对应的第三权重。第三数据库存储有多个与面部特征对应的表情类别。
服务器基于大数据,预先收集了多个不同的面部特征以及与面部特征对应的表情类别;将多个与面部特征对应的表情类别存储至第三数据库中,方便后续通过第三数据库获取与面部特征对应的表情类别。比如,人们高兴时的面部特征为嘴角翘起,面颊上抬起皱,眼睑收缩,眼睛尾部会形成“鱼尾纹”;伤心时的面部特征为眯眼,眉毛收紧,嘴角下拉,下巴抬起或收紧。故通过第三表情分类器,当识别到面部特征为嘴角翘起,面颊上抬起皱,眼睑收缩,眼睛尾部会形成“鱼尾纹”,可以判断出用户的表情类别为高兴。通过第三表情分类器,可以获取与输入的面部特征对应的表情类别,结合人脸识别技术,通过用户的面部特征来分析用户当前的表情类别,进一步提高了表情识别的准确率,避免漏判而造成表情识别的准确率低的缺陷。
为了进一步提高表情识别的准确率,服务器还可以从待处理的视频数据中提取多张人脸图像,分别从各张人脸图像中提取面部特征,将面部特征输入第三表情分类器,获取多个表情识别结果及对应的多个第三权重;从多个表情识别结果中筛选出第三权重最大的表情识别结果,作为第三表情分类器最终输出的表情识别结果。从待处理的视频数据中提取多张人脸图像,并通过第三表情分类器进行分析,避免仅仅根据单张人脸图像中的面部特征对用户的表情类别进行判断,导致表情识别的准确率低的缺陷,从而提高了表情识别的准确率。
此外,为了进一步提高表情识别的准确率,服务器还可以对第三表情分类器进行多次训练。在其中一个实施例中,第三表情分类器通过下述方法得到:获取多个样本面部特征及对应的表情类别;通过待训练的第三表情分类器对样本面部特征进行识别,得到第三表情分类器的表情识别结果;获取表情识别结果与对应的实际表情类别之间的相似度;当相似度小于预设第二阈值时,根据相似度对第三表情分类器进行训练,直到根据训练后的第三表情分类器得到的表情识别结果与对应的实际表情类别之间的相似度大于或等于预设第二阈值,结束训练。
比如,当相似度小于预设第二阈值时,服务器根据相似度调整第三表情分类器的参数;根据调整后的第三表情分类器对样本面部特征进行再次识别,获取根据第三表情分类器得到的表情识别结果与对应的实际表情类别之间的相似度,根据相似度对第三表情分类器的参数进行再次调整,以对第三表情分类器进行再次训练,直到根据训练后的第三表情分类器得到的表情识别结果与对应的实际表情类别之间的相似度大于或等于预设第二阈值,结束训练。服务器根据相似度,对第三表情分类器进行多次训练,方便通过第三表情分类器输出更准确的表情识别结果,避免漏判,从而进一步提高了第三表情分类器的表情识别准确率。
在其中一个实施例中,如图4所示,提供了另一种表情识别方法,包括以下步骤:
步骤S401,获取待处理的视频数据和音频数据,视频数据和音频数据均携带同一用户标识,音频数据包括音频特征和文字信息,视频数据包括用户人脸图像对应的面部特征。
步骤S402,将音频特征输入至第一表情分类器中,第一表情分类器用于从音频特征 中提取出目标特征,根据目标特征查询第一数据库,获取与目标特征对应的表情类别,作为表情识别结果,并确定与表情识别结果对应的第一权重;获取第一表情分类器的表情识别结果以及与表情识别结果对应的第一权重。
步骤S403,将文字信息输入至第二表情分类器中,第二表情分类器用于从文字信息中提取出目标信息,根据目标信息查询第二数据库,获取与目标信息对应的表情类别,作为表情识别结果,并确定与表情识别结果对应的第二权重;获取第二表情分类器的表情识别结果以及与表情识别结果对应的第二权重。
步骤S404,将面部特征输入至第三表情分类器中,第三表情分类器用于根据面部特征查询第三数据库,获取与面部特征对应的表情类别,作为表情识别结果,并确定与表情识别结果对应的第三权重;获取第三表情分类器的表情识别结果以及与表情识别结果对应的第三权重。
步骤S405,从输出的表情识别结果中,筛选出权重最大的表情识别结果,作为与用户标识对应的用户的表情类别。
上述表情识别方法中,实现了从语音、文字和视觉这三个角度对用户的表情类别进行综合判断的目的,能够提供更为多样的表情识别结果,方便从多样的表情识别结果中对用户的表情类别进行综合分析,避免漏判,进一步提高了表情识别的准确率和稳定性,克服了传统方法中仅从视觉角度对用户的表情类别进行判断,导致表情识别的准确率低的缺陷。
应该理解的是,虽然图2-4的流程图中的各个步骤按照箭头的指示依次显示,但是这些步骤并不是必然按照箭头指示的顺序依次执行。除非本文中有明确的说明,这些步骤的执行并没有严格的顺序限制,这些步骤可以以其它的顺序执行。而且,图2-4中的至少一部分步骤可以包括多个子步骤或者多个阶段,这些子步骤或者阶段并不必然是在同一时刻执行完成,而是可以在不同的时刻执行,这些子步骤或者阶段的执行顺序也不必然是依次进行,而是可以与其它步骤或者其它步骤的子步骤或者阶段的至少一部分轮流或者交替地执行。
在其中一个实施例中,如图5所示,提供了一种表情识别装置,包括:数据获取模块510、表情获取模块520和表情筛选模块530,其中:
数据获取模块510,用于获取待处理的视频数据和音频数据,视频数据和音频数据均携带同一用户标识,音频数据包括音频特征和文字信息,视频数据包括用户人脸图像对应的面部特征。
表情获取模块520,用于将音频特征、文字信息和面部特征分别输入至对应的表情分类器中,获取各个表情分类器输出的表情识别结果以及与表情识别结果对应的权重。
表情筛选模块530,用于从输出的表情识别结果中,筛选出权重最大的表情识别结果,作为与用户标识对应的用户的表情类别。
在其中一个实施例中,表情获取模块还用于将音频特征输入至第一表情分类器中,获取第一表情分类器的表情识别结果以及与表情识别结果对应的第一权重;将文字信息输入 至第二表情分类器中,获取第二表情分类器的表情识别结果以及与表情识别结果对应的第二权重;将面部特征输入至第三表情分类器中,获取第三表情分类器的表情识别结果以及与表情识别结果对应的第三权重。
在其中一个实施例中,表情获取模块还用于将音频特征输入至第一表情分类器中,第一表情分类器用于从音频特征中提取出目标特征,根据目标特征查询第一数据库,获取与目标特征对应的表情类别,作为表情识别结果,并确定与表情识别结果对应的第一权重;获取第一表情分类器的表情识别结果以及与表情识别结果对应的第一权重。
在其中一个实施例中,表情获取模块还用于将文字信息输入至第二表情分类器中,第二表情分类器用于从文字信息中提取出目标信息,根据目标信息查询第二数据库,获取与目标信息对应的表情类别,作为表情识别结果,并确定与表情识别结果对应的第二权重;获取第二表情分类器的表情识别结果以及与表情识别结果对应的第二权重。
在其中一个实施例中,表情获取模块还用于将面部特征输入至第三表情分类器中,第三表情分类器用于根据面部特征查询第三数据库,获取与面部特征对应的表情类别,作为表情识别结果,并确定与表情识别结果对应的第三权重;获取第三表情分类器的表情识别结果以及与表情识别结果对应的第三权重。
在其中一个实施例中,表情识别装置还包括第一训练模块,用于获取多个样本音频特征及对应的表情类别;通过待训练的第一表情分类器对样本音频特征进行识别,得到第一表情分类器的表情识别结果;将表情识别结果与对应的实际表情类别进行比较,得到识别误差;当识别误差大于或等于预设第一阈值时,根据识别误差对第一表情分类器进行训练,直到根据训练后的第一表情分类器得到的识别误差小于预设第一阈值,结束训练。
在其中一个实施例中,表情识别装置还包括第二训练模块,用于获取多个样本面部特征及对应的表情类别;通过待训练的第三表情分类器对样本面部特征进行识别,得到第三表情分类器的表情识别结果;获取表情识别结果与对应的实际表情类别之间的相似度;当相似度小于预设第二阈值时,根据相似度对第三表情分类器进行训练,直到根据训练后的第三表情分类器得到的表情识别结果与对应的实际表情类别之间的相似度大于或等于预设第二阈值,结束训练。
上述各个实施例,表情识别装置实现了从语音、文字和视觉这三个角度对用户的表情类别进行综合判断的目的,能够提供更为多样的表情识别结果,方便从多样的表情识别结果中对用户的表情类别进行综合分析,避免漏判,进一步提高了表情识别的准确率和稳定性,克服了传统方法中仅从视觉角度对用户表情类别进行判断,导致表情识别的准确率低的缺陷。
关于表情识别装置的具体限定可以参见上文中对于表情识别方法的限定,在此不再赘述。上述表情识别装置中的各个模块可全部或部分通过软件、硬件及其组合来实现。上述各模块可以硬件形式内嵌于或独立于计算机设备中的处理器中,也可以以软件形式存储于计算机设备中的存储器中,以便于处理器调用执行以上各个模块对应的操作。
在其中一个实施例中,提供了一种计算机设备,该计算机设备可以是服务器,其内部结构图可以如图6所示。该计算机设备包括通过系统总线连接的处理器、存储器、网络接口和数据库。该计算机设备的处理器用于提供计算和控制能力。该计算机设备的存储器包括非易失性存储介质、内存储器。该非易失性存储介质存储有操作系统、计算机可读指令和数据库。该内存储器为非易失性存储介质中的操作系统和计算机可读指令的运行提供环境。该计算机设备的数据库用于存储表情类别。该计算机设备的网络接口用于与外部的终端通过网络连接通信。该计算机可读指令被处理器执行时以实现一种表情识别方法。
本领域技术人员可以理解,图6中示出的结构,仅仅是与本申请方案相关的部分结构的框图,并不构成对本申请方案所应用于其上的计算机设备的限定,具体的计算机设备可以包括比图中所示更多或更少的部件,或者组合某些部件,或者具有不同的部件布置。
一种计算机设备,包括存储器和一个或多个处理器,存储器中储存有计算机可读指令,计算机可读指令被处理器执行时,使得一个或多个处理器执行以下步骤:
获取待处理的视频数据和音频数据,视频数据和音频数据均携带同一用户标识,音频数据包括音频特征和文字信息,视频数据包括用户人脸图像对应的面部特征;
将音频特征、文字信息和面部特征分别输入至对应的表情分类器中,获取各个表情分类器输出的表情识别结果以及与表情识别结果对应的权重;及
从输出的表情识别结果中,筛选出权重最大的表情识别结果,作为与用户标识对应的用户的表情类别。
在其中一个实施例中,处理器执行计算机可读指令时还实现以下步骤:将音频特征输入至第一表情分类器中,获取第一表情分类器的表情识别结果以及与表情识别结果对应的第一权重;将文字信息输入至第二表情分类器中,获取第二表情分类器的表情识别结果以及与表情识别结果对应的第二权重;及将面部特征输入至第三表情分类器中,获取第三表情分类器的表情识别结果以及与表情识别结果对应的第三权重。
在其中一个实施例中,处理器执行计算机可读指令时还实现以下步骤:将音频特征输入至第一表情分类器中,第一表情分类器用于从音频特征中提取出目标特征,根据目标特征查询第一数据库,获取与目标特征对应的表情类别,作为表情识别结果,并确定与表情识别结果对应的第一权重;及获取第一表情分类器的表情识别结果以及与表情识别结果对应的第一权重。
在其中一个实施例中,处理器执行计算机可读指令时还实现以下步骤:将文字信息输入至第二表情分类器中,第二表情分类器用于从文字信息中提取出目标信息,根据目标信息查询第二数据库,获取与目标信息对应的表情类别,作为表情识别结果,并确定与表情识别结果对应的第二权重;及获取第二表情分类器的表情识别结果以及与表情识别结果对应的第二权重。
在其中一个实施例中,处理器执行计算机可读指令时还实现以下步骤:将面部特征输入至第三表情分类器中,第三表情分类器用于根据面部特征查询第三数据库,获取与面部 特征对应的表情类别,作为表情识别结果,并确定与表情识别结果对应的第三权重;及获取第三表情分类器的表情识别结果以及与表情识别结果对应的第三权重。
在其中一个实施例中,处理器执行计算机可读指令时还实现以下步骤:获取多个样本音频特征及对应的表情类别;通过待训练的第一表情分类器对样本音频特征进行识别,得到第一表情分类器的表情识别结果;将表情识别结果与对应的实际表情类别进行比较,得到识别误差;及当识别误差大于或等于预设第一阈值时,根据识别误差对第一表情分类器进行训练,直到根据训练后的第一表情分类器得到的识别误差小于预设第一阈值,结束训练。
在其中一个实施例中,处理器执行计算机可读指令时还实现以下步骤:获取多个样本面部特征及对应的表情类别;通过待训练的第三表情分类器对样本面部特征进行识别,得到第三表情分类器的表情识别结果;获取表情识别结果与对应的实际表情类别之间的相似度;及当相似度小于预设第二阈值时,根据相似度对第三表情分类器进行训练,直到根据训练后的第三表情分类器得到的表情识别结果与对应的实际表情类别之间的相似度大于或等于预设第二阈值,结束训练。
上述各个实施例,计算机设备通过处理器上运行的计算机可读指令,实现了从语音、文字和视觉这三个角度对用户的表情类别进行综合判断的目的,能够提供更为多样的表情识别结果,方便从多样的表情识别结果中对用户的表情类别进行综合分析,避免漏判,进一步提高了表情识别的准确率和稳定性,克服了传统方法中仅从视觉角度对用户表情类别进行判断,导致表情识别的准确率低的缺陷。
一个或多个存储有计算机可读指令的非易失性存储介质,计算机可读指令被一个或多个处理器执行时,使得一个或多个处理器执行以下步骤:
获取待处理的视频数据和音频数据,视频数据和音频数据均携带同一用户标识,音频数据包括音频特征和文字信息,视频数据包括用户人脸图像对应的面部特征;
将音频特征、文字信息和面部特征分别输入至对应的表情分类器中,获取各个表情分类器输出的表情识别结果以及与表情识别结果对应的权重;及
从输出的表情识别结果中,筛选出权重最大的表情识别结果,作为与用户标识对应的用户的表情类别。
在其中一个实施例中,计算机可读指令被处理器执行时还实现以下步骤:将音频特征输入至第一表情分类器中,获取第一表情分类器的表情识别结果以及与表情识别结果对应的第一权重;将文字信息输入至第二表情分类器中,获取第二表情分类器的表情识别结果以及与表情识别结果对应的第二权重;及将面部特征输入至第三表情分类器中,获取第三表情分类器的表情识别结果以及与表情识别结果对应的第三权重。
在其中一个实施例中,计算机可读指令被处理器执行时还实现以下步骤:将音频特征输入至第一表情分类器中,第一表情分类器用于从音频特征中提取出目标特征,根据目标特征查询第一数据库,获取与目标特征对应的表情类别,作为表情识别结果,并确定与表 情识别结果对应的第一权重;及获取第一表情分类器的表情识别结果以及与表情识别结果对应的第一权重。
在其中一个实施例中,计算机可读指令被处理器执行时还实现以下步骤:将文字信息输入至第二表情分类器中,第二表情分类器用于从文字信息中提取出目标信息,根据目标信息查询第二数据库,获取与目标信息对应的表情类别,作为表情识别结果,并确定与表情识别结果对应的第二权重;及获取第二表情分类器的表情识别结果以及与表情识别结果对应的第二权重。
在其中一个实施例中,计算机可读指令被处理器执行时还实现以下步骤:将面部特征输入至第三表情分类器中,第三表情分类器用于根据面部特征查询第三数据库,获取与面部特征对应的表情类别,作为表情识别结果,并确定与表情识别结果对应的第三权重;及获取第三表情分类器的表情识别结果以及与表情识别结果对应的第三权重。
在其中一个实施例中,计算机可读指令被处理器执行时还实现以下步骤:获取多个样本音频特征及对应的表情类别;通过待训练的第一表情分类器对样本音频特征进行识别,得到第一表情分类器的表情识别结果;将表情识别结果与对应的实际表情类别进行比较,得到识别误差;及当识别误差大于或等于预设第一阈值时,根据识别误差对第一表情分类器进行训练,直到根据训练后的第一表情分类器得到的识别误差小于预设第一阈值,结束训练。
在其中一个实施例中,计算机可读指令被处理器执行时还实现以下步骤:获取多个样本面部特征及对应的表情类别;通过待训练的第三表情分类器对样本面部特征进行识别,得到第三表情分类器的表情识别结果;获取表情识别结果与对应的实际表情类别之间的相似度;及当相似度小于预设第二阈值时,根据相似度对第三表情分类器进行训练,直到根据训练后的第三表情分类器得到的表情识别结果与对应的实际表情类别之间的相似度大于或等于预设第二阈值,结束训练。
上述各个实施例,计算机可读存储介质通过其存储的计算机可读指令,实现了从语音、文字和视觉这三个角度对用户的表情类别进行综合判断的目的,能够提供更为多样的表情识别结果,方便从多样的表情识别结果中对用户的表情类别进行综合分析,避免漏判,进一步提高了表情识别的准确率和稳定性,克服了传统方法表情识别的准确率低的缺陷。
本领域普通技术人员可以理解实现上述实施例方法中的全部或部分流程,是可以通过计算机可读指令来指令相关的硬件来完成,该计算机可读指令可存储于一非易失性计算机可读取存储介质中,该计算机可读指令在执行时,可包括如上述各方法的实施例的流程。其中,本申请所提供的各实施例中所使用的对存储器、存储、数据库或其它介质的任何引用,均可包括非易失性和/或易失性存储器。非易失性存储器可包括只读存储器(ROM)、可编程ROM(PROM)、电可编程ROM(EPROM)、电可擦除可编程ROM(EEPROM)或闪存。易失性存储器可包括随机存取存储器(RAM)或者外部高速缓冲存储器。作为说明而非局限,RAM以多种形式可得,诸如静态RAM(SRAM)、动态RAM(DRAM)、 同步DRAM(SDRAM)、双数据率SDRAM(DDRSDRAM)、增强型SDRAM(ESDRAM)、同步链路(Synchlink)DRAM(SLDRAM)、存储器总线(Rambus)直接RAM(RDRAM)、直接存储器总线动态RAM(DRDRAM)、以及存储器总线动态RAM(RDRAM)等。
以上实施例的各技术特征可以进行任意的组合,为使描述简洁,未对上述实施例中的各个技术特征所有可能的组合都进行描述,然而,只要这些技术特征的组合不存在矛盾,都应当认为是本说明书记载的范围。
以上所述实施例仅表达了本申请的几种实施方式,其描述较为具体和详细,但并不能因此而理解为对发明专利范围的限制。应当指出的是,对于本领域的普通技术人员来说,在不脱离本申请构思的前提下,还可以做出若干变形和改进,这些都属于本申请的保护范围。因此,本申请专利的保护范围应以所附权利要求为准。

Claims (20)

  1. 一种表情识别方法,包括:
    获取待处理的视频数据和音频数据,所述视频数据和所述音频数据均携带同一用户标识,所述音频数据包括音频特征和文字信息,所述视频数据包括用户人脸图像对应的面部特征;
    将所述音频特征、文字信息和面部特征分别输入至对应的表情分类器中,获取各个所述表情分类器输出的表情识别结果以及与所述表情识别结果对应的权重;及
    从所述输出的表情识别结果中,筛选出权重最大的表情识别结果,作为与所述用户标识对应的用户的表情类别。
  2. 根据权利要求1所述的方法,其特征在于,所述表情分类器包括第一表情分类器、第二表情分类器和第三表情分类器;
    所述将所述音频特征、文字信息和面部特征分别输入至对应的表情分类器中,获取各个所述表情分类器输出的表情识别结果以及与所述表情识别结果对应的权重,包括:
    将所述音频特征输入至第一表情分类器中,获取第一表情分类器的表情识别结果以及与所述表情识别结果对应的第一权重;
    将所述文字信息输入至第二表情分类器中,获取第二表情分类器的表情识别结果以及与所述表情识别结果对应的第二权重;及
    将所述面部特征输入至第三表情分类器中,获取第三表情分类器的表情识别结果以及与所述表情识别结果对应的第三权重。
  3. 根据权利要求2所述的方法,其特征在于,所述将所述音频特征输入至第一表情分类器中,获取第一表情分类器的表情识别结果以及与所述表情识别结果对应的第一权重,包括:
    将所述音频特征输入至第一表情分类器中,所述第一表情分类器用于从所述音频特征中提取出目标特征,根据所述目标特征查询第一数据库,获取与所述目标特征对应的表情类别,作为表情识别结果,并确定与所述表情识别结果对应的第一权重;及
    获取所述第一表情分类器的表情识别结果以及与所述表情识别结果对应的第一权重。
  4. 根据权利要求2所述的方法,其特征在于,所述将所述文字信息输入至第二表情分类器中,获取第二表情分类器的表情识别结果以及与所述表情识别结果对应的第二权重,包括:
    将所述文字信息输入至第二表情分类器中,所述第二表情分类器用于从所述文字信息中提取出目标信息,根据所述目标信息查询第二数据库,获取与所述目标信息对应的表情类别,作为表情识别结果,并确定与所述表情识别结果对应的第二权重;及
    获取所述第二表情分类器的表情识别结果以及与所述表情识别结果对应的第二权重。
  5. 根据权利要求2所述的方法,其特征在于,所述将所述面部特征输入至第三表情分类器中,获取第三表情分类器的表情识别结果以及与所述表情识别结果对应的第三权重, 包括:
    将所述面部特征输入至第三表情分类器中,所述第三表情分类器用于根据所述面部特征查询第三数据库,获取与所述面部特征对应的表情类别,作为表情识别结果,并确定与所述表情识别结果对应的第三权重;及
    获取所述第三表情分类器的表情识别结果以及与所述表情识别结果对应的第三权重。
  6. 根据权利要求1至5任意一项所述的方法,其特征在于,所述第一表情分类器通过下述方法得到:
    获取多个样本音频特征及对应的表情类别;
    通过待训练的第一表情分类器对所述样本音频特征进行识别,得到第一表情分类器的表情识别结果;
    将所述表情识别结果与对应的实际表情类别进行比较,得到识别误差;及
    当所述识别误差大于或等于预设第一阈值时,根据所述识别误差对所述第一表情分类器进行训练,直到根据训练后的第一表情分类器得到的识别误差小于所述预设第一阈值,结束训练。
  7. 根据权利要求6所述的方法,其特征在于,所述第三表情分类器通过下述方法得到:
    获取多个样本面部特征及对应的表情类别;
    通过待训练的第三表情分类器对所述样本面部特征进行识别,得到第三表情分类器的表情识别结果;
    获取所述表情识别结果与对应的实际表情类别之间的相似度;及
    当所述相似度小于预设第二阈值时,根据所述相似度对所述第三表情分类器进行训练,直到根据训练后的第三表情分类器得到的表情识别结果与对应的实际表情类别之间的相似度大于或等于所述预设第二阈值,结束训练。
  8. 一种表情识别装置,包括:
    数据获取模块,用于获取待处理的视频数据和音频数据,所述视频数据和所述音频数据均携带同一用户标识,所述音频数据包括音频特征和文字信息,所述视频数据包括用户人脸图像对应的面部特征;
    表情获取模块,用于将所述音频特征、文字信息和面部特征分别输入至对应的表情分类器中,获取各个所述表情分类器输出的表情识别结果以及与所述表情识别结果对应的权重;及
    表情筛选模块,用于从所述输出的表情识别结果中,筛选出权重最大的表情识别结果,作为与所述用户标识对应的用户的表情类别。
  9. 根据权利要求8所述的装置,其特征在于,所述表情分类器包括第一表情分类器、第二表情分类器和第三表情分类器;
    所述表情获取模块还用于将所述音频特征输入至第一表情分类器中,获取第一表情分 类器的表情识别结果以及与所述表情识别结果对应的第一权重;将所述文字信息输入至第二表情分类器中,获取第二表情分类器的表情识别结果以及与所述表情识别结果对应的第二权重;将所述面部特征输入至第三表情分类器中,获取第三表情分类器的表情识别结果以及与所述表情识别结果对应的第三权重。
  10. 根据权利要求9所述的装置,其特征在于,所述表情获取模块还用于将所述音频特征输入至第一表情分类器中,所述第一表情分类器用于从所述音频特征中提取出目标特征,根据所述目标特征查询第一数据库,获取与所述目标特征对应的表情类别,作为表情识别结果,并确定与所述表情识别结果对应的第一权重;获取所述第一表情分类器的表情识别结果以及与所述表情识别结果对应的第一权重。
  11. 一种计算机设备,包括存储器及一个或多个处理器,所述存储器中储存有计算机可读指令,所述计算机可读指令被所述一个或多个处理器执行时,使得所述一个或多个处理器执行以下步骤:
    获取待处理的视频数据和音频数据,所述视频数据和所述音频数据均携带同一用户标识,所述音频数据包括音频特征和文字信息,所述视频数据包括用户人脸图像对应的面部特征;
    将所述音频特征、文字信息和面部特征分别输入至对应的表情分类器中,获取各个所述表情分类器输出的表情识别结果以及与所述表情识别结果对应的权重;及
    从所述输出的表情识别结果中,筛选出权重最大的表情识别结果,作为与所述用户标识对应的用户的表情类别。
  12. 根据权利要求11所述的计算机设备,其特征在于,所述表情分类器包括第一表情分类器、第二表情分类器和第三表情分类器;
    所述处理器执行所述计算机可读指令时还执行以下步骤:
    将所述音频特征输入至第一表情分类器中,获取第一表情分类器的表情识别结果以及与所述表情识别结果对应的第一权重;
    将所述文字信息输入至第二表情分类器中,获取第二表情分类器的表情识别结果以及与所述表情识别结果对应的第二权重;及
    将所述面部特征输入至第三表情分类器中,获取第三表情分类器的表情识别结果以及与所述表情识别结果对应的第三权重。
  13. 根据权利要求12所述的计算机设备,其特征在于,所述处理器执行所述计算机可读指令时还执行以下步骤:
    将所述音频特征输入至第一表情分类器中,所述第一表情分类器用于从所述音频特征中提取出目标特征,根据所述目标特征查询第一数据库,获取与所述目标特征对应的表情类别,作为表情识别结果,并确定与所述表情识别结果对应的第一权重;及
    获取所述第一表情分类器的表情识别结果以及与所述表情识别结果对应的第一权重。
  14. 根据权利要求12所述的计算机设备,其特征在于,所述处理器执行所述计算机 可读指令时还执行以下步骤:
    将所述文字信息输入至第二表情分类器中,所述第二表情分类器用于从所述文字信息中提取出目标信息,根据所述目标信息查询第二数据库,获取与所述目标信息对应的表情类别,作为表情识别结果,并确定与所述表情识别结果对应的第二权重;及
    获取所述第二表情分类器的表情识别结果以及与所述表情识别结果对应的第二权重。
  15. 根据权利要求12所述的计算机设备,其特征在于,所述处理器执行所述计算机可读指令时还执行以下步骤:
    将所述面部特征输入至第三表情分类器中,所述第三表情分类器用于根据所述面部特征查询第三数据库,获取与所述面部特征对应的表情类别,作为表情识别结果,并确定与所述表情识别结果对应的第三权重;及
    获取所述第三表情分类器的表情识别结果以及与所述表情识别结果对应的第三权重。
  16. 一个或多个存储有计算机可读指令的非易失性计算机可读存储介质,所述计算机可读指令被一个或多个处理器执行时,使得所述一个或多个处理器执行以下步骤:
    获取待处理的视频数据和音频数据,所述视频数据和所述音频数据均携带同一用户标识,所述音频数据包括音频特征和文字信息,所述视频数据包括用户人脸图像对应的面部特征;
    将所述音频特征、文字信息和面部特征分别输入至对应的表情分类器中,获取各个所述表情分类器输出的表情识别结果以及与所述表情识别结果对应的权重;及
    从所述输出的表情识别结果中,筛选出权重最大的表情识别结果,作为与所述用户标识对应的用户的表情类别。
  17. 根据权利要求16所述的存储介质,其特征在于,所述表情分类器包括第一表情分类器、第二表情分类器和第三表情分类器;
    所述计算机可读指令被所述处理器执行时还执行以下步骤:
    将所述音频特征输入至第一表情分类器中,获取第一表情分类器的表情识别结果以及与所述表情识别结果对应的第一权重;
    将所述文字信息输入至第二表情分类器中,获取第二表情分类器的表情识别结果以及与所述表情识别结果对应的第二权重;及
    将所述面部特征输入至第三表情分类器中,获取第三表情分类器的表情识别结果以及与所述表情识别结果对应的第三权重。
  18. 根据权利要求17所述的存储介质,其特征在于,所述计算机可读指令被所述处理器执行时还执行以下步骤:
    将所述音频特征输入至第一表情分类器中,所述第一表情分类器用于从所述音频特征中提取出目标特征,根据所述目标特征查询第一数据库,获取与所述目标特征对应的表情类别,作为表情识别结果,并确定与所述表情识别结果对应的第一权重;及
    获取所述第一表情分类器的表情识别结果以及与所述表情识别结果对应的第一权重。
  19. 根据权利要求17所述的存储介质,其特征在于,所述计算机可读指令被所述处理器执行时还执行以下步骤:
    将所述文字信息输入至第二表情分类器中,所述第二表情分类器用于从所述文字信息中提取出目标信息,根据所述目标信息查询第二数据库,获取与所述目标信息对应的表情类别,作为表情识别结果,并确定与所述表情识别结果对应的第二权重;及
    获取所述第二表情分类器的表情识别结果以及与所述表情识别结果对应的第二权重。
  20. 根据权利要求17所述的存储介质,其特征在于,所述计算机可读指令被所述处理器执行时还执行以下步骤:
    将所述面部特征输入至第三表情分类器中,所述第三表情分类器用于根据所述面部特征查询第三数据库,获取与所述面部特征对应的表情类别,作为表情识别结果,并确定与所述表情识别结果对应的第三权重;及
    获取所述第三表情分类器的表情识别结果以及与所述表情识别结果对应的第三权重。
PCT/CN2019/122313 2018-12-18 2019-12-02 表情识别方法、装置、计算机设备和存储介质 WO2020125386A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201811553986.XA CN109829363A (zh) 2018-12-18 2018-12-18 表情识别方法、装置、计算机设备和存储介质
CN201811553986.X 2018-12-18

Publications (1)

Publication Number Publication Date
WO2020125386A1 true WO2020125386A1 (zh) 2020-06-25

Family

ID=66859842

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2019/122313 WO2020125386A1 (zh) 2018-12-18 2019-12-02 表情识别方法、装置、计算机设备和存储介质

Country Status (2)

Country Link
CN (1) CN109829363A (zh)
WO (1) WO2020125386A1 (zh)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112232276A (zh) * 2020-11-04 2021-01-15 赵珍 一种基于语音识别和图像识别的情绪检测方法和装置
CN112529623A (zh) * 2020-12-14 2021-03-19 中国联合网络通信集团有限公司 恶意用户的识别方法、装置和设备
CN114202712A (zh) * 2020-08-31 2022-03-18 国家广播电视总局广播电视科学研究院 特征提取方法、装置及电子设备
CN114241564A (zh) * 2021-12-17 2022-03-25 东南大学 一种基于类间差异强化网络的人脸表情识别方法

Families Citing this family (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109829363A (zh) * 2018-12-18 2019-05-31 深圳壹账通智能科技有限公司 表情识别方法、装置、计算机设备和存储介质
CN110688499A (zh) * 2019-08-13 2020-01-14 深圳壹账通智能科技有限公司 数据处理方法、装置、计算机设备和存储介质
CN110503942A (zh) 2019-08-29 2019-11-26 腾讯科技(深圳)有限公司 一种基于人工智能的语音驱动动画方法和装置
CN110991427B (zh) * 2019-12-25 2023-07-14 北京百度网讯科技有限公司 用于视频的情绪识别方法、装置和计算机设备
CN111460494B (zh) * 2020-03-24 2023-04-07 广州大学 面向多模态深度学习的隐私保护方法及系统
CN111899321B (zh) * 2020-08-26 2023-09-26 网易(杭州)网络有限公司 一种虚拟角色表情展现的方法和装置
CN113538810A (zh) * 2021-07-16 2021-10-22 中国工商银行股份有限公司 安防方法、安防系统和自动取款机设备

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107256392A (zh) * 2017-06-05 2017-10-17 南京邮电大学 一种联合图像、语音的全面情绪识别方法
CN107818785A (zh) * 2017-09-26 2018-03-20 平安普惠企业管理有限公司 一种从多媒体文件中提取信息的方法及终端设备
CN108268838A (zh) * 2018-01-02 2018-07-10 中国科学院福建物质结构研究所 人脸表情识别方法及人脸表情识别系统
CN108805089A (zh) * 2018-06-14 2018-11-13 南京云思创智信息科技有限公司 基于多模态的情绪识别方法
CN109829363A (zh) * 2018-12-18 2019-05-31 深圳壹账通智能科技有限公司 表情识别方法、装置、计算机设备和存储介质

Family Cites Families (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
TWI365416B (en) * 2007-02-16 2012-06-01 Ind Tech Res Inst Method of emotion recognition and learning new identification information
US8965762B2 (en) * 2007-02-16 2015-02-24 Industrial Technology Research Institute Bimodal emotion recognition method and system utilizing a support vector machine
CN103456314B (zh) * 2013-09-03 2016-02-17 广州创维平面显示科技有限公司 一种情感识别方法以及装置
CN104835507B (zh) * 2015-03-30 2018-01-16 渤海大学 一种串并结合的多模式情感信息融合与识别方法
CN105976809B (zh) * 2016-05-25 2019-12-17 中国地质大学(武汉) 基于语音和面部表情的双模态情感融合的识别方法及系统
CN106469297A (zh) * 2016-08-31 2017-03-01 北京小米移动软件有限公司 情绪识别方法、装置和终端设备
CN106503646B (zh) * 2016-10-19 2020-07-10 竹间智能科技(上海)有限公司 多模态情感辨识系统及方法
CN107862292B (zh) * 2017-11-15 2019-04-12 平安科技(深圳)有限公司 人物情绪分析方法、装置及存储介质
CN108764010A (zh) * 2018-03-23 2018-11-06 姜涵予 情绪状态确定方法及装置

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107256392A (zh) * 2017-06-05 2017-10-17 南京邮电大学 一种联合图像、语音的全面情绪识别方法
CN107818785A (zh) * 2017-09-26 2018-03-20 平安普惠企业管理有限公司 一种从多媒体文件中提取信息的方法及终端设备
CN108268838A (zh) * 2018-01-02 2018-07-10 中国科学院福建物质结构研究所 人脸表情识别方法及人脸表情识别系统
CN108805089A (zh) * 2018-06-14 2018-11-13 南京云思创智信息科技有限公司 基于多模态的情绪识别方法
CN109829363A (zh) * 2018-12-18 2019-05-31 深圳壹账通智能科技有限公司 表情识别方法、装置、计算机设备和存储介质

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114202712A (zh) * 2020-08-31 2022-03-18 国家广播电视总局广播电视科学研究院 特征提取方法、装置及电子设备
CN112232276A (zh) * 2020-11-04 2021-01-15 赵珍 一种基于语音识别和图像识别的情绪检测方法和装置
CN112232276B (zh) * 2020-11-04 2023-10-13 上海企创信息科技有限公司 一种基于语音识别和图像识别的情绪检测方法和装置
CN112529623A (zh) * 2020-12-14 2021-03-19 中国联合网络通信集团有限公司 恶意用户的识别方法、装置和设备
CN112529623B (zh) * 2020-12-14 2023-07-11 中国联合网络通信集团有限公司 恶意用户的识别方法、装置和设备
CN114241564A (zh) * 2021-12-17 2022-03-25 东南大学 一种基于类间差异强化网络的人脸表情识别方法

Also Published As

Publication number Publication date
CN109829363A (zh) 2019-05-31

Similar Documents

Publication Publication Date Title
WO2020125386A1 (zh) 表情识别方法、装置、计算机设备和存储介质
WO2020140665A1 (zh) 双录视频质量检测方法、装置、计算机设备和存储介质
US11200404B2 (en) Feature point positioning method, storage medium, and computer device
AU2013204970B2 (en) Modifying an appearance of a participant during a video conference
WO2020244153A1 (zh) 会议语音数据处理方法、装置、计算机设备和存储介质
US10217222B2 (en) Image cache for replacing portions of images
WO2020007129A1 (zh) 基于语音交互的上下文获取方法及设备
WO2021000644A1 (zh) 视频处理方法、装置、计算机设备和存储介质
CN108920640B (zh) 基于语音交互的上下文获取方法及设备
WO2020125397A1 (zh) 音频数据推送方法、装置、计算机设备和存储介质
CN107636684A (zh) 视频会议中的情绪识别
WO2020024395A1 (zh) 疲劳驾驶检测方法、装置、计算机设备及存储介质
CN107633203A (zh) 面部情绪识别方法、装置及存储介质
US11367196B2 (en) Image processing method, apparatus, and storage medium
WO2021139475A1 (zh) 一种表情识别方法及装置、设备、计算机可读存储介质、计算机程序产品
WO2022166532A1 (zh) 人脸识别方法、装置、电子设备及存储介质
CN108399052A (zh) 图片压缩方法、装置、计算机设备和存储介质
US11611554B2 (en) System and method for assessing authenticity of a communication
CN109063628B (zh) 人脸识别方法、装置、计算机设备及存储介质
Chakrabarti et al. Facial expression recognition using eigenspaces
CN110134830A (zh) 视频信息数据处理方法、装置、计算机设备和存储介质
KR20210075886A (ko) 듀얼 딥 네트워크를 이용한 영상기반 얼굴표정 감정인식 시스템 및 그 방법
CN114998489A (zh) 虚拟人物视频生成方法、装置、计算机设备及存储介质
CN107025423A (zh) 情绪估计装置以及情绪估计方法
WO2021000833A1 (zh) 基于数据处理的图片显示方法、装置和计算机设备

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19900478

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS (EPO FORM 1205A DATED 30.09.2021)

122 Ep: pct application non-entry in european phase

Ref document number: 19900478

Country of ref document: EP

Kind code of ref document: A1