CN106373569B - Voice interaction device and method - Google Patents
Voice interaction device and method Download PDFInfo
- Publication number
- CN106373569B CN106373569B CN201610806384.5A CN201610806384A CN106373569B CN 106373569 B CN106373569 B CN 106373569B CN 201610806384 A CN201610806384 A CN 201610806384A CN 106373569 B CN106373569 B CN 106373569B
- Authority
- CN
- China
- Prior art keywords
- expression
- semantic
- confidence
- confidence level
- response information
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000000034 method Methods 0.000 title claims abstract description 86
- 230000003993 interaction Effects 0.000 title claims abstract description 65
- 230000014509 gene expression Effects 0.000 claims abstract description 183
- 230000004044 response Effects 0.000 claims abstract description 82
- 230000008451 emotion Effects 0.000 claims description 57
- 230000036651 mood Effects 0.000 claims description 16
- 230000007935 neutral effect Effects 0.000 claims description 11
- 238000004590 computer program Methods 0.000 claims description 10
- 230000002194 synthesizing effect Effects 0.000 claims description 3
- 230000008569 process Effects 0.000 abstract description 21
- 230000015572 biosynthetic process Effects 0.000 description 14
- 238000003786 synthesis reaction Methods 0.000 description 14
- 230000006854 communication Effects 0.000 description 10
- 238000004891 communication Methods 0.000 description 9
- 238000010586 diagram Methods 0.000 description 8
- 230000006870 function Effects 0.000 description 7
- 238000005516 engineering process Methods 0.000 description 4
- 238000012545 processing Methods 0.000 description 4
- 238000013528 artificial neural network Methods 0.000 description 3
- 230000000994 depressogenic effect Effects 0.000 description 3
- 230000001815 facial effect Effects 0.000 description 3
- 230000004807 localization Effects 0.000 description 3
- 238000000513 principal component analysis Methods 0.000 description 3
- 206010063659 Aversion Diseases 0.000 description 2
- 230000003247 decreasing effect Effects 0.000 description 2
- 238000000605 extraction Methods 0.000 description 2
- 230000008921 facial expression Effects 0.000 description 2
- 238000012880 independent component analysis Methods 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 230000003287 optical effect Effects 0.000 description 2
- 230000005236 sound signal Effects 0.000 description 2
- 241000282412 Homo Species 0.000 description 1
- 238000007792 addition Methods 0.000 description 1
- 230000004075 alteration Effects 0.000 description 1
- 238000010420 art technique Methods 0.000 description 1
- 230000008859 change Effects 0.000 description 1
- 238000007621 cluster analysis Methods 0.000 description 1
- 238000000354 decomposition reaction Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 230000002996 emotional effect Effects 0.000 description 1
- 239000013307 optical fiber Substances 0.000 description 1
- 238000013139 quantization Methods 0.000 description 1
- 230000006798 recombination Effects 0.000 description 1
- 238000005215 recombination Methods 0.000 description 1
- 239000004065 semiconductor Substances 0.000 description 1
- 230000001755 vocal effect Effects 0.000 description 1
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/22—Procedures used during a speech recognition process, e.g. man-machine dialogue
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V40/00—Recognition of biometric, human-related or animal-related patterns in image or video data
- G06V40/10—Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
- G06V40/16—Human faces, e.g. facial parts, sketches or expressions
- G06V40/168—Feature extraction; Face representation
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V40/00—Recognition of biometric, human-related or animal-related patterns in image or video data
- G06V40/10—Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
- G06V40/16—Human faces, e.g. facial parts, sketches or expressions
- G06V40/174—Facial expression recognition
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L13/00—Speech synthesis; Text to speech systems
- G10L13/08—Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/26—Speech to text systems
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/22—Procedures used during a speech recognition process, e.g. man-machine dialogue
- G10L2015/225—Feedback of the input speech
Landscapes
- Engineering & Computer Science (AREA)
- Health & Medical Sciences (AREA)
- Physics & Mathematics (AREA)
- Multimedia (AREA)
- Human Computer Interaction (AREA)
- Acoustics & Sound (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Computational Linguistics (AREA)
- Oral & Maxillofacial Surgery (AREA)
- General Health & Medical Sciences (AREA)
- General Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Machine Translation (AREA)
- User Interface Of Digital Computer (AREA)
Abstract
The application relates to a voice interaction device and method. In an exemplary embodiment, a voice interaction method may include: receiving a first voice input from a human user and a first emoticon input associated with the first voice input; identifying a first semantic meaning of the first speech input; recognizing a first expression input by the first expression image; determining a first confidence level associated with the first semantic based on the first semantic and the first expression; and generating first response information based on the first semantics and the first confidence. By utilizing the expression and the semantics to generate the response information, the method can improve the experience of a human user in the process of man-machine voice interaction.
Description
Technical Field
The present invention relates generally to the field of human-computer interaction, and more particularly, to a voice interaction apparatus and method capable of improving accuracy of voice recognition and generating a more appropriate voice response, thereby achieving a more intelligent and personified human-computer interaction process.
Background
The language is the most convenient and effective communication mode between people, so that the voice communication is easily applied to the field of human-computer interaction to replace the traditional human-computer interaction modes such as a keyboard and a mouse. Man-machine natural language dialog means that a machine can "understand" human spoken language, which is a speech recognition technique.
Language is an art that has evolved over thousands of years, containing rich information far beyond the literal, while humans using language are wisdom creatures with multiple emotions, so simple and quick language communication from person to person can be highly complex for machines. Although many techniques have been proposed to improve the accuracy of speech recognition, these prior techniques are basically a pattern matching process of recognizing a pattern of received speech and comparing it with a reference pattern of known speech one by one to determine a recognition result. In these prior art techniques, the use of information contained in and related to speech is still relatively small, resulting in speech recognition techniques sometimes not being able to effectively recognize the true meaning of a human user. For example, interpersonal speech communication may have situations such as whispering, talk, uncertain tone, etc., which are beyond the recognition capability of the existing speech recognition technology. The existing voice recognition technology can only carry out a voice interaction process in a mechanical mode, and the development of machine equipment towards a more intelligent and more anthropomorphic direction is hindered.
Therefore, there is a need for an improved human-computer language interaction apparatus and method, which enable a machine device to more accurately understand the real intention of a human user, thereby improving the intelligence degree and personification level of the machine device, more efficiently simulating the language communication process between people, and improving the interaction experience of the human user.
Disclosure of Invention
One aspect of the present invention is to enable a machine device to more accurately understand the real intention of a human user by using more information during human-computer voice interaction.
An exemplary embodiment of the present invention provides a voice interaction method, which may include: receiving a first voice input from a human user and a first emoticon input associated with the first voice input; identifying a first semantic meaning of the first speech input; recognizing a first expression input by the first expression image; determining a first confidence level associated with the first semantic based on the first semantic and the first expression; and generating first response information based on the first semantics and the first confidence.
In an example, determining a first confidence level associated with the first semantic may include: assigning a default confidence to the first semantic; and adjusting the default confidence based on the first expression.
In an example, determining a first confidence level associated with the first semantic may further include: adjusting the default confidence level based on a context of a voice interaction.
In an example, adjusting the default confidence based on the first expression may include: increasing the default confidence level when the first expression is a positive expression; when the first expression is not qualitative expression, reducing the default confidence coefficient; and when the first expression is a neutral expression other than the positive expression and the negative expression, maintaining the default confidence level unchanged.
In an example, the positive expressions may include happiness, surprise, urgency, seriousness, and the negative expressions may include anger, aversion, keeping away from sight, fear, sadness, hesitation, surprise, suspicion.
In an example, determining a first confidence level associated with the first semantic may further include: judging whether the first semantics contain emotion keywords or not; if the first semantic meaning does not contain an emotion keyword, performing the step of adjusting the default confidence level based on the first expression; if the first semantics contain emotion keywords, judging whether the emotion keywords are matched with the first expressions; increasing the default confidence level if the emotion keyword matches the first expression; and if the emotion keyword does not match the first expression, performing the step of adjusting the default confidence based on the first expression.
In an example, determining a first confidence level associated with the first semantic may further include: judging the semantic type of the first semantic; increasing the default confidence level if the semantic type of the first semantic is a question; and if the semantic type of the first semantic is a statement or a requirement, performing the step of adjusting the default confidence based on the first expression.
In an example, determining a first confidence level associated with the first semantic may further include: judging the semantic type of the first semantic; increasing the default confidence level if the semantic type of the first semantic is a question; if the semantic type of the first semantic is statement or requirement, judging whether the first semantic contains emotion keywords; if the first semantic meaning does not contain an emotion keyword, performing the step of adjusting the default confidence level based on the first expression; if the first semantics contain emotion keywords, judging whether the emotion keywords are matched with the first expressions; increasing the default confidence level if the emotion keyword matches the first expression; and if the emotion keyword does not match the first expression, performing the step of adjusting the default confidence based on the first expression.
In an example, generating first response information based on the first semantics and the first confidence level may include: when the first confidence is above a predetermined threshold, then generating first response information comprising content directly associated with the first semantic; when the first confidence is below the predetermined threshold, then first response information is generated requesting the human user to confirm the first semantics.
In an example, the generated first response information may further include content associated with the first semantic indirection when the first confidence level is below the predetermined threshold.
In an example, generating first response information based on the first semantics and the first confidence level may include: when the first confidence is above a predetermined threshold, then generating first response information comprising content directly associated with the first semantic; when the first confidence level is below the predetermined threshold, then comparing the first confidence level to a second confidence level, the second confidence level being the confidence level associated with a speech input of the human user that immediately precedes the first speech input; generating first response information requesting the human user to confirm the first semantics if the first confidence is above the second confidence; and if the first confidence level is lower than the second confidence level, generating first response information requesting the human user to confirm the first semantic and include content indirectly associated with the first semantic.
In an example, the method may further include synthesizing the first response information into a voice according to a mood corresponding to the first expression to be played to the human user.
Another exemplary embodiment of the present invention provides a voice interaction apparatus, which may include: a speech recognition module configured to recognize a first semantic of a first speech input from a human user; an image recognition module configured to recognize a first expression of a first expression image input associated with the first voice input from the human user; a confidence unit configured to determine a first confidence associated with the first semantic based on the first semantic and the first expression; and a response generation unit configured to generate first response information based on the first semantics and the first confidence.
In an example, the confidence unit may be configured to determine a first confidence associated with the first semantic by performing the steps of: assigning a default confidence to the first semantic; and adjusting the default confidence based on the first expression.
In an example, the confidence unit may be further configured to determine a first confidence associated with the first semantic by performing the steps of: adjusting the default confidence level based on a context of a voice interaction.
In an example, the confidence unit may be configured to adjust the default confidence based on the first expression by performing the steps of: increasing the default confidence level when the first expression is a positive expression; when the first expression is not qualitative expression, reducing the default confidence coefficient; and when the first expression is a neutral expression other than the positive expression and the negative expression, maintaining the default confidence level unchanged.
In an example, the positive expressions may include happiness, surprise, urgency, seriousness, and the negative expressions may include anger, aversion, keeping away from sight, fear, sadness, hesitation, surprise, suspicion.
In an example, the confidence unit may be further configured to determine a first confidence associated with the first semantic by performing the steps of: judging whether the first semantics contain emotion keywords or not; if the first semantic meaning does not contain an emotion keyword, performing the step of adjusting the default confidence level based on the first expression; if the first semantics contain emotion keywords, judging whether the emotion keywords are matched with the first expressions; increasing the default confidence level if the emotion keyword matches the first expression; and if the emotion keyword does not match the first expression, performing the step of adjusting the default confidence based on the first expression.
In an example, the confidence unit may be further configured to determine a first confidence associated with the first semantic by performing the steps of: judging the semantic type of the first semantic; increasing the default confidence level if the semantic type of the first semantic is a question; and if the semantic type of the first semantic is a statement or a requirement, performing the step of adjusting the default confidence based on the first expression.
In an example, the response generation module may be configured to generate the first response information by performing the steps of: when the first confidence is above a predetermined threshold, then generating first response information comprising content directly associated with the first semantic; when the first confidence is below the predetermined threshold, then first response information is generated requesting the human user to confirm the first semantics.
In an example, when the first confidence level is below the predetermined threshold, the first response information generated by the response generation module may further include content indirectly associated with the first semantic.
In an example, the response generation module may be configured to generate the first response information by performing the steps of: when the first confidence is above a predetermined threshold, then generating first response information comprising content directly associated with the first semantic; when the first confidence level is below the predetermined threshold, then comparing the first confidence level to a second confidence level, the second confidence level being the confidence level associated with a speech input of the human user that immediately precedes the first speech input; generating first response information requesting the human user to confirm the first semantics if the first confidence is above the second confidence; and if the first confidence level is lower than the second confidence level, generating first response information requesting the human user to confirm the first semantic and include content indirectly associated with the first semantic.
In an example, the apparatus may further include: and the voice synthesis module is configured to synthesize the first response information into voice according to the tone corresponding to the first expression so as to play the voice to the human user.
Another exemplary embodiment of the present invention provides an electronic device, which may include: a voice receiving unit; an image receiving unit; a memory; and a processor connected to the voice receiving unit, the image receiving unit and the memory via a bus system, the processor being configured to execute instructions stored on the memory to perform any of the methods described above.
Another exemplary embodiment of the invention provides a computer program product, which may comprise computer program instructions, which, when executed by a processor, may cause the processor to perform any of the methods described above.
Another exemplary embodiment of the present invention provides a computer-readable storage medium, on which computer program instructions may be stored, which, when executed by a processor, may cause the processor to perform any of the methods described above.
Drawings
The above and other objects, features and advantages of the present application will become more apparent by describing in more detail embodiments of the present application with reference to the attached drawings. The accompanying drawings are included to provide a further understanding of the embodiments of the application and are incorporated in and constitute a part of this specification, illustrate embodiments of the application and together with the description serve to explain the principles of the application. In the drawings, like reference numbers generally represent like parts or steps.
Fig. 1 is a flowchart illustrating a voice interaction method according to an exemplary embodiment of the present invention.
FIG. 2 is a flowchart illustrating a process of determining confidence based on semantics and expressions according to an exemplary embodiment of the present invention.
Fig. 3 is a flowchart illustrating a process of determining confidence based on semantics and expressions according to another exemplary embodiment of the present invention.
Fig. 4 is a flowchart illustrating a process of determining confidence based on semantics and expressions according to another exemplary embodiment of the present invention.
Fig. 5 is a flowchart illustrating a process of generating response information based on semantics and confidence according to an exemplary embodiment of the present invention.
Fig. 6 is a block diagram illustrating a voice interaction apparatus according to an exemplary embodiment of the present invention.
Fig. 7 is a block diagram illustrating an electronic device according to an exemplary embodiment of the present invention.
Detailed Description
Hereinafter, example embodiments according to the present application will be described in detail with reference to the accompanying drawings. It should be understood that the described embodiments are only some embodiments of the present application and not all embodiments of the present application, and that the present application is not limited by the example embodiments described herein.
FIG. 1 illustrates a general block diagram of a human-machine voice interaction method 100 according to an exemplary embodiment of the invention. Here, "person" may represent a human user, and "machine" may represent any type of electronic device having a human-computer interaction function, including, but not limited to, mobile electronic devices such as smart phones, tablets, notebook computers, robots, personal digital assistants, in-vehicle electronic devices, and non-mobile electronic devices such as desktop computers, information service terminals, ticketing terminals, intelligent home appliances, intelligent customer service devices, and the like. Both of these devices may utilize the voice interaction apparatus and methods described herein. Furthermore, it should be understood that the voice interaction apparatus and method described herein can also be applied to electronic devices with voice interaction functions developed in the future.
Referring to fig. 1, the voice interaction method 100 may begin with steps S110 and S112, in which, in step S110, an electronic device performing voice interaction may receive a first voice input from a human user, and in step S112, the electronic device performing voice interaction may receive a first emoticon input associated with the first voice input from the human user. For example, an electronic device may utilize a microphone or an array of microphones to capture speech uttered by a human user and, at the same time, utilize a camera to capture an expressive image of the human user. In most cases, a human user is generally positioned right in front of the electronic device when performing human-computer interaction, so the electronic device defaults the captured facial expression right in front as the expression of the user performing voice interaction. In still other embodiments, the electronic device may also detect and track a human user who is engaged in voice interaction. For example, the electronic device may detect an orientation of a human user who is performing voice interaction through a sound source localization technique using a microphone array, and then rotate a camera to align the orientation, thereby obtaining an expression image of the human user. Sound source localization is known to the person skilled in the art and the basic principle thereof is not described in detail here. Technical solutions for detecting and tracking users using sound source localization are also described in the applicant's chinese patent application 201610341566.X and 201610596000.1, the disclosure of which is hereby incorporated by reference.
It will be appreciated that both the audio signal captured by the microphone or microphone array and the video or image signal captured by the camera may be pre-processed with time stamps. In this way, the electronic device can associate the speech input (audio signal) and the emoticon input (video or image signal) based on time. For example, when the electronic device detects that there is a voice input, an emoticon input that is the same as or approximately the same time as the voice input may be extracted.
Next, in step S114, speech recognition may be performed on the received first speech input to determine a first semantic meaning thereof. Here, the first semantic may be a literal semantic, i.e. a textual representation, of the first speech input, which can already be recognized with a very high accuracy using existing various speech recognition techniques. For example, when a human user says "order a flight ticket tomorrow to shanghai", a string of text "order a flight ticket tomorrow to shanghai" can be recognized as a first semantic by a speech recognition technique.
Further, in step S116, image recognition may be performed on the received first expression image input to determine the first expression of the human user. For example, it may be recognized that the first expression of the user is happy, anxious, hesitant, etc., or the first expression of the user may be a neutral expression, i.e., a faceless expression.
It should be understood here that in steps S114 and S116, the present invention may use any existing voice recognition technology and image recognition technology. For example, the available speech recognition techniques may include a vocal tract model and speech knowledge based method, a pattern matching method, which has been studied more and more intensively, and an artificial neural network method, etc., which include, for example, a dynamic time warping method (DTW), a Hidden Markov Method (HMM), a vector quantization method (VQ), etc. Artificial neural network methods, which are popular in recent years, can generally be used in conjunction with existing pattern matching methods. The available image recognition techniques may be techniques dedicated to facial expression recognition, and can be generally classified into the following three categories: global identification and local identification; deformation extraction and motion extraction; geometric feature methods and personal feature methods. Taking the commonly used global and local recognition methods as examples, the global recognition method may include, for example, Feature face-based principal Component Analysis (principal Component Analysis), independent Component Analysis (independent Component Analysis), Fisher's Linear Discriminants (Fisher's Linear Discriminants), local Feature Analysis (local Feature Analysis), Fisher birth kinetics (Fisher actions), Hidden Markov Model (HMM), and cluster Analysis; the Local recognition method may include a Facial motion Coding analysis (Facial operations Coding System), a Facial motion parameter method, a Local principal component analysis (Local PCA), a Gabor wavelet method, a neural network method, and the like. It should also be understood that the present invention is not limited to the examples given herein, and that other and future developed speech recognition techniques and face recognition techniques may also be used.
Next, at step S118, a first confidence level associated with the first semantic may be determined based on the identified first semantic and the first expression. In the present invention, the confidence may be defined as an amount indicating whether the first semantic meaning is the real intention of the human user, for example, it may be a numerical range, and the larger the value, the more certain the first semantic meaning is the real intention of the user, and the lower the value, the more uncertain the first semantic meaning is the real meaning that the user wants to express, for example, the meaning of the user himself to the speech expression is also less satisfied, or hesitation is not decided.
Conventional speech recognition is aimed at being accurate only, in an effort to accurately recognize the language words spoken by a human user, and therefore the speech recognition process is "mechanical", resulting in a human-computer interaction process that is also mechanical, quite different from human-to-human communication. When people communicate with each other, the meaning of the surface of the language characters can be recognized, the appearance can be observed, and the mood or attitude of the opposite side can be judged by observing the expression of the opposite side, so that whether the language of the opposite side is expressed by the real meaning can be judged. The general principle of the invention is that in the human-computer interaction, whether the voice recognition result is the real intention of the human user is judged by recognizing the expression of the human user, so that the human-computer interaction process more like the communication between people is realized.
Specifically, in step S118, a default confidence level may be first assigned to the first semantic. For example, the confidence level may range from 1 to 10, where 10 represents the end with high confidence level, 1 represents the end with low confidence level, and the default confidence level may be set in the middle of the range, e.g., 4-6. In an example, the default confidence level may be set to, for example, 5.
The assigned default confidence may then be adjusted according to the identified first expression. Expressions can be roughly classified into three categories, positive, negative, and neutral. Positive expressions indicate that the confidence level of the words spoken by the user is high, which is a true meaning representation. For example, the confidence level may be considered high when the user is exposed to happy or happy, surprised expressions. In addition, when the user faces a focus, a serious expression, the confidence of the speaking is also considered to be high. Thus, when the first identified expressions are these expressions, the default confidence may be increased. On the other hand, when the user faces negative expressions such as anger, disgust, keeping away from sight or slight, fear, sadness, hesitation, surprise, suspicion, etc., it can be considered that the confidence of his utterance is low, and thus the default confidence assigned thereto is reduced. For example, when the user says "order a sky to shanghai ticket" in a happy or serious expression, the user may be very certain of the intention, and thus "order a sky to shanghai ticket" is a true meaning representation of the user; when the user says "order a sky to go to the sea" in a hesitant, sad, depressed, angry expression, it is likely that the user himself does not determine whether to take the plane to go to the sea or that the schedule of taking the plane to go to the sea is not satisfied, and therefore "order a sky to go to the sea" may not be a true meaning indication that the user intended, and the assigned default confidence value should be decremented at this time. While the assigned default confidence value may be maintained when the user is neutral, e.g., no special expression.
It should be understood that the principles of the present invention are not limited to the specific examples of expressions given herein, but that more expressions may be used, and even different expression classification rules may be used, i.e., whether a particular expression is classified as a positive, negative or neutral expression.
In some embodiments, each positive and negative expression may be further divided into different degrees or levels. For example, smile may represent a lower degree of pleasure, grin may represent a medium degree of pleasure, and mouths laugh represent a higher degree of pleasure for happy or pleasurable aspects. The adjustment to the default confidence value may also be different depending on the degree or level of each expression. For example, a lower degree of positive expression may raise the confidence value by 1, a medium degree of positive expression may raise the confidence value by 2, and a higher degree of positive expression may raise the confidence value by 3. It is of course understood that neutral expressions may be of varying degrees or levels.
In some embodiments, the assigned default confidence level may also be adjusted based on the context of the voice interaction. For example, when the previous voice interaction content shows that the weather of the shanghai is rainstorm, the confidence of the voice of the user's "order a airticket for tomorrow to the shanghai" is low; on the other hand, if the previous voice interaction or the user's calendar indicated that the user had a meeting schedule tomorrow in Shanghai, the confidence of the user's voice "order a ticket tomorrow to Shanghai" is low. Thus, the assigned default confidence value may be adjusted according to the context, thereby enabling a more intelligent confidence value determination process.
In some embodiments, when the determined first confidence level is high, e.g., above a predetermined threshold, then response information is generated based on the first criterion, e.g., information directly associated with the first semantic is generated, such as conventional voice interaction, it may be understood that "directly associated" means that the user may order information that is directly intended for the user based on the first semantic, e.g., the first semantic, and the first confidence level is below a predetermined semantic, e.g., a "first confidence level" may be less than a predetermined threshold, "when the first confidence level is less than a predetermined semantic, e.g., a" first confidence level "may be less than a predetermined semantic, e.g., a" may be less than a "when the first confidence level is less than a predetermined semantic, e.g., a" may be less than a "first confidence level," may be less than a "when the first confidence level is more than a predetermined semantic, e.g., a" is less than a predetermined semantic, e.g., a "may be less than a predetermined semantic, e.g., a" may be more than a "may be a predetermined semantic, and a" may be more than a predetermined semantic, and a predetermined semantic, a "may be more than a predetermined semantic, e.g., a predetermined semantic, a" may be a predetermined semantic, a response information may be generated, a response information may be more than a response information may be generated based on a response information may be generated, may be more than a second confidence level may be generated based on a second confidence level may be more than a second confidence level, a predetermined semantic, a second confidence level may be more than a predetermined semantic, a less than a second confidence level may be more than a predetermined semantic, a second confidence level may be more than a second semantic, a less than a second more than a predetermined semantic, a less than a predetermined semantic, a second semantic, a predetermined semantic, a second more than a predetermined semantic, a less than a predetermined semantic, a predetermined.
Then, in step S122, the generated first response information may be synthesized into speech through a speech synthesis (TTS) technique to be played to a human user through a speaker and/or a display, thereby completing one round of the voice interaction process. Also, the present invention may be used with any existing or future developed speech synthesis techniques, which are not described in detail herein.
In some embodiments, the first response information may be synthesized into a voice in accordance with a mood corresponding to the first expression. For example, when the first expression of the user is a happy or happy, excited expression, step S122 may also synthesize a voice using a happy mood; when the user is sad, depressed, or scary, step S122 may synthesize speech using a comforting mood; when the user is angry, disgust, and has kept away from the sight expression, step S122 may use a timid and angry tone to synthesize the voice. Therefore, the voice response played to the user can be more easily accepted by the user, the mood of the user is improved, and the interaction experience of the user is improved. Of course, the correspondence between the mood and the expression of the synthesized speech is not limited to the example given here, but may be defined differently depending on the application scenario.
In conventional speech synthesis with emotion, it is generally necessary to analyze the semantics of the text and determine the emotion or mood required for synthesizing the speech by a machine. In the invention, the recognized first expression can be directly utilized, and the corresponding tone or emotion is adopted to synthesize the voice, so that the process of analyzing the text to determine the tone can be omitted, the program is simpler, and the synthesized voice tone can more accurately accord with the current mood or emotion of the user, so that the human-computer interaction process is richer in human emotion, and the mechanical feeling of cold ice is avoided.
Some exemplary embodiments of the present invention are described above with reference to fig. 1, which are applicable to many general voice communication scenarios. However, interpersonal voice communication is complex and may encounter a variety of special situations. Some man-machine voice interaction methods capable of dealing with similar special scenes are described below with reference to the accompanying drawings.
Fig. 2 shows a flowchart of a process 200 of determining a first confidence based on a first semantic and a first expression according to another exemplary embodiment of the invention. In step S118 described above with reference to fig. 1, the first confidence level is determined by adjusting the assigned default confidence level based on the first expression. Specifically, when the first expression is a positive expression, the default confidence is increased; when the first expression is not qualitative, reducing the default confidence coefficient; when the first expression is a neutral expression, the default confidence is maintained. However, this manner of adjustment may be disadvantageous in some situations, given the complexity of voice communication. For example, when a human user says something sad with a very sad expression or something terrorist with a very horror expression, it can be generally determined that the confidence level of his language is high, and the confidence level should not be lowered. Therefore, in the embodiment shown in fig. 2, first, in step S210, it is retrieved whether the first semantic meaning contains an emotion keyword. The emotion keyword refers to a vocabulary that can be associated with a specific expression or emotion, such as a disaster, accident, etc. associated with sadness, fear, etc., travel, shopping, etc. associated with joy, etc. If no emotion keyword is retrieved in step S210, the previously described step of adjusting the assigned default confidence level based on the first expression is performed in step S212. If the emotion keyword is retrieved in step S210, it is determined whether the retrieved emotion keyword matches the first expression in step S214. In some embodiments, a plurality of emotion keywords may be retrieved in step S210, and each of the keywords may be compared with the first expression in step S214, and if there is one emotion keyword matched with the first expression, the result is determined to be a match; and judging that the emotion keywords are not matched with the first expression only when all the emotion keywords are not matched with the first expression.
If the determination result in step S214 is not matching, the previously described step of adjusting the assigned default confidence level based on the first expression may be performed in step S216; if the determination in step S214 is a match, indicating that the expression of the human user matches the speech content thereof, the confidence level of the first semantic meaning may be considered to be very high, then the assigned default confidence level may be directly increased in step S218, and the increased confidence level may be output as the first confidence level associated with the first semantic meaning for subsequent operations as described in step S120.
The above describes the case of judging whether the first expression matches the first expression from the content of the first semantic. In other cases, the type of first semantic may also be considered for voice interaction. Fig. 3 shows a flow diagram of a process 300 for determining a first confidence based on a first semantic and a first expression according to another embodiment of the invention. As shown in fig. 3, in step S310, a semantic type of the first semantic may be determined first. Linguistically, semantic types are generally divided into three categories, statements, questions and requirements, namely declarative sentences, interrogative sentences and imperative sentences, with different semantic types generally corresponding to different degrees of confidence. For example, when a user is saying a question, it generally indicates that he wants to know a certain answer, so the confidence level is generally higher; when the user speaks the statement sentence and the imperative sentence, it is generally difficult to judge the confidence based on the semantic type.
Therefore, if the semantic type of the first semantic is determined to be questionable in step S310, the assigned default confidence level may be directly increased in step S312, and the increased confidence level may be output as the first confidence level associated with the first semantic for the following operation as described in step S120. On the other hand, if the semantic type of the first semantic is a statement or a requirement, or is other semantic type except a question, in step S310, the aforementioned step of adjusting the assigned default confidence level based on the first expression may be performed in step S314.
Fig. 4 shows a case 400 where the above-described two factors of emotion keyword and semantic type are considered. Referring to fig. 4, a semantic type of the first semantic may be first determined in step S410. If the semantic type of the first semantic is a question, the assigned default confidence level is increased in step S412, and the increased confidence level may be output as the first confidence level associated with the first semantic for subsequent operations as described in step S120. If the semantic type of the first semantic is a statement or claim, or other semantic type than question, then it may proceed to step S414.
In step S414, it may be continuously determined whether the first semantic meaning contains an emotion keyword. If the first semantic meaning does not contain an emotion keyword, the step of adjusting the default confidence based on the first expression described above is performed in step S416; if the first semantic meaning contains an emotion keyword, it is continuously determined whether the emotion keyword matches the first expression in step S418. If so, directly increasing the assigned default confidence level in step S420, and outputting the increased confidence level as a first confidence level associated with the first semantic for subsequent operations as described in step S120; if not, the step of adjusting the default confidence based on the first expression described above is performed in step S422.
Fig. 5 illustrates a flow diagram of another embodiment 500 of generating first response information based on the identified first semantics and the determined first confidence level. First, in step S510, it may be determined whether the first confidence value is above a predetermined threshold. As previously mentioned, the predetermined threshold may be a predetermined confidence criterion, and when the first confidence value is above the predetermined threshold, the confidence may be considered high; the confidence may be considered low if the first confidence is below a predetermined threshold.
When the first confidence is above the predetermined threshold, then first response information including content directly associated with the first semantic may be generated in step S512. When the first confidence level is lower than the predetermined threshold, the first confidence level may be continuously compared with the confidence value (which may be referred to as a second confidence level herein for convenience of description) of the previous voice input in step S514. The comparison between the first confidence level and the previous second confidence level may reflect an emotional change of the human user during the voice interaction. For example, if the first confidence is above the second confidence, it indicates that while the absolute confidence is still low (the first confidence is below the threshold), the relative confidence is increased (the first confidence is above the second confidence), so the interaction process may progress in a better direction. At this time, in step S516, first response information requesting the human user to confirm the first semantics may be generated. On the other hand, if it is determined in step S514 that the first confidence is lower than the previous second confidence, it indicates that not only the absolute confidence is low, but also the relative confidence is decreasing, and the interaction process may progress in a bad direction. At this time, the first response information generated in step S518 may include not only content requesting the human user to confirm the first semantics, but also content indirectly associated with the first semantics for consideration and selection by the user.
Hereinafter, a voice interaction apparatus according to an exemplary embodiment of the present invention will be described with reference to fig. 6. As described above, the voice interaction apparatus of the present invention can be applied to any type of electronic devices having a human-computer interaction function, including but not limited to mobile electronic devices such as smart phones, tablets, notebook computers, robots, personal digital assistants, and in-vehicle electronic devices, and non-mobile electronic devices such as desktop computers, information service terminals, ticketing terminals, intelligent home appliances, intelligent customer service devices, and the like. Both of these devices may utilize the voice interaction apparatus and methods described herein. Furthermore, it should be understood that the voice interaction apparatus described herein may also be applied to electronic devices with voice interaction functions developed in the future.
As shown in FIG. 6, the voice interaction device 600 may include a speech recognition module 610, an image recognition module 620, a confidence module 630, a response generation module 640, and a speech synthesis module 650. The speech recognition module 610 may be configured to recognize a first semantic of a first speech input from a human user. It is to be appreciated that the speech recognition module 610 may utilize any existing, e.g., commercially available, speech recognition engine, or may also utilize a speech recognition engine developed in the future. The image recognition module 620 may be configured to recognize a first expression of a first expression image input from a human user associated with the first voice input. It will also be appreciated that the image recognition module 620 may utilize any existing, e.g., commercially purchased, expression image recognition engine, or may also utilize expression image recognition engines developed in the future. The confidence module 630 may determine a first confidence associated with the first semantic meaning based on the first semantic meaning identified by the speech recognition module 610 and the first expression identified by the image recognition module 620. For example, the confidence module 630 may first assign a default confidence to the first semantic and then adjust the assigned default confidence based on the first expression to obtain a final first confidence. Specifically, when the first expression is a positive expression, the default confidence is increased; when the first expression is not qualitative, reducing the default confidence coefficient; when the first expression is other than a positive expression and a negative expression, such as a neutral expression, the assigned default confidence level is maintained.
In some embodiments, the confidence module 630 may also determine whether the first semantic contains an emotion keyword and compare the contained emotion keyword to the first expression. If the emotion keyword contained in the first semantic matches the first expression, it indicates that the confidence of the user speaking is high, and therefore the assigned default confidence is directly increased. If the first semantics do not include an emotion keyword, or the included emotion keyword does not match the first expression, the previously described operation of adjusting the assigned default confidence level based on the first expression may be performed.
In some embodiments, the confidence module 630 may also determine a semantic type of the first semantic. If the semantic type of the first semantic is a question, the confidence of the user speaking is considered to be high, so the assigned default confidence value is directly increased; if other semantic types than question, such as statement or claim, the previously described operation of adjusting the assigned default confidence level based on the first expression may be performed.
In some embodiments, the confidence module 630 may also adjust the assigned default confidence based on the context. For example, if the first semantic is consistent with the context of the voice interaction, its confidence is high, thus increasing the assigned default confidence; conversely, if not, the assigned default confidence level is decreased.
With continued reference to fig. 6, the response generation module 640 of the voice interaction device 600 may generate the first response information using the first semantic from the speech recognition module 610 and the first confidence from the confidence module 630. The response generation module 640 may generate the first response information with different criteria according to the first confidence level. In some embodiments, when the first confidence is above a predetermined threshold, then generating first response information based on a first criterion, e.g., generating first response information comprising content directly associated with the first semantic; when the first confidence is below a predetermined threshold, then first response information is generated based on a second criterion, such as generating first response information requesting that the human user confirm the first semantics, or such as generating first response information further comprising content indirectly associated with the first semantics.
The process of generating response information may involve the use of knowledge base 660. The knowledge base 660 may be a local knowledge base that may be included as part of the speech recognition device 600, or, as shown in fig. 6, may be a cloud knowledge base 660, and the speech recognition device 600 is connected to the cloud knowledge base 660 through a network, such as a wide area network or a local area network. The knowledge base 660 may include a variety of knowledge data, such as weather data, flight data, hotel data, movie data, music data, dining data, stock data, travel data, map data, government data, industry knowledge, historical knowledge, natural science knowledge, social science knowledge, and so forth. The response generation module 640 may obtain knowledge directly or indirectly related to the first semantics from the knowledge base 660 for generating the first response information.
In some embodiments, when the first confidence is above a predetermined threshold, the response generation module 640 generates first response information comprising content directly associated with the first semantic; when the first confidence level is below a predetermined threshold, then the response generation module 640 also compares the first confidence level to a second confidence level, the second confidence level being the confidence level associated with a speech input of the human user that immediately precedes the first speech input. If the first confidence level is above the second confidence level, the response generation module 640 may generate first response information requesting the human user to confirm the first semantics; if the first confidence level is lower than the second confidence level, the response generation module 640 can generate first response information requesting the human user to confirm the first semantic and including content indirectly associated with the first semantic.
Then, the voice synthesis module 650 may synthesize the first response information generated by the response generation module 640 into voice to be played to a human user through a speaker (not shown), thereby completing one round of voice interaction process. In some embodiments, the speech synthesis module 650 may also utilize the first expression from the image recognition module 620 for speech synthesis. Specifically, the voice synthesis module 650 may synthesize the first response information into voice according to the mood corresponding to the first emotion. For example, when the first expression of the user is a happy or happy, excited expression, the speech synthesis module 650 may also synthesize speech using a happy mood; when the user is sad, depressed, or scary, the speech synthesis module 650 may synthesize speech using a comforting mood; when the user is angry, disgust, and has a dominant expression, the speech synthesis module 650 may synthesize speech using a timid, angry mood. Therefore, the voice response played to the user can be more easily accepted by the user, the mood of the user is improved, and the interaction experience of the user is improved. Of course, the speech synthesis module 650 may perform speech synthesis according to other corresponding relations between expressions and moods, and is not limited to the examples given herein.
Fig. 7 illustrates a block diagram of an electronic device that may utilize the voice interaction apparatus and method described above according to an exemplary embodiment of the present invention. As shown in fig. 7, the electronic device 700 may include a voice receiving unit 710 and an image receiving unit 720. The voice receiving unit 710 may be, for example, a microphone or a microphone array, which may capture the voice of the user. The image receiving unit 720 may be, for example, a monocular camera, a binocular camera, or a multi-purpose camera, which may capture an image of a user, particularly, a face image, and thus the image receiving unit 720 may have a face recognition function to clearly and accurately capture an expression image of the user.
As shown in fig. 7, the electronic device 700 may further include one or more processors 730 and a memory 740, which are connected to the voice receiving unit 710 and the image receiving unit 720 with each other through a bus system 750. The processor 730 may be a Central Processing Unit (CPU) or other form of processing unit, processing core, or controller having data processing capabilities and/or instruction execution capabilities, and may control other components in the electronic device 700 to perform desired functions. Memory 740 may include one or more computer program products that may include various forms of computer-readable storage media, such as volatile memory and/or non-volatile memory. The volatile memory may include, for example, Random Access Memory (RAM), cache memory (cache), and/or the like. The non-volatile memory may include, for example, Read Only Memory (ROM), hard disk, flash memory, etc. One or more computer program instructions may be stored on the computer-readable storage medium and executed by processor 740 to implement the voice interaction methods of the embodiments of the application described above and/or other desired functions. Various applications and various data, such as user data, knowledge databases, etc., may also be stored in the computer-readable storage medium.
Furthermore, the electronic device 700 may further include an output unit 760. The output unit 760 may be, for example, a speaker to perform voice interaction with a user. In other embodiments, output unit 760 may also be an output device such as a display, printer, or the like.
In addition to the above-described methods, apparatuses and devices, embodiments of the present application may also be a computer program product comprising computer program instructions which, when executed by a processor, cause the processor to perform the steps in the voice interaction method according to embodiments of the present application described in the present specification.
The computer program product may be written with program code for performing the operations of embodiments of the present application in any combination of one or more programming languages, including an object oriented programming language such as Java, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the electronic device, partly on the electronic device, as a stand-alone software package, partly on the user electronic device and partly on a remote computing device, or entirely on the remote computing device or server.
Furthermore, embodiments of the present application may also be a computer-readable storage medium having stored thereon computer program instructions, which, when executed by a processor, cause the processor to perform the steps in the voice interaction method according to various embodiments of the present application described herein.
The computer-readable storage medium may take any combination of one or more readable media. The readable medium may be a readable signal medium or a readable storage medium. A readable storage medium may include, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples (a non-exhaustive list) of the readable storage medium include: an electrical connection having one or more wires, a portable disk, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.
The foregoing describes the general principles of the present application in conjunction with specific embodiments, however, it is noted that the advantages, effects, etc. mentioned in the present application are merely examples and are not limiting, and they should not be considered essential to the various embodiments of the present application. Furthermore, the foregoing disclosure of specific details is for the purpose of illustration and description and is not intended to be limiting, since the foregoing disclosure is not intended to be exhaustive or to limit the disclosure to the precise details disclosed.
The block diagrams of devices, apparatuses, systems referred to in this application are only given as illustrative examples and are not intended to require or imply that the connections, arrangements, configurations, etc. must be made in the manner shown in the block diagrams. These devices, apparatuses, devices, systems may be connected, arranged, configured in any manner, as will be appreciated by those skilled in the art. Words such as "including," "comprising," "having," and the like are open-ended words that mean "including, but not limited to," and are used interchangeably therewith. The words "or" and "as used herein mean, and are used interchangeably with, the word" and/or, "unless the context clearly dictates otherwise. The word "such as" is used herein to mean, and is used interchangeably with, the phrase "such as but not limited to".
It should also be noted that in the apparatus and methods of the present application, the components or steps may be disassembled and/or reassembled. These decompositions and/or recombinations are to be considered as equivalents of the present application.
The previous description of the disclosed aspects is provided to enable any person skilled in the art to make or use the present application. Various modifications to these aspects will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other aspects without departing from the scope of the application. Thus, the present application is not intended to be limited to the aspects shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.
The foregoing description has been presented for purposes of illustration and description. Furthermore, the description is not intended to limit embodiments of the application to the form disclosed herein. While a number of example aspects and embodiments have been discussed above, those of skill in the art will recognize certain variations, modifications, alterations, additions and sub-combinations thereof.
Claims (15)
1. A voice interaction method, comprising:
receiving a first voice input from a human user and a first emoticon input associated with the first voice input;
identifying a first semantic meaning of the first speech input;
recognizing a first expression input by the first expression image;
determining a first confidence level associated with the first semantic based on the first semantic and the first expression; and
generating first response information based on the first semantics and the first confidence level,
wherein determining a first confidence level associated with the first semantic comprises:
assigning a default confidence to the first semantic; and
adjusting the default confidence based on the first expression.
2. The method of claim 1, wherein,
adjusting the default confidence based on the first expression, including:
increasing the default confidence level when the first expression is a positive expression;
when the first expression is not qualitative expression, reducing the default confidence coefficient; and
maintaining the default confidence level unchanged when the first expression is a neutral expression other than the positive expression and the negative expression.
3. The method of claim 1, wherein determining a first confidence level associated with the first semantic further comprises:
judging whether the first semantics contain emotion keywords or not;
if the first semantic meaning does not contain an emotion keyword, performing the step of adjusting the default confidence level based on the first expression;
if the first semantics contain emotion keywords, judging whether the emotion keywords are matched with the first expressions;
increasing the default confidence level if the emotion keyword matches the first expression; and
if the emotion keyword does not match the first expression, performing the step of adjusting the default confidence based on the first expression.
4. The method of claim 1, determining a first confidence level associated with the first semantic further comprising:
judging the semantic type of the first semantic;
increasing the default confidence level if the semantic type of the first semantic is a question; and
if the semantic type of the first semantic is a statement or a requirement, performing the step of adjusting the default confidence based on the first expression.
5. The method of claim 1, wherein generating first response information based on the first semantics and the first confidence level comprises:
when the first confidence is above a predetermined threshold, then generating first response information comprising content directly associated with the first semantic;
when the first confidence is below the predetermined threshold, then first response information is generated requesting the human user to confirm the first semantics.
6. The method of claim 5, wherein the generated first response information further comprises content indirectly associated with the first semantic when the first confidence level is below the predetermined threshold.
7. The method of claim 1, wherein generating first response information based on the first semantics and the first confidence level comprises:
when the first confidence is above a predetermined threshold, then generating first response information comprising content directly associated with the first semantic;
when the first confidence level is below the predetermined threshold, then comparing the first confidence level to a second confidence level, the second confidence level being the confidence level associated with a speech input of the human user that immediately precedes the first speech input;
generating first response information requesting the human user to confirm the first semantics if the first confidence is above the second confidence; and
generating first response information requesting the human user to confirm the first semantic and include content indirectly associated with the first semantic if the first confidence level is lower than the second confidence level.
8. The method of claim 1, further comprising synthesizing the first response information into speech according to a mood corresponding to the first expression for playing to the human user.
9. A voice interaction device, comprising:
a speech recognition module configured to recognize a first semantic of a first speech input from a human user;
an image recognition module configured to recognize a first expression of a first expression image input associated with the first voice input from the human user;
a confidence unit configured to determine a first confidence associated with the first semantic based on the first semantic and the first expression; and
a response generation unit configured to generate first response information based on the first semantics and the first confidence degree,
wherein the confidence unit is configured to determine a first confidence associated with the first semantic by performing the steps of:
assigning a default confidence to the first semantic; and
adjusting the default confidence based on the first expression.
10. The apparatus of claim 9, wherein
Adjusting the default confidence based on the first expression, including:
increasing the default confidence level when the first expression is a positive expression;
when the first expression is not qualitative expression, reducing the default confidence coefficient; and
maintaining the default confidence level unchanged when the first expression is a neutral expression other than the positive expression and the negative expression.
11. The apparatus of claim 10, wherein the confidence unit is further configured to determine a first confidence associated with the first semantic by performing the steps of:
judging the semantic type of the first semantic;
increasing the default confidence level if the semantic type of the first semantic is a question;
if the semantic type of the first semantic is statement or requirement, judging whether the first semantic contains emotion keywords;
if the first semantic meaning does not contain an emotion keyword, performing the step of adjusting the default confidence level based on the first expression;
if the first semantics contain emotion keywords, judging whether the emotion keywords are matched with the first expressions;
increasing the default confidence level if the emotion keyword matches the first expression; and
if the emotion keyword does not match the first expression, performing the step of adjusting the default confidence based on the first expression.
12. The apparatus of claim 9, wherein the response generation module is configured to generate the first response information by performing:
when the first confidence is above a predetermined threshold, then generating first response information comprising content directly associated with the first semantic;
when the first confidence is below the predetermined threshold, then first response information is generated requesting the human user to confirm the first semantics.
13. The apparatus of claim 12, wherein the first response information generated by the response generation module further includes content associated with the first semantic indirection when the first confidence level is below the predetermined threshold.
14. An electronic device, comprising:
a voice receiving unit;
an image receiving unit;
a memory; and
a processor connected to each other with the speech receiving unit, the image receiving unit and the memory by a bus system, the processor being configured to execute instructions stored on the memory to perform the method of any one of claims 1-8.
15. A computer readable storage medium having stored thereon computer program instructions which, when executed by a processor, cause the processor to perform the method according to any one of claims 1-8.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610806384.5A CN106373569B (en) | 2016-09-06 | 2016-09-06 | Voice interaction device and method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610806384.5A CN106373569B (en) | 2016-09-06 | 2016-09-06 | Voice interaction device and method |
Publications (2)
Publication Number | Publication Date |
---|---|
CN106373569A CN106373569A (en) | 2017-02-01 |
CN106373569B true CN106373569B (en) | 2019-12-20 |
Family
ID=57900064
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201610806384.5A Active CN106373569B (en) | 2016-09-06 | 2016-09-06 | Voice interaction device and method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN106373569B (en) |
Families Citing this family (30)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
KR102363794B1 (en) * | 2017-03-31 | 2022-02-16 | 삼성전자주식회사 | Information providing method and electronic device supporting the same |
CN106910514A (en) * | 2017-04-30 | 2017-06-30 | 上海爱优威软件开发有限公司 | Method of speech processing and system |
CN109005304A (en) * | 2017-06-07 | 2018-12-14 | 中兴通讯股份有限公司 | A kind of queuing strategy and device, computer readable storage medium |
CN107199572B (en) * | 2017-06-16 | 2020-02-14 | 山东大学 | Robot system and method based on intelligent sound source positioning and voice control |
CN107240398B (en) * | 2017-07-04 | 2020-11-17 | 科大讯飞股份有限公司 | Intelligent voice interaction method and device |
CN108320738B (en) * | 2017-12-18 | 2021-03-02 | 上海科大讯飞信息科技有限公司 | Voice data processing method and device, storage medium and electronic equipment |
US11922934B2 (en) | 2018-04-19 | 2024-03-05 | Microsoft Technology Licensing, Llc | Generating response in conversation |
CN108564943B (en) * | 2018-04-27 | 2021-02-12 | 京东方科技集团股份有限公司 | Voice interaction method and system |
CN108833721B (en) * | 2018-05-08 | 2021-03-12 | 广东小天才科技有限公司 | Emotion analysis method based on call, user terminal and system |
US10872604B2 (en) * | 2018-05-17 | 2020-12-22 | Qualcomm Incorporated | User experience evaluation |
CN108833941A (en) * | 2018-06-29 | 2018-11-16 | 北京百度网讯科技有限公司 | Man-machine dialogue system method, apparatus, user terminal, processing server and system |
CN109240488A (en) * | 2018-07-27 | 2019-01-18 | 重庆柚瓣家科技有限公司 | A kind of implementation method of AI scene engine of positioning |
CN109741738A (en) * | 2018-12-10 | 2019-05-10 | 平安科技(深圳)有限公司 | Sound control method, device, computer equipment and storage medium |
CN111383631B (en) * | 2018-12-11 | 2024-01-23 | 阿里巴巴集团控股有限公司 | Voice interaction method, device and system |
CN109783669A (en) * | 2019-01-21 | 2019-05-21 | 美的集团武汉制冷设备有限公司 | Screen methods of exhibiting, robot and computer readable storage medium |
CN109979462A (en) * | 2019-03-21 | 2019-07-05 | 广东小天才科技有限公司 | Method and system for obtaining intention by combining context |
CN113823282B (en) * | 2019-06-26 | 2024-08-30 | 百度在线网络技术(北京)有限公司 | Voice processing method, system and device |
CN112307816B (en) * | 2019-07-29 | 2024-08-20 | 北京地平线机器人技术研发有限公司 | In-vehicle image acquisition method and device, electronic equipment and storage medium |
CN110491383B (en) * | 2019-09-25 | 2022-02-18 | 北京声智科技有限公司 | Voice interaction method, device and system, storage medium and processor |
CN112804440B (en) * | 2019-11-13 | 2022-06-24 | 北京小米移动软件有限公司 | Method, device and medium for processing image |
CN110931006A (en) * | 2019-11-26 | 2020-03-27 | 深圳壹账通智能科技有限公司 | Intelligent question-answering method based on emotion analysis and related equipment |
CN111210818B (en) * | 2019-12-31 | 2021-10-01 | 北京三快在线科技有限公司 | Word acquisition method and device matched with emotion polarity and electronic equipment |
CN111428017B (en) * | 2020-03-24 | 2022-12-02 | 科大讯飞股份有限公司 | Human-computer interaction optimization method and related device |
CN111710326B (en) * | 2020-06-12 | 2024-01-23 | 携程计算机技术(上海)有限公司 | English voice synthesis method and system, electronic equipment and storage medium |
CN111883127A (en) * | 2020-07-29 | 2020-11-03 | 百度在线网络技术(北京)有限公司 | Method and apparatus for processing speech |
CN112235180A (en) * | 2020-08-29 | 2021-01-15 | 上海量明科技发展有限公司 | Voice message processing method and device and instant messaging client |
CN112687260A (en) * | 2020-11-17 | 2021-04-20 | 珠海格力电器股份有限公司 | Facial-recognition-based expression judgment voice recognition method, server and air conditioner |
CN113435338B (en) * | 2021-06-28 | 2024-07-19 | 平安科技(深圳)有限公司 | Voting classification method, voting classification device, electronic equipment and readable storage medium |
CN114842842A (en) * | 2022-03-25 | 2022-08-02 | 青岛海尔科技有限公司 | Voice interaction method and device of intelligent equipment and storage medium |
CN115497474A (en) * | 2022-09-13 | 2022-12-20 | 广东浩博特科技股份有限公司 | Control method based on voice recognition |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
KR101423258B1 (en) * | 2012-11-27 | 2014-07-24 | 포항공과대학교 산학협력단 | Method for supplying consulting communication and apparatus using the method |
CN104038836A (en) * | 2014-06-03 | 2014-09-10 | 四川长虹电器股份有限公司 | Television program intelligent pushing method |
CN105244023A (en) * | 2015-11-09 | 2016-01-13 | 上海语知义信息技术有限公司 | System and method for reminding teacher emotion in classroom teaching |
CN105334743A (en) * | 2015-11-18 | 2016-02-17 | 深圳创维-Rgb电子有限公司 | Intelligent home control method and system based on emotion recognition |
CN105389309A (en) * | 2014-09-03 | 2016-03-09 | 曲阜师范大学 | Music regulation system driven by emotional semantic recognition based on cloud fusion |
CN105895101A (en) * | 2016-06-08 | 2016-08-24 | 国网上海市电力公司 | Speech processing equipment and processing method for power intelligent auxiliary service system |
-
2016
- 2016-09-06 CN CN201610806384.5A patent/CN106373569B/en active Active
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
KR101423258B1 (en) * | 2012-11-27 | 2014-07-24 | 포항공과대학교 산학협력단 | Method for supplying consulting communication and apparatus using the method |
CN104038836A (en) * | 2014-06-03 | 2014-09-10 | 四川长虹电器股份有限公司 | Television program intelligent pushing method |
CN105389309A (en) * | 2014-09-03 | 2016-03-09 | 曲阜师范大学 | Music regulation system driven by emotional semantic recognition based on cloud fusion |
CN105244023A (en) * | 2015-11-09 | 2016-01-13 | 上海语知义信息技术有限公司 | System and method for reminding teacher emotion in classroom teaching |
CN105334743A (en) * | 2015-11-18 | 2016-02-17 | 深圳创维-Rgb电子有限公司 | Intelligent home control method and system based on emotion recognition |
CN105895101A (en) * | 2016-06-08 | 2016-08-24 | 国网上海市电力公司 | Speech processing equipment and processing method for power intelligent auxiliary service system |
Also Published As
Publication number | Publication date |
---|---|
CN106373569A (en) | 2017-02-01 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN106373569B (en) | Voice interaction device and method | |
US11908468B2 (en) | Dialog management for multiple users | |
US20200279553A1 (en) | Linguistic style matching agent | |
US11514886B2 (en) | Emotion classification information-based text-to-speech (TTS) method and apparatus | |
Triantafyllopoulos et al. | An overview of affective speech synthesis and conversion in the deep learning era | |
US20200349943A1 (en) | Contact resolution for communications systems | |
CN108701453B (en) | Modular deep learning model | |
Metallinou et al. | Context-sensitive learning for enhanced audiovisual emotion classification | |
EP3553773A1 (en) | Training and testing utterance-based frameworks | |
KR100586767B1 (en) | System and method for multi-modal focus detection, referential ambiguity resolution and mood classification using multi-modal input | |
US20240153489A1 (en) | Data driven dialog management | |
US11574637B1 (en) | Spoken language understanding models | |
KR20200113105A (en) | Electronic device providing a response and method of operating the same | |
KR20210070213A (en) | Voice user interface | |
KR20210155401A (en) | Speech synthesis apparatus for evaluating the quality of synthesized speech using artificial intelligence and method of operation thereof | |
US20210158812A1 (en) | Automatic turn delineation in multi-turn dialogue | |
US20220375469A1 (en) | Intelligent voice recognition method and apparatus | |
CN115088033A (en) | Synthetic speech audio data generated on behalf of human participants in a conversation | |
CN117882131A (en) | Multiple wake word detection | |
Guha et al. | Desco: Detecting emotions from smart commands | |
CN110232911B (en) | Singing following recognition method and device, storage medium and electronic equipment | |
Li et al. | A multi-feature multi-classifier system for speech emotion recognition | |
Schuller et al. | Speech communication and multimodal interfaces | |
US11792365B1 (en) | Message data analysis for response recommendations | |
CN117453932B (en) | Virtual person driving parameter generation method, device and storage medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |