[go: up one dir, main page]
More Web Proxy on the site http://driver.im/

CN106373569B - Voice interaction device and method - Google Patents

Voice interaction device and method Download PDF

Info

Publication number
CN106373569B
CN106373569B CN201610806384.5A CN201610806384A CN106373569B CN 106373569 B CN106373569 B CN 106373569B CN 201610806384 A CN201610806384 A CN 201610806384A CN 106373569 B CN106373569 B CN 106373569B
Authority
CN
China
Prior art keywords
expression
semantic
confidence
confidence level
response information
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201610806384.5A
Other languages
Chinese (zh)
Other versions
CN106373569A (en
Inventor
曹立新
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Horizon Robotics Technology Research and Development Co Ltd
Original Assignee
Beijing Horizon Robotics Technology Research and Development Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Horizon Robotics Technology Research and Development Co Ltd filed Critical Beijing Horizon Robotics Technology Research and Development Co Ltd
Priority to CN201610806384.5A priority Critical patent/CN106373569B/en
Publication of CN106373569A publication Critical patent/CN106373569A/en
Application granted granted Critical
Publication of CN106373569B publication Critical patent/CN106373569B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/168Feature extraction; Face representation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/174Facial expression recognition
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/08Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • G10L2015/225Feedback of the input speech

Landscapes

  • Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Human Computer Interaction (AREA)
  • Acoustics & Sound (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • Oral & Maxillofacial Surgery (AREA)
  • General Health & Medical Sciences (AREA)
  • General Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Machine Translation (AREA)
  • User Interface Of Digital Computer (AREA)

Abstract

The application relates to a voice interaction device and method. In an exemplary embodiment, a voice interaction method may include: receiving a first voice input from a human user and a first emoticon input associated with the first voice input; identifying a first semantic meaning of the first speech input; recognizing a first expression input by the first expression image; determining a first confidence level associated with the first semantic based on the first semantic and the first expression; and generating first response information based on the first semantics and the first confidence. By utilizing the expression and the semantics to generate the response information, the method can improve the experience of a human user in the process of man-machine voice interaction.

Description

Voice interaction device and method
Technical Field
The present invention relates generally to the field of human-computer interaction, and more particularly, to a voice interaction apparatus and method capable of improving accuracy of voice recognition and generating a more appropriate voice response, thereby achieving a more intelligent and personified human-computer interaction process.
Background
The language is the most convenient and effective communication mode between people, so that the voice communication is easily applied to the field of human-computer interaction to replace the traditional human-computer interaction modes such as a keyboard and a mouse. Man-machine natural language dialog means that a machine can "understand" human spoken language, which is a speech recognition technique.
Language is an art that has evolved over thousands of years, containing rich information far beyond the literal, while humans using language are wisdom creatures with multiple emotions, so simple and quick language communication from person to person can be highly complex for machines. Although many techniques have been proposed to improve the accuracy of speech recognition, these prior techniques are basically a pattern matching process of recognizing a pattern of received speech and comparing it with a reference pattern of known speech one by one to determine a recognition result. In these prior art techniques, the use of information contained in and related to speech is still relatively small, resulting in speech recognition techniques sometimes not being able to effectively recognize the true meaning of a human user. For example, interpersonal speech communication may have situations such as whispering, talk, uncertain tone, etc., which are beyond the recognition capability of the existing speech recognition technology. The existing voice recognition technology can only carry out a voice interaction process in a mechanical mode, and the development of machine equipment towards a more intelligent and more anthropomorphic direction is hindered.
Therefore, there is a need for an improved human-computer language interaction apparatus and method, which enable a machine device to more accurately understand the real intention of a human user, thereby improving the intelligence degree and personification level of the machine device, more efficiently simulating the language communication process between people, and improving the interaction experience of the human user.
Disclosure of Invention
One aspect of the present invention is to enable a machine device to more accurately understand the real intention of a human user by using more information during human-computer voice interaction.
An exemplary embodiment of the present invention provides a voice interaction method, which may include: receiving a first voice input from a human user and a first emoticon input associated with the first voice input; identifying a first semantic meaning of the first speech input; recognizing a first expression input by the first expression image; determining a first confidence level associated with the first semantic based on the first semantic and the first expression; and generating first response information based on the first semantics and the first confidence.
In an example, determining a first confidence level associated with the first semantic may include: assigning a default confidence to the first semantic; and adjusting the default confidence based on the first expression.
In an example, determining a first confidence level associated with the first semantic may further include: adjusting the default confidence level based on a context of a voice interaction.
In an example, adjusting the default confidence based on the first expression may include: increasing the default confidence level when the first expression is a positive expression; when the first expression is not qualitative expression, reducing the default confidence coefficient; and when the first expression is a neutral expression other than the positive expression and the negative expression, maintaining the default confidence level unchanged.
In an example, the positive expressions may include happiness, surprise, urgency, seriousness, and the negative expressions may include anger, aversion, keeping away from sight, fear, sadness, hesitation, surprise, suspicion.
In an example, determining a first confidence level associated with the first semantic may further include: judging whether the first semantics contain emotion keywords or not; if the first semantic meaning does not contain an emotion keyword, performing the step of adjusting the default confidence level based on the first expression; if the first semantics contain emotion keywords, judging whether the emotion keywords are matched with the first expressions; increasing the default confidence level if the emotion keyword matches the first expression; and if the emotion keyword does not match the first expression, performing the step of adjusting the default confidence based on the first expression.
In an example, determining a first confidence level associated with the first semantic may further include: judging the semantic type of the first semantic; increasing the default confidence level if the semantic type of the first semantic is a question; and if the semantic type of the first semantic is a statement or a requirement, performing the step of adjusting the default confidence based on the first expression.
In an example, determining a first confidence level associated with the first semantic may further include: judging the semantic type of the first semantic; increasing the default confidence level if the semantic type of the first semantic is a question; if the semantic type of the first semantic is statement or requirement, judging whether the first semantic contains emotion keywords; if the first semantic meaning does not contain an emotion keyword, performing the step of adjusting the default confidence level based on the first expression; if the first semantics contain emotion keywords, judging whether the emotion keywords are matched with the first expressions; increasing the default confidence level if the emotion keyword matches the first expression; and if the emotion keyword does not match the first expression, performing the step of adjusting the default confidence based on the first expression.
In an example, generating first response information based on the first semantics and the first confidence level may include: when the first confidence is above a predetermined threshold, then generating first response information comprising content directly associated with the first semantic; when the first confidence is below the predetermined threshold, then first response information is generated requesting the human user to confirm the first semantics.
In an example, the generated first response information may further include content associated with the first semantic indirection when the first confidence level is below the predetermined threshold.
In an example, generating first response information based on the first semantics and the first confidence level may include: when the first confidence is above a predetermined threshold, then generating first response information comprising content directly associated with the first semantic; when the first confidence level is below the predetermined threshold, then comparing the first confidence level to a second confidence level, the second confidence level being the confidence level associated with a speech input of the human user that immediately precedes the first speech input; generating first response information requesting the human user to confirm the first semantics if the first confidence is above the second confidence; and if the first confidence level is lower than the second confidence level, generating first response information requesting the human user to confirm the first semantic and include content indirectly associated with the first semantic.
In an example, the method may further include synthesizing the first response information into a voice according to a mood corresponding to the first expression to be played to the human user.
Another exemplary embodiment of the present invention provides a voice interaction apparatus, which may include: a speech recognition module configured to recognize a first semantic of a first speech input from a human user; an image recognition module configured to recognize a first expression of a first expression image input associated with the first voice input from the human user; a confidence unit configured to determine a first confidence associated with the first semantic based on the first semantic and the first expression; and a response generation unit configured to generate first response information based on the first semantics and the first confidence.
In an example, the confidence unit may be configured to determine a first confidence associated with the first semantic by performing the steps of: assigning a default confidence to the first semantic; and adjusting the default confidence based on the first expression.
In an example, the confidence unit may be further configured to determine a first confidence associated with the first semantic by performing the steps of: adjusting the default confidence level based on a context of a voice interaction.
In an example, the confidence unit may be configured to adjust the default confidence based on the first expression by performing the steps of: increasing the default confidence level when the first expression is a positive expression; when the first expression is not qualitative expression, reducing the default confidence coefficient; and when the first expression is a neutral expression other than the positive expression and the negative expression, maintaining the default confidence level unchanged.
In an example, the positive expressions may include happiness, surprise, urgency, seriousness, and the negative expressions may include anger, aversion, keeping away from sight, fear, sadness, hesitation, surprise, suspicion.
In an example, the confidence unit may be further configured to determine a first confidence associated with the first semantic by performing the steps of: judging whether the first semantics contain emotion keywords or not; if the first semantic meaning does not contain an emotion keyword, performing the step of adjusting the default confidence level based on the first expression; if the first semantics contain emotion keywords, judging whether the emotion keywords are matched with the first expressions; increasing the default confidence level if the emotion keyword matches the first expression; and if the emotion keyword does not match the first expression, performing the step of adjusting the default confidence based on the first expression.
In an example, the confidence unit may be further configured to determine a first confidence associated with the first semantic by performing the steps of: judging the semantic type of the first semantic; increasing the default confidence level if the semantic type of the first semantic is a question; and if the semantic type of the first semantic is a statement or a requirement, performing the step of adjusting the default confidence based on the first expression.
In an example, the response generation module may be configured to generate the first response information by performing the steps of: when the first confidence is above a predetermined threshold, then generating first response information comprising content directly associated with the first semantic; when the first confidence is below the predetermined threshold, then first response information is generated requesting the human user to confirm the first semantics.
In an example, when the first confidence level is below the predetermined threshold, the first response information generated by the response generation module may further include content indirectly associated with the first semantic.
In an example, the response generation module may be configured to generate the first response information by performing the steps of: when the first confidence is above a predetermined threshold, then generating first response information comprising content directly associated with the first semantic; when the first confidence level is below the predetermined threshold, then comparing the first confidence level to a second confidence level, the second confidence level being the confidence level associated with a speech input of the human user that immediately precedes the first speech input; generating first response information requesting the human user to confirm the first semantics if the first confidence is above the second confidence; and if the first confidence level is lower than the second confidence level, generating first response information requesting the human user to confirm the first semantic and include content indirectly associated with the first semantic.
In an example, the apparatus may further include: and the voice synthesis module is configured to synthesize the first response information into voice according to the tone corresponding to the first expression so as to play the voice to the human user.
Another exemplary embodiment of the present invention provides an electronic device, which may include: a voice receiving unit; an image receiving unit; a memory; and a processor connected to the voice receiving unit, the image receiving unit and the memory via a bus system, the processor being configured to execute instructions stored on the memory to perform any of the methods described above.
Another exemplary embodiment of the invention provides a computer program product, which may comprise computer program instructions, which, when executed by a processor, may cause the processor to perform any of the methods described above.
Another exemplary embodiment of the present invention provides a computer-readable storage medium, on which computer program instructions may be stored, which, when executed by a processor, may cause the processor to perform any of the methods described above.
Drawings
The above and other objects, features and advantages of the present application will become more apparent by describing in more detail embodiments of the present application with reference to the attached drawings. The accompanying drawings are included to provide a further understanding of the embodiments of the application and are incorporated in and constitute a part of this specification, illustrate embodiments of the application and together with the description serve to explain the principles of the application. In the drawings, like reference numbers generally represent like parts or steps.
Fig. 1 is a flowchart illustrating a voice interaction method according to an exemplary embodiment of the present invention.
FIG. 2 is a flowchart illustrating a process of determining confidence based on semantics and expressions according to an exemplary embodiment of the present invention.
Fig. 3 is a flowchart illustrating a process of determining confidence based on semantics and expressions according to another exemplary embodiment of the present invention.
Fig. 4 is a flowchart illustrating a process of determining confidence based on semantics and expressions according to another exemplary embodiment of the present invention.
Fig. 5 is a flowchart illustrating a process of generating response information based on semantics and confidence according to an exemplary embodiment of the present invention.
Fig. 6 is a block diagram illustrating a voice interaction apparatus according to an exemplary embodiment of the present invention.
Fig. 7 is a block diagram illustrating an electronic device according to an exemplary embodiment of the present invention.
Detailed Description
Hereinafter, example embodiments according to the present application will be described in detail with reference to the accompanying drawings. It should be understood that the described embodiments are only some embodiments of the present application and not all embodiments of the present application, and that the present application is not limited by the example embodiments described herein.
FIG. 1 illustrates a general block diagram of a human-machine voice interaction method 100 according to an exemplary embodiment of the invention. Here, "person" may represent a human user, and "machine" may represent any type of electronic device having a human-computer interaction function, including, but not limited to, mobile electronic devices such as smart phones, tablets, notebook computers, robots, personal digital assistants, in-vehicle electronic devices, and non-mobile electronic devices such as desktop computers, information service terminals, ticketing terminals, intelligent home appliances, intelligent customer service devices, and the like. Both of these devices may utilize the voice interaction apparatus and methods described herein. Furthermore, it should be understood that the voice interaction apparatus and method described herein can also be applied to electronic devices with voice interaction functions developed in the future.
Referring to fig. 1, the voice interaction method 100 may begin with steps S110 and S112, in which, in step S110, an electronic device performing voice interaction may receive a first voice input from a human user, and in step S112, the electronic device performing voice interaction may receive a first emoticon input associated with the first voice input from the human user. For example, an electronic device may utilize a microphone or an array of microphones to capture speech uttered by a human user and, at the same time, utilize a camera to capture an expressive image of the human user. In most cases, a human user is generally positioned right in front of the electronic device when performing human-computer interaction, so the electronic device defaults the captured facial expression right in front as the expression of the user performing voice interaction. In still other embodiments, the electronic device may also detect and track a human user who is engaged in voice interaction. For example, the electronic device may detect an orientation of a human user who is performing voice interaction through a sound source localization technique using a microphone array, and then rotate a camera to align the orientation, thereby obtaining an expression image of the human user. Sound source localization is known to the person skilled in the art and the basic principle thereof is not described in detail here. Technical solutions for detecting and tracking users using sound source localization are also described in the applicant's chinese patent application 201610341566.X and 201610596000.1, the disclosure of which is hereby incorporated by reference.
It will be appreciated that both the audio signal captured by the microphone or microphone array and the video or image signal captured by the camera may be pre-processed with time stamps. In this way, the electronic device can associate the speech input (audio signal) and the emoticon input (video or image signal) based on time. For example, when the electronic device detects that there is a voice input, an emoticon input that is the same as or approximately the same time as the voice input may be extracted.
Next, in step S114, speech recognition may be performed on the received first speech input to determine a first semantic meaning thereof. Here, the first semantic may be a literal semantic, i.e. a textual representation, of the first speech input, which can already be recognized with a very high accuracy using existing various speech recognition techniques. For example, when a human user says "order a flight ticket tomorrow to shanghai", a string of text "order a flight ticket tomorrow to shanghai" can be recognized as a first semantic by a speech recognition technique.
Further, in step S116, image recognition may be performed on the received first expression image input to determine the first expression of the human user. For example, it may be recognized that the first expression of the user is happy, anxious, hesitant, etc., or the first expression of the user may be a neutral expression, i.e., a faceless expression.
It should be understood here that in steps S114 and S116, the present invention may use any existing voice recognition technology and image recognition technology. For example, the available speech recognition techniques may include a vocal tract model and speech knowledge based method, a pattern matching method, which has been studied more and more intensively, and an artificial neural network method, etc., which include, for example, a dynamic time warping method (DTW), a Hidden Markov Method (HMM), a vector quantization method (VQ), etc. Artificial neural network methods, which are popular in recent years, can generally be used in conjunction with existing pattern matching methods. The available image recognition techniques may be techniques dedicated to facial expression recognition, and can be generally classified into the following three categories: global identification and local identification; deformation extraction and motion extraction; geometric feature methods and personal feature methods. Taking the commonly used global and local recognition methods as examples, the global recognition method may include, for example, Feature face-based principal Component Analysis (principal Component Analysis), independent Component Analysis (independent Component Analysis), Fisher's Linear Discriminants (Fisher's Linear Discriminants), local Feature Analysis (local Feature Analysis), Fisher birth kinetics (Fisher actions), Hidden Markov Model (HMM), and cluster Analysis; the Local recognition method may include a Facial motion Coding analysis (Facial operations Coding System), a Facial motion parameter method, a Local principal component analysis (Local PCA), a Gabor wavelet method, a neural network method, and the like. It should also be understood that the present invention is not limited to the examples given herein, and that other and future developed speech recognition techniques and face recognition techniques may also be used.
Next, at step S118, a first confidence level associated with the first semantic may be determined based on the identified first semantic and the first expression. In the present invention, the confidence may be defined as an amount indicating whether the first semantic meaning is the real intention of the human user, for example, it may be a numerical range, and the larger the value, the more certain the first semantic meaning is the real intention of the user, and the lower the value, the more uncertain the first semantic meaning is the real meaning that the user wants to express, for example, the meaning of the user himself to the speech expression is also less satisfied, or hesitation is not decided.
Conventional speech recognition is aimed at being accurate only, in an effort to accurately recognize the language words spoken by a human user, and therefore the speech recognition process is "mechanical", resulting in a human-computer interaction process that is also mechanical, quite different from human-to-human communication. When people communicate with each other, the meaning of the surface of the language characters can be recognized, the appearance can be observed, and the mood or attitude of the opposite side can be judged by observing the expression of the opposite side, so that whether the language of the opposite side is expressed by the real meaning can be judged. The general principle of the invention is that in the human-computer interaction, whether the voice recognition result is the real intention of the human user is judged by recognizing the expression of the human user, so that the human-computer interaction process more like the communication between people is realized.
Specifically, in step S118, a default confidence level may be first assigned to the first semantic. For example, the confidence level may range from 1 to 10, where 10 represents the end with high confidence level, 1 represents the end with low confidence level, and the default confidence level may be set in the middle of the range, e.g., 4-6. In an example, the default confidence level may be set to, for example, 5.
The assigned default confidence may then be adjusted according to the identified first expression. Expressions can be roughly classified into three categories, positive, negative, and neutral. Positive expressions indicate that the confidence level of the words spoken by the user is high, which is a true meaning representation. For example, the confidence level may be considered high when the user is exposed to happy or happy, surprised expressions. In addition, when the user faces a focus, a serious expression, the confidence of the speaking is also considered to be high. Thus, when the first identified expressions are these expressions, the default confidence may be increased. On the other hand, when the user faces negative expressions such as anger, disgust, keeping away from sight or slight, fear, sadness, hesitation, surprise, suspicion, etc., it can be considered that the confidence of his utterance is low, and thus the default confidence assigned thereto is reduced. For example, when the user says "order a sky to shanghai ticket" in a happy or serious expression, the user may be very certain of the intention, and thus "order a sky to shanghai ticket" is a true meaning representation of the user; when the user says "order a sky to go to the sea" in a hesitant, sad, depressed, angry expression, it is likely that the user himself does not determine whether to take the plane to go to the sea or that the schedule of taking the plane to go to the sea is not satisfied, and therefore "order a sky to go to the sea" may not be a true meaning indication that the user intended, and the assigned default confidence value should be decremented at this time. While the assigned default confidence value may be maintained when the user is neutral, e.g., no special expression.
It should be understood that the principles of the present invention are not limited to the specific examples of expressions given herein, but that more expressions may be used, and even different expression classification rules may be used, i.e., whether a particular expression is classified as a positive, negative or neutral expression.
In some embodiments, each positive and negative expression may be further divided into different degrees or levels. For example, smile may represent a lower degree of pleasure, grin may represent a medium degree of pleasure, and mouths laugh represent a higher degree of pleasure for happy or pleasurable aspects. The adjustment to the default confidence value may also be different depending on the degree or level of each expression. For example, a lower degree of positive expression may raise the confidence value by 1, a medium degree of positive expression may raise the confidence value by 2, and a higher degree of positive expression may raise the confidence value by 3. It is of course understood that neutral expressions may be of varying degrees or levels.
In some embodiments, the assigned default confidence level may also be adjusted based on the context of the voice interaction. For example, when the previous voice interaction content shows that the weather of the shanghai is rainstorm, the confidence of the voice of the user's "order a airticket for tomorrow to the shanghai" is low; on the other hand, if the previous voice interaction or the user's calendar indicated that the user had a meeting schedule tomorrow in Shanghai, the confidence of the user's voice "order a ticket tomorrow to Shanghai" is low. Thus, the assigned default confidence value may be adjusted according to the context, thereby enabling a more intelligent confidence value determination process.
In some embodiments, when the determined first confidence level is high, e.g., above a predetermined threshold, then response information is generated based on the first criterion, e.g., information directly associated with the first semantic is generated, such as conventional voice interaction, it may be understood that "directly associated" means that the user may order information that is directly intended for the user based on the first semantic, e.g., the first semantic, and the first confidence level is below a predetermined semantic, e.g., a "first confidence level" may be less than a predetermined threshold, "when the first confidence level is less than a predetermined semantic, e.g., a" first confidence level "may be less than a predetermined semantic, e.g., a" may be less than a "when the first confidence level is less than a predetermined semantic, e.g., a" may be less than a "first confidence level," may be less than a "when the first confidence level is more than a predetermined semantic, e.g., a" is less than a predetermined semantic, e.g., a "may be less than a predetermined semantic, e.g., a" may be more than a "may be a predetermined semantic, and a" may be more than a predetermined semantic, and a predetermined semantic, a "may be more than a predetermined semantic, e.g., a predetermined semantic, a" may be a predetermined semantic, a response information may be generated, a response information may be more than a response information may be generated based on a response information may be generated, may be more than a second confidence level may be generated based on a second confidence level may be more than a second confidence level, a predetermined semantic, a second confidence level may be more than a predetermined semantic, a less than a second confidence level may be more than a predetermined semantic, a second confidence level may be more than a second semantic, a less than a second more than a predetermined semantic, a less than a predetermined semantic, a second semantic, a predetermined semantic, a second more than a predetermined semantic, a less than a predetermined semantic, a predetermined.
Then, in step S122, the generated first response information may be synthesized into speech through a speech synthesis (TTS) technique to be played to a human user through a speaker and/or a display, thereby completing one round of the voice interaction process. Also, the present invention may be used with any existing or future developed speech synthesis techniques, which are not described in detail herein.
In some embodiments, the first response information may be synthesized into a voice in accordance with a mood corresponding to the first expression. For example, when the first expression of the user is a happy or happy, excited expression, step S122 may also synthesize a voice using a happy mood; when the user is sad, depressed, or scary, step S122 may synthesize speech using a comforting mood; when the user is angry, disgust, and has kept away from the sight expression, step S122 may use a timid and angry tone to synthesize the voice. Therefore, the voice response played to the user can be more easily accepted by the user, the mood of the user is improved, and the interaction experience of the user is improved. Of course, the correspondence between the mood and the expression of the synthesized speech is not limited to the example given here, but may be defined differently depending on the application scenario.
In conventional speech synthesis with emotion, it is generally necessary to analyze the semantics of the text and determine the emotion or mood required for synthesizing the speech by a machine. In the invention, the recognized first expression can be directly utilized, and the corresponding tone or emotion is adopted to synthesize the voice, so that the process of analyzing the text to determine the tone can be omitted, the program is simpler, and the synthesized voice tone can more accurately accord with the current mood or emotion of the user, so that the human-computer interaction process is richer in human emotion, and the mechanical feeling of cold ice is avoided.
Some exemplary embodiments of the present invention are described above with reference to fig. 1, which are applicable to many general voice communication scenarios. However, interpersonal voice communication is complex and may encounter a variety of special situations. Some man-machine voice interaction methods capable of dealing with similar special scenes are described below with reference to the accompanying drawings.
Fig. 2 shows a flowchart of a process 200 of determining a first confidence based on a first semantic and a first expression according to another exemplary embodiment of the invention. In step S118 described above with reference to fig. 1, the first confidence level is determined by adjusting the assigned default confidence level based on the first expression. Specifically, when the first expression is a positive expression, the default confidence is increased; when the first expression is not qualitative, reducing the default confidence coefficient; when the first expression is a neutral expression, the default confidence is maintained. However, this manner of adjustment may be disadvantageous in some situations, given the complexity of voice communication. For example, when a human user says something sad with a very sad expression or something terrorist with a very horror expression, it can be generally determined that the confidence level of his language is high, and the confidence level should not be lowered. Therefore, in the embodiment shown in fig. 2, first, in step S210, it is retrieved whether the first semantic meaning contains an emotion keyword. The emotion keyword refers to a vocabulary that can be associated with a specific expression or emotion, such as a disaster, accident, etc. associated with sadness, fear, etc., travel, shopping, etc. associated with joy, etc. If no emotion keyword is retrieved in step S210, the previously described step of adjusting the assigned default confidence level based on the first expression is performed in step S212. If the emotion keyword is retrieved in step S210, it is determined whether the retrieved emotion keyword matches the first expression in step S214. In some embodiments, a plurality of emotion keywords may be retrieved in step S210, and each of the keywords may be compared with the first expression in step S214, and if there is one emotion keyword matched with the first expression, the result is determined to be a match; and judging that the emotion keywords are not matched with the first expression only when all the emotion keywords are not matched with the first expression.
If the determination result in step S214 is not matching, the previously described step of adjusting the assigned default confidence level based on the first expression may be performed in step S216; if the determination in step S214 is a match, indicating that the expression of the human user matches the speech content thereof, the confidence level of the first semantic meaning may be considered to be very high, then the assigned default confidence level may be directly increased in step S218, and the increased confidence level may be output as the first confidence level associated with the first semantic meaning for subsequent operations as described in step S120.
The above describes the case of judging whether the first expression matches the first expression from the content of the first semantic. In other cases, the type of first semantic may also be considered for voice interaction. Fig. 3 shows a flow diagram of a process 300 for determining a first confidence based on a first semantic and a first expression according to another embodiment of the invention. As shown in fig. 3, in step S310, a semantic type of the first semantic may be determined first. Linguistically, semantic types are generally divided into three categories, statements, questions and requirements, namely declarative sentences, interrogative sentences and imperative sentences, with different semantic types generally corresponding to different degrees of confidence. For example, when a user is saying a question, it generally indicates that he wants to know a certain answer, so the confidence level is generally higher; when the user speaks the statement sentence and the imperative sentence, it is generally difficult to judge the confidence based on the semantic type.
Therefore, if the semantic type of the first semantic is determined to be questionable in step S310, the assigned default confidence level may be directly increased in step S312, and the increased confidence level may be output as the first confidence level associated with the first semantic for the following operation as described in step S120. On the other hand, if the semantic type of the first semantic is a statement or a requirement, or is other semantic type except a question, in step S310, the aforementioned step of adjusting the assigned default confidence level based on the first expression may be performed in step S314.
Fig. 4 shows a case 400 where the above-described two factors of emotion keyword and semantic type are considered. Referring to fig. 4, a semantic type of the first semantic may be first determined in step S410. If the semantic type of the first semantic is a question, the assigned default confidence level is increased in step S412, and the increased confidence level may be output as the first confidence level associated with the first semantic for subsequent operations as described in step S120. If the semantic type of the first semantic is a statement or claim, or other semantic type than question, then it may proceed to step S414.
In step S414, it may be continuously determined whether the first semantic meaning contains an emotion keyword. If the first semantic meaning does not contain an emotion keyword, the step of adjusting the default confidence based on the first expression described above is performed in step S416; if the first semantic meaning contains an emotion keyword, it is continuously determined whether the emotion keyword matches the first expression in step S418. If so, directly increasing the assigned default confidence level in step S420, and outputting the increased confidence level as a first confidence level associated with the first semantic for subsequent operations as described in step S120; if not, the step of adjusting the default confidence based on the first expression described above is performed in step S422.
Fig. 5 illustrates a flow diagram of another embodiment 500 of generating first response information based on the identified first semantics and the determined first confidence level. First, in step S510, it may be determined whether the first confidence value is above a predetermined threshold. As previously mentioned, the predetermined threshold may be a predetermined confidence criterion, and when the first confidence value is above the predetermined threshold, the confidence may be considered high; the confidence may be considered low if the first confidence is below a predetermined threshold.
When the first confidence is above the predetermined threshold, then first response information including content directly associated with the first semantic may be generated in step S512. When the first confidence level is lower than the predetermined threshold, the first confidence level may be continuously compared with the confidence value (which may be referred to as a second confidence level herein for convenience of description) of the previous voice input in step S514. The comparison between the first confidence level and the previous second confidence level may reflect an emotional change of the human user during the voice interaction. For example, if the first confidence is above the second confidence, it indicates that while the absolute confidence is still low (the first confidence is below the threshold), the relative confidence is increased (the first confidence is above the second confidence), so the interaction process may progress in a better direction. At this time, in step S516, first response information requesting the human user to confirm the first semantics may be generated. On the other hand, if it is determined in step S514 that the first confidence is lower than the previous second confidence, it indicates that not only the absolute confidence is low, but also the relative confidence is decreasing, and the interaction process may progress in a bad direction. At this time, the first response information generated in step S518 may include not only content requesting the human user to confirm the first semantics, but also content indirectly associated with the first semantics for consideration and selection by the user.
Hereinafter, a voice interaction apparatus according to an exemplary embodiment of the present invention will be described with reference to fig. 6. As described above, the voice interaction apparatus of the present invention can be applied to any type of electronic devices having a human-computer interaction function, including but not limited to mobile electronic devices such as smart phones, tablets, notebook computers, robots, personal digital assistants, and in-vehicle electronic devices, and non-mobile electronic devices such as desktop computers, information service terminals, ticketing terminals, intelligent home appliances, intelligent customer service devices, and the like. Both of these devices may utilize the voice interaction apparatus and methods described herein. Furthermore, it should be understood that the voice interaction apparatus described herein may also be applied to electronic devices with voice interaction functions developed in the future.
As shown in FIG. 6, the voice interaction device 600 may include a speech recognition module 610, an image recognition module 620, a confidence module 630, a response generation module 640, and a speech synthesis module 650. The speech recognition module 610 may be configured to recognize a first semantic of a first speech input from a human user. It is to be appreciated that the speech recognition module 610 may utilize any existing, e.g., commercially available, speech recognition engine, or may also utilize a speech recognition engine developed in the future. The image recognition module 620 may be configured to recognize a first expression of a first expression image input from a human user associated with the first voice input. It will also be appreciated that the image recognition module 620 may utilize any existing, e.g., commercially purchased, expression image recognition engine, or may also utilize expression image recognition engines developed in the future. The confidence module 630 may determine a first confidence associated with the first semantic meaning based on the first semantic meaning identified by the speech recognition module 610 and the first expression identified by the image recognition module 620. For example, the confidence module 630 may first assign a default confidence to the first semantic and then adjust the assigned default confidence based on the first expression to obtain a final first confidence. Specifically, when the first expression is a positive expression, the default confidence is increased; when the first expression is not qualitative, reducing the default confidence coefficient; when the first expression is other than a positive expression and a negative expression, such as a neutral expression, the assigned default confidence level is maintained.
In some embodiments, the confidence module 630 may also determine whether the first semantic contains an emotion keyword and compare the contained emotion keyword to the first expression. If the emotion keyword contained in the first semantic matches the first expression, it indicates that the confidence of the user speaking is high, and therefore the assigned default confidence is directly increased. If the first semantics do not include an emotion keyword, or the included emotion keyword does not match the first expression, the previously described operation of adjusting the assigned default confidence level based on the first expression may be performed.
In some embodiments, the confidence module 630 may also determine a semantic type of the first semantic. If the semantic type of the first semantic is a question, the confidence of the user speaking is considered to be high, so the assigned default confidence value is directly increased; if other semantic types than question, such as statement or claim, the previously described operation of adjusting the assigned default confidence level based on the first expression may be performed.
In some embodiments, the confidence module 630 may also adjust the assigned default confidence based on the context. For example, if the first semantic is consistent with the context of the voice interaction, its confidence is high, thus increasing the assigned default confidence; conversely, if not, the assigned default confidence level is decreased.
With continued reference to fig. 6, the response generation module 640 of the voice interaction device 600 may generate the first response information using the first semantic from the speech recognition module 610 and the first confidence from the confidence module 630. The response generation module 640 may generate the first response information with different criteria according to the first confidence level. In some embodiments, when the first confidence is above a predetermined threshold, then generating first response information based on a first criterion, e.g., generating first response information comprising content directly associated with the first semantic; when the first confidence is below a predetermined threshold, then first response information is generated based on a second criterion, such as generating first response information requesting that the human user confirm the first semantics, or such as generating first response information further comprising content indirectly associated with the first semantics.
The process of generating response information may involve the use of knowledge base 660. The knowledge base 660 may be a local knowledge base that may be included as part of the speech recognition device 600, or, as shown in fig. 6, may be a cloud knowledge base 660, and the speech recognition device 600 is connected to the cloud knowledge base 660 through a network, such as a wide area network or a local area network. The knowledge base 660 may include a variety of knowledge data, such as weather data, flight data, hotel data, movie data, music data, dining data, stock data, travel data, map data, government data, industry knowledge, historical knowledge, natural science knowledge, social science knowledge, and so forth. The response generation module 640 may obtain knowledge directly or indirectly related to the first semantics from the knowledge base 660 for generating the first response information.
In some embodiments, when the first confidence is above a predetermined threshold, the response generation module 640 generates first response information comprising content directly associated with the first semantic; when the first confidence level is below a predetermined threshold, then the response generation module 640 also compares the first confidence level to a second confidence level, the second confidence level being the confidence level associated with a speech input of the human user that immediately precedes the first speech input. If the first confidence level is above the second confidence level, the response generation module 640 may generate first response information requesting the human user to confirm the first semantics; if the first confidence level is lower than the second confidence level, the response generation module 640 can generate first response information requesting the human user to confirm the first semantic and including content indirectly associated with the first semantic.
Then, the voice synthesis module 650 may synthesize the first response information generated by the response generation module 640 into voice to be played to a human user through a speaker (not shown), thereby completing one round of voice interaction process. In some embodiments, the speech synthesis module 650 may also utilize the first expression from the image recognition module 620 for speech synthesis. Specifically, the voice synthesis module 650 may synthesize the first response information into voice according to the mood corresponding to the first emotion. For example, when the first expression of the user is a happy or happy, excited expression, the speech synthesis module 650 may also synthesize speech using a happy mood; when the user is sad, depressed, or scary, the speech synthesis module 650 may synthesize speech using a comforting mood; when the user is angry, disgust, and has a dominant expression, the speech synthesis module 650 may synthesize speech using a timid, angry mood. Therefore, the voice response played to the user can be more easily accepted by the user, the mood of the user is improved, and the interaction experience of the user is improved. Of course, the speech synthesis module 650 may perform speech synthesis according to other corresponding relations between expressions and moods, and is not limited to the examples given herein.
Fig. 7 illustrates a block diagram of an electronic device that may utilize the voice interaction apparatus and method described above according to an exemplary embodiment of the present invention. As shown in fig. 7, the electronic device 700 may include a voice receiving unit 710 and an image receiving unit 720. The voice receiving unit 710 may be, for example, a microphone or a microphone array, which may capture the voice of the user. The image receiving unit 720 may be, for example, a monocular camera, a binocular camera, or a multi-purpose camera, which may capture an image of a user, particularly, a face image, and thus the image receiving unit 720 may have a face recognition function to clearly and accurately capture an expression image of the user.
As shown in fig. 7, the electronic device 700 may further include one or more processors 730 and a memory 740, which are connected to the voice receiving unit 710 and the image receiving unit 720 with each other through a bus system 750. The processor 730 may be a Central Processing Unit (CPU) or other form of processing unit, processing core, or controller having data processing capabilities and/or instruction execution capabilities, and may control other components in the electronic device 700 to perform desired functions. Memory 740 may include one or more computer program products that may include various forms of computer-readable storage media, such as volatile memory and/or non-volatile memory. The volatile memory may include, for example, Random Access Memory (RAM), cache memory (cache), and/or the like. The non-volatile memory may include, for example, Read Only Memory (ROM), hard disk, flash memory, etc. One or more computer program instructions may be stored on the computer-readable storage medium and executed by processor 740 to implement the voice interaction methods of the embodiments of the application described above and/or other desired functions. Various applications and various data, such as user data, knowledge databases, etc., may also be stored in the computer-readable storage medium.
Furthermore, the electronic device 700 may further include an output unit 760. The output unit 760 may be, for example, a speaker to perform voice interaction with a user. In other embodiments, output unit 760 may also be an output device such as a display, printer, or the like.
In addition to the above-described methods, apparatuses and devices, embodiments of the present application may also be a computer program product comprising computer program instructions which, when executed by a processor, cause the processor to perform the steps in the voice interaction method according to embodiments of the present application described in the present specification.
The computer program product may be written with program code for performing the operations of embodiments of the present application in any combination of one or more programming languages, including an object oriented programming language such as Java, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the electronic device, partly on the electronic device, as a stand-alone software package, partly on the user electronic device and partly on a remote computing device, or entirely on the remote computing device or server.
Furthermore, embodiments of the present application may also be a computer-readable storage medium having stored thereon computer program instructions, which, when executed by a processor, cause the processor to perform the steps in the voice interaction method according to various embodiments of the present application described herein.
The computer-readable storage medium may take any combination of one or more readable media. The readable medium may be a readable signal medium or a readable storage medium. A readable storage medium may include, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples (a non-exhaustive list) of the readable storage medium include: an electrical connection having one or more wires, a portable disk, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.
The foregoing describes the general principles of the present application in conjunction with specific embodiments, however, it is noted that the advantages, effects, etc. mentioned in the present application are merely examples and are not limiting, and they should not be considered essential to the various embodiments of the present application. Furthermore, the foregoing disclosure of specific details is for the purpose of illustration and description and is not intended to be limiting, since the foregoing disclosure is not intended to be exhaustive or to limit the disclosure to the precise details disclosed.
The block diagrams of devices, apparatuses, systems referred to in this application are only given as illustrative examples and are not intended to require or imply that the connections, arrangements, configurations, etc. must be made in the manner shown in the block diagrams. These devices, apparatuses, devices, systems may be connected, arranged, configured in any manner, as will be appreciated by those skilled in the art. Words such as "including," "comprising," "having," and the like are open-ended words that mean "including, but not limited to," and are used interchangeably therewith. The words "or" and "as used herein mean, and are used interchangeably with, the word" and/or, "unless the context clearly dictates otherwise. The word "such as" is used herein to mean, and is used interchangeably with, the phrase "such as but not limited to".
It should also be noted that in the apparatus and methods of the present application, the components or steps may be disassembled and/or reassembled. These decompositions and/or recombinations are to be considered as equivalents of the present application.
The previous description of the disclosed aspects is provided to enable any person skilled in the art to make or use the present application. Various modifications to these aspects will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other aspects without departing from the scope of the application. Thus, the present application is not intended to be limited to the aspects shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.
The foregoing description has been presented for purposes of illustration and description. Furthermore, the description is not intended to limit embodiments of the application to the form disclosed herein. While a number of example aspects and embodiments have been discussed above, those of skill in the art will recognize certain variations, modifications, alterations, additions and sub-combinations thereof.

Claims (15)

1. A voice interaction method, comprising:
receiving a first voice input from a human user and a first emoticon input associated with the first voice input;
identifying a first semantic meaning of the first speech input;
recognizing a first expression input by the first expression image;
determining a first confidence level associated with the first semantic based on the first semantic and the first expression; and
generating first response information based on the first semantics and the first confidence level,
wherein determining a first confidence level associated with the first semantic comprises:
assigning a default confidence to the first semantic; and
adjusting the default confidence based on the first expression.
2. The method of claim 1, wherein,
adjusting the default confidence based on the first expression, including:
increasing the default confidence level when the first expression is a positive expression;
when the first expression is not qualitative expression, reducing the default confidence coefficient; and
maintaining the default confidence level unchanged when the first expression is a neutral expression other than the positive expression and the negative expression.
3. The method of claim 1, wherein determining a first confidence level associated with the first semantic further comprises:
judging whether the first semantics contain emotion keywords or not;
if the first semantic meaning does not contain an emotion keyword, performing the step of adjusting the default confidence level based on the first expression;
if the first semantics contain emotion keywords, judging whether the emotion keywords are matched with the first expressions;
increasing the default confidence level if the emotion keyword matches the first expression; and
if the emotion keyword does not match the first expression, performing the step of adjusting the default confidence based on the first expression.
4. The method of claim 1, determining a first confidence level associated with the first semantic further comprising:
judging the semantic type of the first semantic;
increasing the default confidence level if the semantic type of the first semantic is a question; and
if the semantic type of the first semantic is a statement or a requirement, performing the step of adjusting the default confidence based on the first expression.
5. The method of claim 1, wherein generating first response information based on the first semantics and the first confidence level comprises:
when the first confidence is above a predetermined threshold, then generating first response information comprising content directly associated with the first semantic;
when the first confidence is below the predetermined threshold, then first response information is generated requesting the human user to confirm the first semantics.
6. The method of claim 5, wherein the generated first response information further comprises content indirectly associated with the first semantic when the first confidence level is below the predetermined threshold.
7. The method of claim 1, wherein generating first response information based on the first semantics and the first confidence level comprises:
when the first confidence is above a predetermined threshold, then generating first response information comprising content directly associated with the first semantic;
when the first confidence level is below the predetermined threshold, then comparing the first confidence level to a second confidence level, the second confidence level being the confidence level associated with a speech input of the human user that immediately precedes the first speech input;
generating first response information requesting the human user to confirm the first semantics if the first confidence is above the second confidence; and
generating first response information requesting the human user to confirm the first semantic and include content indirectly associated with the first semantic if the first confidence level is lower than the second confidence level.
8. The method of claim 1, further comprising synthesizing the first response information into speech according to a mood corresponding to the first expression for playing to the human user.
9. A voice interaction device, comprising:
a speech recognition module configured to recognize a first semantic of a first speech input from a human user;
an image recognition module configured to recognize a first expression of a first expression image input associated with the first voice input from the human user;
a confidence unit configured to determine a first confidence associated with the first semantic based on the first semantic and the first expression; and
a response generation unit configured to generate first response information based on the first semantics and the first confidence degree,
wherein the confidence unit is configured to determine a first confidence associated with the first semantic by performing the steps of:
assigning a default confidence to the first semantic; and
adjusting the default confidence based on the first expression.
10. The apparatus of claim 9, wherein
Adjusting the default confidence based on the first expression, including:
increasing the default confidence level when the first expression is a positive expression;
when the first expression is not qualitative expression, reducing the default confidence coefficient; and
maintaining the default confidence level unchanged when the first expression is a neutral expression other than the positive expression and the negative expression.
11. The apparatus of claim 10, wherein the confidence unit is further configured to determine a first confidence associated with the first semantic by performing the steps of:
judging the semantic type of the first semantic;
increasing the default confidence level if the semantic type of the first semantic is a question;
if the semantic type of the first semantic is statement or requirement, judging whether the first semantic contains emotion keywords;
if the first semantic meaning does not contain an emotion keyword, performing the step of adjusting the default confidence level based on the first expression;
if the first semantics contain emotion keywords, judging whether the emotion keywords are matched with the first expressions;
increasing the default confidence level if the emotion keyword matches the first expression; and
if the emotion keyword does not match the first expression, performing the step of adjusting the default confidence based on the first expression.
12. The apparatus of claim 9, wherein the response generation module is configured to generate the first response information by performing:
when the first confidence is above a predetermined threshold, then generating first response information comprising content directly associated with the first semantic;
when the first confidence is below the predetermined threshold, then first response information is generated requesting the human user to confirm the first semantics.
13. The apparatus of claim 12, wherein the first response information generated by the response generation module further includes content associated with the first semantic indirection when the first confidence level is below the predetermined threshold.
14. An electronic device, comprising:
a voice receiving unit;
an image receiving unit;
a memory; and
a processor connected to each other with the speech receiving unit, the image receiving unit and the memory by a bus system, the processor being configured to execute instructions stored on the memory to perform the method of any one of claims 1-8.
15. A computer readable storage medium having stored thereon computer program instructions which, when executed by a processor, cause the processor to perform the method according to any one of claims 1-8.
CN201610806384.5A 2016-09-06 2016-09-06 Voice interaction device and method Active CN106373569B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201610806384.5A CN106373569B (en) 2016-09-06 2016-09-06 Voice interaction device and method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201610806384.5A CN106373569B (en) 2016-09-06 2016-09-06 Voice interaction device and method

Publications (2)

Publication Number Publication Date
CN106373569A CN106373569A (en) 2017-02-01
CN106373569B true CN106373569B (en) 2019-12-20

Family

ID=57900064

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201610806384.5A Active CN106373569B (en) 2016-09-06 2016-09-06 Voice interaction device and method

Country Status (1)

Country Link
CN (1) CN106373569B (en)

Families Citing this family (30)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR102363794B1 (en) * 2017-03-31 2022-02-16 삼성전자주식회사 Information providing method and electronic device supporting the same
CN106910514A (en) * 2017-04-30 2017-06-30 上海爱优威软件开发有限公司 Method of speech processing and system
CN109005304A (en) * 2017-06-07 2018-12-14 中兴通讯股份有限公司 A kind of queuing strategy and device, computer readable storage medium
CN107199572B (en) * 2017-06-16 2020-02-14 山东大学 Robot system and method based on intelligent sound source positioning and voice control
CN107240398B (en) * 2017-07-04 2020-11-17 科大讯飞股份有限公司 Intelligent voice interaction method and device
CN108320738B (en) * 2017-12-18 2021-03-02 上海科大讯飞信息科技有限公司 Voice data processing method and device, storage medium and electronic equipment
US11922934B2 (en) 2018-04-19 2024-03-05 Microsoft Technology Licensing, Llc Generating response in conversation
CN108564943B (en) * 2018-04-27 2021-02-12 京东方科技集团股份有限公司 Voice interaction method and system
CN108833721B (en) * 2018-05-08 2021-03-12 广东小天才科技有限公司 Emotion analysis method based on call, user terminal and system
US10872604B2 (en) * 2018-05-17 2020-12-22 Qualcomm Incorporated User experience evaluation
CN108833941A (en) * 2018-06-29 2018-11-16 北京百度网讯科技有限公司 Man-machine dialogue system method, apparatus, user terminal, processing server and system
CN109240488A (en) * 2018-07-27 2019-01-18 重庆柚瓣家科技有限公司 A kind of implementation method of AI scene engine of positioning
CN109741738A (en) * 2018-12-10 2019-05-10 平安科技(深圳)有限公司 Sound control method, device, computer equipment and storage medium
CN111383631B (en) * 2018-12-11 2024-01-23 阿里巴巴集团控股有限公司 Voice interaction method, device and system
CN109783669A (en) * 2019-01-21 2019-05-21 美的集团武汉制冷设备有限公司 Screen methods of exhibiting, robot and computer readable storage medium
CN109979462A (en) * 2019-03-21 2019-07-05 广东小天才科技有限公司 Method and system for obtaining intention by combining context
CN113823282B (en) * 2019-06-26 2024-08-30 百度在线网络技术(北京)有限公司 Voice processing method, system and device
CN112307816B (en) * 2019-07-29 2024-08-20 北京地平线机器人技术研发有限公司 In-vehicle image acquisition method and device, electronic equipment and storage medium
CN110491383B (en) * 2019-09-25 2022-02-18 北京声智科技有限公司 Voice interaction method, device and system, storage medium and processor
CN112804440B (en) * 2019-11-13 2022-06-24 北京小米移动软件有限公司 Method, device and medium for processing image
CN110931006A (en) * 2019-11-26 2020-03-27 深圳壹账通智能科技有限公司 Intelligent question-answering method based on emotion analysis and related equipment
CN111210818B (en) * 2019-12-31 2021-10-01 北京三快在线科技有限公司 Word acquisition method and device matched with emotion polarity and electronic equipment
CN111428017B (en) * 2020-03-24 2022-12-02 科大讯飞股份有限公司 Human-computer interaction optimization method and related device
CN111710326B (en) * 2020-06-12 2024-01-23 携程计算机技术(上海)有限公司 English voice synthesis method and system, electronic equipment and storage medium
CN111883127A (en) * 2020-07-29 2020-11-03 百度在线网络技术(北京)有限公司 Method and apparatus for processing speech
CN112235180A (en) * 2020-08-29 2021-01-15 上海量明科技发展有限公司 Voice message processing method and device and instant messaging client
CN112687260A (en) * 2020-11-17 2021-04-20 珠海格力电器股份有限公司 Facial-recognition-based expression judgment voice recognition method, server and air conditioner
CN113435338B (en) * 2021-06-28 2024-07-19 平安科技(深圳)有限公司 Voting classification method, voting classification device, electronic equipment and readable storage medium
CN114842842A (en) * 2022-03-25 2022-08-02 青岛海尔科技有限公司 Voice interaction method and device of intelligent equipment and storage medium
CN115497474A (en) * 2022-09-13 2022-12-20 广东浩博特科技股份有限公司 Control method based on voice recognition

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR101423258B1 (en) * 2012-11-27 2014-07-24 포항공과대학교 산학협력단 Method for supplying consulting communication and apparatus using the method
CN104038836A (en) * 2014-06-03 2014-09-10 四川长虹电器股份有限公司 Television program intelligent pushing method
CN105244023A (en) * 2015-11-09 2016-01-13 上海语知义信息技术有限公司 System and method for reminding teacher emotion in classroom teaching
CN105334743A (en) * 2015-11-18 2016-02-17 深圳创维-Rgb电子有限公司 Intelligent home control method and system based on emotion recognition
CN105389309A (en) * 2014-09-03 2016-03-09 曲阜师范大学 Music regulation system driven by emotional semantic recognition based on cloud fusion
CN105895101A (en) * 2016-06-08 2016-08-24 国网上海市电力公司 Speech processing equipment and processing method for power intelligent auxiliary service system

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR101423258B1 (en) * 2012-11-27 2014-07-24 포항공과대학교 산학협력단 Method for supplying consulting communication and apparatus using the method
CN104038836A (en) * 2014-06-03 2014-09-10 四川长虹电器股份有限公司 Television program intelligent pushing method
CN105389309A (en) * 2014-09-03 2016-03-09 曲阜师范大学 Music regulation system driven by emotional semantic recognition based on cloud fusion
CN105244023A (en) * 2015-11-09 2016-01-13 上海语知义信息技术有限公司 System and method for reminding teacher emotion in classroom teaching
CN105334743A (en) * 2015-11-18 2016-02-17 深圳创维-Rgb电子有限公司 Intelligent home control method and system based on emotion recognition
CN105895101A (en) * 2016-06-08 2016-08-24 国网上海市电力公司 Speech processing equipment and processing method for power intelligent auxiliary service system

Also Published As

Publication number Publication date
CN106373569A (en) 2017-02-01

Similar Documents

Publication Publication Date Title
CN106373569B (en) Voice interaction device and method
US11908468B2 (en) Dialog management for multiple users
US20200279553A1 (en) Linguistic style matching agent
US11514886B2 (en) Emotion classification information-based text-to-speech (TTS) method and apparatus
Triantafyllopoulos et al. An overview of affective speech synthesis and conversion in the deep learning era
US20200349943A1 (en) Contact resolution for communications systems
CN108701453B (en) Modular deep learning model
Metallinou et al. Context-sensitive learning for enhanced audiovisual emotion classification
EP3553773A1 (en) Training and testing utterance-based frameworks
KR100586767B1 (en) System and method for multi-modal focus detection, referential ambiguity resolution and mood classification using multi-modal input
US20240153489A1 (en) Data driven dialog management
US11574637B1 (en) Spoken language understanding models
KR20200113105A (en) Electronic device providing a response and method of operating the same
KR20210070213A (en) Voice user interface
KR20210155401A (en) Speech synthesis apparatus for evaluating the quality of synthesized speech using artificial intelligence and method of operation thereof
US20210158812A1 (en) Automatic turn delineation in multi-turn dialogue
US20220375469A1 (en) Intelligent voice recognition method and apparatus
CN115088033A (en) Synthetic speech audio data generated on behalf of human participants in a conversation
CN117882131A (en) Multiple wake word detection
Guha et al. Desco: Detecting emotions from smart commands
CN110232911B (en) Singing following recognition method and device, storage medium and electronic equipment
Li et al. A multi-feature multi-classifier system for speech emotion recognition
Schuller et al. Speech communication and multimodal interfaces
US11792365B1 (en) Message data analysis for response recommendations
CN117453932B (en) Virtual person driving parameter generation method, device and storage medium

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant