CN108334583B

CN108334583B - Emotion interaction method and device, computer readable storage medium and computer equipment

Info

Publication number: CN108334583B
Application number: CN201810077175.0A
Authority: CN
Inventors: 王宏安; 王慧; 陈辉; 王豫宁; 李志浩; 朱频频; 姚乃明; 朱嘉奇
Original assignee: Institute of Software of CAS; Shanghai Xiaoi Robot Technology Co Ltd
Current assignee: Institute of Software of CAS; Shanghai Xiaoi Robot Technology Co Ltd
Priority date: 2018-01-26
Filing date: 2018-01-26
Publication date: 2021-07-09
Anticipated expiration: 2038-01-26
Also published as: CN108334583A

Abstract

An emotion interaction method and device, a computer readable storage medium and computer equipment are provided, wherein the emotion interaction method comprises the following steps: acquiring user data; performing emotion recognition on the user data to obtain the emotion state of the user; determining intent information from at least the user data; and controlling the interaction with the user according to the emotional state and the intention information. According to the technical scheme of the invention, the interaction with the user can be realized in emotion, and the user experience in the interaction process is improved.

Description

Emotion interaction method and device, computer readable storage medium and computer equipment

Technical Field

The invention relates to the technical field of communication, in particular to an emotion interaction method and device, a computer readable storage medium and computer equipment.

Background

In the field of human-computer interaction, the technical development is more and more mature, and the interaction mode is more and more diversified, so that convenience is provided for users.

In the prior art, in the process of user interaction, a user inputs data such as voice and text, and a terminal can perform a series of processing on the data input by the user, such as voice recognition and semantic recognition, and finally determine and feed back an answer to the user.

However, the answer fed back to the user by the terminal is usually an objective answer. The user may have emotion in the interaction process, and human-computer interaction in the prior art cannot feed back the emotion of the user, so that user experience is influenced.

Disclosure of Invention

The technical problem solved by the invention is how to realize the interaction with the user in emotion and improve the user experience in the interaction process.

In order to solve the above technical problem, an embodiment of the present invention provides an emotion interaction method, where the emotion interaction method includes: acquiring user data; performing emotion recognition on the user data to obtain the emotion state of the user; determining intent information from at least the user data; and controlling the interaction with the user according to the emotional state and the intention information.

Optionally, the intention information includes an emotional intention corresponding to the emotional state, and the emotional intention includes an emotional requirement of the emotional state.

Optionally, determining context interaction data, wherein the context interaction data comprises context emotional state and/or context intention information; determining the emotional intent according to the user data, the emotional state, and the intent information, the intent information including the emotional intent.

Optionally, the determining the emotional intent according to the user data, the emotional state, and the context interaction data includes: acquiring the time sequence of the user data; determining the emotional intent based at least on the timing, the emotional state, and the contextual interaction data.

Optionally, the determining the emotional intent according to at least the timing, the emotional state, and the context interaction data includes: extracting focus content corresponding to each time sequence in the user data based on the time sequence of the user data; for each time sequence, matching the focus content corresponding to the time sequence with the content in an emotion type library, and determining the emotion type corresponding to the matched content as the focus emotion type corresponding to the time sequence; and according to the time sequence, determining the emotion intention by using the focus emotion type corresponding to the time sequence, the emotion state corresponding to the time sequence and the context interaction data corresponding to the time sequence.

Optionally, the determining the emotional intent according to the user data, the emotional state, and the context interaction data includes: determining the emotional intent using a Bayesian network based on the user data, the emotional state, and the contextual interaction data; or matching the user data, the emotional state and the context interaction data with preset emotional intentions in an emotional semantic library to obtain the emotional intentions; or searching in a preset intention space by using the user data, the emotional state and the context interaction data to determine the emotional intention, wherein the preset intention space comprises a plurality of emotional intentions.

Optionally, the intention information includes the emotional intention and a basic intention, and an association relationship between the emotional state and the basic intention, the emotional intention includes an emotional requirement of the emotional state, and the basic intention is one or more of preset transaction intention categories.

Optionally, the association between the emotional state and the basic intention is preset, or the association between the emotional state and the basic intention is obtained based on a preset training model.

Optionally, the determining intent information at least according to the user data includes: obtaining semantics of the user data; determining contextual intent information; determining a basic intention according to the semantics of the user data and the context intention information, wherein the intention information comprises the basic intention, and the basic intention of the user is one or more of preset transaction intention categories.

Optionally, the determining the basic intention according to the semantics of the user data and the contextual intention information includes: acquiring the time sequence of the user data and the semantics of the user data of each time sequence; and determining the basic intention at least according to the time sequence, the semantics of the user data of each time sequence and the context intention information corresponding to the time sequence.

Optionally, the determining a basic intention according to the semantics of the user data and the contextual intention information, the intention information including the basic intention includes: extracting focus content corresponding to each time sequence in the user data based on the time sequence of the user data; determining a current interaction environment; determining context intention information corresponding to the time sequence; for each time sequence, determining the basic intention of the user by using the related information corresponding to the time sequence, wherein the related information comprises: the focused content, the current interaction environment, the contextual intent information, the timing, and the semantics.

Optionally, for each time sequence, determining the basic intention of the user by using the relevant information corresponding to the time sequence includes: for each time sequence, determining the basic intention by utilizing a Bayesian network based on the related information corresponding to the time sequence; or, aiming at each time sequence, matching relevant information corresponding to the time sequence with a preset basic intention in a semantic library to obtain the basic intention; or searching the related information corresponding to the time sequence in a preset intention space to determine the basic intention, wherein the preset intention space comprises a plurality of basic intents.

Optionally, the contextual interaction data includes interaction data in previous interaction sessions and/or other interaction data in the current interaction session.

Optionally, the determining intent information according to at least the user data further includes: and acquiring a basic intention corresponding to the user data through calling, and adding the basic intention into the intention information, wherein the basic intention of the user is one or more of preset transaction intention categories.

Optionally, the intent information includes a user intent, the user intent is determined based on the emotional intent and a basic intent, the basic intent is one or more of preset transaction intent categories, and the determining intent information at least according to the user data further includes: and determining the user intention according to the emotional intention, the basic intention and user personalized information corresponding to the user data, wherein the user preference and a source user ID of the user data have an association relationship.

Optionally, the controlling the interaction with the user according to the emotional state and the intention information includes: and determining executable instructions according to the emotional state and the intention information so as to be used for performing emotional feedback on the user.

Optionally, the executable instruction includes at least one emotion modality and at least one output emotion type;

after the executable instruction is determined according to the emotional state and the intention information, the method further comprises the following steps: and performing emotional presentation of one or more output emotional types of the at least one output emotional type according to each emotional mode of the at least one emotional mode.

Optionally, the determining executable instructions according to the emotional state and the intention information includes: after the last round of emotion interaction is finished to generate an executable instruction, determining the executable instruction according to the emotion state and the intention information in the current round of interaction, or if the emotion state is dynamically changed and the variation of the emotion state exceeds a preset threshold value, determining the executable instruction at least according to the emotion intention corresponding to the changed emotion state; or if the emotional state is dynamically changed, determining the corresponding executable instruction according to the dynamically changed emotional state within a set time interval.

Optionally, the executable instruction includes an emotional modality and an output emotional state, or the executable instruction includes the emotional modality, the output emotional state and the emotional intensity.

Optionally, the emotional modality is determined according to at least one modality of the user data.

Optionally, the emotional modality is the same as at least one modality of the user data.

Optionally, the emotion interaction method further includes: when the executable instruction comprises an emotional modality and an output emotional state, executing the executable instruction, and presenting the output emotional state to the user by using the emotional modality; when the executable instruction comprises an emotional modality, an output emotional state and emotional intensity, executing the executable instruction, and presenting the output emotional state to the user according to the emotional modality and the emotional intensity.

Optionally, the determining executable instructions according to the emotional state and the intention information includes: and matching the emotional state and the intention information with a preset instruction library to obtain the executable instruction through matching.

Optionally, the intention information includes a basic intention of the user, the executable instruction includes content matched with the basic intention, and the basic intention of the user is one or more of preset transaction intention categories; the method for acquiring the basic intention comprises the following steps: determining a current interaction environment; determining contextual intent information; determining a basic intention of the user according to the user data, the current interaction environment and the context intention information; or: and acquiring a basic intention corresponding to the user data through calling.

Optionally, the user data of the at least one modality is selected from: touch click data, voice data, facial expression data, body gesture data, physiological signals, input text data.

Optionally, the emotional state of the user is expressed as an emotion classification; or the emotional state of the user is represented as a preset multi-dimensional emotional coordinate point.

The embodiment of the invention also discloses an emotion interaction device, which comprises: the user data acquisition module is used for acquiring user data; the emotion recognition module is used for carrying out emotion recognition on the user data to obtain the emotion state of the user; an intent information determination module to determine intent information based at least on the user data; and the interaction module is used for controlling the interaction with the user according to the emotional state and the intention information.

The embodiment of the invention also discloses a computer readable storage medium, wherein computer instructions are stored on the computer readable storage medium, and the computer instructions execute the steps of the emotion interaction method when running.

The embodiment of the invention also discloses computer equipment which comprises a memory and a processor, wherein the memory is stored with computer instructions capable of running on the processor, and the processor executes the steps of the emotion interaction method when running the computer instructions.

Compared with the prior art, the technical scheme of the embodiment of the invention has the following beneficial effects:

the technical scheme of the invention obtains user data; performing emotion recognition on the user data to obtain the emotion state of the user; determining intent information from at least the user data; and controlling the interaction with the user according to the emotional state and the intention information. According to the technical scheme, the emotion state of the user is obtained by identifying the user data of at least one mode, so that the accuracy of emotion identification can be improved; in addition, the emotional state can be used for controlling interaction with the user by combining the intention information, so that feedback aiming at user data can carry the emotional data, the interaction accuracy is further improved, and the user experience in the interaction process is improved.

Further, the intent information includes an emotional intent corresponding to the emotional state, the emotional intent including an emotional need of the emotional state. In the technical scheme of the invention, the user data based on at least one mode can also obtain the emotional requirements aiming at the emotional state; that is, the intention information includes emotional requirements of the user. For example, where the emotional state of the user is casualty, the emotional intent may include the emotional need of the user "comfort". The emotional intentions are used for interaction with the user, so that the interaction process is more humanized, and the user experience of the interaction process is improved.

Further, the intention information comprises the emotional intention and a basic intention, the emotional intention comprises the emotional requirement of the emotional state and the association relationship between the emotional state and the basic intention, and the basic intention is one or more of preset transaction intention categories. In the technical scheme of the invention, the intention information comprises the emotion requirements of the user and the preset transaction intention types, so that when the intention information is used for controlling the interaction with the user, the emotion requirements of the user can be met while the answer of the user is replied, and the user experience is further improved; in addition, the intention information also comprises the association relationship between the emotional state and the basic intention, and the current real intention of the user can be judged through the association relationship; therefore, when the interaction is carried out with the user, the final feedback information or operation can be determined by utilizing the incidence relation, and the accuracy of the interaction process is improved.

Further, the controlling the interaction with the user according to the emotional state and the intention information comprises: and determining executable instructions according to the emotional state and the intention information so as to be used for performing emotional feedback on the user. In the technical scheme of the invention, the executable instruction can be executed by the computer equipment, and the executable instruction is determined based on the emotional state and the intention information, so that the feedback of the computer equipment can meet the emotional requirement and the objective requirement of a user.

Further, the executable instructions comprise an emotional modality and an output emotional state, or the executable instructions comprise the emotional modality, the output emotional state and the emotional intensity. In the technical scheme of the invention, the executable instruction can be instructed by the computer to be executed by the computer equipment, and the executable instruction can be in the form of data output by the equipment: emotional modality and output emotional state; that is, the data ultimately presented to the user is the output emotional state of the emotional modality, thereby enabling emotional interaction with the user. In addition, the executable instructions can also comprise emotional intensity, the emotional intensity can represent the intensity of the output emotional state, and the emotional interaction with the user can be better realized by utilizing the emotional intensity.

Further, the emotional modality is determined according to at least one modality of the user data. In the technical scheme of the invention, in order to ensure the fluency of interaction, the emotion modality of the output emotion state fed back by the computer equipment can be consistent with the modality of the user data, in other words, the emotion modality can be selected from at least one modality of the user data.

Drawings

FIG. 1 is a flow chart of a method of emotion interaction according to an embodiment of the present invention;

FIG. 2 is a diagram of an emotional interaction scenario according to an embodiment of the invention;

FIG. 3 is a schematic diagram of one implementation of step S102 shown in FIG. 1;

FIG. 4 is a flowchart of one implementation of step S103 shown in FIG. 1;

FIG. 5 is a flow chart of another implementation of step S103 shown in FIG. 1;

FIG. 6 is a flowchart of an embodiment of a method for emotion interaction;

FIG. 7 is a flow diagram of another embodiment of a method for emotion interaction according to the present invention;

FIG. 8 is a flowchart of a further embodiment of a method for emotion interaction;

9-11 are schematic diagrams of the emotion interaction method in a specific application scenario;

FIG. 12 is a partial flow diagram of a method of emotion interaction according to an embodiment of the present invention;

FIG. 13 is a partial flow diagram of another emotion interaction method according to an embodiment of the present invention;

FIG. 14 is a schematic structural diagram of an emotion interaction apparatus according to an embodiment of the present invention;

fig. 15 and 16 are schematic diagrams showing a specific structure of the intention information determining module 803 shown in fig. 14;

FIG. 17 is a diagram illustrating an exemplary structure of the interaction module 804 shown in FIG. 14;

FIG. 18 is a schematic structural diagram of another emotion interaction device according to an embodiment of the present invention.

Detailed Description

As described in the background, the answer fed back to the user by the terminal is generally an objective answer. The user may have emotion in the interaction process, and human-computer interaction in the prior art cannot feed back the emotion of the user, so that user experience is influenced.

According to the technical scheme, the emotion state of the user is obtained by identifying the user data of at least one mode, so that the accuracy of emotion identification can be improved; in addition, the emotional state can be used for controlling interaction with the user by combining the intention information, so that feedback aiming at user data can carry the emotional data, the interaction accuracy is further improved, and the user experience in the interaction process is improved.

The effect of the technical solution of the present invention is described below with reference to specific application scenarios. The robot acquires multi-mode data of a user through an input device such as a camera, a microphone, a touch screen device or a keyboard and the like of the robot so as to perform emotion recognition. The intention information is determined through the intention analysis, an executable instruction is generated, and emotional feedback of joy, sadness, surprise and other emotions is carried out through a display screen, a loudspeaker, a mechanical action device and the like of the robot.

In order to make the aforementioned objects, features and advantages of the present invention comprehensible, embodiments accompanied with figures are described in detail below.

FIG. 1 is a flowchart of an emotion interaction method according to an embodiment of the present invention.

The emotion interaction method shown in FIG. 1 may include the following steps:

step S101: acquiring user data;

step S102: performing emotion recognition on the user data to obtain the emotion state of the user;

step S103: determining intent information from at least the user data;

step S104: and controlling the interaction with the user according to the emotional state and the intention information.

Referring also to FIG. 2, the emotion interaction method illustrated in FIG. 2 may be used in computer device 102. The computer device 102 may perform steps S101 to S104. Further, the computer device 102 may include a memory and a processor, the memory having stored thereon computer instructions executable on the processor, the processor executing steps S101 to S104 when executing the computer instructions. The computer device 102 may include, but is not limited to, a computer, a notebook, a tablet, a robot, a smart wearable device, and the like.

It can be understood that the emotion interaction method of the embodiment of the present invention can be applied to various application scenarios, such as customer service, family companion nursing, virtual intelligent personal assistant, and the like.

In an implementation of step S101, the computer device 102 may obtain user data of the user 103, which may be provided with at least one modality. Further, the user data of the at least one modality is selected from: touch click data, voice data, facial expression data, body gesture data, physiological signals, input text data.

Specifically, as shown in fig. 2, a text input device 101a, such as a touch screen, an inertial sensor, a keyboard, etc., is integrated within the computer device 102, and the text input device 101a may be used by the user 103 to input text data. The computer device 102 has integrated therein a voice capture device 101b, such as a microphone, and the voice capture device 101b can capture voice data of the user 103. The computer device 102 has an image capturing device 101c integrated therein, such as a camera, an infrared device, a motion sensing device, and the like, and the image capturing device 101c can capture facial expression data and body posture data of the user 103. The computer device 102 has a physiological signal collecting device 101n integrated therein, such as a heart rate meter, a blood pressure meter, an electrocardiograph, an electroencephalograph, etc., and the physiological signal collecting device 101n can collect a physiological signal of the user 103. The physiological signal can be selected from body temperature, heart rate, electroencephalogram, electrocardio, myoelectricity, a skin electricity reaction resistor and the like.

It should be noted that, besides the above listed devices, the computer device 102 may also be integrated with any other devices or sensors capable of collecting data, and the embodiment of the present invention is not limited thereto. In addition, the text input device 101a, the voice capturing device 101b, the image capturing device 101c, and the physiological signal capturing device 101n can also be externally coupled to the computer device 102.

More specifically, the computer device 102 may acquire data of multiple modalities simultaneously.

With continued reference to fig. 1 and 2, after step S101 and before step S102, the source user of the user data may also be identified and verified.

Specifically, whether the user ID is consistent with the stored identity may be confirmed by means of a user password or an instruction. It can also be confirmed by the voiceprint password whether the user's identity is consistent with the stored user ID. The input of the user through the identity authentication and the voice through the identity authentication can be used as long-term user data to be accumulated so as to construct an individualized model of the user and solve the optimization problem of the self-adaptability of the user. Such as optimizing acoustic models and personalized language models.

And identity recognition and verification can be performed through face recognition. The face image of the user is obtained in advance through image acquisition equipment, face features (such as pixel features, geometric features and the like) are extracted, and the face features are recorded and stored. When the user subsequently starts the image acquisition equipment to acquire the real-time face image, the real-time acquired image can be matched with the pre-stored face features.

Identity recognition and verification can also be performed through biometrics. For example, a user's fingerprint, iris, etc. may be utilized. Identification and verification may also be performed in conjunction with biometrics and other means (e.g., passwords, etc.). The biometric features that pass the authentication are accumulated as long-term user data for building a personalized model of the user, such as the user's normal heart rate level, blood pressure level, etc.

Specifically, after the user data is acquired and before emotion recognition is performed on the user data, preprocessing may be performed on the user data. For example, for an acquired image, the image may be preprocessed to convert it into a set size, channel, or color space that can be directly processed; the acquired voice data can be further subjected to operations such as awakening, audio coding and decoding, endpoint detection, noise reduction, dereverberation, echo cancellation and the like.

With continued reference to fig. 1, in a specific implementation of step S102, the emotional state of the user may be obtained based on the obtained user data. For user data of different modalities, emotion recognition can be performed in different ways. If the user data of multiple modes is acquired, the user data of multiple modes can be combined for emotion recognition, so that the accuracy of emotion recognition is improved.

Referring to fig. 2 and 3 together, for at least one modality of user data: the computer device 102 may employ different modules for emotion recognition, one or more of touch click data, voice data, facial expression data, body posture data, physiological signals, and input text data. Specifically, the emotion recognition module 301 based on expression can perform emotion recognition on the facial expression data to obtain an emotional state corresponding to the facial expression data. By analogy, the gesture-based emotion recognition module 302 can perform emotion recognition on the body gesture data to obtain an emotional state corresponding to the body gesture data. The speech-based emotion recognition module 303 may perform emotion recognition on the speech data to obtain an emotion state corresponding to the speech data. The text-based emotion recognition module 304 may perform emotion recognition on the input text data to obtain an emotion state corresponding to the input text data. The emotion recognition module 305 based on the physiological signal can perform emotion recognition on the physiological signal to obtain an emotion state corresponding to the physiological signal.

Different emotion recognition modules can adopt different emotion recognition algorithms. Text-based emotion recognition module 304 may determine the emotional state using a learning model, natural language processing, or a combination of both. Specifically, when the learning model is used, the learning model needs to be trained in advance. Firstly, the classification of the output emotional state of the application field is determined, such as an emotion classification model or a dimension model, dimension model coordinates, a numerical range and the like. And marking the training corpus according to the requirements. The corpus may include input text and labeled emotional states (i.e., expected output emotional state classifications, dimension values). And inputting the text into the trained learning model, wherein the learning model can output the emotional state. When a natural language processing mode is used, an emotion expression word bank and an emotion semantic database need to be constructed in advance. The emotion expression word bank can comprise a plurality of emotion word collocations, and the emotion semantic database can comprise language symbols. Specifically, the vocabulary itself has no emotional component, but a combination of multiple vocabularies, referred to as a multi-emotion vocabulary collocation, can be used to convey emotional information. The multi-emotion vocabulary collocation can be obtained through a preset emotion semantic database or an external open source interface, and the multi-emotion semantic database has the function of disambiguating multi-emotion ambiguous words according to current user data or context (such as historical user data) so as to clarify emotion types expressed by the multi-emotion ambiguous words, and further perform emotion recognition in the next step. After the obtained text is subjected to word segmentation, part of speech judgment and syntactic analysis, the emotional state of the text is judged by combining an emotional word bank and an emotional semantic database.

The speech data includes audio features and linguistic features, and the speech-based emotion recognition module 303 can respectively realize or combine the two features to realize emotion recognition of the speech data. The audio features can comprise energy features, pronunciation frame number features, fundamental tone frequency features, formant features, harmonic noise ratio features, Mel cepstrum coefficient features and the like, and can be embodied in the modes of proportional values, mean values, maximum values, median values, standard deviations and the like; the language features may be obtained by natural language processing (similar to text modality processing) after speech to text. When the audio features are used for emotion recognition, output emotion state types are determined, audio data are labeled according to output requirements, classification models (such as Gaussian mixture models) are trained, and main audio features and expression forms are selected in an optimization mode in the training process. And extracting acoustic feature vectors of the voice audio stream to be recognized according to the optimized model and the feature set, and performing emotion classification or regression. When emotion recognition is carried out by utilizing the audio features and the language features, the speech data are respectively subjected to two models to obtain output results, and then the output results are comprehensively considered according to confidence coefficient or tendency (tendency text judgment or audio judgment).

The emotion recognition module 301 based on expression can extract expression features based on images and determine expression classification: the extraction of expression features can be divided into: static image feature extraction and sequence image feature extraction. The deformation characteristics of the expression, namely the transient characteristics of the expression, are extracted from the static image. And for the sequence image, not only the expression deformation characteristic of each frame but also the motion characteristic of a continuous sequence need to be extracted. The deformation feature extraction relies on neutral expressions or models, and the generated expressions are compared with the neutral expressions to extract features, while the extraction of motion features directly depends on facial changes generated by the expressions. The basis of feature selection is: the facial expression characteristics of the human face are carried as much as possible, namely the information quantity is rich; as easy as possible to extract; the information is relatively stable and is less influenced by external factors such as illumination change and the like. In particular, a template-based matching method, a probabilistic model-based method, and a support vector machine-based method may be used. The emotion recognition module 301 based on expression can also perform emotion recognition based on a deep learning facial expression recognition mode. For example, a 3D deformation model (3D deformable Models,3D DMM) may be used, in which the preprocessed image is reconstructed by a parameterizable 3D DMM model, and the correspondence between the original image and the three-dimensional model of the head is preserved. The three-dimensional model includes information such as a structure (texture), depth (depth), and marker (landmark) point of the head. And then cascading the features obtained after the image is subjected to convolution with the structures in the three-dimensional model to obtain new structure information, and respectively sending the features into the two structures for information separation by using the depth patches and the finger cascade of the neighborhood around the mark point to respectively obtain the expression information and the identity information of the user. Establishing a corresponding relation between the image and the three-dimensional head model by embedding a parameterizable 3 DMM; global appearance information using a combination of image, structure and depth mapping; using local geometric information in the neighborhood around the marker point; and establishing a multi-task confrontation relation between identity recognition and expression recognition, and purifying expression characteristics.

The emotion recognition module 305 based on physiological signals performs emotion recognition according to the characteristics of different physiological signals. Specifically, preprocessing operations such as down-sampling, filtering, noise reduction, etc. are performed on the physiological signal. A certain number of statistical features (i.e., feature choices), such as the energy spectrum of the fourier transform, are extracted. The feature selection may adopt genetic algorithm, wavelet transform, independent component analysis, co-space mode, Sequential Floating Forward Selection (SFFS), variance analysis, and the like. And finally, classifying the signals into corresponding emotion classes according to the signal characteristics or mapping the signals into a continuous dimensional space, and realizing the classification by using algorithms such as a support vector machine, a k-Nearest Neighbor classification algorithm (k-Nearest Neighbor), linear discriminant analysis, a neural network and the like.

The emotion recognition principle of other modules can refer to the prior art, and is not described in detail herein.

Further, in the actual interaction, emotion recognition needs to be performed on user data of multiple modalities, that is, emotion recognition based on multi-modality fusion. For example, the user may have gestures, expressions, and the like while talking, and the pictures may also contain characters and the like. Multimodal fusion can cover various modal data such as text, voice, expressions, gestures, physiological signals, and the like.

Multimodal fusion may include data-level fusion, feature-level fusion, model-level fusion, and decision-level fusion. Wherein, the data level fusion requires isomorphism of the multi-modal data. The feature level fusion needs to extract emotional features from multiple modalities and construct a combined feature vector to determine emotional states, for example, a section of video contains facial expression and voice data, firstly, audio and video data need to be synchronized, the facial expression features and the voice features in the voice data are respectively extracted to jointly form the combined feature vector, and overall judgment is performed. The model level fusion is to establish a model for uniformly processing data of each mode, for example, data such as video and voice can adopt a hidden Markov model; the connection and complementarity between different modal data are established according to different application requirements, such as identifying the emotion change of a user when watching a movie, and combining film video and subtitles. In performing model-level fusion, model training is also required based on data extraction features for each modality. Decision-level fusion is to respectively establish models for data of each mode, each mode model independently judges a recognition result, and then uniformly outputs the recognition results in the final decision, for example, operations such as weight superposition of voice recognition, face recognition and physiological signals are performed, and the results are output; decision-level fusion can also be achieved by using a neural network multi-layer perceptron and the like. Preferably, the emotional state of the user is represented as an emotion classification; or the emotional state of the user is represented as a preset multi-dimensional emotional coordinate point.

Or, the emotional state of the user includes: static emotional state and/or dynamic emotional state; the static emotional state can be represented by a discrete emotional model or a dimension emotional model without a time attribute so as to represent the emotional state of the current interaction; the dynamic emotional state can be represented by a discrete emotional model with a time attribute, a dimension emotional model, or other models with a time attribute, so as to represent the emotional state at a certain time point or in a certain time period. More specifically, the static emotional state may be represented as an emotional classification or a dimensional emotional model. The dimension emotion model can be an emotion space formed by a plurality of dimensions, each emotion state corresponds to one point in the emotion space, and each dimension is a factor for describing emotion. For example, two-dimensional space theory: activation-pleasure or three-dimensional space theory: activation-pleasure-dominance. The discrete emotion model is an emotion model in which the emotional state is represented in a discrete tag form, for example: six basic emotions include happiness, anger, sadness, surprise, fear, nausea.

In specific implementation, the emotional state can be expressed by adopting different emotional models, specifically, a classified emotional model and a multidimensional emotional model.

And if the classified emotion model is adopted, the emotion state of the user is expressed as emotion classification. And if the multi-dimensional emotion model is adopted, the emotion state of the user is expressed as a multi-dimensional emotion coordinate point.

In particular implementations, the static emotional state may represent an emotional expression of the user at a certain time. The dynamic emotional state can represent the continuous emotional expression of the user in a certain time period, and the dynamic emotional state can reflect the dynamic process of the emotional change of the user. For static emotional states, it can be expressed by categorizing the emotional models and the multidimensional emotional models.

With continued reference to fig. 1, in a specific implementation of step S103, intent information may be determined according to the user data, and also according to emotional state and the user data.

In one embodiment of the invention, the intent information comprises a basic intent when determining the intent information from the user data. The basic intention may represent a service that the user needs to obtain, for example, that the user needs to perform some operation, or obtain an answer to a question, etc. The basic intents are one or more of preset transaction intention categories. In a specific implementation, the basic intention of the user can be determined by matching the user data with a preset transaction intention category. Specifically, the preset transaction intention category may be stored in the local server or the cloud server in advance. The local server can directly match the user data by using a semantic library, a search mode and the like, and the cloud server can match the user data by using an interface through a parameter calling mode. More specifically, there are various ways of matching, such as by pre-defining a transaction intention category in a semantic library, and matching by calculating the similarity between the user data and the pre-set transaction intention category; matching can also be performed through a search algorithm; classification by deep learning, etc. is also possible.

In another embodiment of the invention, intent information may be determined based on emotional state and the user data. In this case, the intention information includes the emotional intention and the basic intention, the emotional intention includes an emotional need of the emotional state, and an association relationship of the emotional state and the basic intention. Wherein the emotional intent corresponds to the emotional state, and the emotional intent comprises an emotional need of the emotional state.

Further, the relationship between the emotional state and the basic intention is preset. Specifically, when there is a relationship between the emotional state and the basic intention, the relationship is usually a predetermined relationship. The association may affect the data that is ultimately fed back to the user. For example, when the basic intention is to control an exercise machine, the emotional state having an association relationship with the basic intention is excitement; if the user's primary intent is to increase the operating speed of the exercise apparatus, the computer device's ultimate feedback to the user may be to prompt the user that the operation may be dangerous for the user's safety considerations.

Alternatively, the association relationship between the emotional state and the basic intention may be obtained based on a preset training model. For example, the relevance of the emotional state and the basic intention is determined by using a trained end-to-end model and the like. The preset training model can be a fixed deep network model, can input emotional states and a current interaction environment, and can also be continuously updated through online learning (for example, a target function and a reward function are set in the reinforcement learning model by using the reinforcement learning model, and the reinforcement learning model can also be continuously updated and evolved along with the increase of the man-machine interaction times).

In a specific application scenario, in the field of bank customer service, a user says that the customer service robot uses voice: "how do credit cards should be lost? ". The customer service robot captures the voice and facial image of the user through the equipped microphone and camera. The robot obtains the emotional state of the user by analyzing the characteristic information of the voice and the facial expression, obtains the emotional state of the client concerned in the field as 'worried', and can express the emotional state through a classified emotional model. Therefore, the customer service robot can determine that the emotional intention of the user is comfort. Meanwhile, the voice input information is converted into text, and the basic intention of the customer is 'credit card loss report' obtained through steps of natural language processing and the like.

With continued reference to fig. 1, after determining the intention information of the user, in a specific implementation of step S104, content feedback may be performed on the user according to the intention information, and furthermore, emotional feedback may be performed on the user according to an emotional state.

In specific implementation, when the computer device performs emotion feedback according to an emotion state, the computer device can meet the user requirements by controlling the characteristic parameters of the output data. For example, when the output data of the computer device is voice, feedback can be performed for different emotional states by adjusting the speech rate and the intonation of the voice; when the output data of the computer equipment is a text, the feedback can be carried out aiming at different emotional states by adjusting the semantics of the output text.

For example, in the field of bank service, the service robot determines that the emotional state of the user is "worried" and the intention information is "loss report credit card". The customer service robot may present the emotional need 'comfort' while outputting 'credit card loss report step'. Specifically, the customer service robot can output 'credit card loss report step' on the screen while announcing and presenting emotion 'comfort' through voice. The emotion presented by the customer service robot can be adjusted through voice parameters such as the tone and the speed of voice output. The output to the user is consistent with the possibility of tone lightness and moderate speech speed voice broadcast: the step of losing the credit card is displayed on a screen, so that a user does not worry about the fact that if the credit card is lost or stolen, the credit card is frozen immediately after loss, and the property and the credit of the user are not lost …. Not only the emotion requirement is presented, but also reasoning of the emotion state and the emotion reason of the user is presented and explained, namely the relation between the basic intention and the emotion is determined to be 'credit card lost or stolen', so that the user can be better understood, and the user can obtain more accurate comfort and accurate information.

In one embodiment of the invention, referring to both FIGS. 1 and 4, a computer device may determine emotional intent in conjunction with contextual interaction data and user data generated during historical interactions.

The context interaction data may include, among other things, context emotional state and/or context intent information. Further, the contextual interaction data may be Null (Null) when the user makes a first round of interaction.

Step S103 may include the steps of:

step S401: determining context interaction data, wherein the context interaction data comprises context emotional state and/or context intention information;

step S402: determining the emotional intent according to the user data, the emotional state, and the contextual interaction data, the intent information including the emotional intent.

In this embodiment, in order to determine the emotional intention of the user, that is, the emotional requirement of the user, context emotional state and/or context intention information in the context interaction data may be combined. Especially when the emotional state of the user is ambiguous, the potential emotional requirements of the user, such as the generation reason of the emotional state of the user, can be inferred through the context interaction data, so that the feedback of the user with higher accuracy is facilitated subsequently. Specifically, the ambiguous emotional state means that the emotional state of the user cannot be determined in the current interaction. For example, the current sentence of the user cannot judge the emotional state with high confidence, but the emotion of the user in the previous round of interaction may be excited; under the condition that the emotional state of the user in the previous round of interaction is obvious, the emotional state of the previous round of interaction is used for reference, so that the condition that the emotional state of the user in the current round of interaction cannot be acquired due to failure of emotional judgment is avoided.

Further, the contextual interaction data may include interaction data in previous interaction sessions and/or other interaction data in the current interaction session.

In this embodiment, the interactive data in the previous interactive dialog refers to intention information and emotional state in the previous interactive dialog; the other interactive data in the interactive dialog refers to other intention information and other emotional states in the interactive dialog.

In a specific implementation, the other interaction data may be a context of the user data in the current interaction session. For example, when a user speaks a session or the data acquisition device acquires a continuous stream of data, the session may be divided into several sessions and processed in a context with each other, and a continuous stream of data may be acquired at multiple time points and mutually in a context with each other.

The interaction data may be the context of multiple interactions. For example, a user has made multiple rounds of conversations with a machine, the content of each round of conversation being contextual to each other.

The context interaction data comprises interaction data in previous interaction conversations and/or other interaction data in the current interaction conversation.

In an embodiment of the present invention, step S402 may further include the following steps: acquiring the time sequence of the user data; determining the emotional intent based at least on the timing, the emotional state, and the contextual interaction data.

Specifically, acquiring the timing of the user data means that when there are a plurality of operations or a plurality of intentions in the user data, timing information of the plurality of operations included in the user data needs to be determined. The timing of each operation may affect subsequent intent information.

In this embodiment, the time sequence of the user data may be obtained according to a preset time sequence rule; the time sequence of the user data can also be determined according to the time sequence of acquiring the user data; in this case, the time sequence for acquiring the user data may be directly called.

Further, determining the emotional intent based at least on the timing, the emotional state, and the contextual interaction data may comprise: extracting focus content corresponding to each time sequence in the user data based on the time sequence of the user data; for each time sequence, matching the focus content corresponding to the time sequence with the content in an emotion type library, and determining the emotion type corresponding to the matched content as the focus emotion type corresponding to the time sequence; and according to the time sequence, determining the emotion intention by using the focus emotion type corresponding to the time sequence, the emotion state corresponding to the time sequence and the context interaction data corresponding to the time sequence.

In a specific embodiment, the focus content may be a content focused by the user, such as a picture or a text.

The focus content may include text focus, speech focus, and semantic focus. When the text focus is extracted, the weight of each word in the text is different in processing, and the weight of the word is determined through a focus (attention) or attention mechanism. More specifically, the text or vocabulary content focused on in the current text can be extracted through the parts of speech, the attention word list and other contents; the focus model may also be implemented in a unified codec (encoder-decoder) model formed in combination with semantic understanding or intent understanding. When the voice focus is extracted, besides the word weight and the focus model aiming at the converted text data, the method also comprises the capture of acoustic rhythm characteristics, including the characteristics of tone, accent, pause, intonation and the like. The features can help to eliminate ambiguity and improve the attention of keywords.

The focus content may also include image focus or video focus. When an image (or video) focus is extracted, because the image and the video have relatively prominent parts, the pixel distribution of the image can be checked after preprocessing (such as binarization and the like) by using a computer vision mode to obtain an object and the like in the image; if there is a human region in the image, the focus of the image can also be obtained by the attention point of the human sight direction or the pointing direction of the limb movement or gesture. After the focus of the image is obtained, entities in the image or the video can be converted into texts or symbols through semantic conversion, and the texts or symbols are used as focus content to be processed in the next step.

The extraction of the focus content can be implemented in any practicable manner in the prior art, and is not limited herein.

In this embodiment, the focus content, the focus emotion type, the emotion state, and the context interaction data correspond to a time sequence, respectively. And the context interactive data corresponding to the time sequence is the emotional state and intention information of the previous time sequence of the current time sequence.

In another embodiment of the present invention, referring to fig. 1 and 5, step S103 shown in fig. 1 may include the following steps:

step S501: obtaining semantics of the user data;

step S502: determining contextual intent information;

step S503: determining a basic intention according to the semantics of the user data and the context intention information, wherein the intention information comprises the basic intention, and the basic intention of the user is one or more of preset transaction intention categories.

Further, step S503 may include the steps of: acquiring the time sequence of the user data and the semantics of the user data of each time sequence; and determining the basic intention at least according to the time sequence, the semantics of the user data of each time sequence and the context intention information corresponding to the time sequence.

The time sequence of acquiring the user data means that when a plurality of operations or a plurality of intentions exist in the user data, time sequence information of the plurality of operations included in the user data needs to be determined. The timing of each operation may affect subsequent intent information.

The specific manner of obtaining the semantics of the user data of each time sequence may be determined according to the modality of the user data. When the user data is a text, the semantics of the text can be directly determined through semantic analysis; when the user data is voice, the voice can be converted into text, and then semantic analysis is performed to determine the semantics. The user data can also be data after multi-mode data fusion, and can be combined with a specific application scene for semantic extraction. For example, when the user data is a picture without any text, the semantic meaning can be obtained by an image understanding technology.

Specifically, the semantics can be obtained through a natural language processing and semantic library matching process.

Further, the computer device may determine a basic intent in conjunction with the current interaction environment, the contextual interaction data, and the user data.

Step S503 may further include the steps of:

extracting focus content corresponding to each time sequence in the user data;

determining a current interaction environment;

determining context intention information corresponding to the time sequence;

for each time sequence, determining the basic intention of the user by using the related information corresponding to the time sequence, wherein the related information comprises: the focused content, the current interaction environment, the contextual intent information, the timing, and the semantics.

In this embodiment, the contextual intention information includes intention information in previous interactive dialogs and/or other intention information in the current interactive dialog.

To more accurately determine the basic intent of the user, the focus content, the current interaction environment, and the contextual intent information in the contextual interaction data may be combined. Especially when the basic intention of the user is ambiguous, the basic intention of the user, such as the service required to be acquired by the user, can be more accurately inferred through the current interaction environment and the context interaction data, so that the follow-up more accurate feedback on the user is facilitated.

In specific implementation, the current interaction environment can be determined by the application scene of emotional interaction, such as interaction place, interaction environment, dynamic change update of computer equipment and the like.

More specifically, the current interactive environment may include a preset current interactive environment and a current interactive environment. The preset current interaction environment can be a long-term effective scene setting, and can directly influence the logic rule design, the semantic library, the knowledge base and the like of the application. The current interaction environment may be extracted from the current interaction information, i.e. derived from the user data and/or the contextual interaction data. For example, if the user uses a public service assistant to report, the preset current interaction environment can prompt that a reporting mode is selected through ways such as 'telephone, webpage, mobile phone photographing, GPS' and the like; if the user is on site, the current interaction environment can be further updated directly, and a more convenient mode of mobile phone photographing and GPS is recommended directly. The current interaction environment may promote accuracy in understanding intent.

Further, contextual interaction data may be recorded in the computer device and may be invoked during the current interaction.

In the process of extracting the semantics, the user data is preferentially used, and if the user data has content missing or cannot locate the user intention, the context intention information in the context interaction data can be referred to.

In the embodiment shown in fig. 6, the process first proceeds to step S1001, and the interaction flow starts. In step S1002, data acquisition is performed to obtain user data. The acquiring of data may be acquiring data of a plurality of modalities. Static data, such as text, images; dynamic data such as voice, video, and physiological signals may also be included.

The collected data are sent to steps S1003, S1004 and S1005 for processing. In step S1003, the user data is analyzed. Specifically, step S1006, step S1007, and step S1008 may be executed. Step S1006 may identify the user identity in the user data. For personalized modeling in step S1007. Specifically, after the basic conditions of the user are known for the first time, a personal personalized model is generated, feedback or preference of the user on the service is recorded when the user carries out emotional interaction, and the initial personalized model is continuously corrected. In step S1008, emotion recognition may be performed on the user data to obtain an emotional state of the user.

In step S1004, context interaction data of the user data is acquired and stored as history data. And then recalled when there is a subsequent need for context interaction data.

In step S1005, the scene data in the user data is analyzed to obtain the scene data, i.e., the current interaction environment.

The emotional state, the personalized information, the context interaction data and the current interaction environment obtained in the above steps will participate in the intention understanding process in step S1009 to obtain the intention information of the user. It is to be understood that semantic repositories, domain knowledge repositories a and generic knowledge repositories B may also be used in the intent understanding process.

It is understood that the general knowledge base B may include general knowledge, which is knowledge that is not limited by the application field and the scene, such as encyclopedia knowledge, news comments, and the like. The general knowledge has a guiding function for judging the emotional intention, for example, the general knowledge can be: when the user presents a negative emotion, a positive encouragement is required, and the like. The general knowledge can be obtained by traditional knowledge representation methods such as semantic network, ontology, framework, Bayesian network and the like, and novel artificial intelligence technologies such as affair map and deep learning. The domain knowledge base a may include knowledge for a certain application domain, such as knowledge of terms specific to finance, education, and the like.

In step S1010, an emotion decision is made according to the intention information to obtain an emotion command. Further, in step S1011, the emotion command is executed to perform emotion feedback. In step S1012, it is determined whether the current interaction is finished, and if yes, the process is finished; otherwise, the process continues to step S1002 for data acquisition.

Fig. 7 is a specific embodiment of step S1009 shown in fig. 6.

The input information is contextual interaction data 1101, user data 1102 and a current interaction environment 1103. The data proceeds to step S1104, step S1105, and step S1106, respectively.

In step S1104, the time sequence of the user data is analyzed to obtain the transition of the interaction state, for example, the time sequence of the current interaction, and whether there is a preceding interaction and a following interaction. In step S1105, focus extraction may be performed on the user data to obtain focus content. In step S1106, text semantics extraction may be performed on the text corresponding to the user data to obtain semantics. In the semantic extraction process, natural language processing can be carried out on user data, and semantic analysis can be carried out by combining a semantic library and the current interaction environment.

With the interaction state transition, focus content, semantics, personalized information, and emotional state as input information, intention inference is performed in step S1107 to obtain intention information 1108. Specifically, in the intention inference process, the domain knowledge base 1109 and the general knowledge base 1110 may be combined.

Fig. 8 is a specific embodiment of step S1107 shown in fig. 7.

In this embodiment, intent inference may be performed using a rule-based bayesian network.

Matching is performed using the emotion general knowledge library 1203 and the focused content 1201 to obtain a focused emotion type 1202. The focused emotion types 1202 and emotional state sequences 1210 are used as input to reason using emotion intent reasoner 1205 to obtain emotion intent probability combinations 1206.

In particular, the emotional intent reasoner 1205 may be implemented using a Bayesian network. The joint probability distribution matrix in the bayesian network is initialized by the emotion intention rule base 1204, and then machine active learning can be performed according to decision feedback information or man-machine cooperative optimization can be performed by using empirical knowledge 1207. The emotional intent rule base can give joint probability distribution between the emotional intent variable and other related variables. Or giving basic rules and estimating the joint probability distribution according to the basic rules

The semantics 1209, the focused content 1201, the contextual interaction data 1211, and the current interaction environment 1212 are used as inputs to reason using the interaction intention reasoner 1214 to obtain a probability combination of interaction intents 1215. In particular, the interaction intention reasoner 1214 may reason in conjunction with the domain knowledge graph 1213. The interaction intention reasoner 1214 performs query inference within the domain knowledge graph 1213 based on the input, resulting in a probability combination of interaction intents 1215.

The emotional intent probability combination 1206, the interactive intent probability combination 1215, and the personalized features 1216 are input and inferred with a user intent reasoner 1217 to obtain a human-machine fusion user intent probability combination 1218. In particular, the user intent reasoner 1217 may be implemented using a bayesian network. The joint probability distribution matrix in the bayesian network can be initialized using the user intention rule base 1208, and then machine active learning can be performed according to the decision feedback information or man-machine cooperative optimization can be performed using empirical knowledge 1207.

From the human-machine fusion user intent probability combination 1218, individual intents may be filtered out, determining decision action 1219. Decision action 1219 may be performed directly or after confirmation by the user. Further, the user may make user feedback 1220 on decision action 1219. The user feedback 1220 may include implicit passive feedback 1221 and display active feedback 1222. Implicit passive feedback 1221 may refer to automatically acquiring a reaction of the user to the decision result, such as speech, emotion, action, and the like. Displaying active feedback 1222 may refer to the user actively rating the decision result, and may be of a scoring type or a speech type.

In a specific application scenario of the invention, the emotional intent and the basic intent can be determined by using a Bayesian network. Referring to fig. 9-11, the following description is made in detail with reference to specific interaction scenarios.

As shown in fig. 9, the user interacts with the smart speaker for the first time. The user says for the smart sound box in the office: "it is good to feel the headache today and put a song bar one day. "intelligent audio amplifier: "good, please listen to music. "smart sound box action: a relaxing song is put.

In the present round of interaction, the specific process of determining that the user intends to "put a relaxing song" is as follows. The probability distribution of the focus content of the interaction is obtained as follows: the probability of a meeting is 0.1; the song playing probability is 0.5; the probability of headache is 0.4. Through emotion recognition, the probability distribution of the emotional states (discrete emotional states in this example) is calculated as: the neutrality is 0.1; 0.5 of fatigue; sadness 0.4. And determining that the context emotional state is Null (Null) according to the context interaction data. According to the emotion general knowledge base, mapping the focus content information to the focus emotion types (only 'headache' plays a role in the focus emotion types at the moment), and determining the probability of the focus emotion types as follows: the probability of physical discomfort 1. And calculating the probability distribution of the emotional intentions according to a preset joint probability distribution matrix (not completely developed) inferred by the emotional states, the focus emotional types and the context emotional states (in this case, null), wherein the probability distribution of the emotional intentions is as follows: the pacifying probability is 0.8; the excitation probability was 0.2. Because the current focus emotion type is 'physical discomfort' (100%), in the current emotion intention combined probability matrix (the combined probability matrix is not fully expanded at this time, and the three emotion states are not listed), the 'physical discomfort' is searched, the corresponding probability distribution is 0.8 for the intention which needs to be soothed in the focus emotion state, and 0.2 for the intention which needs to be revived, and therefore, the probability of the emotion intention is inferred to be 0.8 for the peaceful intention and 0.2 for the excitement intention (the focus emotion state is 'physical discomfort', the probability is 100%, and the table look-up can be directly carried out).

When the basic intention is determined, determining the semantics of the user data as follows: today/meeting/headache/song play. Determining that the context interaction data information is Null (Null) according to the context interaction data, and the current interaction environment is as follows: time 6: 50; a local office. The probability distribution of the basic intentions is calculated according to the information (the main method is to calculate the matching probability between the interactive content and the interactive intentions in the domain knowledge graph) as follows: the song playing probability is 0.8; the rest probability is 0.2. Combining the emotional intention probability distribution, the interaction intention probability distribution and the user personalized features (for example, a user is more inclined to a certain intention, which is not considered in this example), calculating the probability distribution of the human-computer cooperative user intention according to a joint probability distribution matrix (XX represents that the variable can take any value) inferred by the user intention, wherein the probability distribution is as follows: the probability of relaxing songs is 0.74; the probability of putting a happy song is 0.26.

And screening out one user intention (the two intentions are mutually exclusive and have high selection probability) according to the user intention probability distribution, and mapping to a corresponding decision action (relaxing songs and prompting languages) according to a decision library.

When the personalized features of the user are introduced, for example, in most cases, the user does not want to be replied by the system without any feedback, so the decision part deletes the interactive intention of rest (without any feedback by the system), that is, the current user intention is "put song", and the probability is 1. Then, combining the emotional intention probability combination and the interactive intention combination, finally obtaining the probability distribution of the user intention (obtained by a user intention rule library) according to a set rule, and obtaining the current intention sequence by the user intention probability distribution.

If there is no personalized information, the following three probabilities are output: p (soothing music) ═ P (soothing, song/soothing music play) × P (soothing) + P (excitement, song/soothing music play) × P (excitement)) × P (song play) ═ 0.9 × 0.8+0.1 × 0.2 × 0.8 ═ 0.74 × 0.8 ═ 0.592; p (cheering song) ═ P (soothing, song/music on playing) × P (soothing) + P (excited, song/music on playing) × P (excited)) × P (song on playing) (0.1 × 0.8+0.9 × 0.2) × 0.8 ═ 0.26 × 0.8 ═ 0.208P (rest) ═ 0.2.

Due to the personalized information of the user, the emotional intention of rest is cut off, and the probability of the emotional intention is P (relaxing music) 0.9 × 0.8+0.2 × 0.1-0.74; p (cheering song) is 0.1 × 0.8+0.9 × 0.2 is 0.26; p (rest) ═ 0.

It should be noted that after completing the intent inference, the emotional intent and the interaction intent of the user in the scene may be recorded explicitly or implicitly and used in the subsequent interaction process. And the method can also be used as historical data to carry out reinforcement learning on the intention reasoning process or man-machine cooperative regulation and control, so as to realize gradual updating and optimization.

So far, the first interaction between the user and the intelligent sound box is completed. Under the condition, the user does not interact with the intelligent sound box any more, and the current round of interaction is completed.

Or the user performs subsequent interaction processes such as second interaction, third interaction and the like with the intelligent sound box within set time; that is, the present round of interaction includes multiple interactions. The following explains the example in which the user continues to perform the second interaction and the third interaction with the smart speaker.

Referring to fig. 10, the user interacts with the smart speaker for a second time. The user: "sleep soon, do not go o, change a song bar, wait for overtime. "intelligent audio amplifier: "good. "intelligent audio amplifier carries out the action: a happy song is played.

In the present round of interaction, the specific process of determining that the user intends to "put cheerful songs" is as follows. The probability distribution of the focus content of the interaction is obtained as follows: the sleeping probability is 0.2; the probability of song change is 0.6; the overtime probability is 0.2. Through emotion recognition, the probability distribution of the emotional states (discrete emotional states in this example) is calculated as: the neutrality is 0.1; 0.3 of fatigue; boring to 0.6. According to the emotion common sense library, mapping the focus content information to the focus emotion types (only 'overtime' and 'sleeping' simultaneously act on the focus emotion types at the moment, and overlapping according to the weight), and determining the probabilities of the focus emotion types as follows: the fatigue probability is 0.7; the dysphoria probability is 0.3. Determining the context emotional state according to the context interaction data as follows: the pacifying probability is 0.8; the excitement probability is 0.2 (here, the emotional intent probability distribution calculated during the last interaction). And calculating the probability distribution of the emotional intentions according to a joint probability distribution matrix (not fully expanded) inferred according to the emotional states, the focus emotional types and the context emotional states, wherein the probability distribution of the emotional intentions is as follows: the pacifying probability is 0.3; the excitation probability was 0.7.

When the basic intention is determined, determining the semantics of the user data as follows: sleep/go/change song/wait down/shift. Determining context interaction data information (where the context interaction data information is the interaction intention probability distribution calculated in the last interaction process) according to the context interaction data as follows: the song playing probability is 0.8; the rest probability is 0.2. And the current interaction environment is: time 6: 55; a local office. The probability distribution of the basic intentions is calculated according to the information (the main method is to calculate the matching probability between the interactive content and the interactive intentions in the domain knowledge graph) as follows: the song playing probability is 0.9; the rest probability is 0.1.

Combining the emotional intention probability distribution, the interaction intention probability distribution and the user personalized features (for example, a user is more inclined to a certain intention, which is not considered in this example), calculating the probability distribution of the human-computer cooperative user intention according to a joint probability distribution matrix (XX represents that the variable can take any value) inferred by the user intention, wherein the probability distribution is as follows: the probability of relaxing songs is 0.34; the probability of putting a happy song is 0.66.

For example, it may be determined from the context that "please listen to music" is no longer prompted, but only "good" is answered.

When the personalized features of the user are introduced, for example, in most cases, the user does not want to get a reply that the system does not do any feedback, so the decision part deletes the interactive intention of rest (the system does not do any feedback); i.e. thus the possibility of having a rest of 0.1 is eliminated, the total probability of playing soothing music and cheering music being 1.

Referring to fig. 11, the user interacts with the smart speaker for a third time. The user: "this is good, call me to go out in half an hour" intelligent sound box: "alarm with 7:30 set" (alarm after half an hour) Smart speaker performs the action: the cheerful song continues to be played.

In the present round of interaction, the specific process of determining that the user intends to "put cheerful songs" is as follows. The probability distribution of the focus content of the interaction is obtained as follows: the error probability is 0.2; half hour probability 0.6; the exit probability is 0.2. Through emotion recognition, the probability distribution of the emotional states (discrete emotional states in this example) is calculated as: neutral probability 0.2; the happy probability is 0.7; the boredom probability is 0.1. Mapping the focus content information to the focus emotion type according to the emotion common sense library (no focus content plays a role in the focus emotion type at this time, so the focus emotion type is empty); determining the context emotional state according to the context interaction data as follows: the pacifying probability is 0.3; the excitement probability is 0.7 (in this case, the emotional intention probability distribution calculated during the last interaction). And calculating the probability distribution of the emotional intentions according to a joint probability distribution matrix (not fully expanded) inferred according to the emotional states, the focus emotional types and the context emotional states, wherein the probability distribution of the emotional intentions is as follows: the pacifying probability is 0.3; the excitement probability is 0.7 (no new emotional intention is generated at the moment, so the emotional intention probability distribution in the last interactive process is equal to the emotional intention probability distribution in the last interactive process);

when the basic intention is determined, determining the semantics of the user data as follows: this/ok/half hour/call me out. Determining context interaction data information (where the context interaction data information is the interaction intention probability distribution calculated in the last interaction process) according to the context interaction data as follows: the song playing probability is 0.9; the rest probability is 0.1. And the current interaction environment is: time 7: 00; a local office. The probability distribution of the basic intention is calculated according to the information as follows: the song playing probability is 0.4; the alarm probability is set to be 0.6.

Combining the emotional intention probability distribution, the basic intention probability distribution, and the user personalized features (for example, a user prefers an intention, which is not considered in this example), calculating the probability distribution of the human-computer cooperative user intention according to the joint probability distribution matrix (XX represents that the variable can take any value) inferred by the user intention, as follows: the probability of relaxing songs is 0.14; the probability of putting cheerful songs is 0.26; an alarm clock 0.6 is arranged.

According to the probability distribution of the user intentions, two user intentions are screened out (the first two are mutually exclusive, the higher probability is selected, and the alarm is set, and the alarm is not mutually exclusive with the first two, and is also selected), and according to the decision library, the alarm is mapped to the corresponding decision action (a cheerful song is put (a prompt language is not needed), and the alarm is set according to the user requirements (time information in a scene and 'half hour' extracted from interactive contents are taken as parameters)).

There is no user personalization feature as an aid, and both the cheering song and the alarm clock are saved in the final result.

In another specific application scenario of the invention, the emotional intent can be determined by utilizing an emotional semantic library; and determining the basic intention by utilizing the semantic library. The emotion semantic library can also comprise the association relationship between the emotion state and the basic intention.

Specifically, referring to table 1, table 1 shows the relationship between emotional states and the basic intention.

TABLE 1

As shown in table 1, when the basic intention is to open the credit card, the emotional intention is different according to the emotional state: when the emotional state is anxiety, the emotional intention is expected to obtain comfort; the emotional intent is the desire to obtain encouragement when the emotional state is happy. Other things are similar and will not be described here.

In another embodiment of the present invention, step S103 may further include the following steps: and acquiring a basic intention corresponding to the user data through calling, and adding the basic intention into the intention information, wherein the basic intention of the user is one or more of preset transaction intention categories.

In this embodiment, the process of determining the basic intention may be processed in other devices, and the computer device may access and call the other devices through the interface to obtain the basic intention.

In a specific implementation of steps S402 and S503, the computer device may be implemented by rule logic and/or a learning system. Specifically, the emotional intention of the user can be determined by utilizing the matching relation of the user data, the emotional state, the context interaction data and the emotional intention; the user's basic intent may be determined using user data, the current interaction environment, the contextual interaction data, and a matching relationship of the basic intent. The computer device can also be used for acquiring the basic intention of the user by utilizing the model after obtaining the model through machine learning. Specifically, the determination of the intention information in the non-professional field can be obtained by learning the general linguistic data, and the determination of the intention information in the professional field can be combined with machine learning and logic rules to improve the understanding accuracy.

Specifically, referring to fig. 2, the computer device 102 extracts user data of multiple modalities of the user 103 through multiple input devices, which may be selected from voice, text, body gestures, physiological signals, and the like. The voice, the characters, the user expressions and the body postures contain rich information, and semantic information in the voice, the characters, the user expressions and the body postures is extracted and fused; and then, by combining the current interaction environment, the context interaction data and the user interaction object, the identified user emotional state and the current behavior tendency of the user 103, namely the intention information of the user 103, are deduced.

The processes of acquiring intention information from user data in different modalities are different, such as: the data of the text mode can be subjected to semantic analysis through algorithms such as natural language processing and the like to obtain the basic intention of the user, and then the emotional intention is obtained through the combination of the basic intention of the user and the emotional state; the voice modal data obtains a voice text through voice-to-text conversion, then carries out semantic analysis to obtain the basic intention of the user, and then obtains the emotional intention by combining the emotional state (obtained through audio data parameters); judging the basic intention and emotional intention of the user by using an image and video identification method of computer vision according to image or video data such as facial expressions, gesture actions and the like; the modal data of the physiological signal can be matched with other modal data to jointly obtain a basic intention and an emotional intention, for example, intention information of the interaction is determined by matching with the input of voice and the like of a user; or, in the dynamic emotion data processing process, there may be an initial trigger instruction, for example, the user starts interaction through a voice instruction to obtain the basic intention of the user, and then tracks the physiological signal in a period of time, and determines the emotional intention of the user at regular intervals, where the physiological signal only affects the emotional intention without changing the basic intention.

In another specific application scenario, when the user opens the door, the user cannot find the key, so that worried about a sentence: "my key worsted? ". The user's action is to pull a door handle or to look for a key in a backpack pocket. At this time, the emotional state of the user may be a negative emotion such as urgency and fidget, and the computer device may determine, by combining the action, voice ("where the key is"), and emotional state (urgency) of the user, that the basic intention of the user should be to find the key or to seek help to open the door, through the acquired facial expression, voice feature, physiological signal, and the like; the emotional intent is the need for soothing.

With continued reference to fig. 1, step S104 may include the steps of: and determining executable instructions according to the emotional state and the intention information so as to be used for performing emotional feedback on the user.

In this embodiment, the process of determining executable instructions by the computer device may be a process of sentiment decision. The computer device can execute the executable instructions and can present the services and emotions required by the user. More specifically, the computer device may also determine executable instructions in connection with the intent information, interaction environment, contextual interaction data, and/or interaction objects. The interactive environment, contextual interaction data, interactive objects, etc. are invokable and selectable by the computer device.

Preferably, the executable instructions may comprise an emotional modality and an output emotional state, or the executable instructions comprise an emotional modality, an output emotional state, and an emotional intensity. Specifically, the executable instruction has a definite executable meaning and can comprise specific parameters required by the emotional presentation of the computer equipment, such as the emotional modality of the presentation, the output emotional state of the presentation, the emotional intensity of the presentation and the like. Preferably, the executable instructions can comprise at least one emotion modality and at least one output emotion type;

after the executable instruction is determined according to the emotional state and the intention information, the following steps can be further included: and performing emotional presentation of one or more output emotional types in the at least one output emotional state according to each emotional mode in the at least one emotional mode.

The emotion modality in this embodiment may include at least one of a text emotion presentation modality, a sound emotion presentation modality, an image emotion presentation modality, a video emotion presentation modality, and a mechanical motion emotion presentation modality, which is not limited in this respect.

In this embodiment, the output emotional state may be expressed as an emotional category; the output emotional state can also be expressed as a preset multidimensional emotion coordinate point or area; the output emotional state may also be an output emotional type.

Wherein outputting the emotional state may include: statically outputting the emotional state and/or dynamically outputting the emotional state; the static output emotional state can be represented by a discrete emotional model or a dimension emotional model without time attribute so as to represent the output emotional state of the current interaction; the dynamic output emotional state can be represented by a discrete emotional model with a time attribute, a dimension emotional model, or other models with a time attribute, so as to represent the output emotional state at a certain time point or within a certain time period. More specifically, the static output emotional state may be represented as an emotional classification or a dimensional emotional model. The dimension emotional model can be an emotional space formed by a plurality of dimensions, each output emotional state corresponds to one point or one area in the emotional space, and each dimension is a factor for describing emotion. For example, two-dimensional space theory: activation-pleasure or three-dimensional space theory: activation-pleasure-dominance. The discrete emotion model is an emotion model in which an output emotion state is represented in a discrete tag form, for example: six basic emotions include happiness, anger, sadness, surprise, fear, nausea.

The executable instructions should have a definite executable meaning and be easy to understand and accept. The content of the executable instructions may include at least one emotion modality and at least one output emotion type.

It should be noted that the final emotion presentation may be only one emotion modality, such as a text emotion modality; and a combination of several emotion modes, such as a combination of a text emotion mode and a sound emotion mode, or a combination of a text emotion mode, a sound emotion mode and an image emotion mode.

The output emotional state can be an emotional type (also called an emotional component) of the output emotion, and can also be an emotion classification of the output emotion, and is represented by classifying an output emotion model and a dimension output emotion model. The emotional state of the classified output emotion model is discrete, so the classified output emotion model is also called a discrete output emotion model; a region and/or a set of at least one point in the multi-dimensional emotion space may be defined to classify an output emotion type in the output emotion model. The dimension output emotion model can be a multi-dimensional emotion space, each dimension of the space corresponds to a psychologically defined emotion factor, and under the dimension emotion model, the output emotion state is represented by coordinate values in the emotion space. In addition, the dimension output emotion model can be continuous or discrete.

Specifically, the discrete output emotion model is a main form and a recommended form of emotion types, which can classify emotions presented by emotion information according to fields and application scenes, and the output emotion types of different fields or application scenes may be the same or different. For example, in the general field, a basic emotion classification system is generally adopted as a dimension output emotion model, namely, a multi-dimensional emotion space comprises six basic emotion dimensions including happiness, anger, sadness, surprise, fear and nausea; in the customer service area, commonly used emotion types may include, but are not limited to, happy, sad, comforted, dissuaded, etc.; while in the field of companion care, commonly used emotion types may include, but are not limited to, happiness, sadness, curiosity, consolation, encouragement, dissuasion, and the like.

The dimension output emotion model is a complementary method of emotion types, and is only used for the situations of continuous dynamic change and subsequent emotion calculation at present, such as the situation that parameters need to be finely adjusted in real time or the situation that the calculation of the context emotional state has great influence. The advantage of the dimension output emotion model is that it is convenient to compute and fine tune, but it needs to be exploited later by matching with the presented application parameters.

In addition, each domain has an output emotion type of primary interest (which is obtained by identifying user information in the domain) and an output emotion type of primary presentation (which is an emotion type in an emotion presentation or interactive instruction), which may be two different groups of emotion classifications (classification output emotion models) or a different emotion dimension range (dimension output emotion model). Under a certain application scene, the determination of the output emotion types of the main presentation corresponding to the output emotion types mainly concerned in the field is completed through a certain emotion instruction decision process.

When the executable instructions include a plurality of emotion modalities, the at least one output emotion type is preferentially presented using a text emotion modality and then is supplementarily presented using one or more of a sound emotion modality, an image emotion modality, a video emotion modality, and a mechanical motion emotion modality. Here, the output emotion type of the supplemental presentation may be at least one output emotion type not presented by the text emotion modality, or the emotion intensity and/or emotion polarity presented by the text output emotion modality does not comply with at least one output emotion type required by the executable instruction.

It is noted that the executable instructions can specify one or more output emotion types and can be ordered according to the strength of each output emotion type to determine the primary and secondary of each output emotion type in the emotion presentation process. Specifically, if the emotion intensity of the output emotion type is less than the preset emotion intensity threshold, the emotion intensity of the output emotion type in the emotion presentation process can be considered to be not greater than other output emotion types with emotion intensity greater than or equal to the emotion intensity threshold in the executable instructions.

In an embodiment of the invention, the selection of the emotional modality depends on the following factors: the emotion output device and the application state thereof (for example, whether a display for displaying text or images is provided or not, whether a speaker is connected or not, and the like), the type of interaction scene (for example, daily chat, business consultation, and the like), the type of conversation (for example, the answer of common questions is mainly text reply, and the navigation is mainly image and voice), and the like.

Further, the output mode of the emotion presentation depends on the emotion modality. For example, if the emotion modality is a text emotion modality, the output mode of the final emotion presentation is a text mode; if the emotion mode is mainly the text emotion mode and is assisted by the sound emotion mode, the final emotion presentation output mode is a mode of combining text and voice. That is, the output of the emotional presentation may include only one emotional modality, or may include a combination of emotional modalities, which is not limited by the present invention.

According to the technical scheme provided by the embodiment of the invention, by acquiring the executable instruction, wherein the executable instruction comprises at least one emotion mode and at least one output emotion type, the at least one emotion mode comprises a text emotion mode, and the emotion presentation of one or more emotion types in the at least one emotion type is carried out according to each emotion mode in the at least one emotion mode, a multi-mode emotion presentation mode taking a text as a main mode is realized, and therefore, the user experience is improved.

In another embodiment of the present invention, the performing of the emotional rendering of the one or more of the at least one output emotional types according to each of the at least one emotional modalities includes: searching an emotion presentation database according to the at least one output emotion type to determine at least one emotion vocabulary corresponding to each output emotion type in the at least one output emotion type; and presenting at least one sentiment vocabulary.

Specifically, the emotion presentation database may be preset manually labeled, may be obtained through big data learning, may also be obtained through semi-learning semi-manual semi-supervised human-machine cooperation, and may even be obtained through training the whole interactive system through a large amount of emotion dialogue data. It should be noted that the emotion presentation database allows for online learning and updating.

The emotion vocabulary and the output emotion type, emotion intensity and emotion polarity parameters thereof can be stored in an emotion presentation database and also can be obtained through an external interface. In addition, the emotion presentation database comprises a set of emotion vocabularies of a plurality of application scenes and corresponding parameters, so that the emotion vocabularies can be switched and adjusted according to actual application conditions.

The emotion vocabulary can be classified according to the emotion state of the concerned user under the application scene. That is, the output emotion type, emotion intensity and emotion polarity of the same emotion vocabulary are related to the application scene. Where emotion polarity may include one or more of positive, negative and neutral.

It will be appreciated that the executable instructions may also include functional operations that the computer device needs to perform, such as responding to answers to a user's questions, and the like.

Further, the intention information comprises a basic intention of the user, and the executable instruction comprises content matched with the basic intention, wherein the basic intention of the user is one or more of preset transaction intention categories. The method for obtaining the basic intention may refer to the embodiment shown in fig. 5, and is not described herein again.

Preferably, the emotional modality is determined according to at least one modality of the user data. Still further, the emotional modality is the same as at least one modality of the user data.

In the embodiment of the present invention, in order to ensure the fluency of the interaction, the emotional modality of the output emotional state fed back by the computer device may be consistent with the modality of the user data, in other words, the emotional modality may be selected from at least one modality of the user data.

It will be appreciated that the emotional modalities may also be determined in connection with interaction scenarios, conversation categories. For example, in the scenes of daily chat, business consultation and the like, the emotional modalities are usually voice and text; when the conversation type is question answering system (FAQ), the emotion modality is mainly text; when the dialogue type is navigation, the emotion modality is mainly images and assisted by voice.

Referring also to FIG. 12, further, determining executable instructions based on the emotional state and the intent information may include:

step S601: after the executable instruction is generated in the last round of emotion interaction, determining the executable instruction according to the emotion state and the intention information in the current interaction, or

Step S602: if the emotional state is dynamically changed and the variation of the emotional state exceeds a preset threshold value, determining an executable instruction at least according to the emotional intention corresponding to the changed emotional state;

alternatively, step S603: and if the emotional state is dynamically changed, determining the corresponding executable instruction according to the dynamically changed emotional state within a set time interval.

In this embodiment, the specific process of determining the executable instruction by the computer device may be related to an application scenario, and different policies may exist in different applications.

In the specific implementation of step S601, different interaction processes are independent from each other, and only one executable instruction is generated in one emotion interaction process. And after the executable instruction of the previous round of emotion interaction is determined, the executable instruction in the current interaction is determined.

In the implementation of step S602, the emotional state changes dynamically with time for the case of a dynamically changing emotional state. The computer equipment can trigger next interaction after the change of the emotional state exceeds a preset threshold value, namely, the executable instruction is determined according to the emotional intention corresponding to the emotional state after the change. In specific implementation, if the emotional state is dynamically changed, after a first emotional state is sampled from a certain instruction as a reference emotional state, the emotional state may be sampled by using a set sampling frequency, for example, the emotional state is sampled every 1s, and only when the change between the emotional state and the reference emotional state exceeds a predetermined threshold, the current emotional state is input to a feedback mechanism for adjusting an interaction policy. The emotional state can also be fed back by adopting the set sampling frequency. That is, starting from a certain instruction, the emotional state is sampled by adopting a set sampling frequency, for example, the emotional state is sampled every 1s, and the use condition of the emotional state is consistent with the static condition. Further, the emotional state exceeding the predetermined threshold needs to be combined with historical data (e.g., the reference emotional state, the previous round of interaction emotional state, etc.) to adjust the emotional state (e.g., smooth the emotional transition, etc.) before being used to determine the interactive instruction, and then feedback is performed based on the adjusted emotional state to determine the executable instruction.

In the specific implementation of step S603, for the case of dynamically changing emotional state, the computer device may generate a discontinuous executable instruction with a change, that is, determine the corresponding executable instruction for the emotional state within the set time interval.

In addition, the dynamic emotional state change can also be stored as context interaction data and participate in the subsequent emotional interaction process.

The executable instructions may be determined by matching with rule logic, by learning systems (e.g., neural networks, reinforcement learning), or by a combination thereof. Further, the emotional state and the intention information are matched with a preset instruction library to obtain the executable instruction through matching.

Referring to fig. 1 and 13 together, after determining the executable instruction, the emotion interaction method may further include the steps of:

step S701: when the executable instruction comprises an emotional modality and an output emotional state, executing the executable instruction, and presenting the output emotional state to the user by using the emotional modality;

step S702: when the executable instruction comprises an emotional modality, an output emotional state and emotional intensity, executing the executable instruction, and presenting the output emotional state to the user according to the emotional modality and the emotional intensity.

In this embodiment, the computer device may present corresponding content or perform corresponding operations according to specific parameters of the executable instructions.

In a specific implementation of step S701, the executable instructions include an emotional modality and an output emotional state, and the computer device will present the output emotional state in a manner indicated by the emotional modality. In the specific implementation of step S702, the computer device will also present the emotional intensity of the output emotional state.

In particular, the emotional modalities may represent channels of a user interface that output emotional state presentations, such as text, expressions, gestures, speech, and so forth. The emotional state that the computer device ultimately presents may be a single modality or a combination of modalities. The computer equipment can present texts, images or videos through a text or image output device such as a display; voice is presented through a speaker, etc. Further, for the joint presentation of output emotional states via a plurality of emotional modalities, collaborative operations are involved, such as spatial and temporal collaboration: time synchronization of the content presented by the display and the voice broadcast content; spatial and temporal synchronization: the robot needs to move to a specific location while playing/showing other modality information, etc.

It will be appreciated that the computer device may perform functional operations in addition to presenting the output emotional state. The execution function operation may be a feedback operation for basic intention understanding, and may have explicit presentation contents. Such as replying to content consulted by the user; perform an operation commanded by the user, and the like.

Further, the user's emotional intent may affect the operation of their basic intent, and the computer device may alter or modify the direct operation to the basic intent when executing the executable instructions. For example, the user commands to the smart wearable device to: "exercise time of 30 minutes more scheduled", the basic intention is clear. The prior art does not have the emotion recognition function and the emotion interaction step, and the time can be directly set; however, in the technical scheme of the present invention, if the computer device detects that data such as heartbeat, blood pressure, etc. of the user deviate from normal values very much and have characteristics of serious "hyperexcitability", etc., the computer device may broadcast the warning information by voice to prompt the user to: "you are at present too fast in heartbeat, and long-time exercise may be not good for physical health, please confirm whether exercise time is prolonged", and then further interactive decision is made according to the reply of the user.

It should be noted that, after the content indicated by the executable instruction is presented to the user through the computer device, the user may be prompted to perform the next emotion interaction, so as to enter a new emotion interaction process. And the former interactive content, including emotional state, intention information and the like, is used as the context interactive data of the user in the following emotional interaction process. Contextual interaction data may also be stored and used for iterative learning and improvement of the determination of intent information.

In another specific application scenario of the invention, the intelligent wearable device performs emotion recognition by measuring physiological signals, determines intention information by intention analysis, generates executable instructions, and sends pictures, music, or warning sounds and the like matched with the executable instructions through output devices such as a display screen or a loudspeaker to perform emotion feedback, such as pleasure, surprise, encouragement and the like.

For example, a user who is running speaks to the smart wearable device in speech: "how long did i run now? The intelligent wearable device captures voice and heartbeat data of a user through the microphone and the heartbeat real-time measuring device and conducts emotion recognition. The emotion 'fidget' of the user concerned in the scene is obtained by analyzing the voice characteristics of the user, meanwhile, the heartbeat characteristics of the user are analyzed to obtain the other emotional state 'hyperexcitability' of the user, and the emotion states can be represented by the classified emotion model. Meanwhile, the intelligent wearable device converts the voice into a text, and the basic intention of the user is obtained by matching the field semantics, namely ' obtaining the time of the user's motion '. This step may require a semantic library involving the medical health domain as well as personalized information.

The emotional states of the user, namely ' dysphoria ' and ' hyperexcitability ' are linked with the basic intention ' obtaining the time of the user's exercise ', and the time of obtaining the user's exercise can be analyzed to obtain ' the time of obtaining the user's exercise, the user indicates dysphoria and may cause discomfort symptoms such as hyperexcitability and the like due to the current exercise '. This step may require an emotional semantic library and personalized information relating to the medical health domain.

The final feedback of the intelligent wearable device needs to meet the requirements of the application scene, and if the preset emotion policy database is: "for a user who intends to 'obtain real-time motion data of the user', if the emotional state of the user is 'dysphoria', the user needs to output 'real-time motion data' and simultaneously present emotion 'soothing'; if the physiological signal shows that the emotional state is 'hyperexcitable', the 'caution' and the emotional intensity are respectively medium and high. At the moment, the intelligent wearable device designates output equipment according to the current interactive content and the state of the emotion output equipment, sends out an executable instruction of ' screen output ' motion time ', presents emotion ' soothing ' and ' warning ' through voice broadcast, and the emotion intensity is respectively medium and high. "

At this moment, the voice output of the intelligent wearable device, the voice parameters such as the tone and the speed of voice output of the intelligent wearable device need to be adjusted according to the emotional states, such as 'pacifying' and 'warning' and corresponding emotional intensity. The output to the user that conforms to this executable instruction may be a voice broadcast of pitch speed, slow speech speed: "you move this time for 35 minutes. May you be happy! The length of time for aerobic exercise has been reached. If the current heartbeat is slightly fast, if uncomfortable symptoms such as the heartbeat is sensed to be too fast, the current movement is interrupted and deep breathing is performed for adjustment. The intelligent wearable device may also avoid voice broadcast operation in consideration of privacy or presentation manipulation of the interactive content, and change to plain text or represent through video and animation.

As shown in FIG. 14, the embodiment of the invention also discloses an emotion interaction device 80. The emotion interaction apparatus 80 can be used for the computer device 102 shown in FIG. 1. In particular, emotion interaction means 80 may be internally integrated with or externally coupled to the computer device 102.

The emotion interaction apparatus 80 may include a user data acquisition module 801, an emotion recognition module 802, an intention information determination module 803, and an interaction module 804.

The user data obtaining module 801 is configured to obtain user data; the emotion recognition module 802 is configured to perform emotion recognition on the user data to obtain an emotion state of the user; the intention information determining module 803 is used to determine intention information at least according to the user data; the interaction module 804 is used for controlling interaction with the user according to the emotional state and the intention information.

According to the embodiment of the invention, the emotion state of the user is obtained by identifying the user data of at least one mode, so that the accuracy of emotion identification can be improved; in addition, the emotional state can be used for controlling interaction with the user by combining the intention information, so that feedback aiming at user data can carry the emotional data, the interaction accuracy is further improved, and the user experience in the interaction process is improved.

Preferably, the intention information includes an emotional intention corresponding to the emotional state, the emotional intention including an emotional need of the emotional state. In the embodiment of the invention, the user data based on at least one mode can also obtain the emotional requirements aiming at the emotional state; that is, the intention information includes emotional requirements of the user. For example, where the emotional state of the user is casualty, the emotional intent may include the emotional need of the user "comfort". The emotional intentions are used for interaction with the user, so that the interaction process is more humanized, and the user experience of the interaction process is improved.

Preferably, referring to fig. 14 and 15 together, the intention information determining module 803 may include: a first context interaction data determining unit 8031, configured to determine context interaction data, where the context interaction data includes context emotional state and/or context intention information; an emotional intent determination unit 8032, configured to determine the emotional intent according to the user data, the emotional state, and the context interaction data, where the intent information includes the emotional intent.

In this embodiment, the contextual interaction data may be used to determine an emotional state. When the current emotional state is ambiguous, such as being unrecognizable, or there are situations where a plurality of emotional states cannot be distinguished, the emotional state can be further distinguished by using the context interaction data, so as to ensure the determination of the emotional state in the current interaction.

Specifically, the ambiguous emotional state means that the emotional state of the user cannot be determined in the current interaction. For example, the current sentence of the user cannot judge the emotional state with high confidence, but the emotion of the user in the previous round of interaction may be excited; under the condition that the emotional state of the user in the previous round of interaction is obvious, the emotional state of the previous round of interaction is used for reference, so that the condition that the emotional state of the user in the current round of interaction cannot be acquired due to failure of emotional judgment is avoided.

Contextual interaction data may also be used for intent understanding, determining basic intent. The basic intention requires context correlation; the relationship between emotional state and basic intention also needs context information to be determined.

Contextual interaction data may also include long-term historical data. The long-term historical data can be user data formed by long-term accumulation when the time limit of the current multiple rounds of conversations is exceeded.

Further, emotion intention determination unit 8032 may include: a timing acquisition subunit (not shown) configured to acquire a timing of the user data; a determining subunit (not shown) for determining the emotional intent at least according to the timing, the emotional state, and the context interaction data.

Further, the determining subunit may include a first focus content extracting subunit configured to extract focus content corresponding to each time series in the user data based on the time series of the user data; the matching subunit is used for matching the focus content corresponding to the time sequence with the content in the emotion type library aiming at each time sequence and determining the emotion type corresponding to the matched content as the focus emotion type corresponding to the time sequence; and the final determining subunit is used for determining the focus emotion type corresponding to the time sequence, the emotion state corresponding to the time sequence and the context interaction data corresponding to the time sequence according to the time sequence.

In another preferred embodiment of the present invention, the emotional intent determination unit 8032 may further include: a first Bayesian network computing subunit to determine the emotional intent using a Bayesian network based on the user data, the emotional state, and the contextual interaction data; the first matching calculation subunit is used for matching the user data, the emotional state and the context interaction data with preset emotional intentions in an emotional semantic library to obtain the emotional intentions; the first search subunit is used for searching in a preset intention space by using the user data, the emotional state and the context interaction data to determine the emotional intention, and the preset intention space comprises a plurality of emotional intentions.

In a specific embodiment of the invention, the intention information includes the emotional intention and a basic intention, the emotional intention includes an emotional requirement of the emotional state and an association relationship between the emotional state and the basic intention, and the basic intention is one or more of preset transaction intention categories.

In particular implementations, the transaction intent categories may be explicit intent categories related to business and operations depending on the application domain and scenario. Such as 'opening bank card' and 'transfer business' in the bank field; personal assistant categories such as "consult calendar," "send mail," etc. The object intent categories are generally unrelated to emotions.

Further, the relationship between the emotional state and the basic intention is preset. Specifically, when the emotional state and the basic intention are related to each other, the relationship is usually a predetermined relationship. The association may affect the data that is ultimately fed back to the user. For example, when the basic intention is to control an exercise machine, the emotional state having an association relationship with the basic intention is excitement; if the user's primary intent is to increase the operating speed of the exercise apparatus, the computer device's ultimate feedback to the user may be to prompt the user that the operation may be dangerous for the user's safety considerations.

In the embodiment of the invention, the intention information comprises the emotion requirements of the user and the preset transaction intention types, so that when the intention information is used for controlling the interaction with the user, the emotion requirements of the user can be met while the answer of the user is replied, and the user experience is further improved; in addition, the intention information also comprises the association relationship between the emotional state and the basic intention, and the current real intention of the user can be judged through the association relationship; therefore, when the interaction is carried out with the user, the final feedback information or operation can be determined by utilizing the incidence relation, and the accuracy of the interaction process is improved.

Preferably, referring to fig. 14 and 16 together, the intention information determining module 803 may include: a semantic acquiring unit 8033, configured to acquire a time sequence of the user data and a semantic of the user data of each time sequence; a context intention information determining unit 8034 to determine context intention information; a basic intention determining unit 8035, configured to determine a basic intention according to the semantics of the user data and the contextual intention information, where the intention information includes the basic intention, and the basic intention of the user is one or more of preset transaction intention categories.

Further, the basic intention determining unit 8035 may include a timing acquiring subunit (not shown) for acquiring the timing of the user data and the semantics of the user data of each timing; a calculating subunit (not shown) configured to determine the basic intention at least according to the time sequence, the semantics of the user data of each time sequence, and the contextual intention information corresponding to the time sequence.

In a preferred embodiment of the present invention, the computer device may determine the basic intent in conjunction with the current interaction environment, contextual interaction data, and user data.

The basic intention determining unit 8035 may further include: the second focus content extraction subunit is used for extracting focus content corresponding to each time sequence in the user data; the current interaction environment determining subunit is used for determining the current interaction environment; the context intention information determining subunit is used for determining context intention information corresponding to the time sequence; a final calculation subunit, configured to determine, for each time sequence, a basic intention of the user using relevant information corresponding to the time sequence, where the relevant information includes: the focused content, the current interaction environment, the contextual intent information, the timing, and the semantics.

Further, the final calculation subunit may include: a second Bayesian network computing subunit configured to determine, for each time sequence, the basic intent using a Bayesian network based on relevant information corresponding to the time sequence; the second matching calculation subunit is used for matching the relevant information corresponding to each time sequence with a preset basic intention in a semantic library so as to obtain the basic intention; and the second searching subunit is used for searching the relevant information corresponding to the time sequence in a preset intention space to determine the basic intention, wherein the preset intention space comprises a plurality of basic intents.

Optionally, the intention information determining module 803 may further include: and the basic intention calling unit is used for obtaining a basic intention corresponding to the user data through calling and adding the basic intention into the intention information, wherein the basic intention of the user is one or more of preset transaction intention categories.

Specifically, the preset transaction intention category may be stored in the local server or the cloud server in advance. The local server can directly match the user data by using a semantic library, a search mode and the like, and the cloud server can match the user data by using an interface through a parameter calling mode. More specifically, there are various ways of matching, such as by pre-defining a transaction intention category in a semantic library, and matching by calculating the similarity between the user data and the pre-set transaction intention category; matching can also be performed through a search algorithm; classification by deep learning, etc. is also possible.

Preferably, referring to fig. 14 and 17, the interaction module 804 may include an executable instruction determining unit 8041 configured to determine executable instructions according to the emotional state and the intention information, so as to perform emotional feedback on the user.

Preferably, the executable instructions comprise at least one emotion modality and at least one output emotion type;

the interaction module 804 further includes an output emotion type presenting unit, configured to perform emotion presentation of one or more output emotion types of the at least one output emotion state according to each of the at least one emotion modality.

The emotion modality may include at least one of a text emotion presentation modality, a sound emotion presentation modality, an image emotion presentation modality, a video emotion presentation modality, and a mechanical motion emotion presentation modality, which is not limited in the present invention.

In this embodiment, the output emotional state is expressed as an emotional category; or outputting an emotion coordinate point or area with emotion states expressed as preset multi-dimension. The output emotional state may also be an output emotional type.

Wherein outputting the emotional state comprises: statically outputting the emotional state and/or dynamically outputting the emotional state; the static output emotional state can be represented by a discrete emotional model or a dimension emotional model without time attribute so as to represent the output emotional state of the current interaction; the dynamic output emotional state can be represented by a discrete emotional model with a time attribute, a dimension emotional model, or other models with a time attribute, so as to represent the output emotional state at a certain time point or within a certain time period. More specifically, the static output emotional state may be represented as an emotional classification or a dimensional emotional model. The dimension emotional model can be an emotional space formed by a plurality of dimensions, each output emotional state corresponds to one point or one area in the emotional space, and each dimension is a factor for describing emotion. For example, two-dimensional space theory: activation-pleasure or three-dimensional space theory: activation-pleasure-dominance. The discrete emotion model is an emotion model in which an output emotion state is represented in a discrete tag form, for example: six basic emotions include happiness, anger, sadness, surprise, fear, nausea.

The executable instructions should have a well-defined executable meaning and be readily understood and accepted. The content of the executable instructions may include at least one emotion modality and at least one output emotion type.

The output emotional state can be output emotional type (also called emotional component) or emotion classification, and is represented by classifying an output emotional model and a dimension output emotional model. The emotional state of the classified output emotion model is discrete, so the classified output emotion model is also called a discrete output emotion model; a region and/or a set of at least one point in the multi-dimensional emotion space may be defined to classify an output emotion type in the output emotion model. The dimension output emotion model is used for constructing a multi-dimensional emotion space, each dimension of the space corresponds to a psychologically defined emotion factor, and under the dimension emotion model, the output emotion state is represented by coordinate values in the emotion space. In addition, the dimension output emotion model can be continuous or discrete.

In addition, each domain has an output emotion type of primary interest (emotion recognition user information gets an emotion type of interest in the domain) and an output emotion type of primary presentation (emotion type in emotion presentation or interactive instructions), which may be two different groups of emotion classifications (classified output emotion models) or a different range of emotion dimensions (dimension output emotion models). Under a certain application scene, the determination of the output emotion types of the main presentation corresponding to the output emotion types mainly concerned in the field is completed through a certain emotion instruction decision process.

The executable instruction determination unit 8041 includes: a first executable instruction determining subunit 80411, configured to determine an executable instruction according to the emotional state and the intention information in the current interaction after the executable instruction is generated in the previous emotion interaction; a second executable instruction determining subunit 80412, configured to determine, when the emotional state is dynamically changed and a variation amount of the emotional state exceeds a predetermined threshold, an executable instruction according to at least an emotional intention corresponding to the changed emotional state; a third executable instruction determining subunit 80413, configured to determine, when the emotional state is dynamically changed, the corresponding executable instruction according to the dynamically changed emotional state within a set time interval.

In specific implementation, if the emotional state is dynamically changed, after a first emotional state is sampled from a certain instruction as a reference emotional state, the emotional state may be sampled by using a set sampling frequency, for example, the emotional state is sampled every 1s, and only when the change between the emotional state and the reference emotional state exceeds a predetermined threshold, the current emotional state is input to a feedback mechanism for adjusting an interaction policy. Further, the emotional state exceeding the predetermined threshold needs to be combined with historical data (e.g., the reference emotional state, the previous round of interaction emotional state, etc.) to adjust the emotional state (e.g., smooth the emotional transition, etc.) before being used to determine the interactive instruction, and then feedback is performed based on the adjusted emotional state to determine the executable instruction.

If the emotional state is dynamically changed, the emotional state can also be fed back by adopting a set sampling frequency. That is, starting from a certain instruction, the emotional state is sampled by adopting a set sampling frequency, for example, the emotional state is sampled every 1s, and the use condition of the emotional state is consistent with the static condition.

The executable instruction determination unit 8041 may further include: the matching subunit 80414 is configured to match the emotional state and the intention information with a preset instruction library, so as to obtain the executable instruction by matching.

The executable instructions comprise an emotional modality and an output emotional state; or the executable instructions include an emotional modality, an output emotional state, and an emotional intensity. When the executable instructions include an emotional modality, an output emotional state, and an emotional intensity, the output emotional state and the emotional intensity may be represented by way of multi-dimensional coordinates or discrete states.

In the embodiment of the present invention, the executable instruction may be executed by the computer device, and the form of the data output by the computer device may be indicated in the executable instruction: emotional modality and output emotional state; that is, the data ultimately presented to the user is the output emotional state of the emotional modality, thereby enabling emotional interaction with the user. In addition, the executable instructions can also comprise emotional intensity, the emotional intensity can represent the intensity of the output emotional state, and the emotional interaction with the user can be better realized by utilizing the emotional intensity.

Referring to FIG. 14 and FIG. 18 together, the emotion interacting device 110 shown in FIG. 18 may further include a first executing module 805 and/or a second executing module 806, which is opposite to the emotion interacting device 80 shown in FIG. 14. A first execution module 805 is configured to execute the executable instructions when the executable instructions include an emotional modality and an output emotional state, the output emotional state being presented to the user by the emotional modality; the second execution module 806 is configured to execute the executable instructions to present the output emotional state to the user according to the emotional modality and the emotional intensity when the executable instructions include the emotional modality, the output emotional state, and the emotional intensity.

More contents of the operation principle and the operation mode of the emotion interaction apparatus 80 or the emotion interaction apparatus 110 can refer to the related descriptions in fig. 1 to fig. 13, and are not described again here.

The embodiment of the invention also discloses a computer readable storage medium, which stores computer instructions, and when the computer instructions are executed, the steps of the emotion interaction method shown in fig. 1 to 13 can be executed. The storage medium may include ROM, RAM, magnetic or optical disks, etc.

The embodiment of the invention also discloses computer equipment, wherein computer instructions are stored on the computer equipment, and when the computer instructions are operated, the steps of the emotion interaction method shown in the figures 1 to 13 can be executed.

It should be understood that although one implementation form of the embodiments of the present invention described above may be a computer program product, the method or apparatus of the embodiments of the present invention may be implemented in software, hardware, or a combination of software and hardware. The hardware portion may be implemented using dedicated logic; the software portions may be stored in a memory and executed by a suitable instruction execution system, such as a microprocessor or specially designed hardware. It will be appreciated by those of ordinary skill in the art that the methods and apparatus described above may be implemented using computer executable instructions and/or embodied in processor control code, such code provided, for example, on a carrier medium such as a disk, CD or DVD-ROM, programmable memory such as read only memory (firmware), or a data carrier such as an optical or electronic signal carrier. The methods and apparatus of the present invention may be implemented in hardware circuitry, such as very large scale integrated circuits or gate arrays, semiconductors such as logic chips, transistors, or programmable hardware devices such as field programmable gate arrays, programmable logic devices, or in software for execution by various types of processors, or in a combination of hardware circuitry and software, such as firmware.

It should be understood that although several modules or units of the apparatus are mentioned in the above detailed description, such division is merely exemplary and not mandatory. Indeed, according to exemplary embodiments of the invention, the features and functions of two or more modules/units described above may be implemented in one module/unit, whereas the features and functions of one module/unit described above may be further divided into implementations by a plurality of modules/units. Furthermore, some of the modules/units described above may be omitted in some application scenarios.

It should be understood that the terms "first", "second", and "third" used in the description of the embodiments of the present invention are only used for clearly illustrating the technical solutions, and are not used for limiting the protection scope of the present invention.

The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents and the like that are within the spirit and principle of the present invention are included in the present invention.

Claims

1. An emotion interaction method, comprising:

acquiring user data;

performing emotion recognition on the user data to obtain the emotion state of the user;

determining intent information from at least the user data;

controlling interaction with a user according to the emotional state and the intention information;

the intention information comprises an emotional intention and a basic intention corresponding to the emotional state, and the association relationship between the emotional state and the basic intention, the emotional intention comprises the emotional requirement of the emotional state, and the basic intention is one or more of preset transaction intention categories;

the controlling the interaction with the user according to the emotional state and the intention information comprises:

determining executable instructions according to the emotional state and the intention information for performing emotional feedback on the user, wherein the executable instructions comprise an emotional mode and an output emotional state;

the determining intent information from at least the user data comprises:

determining context interaction data, wherein the context interaction data comprises context emotional state and/or context intention information;

determining the emotional intent from the user data, the emotional state, and the contextual interaction data, the intent information including the emotional intent;

the determining the emotional intent from the user data, the emotional state, and the contextual interaction data comprises: acquiring the time sequence of the user data, wherein the acquiring of the time sequence of the user data refers to determining the time sequence information of a plurality of operations included in the user data when the user data has a plurality of operations or a plurality of intents; the time sequence information is obtained according to a preset time sequence rule or is determined or preset according to the time sequence of obtaining the user data; determining the emotional intent based at least on the timing, the emotional state, and the contextual interaction data.

2. The method of claim 1, wherein the determining the emotional intent from at least the timing, the emotional state, and the contextual interaction data comprises:

extracting focus content corresponding to each time sequence in the user data based on the time sequence of the user data;

for each time sequence, matching the focus content corresponding to the time sequence with the content in an emotion type library, and determining the emotion type corresponding to the matched content as the focus emotion type corresponding to the time sequence;

and according to the time sequence, determining the emotion intention by using the focus emotion type corresponding to the time sequence, the emotion state corresponding to the time sequence and the context interaction data corresponding to the time sequence.

3. The method of claim 1, wherein the determining the emotional intent from the user data, the emotional state, and the contextual interaction data comprises:

determining the emotional intent using a Bayesian network based on the user data, the emotional state, and the contextual interaction data;

or matching the user data, the emotional state and the context interaction data with preset emotional intentions in an emotional semantic library to obtain the emotional intentions;

or searching in a preset intention space by using the user data, the emotional state and the context interaction data to determine the emotional intention, wherein the preset intention space comprises a plurality of emotional intentions.

4. The emotion interaction method of claim 1, wherein the relationship between the emotional state and the basic intention is preset, or the relationship between the emotional state and the basic intention is obtained based on a preset training model.

5. The method of claim 1, wherein determining intent information based at least on the user data comprises:

obtaining semantics of the user data;

determining contextual intent information;

determining a basic intention according to the semantics of the user data and the context intention information, wherein the intention information comprises the basic intention, and the basic intention of the user is one or more of preset transaction intention categories.

6. The emotion interaction method of claim 5, wherein the determining of the basic intention from the semantics of the user data and the contextual intention information comprises:

acquiring the time sequence of the user data and the semantics of the user data of each time sequence;

and determining the basic intention at least according to the time sequence, the semantics of the user data of each time sequence and the context intention information corresponding to the time sequence.

7. The emotion interaction method of claim 5, wherein the determining of the basic intention from the semantics of the user data and the contextual intention information comprises:

determining a current interaction environment;

determining context intention information corresponding to the time sequence;

8. The emotion interaction method of claim 7, wherein for each time sequence, determining the basic intention of the user by using the relevant information corresponding to the time sequence comprises:

for each time sequence, determining the basic intention by utilizing a Bayesian network based on the related information corresponding to the time sequence;

or, aiming at each time sequence, matching relevant information corresponding to the time sequence with a preset basic intention in a semantic library to obtain the basic intention;

or searching the related information corresponding to the time sequence in a preset intention space to determine the basic intention, wherein the preset intention space comprises a plurality of basic intents.

9. The emotion interaction method of claim 1, wherein the contextual interaction data comprises interaction data in previous interaction sessions and/or other interaction data in the current interaction session.

10. The method of emotional interaction of claim 1, wherein the determining intent information based at least on the user data further comprises:

and acquiring a basic intention corresponding to the user data through calling, and adding the basic intention into the intention information, wherein the basic intention of the user is one or more of preset transaction intention categories.

11. The emotion interaction method of claim 1, wherein the intention information includes a user intention, the user intention is determined based on the emotion intention and a basic intention, the basic intention is one or more of pre-set transaction intention categories, and the determining intention information at least from the user data further comprises:

and determining the user intention according to the emotional intention, the basic intention and user personalized information corresponding to the user data, wherein the user personalized information has an association relation with a source user ID of the user data.

12. The emotion interaction method of claim 1, wherein the executable instructions include at least one emotion modality and at least one output emotion type; after the executable instruction is determined according to the emotional state and the intention information, the method further comprises the following steps:

and performing emotional presentation of one or more output emotional types of the at least one output emotional type according to each emotional mode of the at least one emotional mode.

13. The emotion interaction method of claim 1, wherein the determining executable instructions from the emotional state and the intention information comprises:

after the executable instruction is generated in the last round of emotion interaction, determining the executable instruction according to the emotion state and the intention information in the current interaction, or

If the emotional state is dynamically changed and the variation of the emotional state exceeds a preset threshold value, determining an executable instruction at least according to the emotional intention corresponding to the changed emotional state;

or if the emotional state is dynamically changed, determining the corresponding executable instruction according to the dynamically changed emotional state within a set time interval.

14. The emotion interaction method of claim 1, wherein the executable instructions comprise an emotion modality and an output emotion state, or wherein the executable instructions comprise an emotion modality, an output emotion state and an emotion intensity.

15. The method of claim 14, wherein the user data is provided with at least one modality, and the emotional modality is determined according to the at least one modality of the user data.

16. The method of claim 14, wherein the emotional modality is the same as at least one modality of the user data.

17. The emotion interaction method of claim 1, further comprising:

when the executable instruction comprises an emotional modality and an output emotional state, executing the executable instruction, and presenting the output emotional state to the user by using the emotional modality;

when the executable instruction comprises an emotional modality, an output emotional state and emotional intensity, executing the executable instruction, and presenting the output emotional state to the user according to the emotional modality and the emotional intensity.

18. The emotion interaction method of claim 1, wherein the determining executable instructions from the emotional state and the intention information comprises:

and matching the emotional state and the intention information with a preset instruction library to obtain the executable instruction through matching.

19. The emotion interaction method of claim 1, wherein the intention information comprises a basic intention of a user, the executable instruction comprises content matched with the basic intention, and the basic intention of the user is one or more of preset transaction intention categories;

the method for acquiring the basic intention comprises the following steps:

determining a current interaction environment;

determining contextual intent information;

determining a basic intention of the user according to the user data, the current interaction environment and the context intention information;

or: and acquiring a basic intention corresponding to the user data through calling.

20. The emotion interaction method of claim 1, wherein the at least one modality of user data is selected from the group consisting of: touch click data, voice data, facial expression data, body gesture data, physiological signals, input text data.

21. The emotion interaction method of claim 1, wherein the emotional state of the user is represented as an emotion classification; or the emotional state of the user is represented as a preset multi-dimensional emotional coordinate point.

22. An emotion interaction apparatus, comprising:

the user data acquisition module is used for acquiring user data;

the emotion recognition module is used for carrying out emotion recognition on the user data to obtain the emotion state of the user;

an intent information determination module to determine intent information based at least on the user data;

the interaction module is used for controlling interaction with a user according to the emotional state and the intention information;

the intention information comprises an emotional intention corresponding to the emotional state, and the emotional intention comprises an emotional requirement of the emotional state;

the intention information comprises the emotional intention and a basic intention, the emotional intention comprises the emotional requirement of the emotional state and the association relation between the emotional state and the basic intention, and the basic intention is one or more of preset transaction intention categories;

the interaction module comprises: an executable instruction determining unit, configured to determine an executable instruction according to the emotional state and the intention information, so as to perform emotional feedback on the user, where the executable instruction includes an emotional modality and an output emotional state;

the intention information determination module includes: the first context interaction data determining unit is used for determining context interaction data, and the context interaction data comprises context emotional state and/or context intention information; an emotional intention determining unit, configured to determine the emotional intention according to the user data, the emotional state, and the context interaction data, where the intention information includes the emotional intention;

the emotion intention determination unit includes: the time sequence acquiring subunit is configured to acquire a time sequence of the user data, where acquiring the time sequence of the user data refers to determining time sequence information of multiple operations included in the user data when multiple operations or multiple intents exist in the user data; the time sequence information is obtained according to a preset time sequence rule or is determined or preset according to the time sequence of obtaining the user data; a determining subunit, configured to determine the emotional intent at least according to the timing, the emotional state, and the context interaction data.

23. The emotion interaction apparatus of claim 22, wherein the determining subunit comprises:

a first focus content extraction subunit, configured to extract focus content corresponding to each time sequence in the user data based on the time sequence of the user data;

the matching subunit is used for matching the focus content corresponding to the time sequence with the content in the emotion type library aiming at each time sequence and determining the emotion type corresponding to the matched content as the focus emotion type corresponding to the time sequence;

and the final determining subunit is used for determining the focus emotion type corresponding to the time sequence, the emotion state corresponding to the time sequence and the context interaction data corresponding to the time sequence according to the time sequence.

24. The emotion interaction apparatus of claim 22, wherein the emotion intention determination unit includes:

a first Bayesian network computing subunit to determine the emotional intent using a Bayesian network based on the user data, the emotional state, and the contextual interaction data;

the first matching calculation subunit is used for matching the user data, the emotional state and the context interaction data with preset emotional intentions in an emotional semantic library to obtain the emotional intentions;

the first search subunit is used for searching in a preset intention space by using the user data, the emotional state and the context interaction data to determine the emotional intention, and the preset intention space comprises a plurality of emotional intentions.

25. The emotion interaction device of claim 22, wherein the relationship between the emotional state and the basic intention is preset, or the relationship between the emotional state and the basic intention is obtained based on a preset training model.

26. The emotion interaction device of claim 22, wherein the intention information determination module further comprises:

the semantic acquisition unit is used for acquiring the time sequence of the user data and the semantics of the user data of each time sequence;

a context intention information determining unit to determine context intention information;

the basic intention determining unit is used for determining basic intentions according to the semantics of the user data and the context intention information, the intention information comprises the basic intentions, and the basic intentions of the user are one or more of preset transaction intention categories.

27. The emotion interaction apparatus of claim 26, wherein the basic intention determination unit comprises:

a time sequence acquiring subunit, configured to acquire a time sequence of the user data and semantics of the user data of each time sequence;

and the computing subunit is used for determining the basic intention at least according to the time sequence, the semantics of the user data of each time sequence and the context intention information corresponding to the time sequence.

28. The emotion interaction apparatus of claim 26, wherein the basic intention determination unit comprises:

the second focus content extraction subunit is used for extracting focus content corresponding to each time sequence in the user data;

the current interaction environment determining subunit is used for determining the current interaction environment;

the context intention information determining subunit is used for determining context intention information corresponding to the time sequence;

a final calculation subunit, configured to determine, for each time sequence, a basic intention of the user using relevant information corresponding to the time sequence, where the relevant information includes: the focused content, the current interaction environment, the contextual intent information, the timing, and the semantics.

29. The emotion interaction device of claim 28, wherein the final computation subunit comprises:

a second Bayesian network computing subunit configured to determine, for each time sequence, the basic intent using a Bayesian network based on relevant information corresponding to the time sequence;

the second matching calculation subunit is used for matching the relevant information corresponding to each time sequence with a preset basic intention in a semantic library so as to obtain the basic intention;

and the second searching subunit is used for searching the relevant information corresponding to the time sequence in a preset intention space to determine the basic intention, wherein the preset intention space comprises a plurality of basic intents.

30. The emotion interaction apparatus of claim 22, wherein the contextual interaction data comprises interaction data in previous interaction sessions and/or other interaction data in the current interaction session.

31. The emotion interaction device of claim 22, wherein the intention information determination module further comprises:

and the basic intention calling unit is used for obtaining a basic intention corresponding to the user data through calling and adding the basic intention into the intention information, wherein the basic intention of the user is one or more of preset transaction intention categories.

32. The emotion interaction device of claim 22, wherein the intention information includes a user intention, the user intention is determined based on the emotion intention and a basic intention, the basic intention is one or more of preset transaction intention categories, and the intention information determination module further comprises:

and the intention information determining unit is used for determining the user intention according to the emotional intention, the basic intention and user personalized information corresponding to the user data, and the user preference and the source user ID of the user data have an incidence relation.

33. The emotion interaction device of claim 22, wherein the executable instructions comprise at least one emotion modality and at least one output emotion type;

the interaction module further comprises an output emotion type presenting unit which is used for performing emotion presentation of one or more output emotion types in the at least one output emotion type according to each emotion modality in the at least one emotion modality.

34. The emotion interaction device of claim 22, wherein the executable instruction determination unit comprises:

the first executable instruction determining subunit is used for determining an executable instruction according to the emotion state and the intention information in the current interaction after the executable instruction is generated in the last round of emotion interaction;

the second executable instruction determining subunit is used for determining an executable instruction according to at least the emotion intention corresponding to the changed emotional state when the emotional state is dynamically changed and the variation of the emotional state exceeds a preset threshold;

and the third executable instruction determining subunit is used for determining the corresponding executable instruction according to the dynamically changed emotional state within a set time interval when the emotional state is dynamically changed.

35. The emotion interaction device of claim 22, wherein the executable instructions comprise an emotion modality and an output emotion state; or the executable instructions include an emotional modality, an output emotional state, and an emotional intensity.

36. The emotion interaction device of claim 35, wherein the emotion modality is determined according to at least one modality of the user data.

37. The emotion interaction device of claim 35, wherein the emotion modality is the same as at least one modality of the user data.

38. The emotion interaction device of claim 22, further comprising a first execution module and/or a second execution module:

a first execution module to execute the executable instructions to present the output emotional state to the user using an emotional modality when the executable instructions include the emotional modality and the output emotional state;

and the second execution module is used for executing the executable instruction when the executable instruction comprises an emotional mode, an output emotional state and emotional intensity, and presenting the output emotional state to the user according to the emotional mode and the emotional intensity.

39. The emotion interaction device of claim 22, wherein the executable instruction determination unit comprises:

and the matching subunit is used for matching the emotional state and the intention information with a preset instruction library so as to obtain the executable instruction through matching.

40. The emotion interaction device of claim 22, wherein the intention information comprises a basic intention of the user, the executable instructions comprise content matched with the basic intention, and the basic intention of the user is one or more of preset transaction intention categories;

the intention information determination module includes: the current interaction environment determining unit is used for determining the current interaction environment;

a second context interaction data determination unit to determine context intention information;

a basic intention determining unit used for determining the basic intention of the user according to the user data, the current interaction environment and the context intention information;

or: and the basic intention calling unit is used for obtaining the basic intention corresponding to the user data through calling.

41. The emotion interaction device of claim 22, wherein the user data of the at least one modality is selected from the group consisting of: touch click data, voice data, facial expression data, body gesture data, physiological signals, input text data.

42. The emotion interaction device of claim 22, wherein the emotional state of the user is represented as an emotion classification; or the emotional state of the user is represented as a preset multi-dimensional emotional coordinate point.

43. A computer readable storage medium having computer instructions stored thereon, wherein the computer instructions when executed perform the steps of the emotion interaction method as recited in any of claims 1 to 21.

44. A computer device comprising a memory and a processor, the memory having stored thereon computer instructions executable on the processor, wherein the processor, when executing the computer instructions, performs the steps of the emotion interaction method as recited in any of claims 1 to 21.