Detailed Description
To make the objects, technical solutions and advantages of the present disclosure more clear, embodiments of the present disclosure will be described in further detail below with reference to the accompanying drawings.
When the following description refers to the accompanying drawings, like numbers in different drawings represent the same or similar elements unless otherwise indicated. The implementations described in the exemplary embodiments below are not intended to represent all implementations consistent with the present disclosure. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the present disclosure, as detailed in the appended claims.
Example embodiments will now be described more fully with reference to the accompanying drawings. Example embodiments may, however, be embodied in many different forms and should not be construed as limited to the examples set forth herein; rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the concept of example embodiments to those skilled in the art. The described features, structures, or characteristics may be combined in any suitable manner in one or more embodiments. In the following description, numerous specific details are provided to give a thorough understanding of embodiments of the disclosure. One skilled in the relevant art will recognize, however, that the subject matter of the present disclosure can be practiced without one or more of the specific details, or with other methods, components, devices, steps, and the like. In other instances, well-known technical solutions have not been shown or described in detail to avoid obscuring aspects of the present disclosure.
Furthermore, the drawings are merely schematic illustrations of the present disclosure and are not necessarily drawn to scale. The same reference numerals in the drawings denote the same or similar parts, and thus their repetitive description will be omitted. Some of the block diagrams shown in the figures are functional entities and do not necessarily correspond to physically or logically separate entities. These functional entities may be implemented in the form of software, or in one or more hardware modules or integrated circuits, or in different networks and/or processor devices and/or microcontroller devices.
The following tasks can be completed by assisting the user through human-computer voice interaction: the method comprises the steps of controlling system setting items, inquiring information types, converting user input into text records, chatting with a user and the like.
In an exemplary embodiment, fig. 1 shows a flow diagram of a voice interaction method in an exemplary embodiment according to the present disclosure. Technical terms involved in the voice interaction process are introduced based on fig. 1.
Referring to fig. 1, the following modules are involved in the voice interaction process:
the speech recognition module ASR 11: the voice input sent by the user is converted into text information by an Automatic Speech Recognition (ASR) technology.
Natural speech understanding module NLU 12: the text information input is understood and converted into semantic representation which can be understood by a machine through a Natural Language Processing (NLP) technology, and a structured intention and a slot position are obtained.
The dialogue management module DM 13: dialog Management (DM) is used to maintain and remember the history of the man-machine Dialog process and to determine what actions the system should take next based on the state of the Dialog. The above-described actions are understood to mean what a machine needs to express.
Natural language generation module NLG 14: the system action is converted into Natural Language text by Natural Language Generation (NLG) technology, i.e. the feedback generates text information that can be understood by human.
Text-to-speech module TTS 15: text information is converted into audio from Text To Speech (TTS) to be fed back to a user through the terminal equipment.
The voice interaction scheme provided by the related art generally includes: (1) converting the input audio of the user into corresponding text; (2) the converted text input is understood by the modes of classification, similarity retrieval and the like, so that the unstructured input is converted into a preset feedback result; (3) and executing a corresponding instruction according to a preset feedback result.
The related art provides a voice interaction process in which only text contents converted from user input audio are processed (without using any features of the related user). However, speech input of different users may be converted to the same text content, resulting in the same feedback to speech input of different users. That is to say, the voice interaction content in the related art lacks personalization and pertinence, which affects the user experience.
In view of the technical problems in the related art, the present technical solution provides a voice interaction method, a voice interaction apparatus, and a computer-readable storage medium and a terminal for implementing the methods. The system architecture of an exemplary application environment for the voice interaction scheme provided by the present disclosure is described as follows:
fig. 2 is a schematic diagram illustrating a system architecture of an exemplary application environment to which the voice interaction scheme of an embodiment of the present disclosure may be applied.
As shown in fig. 2, the system architecture 100 may include a terminal 210, a network 220, and a server 230. The terminal 210, the network 220, and the server 230 are connected to each other via the network 220.
The terminal 210 is exemplified by, but not limited to, a smart phone, a tablet computer, a smart speaker, etc. which may be installed with a mobile assistant application. The network 220 may be any type of communication medium capable of providing a communication link between the terminal 210 and the server 230, such as a wired communication link, a wireless communication link, a fiber optic cable, or the like, and is not limited thereto. The server 230 may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server providing basic cloud computing services such as cloud services, a cloud database, cloud computing, cloud functions, cloud storage, network services, cloud communication, middleware services, domain name services, security services, and big data and artificial intelligence platforms.
The voice interaction method provided by the embodiment of the present disclosure may be performed by any node in the server 230. Accordingly, the voice interaction device is generally provided in the server 230. However, it is easily understood by those skilled in the art that the voice interaction method provided in the embodiment of the present disclosure may also be executed by the terminal 210, and accordingly, the voice interaction apparatus may also be disposed in the corresponding terminal 210, which is not particularly limited in the exemplary embodiment.
For example, in a case that the voice interaction method provided by the embodiment of the present disclosure may also be executed by the terminal 210, the terminal 210 receives voice information, determines a target user characteristic according to the voice information, and determines slot position information according to the voice information. Further, the terminal 210 determines at least one target response library from a plurality of preset response libraries according to the target user characteristics, wherein different user characteristics correspond to different response libraries. Further, according to the slot position information, response content to the voice information is determined in the at least one target response library. In the scheme, the characteristics of the user are determined according to the voice information of the user, and further, the target response library is determined according to the characteristics of the user. Therefore, the response content of the target user can be matched with the user requirement, the individuation degree and pertinence of the voice interaction content are improved, and the use experience of the user is finally improved.
Based on the system architecture of the above exemplary application environment, embodiments of the voice interaction method provided by the present disclosure are introduced:
fig. 3 is a flow chart of a voice interaction method in an exemplary embodiment of the present disclosure. Referring to fig. 3, the method includes:
s310, receiving the voice information, determining the characteristics of the target user according to the voice information, and determining the slot position information according to the voice information.
Illustratively, the voice information is a chat content or consultation problem initiated by the user on the voice interaction application software (e.g., a voice assistant), such as "what's about, trouble helping me set an alarm clock", and the like. The target user characteristic may be a characteristic of any user who initiates the chat content/consultation question to the voice assistant, and in this embodiment, the target user is identified by the "target user characteristic". Illustratively, the user characteristics may be the user's age, gender, occupation, and the like.
S320, determining at least one target response library from a plurality of preset response libraries according to the target user characteristics, wherein different user characteristics correspond to different response libraries.
S330, determining target response content in at least one target response library according to the slot position information.
Illustratively, the slot information is key information for collecting "entertainment mode". In this embodiment, the target response library is a response library corresponding to the age group of the child, so the determined key information of "providing entertainment mode" may be the child amusement park. In this embodiment, the target answering database is an answering database corresponding to an adult age group, so the determined key information of the entertainment mode can be movie theaters and the like.
S340, determining the target response content as the response content to the voice message.
Illustratively, the target response content is displayed in a terminal display screen in a text form, and the target response content can also be sent out in an audio form through a terminal loudspeaker. For example, in order to increase the personalization degree of the response mode in the voice interaction process, in the case that the target response library is a response library corresponding to the age group of the child, the response content to the audio information may be the sound of the child. In the case where the target response library is a response library corresponding to an adult age group, the content of the response to the audio information may be an adult voice.
In the voice interaction scheme provided in the embodiment shown in fig. 3, slot position information and target user characteristics are determined according to voice information, a response library corresponding to the slot position information is determined according to the target user characteristics, and further, response content to the voice information is determined in the at least one target response library according to the slot position information. Because different user characteristics correspond to different response libraries, and the target response library is determined according to the target user characteristics, the targeted response to the voice information given to the user can be improved to a certain extent in the response content, so that the diversity of voice interaction is improved to a certain extent, the personalized degree of the voice interaction content and the response mode is favorably increased, and the user experience is improved.
Meanwhile, the plurality of response libraries are preset, so that the target response library can be rapidly determined according to the user characteristics, and the corresponding response content can be determined in the target response library in real time according to the slot position information.
In an exemplary embodiment, FIG. 4 shows a flow diagram of a voice interaction method in accordance with another exemplary embodiment of the present disclosure. Fig. 4 is a diagram showing a plurality of improvement points of the voice interaction scheme provided by the present invention, which are determined by improvement based on fig. 1.
Illustratively, fig. 5 is a flow chart diagram of a voice interaction method in another exemplary embodiment of the present disclosure. The following describes each improvement point of the voice interaction scheme provided by the technical solution in fig. 4 in sequence with reference to fig. 5.
Referring to fig. 5, in S510, voice information is converted into text information.
Illustratively, referring to fig. 4, the voice information is converted into text information by the voice recognition module 11. In particular, ASR techniques convert speech input spoken by a user into textual information.
In S510', a first type of audio feature of the speech information is extracted, and the speech information is screened according to the extracted first type of audio feature.
Illustratively, referring to fig. 4, a first type of audio features of the voice information are extracted by the audio filtering module 41, and the voice information is filtered according to the extracted first type of audio features.
As described above, in the technical scheme, the response content is determined according to the user characteristics, so that the targeted response to the voice information given to the user is improved to a certain extent, and the diversity of voice interaction is improved to a certain extent. However, the level of voice quality is high or low, and the voice categories are classified into human voice and non-human voice. Therefore, the voice information received by the terminal installed with the voice interaction application software may have more noises (non-human voice and low-quality level sounds such as environmental noise, animal voice, and the like), and in order to ensure the accuracy of the response content in the voice interaction process, the technical scheme performs audio screening, and then uses the screened audio to determine the user characteristics.
For example, fig. 6 is a flowchart illustrating a method for determining a user characteristic in an exemplary embodiment of the disclosure. Wherein, S610-S630 are used for audio screening. Specifically, with reference to fig. 6:
in S610, a first type of audio feature of the speech information is extracted.
In this embodiment, the voice information is subjected to framing processing to obtain a plurality of audio frames. The first type of audio feature is further determined for the audio frame. In order to improve the audio screening efficiency, the first type of audio features are static features of a preset type. For example, the second audio feature may be a Low Level Descriptors (LLD) feature, a high level statistics functions (HSFs) feature obtained by statistics based on the LLDs feature, a GeMAPS feature set formed by multiple HSF features, an extended eGeMAPS feature set of the GeMAPS, and a combae feature set.
In S620, the extracted first class of audio features are input into the trained audio quality classification model. And, in S630, filtering the voice information according to the output of the audio quality classification model.
In this embodiment, the extracted ComParE feature and the eGeMAPS feature are determined as the first type of audio feature. Further, the extracted audio features are spliced and then input into a trained audio quality classification model (such as a decision tree model), and the first type of audio features are classified through the model, wherein different types correspond to different audio quality grades.
In this embodiment, the voice corresponding to the lower audio quality level is screened out, so as to realize the screening of the received voice information. Therefore, noise such as non-human voice is screened out, the audio quality is improved, and the accuracy of determining response content is improved. Meanwhile, invalid audio is reduced, the subsequent calculation amount is favorably reduced, the calculation resources are saved, and the real-time property of determining response content in the voice interaction process is favorably realized.
It should be noted that, in general, the extraction efficiency of the first type of audio features is higher than the efficiency of converting the same speech into text information by the ASR technology. Therefore, the audio screening process of the audio screening module 41 is performed in parallel with the processing process of the speech recognition module 11, which is beneficial to improving the real-time response in the speech interaction process.
With continued reference to fig. 5, in S520, the text information is structured to obtain slot information.
Illustratively, referring to fig. 4, the intention and slot information is obtained by structuring the text information by the natural language understanding module 12. Specifically, the process of determining the intention corresponding to the text information is essentially to determine the classification of the purpose of the user to utter the speech. The text information is structured according to a plurality of preset classifications (e.g., weather forecast, alarm setting, entertainment mode providing, ticket/air ticket ordering, etc.), and it is determined whether the structured information belongs to one of the preset classifications. For example, the voice is for setting an alarm, or the voice is adapted to order an air ticket, or the like. Further, the information that the task of the category further needs to collect, namely the slot position information, is determined according to the classification of the voice purpose in the intention. Specifically, for the voice intended to set the alarm clock, the slot information includes: and setting information such as time of an alarm clock, a reminding mode of the alarm clock and the like. For a voice intended to order a ticket, the slot information may include: departure place, destination, departure time, etc.
In S520', a second type of audio features of the filtered speech information are extracted, and target user features are determined according to the second type of audio features.
Illustratively, referring to fig. 4, the second type of audio features of the filtered speech information are extracted by the feature detection module 42, and the target user features are determined according to the second type of audio features.
Illustratively, S640-S660 in FIG. 6 are used to determine the target user characteristics. The target user characteristics refer to characteristics belonging to a target user, and the target user refers to any user who performs information interaction with the voice interaction application software. Specifically, with reference to fig. 6:
in S640, the second type audio features of the filtered voice information are extracted.
In the process of extracting the Mel cepstrum coefficient characteristics MFCC, a corresponding filter is designed according to the nonlinear acceptance range of human hearing, so that the audio information important to human can be effectively extracted. In order to improve the accuracy of determining the user characteristics, the above-mentioned second type audio characteristics in this embodiment employ MFCC.
Illustratively, regarding the extraction of MFCC features: the filtered speech is segmented by means of sliding window, for example, for an audio frequency of 1000ms long, into 100 frames of 10ms long. For each 10ms frame, MFCC features are computed, resulting in MFCC features as 13 dimensions. So that the audio with the duration of 1000ms then translates into 100 features of 13 dimensions. That is, a certain duration of speech is converted into a two-dimensional matrix with a plurality of time segments and MFCC features corresponding to each time segment. The length of each frame, the step length between two adjacent frames, and the number of MFCC features calculated in each frame, etc. may be adjusted by themselves according to the application and the experimental effect, and are not limited herein.
It should be noted that the second type of audio features may also be other types of audio features, such as the eGeMAPS and the ComParE features mentioned above, which are not limited herein. The MFCC features are employed herein in view of their benefits in improving the accuracy of the determination of user features.
In S650, the extracted second class of audio features are input into a trained user feature classification model, wherein the user feature classification model is used for predicting the age of the user and/or the gender of the user according to the input audio features. And, in S660, determining the target user feature according to the output of the user feature classification model.
Considering that the second type of audio features are in the form of a two-dimensional matrix, the user feature classification model may employ a convolutional neural network. After the second-class audio features are input into the trained user feature classification model, target user features including the features of the age group, the gender and the like of the target user can be determined according to model output.
According to the embodiment, the response content to the target user is determined according to the target user characteristics, the targeted response to the voice information of the user can be improved to a certain extent, and the diversity of voice interaction is improved to a certain extent.
It should be noted that the user characteristic determination process of the characteristic detection module 42 is implemented in parallel with the processing process of the natural language understanding module 12, so as to further improve the real-time performance of the response in the voice interaction process.
In the exemplary embodiment of the technical solution, the response library is preset, and the response library includes a plurality of user feature identifiers and a response library corresponding to each user feature identifier. And associating each user characteristic identification in the answer base with its corresponding user characteristic. Therefore, for different user characteristics, a target response library matched with the current target user characteristics (such as the age, the gender and the like of the target user) can be determined in the response library. And determining the response content of the voice sent by the target user based on the target response library and the slot position.
Therefore, before describing the specific implementation of S530, a determination example of the preset response library will be described with reference to fig. 7 and 8. In this embodiment, the user characteristics include: age 1 (e.g., 5-15), age …, age i, age …, age N, further including a first gender (e.g., male) and a second gender (e.g., female). Based on this, fig. 7 shows a flowchart of a determination method of the answer base according to an embodiment of the present disclosure. Referring to FIG. 7, S710-S730 are included.
In S710, an ith response library for the ith age group is determined, and the user characteristics of the ith age group are associated with the ith response library, i being a positive integer not greater than N, N being a positive integer.
Illustratively, age 1 (e.g., 5-15 years) corresponds to a child, and if the speech-converted text is "want to play today", the response strategy may recommend a nearby casino, children's park, etc. to gather information to determine the 1 st response pool to which age 1 corresponds. Age 3 (e.g., 25-35 years) corresponds to an adult, and if the speech-converted text is "want to play today", the response strategy may be entertained in other fashions to gather information to determine the 3 rd response library to which age 3 corresponds. And determining the corresponding response library of each age group by analogy. Referring to fig. 8, N response pools associated with N age groups, respectively, are obtained.
In S720, an N +1 th response library for the first gender is determined, and the user characteristics of the first gender are associated with the N +1 th response library. And in S730, determining an N +2 th response library for the second gender, and associating the user characteristics of the second gender with the N +2 th response library to obtain a plurality of preset response libraries 430.
Illustratively, for a first gender (e.g., female), if the speech-converted text is "want to prepare a gift for a subject," the response strategy may recommend men's supplies, etc., to gather information to determine an N +1 response library associated with the first gender (see FIG. 8). Correspondingly, for the second gender (e.g., male), if the speech-converted text is "want to prepare a gift for the subject", the response policy may recommend lady supplies, etc., so as to collect information to determine the N +2 th response library corresponding to the second gender (refer to fig. 8).
It should be noted that the execution sequence of S710-S730 is not sequential, and may be executed concurrently.
In order to make the response content more fit to the self requirements of different users to optimize the feedback content in the voice interaction process, a user portrait generation mode for different users is adopted in the related technology. Specifically, the method comprises the following steps: for different users, corresponding user portraits are generated by recording the use conditions of the users. If the user portrays that the user often uses meal-related functions such as take-out, the user may be given feedback about nearby restaurants, such as "I hungry" for such text input; and for the user who enters the field and has chatting interaction with the terminal provided with the voice interaction application software, the user can communicate with the user through the chatting capability without functional feedback. However, the way of determining the user image provided by the related art needs to accumulate user data of different users for a long time, and cannot give fine and real-time content feedback to the query of the user, thereby being not favorable for realizing diversity of voice interaction.
In the technical scheme, the multiple response libraries are determined by the embodiment shown in fig. 7, so that the preset multiple response libraries 430 can be directly called when the user performs information interaction with the terminal installed with the voice interaction application software, thereby facilitating the improvement of the efficiency of determining response content in the voice interaction process and enabling the user to obtain the response content in real time.
With continued reference to fig. 5, in an exemplary embodiment, the plurality of answer libraries determined based on the embodiment shown in fig. 7, and after determining the user characteristics of the target user (i.e., "target user characteristics"), S530 is performed by the dialog management module 13: and acquiring at least one response library associated with the target user characteristics from a plurality of preset response libraries according to the target user characteristics to obtain at least one target response library.
Referring to fig. 4, according to the target user characteristic, at least one response library associated with the target user characteristic is obtained from a plurality of preset response libraries 430, so as to obtain at least one target response library 43.
Illustratively, referring to fig. 8, the ith age group user characteristics appetite "ith age group" are associated, so that after determining the "target user characteristics", the corresponding user characteristic identification may be determined, and further, the associated response library (i.e., the above-mentioned target response library) may be determined. If the target user characteristics are "user characteristics of age group 3" and "user characteristics of second gender", it is determined that the "feature group 3" associated with the "age group 3" and the "N +2 th response library" associated with the "second gender" are the target response library 43.
Further, referring to fig. 5, S540 is performed by the dialog management module 13: and determining target response content in at least one target response library according to the slot position information, and determining the target response content as the response content of the voice information.
With continued reference to fig. 4, the dialog management module 13 determines the information corresponding to the intention and the slot position from the screened target response library 43, and sends the information to the natural language generation module 14, and converts the system action into a natural language text through the NLG technique, that is, text information that can be understood by a person is fed back according to the target response content collected by the intention and the slot position. Further, the text information may be directly displayed on the display screen of the terminal as the reply content. Without the need to convert the text information to audio by means of the text-to-speech module 15.
For example, the text information may be converted into audio through the text-to-speech module 15, and then fed back to the target user through the terminal device. The audio is emitted, for example, through a microphone for receipt by the target user.
According to the technical scheme, the corresponding target response library is screened out from the preset response libraries according to the self characteristics of the target user, so that the response libraries according with the self characteristics of the user, such as age, sex and the like are determined, the information corresponding to the intention and the slot position is determined from the screened target response library, and the response content corresponding to the voice information is obtained. Therefore, the technical scheme can effectively improve the accuracy of the response content and improve the user experience.
Because different user characteristics correspond to different response libraries, and the target response library is determined according to the target user characteristics, the response content determined by the scheme is different for users with different characteristics aiming at the problem of the same text. The targeted response to the voice information given to the user can be improved to a certain extent in the response content based on the respective characteristics of the user, so that the diversity of voice interaction is improved to a certain extent, the increase of the personalized degree of the voice interaction content is facilitated, and the use experience of the user is improved.
It is to be noted that the above-mentioned figures are only schematic illustrations of the processes involved in the method according to an exemplary embodiment of the invention, and are not intended to be limiting. It will be readily understood that the processes shown in the above figures are not intended to indicate or limit the chronological order of the processes. In addition, it is also readily understood that these processes may be performed synchronously or asynchronously, e.g., in multiple modules.
The following are embodiments of the disclosed apparatus that may be used to perform embodiments of the disclosed methods. For details not disclosed in the embodiments of the apparatus of the present disclosure, refer to the embodiments of the method of the present disclosure.
Fig. 9 is a schematic structural diagram of a voice interaction apparatus to which an embodiment of the present disclosure may be applied. Referring to fig. 9, the voice interaction apparatus shown in the figure may be implemented as all or a part of the terminal by software, hardware or a combination of both, and may also be integrated in the terminal or on the server as a separate module.
The voice interaction apparatus 900 in the embodiment of the present disclosure includes: a user characteristic determining unit 910, a response library determining unit 920, a response content determining unit 930, and a response unit 340, wherein:
the user characteristic determining unit 910 is configured to receive voice information, determine a target user characteristic according to the voice information, and determine slot position information according to the voice information; the answer library determining unit 920 is configured to determine at least one target answer library from a plurality of preset answer libraries according to the target user characteristics, where different user characteristics correspond to different answer libraries; the above-mentioned response content determining unit 930, configured to determine, according to the slot position information, a target response content in the at least one target response library; and a reply unit 940 for determining the target reply content as the reply content to the voice message.
In an exemplary embodiment, FIG. 10 schematically illustrates a block diagram of a voice interaction device, according to another exemplary embodiment of the present disclosure. Please refer to fig. 10:
in an exemplary embodiment, based on the foregoing solution, the apparatus further includes: a slot position determination unit 950.
Wherein the slot determining unit 950 is configured to: converting the voice information into text information; and carrying out structuralization processing on the text information to obtain the slot position information.
In an exemplary embodiment, based on the foregoing solution, the apparatus further includes: an audio feature screening unit 960.
The audio feature screening unit 960 is specifically configured to: extracting first type audio features of the voice information, and screening the voice information according to the extracted first type audio features; and, the user characteristic determining unit 910 is further configured to: and extracting second-class audio features of the screened voice information, and determining the target user features according to the second-class audio features.
In an exemplary embodiment, based on the foregoing scheme, the slot determining unit 950 is specifically configured to: converting the voice information into text information through a voice recognition module; the audio feature screening unit 960 is specifically configured to: extracting first type audio features of the voice information through an audio screening module, and screening the voice information according to the extracted first type audio features; the audio screening module and the voice recognition module perform parallel processing.
In an exemplary embodiment, based on the foregoing scheme, the slot determining unit 970 is further specifically configured to: carrying out structuralization processing on the text information through a natural language understanding module to obtain the slot position information; the user characteristic determining unit 910 is further specifically configured to: extracting second type audio features of the screened voice information through a feature detection module, and determining the target user features according to the second type audio features; wherein, the characteristic detection module and the natural language understanding module process in parallel.
In an exemplary embodiment, based on the foregoing scheme, the user characteristic determining unit 910 is specifically configured to: extracting second class audio features of the voice information; inputting the extracted second-class audio features into a trained user feature classification model, wherein the user feature classification model is used for predicting the age bracket of the user and/or the gender of the user according to the input audio features; and determining the target user characteristics according to the output of the user characteristic classification model.
In an exemplary embodiment, the audio feature screening unit 960 is specifically configured to: before the user characteristic determining unit 910 determines a target user characteristic according to the voice information, extracting a first type of audio characteristic of the voice information; inputting the extracted first type of audio features into the trained audio quality classification model; and screening the voice information according to the output of the audio quality classification model so as to determine the characteristics of the target user through the screened voice information.
In an exemplary embodiment, based on the foregoing solution, the apparatus further includes: the answer bank determination unit 970. The user features described above include: the user characteristics corresponding to the N age groups respectively, and the first gender and the second gender.
Wherein, the answer base determination unit 970 is configured to: determining an ith response library aiming at the ith age group, and associating the user characteristics of the ith age group with the ith response library, wherein i is a positive integer not greater than N, and N is a positive integer; determining an N +1 response library for a first gender and associating user characteristics of the first gender with the N +1 response library; and determining an N +2 response library for a second gender, and associating the user characteristics of the second gender with the N +2 response library to obtain a plurality of preset response libraries.
In an exemplary embodiment, based on the foregoing scheme, the answer base determining unit 920 is specifically configured to: and acquiring at least one response library associated with the target user characteristics from the preset multiple response libraries to obtain the at least one target response library.
In an exemplary embodiment, based on the foregoing scheme, the first type of audio features includes a preset type of static features, and the second type of audio features includes mel-frequency cepstral coefficient features.
It should be noted that, when the voice interaction apparatus provided in the foregoing embodiment executes the voice interaction method, only the division of the functional modules is illustrated, and in practical applications, the functions may be distributed by different functional modules according to needs, that is, the internal structure of the device is divided into different functional modules, so as to complete all or part of the functions described above. In addition, the voice interaction apparatus and the voice interaction method provided in the above embodiments belong to the same concept, and therefore, for details that are not disclosed in the embodiments of the apparatus of the present disclosure, please refer to the embodiments of the voice interaction method disclosed in the present disclosure, which will not be described herein again.
The above-mentioned serial numbers of the embodiments of the present disclosure are merely for description and do not represent the merits of the embodiments.
The embodiments of the present disclosure also provide a computer-readable storage medium, on which a computer program is stored, which when executed by a processor implements the steps of the method of any of the preceding embodiments. The computer-readable storage medium may include, but is not limited to, any type of disk including floppy disks, optical disks, DVD, CD-ROMs, microdrive, and magneto-optical disks, ROMs, RAMs, EPROMs, EEPROMs, DRAMs, VRAMs, flash memory devices, magnetic or optical cards, nanosystems (including molecular memory ICs), or any type of media or device suitable for storing instructions and/or data.
The embodiment of the present disclosure further provides a terminal, which includes a memory, a processor, and a computer program stored in the memory and capable of running on the processor, and when the processor executes the program, the steps of any of the above-mentioned embodiments of the method are implemented.
Fig. 11 schematically shows a block diagram of a terminal in an exemplary embodiment according to the present disclosure. Referring to fig. 11, a terminal 1100 includes: a processor 1101 and a memory 1102.
In the embodiment of the present disclosure, the processor 1101 is a control center of a computer system, and may be a processor of a physical machine or a processor of a virtual machine. Processor 1101 may include one or more processing cores, such as a 4-core processor, an 8-core processor, or the like. The processor 1101 may be implemented in at least one hardware form of a DSP (Digital Signal Processing), an FPGA (Field-Programmable Gate Array), and a PLA (Programmable Logic Array). The processor 1101 may also include a main processor and a coprocessor, where the main processor is a processor for Processing data in an awake state, and is also called a Central Processing Unit (CPU); a coprocessor is a low power processor for processing data in a standby state.
In the embodiment of the present disclosure, the processor 1101 is specifically configured to:
receiving voice information, determining target user characteristics according to the voice information, and determining slot position information according to the voice information; determining at least one target response library from a plurality of preset response libraries according to the target user characteristics, wherein different user characteristics correspond to different response libraries; determining target response content in the at least one target response library according to the slot position information; and determining the target response content as the response content of the voice message.
Further, the determining the slot position information according to the voice information includes: converting the voice information into text information; and carrying out structuralization processing on the text information to obtain the slot position information.
Further, the determining the target user characteristics according to the voice information includes: extracting first type audio features of the voice information, and screening the voice information according to the extracted first type audio features; and extracting second type audio features of the screened voice information, and determining the target user features according to the second type audio features.
Further, the determining the slot position information according to the voice information includes: converting the voice information into text information through a voice recognition module; the extracting of the first type of audio features of the voice information and the screening of the voice information according to the extracted first type of audio features includes: extracting first type audio features of the voice information through an audio screening module, and screening the voice information according to the extracted first type audio features; the audio screening module and the voice recognition module perform parallel processing.
Further, the structuring the text information to obtain the slot information includes: carrying out structuralization processing on the text information through a natural language understanding module to obtain the slot position information; the extracting the second type of audio features of the screened voice information and determining the target user features according to the second type of audio features includes: extracting second type audio features of the screened voice information through a feature detection module, and determining the target user features according to the second type audio features; wherein, the characteristic detection module and the natural language understanding module process in parallel.
Further, the determining the target user characteristics according to the voice information includes: extracting second class audio features of the voice information; inputting the extracted second-class audio features into a trained user feature classification model, wherein the user feature classification model is used for predicting the age bracket of the user and/or the gender of the user according to the input audio features; and determining the target user characteristics according to the output of the user characteristic classification model.
Further, before determining the target user characteristic according to the voice information, the method further includes: extracting first class audio features of the voice information; inputting the extracted first type of audio features into the trained audio quality classification model; and screening the voice information according to the output of the audio quality classification model so as to determine the characteristics of the target user through the screened voice information.
Further, the user features include: the user characteristics corresponding to the N age groups respectively, and the first gender and the second gender; the method further comprises the following steps: determining an ith response library aiming at the ith age group, and associating the user characteristics of the ith age group with the ith response library, wherein i is a positive integer not greater than N, and N is a positive integer; determining an N +1 response library for a first gender and associating user characteristics of the first gender with the N +1 response library; and determining an N +2 response library for a second gender, and associating the user characteristics of the second gender with the N +2 response library to obtain a plurality of preset response libraries.
Further, the determining of at least one preset target response library according to the target user characteristics includes: and acquiring at least one response library associated with the target user characteristics from the preset multiple response libraries to obtain the at least one target response library.
Further, the first type of audio features includes a predetermined type of static features, and the second type of audio features includes mel-frequency cepstral coefficient features.
Memory 1102 may include one or more computer-readable storage media, which may be non-transitory. Memory 1102 can also include high-speed random access memory, as well as non-volatile memory, such as one or more magnetic disk storage devices, flash memory storage devices. In some embodiments of the present disclosure, a non-transitory computer readable storage medium in memory 1102 is used to store at least one instruction for execution by processor 1101 to implement a method in embodiments of the present disclosure.
In some embodiments, the terminal 1100 further comprises: a peripheral interface 1103 and at least one peripheral. The processor 1101, memory 1102 and peripheral interface 1103 may be connected by a bus or signal lines. Various peripheral devices may be connected to the peripheral interface 1103 by buses, signal lines, or circuit boards. Specifically, the peripheral device includes: at least one of a display 1104, a camera 1105, and an audio circuit 1106.
The peripheral interface 1103 may be used to connect at least one peripheral associated with I/O (Input/Output) to the processor 1101 and the memory 1102. In some embodiments of the present disclosure, the processor 1101, memory 1102, and peripheral interface 1103 are integrated on the same chip or circuit board; in some other embodiments of the present disclosure, any one or both of the processor 1101, the memory 1102, and the peripheral device interface 1103 may be implemented on separate chips or circuit boards. The embodiments of the present disclosure are not particularly limited in this regard.
The display screen 1104 is used to display a UI (User Interface). The UI may include graphics, text, icons, video, and any combination thereof. When the display screen 1104 is a touch display screen, the display screen 1104 also has the ability to capture touch signals on or over the surface of the display screen 1104. The touch signal may be input to the processor 1101 as a control signal for processing. At this point, the display screen 1104 may also be used to provide virtual buttons and/or a virtual keyboard, also referred to as soft buttons and/or a soft keyboard. In some embodiments of the present disclosure, the display 1104 may be one, providing the front panel of the terminal 1100; in other embodiments of the present disclosure, the number of the display 1104 may be at least two, and each of the display 1104 is disposed on a different surface of the terminal 1100 or in a foldable design; in still other embodiments of the present disclosure, display 1104 may be a flexible display disposed on a curved surface or on a folded surface of terminal 1100. Even further, the display screen 1104 may be arranged in a non-rectangular irregular pattern, i.e., a shaped screen. The Display 1104 may be made of LCD (Liquid Crystal Display), OLED (Organic Light-Emitting Diode), or the like.
The camera 1105 is used to capture images or video. Optionally, the camera 1105 includes a front camera and a rear camera. Generally, a front camera is disposed at a front panel of the terminal, and a rear camera is disposed at a rear surface of the terminal. In some embodiments, the number of the rear cameras is at least two, and each rear camera is any one of a main camera, a depth-of-field camera, a wide-angle camera and a telephoto camera, so that the main camera and the depth-of-field camera are fused to realize a background blurring function, and the main camera and the wide-angle camera are fused to realize panoramic shooting and VR (Virtual Reality) shooting functions or other fusion shooting functions. In some embodiments of the present disclosure, the camera 1105 may also include a flash. The flash lamp can be a monochrome temperature flash lamp or a bicolor temperature flash lamp. The double-color-temperature flash lamp is a combination of a warm-light flash lamp and a cold-light flash lamp, and can be used for light compensation at different color temperatures.
The audio circuitry 1106 may include a microphone and a speaker. The microphone is used for collecting sound waves of the user and the environment, converting the sound waves into electric signals, and inputting the electric signals to the processor 1101 for processing. For stereo capture or noise reduction purposes, multiple microphones may be provided, each at a different location of terminal 1100. The microphone may also be an array microphone or an omni-directional pick-up microphone.
Power supply 1107 is used to supply power to various components in terminal 1100. The power supply 1107 may be alternating current, direct current, disposable or rechargeable. When power supply 1107 includes a rechargeable battery, the rechargeable battery may be a wired rechargeable battery or a wireless rechargeable battery. The wired rechargeable battery is a battery charged through a wired line, and the wireless rechargeable battery is a battery charged through a wireless coil. The rechargeable battery may also be used to support fast charge technology.
The terminal block diagram shown in the embodiments of the present disclosure does not constitute a limitation on terminal 1100, and terminal 1100 may include more or fewer components than shown, or combine some components, or employ a different arrangement of components.
In the description of the present disclosure, it is to be understood that the terms "first," "second," and the like are used for descriptive purposes only and are not to be construed as indicating or implying relative importance. The specific meaning of the above terms in the present disclosure can be understood in specific instances by those of ordinary skill in the art. Further, in the description of the present disclosure, "a plurality" means two or more unless otherwise specified. "and/or" describes the association relationship of the associated objects, meaning that there may be three relationships, e.g., a and/or B, which may mean: a exists alone, A and B exist simultaneously, and B exists alone. The character "/" generally indicates that the former and latter associated objects are in an "or" relationship.
The above description is only for the specific embodiments of the present disclosure, but the scope of the present disclosure is not limited thereto, and any person skilled in the art can easily conceive of the changes or substitutions within the technical scope of the present disclosure, and all the changes or substitutions should be covered within the scope of the present disclosure. Accordingly, equivalents may be resorted to as falling within the scope of the disclosure as claimed.