CN114141233A

CN114141233A - Voice awakening method and related equipment thereof

Info

Publication number: CN114141233A
Application number: CN202111493721.7A
Authority: CN
Inventors: 张磊; 吴国兵; 朱成志; 张滔
Original assignee: iFlytek Co Ltd
Current assignee: iFlytek Co Ltd
Priority date: 2021-12-08
Filing date: 2021-12-08
Publication date: 2022-03-04

Abstract

The application discloses a voice awakening method and related equipment thereof, wherein the method comprises the following steps: for the terminal equipment, after acquiring a current voice segment from a voice stream in real time, performing awakening identification processing on the current voice segment to obtain a current awakening identification result; if the current awakening identification result meets a high threshold awakening condition, an awakening instruction is triggered to awaken a certain service item in the terminal equipment; if the current awakening identification result meets the low threshold awakening condition and the target awakening identification result meeting the low threshold awakening condition exists in at least one historical awakening identification result, an awakening instruction is triggered to awaken a certain service item in the terminal equipment. Therefore, the awakening difficulty of the audio data which is difficult to awaken can be effectively reduced, the awakening rate can be effectively improved, and the voice awakening effect can be improved.

Description

Voice awakening method and related equipment thereof

Technical Field

The present application relates to the field of artificial intelligence technologies, and in particular, to a voice wake-up method and related devices.

Background

Voice wake-up technology is a common artificial intelligence technology; and the voice wakeup technology is used to determine whether to wake up a certain service item (e.g., a tv show search function, a movie search function, navigation, etc.) according to user voice data. For example, when the user says "Xiao A classmate, navigate to XXX cell", the terminal device can wake up the navigation service based on the user voice data so that the terminal device can display the corresponding navigation route.

However, the voice wake-up effect is poor due to the defects of the voice wake-up technology.

Disclosure of Invention

A main objective of the embodiments of the present application is to provide a voice wake-up method and related devices, which can improve a voice wake-up effect.

The embodiment of the application provides a voice awakening method, which comprises the following steps:

acquiring a current voice section;

performing awakening identification processing on the current voice section to obtain a current awakening identification result;

when the current awakening identification result meets a high threshold awakening condition, triggering an awakening instruction;

and triggering a wake-up instruction when the current wake-up identification result meets a low threshold wake-up condition and a target wake-up identification result meeting the low threshold wake-up condition exists in at least one historical wake-up identification result.

In one possible embodiment, the at least one historical wake up recognition result comprises a previous wake up recognition result; the former awakening identification result is obtained by carrying out awakening identification processing on a previous voice section of the current voice section; the acquisition time of the previous voice segment is earlier than that of the current voice segment;

when the current awakening identification result meets the low threshold awakening condition and a target awakening identification result meeting the low threshold awakening condition exists in at least one historical awakening identification result, triggering an awakening instruction, comprising the following steps:

and triggering a wake-up instruction when the current wake-up identification result meets a low threshold wake-up condition and the previous wake-up identification result meets the low threshold wake-up condition.

In a possible embodiment, the triggering a wake-up instruction when the current wake-up recognition result satisfies a low-threshold wake-up condition and there is a target wake-up recognition result satisfying the low-threshold wake-up condition in at least one historical wake-up recognition result includes:

and triggering an awakening instruction when the current awakening identification result meets a low threshold awakening condition, a target awakening identification result meeting the low threshold awakening condition exists in the at least one historical awakening identification result, and time difference characterization data between the target awakening identification result and the current awakening identification result meets a first time difference condition.

In a possible implementation manner, the at least one historical wake-up recognition result is obtained by performing wake-up recognition processing on at least one historical voice segment;

the method further comprises the following steps:

when the current awakening identification result does not meet the low threshold awakening condition, a target awakening identification result meeting the low threshold awakening condition exists in the at least one historical awakening identification result, and the current voice section does not meet the preset message repetition condition, updating the at least one historical voice section according to the current voice section;

and when the current awakening identification result shows that the current voice segment does not meet the low threshold awakening condition, a target awakening identification result meeting the low threshold awakening condition exists in the at least one historical awakening identification result, and the current voice segment meets a preset message repetition condition, the current voice segment is discarded.

In one possible embodiment, the method further comprises:

if the message identification result of the current voice segment is the same as that of the voice segment to be referred to, determining that the current voice segment meets a preset message repetition condition; acquiring a current voice segment to be referenced, wherein the acquisition time difference between the voice segment to be referenced and the current voice segment meets a second time difference condition;

and if the message identification result of the current voice segment is different from the message identification result of the voice segment to be referred, determining that the current voice segment does not meet the preset message repetition condition.

In a possible embodiment, the current wake up recognition result is determined using a wake up recognition model;

the method further comprises the following steps:

when the current awakening identification result meets a low threshold awakening condition and a target awakening identification result meeting the low threshold awakening condition exists in at least one historical awakening identification result, determining to-be-used training data according to an awakening audio segment; wherein the wake-up audio segment comprises the current speech segment and a historical speech segment with the target wake-up recognition result; the training data to be used is used for updating the wake-up recognition model.

the method further comprises the following steps:

when the current awakening identification result meets a high threshold awakening condition and a target awakening identification result meeting a low threshold awakening condition exists in at least one historical awakening identification result, determining to-be-used training data according to an awakening audio segment; wherein the wake-up audio segment comprises the current speech segment and a historical speech segment with the target wake-up recognition result; the training data to be used is used for updating the wake-up recognition model.

the method further comprises the following steps:

when the current awakening recognition result meets a low threshold awakening condition and the cloud awakening recognition result of the current voice segment meets a normal awakening condition, determining training data to be used according to the awakening voice segment; the cloud end awakening identification result is obtained by awakening and identifying the current voice section by a cloud end awakening model; the wake-up audio segment comprises the current speech segment; the training data to be used is used for updating the wakeup identification model and the cloud wakeup model.

the method further comprises the following steps:

when the current awakening recognition result meets a non-awakening condition and the cloud awakening recognition result of the current voice segment meets a normal awakening condition, determining to-be-used training data according to the awakening voice segment; the cloud end awakening identification result is obtained by awakening and identifying the current voice section by a cloud end awakening model; the wake-up audio segment comprises the current speech segment; the training data to be used is used for updating the wakeup identification model and the cloud wakeup model.

the method further comprises the following steps:

when the current awakening recognition result meets a low threshold awakening condition and the cloud awakening recognition result of the current voice segment does not meet a normal awakening condition, determining to-be-used training data according to the awakening voice segment; the cloud end awakening identification result is obtained by awakening and identifying the current voice section by a cloud end awakening model; the wake-up audio segment comprises the current speech segment; the training data to be used is used for updating the wakeup identification model and the cloud wakeup model.

In a possible implementation manner, the performing the wake-up recognition processing on the current speech segment to obtain a current wake-up recognition result includes:

if the voice segment is in the non-awakening state, utilizing an awakening identification model to perform awakening identification processing on the current voice segment to obtain a current awakening identification result;

the method further comprises the following steps:

if the current voice segment is in the awakening state, determining training data to be used according to the triggered voice frequency segment in the awakening state when the current voice segment is determined to meet the preset information abnormal condition; wherein the training data to be used is used to update the wake up recognition model.

In one possible implementation, the method is applied to a terminal device; the current awakening identification result is determined by utilizing an awakening identification model;

the method further comprises the following steps:

when the current awakening identification result meets a low threshold awakening condition and a target awakening identification result meeting the low threshold awakening condition exists in at least one historical awakening identification result, sending attribute information of an awakening audio segment and the awakening audio segment to a cloud server so that the cloud server can determine training data to be used according to the attribute information of the awakening audio segment and the awakening audio segment; wherein the wake-up audio segment comprises the current speech segment; the training data to be used is used for updating the wake-up recognition model.

The embodiment of the present application further provides a voice wake-up apparatus, including:

the voice acquisition unit is used for acquiring a current voice section;

the awakening identification unit is used for carrying out awakening identification processing on the current voice section to obtain a current awakening identification result;

the first trigger unit is used for triggering a wake-up instruction when the current wake-up identification result meets a high threshold wake-up condition;

and the second triggering unit is used for triggering a wake-up instruction when the current wake-up identification result meets a low threshold wake-up condition and a target wake-up identification result meeting the low threshold wake-up condition exists in at least one historical wake-up identification result.

An embodiment of the present application further provides an apparatus, including: a processor, a memory, a system bus;

the processor and the memory are connected through the system bus;

the memory is used for storing one or more programs, and the one or more programs comprise instructions which, when executed by the processor, cause the processor to execute any implementation of the voice wake-up method provided by the embodiment of the application.

The embodiment of the present application further provides a computer-readable storage medium, where instructions are stored in the computer-readable storage medium, and when the instructions are run on a terminal device, the terminal device is enabled to execute any implementation of the voice wakeup method provided in the embodiment of the present application.

The embodiment of the present application further provides a computer program product, and when the computer program product runs on a terminal device, the terminal device is enabled to execute any implementation manner of the voice wake-up method provided by the embodiment of the present application.

Based on the technical scheme, the method has the following beneficial effects:

according to the technical scheme provided by the application, for terminal equipment (such as a smart television, a smart phone, a vehicle-mounted intelligent system and the like), after a current voice section is acquired from a voice stream in real time, awakening identification processing is carried out on the current voice section to obtain a current awakening identification result; if the current awakening identification result meets a high threshold awakening condition, an awakening instruction is triggered to awaken a certain service item in the terminal equipment; if the current awakening identification result meets the low threshold awakening condition and the target awakening identification result meeting the low threshold awakening condition exists in at least one historical awakening identification result, an awakening instruction is triggered to awaken a certain service item in the terminal equipment.

Therefore, the voice awakening method provided by the embodiment of the application can not only awaken audio data meeting the high threshold awakening condition, but also awaken audio data meeting the low threshold awakening condition twice although the high threshold awakening condition is not met, so that the awakening difficulty of certain audio data (such as audio data which is difficult to awaken) can be effectively reduced, the awakening rate can be effectively improved, and the voice awakening effect can be improved.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings needed to be used in the description of the embodiments or the prior art will be briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

Fig. 1 is a schematic diagram of a voice stream according to an embodiment of the present application;

fig. 2 is a flowchart of a voice wake-up method according to an embodiment of the present application;

fig. 3 is a schematic diagram of a voice wake-up technique according to an embodiment of the present application;

fig. 4 is a schematic diagram of attribute information of audio data according to an embodiment of the present application;

fig. 5 is a schematic diagram of a wake-up type according to an embodiment of the present application;

fig. 6 is a schematic diagram of false triggering audio data according to an embodiment of the present application;

fig. 7 is a schematic structural diagram of a voice wake-up apparatus according to an embodiment of the present application.

Detailed Description

The inventor finds in research on a voice wake-up technology that the voice wake-up technology can be implemented by means of a pre-constructed neural network model with a wake-up recognition function; and the neural network model can perform awakening identification processing on audio data to obtain the awakening probability of the audio data, so that an awakening instruction is triggered after the awakening probability is determined to reach a preset threshold value. It can be seen that the voice wake-up technology implemented based on the neural network model has wake-up performance depending on a preset threshold, and if the preset threshold is relatively high, the wake-up rate of the voice wake-up technology is relatively low easily because the wake-up probability of more audio data cannot exceed the preset threshold; if the preset threshold is relatively low, the false wake-up rate of the voice wake-up technology is relatively high due to the fact that the wake-up probability of most of the audio data can exceed the preset threshold.

The inventors have also found that the voice wake-up technique is usually performed for a voice stream (such as the voice stream shown in fig. 1) to be collected and wake-up in real time; in addition, for a voice stream including an audio segment that is not easy to wake up, although the wake-up probabilities of a plurality of voice segments in the voice stream cannot reach the preset threshold, the wake-up probabilities of the voice acquisition segments are always higher.

Based on the above findings, in order to overcome the technical problems shown in the background section, an embodiment of the present application provides a voice wake-up method, including: after acquiring a current voice section from a voice stream in real time, performing awakening identification processing on the current voice section to obtain a current awakening identification result; if the current awakening identification result meets a high threshold awakening condition, an awakening instruction is triggered to awaken a certain service item; and if the current awakening identification result meets the low threshold awakening condition and the target awakening identification result meeting the low threshold awakening condition exists in at least one historical awakening identification result, triggering an awakening instruction to awaken a certain service item.

In addition, the embodiment of the present application does not limit an execution subject of the voice wakeup method, and for example, the voice wakeup method provided in the embodiment of the present application may be applied to a data processing device such as a terminal device or a server. The terminal device may be a smart phone, a computer, a Personal Digital Assistant (PDA), a tablet computer, or the like. The server may be a stand-alone server, a cluster server, or a cloud server. For another example, in some cases, the voice wake-up method provided in the embodiment of the present application may also be implemented by a terminal device and a server in a matching manner.

In order to make the objects, technical solutions and advantages of the embodiments of the present application clearer, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are some embodiments of the present application, but not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

Method embodiment one

Referring to fig. 2, the figure is a flowchart of a voice wake-up method according to an embodiment of the present application.

The voice awakening method provided by the embodiment of the application comprises the following steps of S1-S4:

s1: and acquiring the current voice section.

The "current speech segment" is used to represent a speech segment collected in real time from a speech stream. For example, as shown in fig. 1, when a "third speech segment" is collected from the speech stream shown in fig. 1, the third speech segment may be determined as the current speech segment, so that data processing for the third speech segment can be subsequently implemented by means of a data processing procedure for the current speech segment.

In addition, the embodiment of the present application does not limit the execution subject of S1, and for example, it may be a terminal device (e.g., a smart television, a smart phone, a vehicle-mounted smart system, etc.).

Based on the above-mentioned related content of S1, for a terminal device with voice wake-up function, the terminal device can perform real-time voice acquisition processing on a voice stream to obtain a current voice segment, so that it can be determined whether a user has an intention to wake up a voice control function in the terminal device based on the current voice segment.

S2: and performing awakening identification processing on the current voice section to obtain a current awakening identification result.

The "current wake-up recognition result" refers to the wake-up probability of the current speech segment (i.e. the probability of occurrence of wake-up of the current speech segment), so that the "current wake-up recognition result" can indicate whether the current speech segment can be wake-up (i.e. wake up the voice control function in the terminal device).

In addition, the embodiment of the present application does not limit the determination process of the "current wake-up recognition result", for example, the determination process may specifically include: and performing awakening identification processing on the current voice section by utilizing a pre-constructed awakening identification model to obtain a current awakening identification result.

The awakening recognition model is used for awakening and recognizing input data of the awakening recognition model; and the "wake recognition model" is a machine learning model (e.g., Recurrent Neural Networks (RNN), Convolutional Neural Networks (CNN), etc.).

It should be noted that, in the embodiment of the present application, the construction process of the "wake-up recognition model" is not limited, and the construction process may be implemented by using any existing or future construction method of a wake-up recognition model.

In addition, the embodiment of the present application does not limit the execution main body of S2, for example, in order to improve the real-time performance of voice wakeup, the execution main body of S2 may be a terminal device.

Based on the related content of S2, after the current speech segment is obtained from the speech stream in real time, the current speech segment is wakened and recognized by the wakening recognition model to obtain the current wakening recognition result, so that the current wakening recognition result can indicate the occurrence probability of the current speech segment triggering the wakening instruction, and it can be determined whether the current speech segment is wakened based on the current wakening recognition result subsequently.

S3: and triggering a wake-up instruction when the current wake-up identification result meets a high threshold wake-up condition.

The "high threshold wake-up condition" may be preset; and the high threshold wake-up condition may include: a first probability threshold is reached. It should be noted that, the embodiment of the present application is not limited to the above-mentioned "first probability threshold", and for example, it may be a normal threshold (for example, the above-mentioned "preset threshold") set for a machine learning model with a wake-up recognition function.

The "wake-up command" is used to generate a wake-up (i.e., to switch the voice control function in the terminal device from an un-wake-up state to a wake-up state) so that other function modules in the terminal device can know the wake-up state, so that the function modules can provide corresponding service items to the user (e.g., "navigate to XXX cell") based on voice data subsequently input by the user (e.g., "navigate to XXX cell").

In addition, the embodiment of the present application does not limit the execution main body of S3, for example, in order to improve the real-time performance of voice wakeup, the execution main body of S3 may be a terminal device.

Based on the above-mentioned related content of S3, after the current wake-up recognition result (i.e., the wake-up recognition result for the current voice segment) is obtained, it may be determined whether the current wake-up recognition result satisfies the high-threshold wake-up condition (e.g., whether the wake-up rate of the current voice segment reaches a preset threshold), so that when it is determined that the current wake-up recognition result satisfies the high-threshold wake-up condition, it may be determined that the current voice segment is woken up, and therefore, the wake-up instruction may be directly triggered to notify the wake-up message to other functional modules in the terminal device, so that the functional modules can provide corresponding service items to the user based on the voice data subsequently input by the user.

S4: and triggering a wake-up instruction when the current wake-up identification result meets the low threshold wake-up condition and a target wake-up identification result meeting the low threshold wake-up condition exists in at least one historical wake-up identification result.

The above-mentioned "low threshold wake-up condition" may be preset; and the low threshold wake-up condition may include: the second probability threshold is reached, but the first probability threshold is not reached. Wherein the second probability threshold is less than the first probability threshold.

The "historical voice segment" is obtained by performing voice segment (also called "historical voice segment") with a collection time earlier than the "current voice segment" in the voice stream. For example, for the voice stream shown in fig. 1, if the "current voice segment" is the third voice segment, the first voice segment and the second voice segment may be a historical voice segment corresponding to the "current voice segment".

In addition, the embodiment of the present application does not limit the determination process of the "at least one historical wake-up recognition result", for example, in order to improve the voice wake-up effect, the determination process of the "at least one historical wake-up recognition result" may specifically include steps 11 to 12:

step 11: and determining a historical reference interval according to the preset reference time length and the acquisition time of the current voice section.

The preset reference time length is used for describing the voice time length of a trigger word; in addition, the embodiment of the present application does not limit the obtaining process of the "preset reference duration", for example, the obtaining process may be preset, and particularly may be set according to an application scenario. As another example, the "preset reference duration" may be mined from a large amount of user speech data.

The "historical reference interval" may be expressed as [ the collection time of the current speech segment-the preset reference time length, the collection time of the current speech segment ].

Step 12: and performing awakening recognition processing on at least one voice segment collected from the voice stream in the historical reference interval to obtain at least one historical awakening recognition result.

Based on the relevant content of the at least one historical awakening identification result, the determination time of the historical awakening identification results is closer to the determination time of the current awakening identification result, so that the interference of the historical voice awakening process to the current awakening process can be effectively avoided, and the voice awakening effect is favorably improved.

The "target wake-up recognition result" refers to a historical wake-up recognition result satisfying a low threshold wake-up condition in the "at least one historical wake-up recognition result"; and the target awakening identification result is obtained by carrying out awakening identification processing on the target voice segment. Wherein, the acquisition time of the target voice segment is earlier than the acquisition time of the current voice segment.

In addition, the embodiment of the present application is not limited to the execution main body of S4, and for example, the execution main body of S4 may be a terminal device in order to improve the real-time performance of voice wakeup.

Based on the related contents of S1 to S4, it can be known that, in the voice wake-up method provided in the embodiment of the present application, after the current voice segment is obtained from the voice stream in real time, the current voice segment is wakened up and recognized, so as to obtain a current wake-up recognition result; if the current awakening identification result meets a high threshold awakening condition, an awakening instruction is triggered to awaken a certain service item in the terminal equipment; if the current awakening identification result meets the low threshold awakening condition and the target awakening identification result meeting the low threshold awakening condition exists in at least one historical awakening identification result, an awakening instruction is triggered to awaken a certain service item in the terminal equipment.

Method embodiment two

In fact, for a speech stream including a non-wakeable audio segment, the wake-up probabilities of multiple consecutive speech segments or multiple speech segments with relatively close acquisition times (e.g., "second speech segment" and "third speech segment" shown in fig. 1) in the speech stream are both between the second probability threshold and the first probability threshold. As can be seen, the "target speech segment" is adjacent to the "current speech segment"; alternatively, although the "target speech segment" is not adjacent to the "current speech segment", the time interval between the acquisition time of the "target speech segment" and the acquisition time of the "current speech segment" is relatively small.

Based on this, in order to improve the voice wake-up effect, the embodiment of the present application provides another possible implementation manner of S4, which may specifically include at least one of S41-S42:

s41: and triggering a wake-up instruction when the current wake-up identification result meets the low threshold wake-up condition and the previous wake-up identification result meets the low threshold wake-up condition.

The former awakening identification result is obtained by carrying out awakening identification processing on the previous voice section of the current voice section; and the "previous wakeup identification result" refers to the probability of wakeup of the previous speech segment (i.e., the probability of occurrence of wakeup of the previous speech segment), so that the "previous wakeup identification result" can indicate whether the previous speech segment can generate wakeup or not.

It should be noted that the determination process of the "previous wake up identification result" is similar to the determination process of the "current wake up identification result" described above.

The "previous speech segment" refers to a speech segment adjacent to the current speech segment and having an earlier acquisition time than the current speech segment. For example, for the voice stream shown in fig. 1, if the "current voice segment" is the third voice segment, the "previous voice segment" may be the second voice segment.

Based on the related content of S41, after the current wake-up recognition result (i.e., the wake-up recognition result for the current voice segment) is obtained, if it is determined that the current wake-up recognition result satisfies the low-threshold wake-up condition and it is determined that the previous wake-up recognition result satisfies the low-threshold wake-up condition, it may be determined that the wake-up recognition results of two consecutive voice segments in the voice stream both satisfy the low-threshold wake-up condition, so that it may be estimated that the wake-up voice segment including the two voice segments belongs to the audio data that is not easy to wake up, and therefore, the wake-up instruction may be directly triggered to notify other functional modules in the terminal device of the wake-up message, so that the functional modules may provide corresponding service items to the user based on the voice data subsequently continuously input by the user.

S42: and triggering an awakening instruction when the current awakening identification result meets the low threshold awakening condition, a target awakening identification result meeting the low threshold awakening condition exists in at least one historical awakening identification result, and time difference characterization data between the target awakening identification result and the current awakening identification result meets a first time difference condition.

The above "time difference characterization data" is used to describe the time interval between two wake-up recognition results; for example, when the "target wakeup identification result" is obtained by performing wakeup identification processing on a target speech segment, the "time difference characterization data between the target wakeup identification result and the current wakeup identification result" may include: the time difference between the acquisition time of the current speech segment and the acquisition time of the target speech segment.

It should be noted that the embodiment of the present application does not limit the relationship between the "current speech segment" and the "target speech segment" in S42, for example, the "current speech segment" and the "target speech segment" may be adjacent speech segments (e.g., a third speech segment and a second speech segment), or may be non-adjacent speech segments with relatively close acquisition time (e.g., a third speech segment and a first speech segment).

In addition, the embodiment of the present application does not limit the determination manner of the "time difference characterization data", and for example, the determination may be performed by using a time difference calculation formula. As another example, it may be determined by means of the timer T shown in fig. 3.

The "first time difference condition" described above may be set in advance. For example, as shown in fig. 3, the "first time difference condition" may specifically be: belonging to the time difference region [1 second, 10 seconds ].

Based on the above-mentioned related contents of S42, after obtaining the current wake-up recognition result (i.e., the wake-up recognition result for the current voice segment), when determining that the current wake-up recognition result satisfies the low-threshold wake-up condition, determining that the existing target wake-up recognition result satisfies the low-threshold wake-up condition, and determining that the time difference characterization data between the target wake-up recognition result and the current wake-up recognition result satisfies the first time difference condition, it may be determined that the wake-up recognition results of two voice segments with relatively close collection times in the voice stream both satisfy the low-threshold wake-up condition, so that it may be estimated that the wake-up audio segment including the two voice segments belongs to the difficult-to-wake-up audio data, so that a wake-up instruction may be directly triggered to notify other function modules in the terminal device of the wake-up message, so that these function modules may continue inputting voice data based on the user subsequently, and providing the corresponding service items to the user.

Based on the related content of S4, for a voice stream, if the wake-up recognition results of two consecutive voice segments in the voice stream both satisfy the low-threshold wake-up condition, or the wake-up recognition results of two consecutive voice segments in the preset reference duration both satisfy the low-threshold wake-up condition, it may be determined that the wake-up audio segments including the two voice segments belong to the audio data that is not easy to wake up, so that the wake-up instruction may be directly triggered to notify the other function modules in the terminal device of the wake-up message, so that the function modules may provide corresponding service items to the user based on the subsequent voice data that is continuously input by the user, thereby effectively reducing the wake-up difficulty of the audio data that is not easy to wake up, thereby effectively increasing the wake-up rate, and further facilitating to increase the voice effect.

Method embodiment three

In addition, in some cases, the voice stream may carry repeated character content (for example, "xiao a classmate, navigate to XXX cell"), so that to avoid adverse effects of the repeated character content on voice wakeup, the embodiment of the present application further provides another possible implementation manner of the voice wakeup method, and for convenience of understanding, the following description is made with reference to an example.

As an example, when the "at least one historical wake-up recognition result" is obtained by performing the wake-up recognition processing on at least one historical voice segment, the voice wake-up method may further include, in addition to the above-mentioned S1-S4, S5-S6:

s5: and updating at least one historical voice section according to the current voice section when the current voice section does not meet the preset message repetition condition and the target awakening identification result meeting the low threshold awakening condition exists in at least one historical awakening identification result.

The "preset message repetition condition" may be preset; the working principle of the "preset message repetition condition" may specifically be: judging whether the message identification result of the current voice segment is the same as that of the voice segment to be referred to, if so, determining that the current voice segment meets the preset message repetition condition; if not, determining that the current voice section does not meet the preset message repetition condition.

The "message identification result of the current voice segment" is used for representing the character information carried by the current voice segment; the embodiment of the present application does not limit the determination process of the "message recognition result of the current speech segment", for example, the speech recognition processing may be performed on the current speech segment first to obtain at least one candidate vocabulary; then these words are collected to obtain the message identification result of the current speech segment.

The "voice segment to be referred to" refers to a voice segment to be referred to when determining whether the current voice segment meets a preset message repetition condition; and the acquisition time difference between the speech segment to be referred and the current speech segment meets a second time difference condition.

It should be noted that the "second time difference condition" may be preset, for example, as shown in fig. 3, a time interval between the acquisition time of the speech segment to be referred to and the acquisition time of the current speech segment is less than 1 second; or, the time interval between the acquisition starting point of the speech segment to be referred to and the acquisition starting point of the current speech segment is less than 200 milliseconds.

The "message identification result of the speech segment to be referred" is used for representing character information carried by the speech segment to be referred; and the determination process of "message recognition result of speech segment to be referred" is similar to the determination process of "message recognition result of current speech segment" above.

Based on the above-mentioned related content of the "preset message repetition condition", for the process of searching for the second low-threshold message from the voice stream (i.e., for the target wake-up recognition result satisfying the low-threshold wake-up condition in the at least one historical wake-up recognition result), after determining that the current voice segment does not satisfy the low threshold wake-up condition, it may be determined whether the message recognition result of the current voice segment is the same as the message recognition result of the voice segment to be referred to, and if so, it means that the character content carried by the current voice segment is the same as the character content carried by the voice segment to be referred to, so that the current voice segment can be determined to satisfy the preset message repetition condition, therefore, the current voice segment and the content related to the current voice segment (such as a wakeup recognition result, a voice recognition result and the like) can be deleted, so that the interference of repeated characters on the searching process of the second low threshold message is favorably avoided; if not, it indicates that the character content carried by the current speech segment is different from the character content carried by the speech segment to be referred to, so that it can be determined that the current speech segment does not satisfy the preset message repetition condition, and the current speech segment and the content related thereto (e.g., the wakeup recognition result, the speech recognition result, etc.) can be retained, so that the search process of the second low threshold message can be continued with reference to the current speech segment and the content related thereto in the following.

It should be noted that the embodiment of the present application is not limited to the implementation of "updating at least one historical speech segment according to the current speech segment" in S5, and for example, the implementation may specifically be: deleting a historical voice segment with the earliest acquisition time in at least one historical voice segment to obtain at least one residual voice segment; and then the current voice section and at least one residual voice section are collected to obtain at least one updated historical voice section.

S6: and when the current voice segment is judged to meet the preset message repetition condition, discarding the current voice segment.

It should be noted that the embodiments of the present application are not limited to the execution entities of S5-S6, for example, in order to improve the real-time performance of voice wakeup, the execution entities of S5-S6 may be terminal devices.

Based on the related contents of the above S5 to S6, as shown in fig. 3, after determining that the current voice segment does not satisfy the low threshold wakeup condition and the target wakeup identification result satisfying the low threshold wakeup condition exists in at least one historical wakeup identification result, it may be determined whether the message identification result of the current voice segment is the same as the message identification result of the voice segment to be referred to, and if the message identification result of the current voice segment is the same as the message identification result of the voice segment to be referred to, the current voice segment may be directly discarded to avoid the current voice segment from interfering with the subsequent wakeup process; if the current voice segment is different from the current voice segment, updating at least one historical voice segment according to the current voice segment, so that the updated awakening identification result of the at least one historical voice segment can be used as the at least one historical awakening identification result to participate in a subsequent awakening process, the automatic updating process of the at least one historical awakening identification result can be realized, and the voice awakening effect can be improved.

Method example four

In fact, the above-mentioned "wake-up recognition processing" is usually implemented by means of a wake-up recognition model; and the wake-up recognition model is usually constructed with a large amount of training data.

The inventors found in the research on the above training data that the training data can be obtained by playing back a combination of pre-recorded wakeup word audio data and scene noise data. However, because the difference between the noise data related to the training data and the noise data appearing in the actual application scene is relatively large, the difference between the training data and the voice data appearing in the actual application scene is also relatively large, so that the awakening recognition model constructed based on the training data cannot accurately perform awakening recognition processing on the voice data appearing in the actual application scene, and thus the voice awakening effect is easily poor.

Therefore, in order to overcome the above problems, the updated recognition model can be updated and trained by means of the voice data (especially, the audio data which is not easy to wake up, the audio data which is mistakenly waken up, and the like) appearing in the actual application scene, so that the updated recognition model has better wakening up recognition performance.

Based on this, the present application provides another possible implementation manner of the voice wake-up method, in this implementation manner, the voice wake-up method may include not only all the steps or some of the steps described above, but also S7:

s7: and when the current awakening identification result meets the low threshold awakening condition and a target awakening identification result meeting the low threshold awakening condition exists in at least one historical awakening identification result, determining to-be-used training data according to the awakening audio segment.

The "wake-up audio segment" refers to audio data carrying a wake-up word; and the "wake audio segment" carries a wake word (e.g., small a classmates, play, open, find, etc.).

In addition, when the current wakeup identification result meets the low threshold wakeup condition and a target wakeup identification result meeting the low threshold wakeup condition exists in at least one historical wakeup identification result, it can be determined that both the current voice segment and the target voice segment (i.e., the historical voice segment with the target wakeup identification result) carry partial characters of the wakeup word, so the wakeup audio segment can include the current voice segment and the target voice segment.

It should be noted that, the embodiment of the present application is not limited to the above-mentioned "wake-up audio segment" obtaining process, and may be implemented by using any existing or future determination process (for example, the wake-up audio segment may be extracted from a voice stream by means of a voice recognition result performed in real time on the voice stream).

The "to-be-used training data" refers to training data that needs to be used when updating is performed on the above "wake recognition model".

In addition, the embodiment of the present application is not limited to the determination process of the "to-be-used training data", for example, if the update process of the "wakeup identification model" is implemented in the cloud server, the determination process of the "to-be-used training data" may specifically include steps 21 to 22:

step 21: when the terminal equipment determines that the current awakening identification result meets the low threshold awakening condition and determines that a target awakening identification result meeting the low threshold awakening condition exists in at least one historical awakening identification result, the terminal equipment sends the awakening audio segment and attribute information of the awakening audio segment to a cloud server.

The attribute information of the wakeup audio segment is used for describing the wakeup audio segment; furthermore, the "attribute information" is not limited in the embodiment of the present application, for example, as shown in fig. 4, the "attribute information" may include at least one of device information, data type, wakeup word label, wakeup type, speech recognition text label, time information, and whether the information is from a cloud wakeup.

The "wake-up word label" refers to a wake-up word labeled by the terminal device for the wake-up audio segment, so that the "wake-up word label" is used to represent the character content carried by the wake-up audio segment; in addition, the embodiment of the present application does not limit the determination manner of the "wake word label", for example, the determination manner may specifically be: matching the voice recognition text of the awakening audio frequency segment with at least one candidate awakening word; and determining the successfully matched candidate awakening words as the awakening word labels of the awakening audio segment.

It should be noted that "at least one candidate wake-up word" refers to words (e.g., xiao a classmates, play, open, search, etc.) that are pre-stored in the terminal device and can trigger a wake-up instruction, so that the terminal device can refer to the words when performing a wake-up recognition process on an audio data.

The "wake type" is used to indicate a wake mode for waking up the audio segment; furthermore, the embodiment of the present application does not limit the "wake-up type", and for example, it may be implemented by using the "wake-up type" shown in fig. 5. It should be noted that, in fig. 5, "high threshold wake-up" is used to indicate that the high threshold wake-up condition is satisfied; "Low threshold wakeup" is used to indicate that the low threshold wakeup condition is met.

Based on the related content of step 21, for the terminal device, when the current wake-up recognition result satisfies the low-threshold wake-up condition and there is a target wake-up recognition result satisfying the low-threshold wake-up condition in at least one historical wake-up recognition result, a wake-up audio segment including the current speech segment and the target speech segment may be determined first; and uploading the attribute information of the awakening audio segment and the attribute information of the awakening audio segment to a cloud server so that the cloud server can update the training data of the awakening recognition model by referring to the attribute information of the awakening audio segment and the attribute information of the awakening audio segment.

Step 22: and the cloud server determines the training data to be used according to the awakening audio segment and the attribute information of the awakening audio segment.

It should be noted that the present application example does not limit the implementation manner of step 22, and for example, it may specifically include: firstly, determining a difficult-to-awaken audio sample according to the awakening audio segment, and determining label information of the difficult-to-awaken audio sample according to the awakening word label of the awakening audio segment (or the awakening word labeled by the awakening audio segment manually); and determining to-be-used training data by utilizing the difficult-to-awaken audio sample and the label information of the difficult-to-awaken audio sample, so that the to-be-used training data comprises the difficult-to-awaken audio sample and the label information of the difficult-to-awaken audio sample, and then performing model updating training aiming at the awakening recognition model according to the difficult-to-awaken audio sample and the label information of the difficult-to-awaken audio sample.

Based on the related content of S7, after the current wake-up recognition result is obtained, if it is determined that the current wake-up recognition result satisfies the low-threshold wake-up condition and it is determined that the target wake-up recognition result satisfying the low-threshold wake-up condition exists in the at least one historical wake-up recognition result, it may be determined that the wake-up audio segment including the current speech segment and the target speech segment belongs to the audio data that is not easily wakened up by the wake-up recognition model for wake-up recognition processing, so that the wake-up audio segment may belong to the audio data that is not easily wakened up, and therefore, the training data to be used may be determined according to the wake-up audio segment, so that the training data to be used may include the wake-up audio segment, so that the wake-up audio segment may be used as a positive example later, and model update training may be performed on the wake-up recognition model after update, so that the updated wake-up recognition model may be better performed on the wake-up recognition model on the wake-up segment, therefore, the awakening recognition performance of the awakening recognition model is improved, and the voice awakening effect is improved.

In addition, in order to further improve the collection effect of the audio data that is not easy to wake up, the present application provides another possible implementation manner of the voice wake-up method, in which the voice wake-up method may include not only all or part of the above steps, but also S8:

s8: and when the current awakening identification result meets the high threshold awakening condition and a target awakening identification result meeting the low threshold awakening condition exists in at least one historical awakening identification result, determining to-be-used training data according to the awakening audio segment.

It can be seen that after the current wake-up recognition result is obtained, if it is determined that the current wake-up recognition result satisfies the high-threshold wake-up condition and it is determined that a target wake-up recognition result satisfying the low-threshold wake-up condition exists in at least one historical wake-up recognition result, it may be determined that a wake-up audio segment including the current speech segment and the target speech segment belongs to audio data that is not easily wakened up by the wake-up recognition model, so that it may be determined that the wake-up audio segment belongs to the audio data that is not easily wakened up, and therefore, training data to be used may be determined according to the wake-up audio segment, so that the training data to be used may include the wake-up audio segment, so that it may be possible to perform collection processing for audio data having a "low-threshold wake-up → high-threshold wake-up" wake-up type, thereby being beneficial to improve the comprehensiveness of the training data to be used, thereby being beneficial to improving the voice awakening effect.

Note that "determining the training data to be used based on the wake-up audio segment" in S8 is similar to the determination process of "training data to be used" in S7.

In addition, in order to further improve the collection effect of the audio data that is not easy to wake up, the present application provides another possible implementation manner of the voice wake-up method, in which the voice wake-up method may include not only all or part of the above steps, but also S9:

s9: and when the current awakening recognition result meets the low threshold awakening condition and the cloud awakening recognition result of the current voice segment meets the normal awakening condition, determining to-be-used training data according to the awakening voice segment.

The cloud end awakening identification result is obtained by awakening and identifying the current voice section by the cloud end awakening model; and the "cloud wake-up recognition result" is similar to the "current wake-up recognition result" above.

The cloud terminal awakening model is used for awakening and identifying input data of the cloud terminal awakening model; and the 'cloud wake-up model' is a machine learning model.

The embodiment of the present application does not limit the association relationship between the "cloud wake-up model" and the "wake-up recognition model" above, for example, the network structure of the "cloud wake-up model" is more complex than the network structure of the "wake-up recognition model" above; and the awakening recognition performance of the cloud awakening model is far higher than that of the awakening recognition model.

The embodiment of the present application does not limit the deployment relationship between the "cloud wake-up model" and the "wake-up recognition model" described above, for example, because the computing resources of the terminal device are limited, in order to ensure that the terminal device can better serve the user, the wake-up recognition model is generally deployed on the terminal device, so that the terminal device can use the wake-up recognition model to perform wake-up triggering processing; moreover, since the computing resources of the cloud server are rich, in order to improve the awakening recognition effect, the cloud awakening model can be deployed in the cloud server, so that the cloud server can verify the awakening recognition result generated in the terminal device by means of the cloud awakening model. The terminal equipment can be in data communication with the cloud server.

In addition, the embodiment of the present application does not limit the obtaining process of the "cloud wake-up recognition result", for example, the obtaining process may specifically include: after the terminal device acquires the current voice segment, the terminal device can send the current voice segment to a cloud server, so that the cloud server can utilize a cloud awakening model to perform awakening identification processing on the current voice segment to obtain a cloud awakening identification result of the current voice segment; then, the cloud server can feed back the cloud awakening identification result to the terminal device, so that the terminal device can determine whether the current awakening identification result is correct or not by means of the cloud awakening identification result.

The "normal wake-up condition" may be preset, and may include: a first probability threshold is reached.

Based on the related content of S9, after the current wake-up recognition result is obtained, if it is determined that the current wake-up recognition result satisfies the low threshold wake-up condition and it is determined that the cloud wake-up recognition result of the current speech segment satisfies the normal wake-up condition, it may be determined that the current speech segment should actually be wakened up, but due to the limitations of the wake-up recognition model, the wake-up recognition model cannot accurately wake up and recognize the current speech segment, so that the wake-up audio segment including the current speech segment belongs to the audio data that is not easily wakened up and recognized by the wake-up recognition model, and it may be determined that the wake-up audio segment belongs to the audio data that is not easily wakened up, so that the to-be-used training data may be determined according to the wake-up audio segment, so that the to-be-used training data may include the wake-up audio segment, and thus it may be implemented to perform the wake-up on the audio data having the "local low threshold wake-up → cloud wake-up" wake-up generation wake-up "type The collection processing is further beneficial to improving the comprehensiveness of the audio data which are not easy to wake up, and is further beneficial to improving the comprehensiveness of the training data to be used and the voice wake-up effect.

Note that "determining the training data to be used based on the wake-up audio segment" in S9 is similar to the determination process of "training data to be used" in S7.

It should be further noted that, for any embodiment that refers to both the "wake recognition model" (also referred to as a local wake model) and the "cloud wake model", the "training data to be used" is used to update the wake recognition model and the cloud wake model.

It should be further noted that, the embodiment of the present application does not limit the updating process of the "wake-up recognition model and the cloud wake-up model", for example, the training data to be used may be utilized to perform update training on the wake-up recognition model and the cloud wake-up model respectively, so as to obtain an updated wake-up recognition model and an updated cloud wake-up model. For another example, the cloud wake-up model can be updated and trained by using the training data to be used, so as to obtain an updated cloud wake-up model; and then, according to a preset model simplification rule, carrying out model simplification processing on the updated cloud terminal awakening model to obtain the updated cloud terminal awakening model.

In addition, in order to further improve the collection effect of the audio data that is not easy to wake up, the present application provides another possible implementation manner of the voice wake-up method, in which the voice wake-up method may include not only all or part of the above steps, but also S10:

s10: and when the current awakening recognition result meets the non-awakening condition and the cloud awakening recognition result of the current voice segment meets the normal awakening condition, determining the training data to be used according to the awakening voice segment.

The above "no-wake-up condition" may be preset; for example, it may specifically be: below a second probability threshold. It should be noted that, for the relevant content of the above "second probability threshold", please refer to S4 above.

Based on the relevant content of S10, after the current wakeup identification result is obtained, if it is determined that the current wakeup identification result satisfies the non-wakeup condition and the cloud wakeup identification result of the current speech segment satisfies the normal wakeup condition, it may be determined that the current speech segment should actually be wakened up, but due to the limitations of the wakeup identification model, the wakeup identification model cannot be used to wake up the current speech segment at all, so that the wakeup audio segment including the current speech segment belongs to the audio data that cannot be wakened up by the wakeup identification model, so that the wakeup audio segment may be determined to belong to the audio data that is not easy to wake up, so that the to-be-used training data may be determined according to the wakeup audio segment, so that the to-be-used training data may include the wakeup audio segment, so that the collection processing for the audio data having the "local non-wakeup → cloud wakeup" type may be performed, therefore, comprehensiveness of the audio data which are not easy to wake up is improved, comprehensiveness of the training data to be used is improved, and voice wake-up effect is improved.

Note that "determining the training data to be used based on the wake-up audio segment" in S10 is similar to the determination process of "training data to be used" in S7.

Method example five

In fact, since the voice wake-up technology not only has the abnormality of difficult wake-up but also has the abnormality of false wake-up, the voice wake-up performance of the voice wake-up technology can be comprehensively measured by the wake-up rate and the false wake-up rate. Based on this, in order to improve the wake recognition efficiency of the above "wake recognition model", a large amount of real false wake audio data may be collected during the actual application of the "wake recognition model", so that the "wake recognition model" can be updated based on the false wake audio data.

The inventor also found in the research on the voice wake-up technology that, since the voice stream is usually interacted with human-computer in the mode of "wake-up word + user intention", the false wake-up audio data can be collected by the wake-up recognition result and the semantic recognition result for the voice stream. For example, as shown in fig. 6, a false wake-up audio segment is typically included in audio data that occurs in a pattern of "audio segment triggering wake-up message + recognition as null/spoken word/no valid semantics".

Based on this, the present application provides another possible implementation manner of the voice wake-up method, in this implementation manner, the voice wake-up method may include not only all or part of the above steps, but also S11-S12:

s11: after the current voice segment is acquired, determining whether the current voice segment is in an awakening state, if so, executing step S12; if not, then S2 and its subsequent steps are performed.

The awakening state is determined according to whether the awakening instruction is triggered or not; and the determining process may specifically include: before the awakening instruction is not triggered, the voice control function in the terminal equipment is in an awakened state; after the wake-up command is triggered, the voice control function in the terminal equipment is in a wake-up state.

As can be seen, for a terminal device whose voice control function is in an un-awake state, after a current voice segment is obtained from a voice stream in real time, the current voice segment may be subjected to an awake recognition process, so as to determine whether to switch the voice control function in the terminal device from the un-awake state to an awake state; however, for the terminal device whose voice control function has been switched to the wake-up state, after the current voice segment is acquired from the voice stream in real time, the semantic recognition processing may be performed on the current voice segment, so that other function modules in the terminal device can provide corresponding service items to the user according to the semantic recognition result (for example, provide a navigation route "to the XXX cell" to the user).

S12: and when the current voice section meets the preset information abnormal condition, determining the training data to be used according to the triggered voice section in the awakening state.

The "preset information abnormality condition" may be set in advance. For example, it may specifically include: and the semantic recognition result of the current voice segment meets the preset semantic condition.

The "semantic recognition result of the current speech segment" is used to represent semantic information carried by a speech stream including the current speech segment.

The "preset semantic condition" may be preset. For example, the "preset semantic condition" may be: belonging to any one of empty semantics, linguistic and ineffective semantics. The "empty semantics" refers to that the "semantic recognition result of the current speech segment" does not carry any semantic information. The term "language word" refers to that the "semantic recognition result of the current speech segment" carries a language word. The "invalid semantics" refers to that the semantic information carried by the "semantic recognition result of the current speech segment" does not belong to any preset candidate function triggering semantics.

It should be noted that the "candidate function triggering semantics" is used to trigger a certain function module in the terminal device to provide a corresponding service item to the user; and each "candidate function trigger semantic" may be set in advance.

In fact, the speech recognition process for the speech stream is also performed in real time, and a speech segment (e.g., "fifth speech segment" shown in fig. 1) capable of recognizing the complete semantics is separated from a speech segment (e.g., "third speech segment" shown in fig. 1) capable of recognizing the complete trigger. Based on this, in order to improve the accuracy of identifying the false wake-up data, the embodiment of the present application further provides another possible implementation manner of the above "preset information exception condition", which may specifically be: the semantic recognition result of the current voice segment meets a preset semantic condition; and the time interval between the acquisition time of the current voice segment and the acquisition time of the switching voice segment in the awakening state is greater than the preset semantic duration.

The above "switching speech segment in awake state" refers to a speech segment in a speech stream that is determined to produce an awake state (i.e., a speech segment that triggers an awake instruction). For example, for the voice stream shown in fig. 1, the above "triggered voice segment in awake state" is the "third voice segment".

The preset semantic duration is used for representing the voice duration of a complete semantic; the "preset time period" is not limited in the embodiment of the present application, and may be preset by a relevant person, for example. As another example, it may be analyzed from a large amount of speech data.

Based on the relevant content of another possible implementation manner of the "preset information abnormal condition", it can be known that, for a terminal device that has been switched to an awake state, after acquiring a current voice segment from a voice stream in real time, it may be determined first whether a time interval between the current voice segment and the "switching voice segment of the awake state" is greater than a preset time length; if the current voice segment is larger than the preset semantic condition, the semantic recognition result of the current voice segment can be determined to completely represent semantic information, so that whether the semantic recognition result of the current voice segment meets the preset semantic condition or not can be judged, and when the preset semantic condition is determined to be met, the current voice segment can be determined to meet the preset information abnormal condition, so that the awakening state can be estimated to be one-time false triggering, and the triggered voice frequency segment of the awakening state can be determined to be false awakening voice frequency data.

The "trigger audio segment in the awake state" carries a trigger word used when switching to the awake state; and the "trigger audio segment of awake state" includes the "switching speech segment of awake state" described above. For example, the "trigger audio segment of an awake state" may include the "first voice segment", "second voice segment", and "third voice segment" shown in fig. 1.

It should be noted that the present embodiment does not limit the determination process of "to-be-used training data" in S12, and the determination process of "to-be-used training data" in S12 is similar to the determination process of "to-be-used training data" in S7 above.

Based on the related contents of the above S11 to S12, for the terminal device whose voice control function is in the non-awake state, after the current voice segment is acquired from the voice stream in real time, the terminal device may perform the wake-up recognition processing on the current voice segment, so as to determine whether to switch the voice control function in the terminal device from the non-awake state to the awake state; for a terminal device whose voice control function has been switched to the awake state, however, after the current voice segment is acquired in real time from the voice stream, if the current voice segment is determined to satisfy the default information exception condition, it can be estimated that the awakening state is a false trigger, so that the trigger audio segment of the awakening state can be determined to belong to the false awakening audio data, and the training data to be used can be determined according to the trigger audio segment of the awakening state, so that the training data to be used can comprise the trigger audio segment, so that the trigger audio segment can be taken as a negative example in the following, model update training is carried out aiming at the awakening recognition model, so that the updated wake recognition model can better perform the wake recognition processing for the trigger audio segment, therefore, the awakening recognition performance of the awakening recognition model is improved, and the voice awakening effect is improved.

In addition, in order to further improve the collection effect of the false wake-up audio data, the embodiment of the present application further provides another possible implementation manner of the voice wake-up method, in which the voice wake-up method may include not only all or part of the above steps, but also S13:

s13: and when the current awakening identification result meets the low threshold awakening condition and the cloud awakening identification result of the current voice segment does not meet the normal awakening condition, determining to-be-used training data according to the awakening voice frequency segment.

As can be seen, for a terminal device in an un-awakened state, after acquiring a current voice segment from a voice stream in real time, if it is determined that a current awakening recognition result satisfies a low threshold awakening condition and it is determined that a cloud awakening recognition result of the current voice segment does not satisfy a normal awakening condition, it can be known that an awakening audio segment that is considered by the "awakening recognition model" to include the current voice segment may possibly carry an awakening word, but an awakening audio segment that is considered by the "cloud awakening model" to include the current voice segment may not carry an awakening word, so that it can be inferred that the "awakening recognition model" gives an incorrect awakening recognition result for the current voice segment, and thus the awakening audio segment that includes the current voice segment may be determined to use training data, so that the to-be-used training data may include the awakening audio segment, which can be implemented for an audio segment having a "local low awakening threshold → cloud un-awakening" type Data are collected and processed, so that comprehensiveness of mistakenly awakening audio data is improved, comprehensiveness of training data to be used is improved, and voice awakening effect is improved.

Note that "determining the training data to be used based on the wake-up audio segment" in S13 is similar to the determination process of "training data to be used" in S7.

Method example six

In order to facilitate understanding of the technical solutions provided in the embodiments of the present application, the following description is made with reference to two application scenarios.

Scene one: and only using the terminal equipment to mine the audio data which is difficult to wake up and the audio data which is mistakenly waken up from the voice data collected in real time.

For a terminal device with a voice control function in an un-awakened state, after acquiring a current voice segment from a voice stream in real time, utilizing an awakening recognition model to perform awakening recognition processing on the current voice segment to obtain a current awakening recognition result; judging whether the current awakening identification result reaches a first probability threshold, if so, determining that the current voice section is awakened, thereby determining that an awakening voice frequency section comprising the current voice section carries an awakening word, directly triggering an awakening instruction, and switching the voice control function in the terminal equipment from an un-awakening state to an awakening state;

if the current awakening identification result does not reach the first probability threshold, whether the current awakening identification result reaches a second probability threshold can be continuously judged; if the current voice segment is up to the second probability threshold, determining that the current voice segment is likely to be awakened, so that whether a target voice segment which meets a low threshold awakening condition exists before the current voice segment can be traced, if the target voice segment does not exist, determining that the current voice segment belongs to a voice segment which meets the low threshold awakening condition for the first time, so that the awakened voice segment which comprises the current voice segment is likely to be awakened, and therefore, continuously collecting subsequent voice segments from the voice stream in real time to search whether the voice segments which meet the low threshold awakening condition for the second time exist;

if the target voice segment exists, the target voice segment and the current voice segment are satisfied with two times of low threshold awakenings, so that the wake-up audio segment comprising the target speech segment and the current speech segment can be presumed to belong to the non-wakeable audio data, so that a wake-up instruction can be triggered, so as to switch the voice control function in the terminal device from the un-awakened state to the awakened state, and upload the awakened audio segment and the attribute information of the awakened audio segment to a cloud server, so that the cloud server can refer to the awakening audio segment and the attribute information of the awakening audio segment to generate the training data to be used so as to utilize the training data to be used later, and updating and training the awakening recognition model so that the updated awakening recognition model can accurately recognize that the awakening audio segment carries effective awakening words, thereby being beneficial to improving the voice awakening effect.

In addition, for the terminal equipment with the voice control function in the awakening state, after the current voice segment is obtained from the voice stream in real time, the semantic recognition processing can be carried out on the current voice segment to obtain the semantic recognition result of the current voice segment, so that when the current voice segment is determined to meet the 'preset information abnormal condition' according to the semantic recognition result of the current voice segment, the terminal equipment can be presumed to have a false awakening phenomenon, therefore, the triggering audio segment in the awakening state and the attribute information of the triggering audio segment can be uploaded to a cloud server, so that the cloud server can refer to the triggering audio segment and the attribute information of the triggering audio segment to generate the training data to be used, so that the training data to be used can be subsequently utilized to carry out the update training on the awakening recognition model, so that the updated awakening recognition model can accurately recognize that the awakening segment carries the effective awakening word, this is beneficial to improving the voice awakening effect.

Scene two: and mining the difficult awakening audio data and the mistaken awakening audio data from the real-time collected voice data by comprehensively utilizing the terminal equipment and the cloud server.

For a terminal device with a voice control function in an un-awake state, after acquiring a current voice segment from a voice stream in real time, the following two procedures may be respectively performed:

the first process is as follows: firstly, using a local awakening model (i.e. the above "awakening recognition model"), performing awakening recognition processing on the current speech segment to obtain a local awakening recognition result (i.e. the above "current awakening recognition result"); judging whether the local awakening identification result reaches a first probability threshold, if so, determining that the current voice section is awakened, thereby determining that an awakening voice frequency section comprising the current voice section carries an awakening word, directly triggering an awakening instruction, and switching the voice control function in the terminal equipment from an un-awakening state to an awakening state;

if the local awakening identification result does not reach the first probability threshold, whether the local awakening identification result reaches a second probability threshold can be continuously judged; if the current voice segment is up to the second probability threshold, determining that the current voice segment is likely to be awakened, so that whether a target voice segment which meets a low threshold awakening condition exists before the current voice segment can be traced, if the target voice segment does not exist, determining that the current voice segment belongs to a voice segment which meets the low threshold awakening condition for the first time, so that the awakened voice segment which comprises the current voice segment is likely to be awakened, and therefore, continuously collecting subsequent voice segments from the voice stream in real time to search whether the voice segments which meet the low threshold awakening condition for the second time exist;

And a second process: sending the current voice segment to a cloud server, so that the cloud server can perform wake-up recognition processing on the current voice segment by using a cloud-end wake-up model to obtain a cloud-end wake-up recognition result, and feeding the cloud-end wake-up recognition result back to the terminal device, so that after the terminal device receives the cloud-end wake-up recognition result, if the local wake-up recognition result does not reach a first probability threshold and the cloud-end wake-up recognition result meets a normal wake-up condition, it can be inferred that the wake-up audio segment including the current voice segment belongs to the difficult-to-wake-up audio data, so that the attribute information of the wake-up audio segment and the wake-up audio segment can be uploaded to the cloud server, so that the cloud server can generate the to-use training data by referring to the attribute information of the wake-up audio segment and the wake-up audio segment, so that the to-use training data can be subsequently utilized, the awakening recognition model is updated and trained, so that the updated awakening recognition model can accurately recognize that the awakening audio segment carries effective awakening words, and the voice awakening effect is improved;

if the local wake-up recognition result is between the second probability threshold and the first probability threshold, and the cloud wake-up recognition result does not satisfy the normal wake-up condition, it can be estimated that the terminal device may be erroneously awakened for the current voice segment, therefore, the awakening audio segment comprising the current voice segment can be presumed to belong to the false awakening audio data, so that the awakening audio segment and the attribute information of the awakening audio segment can be uploaded to a cloud server, so that the cloud server can refer to the awakening audio segment and the attribute information of the awakening audio segment to generate the training data to be used so as to utilize the training data to be used later, and updating and training the awakening recognition model so that the updated awakening recognition model can accurately recognize that the awakening audio segment carries effective awakening words, thereby being beneficial to improving the voice awakening effect.

It should be noted that, for the scenario of "terminal device + cloud end," if low threshold wakeup is locally generated (for example, a primary or secondary low threshold wakeup condition is reached), but the cloud end wakes up, at this time, according to the cloud end wakeup logic, a wakeup message may be thrown outwards (that is, a wakeup command is triggered to convert the terminal device from an un-wakeup state to a wakeup state), and at this time, the dual threshold policy should clear all state data to restart counting. In addition, if a double-threshold strategy is adopted locally for wake-up processing, and the first local low-threshold wake-up cloud is not waken up, and a second wake-up message is generated in a short time, the data wakened up by the first low-threshold should not be uploaded again (because the data are uploaded once according to the logic, repeated uploading is prevented).

In addition, for the two scenarios, the cloud server may process the data uploaded by the terminal device by using a data processing policy shown below.

(1) Data screening

a. And communicating with a cloud voice recognition interface aiming at the awakening audio data uploaded by the client, and storing the text identification of the voice recognition text and the voice recognition text as data labeling information together. During storage, a column of confidence score information may be added to the table shown in fig. 4, and the value of the confidence score information is the edit distance between the speech recognition text and the wakeup word label.

b. The data generated by local awakening, local recognition and local semantics are separately and independently uploaded, and for each piece of uploaded local awakening data, the voice recognition result or the semantic recognition result information immediately behind the generation time point is inquired, and the inquired result is used as recognition marking information to be stored together with the awakening data.

(2) Data storage

And storing the data in a classified mode according to the calculated confidence score information.

Based on the relevant content of the two scenes, the terminal equipment can adopt a double-threshold strategy to wake up, wake-up analysis aiming at voice streams can be realized, the audio which is not easy to wake up and the data which is mistaken to wake up are collected and uploaded, so that the cloud server can provide a large amount of audio which is not easy to wake up and a large amount of data which are mistaken to wake up based on the terminal equipment, the existing local wake-up model and the cloud wake-up model are updated and trained, the updated local wake-up model and the updated cloud wake-up model have better wake-up recognition performance, the purpose of carrying out model training by utilizing a large amount of voice data in the real scene can be realized, the rhythm of iterative evolution of the core effect can be better accelerated, and the voice wake-up effect can be improved.

Based on the voice wake-up method provided by the above method embodiment, an embodiment of the present application further provides a voice wake-up apparatus, which is explained and explained with reference to the accompanying drawings.

Device embodiment

The embodiment of the apparatus is described with reference to the voice wake-up apparatus, and please refer to the above method embodiment for related contents.

Referring to fig. 7, the figure is a schematic structural diagram of a voice wake-up apparatus according to an embodiment of the present application.

The voice wake-up apparatus 700 provided in the embodiment of the present application includes:

a voice obtaining unit 701, configured to obtain a current voice segment;

a wakeup identifying unit 702, configured to perform wakeup identification processing on the current voice segment to obtain a current wakeup identification result;

a first triggering unit 703, configured to trigger a wake-up instruction when the current wake-up identification result meets a high threshold wake-up condition;

a second triggering unit 704, configured to trigger a wake-up instruction when the current wake-up identification result meets a low threshold wake-up condition and a target wake-up identification result meeting the low threshold wake-up condition exists in at least one historical wake-up identification result.

the second trigger unit 704 is specifically configured to: and triggering a wake-up instruction when the current wake-up identification result meets a low threshold wake-up condition and the previous wake-up identification result meets the low threshold wake-up condition.

In a possible implementation manner, the second triggering unit 704 is specifically configured to: and triggering an awakening instruction when the current awakening identification result meets a low threshold awakening condition, a target awakening identification result meeting the low threshold awakening condition exists in the at least one historical awakening identification result, and time difference characterization data between the target awakening identification result and the current awakening identification result meets a first time difference condition.

the voice wake-up apparatus 700 further comprises:

a message updating unit, configured to update the at least one historical voice segment according to the current voice segment when the current wake-up recognition result does not satisfy a low-threshold wake-up condition, a target wake-up recognition result satisfying the low-threshold wake-up condition exists in the at least one historical wake-up recognition result, and the current voice segment does not satisfy a preset message repetition condition; and when the current awakening identification result shows that the current voice segment does not meet the low threshold awakening condition, a target awakening identification result meeting the low threshold awakening condition exists in the at least one historical awakening identification result, and the current voice segment meets a preset message repetition condition, the current voice segment is discarded.

In a possible embodiment, the voice wake-up apparatus 700 further includes:

a repeated identification unit, configured to determine that the current speech segment satisfies a preset message repetition condition if the message identification result of the current speech segment is the same as the message identification result of the speech segment to be referred to; acquiring a current voice segment to be referenced, wherein the acquisition time difference between the voice segment to be referenced and the current voice segment meets a second time difference condition;

the voice wake-up apparatus 700 further comprises:

the first determining unit is used for determining training data to be used according to the awakening audio frequency segment when the current awakening identification result meets the low threshold awakening condition and a target awakening identification result meeting the low threshold awakening condition exists in at least one historical awakening identification result; wherein the wake-up audio segment comprises the current speech segment and a historical speech segment with the target wake-up recognition result; the training data to be used is used for updating the wake-up recognition model.

the voice wake-up apparatus 700 further comprises:

the second determining unit is used for determining the training data to be used according to the awakening audio frequency segment when the current awakening identification result meets the high threshold awakening condition and a target awakening identification result meeting the low threshold awakening condition exists in at least one historical awakening identification result; wherein the wake-up audio segment comprises the current speech segment and a historical speech segment with the target wake-up recognition result; the training data to be used is used for updating the wake-up recognition model.

the voice wake-up apparatus 700 further comprises:

a third determining unit, configured to determine training data to be used according to the wakening audio segment when the current wakening recognition result meets a low threshold wakening condition and the cloud wakening recognition result of the current voice segment meets a normal wakening condition; the cloud end awakening identification result is obtained by awakening and identifying the current voice section by a cloud end awakening model; the wake-up audio segment comprises the current speech segment; the training data to be used is used for updating the wakeup identification model and the cloud wakeup model.

the voice wake-up apparatus 700 further comprises:

a fourth determining unit, configured to determine training data to be used according to the wakeup audio segment when the current wakeup identification result meets a non-wakeup condition and the cloud wakeup identification result of the current voice segment meets a normal wakeup condition; the cloud end awakening identification result is obtained by awakening and identifying the current voice section by a cloud end awakening model; the wake-up audio segment comprises the current speech segment; the training data to be used is used for updating the wakeup identification model and the cloud wakeup model.

the voice wake-up apparatus 700 further comprises:

a fifth determining unit, configured to determine training data to be used according to the wakeup audio segment when the current wakeup identification result meets a low threshold wakeup condition and the cloud wakeup identification result of the current voice segment does not meet a normal wakeup condition; the cloud end awakening identification result is obtained by awakening and identifying the current voice section by a cloud end awakening model; the wake-up audio segment comprises the current speech segment; the training data to be used is used for updating the wakeup identification model and the cloud wakeup model.

In a possible implementation manner, the wake-up identifying unit 702 is specifically configured to: if the voice segment is in the non-awakening state, utilizing an awakening identification model to perform awakening identification processing on the current voice segment to obtain a current awakening identification result;

the voice wake-up apparatus 700 further comprises:

the false trigger recognition unit is used for determining training data to be used according to the trigger audio segment in the awakening state when the current voice segment is determined to meet the preset information abnormal condition if the current voice segment is in the awakening state; wherein the training data to be used is used to update the wake up recognition model.

the voice wake-up apparatus 700 further comprises:

the data sending unit is used for sending attribute information of the awakening audio segment and the awakening audio segment to a cloud server when the current awakening identification result meets a low threshold awakening condition and a target awakening identification result meeting the low threshold awakening condition exists in at least one historical awakening identification result, so that the cloud server determines training data to be used according to the attribute information of the awakening audio segment and the awakening audio segment; wherein the wake-up audio segment comprises the current speech segment; the training data to be used is used for updating the wake-up recognition model.

Further, an embodiment of the present application further provides an apparatus, including: a processor, a memory, a system bus;

the processor and the memory are connected through the system bus;

the memory is used for storing one or more programs, and the one or more programs comprise instructions which, when executed by the processor, cause the processor to execute any one of the implementation methods of the voice wake-up method.

Further, an embodiment of the present application further provides a computer-readable storage medium, where instructions are stored in the computer-readable storage medium, and when the instructions are run on a terminal device, the terminal device is caused to execute any implementation method of the above voice wakeup method.

Further, an embodiment of the present application further provides a computer program product, which when running on a terminal device, causes the terminal device to execute any implementation method of the above voice wakeup method.

As can be seen from the above description of the embodiments, those skilled in the art can clearly understand that all or part of the steps in the above embodiment methods can be implemented by software plus a necessary general hardware platform. Based on such understanding, the technical solution of the present application may be essentially or partially implemented in the form of a software product, which may be stored in a storage medium, such as a ROM/RAM, a magnetic disk, an optical disk, etc., and includes several instructions for enabling a computer device (which may be a personal computer, a server, or a network communication device such as a media gateway, etc.) to execute the method according to the embodiments or some parts of the embodiments of the present application.

It should be noted that, in the present specification, the embodiments are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments may be referred to each other. The device disclosed by the embodiment corresponds to the method disclosed by the embodiment, so that the description is simple, and the relevant points can be referred to the method part for description.

It is further noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.

The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present application. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the application. Thus, the present application is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. A voice wake-up method, the method comprising:

acquiring a current voice section;

2. The method of claim 1, wherein the at least one historical wake up recognition result comprises a previous wake up recognition result; the former awakening identification result is obtained by carrying out awakening identification processing on a previous voice section of the current voice section; the acquisition time of the previous voice segment is earlier than that of the current voice segment;

3. The method of claim 1, wherein triggering a wake-up command when the current wake-up recognition result satisfies a low threshold wake-up condition and a target wake-up recognition result satisfying the low threshold wake-up condition exists in at least one historical wake-up recognition result comprises:

4. The method according to claim 1, wherein the at least one historical wake-up recognition result is obtained by performing wake-up recognition processing on at least one historical speech segment;

the method further comprises the following steps:

5. The method of claim 4, further comprising:

6. The method of claim 1, wherein the current wake up recognition result is determined using a wake up recognition model;

the method further comprises the following steps:

7. The method of claim 1, wherein the current wake up recognition result is determined using a wake up recognition model;

the method further comprises the following steps:

8. The method of claim 1, wherein the current wake up recognition result is determined using a wake up recognition model;

the method further comprises the following steps:

9. The method of claim 1, wherein the current wake up recognition result is determined using a wake up recognition model;

the method further comprises the following steps:

10. The method of claim 1, wherein the current wake up recognition result is determined using a wake up recognition model;

the method further comprises the following steps:

11. The method according to claim 1, wherein performing the wakeup identification process on the current speech segment to obtain a current wakeup identification result includes:

the method further comprises the following steps:

12. The method according to claim 1, wherein the method is applied to a terminal device; the current awakening identification result is determined by utilizing an awakening identification model;

the method further comprises the following steps:

13. A voice wake-up apparatus, comprising:

the voice acquisition unit is used for acquiring a current voice section;

14. An apparatus, characterized in that the apparatus comprises: a processor, a memory, a system bus;

the processor and the memory are connected through the system bus;

the memory is for storing one or more programs, the one or more programs comprising instructions, which when executed by the processor, cause the processor to perform the method of any of claims 1 to 12.

15. A computer-readable storage medium having stored therein instructions which, when run on a terminal device, cause the terminal device to perform the method of any one of claims 1 to 12.

16. A computer program product, characterized in that it, when run on a terminal device, causes the terminal device to perform the method of any one of claims 1 to 12.