CN111179974A

CN111179974A - Improved decoding network, command word recognition method and device

Info

Publication number: CN111179974A
Application number: CN201911391217.9A
Authority: CN
Inventors: 蒋泳森
Original assignee: AI Speech Ltd
Current assignee: AI Speech Ltd
Priority date: 2019-12-30
Filing date: 2019-12-30
Publication date: 2020-05-19
Anticipated expiration: 2039-12-30
Also published as: CN111179974B

Abstract

The invention discloses an improved decoding network, a command word recognition method and a device, wherein the command word recognition method comprises the following steps: in response to detecting a human voice in the input audio, sending the audio to a modified decoding network for decoding, wherein the modified decoding network is a decoding network generated by using command words, and the modified decoding network throws the integration output when decoding is completed; acquiring the integrated output, calculating the score of the integrated output, and judging whether the score is greater than a preset threshold value; and if the score is larger than the preset threshold value, outputting the integrated output as a command word. The method and the device provided by the application do not need to judge whether the voice is finished by VAD, thereby reducing the delay caused by that VAD observes the audio for a period of time more, rapidly outputting the result and reducing the delay.

Description

Improved decoding network, command word recognition method and device

Technical Field

The invention belongs to the technical field of voice recognition, and particularly relates to an improved decoding network, a command word recognition method and a command word recognition device.

Background

In the related art, command word recognition in the market is based on VAD (Voice activity detection) + speech recognition scheme based on grammar network. In the scheme, when audio is input, whether talking sound exists is judged by VAD, if so, recognition is started by a grammar network formed based on command words and a specific acoustic model, the recognition is continued until the VAD judges that the talking is finished, then the recognition is finished, and a recognition result is obtained

The inventors have discovered in the course of practicing the present application that speech recognition schemes have a significant delay in the user's response to the device having spoken a command word.

Disclosure of Invention

Embodiments of the present invention provide an improved decoding network, a command word recognition method and an apparatus, so as to solve at least one of the above technical problems.

In a first aspect, an embodiment of the present invention provides an improved decoding network, where the decoding network includes a decoding path and branch outputs distributed on the decoding path, and includes: and moving the branch output on each decoding path to the last branch output of each decoding path and combining the branch outputs into an integrated output.

In a second aspect, an embodiment of the present invention provides a command word recognition method, including: in response to detecting a human voice in the input audio, sending the audio to a modified decoding network according to the first aspect for decoding, wherein the modified decoding network is a decoding network generated with command words, the modified decoding network throwing out the integration output when decoding is complete; acquiring the integrated output, calculating the score of the integrated output, and judging whether the score is greater than a preset threshold value; and if the score is larger than the preset threshold value, outputting the integrated output as a command word.

In a third aspect, an embodiment of the present invention provides a command word recognition apparatus, including: a decoding module configured to send the audio to a modified decoding network according to the first aspect for decoding in response to detection of a human voice in the input audio, wherein the modified decoding network is a decoding network generated using command words, and the modified decoding network throws the integrated output when decoding is completed; a calculation module configured to obtain the integrated output and calculate a score of the integrated output, and determine whether the score is greater than a preset threshold; and a command word output module configured to output the integrated output as a command word if the score is greater than the preset threshold.

In a fourth aspect, an electronic device is provided, comprising: at least one processor, and a memory communicatively coupled to the at least one processor, wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the steps of the command word recognition method of any of the embodiments of the present invention.

In a fifth aspect, the present invention also provides a computer program product, where the computer program product includes a computer program stored on a non-volatile computer-readable storage medium, where the computer program includes program instructions, and when the program instructions are executed by a computer, the computer executes the steps of the command word recognition method according to any embodiment of the present invention.

The method and the device provided by the application send the audio frequency of the detected voice to the improved decoding network, then calculate the output calculation score of the improved decoding network, if the score is larger than the preset threshold value, the output of the decoding network can be output as a command word, and whether the voice is finished or not is judged without the VAD, so that the delay caused by the fact that the VAD observes the audio frequency for a period of time more can be reduced, the result can be output quickly, and the delay is reduced.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on the drawings without creative efforts.

FIG. 1 is a decoding diagram of a decoding network of the prior art;

fig. 2 is a decoding diagram of a decoding network according to an embodiment of the present invention;

FIG. 3 is a flowchart of a command word recognition method according to an embodiment of the present invention;

fig. 4 is a block diagram of a command word recognition apparatus according to an embodiment of the present invention;

fig. 5 is a schematic structural diagram of an electronic device according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Referring to fig. 1 and fig. 2, fig. 1 shows a decoding schematic diagram of a decoding network in the prior art, and fig. 2 shows a decoding schematic diagram of a decoding network according to an embodiment of the present invention.

In this embodiment, as shown in fig. 1, the output of the command word (output label on WFST side) is moved to the end of each path, so that once the decoder starts to output the result, a complete path is inevitably finished, for example, the "dehumidification mode" is used as the command word, it uses "dehumidification" as the first output, "wet" as the second output, "mode" as the third output and "equation" as the fourth output, and then it determines whether the result is valid by a certain width value, and once the result is valid, it will output the result immediately.

As shown in fig. 2, an improved decoding network, wherein the decoding network includes a decoding path and branch outputs distributed on the decoding path, includes: and moving the branch output on each decoding path to the last branch output of each decoding path and combining the branch outputs into an integrated output. For example, taking "dehumidification mode" as the command word, "dehumidification" and "mode" will move directly to the last branch "mode" and merge into an integrated output, so that the effect of fig. 2 can be achieved to give an output result more quickly than that of fig. 1.

Referring to fig. 3, a flowchart of a command word recognition method according to an embodiment of the invention is shown. The method can be applied to equipment needing to recognize command words, such as an intelligent voice television, an intelligent voice mobile phone, an intelligent sound box, an intelligent story machine and the like.

As shown in fig. 3, in step 301, in response to detecting human voice in the input audio, sending the audio to the improved decoding network of claim 1 for decoding,

in step 302, obtaining the integrated output and calculating a score of the integrated output, and determining whether the score is greater than a preset threshold;

in step 303, if the score is greater than the preset threshold, the integrated output is output as a command word.

In this embodiment, for step 301, in response to detecting a voice in the input audio, the command word recognition device sends the audio to the modified decoding network according to fig. 2 for decoding, for example, the user starts inputting the audio, then the device detects whether there is a voice with VAD, and then sends the audio to the modified decoding network after detecting the voice, wherein the modified decoding network is a decoding network generated by using command words, and the modified decoding network throws the integrated output when decoding is completed.

Then, for step 302, obtaining the integrated output and calculating a score of the integrated output, and determining whether the score is greater than a preset threshold; finally, in step 303, if the score is greater than the preset threshold, the integrated output is output as a command word, for example, after the device obtains the integrated output thrown by the decoding network, an instruction for calculating the score of the integrated output thrown by the decoding network is sent to a calculation module of the device, and then it is determined whether the score of the integrated output thrown by the decoding network is greater than the preset threshold, and if the calculated score is greater than the breadth of the command word output, the integrated output thrown by the decoding network is output as the command word.

According to the scheme of the embodiment, the voice frequency of the detected voice is sent to the improved decoding network, then the output calculation score of the improved decoding network is calculated, if the score is larger than the preset threshold value, the output of the decoding network can be output as a command word, whether the voice is finished or not is judged without the VAD, so that the delay caused by the fact that the VAD observes the voice frequency for a period of time more can be reduced, the result can be output quickly, and the delay is reduced.

In some optional embodiments, after determining whether the score is greater than a preset threshold, the method further comprises: and if the score is not larger than the preset threshold, judging whether to finish decoding or not based on the VAD detection result.

In some optional embodiments, the decoding network comprises a WFST decoding network.

In some optional embodiments, the method is for an intelligent appliance.

The following description is provided to enable those skilled in the art to better understand the present disclosure by describing some of the problems encountered by the inventors in implementing the present disclosure and by describing one particular embodiment of the finally identified solution.

The inventor finds in the process of implementing the present application that the defects in the prior art are mainly caused by the following: the delay is mainly caused by the fact that the VAD judges that the voice is over, because the VAD can observe more audio for a period of time when judging whether the voice exists.

Those skilled in the art will usually make some consideration and optimization on VAD when they encounter the above technical problem, such as whether the time observed by VAD can be reduced to reduce delay.

The solution of the present application is not easily conceivable by those skilled in the art because the recognition is a stable structure, the process of internal decoding is complicated, and it is difficult to think of some optimizations from inside the recognition, such as actively ending the recognition early in the decoding process without depending on VAD as mentioned in the embodiments of the present application.

The scheme of the embodiment of the application carries out specific change on the decoding network, namely, the output of the command word (output label on the WFST side) is moved to the end of each path, so that once the decoder starts to output the result, a complete path is necessarily run out, and then whether the result is effective or not is judged through a certain threshold value, and once the result is effective, the result is immediately output.

With continued reference to fig. 2, a flow diagram of an OneShot optimization scheme of an embodiment of the present application is shown. As shown in fig. 2, taking the command word "dehumidification mode" as an example:

step1, starting to input audio, and detecting whether a person has voice by VAD;

step2, once human voice is detected, sending audio to a recognition network, such as the decoding network, and taking the path above because of the 'dehumidification mode' (the traditional network 'dehumidification mode' disperses on the path above, so that no information can see when the path is finished, the improved network moves the output to the end of the path, and once the output is output, the path is finished), until a command word is output, then calculating a score through an acoustic model, when the score is greater than a threshold value, the decoder actively finishes outputting the result, otherwise, entering Step 3;

step3, when the threshold judgment fails in the previous Step, i.e. the result cannot be output actively by the decoder, the judgment of the end still needs to be carried out by VAD, but a lot of experiments show that 90% of the situations can not occur.

The embodiment of the application can realize the following direct effects: more than 90% of command words can actively finish directly outputting the result through the decoder without judging the end through VAD, so that the result can be output quickly, and the time delay is reduced.

The application can realize the deep effect in real time: in practical products, command word recognition is basically an input terminal of an intelligent household appliance, and the working state of the household appliance is controlled through command words. At this time, the time from the time when the client finishes speaking the command word to the time when the appliance reacts is the real time delay which can be felt by the client, and if the time is finished by the VAD only, the time delay is very obvious, which greatly affects the experience of the client. The scheme based on the active termination of the decoder does not depend on VAD to terminate, so that the delay can be greatly reduced (the actually measured average is reduced by about 400 ms), the user experience is very good, and the competitiveness of the product is greatly improved.

The effect of reducing the delay which can be realized by the embodiment of the application is mainly that the inside of the decoder can actively finish throwing the result, and the most essential is that the decoding network is improved.

Referring to fig. 4, a block diagram of a command word recognition apparatus according to an embodiment of the invention is shown.

As shown in fig. 4, the command word recognition apparatus 400 includes a decoding module 410, a calculating module 420, and a command word output module 430.

Wherein the decoding module 410 is configured to send the audio to a modified decoding network for decoding in response to detecting a human voice in the input audio, wherein the modified decoding network is a decoding network generated by using a command word, and the modified decoding network throws the integrated output when decoding is completed; a calculating module 420 configured to obtain the integrated output and calculate a score of the integrated output, and determine whether the score is greater than a preset threshold; and a command word output module 430 configured to output the integrated output as a command word if the score is greater than the preset threshold.

In some optional embodiments, the command word recognition apparatus 400 further includes: a VAD decision module (not shown in the figure) configured to decide whether to end the decoding based on the VAD detection result if the score is not greater than the preset threshold.

It should be understood that the modules depicted in fig. 4 correspond to various steps in the method described with reference to fig. 3. Thus, the operations and features described above for the method and the corresponding technical effects are also applicable to the modules in fig. 4, and are not described again here.

It should be noted that the modules in the embodiments of the present application are not intended to limit the solution of the present application, for example, the word segmentation module may be described as a module that divides the received sentence text into a sentence and at least one entry. In addition, the related functional modules may also be implemented by a hardware processor, for example, the word segmentation module may also be implemented by a processor, which is not described herein again.

In other embodiments, an embodiment of the present invention further provides a non-volatile computer storage medium, where the computer storage medium stores computer-executable instructions, and the computer-executable instructions may execute the command word recognition method in any of the above method embodiments;

as one embodiment, a non-volatile computer storage medium of the present invention stores computer-executable instructions configured to:

in response to detecting a human voice in the input audio, sending the audio to a modified decoding network for decoding, wherein the modified decoding network is a decoding network generated by using command words, and the modified decoding network throws the integration output when decoding is completed;

acquiring the integrated output, calculating the score of the integrated output, and judging whether the score is greater than a preset threshold value;

and if the score is larger than the preset threshold value, outputting the integrated output as a command word.

The non-volatile computer-readable storage medium may include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function; the storage data area may store data created according to the use of the command word recognition apparatus, and the like. Further, the non-volatile computer-readable storage medium may include high speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other non-volatile solid state storage device. In some embodiments, the non-transitory computer readable storage medium optionally includes memory located remotely from the processor, which may be connected to the command word recognition device via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

Embodiments of the present invention also provide a computer program product, which includes a computer program stored on a non-volatile computer-readable storage medium, where the computer program includes program instructions, and when the program instructions are executed by a computer, the computer executes any one of the above command word recognition methods.

Fig. 5 is a schematic structural diagram of an electronic device according to an embodiment of the present invention, and as shown in fig. 5, the electronic device includes: one or more processors 510 and memory 520, with one processor 510 being an example in fig. 5. The apparatus of the voice recognition method may further include: an input device 530 and an output device 540. The processor 510, the memory 520, the input device 530, and the output device 540 may be connected by a bus or other means, and the bus connection is exemplified in fig. 5. The memory 520 is a non-volatile computer-readable storage medium as described above. The processor 510 executes various functional applications of the server and data processing by executing nonvolatile software programs, instructions and modules stored in the memory 520, that is, implements the command word recognition method of the above-described method embodiment. The input device 530 may receive input numeric or character information and generate key signal inputs related to user settings and function control of the command word recognition device. The output device 540 may include a display device such as a display screen.

The product can execute the method provided by the embodiment of the invention, and has corresponding functional modules and beneficial effects of the execution method. For technical details that are not described in detail in this embodiment, reference may be made to the method provided by the embodiment of the present invention.

As an embodiment, the electronic device is applied to a command word recognition apparatus, and includes:

at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to cause the at least one processor to:

The electronic device of the embodiments of the present application exists in various forms, including but not limited to:

(1) a mobile communication device: such devices are characterized by mobile communications capabilities and are primarily targeted at providing voice, data communications. Such terminals include smart phones (e.g., iphones), multimedia phones, functional phones, and low-end phones, among others.

(2) Ultra mobile personal computer device: the equipment belongs to the category of personal computers, has calculation and processing functions and generally has the characteristic of mobile internet access. Such terminals include: PDA, MID, and UMPC devices, etc., such as ipads.

(3) A portable entertainment device: such devices can display and play multimedia content. Such devices include audio and video players (e.g., ipods), handheld game consoles, electronic books, as well as smart toys and portable car navigation devices.

(4) The server is similar to a general computer architecture, but has higher requirements on processing capability, stability, reliability, safety, expandability, manageability and the like because of the need of providing highly reliable services.

(5) And other electronic devices with data interaction functions.

The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and the parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.

Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims

1. An improved decoding network, wherein the decoding network includes a decoding path and branch outputs distributed over the decoding path, comprising: and moving the branch output on each decoding path to the last branch output of each decoding path and combining the branch outputs into an integrated output.

2. A command word recognition method, comprising:

in response to detecting a human voice in the input audio, sending the audio to the improved decoding network of claim 1 for decoding, wherein the improved decoding network is a decoding network generated with command words that throws the consolidated output when decoding is complete;

3. The method of claim 2, wherein after determining whether the score is greater than a preset threshold, the method further comprises:

and if the score is not larger than the preset threshold, judging whether to finish decoding or not based on the VAD detection result.

4. The method of claim 2 or 3, wherein the decoding network comprises a WFST decoding network.

5. The method of claim 4, wherein the method is for an intelligent appliance.

6. A command word recognition apparatus comprising:

a decoding module configured to send the audio to the modified decoding network of claim 1 for decoding in response to detecting a human voice in the input audio, wherein the modified decoding network is a decoding network generated with command words, the modified decoding network throwing the consolidated output when decoding is complete;

a calculation module configured to obtain the integrated output and calculate a score of the integrated output, and determine whether the score is greater than a preset threshold;

and the command word output module is configured to output the integrated output as a command word if the score is greater than the preset threshold.

7. The apparatus of claim 6, further comprising:

and the VAD judging module is configured to judge whether to finish decoding or not based on the VAD detection result if the score is not larger than the preset threshold.

8. The apparatus of claim 6 or 7, wherein the decoding network comprises a WFST decoding network.

9. An electronic device, comprising: at least one processor, and a memory communicatively coupled to the at least one processor, wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the steps of the method of any one of claims 1 to 6.

10. A storage medium having stored thereon a computer program, characterized in that the program, when being executed by a processor, is adapted to carry out the steps of the method of any one of claims 1 to 6.