CN111128166B - Optimization method and device for continuous awakening recognition function - Google Patents
Optimization method and device for continuous awakening recognition function Download PDFInfo
- Publication number
- CN111128166B CN111128166B CN201911379635.6A CN201911379635A CN111128166B CN 111128166 B CN111128166 B CN 111128166B CN 201911379635 A CN201911379635 A CN 201911379635A CN 111128166 B CN111128166 B CN 111128166B
- Authority
- CN
- China
- Prior art keywords
- voice
- audio
- recognition result
- voice recognition
- word
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000000034 method Methods 0.000 title claims abstract description 48
- 238000005457 optimization Methods 0.000 title claims abstract description 27
- 238000001514 detection method Methods 0.000 claims abstract description 42
- 230000000694 effects Effects 0.000 claims abstract description 34
- 230000006870 function Effects 0.000 claims description 39
- 230000015654 memory Effects 0.000 claims description 13
- 238000004590 computer program Methods 0.000 claims description 8
- 238000012545 processing Methods 0.000 description 6
- 238000010586 diagram Methods 0.000 description 5
- 230000003993 interaction Effects 0.000 description 5
- 238000005516 engineering process Methods 0.000 description 4
- 230000008569 process Effects 0.000 description 4
- 230000002618 waking effect Effects 0.000 description 4
- 238000010295 mobile communication Methods 0.000 description 3
- 230000007547 defect Effects 0.000 description 2
- 230000010354 integration Effects 0.000 description 2
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000003139 buffering effect Effects 0.000 description 1
- 238000004364 calculation method Methods 0.000 description 1
- 238000004891 communication Methods 0.000 description 1
- 230000007812 deficiency Effects 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 230000007613 environmental effect Effects 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 230000002035 prolonged effect Effects 0.000 description 1
- 239000007787 solid Substances 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/22—Procedures used during a speech recognition process, e.g. man-machine dialogue
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/28—Constructional details of speech recognition systems
- G10L15/30—Distributed recognition, e.g. in client-server systems, for mobile phones or network applications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/78—Detection of presence or absence of voice signals
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L67/00—Network arrangements or protocols for supporting network services or applications
- H04L67/50—Network services
- H04L67/56—Provisioning of proxy services
- H04L67/568—Storing data temporarily at an intermediate stage, e.g. caching
- H04L67/5683—Storage of data provided by user terminals, i.e. reverse caching
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/22—Procedures used during a speech recognition process, e.g. man-machine dialogue
- G10L2015/223—Execution procedure of a spoken command
Landscapes
- Engineering & Computer Science (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Signal Processing (AREA)
- Computer Networks & Wireless Communication (AREA)
- Telephonic Communication Services (AREA)
Abstract
The invention discloses an optimization method and a device for a continuous awakening identification function, wherein the method comprises the following steps: continuously receiving audio until a wakeup word is detected; performing voice recognition on the audio containing the awakening word to form a first voice recognition result, and caching a second audio received after the first audio in preset time; judging whether the first voice recognition result contains voices except for the awakening words or not; if the first voice recognition result does not contain the voice except the awakening word, judging whether the voice activity detection of the second audio is overtime; if the voice activity detection is not overtime, performing voice recognition on the second audio to form a second voice recognition result; and if the second voice recognition result contains the voice except the awakening word, calling back the second voice recognition result. The scheme provided by the method and the device can optimize the existing continuous awakening recognition function, and the user experience is better.
Description
Technical Field
The invention belongs to the technical field of voice awakening recognition, and particularly relates to a method and a device for optimizing a continuous awakening recognition function.
Background
In the related technology, oneShot is achieved immediately, popular points can be called as 'one word' and the integrated mode of 'awakening word + voice semantic recognition' is adopted, so that zero interval, zero delay and seamless connection between the awakening word and voice control are realized, the traditional question-answer mode is abandoned, the steps of voice control of a user are greatly reduced, information feedback is realized, the complexity is simplified, the simple operation is realized, and the simplicity is not simple at the beginning of design.
The OneShot has the characteristics of integration of recognition and awakening and semantic understanding, ensures the uniformity and continuity of voice interaction and completes control. That is, the user can directly issue the instruction without the need of starting interaction by asking and answering like the voice interaction method in the past. The OneShot function can realize the integration of 'awakening words + voice semantic recognition' in one language, and compared with the traditional voice interaction, the efficiency is much higher.
The technology similar to OneShot in the prior art has a certain flying 'awakening identification' and a certain degree 'awakening identification continuous saying'.
The inventor finds that the technologies do not disclose a solution to the situation of OneShot deficiency in the process of realizing the application. And although the scheme can realize the Oneshot function under relatively ideal voice environment, the audio frequency can only recognize the awakening word and discard the command word when the following situations occur:
1) AEC (echo cancellation) cancellation is not clean;
2) The environmental noise is large;
3) The user speaks more slowly so that the silence between the wake up word and the command word is too long.
Disclosure of Invention
An embodiment of the present invention provides an optimization method and apparatus for a continuous wake-up recognition function, which are used to solve at least one of the above technical problems.
In a first aspect, an embodiment of the present invention provides an optimization method for a continuous wake-up recognition function, including: continuously receiving audio until a wake-up word is detected; performing voice recognition on the audio containing the awakening word to form a first voice recognition result, and continuously caching a second audio received after the first audio within a preset time; judging whether the first voice recognition result contains voice except for the awakening word; if the first voice recognition result does not contain the voice except the awakening word, judging whether the voice activity detection of the second audio is overtime; if the voice activity detection is not overtime, performing voice recognition on the second audio to form a second voice recognition result; and if the second voice recognition result contains voice except the awakening word, calling back the second voice recognition result.
In a second aspect, an embodiment of the present invention provides an apparatus for optimizing a continuous wake-up recognition function, including: a wake-up detection module configured to continuously receive audio until a wake-up word is detected; the first recognition module is configured to perform voice recognition on the audio containing the awakening word to form a first voice recognition result, and continuously cache a second audio received after the first audio in a preset time; the recognition judgment module is configured to judge whether the first voice recognition result contains voices except the awakening words; the overtime judging module is configured to judge whether the voice activity detection of the second audio is overtime or not if the first voice recognition result does not contain the voice except the awakening word; a second recognition module configured to perform speech recognition on the second audio to form a second speech recognition result if the voice activity detection is not over time; and the callback module is configured to callback the second voice recognition result if the second voice recognition result contains voice except the awakening word.
In a third aspect, an electronic device is provided, which includes: at least one processor, and a memory communicatively coupled to the at least one processor, wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the steps of the method for optimizing a continuous wake up identification function of any embodiment of the present invention.
In a fourth aspect, the present invention further provides a computer program product, which includes a computer program stored on a non-volatile computer-readable storage medium, where the computer program includes program instructions, and when the program instructions are executed by a computer, the computer executes the steps of the optimization method of the continuous wake recognition function according to any embodiment of the present invention.
According to the scheme provided by the method and the device, the second audio received after the first audio is continuously cached in the preset time while the audio containing the awakening word is subjected to voice recognition, and then when the first voice recognition result does not recognize the voice except the awakening word, the second audio is subjected to voice recognition to form a second voice recognition result, and the second voice recognition result is used for continuously recognizing the content behind the awakening word, so that the command word spoken by the user can be possibly recognized for the user with slow speaking, and the compensation optimization scheme can be used as the compensation optimization scheme of the existing continuous awakening recognition.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings required to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the description below are some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings without creative efforts.
Fig. 1 is a flowchart of an optimization method for a continuous wake-up recognition function according to an embodiment of the present invention;
fig. 2a, fig. 2b and fig. 2c are flowcharts illustrating an embodiment of a method for optimizing a continuous wake-up recognition function according to an embodiment of the present invention;
fig. 3 is a block diagram of an optimization apparatus for continuous wake-up recognition according to an embodiment of the present invention;
fig. 4 is a schematic structural diagram of an electronic device according to an embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
Referring to fig. 1, which shows a flowchart of an embodiment of the method for optimizing a continuous wake-up recognition function according to the present application, the method for optimizing a continuous wake-up recognition function according to the present embodiment may be applied to a voice device with a continuous wake-up recognition function or an Oneshot function, such as a smart speaker, a story machine, a smart voice television, a smart voice handset, and other smart voice devices.
As shown in fig. 1, in step 101, audio continues to be received until a wake-up word is detected;
in step 102, performing voice recognition on the audio containing the wakeup word to form a first voice recognition result, and caching a second audio received after the first audio in a preset time;
in step 103, determining whether the first speech recognition result includes speech other than the wakeup word;
in step 104, if the first speech recognition result does not include speech other than the wakeup word, determining whether the activity detection for the second audio speech is overtime;
in step 105, if the voice activity detection is not over time, performing voice recognition on the second audio to form a second voice recognition result;
in step 106, if the second speech recognition result includes speech other than the wakeup word, the second speech recognition result is recalled.
In this embodiment, for step 101, the optimizing means for continuously waking up the recognition function may utilize voice activity detection until detecting the user voice, and then send the user voice to the wake engine, if the wake engine is successfully woken up, indicating that the wake word is detected. Then, for step 102, the optimizing device of the continuous wake up recognition function may send the audio frequency containing the wake up word to the speech recognition engine for speech recognition, so as to obtain a first speech recognition result. The audio containing the wake word is the audio that is buffered until no human voice is present after the wake word is detected during the voice activity detection, for example, a silence of 50ms after the audio containing the wake word is detected indicates no human voice. While the audio containing the wakeup word is sent to the speech recognition engine, the optimization device for continuously waking up the recognition function can also buffer a second audio received after the first audio within a preset time. The preset time may be a time set by a developer, and is not limited herein. Certainly, the preset time is not suitable to be too long, and if the preset time is too long, the user can feel that the system processing time is a bit long, so that the user experience is influenced.
Then, in step 103, the optimizing device for continuously waking up the recognition function determines whether the first speech recognition result includes speech other than the wake-up word. Then, in step 104, if the first speech recognition result does not include speech other than the wakeup word, the voice activity detection may be performed on the second audio, and it is determined whether the detection of the voice activity detection on the second audio is overtime, and if the detection is overtime, it indicates that no human voice has been detected for a long time by the voice activity detection, that is, the second audio does not include human voice.
Then, in step 105, if the voice activity detection is not over time, it indicates that the second audio contains a human voice, and performs voice recognition on the second audio to form a second voice recognition result, because the second audio is an audio that continues to be cached for a period of time after the audio containing the wakeup word is sent to the voice recognition engine, the second voice recognition result formed by performing voice recognition on the second audio may possibly contain a command word of the user.
Finally, in step 106, if the second speech recognition result includes speech other than the wakeup word, it indicates that the second speech recognition result includes content that is not recognized by the first speech recognition result, such as a command word of the user, and therefore the second speech recognition result can be recalled. In the oneshot scenario, if the first speech recognition result only contains the wakeup word, it is not clear that the user has spoken the command word, so that a second speech recognition is required.
According to the method, the audio frequency containing the awakening word is sent to voice recognition, meanwhile, the second audio frequency received after the first audio frequency is cached continuously within the preset time, then when the first voice recognition result does not recognize the voice except the awakening word, the second audio frequency is subjected to voice recognition to form a second voice recognition result, and the second voice recognition result recognizes the content after the awakening word, so that the command word spoken by the user with slow speaking can be possibly recognized, and the method can be used as a compensation optimization scheme of the existing continuous awakening recognition.
In some optional embodiments, after performing speech recognition on the second audio to form a second speech recognition result, the method further comprises: and if the second voice recognition result does not contain the voice except the awakening word, throwing a result which is recognized to be empty. If the second speech recognition result does not contain the speech except the awakening word, it indicates that the user may actually say that the command word is not spoken after just saying the awakening word, and at this time, information that the recognition result is "empty" needs to be thrown, which indicates that the continuous awakening recognition is not successful, so as to give a signal to the system, and the system can take other measures, for example, responding to the awakening word of the user as soon as possible.
In some embodiments, after determining whether the first speech recognition result includes speech other than the wakeup word, the method further includes: if the first speech recognition result contains speech except the awakening word, the first speech recognition result is recalled, so that the existing oneshot scheme cannot be influenced under the condition of starting optimization.
In a further optional embodiment, after determining whether the voice activity detection is overtime if the first voice recognition result does not include a voice other than the wakeup word, the method further includes: if the voice activity detection is overtime, the information with the recognition result of (empty) is thrown. The voice activity detection is over time for the detection of the second audio, which indicates that no human voice is detected for a long time, and the previous first voice recognition result does not contain voice except the awakening word, so that the voice activity detection can be ended in advance when the voice activity detection is over time, and therefore, the subsequent second voice recognition for the second audio is not needed, the system flow is saved, the system processing time can be shortened, and the user experience is better.
In further alternative embodiments, the second audio received after buffering the first audio for the preset time includes: after the awakening words are detected and sent to the recognition for the first recognition, the second audio received after the first audio is cached is received until the returned first voice recognition result is received, the cached audio does not temporarily use other time in the period of time, the processing time of the system is not prolonged, therefore, the original continuous awakening recognition function is not influenced negatively, only optimization is carried out, side effects are not generated, and the user experience is better.
The following description is given to a specific example describing some problems encountered by the inventor in implementing the present invention and a final solution so as to enable those skilled in the art to better understand the solution of the present application.
The inventor finds that the defects in the prior art are mainly caused by the following contents in the process of implementing the application: oneShot recognizes the wakeup word + command word. And the OneShot function relies on vad to detect the voice, the audio frequency of the detected voice is sent to a wake-up engine, the same audio frequency is sent to be identified after wake-up, and if vad detects the end of the voice after the word is woken up, the subsequent command word is lost.
Since the technologies generally provided to the public are mostly general and have certain limitation of the use range, those skilled in the art usually adopt a scheme for improving the audio quality to avoid the defects.
The Oneshot optimization scheme provides the customer with an option that in most cases is not needed, but the application provides a solution when the above-mentioned drawbacks occur, and the solution does not have a side effect on the original Oneshot function.
In the scheme provided by the application, after the OneShot optimization function is started by a user, ASR is firstly carried out on the audio sent after awakening, meanwhile, the audio after the awakening word is cached, when only the awakening word is contained after identification, the cached audio is sent to vad, if the voice exists, ASR is carried out again, and the identified result is fed back to the user together.
In the embodiment of the application, the audio sent to wake-up may not be processed by vad, the audio may be sent to the wake-up engine first, the wake-up engine detects that the same segment of audio and the subsequent audio are sent to vad for detection after wake-up, if the user says that the interval time between the wake-up word and the command word is slightly long, vad may detect that the voice is over after detecting the wake-up word, the subsequent command word may be lost, and if the command word is lost or not, the judgment needs to be performed by the subsequent second recognition, so that the situation that the command word is lost during the first recognition can be optimized by the second recognition.
Please refer to fig. 2a, which shows a flowchart of an OneShot optimization scheme according to an embodiment of the present application. As shown in fig. 2 a:
step 1: voice input wake-up engine until a wake-up word is detected
Step 2: awakening the audio to the ASR engine for vad detection
And step 3: the awakening audio is voice, ASR recognition is carried out, and the audio after the awakening audio is cached
And 4, step 4: and after the ASR server returns the recognition result, stopping caching the audio.
And 5: and if the recognition result is null (the ASR service filters out the awakening words), the awakening audio only contains the awakening words, the next step is continued, if the awakening audio is not null, the recognition result is recalled, and the operation is finished.
Step 6: 2 nd recognition is carried out on the cached audio and the subsequent audio, voice detection is carried out by the local vad, and ASR recognition is carried out when voice exists
And 7: and (5) calling back the identification result, and ending.
With continued reference to fig. 2b and 2c, there is shown a flow diagram explaining a part of the highlighted steps of fig. 2 a.
oneshot refers to the wake word + command word, hereinafter "hello relax on light", wake word refers to "hello relax", command word refers to "turn on light".
The oneshot optimization procedure comprises the following 3 cases:
1. the user says "hello die turn on light", recognizes as "turn on light" for the first time, and recalls "turn on light", ending. This case is problem-free and does not require optimization. (awakening words are filtered by the server, the same applies hereinafter)
2. The user says that the 'hello die turns on a light', the first recognition is 'hello die', the description only contains a wake-up word, the second recognition is started, the second recognition is 'turn on a light', the call back is 'turn on a light', and the process is finished. The purpose of optimization is achieved.
3. The user says 'your minor relaxation', the first recognition is 'your minor relaxation', the user only contains the awakening word, the second recognition is started, the second recognition is 'null' or the second vad time-out, the user only says the awakening word, and the user calls back 'null', and the operation is finished. The optimization procedure does not affect the situation where only the wake-up word is spoken.
The oneshot optimization process mainly optimizes the 2 nd case without influencing the 1 st and 3 rd cases.
The audio identified by asr is also described here:
the audio recognized by the asr for the first time comprises a wake-up audio and audio after the wake-up audio, and the specific end point is determined by vad. ( And the vad performs voice print detection, and determines the end point of the asr audio when no voice exists. The same applies hereinafter )
The audio sent by the asr for the second time comprises the audio buffered during the first recognition and the subsequent audio, and the specific end point is determined by vad.
Through the scheme provided by the embodiment of the application, the existing oneshot function can be optimized without being influenced.
Referring to fig. 3, a block diagram of an apparatus for optimizing a continuous wake up identification function according to an embodiment of the present invention is shown.
As shown in fig. 3, the apparatus 300 for optimizing the continuous wake-up recognition function includes a wake-up detection module 310, a first recognition module 320, a recognition determination module 330, a timeout determination module 340, a second recognition module 350, and a callback module 360.
Wherein the wake-up detection module 310 is configured to continuously receive the audio until a wake-up word is detected; the first recognition module 320 is configured to perform voice recognition on the audio including the wakeup word to form a first voice recognition result, and continue to cache a second audio received after the first audio within a preset time; a recognition judging module 330 configured to judge whether the first speech recognition result includes speech other than the wakeup word; a timeout determining module 340 configured to determine whether the voice activity detection for the second audio is timeout if the first voice recognition result does not include a voice other than the wakeup word; a second recognition module 350 configured to perform speech recognition on the second audio to form a second speech recognition result if the voice activity detection is not over time; and a callback module 360 configured to callback the second speech recognition result if the second speech recognition result includes speech other than the wakeup word.
In some optional embodiments, the apparatus further comprises: a result throwing module (not shown in the figure) configured to throw a result that is identified as empty if the second speech recognition result does not include speech other than the wakeup word.
In some other optional embodiments, the callback module is further configured to: and if the first voice recognition result contains voice except the awakening word, calling back the first voice recognition result.
It should be understood that the modules depicted in fig. 3 correspond to various steps in the method described with reference to fig. 1. Thus, the operations and features described above for the method and the corresponding technical effects are also applicable to the modules in fig. 3, and are not described again here.
It should be noted that the modules in the embodiments of the present application are not limited to the solution of the present application, for example, the recognition determining module may describe a module for determining whether the first speech recognition result includes a speech module other than the wakeup word. In addition, the related function module may also be implemented by a hardware processor, for example, the identification and determination module may also be implemented by a processor, which is not described herein again.
In other embodiments, an embodiment of the present invention further provides a non-volatile computer storage medium, where the computer storage medium stores computer-executable instructions, and the computer-executable instructions may perform the optimization method for the continuous wake-up recognition function in any of the above method embodiments;
as one embodiment, the non-volatile computer storage medium of the present invention stores computer-executable instructions configured to:
continuously receiving audio until a wake-up word is detected;
performing voice recognition on a first audio frequency containing a wakeup word to form a first voice recognition result, and continuously caching a second audio frequency received after the first audio frequency in a preset time;
judging whether the first voice recognition result contains voice except for the awakening word;
if the first voice recognition result does not contain the voice except the awakening word, judging whether the voice activity detection of the second audio frequency is overtime;
if the voice activity detection is not overtime, performing voice recognition on the second audio to form a second voice recognition result;
and if the second voice recognition result contains voice except the awakening word, calling back the second voice recognition result.
The non-volatile computer readable storage medium may include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function; the storage data area may store data created according to use of the optimizing means of the continuous wake up identifying function, and the like. Further, the non-volatile computer-readable storage medium may include high speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other non-volatile solid state storage device. In some embodiments, the non-transitory computer readable storage medium optionally includes memory remotely located from the processor, and these remote memories may be connected over a network to the optimizing device for continuous wake recognition functionality. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.
Embodiments of the present invention further provide a computer program product, where the computer program product includes a computer program stored on a non-volatile computer-readable storage medium, and the computer program includes program instructions, and when the program instructions are executed by a computer, the computer executes any one of the above optimization methods for continuously waking up an identification function.
Fig. 4 is a schematic structural diagram of an electronic device according to an embodiment of the present invention, and as shown in fig. 4, the electronic device includes: one or more processors 410 and memory 420, with one processor 410 being an example in fig. 4. The apparatus of the voice recognition method may further include: an input device 430 and an output device 440. The processor 410, the memory 420, the input device 430, and the output device 440 may be connected by a bus or other means, such as the bus connection in fig. 4. The memory 420 is a non-volatile computer-readable storage medium as described above. The processor 410 executes various functional applications and data processing of the server by executing the nonvolatile software programs, instructions and modules stored in the memory 420, namely, the optimization method for implementing the continuous wake up identification function of the above method embodiment. The input means 430 may receive input numeric or character information and generate key signal inputs related to user settings and function control of the optimizing means for the continuous wake-up recognition function. The output device 440 may include a display device such as a display screen.
The product can execute the method provided by the embodiment of the invention, and has corresponding functional modules and beneficial effects of the execution method. For technical details that are not described in detail in this embodiment, reference may be made to the method provided in the embodiment of the present invention.
As an embodiment, the electronic device is applied to an optimization apparatus for a continuous wake-up recognition function, and includes:
at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to cause the at least one processor to:
continuously receiving audio until a wakeup word is detected;
performing voice recognition on the audio containing the awakening word to form a first voice recognition result, and continuously caching a second audio received after the first audio in a preset time;
judging whether the first voice recognition result contains voice except for the awakening word;
if the first voice recognition result does not contain the voice except the awakening word, judging whether the activity detection of the second audio voice is overtime;
if the voice activity detection is not overtime, performing voice recognition on the second audio to form a second voice recognition result;
and if the second voice recognition result contains voice except the awakening word, calling back the second voice recognition result.
The electronic device of the embodiments of the present application exists in various forms, including but not limited to:
(1) A mobile communication device: such devices are characterized by mobile communications capabilities and are primarily targeted at providing voice and data communications. Such terminals include smart phones (e.g., iphones), multimedia phones, functional phones, and low-end phones, among others.
(2) Ultra mobile personal computer device: the equipment belongs to the category of personal computers, has calculation and processing functions and generally has the characteristic of mobile internet access. Such terminals include: PDA, MID, and UMPC devices, etc., such as ipads.
(3) A portable entertainment device: such devices can display and play multimedia content. Such devices include audio and video players (e.g., ipods), handheld game consoles, electronic books, and smart toys and portable car navigation devices.
(4) The server is similar to a general computer architecture, but has higher requirements on processing capability, stability, reliability, safety, expandability, manageability and the like because of the need of providing highly reliable services.
(5) And other electronic devices with data interaction functions.
The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and the parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.
Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.
Claims (10)
1. An optimization method for continuous wake-up recognition function includes:
continuously receiving audio until a wakeup word is detected;
performing voice recognition on a first audio frequency containing a wakeup word to form a first voice recognition result, and continuously caching a second audio frequency received after the first audio frequency in a preset time;
judging whether the first voice recognition result contains voice except for the awakening word;
if the first voice recognition result does not contain the voice except the awakening word, judging whether the voice activity detection of the second audio is overtime;
if the voice activity detection is not overtime, performing voice recognition on the second audio to form a second voice recognition result;
and if the second voice recognition result contains voice except the awakening word, calling back the second voice recognition result.
2. The method of claim 1, wherein after speech recognizing the second audio to form a second speech recognition result, the method further comprises:
and if the second voice recognition result does not contain the voice except the awakening word, throwing a result which is recognized to be empty.
3. The method according to claim 1, wherein after the determining whether the first speech recognition result contains speech other than a wake word, the method further comprises:
and if the first voice recognition result contains voice except the awakening word, calling back the first voice recognition result.
4. The method according to any of claims 1-3, wherein after determining whether the detection of speech activity for the second audio is timed out if the first speech recognition result does not include speech other than a wake-up word, the method further comprises:
and if the voice activity detection is overtime, throwing a result which is identified as empty.
5. The method of claim 4, wherein the second audio received after continuing to buffer the first audio for the preset time comprises:
and after the awakening word is detected, continuing to cache the second audio received after the first audio until the first voice recognition result is received, and stopping caching.
6. An apparatus for optimizing a continuous wake-up recognition function, comprising:
a wake-up detection module configured to continuously receive audio until a wake-up word is detected;
the first recognition module is configured to perform voice recognition on the audio containing the awakening word to form a first voice recognition result, and continuously cache a second audio received after the first audio in a preset time;
the recognition judging module is configured to judge whether the first voice recognition result contains voice except the awakening word;
the overtime judging module is configured to judge whether the voice activity detection of the second audio is overtime or not if the first voice recognition result does not contain the voice except the awakening word;
a second recognition module configured to perform speech recognition on the second audio to form a second speech recognition result if the voice activity detection is not over time;
and the callback module is configured to callback the second voice recognition result if the second voice recognition result contains voices except the awakening words.
7. The apparatus of claim 6, wherein the apparatus further comprises:
and the error throwing module is configured to throw a result which is identified to be empty if the second voice identification result does not contain the voice except the awakening word.
8. The apparatus of claim 6, wherein the callback module is further configured to:
and if the first voice recognition result contains voices except the awakening words, calling back the first voice recognition result.
9. An electronic device, comprising: at least one processor, and a memory communicatively coupled to the at least one processor, wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the steps of the method of any one of claims 1 to 5.
10. A storage medium having stored thereon a computer program, characterized in that the program, when being executed by a processor, is adapted to carry out the steps of the method of any one of claims 1 to 5.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201911379635.6A CN111128166B (en) | 2019-12-27 | 2019-12-27 | Optimization method and device for continuous awakening recognition function |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201911379635.6A CN111128166B (en) | 2019-12-27 | 2019-12-27 | Optimization method and device for continuous awakening recognition function |
Publications (2)
Publication Number | Publication Date |
---|---|
CN111128166A CN111128166A (en) | 2020-05-08 |
CN111128166B true CN111128166B (en) | 2022-11-25 |
Family
ID=70504254
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201911379635.6A Active CN111128166B (en) | 2019-12-27 | 2019-12-27 | Optimization method and device for continuous awakening recognition function |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN111128166B (en) |
Families Citing this family (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111897584B (en) * | 2020-08-14 | 2022-07-08 | 思必驰科技股份有限公司 | Wake-up method and device for voice equipment |
CN114049896A (en) * | 2021-11-08 | 2022-02-15 | 西安链科信息技术有限公司 | Vehicle-mounted cloud intelligent voice interaction system, method, equipment and terminal |
CN114155857A (en) * | 2021-12-21 | 2022-03-08 | 思必驰科技股份有限公司 | Voice wake-up method, electronic device and storage medium |
CN115512700A (en) * | 2022-09-07 | 2022-12-23 | 广州小鹏汽车科技有限公司 | Voice interaction method, voice interaction device, vehicle and readable storage medium |
Family Cites Families (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109994106B (en) * | 2017-12-29 | 2023-06-23 | 阿里巴巴集团控股有限公司 | Voice processing method and equipment |
CN108962262B (en) * | 2018-08-14 | 2021-10-08 | 思必驰科技股份有限公司 | Voice data processing method and device |
CN109378000B (en) * | 2018-12-19 | 2022-06-07 | 科大讯飞股份有限公司 | Voice wake-up method, device, system, equipment, server and storage medium |
CN110473539B (en) * | 2019-08-28 | 2021-11-09 | 思必驰科技股份有限公司 | Method and device for improving voice awakening performance |
-
2019
- 2019-12-27 CN CN201911379635.6A patent/CN111128166B/en active Active
Also Published As
Publication number | Publication date |
---|---|
CN111128166A (en) | 2020-05-08 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN111128166B (en) | Optimization method and device for continuous awakening recognition function | |
AU2019246868B2 (en) | Method and system for voice activation | |
CN108962262B (en) | Voice data processing method and device | |
AU2015390534B2 (en) | Speech recognition method, speech wakeup apparatus, speech recognition apparatus, and terminal | |
CN109147779A (en) | Voice data processing method and device | |
CN112201246B (en) | Intelligent control method and device based on voice, electronic equipment and storage medium | |
WO2014208231A1 (en) | Voice recognition client device for local voice recognition | |
CN113362828B (en) | Method and apparatus for recognizing speech | |
CN102591455A (en) | Selective Transmission of Voice Data | |
KR20160005050A (en) | Adaptive audio frame processing for keyword detection | |
CN110968353A (en) | Central processing unit awakening method and device, voice processor and user equipment | |
US20230317096A1 (en) | Audio signal processing method and apparatus, electronic device, and storage medium | |
CN109697981B (en) | Voice interaction method, device, equipment and storage medium | |
CN112735398B (en) | Man-machine conversation mode switching method and system | |
CN111816190A (en) | Voice interaction method and device for upper computer and lower computer | |
CN109545211A (en) | Voice interactive method and system | |
CN112634911B (en) | Man-machine conversation method, electronic device and computer readable storage medium | |
JP6817386B2 (en) | Voice recognition methods, voice wakeup devices, voice recognition devices, and terminals | |
CN109686372B (en) | Resource playing control method and device | |
WO2021077528A1 (en) | Method for interrupting human-machine conversation | |
CN113362845B (en) | Method, apparatus, device, storage medium and program product for noise reduction of sound data | |
CN112447177B (en) | Full duplex voice conversation method and system | |
CN114743546B (en) | Method and device for reducing intelligent voice false wake-up rate and electronic equipment | |
CN114155857A (en) | Voice wake-up method, electronic device and storage medium | |
CN108922523B (en) | Position prompting method and device, storage medium and electronic equipment |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
CB02 | Change of applicant information | ||
CB02 | Change of applicant information |
Address after: 215123 building 14, Tengfei Innovation Park, 388 Xinping street, Suzhou Industrial Park, Suzhou City, Jiangsu Province Applicant after: Sipic Technology Co.,Ltd. Address before: 215123 building 14, Tengfei Innovation Park, 388 Xinping street, Suzhou Industrial Park, Suzhou City, Jiangsu Province Applicant before: AI SPEECH Co.,Ltd. |
|
GR01 | Patent grant | ||
GR01 | Patent grant |