CN104464723B

CN104464723B - A kind of voice interactive method and system

Info

Publication number: CN104464723B
Application number: CN201410782284.4A
Authority: CN
Inventors: 张凯; 陈盛
Original assignee: iFlytek Co Ltd
Current assignee: iFlytek Co Ltd
Priority date: 2014-12-16
Filing date: 2014-12-16
Publication date: 2018-03-20
Anticipated expiration: 2034-12-16
Also published as: CN104464723A

Abstract

The invention discloses a kind of voice interactive method and system, this method includes recording the voice data of user's input；End-point detection is carried out to voice data, until detecting speech front-end point；Prefix word detection is carried out to the voice data lighted from speech front-end, until detecting prefix word sound, the prefix word performs the word of type of action for reflection；The voice segments lighted in voice data from the front end of prefix word sound are obtained as phonetic order；Speech recognition is carried out to phonetic order；If voice identification result effectively if perform the operation of corresponding voice identification result.The method and system of the present invention are because the voice segments that will be lighted in voice data from the front end of prefix word sound are as phonetic order, and the word of execution type of action will be reflected as prefix word, it is achieved that the combination between prefix word and phonetic order, it is possible to prevente effectively from occur because force cutting phonetic order bring the problem of can not obtaining efficient voice recognition result, improve the efficiency of interactive voice.

Description

A kind of voice interactive method and system

Technical field

The present invention relates to interactive voice field, more particularly to a kind of voice interactive method and system.

Background technology

In order to avoid the noise of speaking on periphery is mistakenly identified as phonetic order, Yong Hu when standby by the mobile devices such as mobile phone When starting the voice interactive function of mobile device every time, mobile device is required to complete following operate：1st, user's input is recorded Voice data；2nd, obtain voice data and carry out wake-up detection, until waking up successfully；3rd, user is prompted to input language after waking up successfully Sound instructs；4th, after prompting user inputs phonetic order, the voice data of user's input is recorded again；5th, obtain what is recorded again Voice segments in voice data are as phonetic order；6th, speech recognition is carried out to phonetic order, obtains voice identification result；7th, really Whether effective determine voice identification result, voice identification result is performed if effectively.Accordingly, user starts mobile set each During standby voice interactive function, it is required to complete following operate：1st, wake-up word is said, to wake up mobile device；2nd, set in movement When standby prompting user inputs phonetic order, when saying phonetic order, such as saying " phoning Zhang San ".As can be seen here, this kind Voice interactive method has the defects of property easy to use is poor.

In order to solve the problems, such as that property easy to use is poor existing for above-mentioned voice interactive method, also proposed a kind of base at present In the voice interactive method for waking up word, this kind of voice interactive method is that directly processing user is saying wake-up word after waking up successfully The phonetic order continuously said afterwards.Corresponding with this kind of voice interactive method, the operation that user needs to complete is continuously to say to call out Wake up word and phonetic order, for example, for the application " to phone Zhang San ", user needs to say that " language point leads to, and phones Three ", " language point leads to " therein is fixed wake-up word set in advance, and " phoning Zhang San " is phonetic order.This kind Although voice interactive method has certain advantage in property easy to use, user is generally continuously to speak, and wakes up word Can be along connecting together with phonetic order below, therefore, this voice segments using in voice data in waking up successfully are as language The pressure slit mode of sound instruction, it is likely that cause phonetic order imperfect, and then cause sound identification module not had The voice identification result of effect, the recognition accuracy of sound identification module is reduced, this just reduces voice friendship to a certain extent Mutual efficiency.In addition, this kind of voice interactive method works only for fixed wake-up word, user needs hardness memory setting Word is waken up, otherwise will be unable to start whole interactive voice process, therefore, the property easy to use of this kind of voice interactive method still needs Further improve.

The content of the invention

The embodiment of the present invention aims to overcome that interactive voice existing for existing voice exchange method is less efficient and asked A kind of topic, there is provided voice interactive method efficiently based on prefix word.

To achieve the above object, the technical solution adopted by the present invention is：A kind of voice interactive method, including：

Record the voice data of user's input；

End-point detection is carried out to the voice data, until detecting speech front-end point；

Prefix word detection is carried out to the voice data lighted from the speech front-end, until prefix word sound is detected, its In, the prefix word performs the word of type of action, and the prefix word and the voice for showing user view for reflection Instruction is combined together；

Obtain in the voice data from the voice segments that the front end of the prefix word sound is lighted as phonetic order, until Detect that instruction obtains termination event；

Speech recognition is carried out to the phonetic order, obtains voice identification result；

Judge whether institute's speech recognition result is effective, the behaviour of corresponding institute speech recognition result is performed if effectively Make.

Preferably, methods described also includes：

Before end-point detection is carried out to the voice data, noise reduction process is carried out to the voice data.

Preferably, the voice data progress prefix word detection to being lighted from the speech front-end includes：

Based on the parallel search network for including prefix word model and filler model, the sound lighted from the speech front-end is detected Frequency whether there is the prefix word sound in.

Preferably, it is described to judge whether institute's speech recognition result effectively includes：

Judge to whether there is the order word to match with institute speech recognition result in order word network, such as exist, then sentence It is effective to determine institute's speech recognition result.

Preferably, the instruction obtains termination event and included：Institute's speech segment terminates persistently to have set with institute speech segment Fix time.

To achieve these goals, the technical solution adopted by the present invention is：A kind of voice interactive system, including：

Recording module, for recording the voice data of user's input；

Endpoint detection module, for carrying out end-point detection to the voice data, until detecting speech front-end point；

Prefix word detection module, for carrying out prefix word detection to the voice data lighted from the speech front-end, until Prefix word sound is detected, wherein, the prefix word performs the word of type of action for reflection, and the prefix word is with being used for Show that the phonetic order of user view is combined together；

Voice Activity Detection module, for obtaining the language lighted in the voice data from the front end of the prefix word sound Segment is as phonetic order, until detecting that instruction obtains termination event；

Sound identification module, for carrying out speech recognition to the phonetic order, obtain voice identification result；

Judge module, for judging, whether speech recognition result is effective；And

Execution module, for performing operation corresponding to effective voice identification result.

Preferably, the system also includes：

Noise reduction module, it is connected respectively with the recording module and the endpoint detection module, for the recording module The voice data of recording carries out noise reduction process, and sends the voice data after noise reduction process to the endpoint detection module.

Preferably, the prefix word detection module is specifically used for based on parallel including prefix word model and filler model Network is searched for, detects and whether there is the prefix word sound in the voice data lighted from the speech front-end.

Preferably, the judge module is specifically used for judging whether there is and the speech recognition knot in order word network The order word that fruit matches, such as exist, then judge that institute's speech recognition result is effective.

The beneficial effects of the present invention are, voice interactive method of the invention and system due to by voice data from prefix The voice segments that the front end of word sound is lighted as phonetic order, and will e.g. " phoning ", " send short messages to ", " open QQ " The word of type of action is performed as prefix word Deng reflection, it is achieved that the combination between prefix word and phonetic order, this Not only it is possible to prevente effectively from occur because force cutting phonetic order bring the problem of can not obtaining efficient voice recognition result, carry The high efficiency of interactive voice, and this word that will meet conventional language custom is as the mode of prefix word, make user without Need hardness to remember fixed wake-up word, need to only be accustomed to saying the i.e. achievable interactive voice of action for needing to perform according to conventional language Wake-up and action execution, and then further increase the property easy to use of interactive voice.

Brief description of the drawings

Fig. 1 shows a kind of flow chart of embodiment according to voice interactive method of the present invention；

Fig. 2 shows a kind of frame principle figure of implementation structure according to voice interactive system of the present invention.

Embodiment

Embodiments of the invention are described below in detail, the example of the embodiment is shown in the drawings, wherein from beginning to end Same or similar label represents same or similar element or the element with same or like function.Below with reference to attached The embodiment of figure description is exemplary, is only used for explaining the present invention, and is not construed as limiting the claims.

The present invention in order to solve existing for existing voice exchange method because phonetic order is carried out force cutting influence language The problem of sound interactive efficiency, there is provided a kind of highly efficient voice interactive method, as shown in figure 1, this method comprises the following steps：

Step S1：Record the voice data of user's input.

Here, the voice data of recording can be stored in the cyclic buffer of regular length, and record storage address, with The voice data is obtained for subsequent step.

Step S2：End-point detection is carried out to voice data, until detecting speech front-end point.

The speech front-end point is exactly boundary frame of the non-speech segment to voice segments, when carrying out voice data processing, first to sound Then frequency calculates energy feature, energy feature, which exceedes setting numerical value, just to be recognized according to framing is carried out to every frame data of voice data It is voice for the frame data, is otherwise non-voice.

Here, voice data can be constantly stored in cyclic buffer with the progress of recording, and with voice data Continuous storage, can be obtained constantly from cyclic buffer voice data carry out end-point detection, therefore, this enters to voice data The action of row end-point detection synchronous with the action that the voice data of recording is stored in cyclic buffer can be carried out substantially, to carry High treatment efficiency.

Step S3：Prefix word detection is carried out to the voice data lighted from speech front-end, until prefix word sound is detected, Wherein, the prefix word performs the word of type of action for reflection, with can be by for waking up the prefix word of interactive voice with using Organically it is combined together in the phonetic order for showing user view.The word that the reflection performs type of action is, for example, " to beat electricity Talk about to ", " send short messages to ", " open the word that QQ ", " opening wechat " etc. meet conventional language custom.

The main function of prefix word detection is judges whether to wake up interactive voice operation, if detecting prefix word Sound, then start speech recognition, to perform corresponding actions according to user view.

The method of prefix word detection for example may include following steps：

Step S31, acoustic feature extraction：Extract in audio-frequency information (detection of prefix word is carried out generally in units of voice segments) It is with distinction and be the feature based on human hearing characteristic extraction, generally choose the MFCC that uses in speech recognition (Mel-Frequency Cepstrum Coefficient, Mel frequency cepstrum coefficient) feature is as acoustic feature.

Step S32, the detection of prefix word：Obtained acoustic feature will be extracted, examined using the acoustic model of training in prefix word Acoustic score is calculated on survey grid network, if including the prefix word to be detected in the optimal path of acoustic score, it is determined that detected Prefix word, otherwise return to step S31 and continue to extract acoustic feature.

, can also be it is determined that in order to reduce the false drop rate of prefix word on the basis of above-mentioned steps S31 and step S32 Following steps S33 is performed after detecting prefix word.

Step S33, prefix word confirm：Obtained acoustic feature will be extracted, it is true in prefix word using the acoustic model of training Recognize progress prefix word confirmation on network, obtain finally confirming score；Whether the prefix word for judging the detection is real prefix word, Will the prefix word final confirmation score and thresholding set in advance be compared, if finally confirm score be more than or equal to door Limit, then it is assumed that the prefix word is real prefix word, and voice wakes up successfully；If finally confirm that score is less than thresholding, then it is assumed that The prefix word is false prefix word, comes back to step S31 and continues to extract acoustic feature.

Here, the word increase that the reflection for meeting conventional language custom can be performed to type of action detects network in prefix word Confirm with prefix word in network, in addition, the method for the present invention also supports user that according to personal speech habits, reflection is performed into action The word increase of type detects network in prefix word and prefix word confirms the operation of network.This cause the present invention method no longer by It is limited to the fixed application convenience for waking up word, further increasing the present invention.

The implementation method of above-mentioned prefix word detection network can draw using optimal score path computing, the optimal sub-path that obtains Calculation formula is：

Current X represents the acoustic feature vector extracted from voice data, and W represents the maximum optimal word sequence of score；Bar Part probability P (X | W) it is acoustic model scores, it is calculated by the acoustic model trained；Prior probability P (W) is language mould Type score, it is full probability as to the PenaltyP (X) added by different acoustic models, when acoustic model and prefix word detection net Network is definite value after deciding.On this basis, prefix word confirms that the implementation method of network is：

A) the prefix word of detection is decoded to phoneme one-level, and records all scores：

(Score_phone1,Score_phone2,…,Score_phoneN), wherein N is phoneme number total in prefix word, Score_phone1,Score_phone2,…,Score_phoneNThe decoding score of each phoneme in the prefix word is represented respectively.

B) each phoneme of prefix word is calculated really to recognize point, calculation is as follows：

Wherein K_istartAnd K_iendThe initial time of respectively i-th phoneme and end time；CM_phoneiRepresent i-th of sound Element is recognized point really, subscript phonei i-th of phoneme of expression, Score_phoneiThe decoding score of i-th of phoneme as shown above, Score_framekRepresent the score of kth frame obtained using prefix word confirmation network decoding.

C) the final confirmation score C M of the prefix word is calculated_word, calculation is as follows：

In order to improve prefix word detection efficiency and the degree of accuracy, the training of above-mentioned acoustic model can be divided into two parts, be respectively Prefix word model and filler model (i.e. filler models)；Prefix word model can use the acoustic model in traditional speech recognition Training method, choose database, using based on MLE (Maximum Likelihood Estimation, maximal possibility estimation) and Obtained under MPE (Minimum Phone Error, minimum phoneme mistake) distinction training criterion；And filler model is then used to inhale Receive the independent voice in addition to prefix word.Therefore, prefix is carried out to the voice data lighted from speech front-end in above-mentioned steps S3 Word detection can further comprise：Based on the parallel search network for including prefix word model and filler model, detect from speech front-end It whether there is prefix word sound in the voice data lighted.

It will be understood by those skilled in the art that the present invention can also use interactive voice field in usually use its He detects prefix word sound at words detection means, and this embodiment of the present invention is not limited.

Step S4：Obtain in voice data from the voice segments that the front end of prefix word sound is lighted as phonetic order, until Detect that instruction obtains termination event, to realize the combination of prefix word and phonetic order.

Here, step S1 operation continues un-interrupted after prefix word sound (waking up successfully) is detected, and The action for obtaining phonetic order is successfully triggered by waking up, and the step is directly to be obtained after waking up successfully from cyclic buffer Voice segments in voice data.

, can be after prefix word sound be detected for the ease of obtaining the voice segments, the aft terminal for recording prefix word sound exists The length of storage address and prefix word sound in cyclic buffer, so, you can the forward terminal of prefix word sound is calculated Storage address in cyclic buffer, so as to accurately obtain the language lighted in voice data from the front end of prefix word sound Segment.

Step S5：Speech recognition is carried out to phonetic order, obtains voice identification result.

Step S6：Judge whether voice identification result is effective, the operation of corresponding voice identification result is performed if effectively； Terminate this interactive voice if invalid, here, may remind the user that interactive failure, and remind user to input again correctly Phonetic order.

The voice interactive method of the present invention is due to the voice segments lighted in voice data from the front end of prefix word sound being made For phonetic order, and the word for performing type of action using reflecting is as prefix word, it is achieved that between prefix word and phonetic order Combination, this not only it is possible to prevente effectively from occur because force cutting phonetic order bring can not obtain efficient voice identification As a result the problem of, the efficiency of interactive voice is improved, and this word of conventional language custom that will meet is as prefix word Mode, make the wake-up word that user is fixed without hardness memory, only need to be accustomed to saying according to conventional language needs the action performed i.e. The execution of wake-up and the action of interactive voice can be achieved, and then further increase the property easy to use of interactive voice.

In order to improve the degree of accuracy of forward terminal detection, the detection of prefix word and speech recognition, and improve interactive voice of the present invention The antijamming capability of method, method of the invention can also be entered before end-point detection is carried out to voice data to voice data Row noise reduction process, clean voice data is obtained, on the other hand, above-mentioned steps S3 is specifically the clean audio number to being lighted from speech front-end According to prefix word detection is carried out, above-mentioned steps S4 is specifically the language for obtaining and being lighted in clean voice data from the front end of prefix word sound Segment is as phonetic order.

Judge whether institute's speech recognition result can effectively further comprise following steps in above-mentioned steps S6：

Step S61：Loading command word network.

The method of the present invention supports user to expand the operation of order word network as needed.

Step S62：Judge such as exist with the presence or absence of the order word to match with voice identification result in order word network, Then judge that institute's speech recognition result is effective.

Here, can by calculate the similarity between voice identification result and each order word obtain voice identification result with it is each Matching degree score between order word, if matching degree score is greater than given threshold, then it is assumed that voice identification result Effectively, otherwise it is assumed that sound result is invalid.

Above-mentioned instruction obtains termination event and can set as needed, such as including：Voice segments terminate to have continued with voice segments Setting time.Therefore, can be simultaneously to being lighted from the front end of prefix word sound in voice data after prefix word sound is detected Voice segments carry out speech recognition, aft terminal detection and duration timing.Those skilled in the art can be according to practical application field Close and the setting time be arranged to fixed value, or the setting time is arranged to be inputted by user and determined, it is generally the case that The setting time selects in the range of 800ms to 2000ms, such as selection is 1000ms.Upper speech segment sign-off table shows detection To the aft terminal of voice segments.If aft terminal is also not detected by when voice segments continue setting time, it also hold that voice segments Terminate.Here, the beginning and end of each voice segments corresponds to the forward terminal and aft terminal of voice segments respectively, forward terminal is just non-language Segment is to the boundary frame of voice segments, and aft terminal is exactly boundary frame of the voice segments to non-speech segment, and therefore, voice segments are continuous certain The frame data of length all meet what the requirement of voice obtained.

It is corresponding with above-mentioned voice interactive method, voice interactive system of the invention as shown in Fig. 2 including recording module 1, Endpoint detection module 2, prefix word detection module 3, Voice Activity Detection module 4, sound identification module 5, judge module 6, execution Module 7, the recording module 1 are used for the voice data for recording user's input；The endpoint detection module 2 is used for the voice data End-point detection is carried out, until detecting speech front-end point；The prefix word detection module 3 is used for being lighted from the speech front-end Voice data carries out prefix word detection, until prefix word sound is detected, wherein, the prefix word performs type of action for reflection Word；The Voice Activity Detection module 4 is used to obtain what is lighted from the front end of the prefix word sound in the voice data Voice segments are as phonetic order, until detecting that instruction obtains termination event；The sound identification module 5 is used to refer to the voice Order carries out speech recognition, obtains voice identification result；The judge module 6 is used to judge whether institute's speech recognition result is effective； The execution module 7 is used to perform effective voice identification result.

The present invention system can also further comprise noise reduction module (not shown), the noise reduction module respectively with record mould Block 1 and endpoint detection module 2 are connected, and noise reduction process is carried out for the voice data recorded to recording module 1, and by noise reduction process Voice data afterwards sends endpoint detection module 2 to.

Further, above-mentioned prefix word detection module 3 can be additionally used in based on including prefix word model and filler model and Row search network, detects and whether there is the prefix word sound in the voice data lighted from the speech front-end.

Further, above-mentioned judge module 6 can also be used to judge in order word network to whether there is and the speech recognition As a result the order word to match, such as exist, then judge that institute's speech recognition result is effective.

Above-mentioned instruction, which obtains termination event, for example may include that voice segments terminate to continue setting time with voice segments, on the other hand, Above-mentioned endpoint detection module 2 can be additionally used in the duration for the aft terminal and voice segments for detecting the voice segments.

Each embodiment in this specification is described by the way of progressive, identical similar portion between each embodiment Divide mutually referring to what each embodiment stressed is the difference with other embodiment.It is real especially for system For applying example, because it is substantially similar to embodiment of the method, so describing fairly simple, related part is referring to embodiment of the method Part explanation.System embodiment described above is only schematical, wherein described be used as separating component explanation Module or unit can be or may not be it is physically separate, can be as the part that module or unit are shown or Person may not be physical location, you can with positioned at a place, or can also be distributed on multiple NEs.Can root Factually border needs to select some or all of module therein realize the purpose of this embodiment scheme.Ordinary skill Personnel are without creative efforts, you can to understand and implement.

Construction, feature and the action effect of the present invention, above institute is described in detail according to the embodiment shown in schema above Only presently preferred embodiments of the present invention is stated, but the present invention is not to limit practical range shown in drawing, it is every according to structure of the invention Want made change, or be revised as the equivalent embodiment of equivalent variations, when still without departing from specification and illustrating covered spirit, All should be within the scope of the present invention.

Claims

A kind of 1. voice interactive method, it is characterised in that including：

Record the voice data of user's input；

End-point detection is carried out to the voice data, until detecting speech front-end point；

Prefix word detection is carried out to the voice data lighted from the speech front-end, until prefix word sound is detected, wherein, institute State the word that prefix word performs type of action for reflection, and the prefix word and the phonetic order knot for showing user view It is combined；

Obtain in the voice data from the voice segments that the front end of the prefix word sound is lighted as phonetic order, until detection Termination event is obtained to instruction；

Speech recognition is carried out to the phonetic order, obtains voice identification result；

Judge whether institute's speech recognition result is effective, the operation of corresponding institute speech recognition result is performed if effectively.
2. according to the method for claim 1, it is characterised in that methods described also includes：

Before end-point detection is carried out to the voice data, noise reduction process is carried out to the voice data.
3. according to the method for claim 1, it is characterised in that the voice data to being lighted from the speech front-end enters The detection of row prefix word includes：

Based on the parallel search network for including prefix word model and filler model, the audio number lighted from the speech front-end is detected It whether there is the prefix word sound in.
4. according to the method for claim 1, it is characterised in that described to judge whether institute's speech recognition result effectively wraps Include：

Judge to whether there is the order word to match with institute speech recognition result in order word network, such as exist, then judge institute Speech recognition result is effective.
5. voice interactive method according to any one of claim 1 to 4, it is characterised in that the instruction, which obtains, to be terminated Event includes：Institute's speech segment terminates to continue setting time with institute speech segment.
A kind of 6. voice interactive system, it is characterised in that including：

Recording module, for recording the voice data of user's input；

Endpoint detection module, for carrying out end-point detection to the voice data, until detecting speech front-end point；

Prefix word detection module, for carrying out prefix word detection to the voice data lighted from the speech front-end, until detection To prefix word sound, wherein, the prefix word performs the word of type of action for reflection, and the prefix word shows with being used for The phonetic order of user view is combined together；

Voice Activity Detection module, for obtaining the voice segments lighted in the voice data from the front end of the prefix word sound As phonetic order, until detecting that instruction obtains termination event；

Sound identification module, for carrying out speech recognition to the phonetic order, obtain voice identification result；

Judge module, for judging, whether speech recognition result is effective；And

Execution module, for performing operation corresponding to effective voice identification result.
7. system according to claim 6, it is characterised in that the system also includes：

Noise reduction module, it is connected respectively with the recording module and the endpoint detection module, for being recorded to the recording module Voice data carry out noise reduction process, and send the voice data after noise reduction process to the endpoint detection module.
8. system according to claim 6, it is characterised in that the prefix word detection module is specifically used for based on before including Sew the parallel search network of word model and filler model, detect and whether there is institute in the voice data lighted from the speech front-end State prefix word sound.
9. system according to claim 6, it is characterised in that the judge module is specifically used for judging in order word network With the presence or absence of the order word to match with institute speech recognition result, such as exist, then judge that institute's speech recognition result is effective.
10. the system according to any one of claim 6 to 9, it is characterised in that the instruction, which obtains, terminates event package Include：Institute's speech segment terminates to continue setting time with institute speech segment.