CN108428448A

CN108428448A - A kind of sound end detecting method and audio recognition method

Info

Publication number: CN108428448A
Application number: CN201710076757.2A
Authority: CN
Inventors: 范利春
Original assignee: Yutou Technology Hangzhou Co Ltd
Current assignee: Yutou Technology Hangzhou Co Ltd
Priority date: 2017-02-13
Filing date: 2017-02-13
Publication date: 2018-08-21
Also published as: WO2018145584A1; TWI659409B; TW201830377A

Abstract

The invention discloses a kind of sound end detecting method and audio recognition methods, belong to technical field of voice recognition；Method includes：It extracts the phonetic feature of voice data and is input in silence model；Silence model is according to phonetic feature output label for indicating whether voice data is mute frame；The sound end of one section of voice is confirmed according to the label of the voice data of successive frame：In unactivated state, if the length for the voice data of non-mute frame continuously occur is more than a preset first threshold, judge that first frame is the starting endpoint that the voice data of non-mute frame is one section of voice；In state of activation, if the continuous length for the voice data of mute frame occur is more than a preset second threshold, judge that first frame is the end caps that the voice data of mute frame is one section of voice.The advantageous effect of above-mentioned technical proposal is：Solve the problems, such as that speech terminals detection is inaccurate and excessively high for detection environmental requirement in the prior art.

Description

A kind of sound end detecting method and audio recognition method

Technical field

The present invention relates to technical field of voice recognition more particularly to a kind of sound end detecting method and speech recognition sides Method.

Background technology

With the development of speech recognition technology, the application of speech recognition in people's lives is more and more extensive.Work as user When using speech recognition technology in handheld device, it will usually coordinate speech recognition button to control the voice paragraph for needing to identify Beginning and end time, but when user be in smart home environment use speech recognition technology when, can because away from From pick up facility the beginning endpoint and end caps of voice paragraph can not be determined manually by the way of button cooperation farther out, this When just need another mode to carry out automatic decision namely speech terminals detection technology to time of voice beginning and end (Voice Active Detection, VAD).

Traditional end-point detecting method is based primarily upon sub-belt energy progress, that is, calculates per frame voice data in a certain frequency range Energy, and be compared to preset energy threshold to judge the beginning endpoint of voice and end caps.This endpoint inspection Survey method is more demanding to detection environment, speech recognition must be carried out in quiet environment and just can guarantee the language detected The accuracy of voice endpoint.And in relatively noisy noise circumstance, different types of noise can generate different sub-belt energies It influences, to be brought to above-mentioned end-point detecting method in the noise circumstance of larger interference, especially low signal-to-noise ratio and non-stationary, Very big interference can be caused to the calculating of sub-belt energy, so that final testing result is inaccurate.And only guaranteed end-speech The accuracy of point detection just can guarantee that voice is correctly collected, and then correct identified.The result of end-point detection is not allowed to truly have can Voice can be made to be truncated or the more noises of typing, can cause speech recognition that cannot be decoded to whole word, leaked to bring The problems such as report or wrong report, or even the decoding whole mistake of whole word can be caused, reduce the accuracy of voice recognition result.

Invention content

According to the above-mentioned problems in the prior art, a kind of sound end detecting method and audio recognition method are now provided Technical solution, it is intended to solve that speech terminals detection in the prior art is inaccurate and environmental requirement is excessively high asks for detection Topic.Above-mentioned technical proposal specifically includes：

A kind of sound end detecting method, wherein training in advance forms one for judging whether voice data is mute frame Silence model, then obtain one section of voice of the externally input voice data including successive frame, and execute following steps Suddenly：

Step S1 extracts the phonetic feature of voice data described in each frame, and the phonetic feature is input to described quiet In sound model；

Step S2, the silence model are associated with the mark of voice data described in each frame according to phonetic feature output Label, the label is for indicating whether the voice data is mute frame；

Step S3 confirms the sound end of one section of voice according to the label of the voice data of successive frame：

When the pick up facility for acquiring the voice is in unactivated state, if continuously there is the voice of non-mute frame The length of data is more than a preset first threshold, then judges that first frame be the voice data of the non-mute frame is one section The starting endpoint of the voice；

When the pick up facility for acquiring the voice is active, if continuously there is the voice of the mute frame The length of data is more than a preset second threshold, then judges that first frame be the voice data of the mute frame is one section of institute The end caps of predicate sound.

Preferably, sound end detecting method, wherein training forms the silence model in advance by following methods：

Step A1 inputs preset multiple training voice data, and extracts the language of each training voice data Sound feature；

Step A2 carries out automatic marking for being trained described in every frame according to the corresponding phonetic feature with voice data Operation obtains a label of voice data described in corresponding every frame；The label is for indicating voice data described in a corresponding frame For mute frame or non-mute frame；

Step A3 obtains the silence model according to training voice data and the corresponding label training；

It is provided with first node and second node on the output layer of the silence model；

The label that the first node is used to indicate to correspond to the mute frame；

The label that the second node is used to indicate to correspond to the non-mute frame.

Preferably, sound end detecting method, wherein corresponding externally input each training voice data is equal A mark text is pre-set, to mark the corresponding content of text of the training voice data；

Then the step A2 is specifically included：

Step A21 obtains the phonetic feature and the corresponding mark text；

Step A22, the acoustic model formed using advance training is to the phonetic feature and the corresponding mark text Pressure alignment is carried out, to obtain the output label that phonetic feature described in every frame corresponds to phone；

Step A23, to being post-processed with voice data by the training for forcing alignment, by mute phone The output label be mapped on the label for indicating the mute frame, and by the output label of non-mute phone It is mapped on the label for indicating the non-mute frame.

Preferably, the sound end detecting method, wherein in the step A22, the acoustic mode of training formation in advance Type is gauss hybrid models-Hidden Markov Model, or is deep neural network-Hidden Markov Model.

Preferably, sound end detecting method, wherein the silence model is the depth god for including multilayer neural network Through network model.

Preferably, sound end detecting method, wherein wrapped before every two layers neural network of the silence model Include at least one nonlinear transformation.

Preferably, sound end detecting method, wherein every layer of neural network of the silence model is full connection Neural network either convolutional neural networks or recurrent neural network.

Preferably, sound end detecting method, wherein the silence model is the depth god for including multilayer neural network Through network model；

The label that the second node is used to indicate to correspond to non-mute frame；

Then the step S2 is specifically included：

Step S21 passes through the forward calculation of neural network described in multilayer after the phonetic feature inputs the silence model It respectively obtains the first value for being associated with the first node in the output layer and is associated with the second of the second node Value；

First value is compared by step S22 with second value：

If first value is more than second value, using the first node as described in the voice data Label simultaneously exports；

If first value is less than second value, using the second node as described in the voice data Label simultaneously exports.

A kind of audio recognition method, wherein detect to obtain need to identify one using above-mentioned sound end detecting method The starting endpoint of Duan Yuyin and the end caps.

The advantageous effect of above-mentioned technical proposal is：A kind of sound end detecting method is provided, can be solved in the prior art Speech terminals detection is inaccurate and for the excessively high problem of detection environmental requirement, therefore promotes the accurate of speech terminals detection Property, the wide usage of extension endpoint detection method, to improve entire speech recognition process.

Description of the drawings

Fig. 1 is a kind of overall procedure schematic diagram of sound end detecting method in the preferred embodiment of the present invention；

Fig. 2 is in the preferred embodiment of the present invention, and training forms the flow diagram of silence model；

Fig. 3 on the basis of Fig. 2, is marked automatically with voice data to training in the preferred embodiment of the present invention The flow diagram of note；

Fig. 4 is to include the structural schematic diagram of the silence model of multilayer neural network in the preferred embodiment of the present invention；

Fig. 5 on the basis of Fig. 1, is handled and is exported and be associated with voice data in the preferred embodiment of the present invention The flow diagram of label.

Specific implementation mode

Following will be combined with the drawings in the embodiments of the present invention, and technical solution in the embodiment of the present invention carries out clear, complete Site preparation describes, it is clear that described embodiments are only a part of the embodiments of the present invention, instead of all the embodiments.It is based on Embodiment in the present invention, those of ordinary skill in the art obtained under the premise of not making creative work it is all its His embodiment, shall fall within the protection scope of the present invention.

It should be noted that in the absence of conflict, the feature in embodiment and embodiment in the present invention can phase Mutually combination.

The invention will be further described in the following with reference to the drawings and specific embodiments, but not as limiting to the invention.

According to the above-mentioned problems in the prior art, a kind of sound end detecting method is now provided, in this method, in advance Training formed one for judge voice data whether be mute frame silence model, then obtain it is externally input include successive frame Voice data one section of voice, and execute following step as shown in Figure 1：

Step S1, extracts the phonetic feature of each frame voice data, and phonetic feature is input in silence model；

Step S2, silence model are associated with the label of each frame voice data according to phonetic feature output, and label is used for table Show whether voice data is mute frame；

When the pick up facility for acquiring voice is in unactivated state, if continuously there is the length of the voice data of non-mute frame Degree is more than a preset first threshold, then judges that first frame is the starting endpoint that the voice data of non-mute frame is one section of voice；

When the pick up facility for acquiring voice is active, if the continuous length for the voice data of mute frame occur is big In a preset second threshold, then judge that first frame is the end caps that the voice data of mute frame is one section of voice.

Specifically, in the present embodiment, it is initially formed a silence model, which can be used for judging in one section of voice Every frame voice data whether be mute frame.So-called mute frame refers to not including the efficient voice for needing to carry out speech recognition Voice data；So-called non-mute frame refers to the voice data for including the efficient voice for needing to carry out speech recognition.

Then, in the present embodiment, after training forms silence model, each frame voice in externally input one section of voice is extracted The phonetic feature of data, and the phonetic feature extracted is input in silence model, which is associated with output Label.In the present embodiment, one co-exists in two labels, is respectively used to indicate that the frame voice data is mute frame/non-mute frame.

In the present embodiment, after having obtained the mute and non-mute classification of each frame voice data, then sound end is judged. However the non-mute frame of a frame not occur can think that one section of voice starts, and a frame mute frame can not occur and be considered as one section of language Sound terminates, but needs to judge according to the frame number of continuous mute frame/non-mute frame the starting endpoint and end of one section of voice Point.Specially：

In the preferred embodiment of the present invention, above-mentioned first threshold can be taken with value 30, above-mentioned second threshold Value 50.I.e.：

When the pick up facility for acquiring voice is in unactivated state, if the length for non-mute frame continuously occur is more than 30 (the continuous non-mute frame of 30 frame occur), then judge the non-mute frame of first frame for the starting endpoint of this section of voice.

When the pick up facility for acquiring voice is active, (occur if the continuous length for mute frame occur is more than 50 Continuous 50 frame mute frame), then judge first frame mute frame for the end caps of this section of voice.

In another preferred embodiment of the present invention, above-mentioned first threshold equally can be with value 70, above-mentioned second threshold It can be with value 50.

In the other embodiment of the present invention, taking for first threshold and second threshold can be freely set according to actual conditions Value, to meet the needs of speech terminals detection under varying environment.

In the preferred embodiment of the present invention, it can be formed by following methods as shown in Figure 2 training in advance mute Model：

Step A1 inputs preset multiple training voice data, and the voice for extracting each training voice data is special Sign；

Step A2 carries out automatic marking operation for every frame training according to corresponding phonetic feature with voice data, obtains A corresponding label per frame voice data；Label is for indicating that corresponding frame voice data is mute frame or non-mute frame；

Step A3 carries out automatic marking operation for every frame training according to corresponding phonetic feature with voice data, obtains A corresponding label per frame voice data；Label is for indicating that corresponding frame voice data is mute frame or non-mute frame；

It is provided with first node and second node on the output layer of silence model；

First node is used to indicate the label of corresponding mute frame；

The label that second node is used to indicate to correspond to non-mute frame.

Specifically, in the present embodiment, preset multiple training voice data are inputted first.So-called training voice number According to referring to the voice data that its content of text is known in advance.During the training can be according to having trained with voice data in advance The Chinese speech data set of the speech recognition system of text extracts to obtain, and possesses the mark text of corresponding training voice data This.That is the voice applied when the training voice data inputted in above-mentioned steps A1 and the acoustic model of training subsequent speech recognition Data are identical.

In the present embodiment, after inputting training voice data, its language is extracted respectively with voice data for each training Sound feature.The extraction of phonetic feature can also use the phonetic feature extracted when the acoustic model of trained speech recognition.It is common Phonetic feature may include mel-frequency cepstrum coefficient (Mel Frequency Cepstrum Coefficient, MFCC), Perceive linear prediction (Perceptual Linear Predictive, PLP) or filter group (Filter-Bank, FBANK) Feature.Similarly, in other embodiments of the invention, other similar phonetic features may be used to complete silence model Training.

In the present embodiment, in above-mentioned steps A2, before the training input parameter as silence model, it is necessary first to upper It states training and carries out automatic marking operation with voice data, so that per frame speech data frame alignment.In above-mentioned automatic marking operation, often One frame voice data can all obtain a label, and the processing method of above-mentioned automatic marking can hereinafter be described in detail, by automatic After labeling operation, so that it may to train silence model.

In the preferred embodiment of the present invention, corresponding externally input each training pre-sets a mark with voice data Explanatory notes sheet, to mark the corresponding content of text of training voice data；

Then above-mentioned steps A2 is specific as shown in figure 3, may include：

Step A21 obtains phonetic feature and corresponding mark text；

Step A22, the acoustic model formed using advance training force phonetic feature and corresponding mark text Alignment, to obtain the output label that every frame phonetic feature corresponds to phone；

Step A23 post-processes the training by pressure alignment with voice data, by the output mark of mute phone Label are mapped on the label for indicating mute frame, and the output label of non-mute phone is mapped to the label for indicating non-mute frame On.

Specifically, in the present embodiment, automatic marking operation is carried out with voice data to training according to craft, then needs to consume Take a large amount of cost of labor, and for noise be labeled in the annotation results of different mark personnel also will appear it is inconsistent The case where, to influence the process of follow-up training pattern.Therefore it is provided in technical solution of the present invention a kind of efficiently feasible automatic Mask method.

In the above method, the phonetic feature of each frame training voice data and corresponding mark text are obtained first, Pressure alignment then is carried out to phonetic feature and mark text.

In the present embodiment, the acoustic model (acoustic model that i.e. training is formed in advance) of subsequent speech recognition can be utilized right Phonetic feature and mark text carry out pressure alignment.The acoustic model of speech recognition in the present invention can be Gaussian Mixture mould Type-Hidden Markov Model (Gaussian Mixture Model-Hidden Markov Model, GMM-HMM), can also It is deep neural network-hidden Markov model (Deep Neural Network-Hidden Markov Model, DNN- ) or other suitable models HMM.Modeling unit in above-mentioned acoustic model is phone (phone) rank, such as up and down Literary independent phone (Context Independent Phone, ci-phone) or context-sensitive phone (Context Dependent Phone, cd-phone).It carries out forcing alignment operation can will be to training voice using above-mentioned acoustic model Data frame alignment is to phone rank.

In the present embodiment, in above-mentioned steps A23, to by forcing the training of alignment to be carried out with voice data post-processing it Afterwards, you can obtain the voice data that frame corresponds to mute label.In above-mentioned post-processing operation, usually regard part phone as quiet Sound phone regards other phones as non-mute phone, and after above-mentioned mapping, each frame voice data can be with quiet Sound/non-mute label is mapped.

In the preferred embodiment of the present invention, followed by the label energy of phonetic feature and the frame alignment above obtained Enough train silence model.Above-mentioned silence model can be the deep neural network model for including multilayer neural network.It is above-mentioned quiet Every layer of sound model can be the neural network connected entirely, convolutional neural networks, recurrent neural network etc., every two layers of neural network Can include one or more nonlinear transformations, such as signoid nonlinear transformations, tanh nonlinear transformations, maxpool before Nonlinear transformation, RELU nonlinear transformations or softmax nonlinear transformations.

In the preferred embodiment of the present invention, as shown in figure 4, the silence model includes multilayer neural network 41, and Including an output layer 42.First node 421 and second node 422 are set in the output layer 42 of the silence model.Above-mentioned first Node 421 is used to indicate the label of corresponding mute frame, the label that second node 422 is used to indicate to correspond to non-mute frame.It is exporting Softmax nonlinear transformations or other nonlinear transformations behaviour can be carried out on the first node 421 and second node 422 of layer 42 Make, can not also be operated using nonlinear transformation.

Then in preferred embodiment of the invention, above-mentioned steps S2 is specific as shown in figure 5, including：

Step S21 respectively obtains output after phonetic feature inputs silence model by the forward calculation of multilayer neural network It is associated with the first value of first node in layer and is associated with the second value of second node；

First value is compared by step S22 with the second value：

If the first value is more than the second value, using first node as the label of voice data and output；

If the first value is less than the second value, using second node as the label of voice data and output.

Specifically, in the present embodiment, phonetic feature is input in trained silence model, multilayer neural network carries out Forward calculation, and finally obtain the value of two output nodes (first node and second node) in output layer, i.e., first takes Value and the second value.Then compare the size of the first value and the second value：

If the first value is larger, label and output of the first node as voice data are selected, i.e. voice data at this time For mute frame；

Correspondingly, label and output of the second node as voice data are selected if the second value is larger, i.e. language at this time Sound data are non-mute frame.

In the preferred embodiment of the present invention, in an entire flow following article of above-mentioned sound end detecting method It is described：

Prepare the good Chinese speech recognizing system of a precondition first, the speech recognition system selected here has Chinese Voice data collection, and possess the mark text of voice data.

The training that the acoustic model of above-mentioned speech recognition system uses is characterized as FBANK features with voice, therefore training is quiet FBANK features are still used when sound model.

Training is extracted into phonetic feature with voice data, and with being carried out in corresponding mark text input speech recognition system Alignment is forced, each frame phonetic feature is corresponded into phone grade distinguishing label, is then mapped to non-mute phone in alignment result On non-mute label, mute phone is mapped on mute label, the training data label to complete silence model prepares.

Then, silence model is formed using above-mentioned training voice data and corresponding label training

When the silence model formed using above-mentioned training carries out the detection of sound end, by each frame in one section of voice It is sent into trained silence model after voice data extraction phonetic feature, is exported after the forward calculation of multilayer neural network First value of first node and the second value of second node, then compare the size of two values, larger pair of output value Label of the label for the node answered as the frame voice data, to indicate the frame voice data as mute frame/non-mute frame.

Finally, mute frame/non-mute frame of successive frame is judged whether：

When the pick up facility for acquiring voice is in unactivated state, if there are the continuous non-mute frames of 30 frame, by the company The starting endpoint of the first frame voice data voice to be identified as whole section in the non-mute frame of continuous 30 frames；

When acquire voice pick up facility be active when, if there are continuous 50 frame mute frame, by this continuous 50 The end caps of the first frame voice data voice to be identified as whole section in frame mute frame.

In the preferred embodiment of the present invention, a kind of audio recognition method is also provided, wherein being examined using above-mentioned sound end Survey method detects to obtain the starting endpoint and end caps for one section of voice for needing to identify, to determine the model for the voice for needing to identify It encloses, then this section of voice is identified using existing speech recognition technology again.

The foregoing is merely preferred embodiments of the present invention, are not intended to limit embodiments of the present invention and protection model It encloses, to those skilled in the art, should can appreciate that all with made by description of the invention and diagramatic content Equivalent replacement and obviously change obtained scheme, should all be included within the scope of the present invention.

Claims

1. a kind of sound end detecting method, which is characterized in that training in advance forms one for judging whether voice data is quiet The silence model of sound frame then obtains one section of voice of the externally input voice data including successive frame, and under execution State step：

Step S1 extracts the phonetic feature of voice data described in each frame, and the phonetic feature is input to the mute mould In type；

Step S2, the silence model are associated with the label of voice data described in each frame, institute according to phonetic feature output Label is stated for indicating whether the voice data is mute frame；

When the pick up facility for acquiring the voice is in unactivated state, if continuously there is the voice data of non-mute frame Length be more than a preset first threshold, then judge that first frame be the voice data of the non-mute frame is described in one section The starting endpoint of voice；

When the pick up facility for acquiring the voice is active, if continuously there is the voice data of the mute frame Length be more than a preset second threshold, then judge that first frame be the voice data of the mute frame is one section of institute's predicate The end caps of sound.

2. sound end detecting method as described in claim 1, which is characterized in that by following methods, training forms institute in advance State silence model：

Step A2 carries out automatic marking operation for being trained described in every frame according to the corresponding phonetic feature with voice data, Obtain a label of voice data described in corresponding every frame；The label is for indicating that voice data described in a corresponding frame is mute Frame or non-mute frame；

3. sound end detecting method as claimed in claim 2, which is characterized in that corresponding externally input each training A mark text is pre-set with voice data, to mark the corresponding content of text of the training voice data；

Then the step A2 is specifically included：

Step A21 obtains the phonetic feature and the corresponding mark text；

Step A22, the acoustic model formed using advance training carry out the phonetic feature and the corresponding mark text Alignment is forced, to obtain the output label that phonetic feature described in every frame corresponds to phone；

Step A23, to being post-processed with voice data by the training for forcing alignment, by the institute of mute phone It states output label to be mapped on the label for indicating the mute frame, and the output label of non-mute phone is mapped Onto the label for indicating the non-mute frame.

4. sound end detecting method as claimed in claim 3, which is characterized in that in the step A22, training in advance is formed The acoustic model be gauss hybrid models-Hidden Markov Model, or be deep neural network-Hidden Markov mould Type.

5. sound end detecting method as described in claim 1, which is characterized in that the silence model be include multilayer nerve The deep neural network model of network.

6. sound end detecting method as claimed in claim 5, which is characterized in that every two layers god of the silence model Through including at least one nonlinear transformation before network.

7. sound end detecting method as claimed in claim 5, which is characterized in that every layer of nerve of the silence model Network is the neural network that connects entirely either convolutional neural networks or recurrent neural network.

8. sound end detecting method as claimed in claim 2, which is characterized in that the silence model be include multilayer nerve The deep neural network model of network；

Then the step S2 is specifically included：

Step S21 after the phonetic feature inputs the silence model, is distinguished by the forward calculation of neural network described in multilayer The second value for obtaining the first value for being associated with the first node in the output layer and being associated with the second node；

First value is compared by step S22 with second value：

If first value is more than second value, using the first node as the label of the voice data And it exports；

If first value is less than second value, using the second node as the label of the voice data And it exports.

9. a kind of audio recognition method, which is characterized in that using the sound end detecting method inspection as described in claim 1-8 Measure the starting endpoint for one section of voice for needing to identify and the end caps.