CN108428448A - A kind of sound end detecting method and audio recognition method - Google Patents
A kind of sound end detecting method and audio recognition method Download PDFInfo
- Publication number
- CN108428448A CN108428448A CN201710076757.2A CN201710076757A CN108428448A CN 108428448 A CN108428448 A CN 108428448A CN 201710076757 A CN201710076757 A CN 201710076757A CN 108428448 A CN108428448 A CN 108428448A
- Authority
- CN
- China
- Prior art keywords
- voice data
- frame
- label
- voice
- mute
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 50
- 239000000284 extract Substances 0.000 claims abstract description 6
- 238000012549 training Methods 0.000 claims description 59
- 238000013528 artificial neural network Methods 0.000 claims description 19
- 230000009466 transformation Effects 0.000 claims description 11
- 238000004364 calculation method Methods 0.000 claims description 5
- 230000001537 neural effect Effects 0.000 claims description 4
- 238000013527 convolutional neural network Methods 0.000 claims description 3
- 238000003062 neural network model Methods 0.000 claims description 3
- 230000000306 recurrent effect Effects 0.000 claims description 3
- 238000007689 inspection Methods 0.000 claims description 2
- 210000005036 nerve Anatomy 0.000 claims 3
- 238000001514 detection method Methods 0.000 abstract description 14
- 230000007613 environmental effect Effects 0.000 abstract description 3
- 230000000694 effects Effects 0.000 abstract description 2
- 230000004913 activation Effects 0.000 abstract 1
- 238000000844 transformation Methods 0.000 description 7
- 238000005516 engineering process Methods 0.000 description 6
- 238000010586 diagram Methods 0.000 description 5
- 239000000203 mixture Substances 0.000 description 2
- 238000012805 post-processing Methods 0.000 description 2
- 230000015572 biosynthetic process Effects 0.000 description 1
- 238000013480 data collection Methods 0.000 description 1
- 238000013075 data extraction Methods 0.000 description 1
- 230000001419 dependent effect Effects 0.000 description 1
- 230000030808 detection of mechanical stimulus involved in sensory perception of sound Effects 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 238000000605 extraction Methods 0.000 description 1
- 238000013100 final test Methods 0.000 description 1
- 238000002372 labelling Methods 0.000 description 1
- 238000013507 mapping Methods 0.000 description 1
- 238000002360 preparation method Methods 0.000 description 1
- 238000003672 processing method Methods 0.000 description 1
- NGVDGCNFYWLIFO-UHFFFAOYSA-N pyridoxal 5'-phosphate Chemical compound CC1=NC=C(COP(O)(O)=O)C(C=O)=C1O NGVDGCNFYWLIFO-UHFFFAOYSA-N 0.000 description 1
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/04—Segmentation; Word boundary detection
- G10L15/05—Word boundary detection
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/04—Segmentation; Word boundary detection
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
- G10L15/16—Speech classification or search using artificial neural networks
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/78—Detection of presence or absence of voice signals
- G10L25/87—Detection of discrete points within a voice signal
Landscapes
- Engineering & Computer Science (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Artificial Intelligence (AREA)
- Evolutionary Computation (AREA)
- Signal Processing (AREA)
- Telephonic Communication Services (AREA)
Abstract
The invention discloses a kind of sound end detecting method and audio recognition methods, belong to technical field of voice recognition;Method includes:It extracts the phonetic feature of voice data and is input in silence model;Silence model is according to phonetic feature output label for indicating whether voice data is mute frame;The sound end of one section of voice is confirmed according to the label of the voice data of successive frame:In unactivated state, if the length for the voice data of non-mute frame continuously occur is more than a preset first threshold, judge that first frame is the starting endpoint that the voice data of non-mute frame is one section of voice;In state of activation, if the continuous length for the voice data of mute frame occur is more than a preset second threshold, judge that first frame is the end caps that the voice data of mute frame is one section of voice.The advantageous effect of above-mentioned technical proposal is:Solve the problems, such as that speech terminals detection is inaccurate and excessively high for detection environmental requirement in the prior art.
Description
Technical field
The present invention relates to technical field of voice recognition more particularly to a kind of sound end detecting method and speech recognition sides
Method.
Background technology
With the development of speech recognition technology, the application of speech recognition in people's lives is more and more extensive.Work as user
When using speech recognition technology in handheld device, it will usually coordinate speech recognition button to control the voice paragraph for needing to identify
Beginning and end time, but when user be in smart home environment use speech recognition technology when, can because away from
From pick up facility the beginning endpoint and end caps of voice paragraph can not be determined manually by the way of button cooperation farther out, this
When just need another mode to carry out automatic decision namely speech terminals detection technology to time of voice beginning and end
(Voice Active Detection, VAD).
Traditional end-point detecting method is based primarily upon sub-belt energy progress, that is, calculates per frame voice data in a certain frequency range
Energy, and be compared to preset energy threshold to judge the beginning endpoint of voice and end caps.This endpoint inspection
Survey method is more demanding to detection environment, speech recognition must be carried out in quiet environment and just can guarantee the language detected
The accuracy of voice endpoint.And in relatively noisy noise circumstance, different types of noise can generate different sub-belt energies
It influences, to be brought to above-mentioned end-point detecting method in the noise circumstance of larger interference, especially low signal-to-noise ratio and non-stationary,
Very big interference can be caused to the calculating of sub-belt energy, so that final testing result is inaccurate.And only guaranteed end-speech
The accuracy of point detection just can guarantee that voice is correctly collected, and then correct identified.The result of end-point detection is not allowed to truly have can
Voice can be made to be truncated or the more noises of typing, can cause speech recognition that cannot be decoded to whole word, leaked to bring
The problems such as report or wrong report, or even the decoding whole mistake of whole word can be caused, reduce the accuracy of voice recognition result.
Invention content
According to the above-mentioned problems in the prior art, a kind of sound end detecting method and audio recognition method are now provided
Technical solution, it is intended to solve that speech terminals detection in the prior art is inaccurate and environmental requirement is excessively high asks for detection
Topic.Above-mentioned technical proposal specifically includes:
A kind of sound end detecting method, wherein training in advance forms one for judging whether voice data is mute frame
Silence model, then obtain one section of voice of the externally input voice data including successive frame, and execute following steps
Suddenly:
Step S1 extracts the phonetic feature of voice data described in each frame, and the phonetic feature is input to described quiet
In sound model;
Step S2, the silence model are associated with the mark of voice data described in each frame according to phonetic feature output
Label, the label is for indicating whether the voice data is mute frame;
Step S3 confirms the sound end of one section of voice according to the label of the voice data of successive frame:
When the pick up facility for acquiring the voice is in unactivated state, if continuously there is the voice of non-mute frame
The length of data is more than a preset first threshold, then judges that first frame be the voice data of the non-mute frame is one section
The starting endpoint of the voice;
When the pick up facility for acquiring the voice is active, if continuously there is the voice of the mute frame
The length of data is more than a preset second threshold, then judges that first frame be the voice data of the mute frame is one section of institute
The end caps of predicate sound.
Preferably, sound end detecting method, wherein training forms the silence model in advance by following methods:
Step A1 inputs preset multiple training voice data, and extracts the language of each training voice data
Sound feature;
Step A2 carries out automatic marking for being trained described in every frame according to the corresponding phonetic feature with voice data
Operation obtains a label of voice data described in corresponding every frame;The label is for indicating voice data described in a corresponding frame
For mute frame or non-mute frame;
Step A3 obtains the silence model according to training voice data and the corresponding label training;
It is provided with first node and second node on the output layer of the silence model;
The label that the first node is used to indicate to correspond to the mute frame;
The label that the second node is used to indicate to correspond to the non-mute frame.
Preferably, sound end detecting method, wherein corresponding externally input each training voice data is equal
A mark text is pre-set, to mark the corresponding content of text of the training voice data;
Then the step A2 is specifically included:
Step A21 obtains the phonetic feature and the corresponding mark text;
Step A22, the acoustic model formed using advance training is to the phonetic feature and the corresponding mark text
Pressure alignment is carried out, to obtain the output label that phonetic feature described in every frame corresponds to phone;
Step A23, to being post-processed with voice data by the training for forcing alignment, by mute phone
The output label be mapped on the label for indicating the mute frame, and by the output label of non-mute phone
It is mapped on the label for indicating the non-mute frame.
Preferably, the sound end detecting method, wherein in the step A22, the acoustic mode of training formation in advance
Type is gauss hybrid models-Hidden Markov Model, or is deep neural network-Hidden Markov Model.
Preferably, sound end detecting method, wherein the silence model is the depth god for including multilayer neural network
Through network model.
Preferably, sound end detecting method, wherein wrapped before every two layers neural network of the silence model
Include at least one nonlinear transformation.
Preferably, sound end detecting method, wherein every layer of neural network of the silence model is full connection
Neural network either convolutional neural networks or recurrent neural network.
Preferably, sound end detecting method, wherein the silence model is the depth god for including multilayer neural network
Through network model;
It is provided with first node and second node on the output layer of the silence model;
The label that the first node is used to indicate to correspond to the mute frame;
The label that the second node is used to indicate to correspond to non-mute frame;
Then the step S2 is specifically included:
Step S21 passes through the forward calculation of neural network described in multilayer after the phonetic feature inputs the silence model
It respectively obtains the first value for being associated with the first node in the output layer and is associated with the second of the second node
Value;
First value is compared by step S22 with second value:
If first value is more than second value, using the first node as described in the voice data
Label simultaneously exports;
If first value is less than second value, using the second node as described in the voice data
Label simultaneously exports.
A kind of audio recognition method, wherein detect to obtain need to identify one using above-mentioned sound end detecting method
The starting endpoint of Duan Yuyin and the end caps.
The advantageous effect of above-mentioned technical proposal is:A kind of sound end detecting method is provided, can be solved in the prior art
Speech terminals detection is inaccurate and for the excessively high problem of detection environmental requirement, therefore promotes the accurate of speech terminals detection
Property, the wide usage of extension endpoint detection method, to improve entire speech recognition process.
Description of the drawings
Fig. 1 is a kind of overall procedure schematic diagram of sound end detecting method in the preferred embodiment of the present invention;
Fig. 2 is in the preferred embodiment of the present invention, and training forms the flow diagram of silence model;
Fig. 3 on the basis of Fig. 2, is marked automatically with voice data to training in the preferred embodiment of the present invention
The flow diagram of note;
Fig. 4 is to include the structural schematic diagram of the silence model of multilayer neural network in the preferred embodiment of the present invention;
Fig. 5 on the basis of Fig. 1, is handled and is exported and be associated with voice data in the preferred embodiment of the present invention
The flow diagram of label.
Specific implementation mode
Following will be combined with the drawings in the embodiments of the present invention, and technical solution in the embodiment of the present invention carries out clear, complete
Site preparation describes, it is clear that described embodiments are only a part of the embodiments of the present invention, instead of all the embodiments.It is based on
Embodiment in the present invention, those of ordinary skill in the art obtained under the premise of not making creative work it is all its
His embodiment, shall fall within the protection scope of the present invention.
It should be noted that in the absence of conflict, the feature in embodiment and embodiment in the present invention can phase
Mutually combination.
The invention will be further described in the following with reference to the drawings and specific embodiments, but not as limiting to the invention.
According to the above-mentioned problems in the prior art, a kind of sound end detecting method is now provided, in this method, in advance
Training formed one for judge voice data whether be mute frame silence model, then obtain it is externally input include successive frame
Voice data one section of voice, and execute following step as shown in Figure 1:
Step S1, extracts the phonetic feature of each frame voice data, and phonetic feature is input in silence model;
Step S2, silence model are associated with the label of each frame voice data according to phonetic feature output, and label is used for table
Show whether voice data is mute frame;
Step S3 confirms the sound end of one section of voice according to the label of the voice data of successive frame:
When the pick up facility for acquiring voice is in unactivated state, if continuously there is the length of the voice data of non-mute frame
Degree is more than a preset first threshold, then judges that first frame is the starting endpoint that the voice data of non-mute frame is one section of voice;
When the pick up facility for acquiring voice is active, if the continuous length for the voice data of mute frame occur is big
In a preset second threshold, then judge that first frame is the end caps that the voice data of mute frame is one section of voice.
Specifically, in the present embodiment, it is initially formed a silence model, which can be used for judging in one section of voice
Every frame voice data whether be mute frame.So-called mute frame refers to not including the efficient voice for needing to carry out speech recognition
Voice data;So-called non-mute frame refers to the voice data for including the efficient voice for needing to carry out speech recognition.
Then, in the present embodiment, after training forms silence model, each frame voice in externally input one section of voice is extracted
The phonetic feature of data, and the phonetic feature extracted is input in silence model, which is associated with output
Label.In the present embodiment, one co-exists in two labels, is respectively used to indicate that the frame voice data is mute frame/non-mute frame.
In the present embodiment, after having obtained the mute and non-mute classification of each frame voice data, then sound end is judged.
However the non-mute frame of a frame not occur can think that one section of voice starts, and a frame mute frame can not occur and be considered as one section of language
Sound terminates, but needs to judge according to the frame number of continuous mute frame/non-mute frame the starting endpoint and end of one section of voice
Point.Specially:
When the pick up facility for acquiring voice is in unactivated state, if continuously there is the length of the voice data of non-mute frame
Degree is more than a preset first threshold, then judges that first frame is the starting endpoint that the voice data of non-mute frame is one section of voice;
When the pick up facility for acquiring voice is active, if the continuous length for the voice data of mute frame occur is big
In a preset second threshold, then judge that first frame is the end caps that the voice data of mute frame is one section of voice.
In the preferred embodiment of the present invention, above-mentioned first threshold can be taken with value 30, above-mentioned second threshold
Value 50.I.e.:
When the pick up facility for acquiring voice is in unactivated state, if the length for non-mute frame continuously occur is more than 30
(the continuous non-mute frame of 30 frame occur), then judge the non-mute frame of first frame for the starting endpoint of this section of voice.
When the pick up facility for acquiring voice is active, (occur if the continuous length for mute frame occur is more than 50
Continuous 50 frame mute frame), then judge first frame mute frame for the end caps of this section of voice.
In another preferred embodiment of the present invention, above-mentioned first threshold equally can be with value 70, above-mentioned second threshold
It can be with value 50.
In the other embodiment of the present invention, taking for first threshold and second threshold can be freely set according to actual conditions
Value, to meet the needs of speech terminals detection under varying environment.
In the preferred embodiment of the present invention, it can be formed by following methods as shown in Figure 2 training in advance mute
Model:
Step A1 inputs preset multiple training voice data, and the voice for extracting each training voice data is special
Sign;
Step A2 carries out automatic marking operation for every frame training according to corresponding phonetic feature with voice data, obtains
A corresponding label per frame voice data;Label is for indicating that corresponding frame voice data is mute frame or non-mute frame;
Step A3 carries out automatic marking operation for every frame training according to corresponding phonetic feature with voice data, obtains
A corresponding label per frame voice data;Label is for indicating that corresponding frame voice data is mute frame or non-mute frame;
It is provided with first node and second node on the output layer of silence model;
First node is used to indicate the label of corresponding mute frame;
The label that second node is used to indicate to correspond to non-mute frame.
Specifically, in the present embodiment, preset multiple training voice data are inputted first.So-called training voice number
According to referring to the voice data that its content of text is known in advance.During the training can be according to having trained with voice data in advance
The Chinese speech data set of the speech recognition system of text extracts to obtain, and possesses the mark text of corresponding training voice data
This.That is the voice applied when the training voice data inputted in above-mentioned steps A1 and the acoustic model of training subsequent speech recognition
Data are identical.
In the present embodiment, after inputting training voice data, its language is extracted respectively with voice data for each training
Sound feature.The extraction of phonetic feature can also use the phonetic feature extracted when the acoustic model of trained speech recognition.It is common
Phonetic feature may include mel-frequency cepstrum coefficient (Mel Frequency Cepstrum Coefficient, MFCC),
Perceive linear prediction (Perceptual Linear Predictive, PLP) or filter group (Filter-Bank, FBANK)
Feature.Similarly, in other embodiments of the invention, other similar phonetic features may be used to complete silence model
Training.
In the present embodiment, in above-mentioned steps A2, before the training input parameter as silence model, it is necessary first to upper
It states training and carries out automatic marking operation with voice data, so that per frame speech data frame alignment.In above-mentioned automatic marking operation, often
One frame voice data can all obtain a label, and the processing method of above-mentioned automatic marking can hereinafter be described in detail, by automatic
After labeling operation, so that it may to train silence model.
In the preferred embodiment of the present invention, corresponding externally input each training pre-sets a mark with voice data
Explanatory notes sheet, to mark the corresponding content of text of training voice data;
Then above-mentioned steps A2 is specific as shown in figure 3, may include:
Step A21 obtains phonetic feature and corresponding mark text;
Step A22, the acoustic model formed using advance training force phonetic feature and corresponding mark text
Alignment, to obtain the output label that every frame phonetic feature corresponds to phone;
Step A23 post-processes the training by pressure alignment with voice data, by the output mark of mute phone
Label are mapped on the label for indicating mute frame, and the output label of non-mute phone is mapped to the label for indicating non-mute frame
On.
Specifically, in the present embodiment, automatic marking operation is carried out with voice data to training according to craft, then needs to consume
Take a large amount of cost of labor, and for noise be labeled in the annotation results of different mark personnel also will appear it is inconsistent
The case where, to influence the process of follow-up training pattern.Therefore it is provided in technical solution of the present invention a kind of efficiently feasible automatic
Mask method.
In the above method, the phonetic feature of each frame training voice data and corresponding mark text are obtained first,
Pressure alignment then is carried out to phonetic feature and mark text.
In the present embodiment, the acoustic model (acoustic model that i.e. training is formed in advance) of subsequent speech recognition can be utilized right
Phonetic feature and mark text carry out pressure alignment.The acoustic model of speech recognition in the present invention can be Gaussian Mixture mould
Type-Hidden Markov Model (Gaussian Mixture Model-Hidden Markov Model, GMM-HMM), can also
It is deep neural network-hidden Markov model (Deep Neural Network-Hidden Markov Model, DNN-
) or other suitable models HMM.Modeling unit in above-mentioned acoustic model is phone (phone) rank, such as up and down
Literary independent phone (Context Independent Phone, ci-phone) or context-sensitive phone (Context
Dependent Phone, cd-phone).It carries out forcing alignment operation can will be to training voice using above-mentioned acoustic model
Data frame alignment is to phone rank.
In the present embodiment, in above-mentioned steps A23, to by forcing the training of alignment to be carried out with voice data post-processing it
Afterwards, you can obtain the voice data that frame corresponds to mute label.In above-mentioned post-processing operation, usually regard part phone as quiet
Sound phone regards other phones as non-mute phone, and after above-mentioned mapping, each frame voice data can be with quiet
Sound/non-mute label is mapped.
In the preferred embodiment of the present invention, followed by the label energy of phonetic feature and the frame alignment above obtained
Enough train silence model.Above-mentioned silence model can be the deep neural network model for including multilayer neural network.It is above-mentioned quiet
Every layer of sound model can be the neural network connected entirely, convolutional neural networks, recurrent neural network etc., every two layers of neural network
Can include one or more nonlinear transformations, such as signoid nonlinear transformations, tanh nonlinear transformations, maxpool before
Nonlinear transformation, RELU nonlinear transformations or softmax nonlinear transformations.
In the preferred embodiment of the present invention, as shown in figure 4, the silence model includes multilayer neural network 41, and
Including an output layer 42.First node 421 and second node 422 are set in the output layer 42 of the silence model.Above-mentioned first
Node 421 is used to indicate the label of corresponding mute frame, the label that second node 422 is used to indicate to correspond to non-mute frame.It is exporting
Softmax nonlinear transformations or other nonlinear transformations behaviour can be carried out on the first node 421 and second node 422 of layer 42
Make, can not also be operated using nonlinear transformation.
Then in preferred embodiment of the invention, above-mentioned steps S2 is specific as shown in figure 5, including:
Step S21 respectively obtains output after phonetic feature inputs silence model by the forward calculation of multilayer neural network
It is associated with the first value of first node in layer and is associated with the second value of second node;
First value is compared by step S22 with the second value:
If the first value is more than the second value, using first node as the label of voice data and output;
If the first value is less than the second value, using second node as the label of voice data and output.
Specifically, in the present embodiment, phonetic feature is input in trained silence model, multilayer neural network carries out
Forward calculation, and finally obtain the value of two output nodes (first node and second node) in output layer, i.e., first takes
Value and the second value.Then compare the size of the first value and the second value:
If the first value is larger, label and output of the first node as voice data are selected, i.e. voice data at this time
For mute frame;
Correspondingly, label and output of the second node as voice data are selected if the second value is larger, i.e. language at this time
Sound data are non-mute frame.
In the preferred embodiment of the present invention, in an entire flow following article of above-mentioned sound end detecting method
It is described:
Prepare the good Chinese speech recognizing system of a precondition first, the speech recognition system selected here has Chinese
Voice data collection, and possess the mark text of voice data.
The training that the acoustic model of above-mentioned speech recognition system uses is characterized as FBANK features with voice, therefore training is quiet
FBANK features are still used when sound model.
Training is extracted into phonetic feature with voice data, and with being carried out in corresponding mark text input speech recognition system
Alignment is forced, each frame phonetic feature is corresponded into phone grade distinguishing label, is then mapped to non-mute phone in alignment result
On non-mute label, mute phone is mapped on mute label, the training data label to complete silence model prepares.
Then, silence model is formed using above-mentioned training voice data and corresponding label training
When the silence model formed using above-mentioned training carries out the detection of sound end, by each frame in one section of voice
It is sent into trained silence model after voice data extraction phonetic feature, is exported after the forward calculation of multilayer neural network
First value of first node and the second value of second node, then compare the size of two values, larger pair of output value
Label of the label for the node answered as the frame voice data, to indicate the frame voice data as mute frame/non-mute frame.
Finally, mute frame/non-mute frame of successive frame is judged whether:
When the pick up facility for acquiring voice is in unactivated state, if there are the continuous non-mute frames of 30 frame, by the company
The starting endpoint of the first frame voice data voice to be identified as whole section in the non-mute frame of continuous 30 frames;
When acquire voice pick up facility be active when, if there are continuous 50 frame mute frame, by this continuous 50
The end caps of the first frame voice data voice to be identified as whole section in frame mute frame.
In the preferred embodiment of the present invention, a kind of audio recognition method is also provided, wherein being examined using above-mentioned sound end
Survey method detects to obtain the starting endpoint and end caps for one section of voice for needing to identify, to determine the model for the voice for needing to identify
It encloses, then this section of voice is identified using existing speech recognition technology again.
The foregoing is merely preferred embodiments of the present invention, are not intended to limit embodiments of the present invention and protection model
It encloses, to those skilled in the art, should can appreciate that all with made by description of the invention and diagramatic content
Equivalent replacement and obviously change obtained scheme, should all be included within the scope of the present invention.
Claims (9)
1. a kind of sound end detecting method, which is characterized in that training in advance forms one for judging whether voice data is quiet
The silence model of sound frame then obtains one section of voice of the externally input voice data including successive frame, and under execution
State step:
Step S1 extracts the phonetic feature of voice data described in each frame, and the phonetic feature is input to the mute mould
In type;
Step S2, the silence model are associated with the label of voice data described in each frame, institute according to phonetic feature output
Label is stated for indicating whether the voice data is mute frame;
Step S3 confirms the sound end of one section of voice according to the label of the voice data of successive frame:
When the pick up facility for acquiring the voice is in unactivated state, if continuously there is the voice data of non-mute frame
Length be more than a preset first threshold, then judge that first frame be the voice data of the non-mute frame is described in one section
The starting endpoint of voice;
When the pick up facility for acquiring the voice is active, if continuously there is the voice data of the mute frame
Length be more than a preset second threshold, then judge that first frame be the voice data of the mute frame is one section of institute's predicate
The end caps of sound.
2. sound end detecting method as described in claim 1, which is characterized in that by following methods, training forms institute in advance
State silence model:
Step A1 inputs preset multiple training voice data, and the voice for extracting each training voice data is special
Sign;
Step A2 carries out automatic marking operation for being trained described in every frame according to the corresponding phonetic feature with voice data,
Obtain a label of voice data described in corresponding every frame;The label is for indicating that voice data described in a corresponding frame is mute
Frame or non-mute frame;
Step A3 obtains the silence model according to training voice data and the corresponding label training;
It is provided with first node and second node on the output layer of the silence model;
The label that the first node is used to indicate to correspond to the mute frame;
The label that the second node is used to indicate to correspond to the non-mute frame.
3. sound end detecting method as claimed in claim 2, which is characterized in that corresponding externally input each training
A mark text is pre-set with voice data, to mark the corresponding content of text of the training voice data;
Then the step A2 is specifically included:
Step A21 obtains the phonetic feature and the corresponding mark text;
Step A22, the acoustic model formed using advance training carry out the phonetic feature and the corresponding mark text
Alignment is forced, to obtain the output label that phonetic feature described in every frame corresponds to phone;
Step A23, to being post-processed with voice data by the training for forcing alignment, by the institute of mute phone
It states output label to be mapped on the label for indicating the mute frame, and the output label of non-mute phone is mapped
Onto the label for indicating the non-mute frame.
4. sound end detecting method as claimed in claim 3, which is characterized in that in the step A22, training in advance is formed
The acoustic model be gauss hybrid models-Hidden Markov Model, or be deep neural network-Hidden Markov mould
Type.
5. sound end detecting method as described in claim 1, which is characterized in that the silence model be include multilayer nerve
The deep neural network model of network.
6. sound end detecting method as claimed in claim 5, which is characterized in that every two layers god of the silence model
Through including at least one nonlinear transformation before network.
7. sound end detecting method as claimed in claim 5, which is characterized in that every layer of nerve of the silence model
Network is the neural network that connects entirely either convolutional neural networks or recurrent neural network.
8. sound end detecting method as claimed in claim 2, which is characterized in that the silence model be include multilayer nerve
The deep neural network model of network;
It is provided with first node and second node on the output layer of the silence model;
The label that the first node is used to indicate to correspond to the mute frame;
The label that the second node is used to indicate to correspond to non-mute frame;
Then the step S2 is specifically included:
Step S21 after the phonetic feature inputs the silence model, is distinguished by the forward calculation of neural network described in multilayer
The second value for obtaining the first value for being associated with the first node in the output layer and being associated with the second node;
First value is compared by step S22 with second value:
If first value is more than second value, using the first node as the label of the voice data
And it exports;
If first value is less than second value, using the second node as the label of the voice data
And it exports.
9. a kind of audio recognition method, which is characterized in that using the sound end detecting method inspection as described in claim 1-8
Measure the starting endpoint for one section of voice for needing to identify and the end caps.
Priority Applications (3)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710076757.2A CN108428448A (en) | 2017-02-13 | 2017-02-13 | A kind of sound end detecting method and audio recognition method |
PCT/CN2018/074311 WO2018145584A1 (en) | 2017-02-13 | 2018-01-26 | Voice activity detection method and voice recognition method |
TW107104564A TWI659409B (en) | 2017-02-13 | 2018-02-08 | Speech point detection method and speech recognition method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710076757.2A CN108428448A (en) | 2017-02-13 | 2017-02-13 | A kind of sound end detecting method and audio recognition method |
Publications (1)
Publication Number | Publication Date |
---|---|
CN108428448A true CN108428448A (en) | 2018-08-21 |
Family
ID=63107183
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201710076757.2A Pending CN108428448A (en) | 2017-02-13 | 2017-02-13 | A kind of sound end detecting method and audio recognition method |
Country Status (3)
Country | Link |
---|---|
CN (1) | CN108428448A (en) |
TW (1) | TWI659409B (en) |
WO (1) | WO2018145584A1 (en) |
Cited By (17)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109036459A (en) * | 2018-08-22 | 2018-12-18 | 百度在线网络技术(北京)有限公司 | Sound end detecting method, device, computer equipment, computer storage medium |
CN109119070A (en) * | 2018-10-19 | 2019-01-01 | 科大讯飞股份有限公司 | A kind of sound end detecting method, device, equipment and storage medium |
CN109378016A (en) * | 2018-10-10 | 2019-02-22 | 四川长虹电器股份有限公司 | A kind of keyword identification mask method based on VAD |
CN110634483A (en) * | 2019-09-03 | 2019-12-31 | 北京达佳互联信息技术有限公司 | Man-machine interaction method and device, electronic equipment and storage medium |
CN110827858A (en) * | 2019-11-26 | 2020-02-21 | 苏州思必驰信息科技有限公司 | Voice endpoint detection method and system |
CN110875033A (en) * | 2018-09-04 | 2020-03-10 | 蔚来汽车有限公司 | Method, apparatus, and computer storage medium for determining a voice end point |
CN110910905A (en) * | 2018-09-18 | 2020-03-24 | 北京京东金融科技控股有限公司 | Mute point detection method and device, storage medium and electronic equipment |
CN111063356A (en) * | 2018-10-17 | 2020-04-24 | 北京京东尚科信息技术有限公司 | Electronic equipment response method and system, sound box and computer readable storage medium |
CN111128174A (en) * | 2019-12-31 | 2020-05-08 | 北京猎户星空科技有限公司 | Voice information processing method, device, equipment and medium |
CN111583933A (en) * | 2020-04-30 | 2020-08-25 | 北京猎户星空科技有限公司 | Voice information processing method, device, equipment and medium |
WO2020192009A1 (en) * | 2019-03-25 | 2020-10-01 | 平安科技(深圳)有限公司 | Silence detection method based on neural network, and terminal device and medium |
CN112151073A (en) * | 2019-06-28 | 2020-12-29 | 北京声智科技有限公司 | Voice processing method, system, device and medium |
CN112259089A (en) * | 2019-07-04 | 2021-01-22 | 阿里巴巴集团控股有限公司 | Voice recognition method and device |
CN112652296A (en) * | 2020-12-23 | 2021-04-13 | 北京华宇信息技术有限公司 | Streaming voice endpoint detection method, device and equipment |
CN112967739A (en) * | 2021-02-26 | 2021-06-15 | 山东省计算中心(国家超级计算济南中心) | Voice endpoint detection method and system based on long-term and short-term memory network |
CN115910043A (en) * | 2023-01-10 | 2023-04-04 | 广州小鹏汽车科技有限公司 | Voice recognition method and device and vehicle |
CN116469413A (en) * | 2023-04-03 | 2023-07-21 | 广州市迪士普音响科技有限公司 | Compressed audio silence detection method and device based on artificial intelligence |
Families Citing this family (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108428448A (en) * | 2017-02-13 | 2018-08-21 | 芋头科技(杭州)有限公司 | A kind of sound end detecting method and audio recognition method |
US11227601B2 (en) * | 2019-09-21 | 2022-01-18 | Merry Electronics(Shenzhen) Co., Ltd. | Computer-implement voice command authentication method and electronic device |
CN111667817A (en) * | 2020-06-22 | 2020-09-15 | 平安资产管理有限责任公司 | Voice recognition method, device, computer system and readable storage medium |
US20220103199A1 (en) * | 2020-09-29 | 2022-03-31 | Sonos, Inc. | Audio Playback Management of Multiple Concurrent Connections |
CN112365899B (en) * | 2020-10-30 | 2024-07-16 | 北京小米松果电子有限公司 | Voice processing method, device, storage medium and terminal equipment |
Citations (14)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2001086633A1 (en) * | 2000-05-10 | 2001-11-15 | Multimedia Technologies Institute - Mti S.R.L. | Voice activity detection and end-point detection |
WO2002061727A2 (en) * | 2001-01-30 | 2002-08-08 | Qualcomm Incorporated | System and method for computing and transmitting parameters in a distributed voice recognition system |
CN1953050A (en) * | 2005-10-19 | 2007-04-25 | 株式会社东芝 | Device, method, and for determining speech/non-speech |
CN101308653A (en) * | 2008-07-17 | 2008-11-19 | 安徽科大讯飞信息科技股份有限公司 | End-point detecting method applied to speech identification system |
CN101625857A (en) * | 2008-07-10 | 2010-01-13 | 新奥特(北京)视频技术有限公司 | Self-adaptive voice endpoint detection method |
CN101740024A (en) * | 2008-11-19 | 2010-06-16 | 中国科学院自动化研究所 | Method for automatic evaluation based on generalized fluent spoken language fluency |
CN102034475A (en) * | 2010-12-08 | 2011-04-27 | 中国科学院自动化研究所 | Method for interactively scoring open short conversation by using computer |
CN103117060A (en) * | 2013-01-18 | 2013-05-22 | 中国科学院声学研究所 | Modeling approach and modeling system of acoustic model used in speech recognition |
CN104681036A (en) * | 2014-11-20 | 2015-06-03 | 苏州驰声信息科技有限公司 | System and method for detecting language voice frequency |
CN105118502A (en) * | 2015-07-14 | 2015-12-02 | 百度在线网络技术(北京)有限公司 | End point detection method and system of voice identification system |
CN105206258A (en) * | 2015-10-19 | 2015-12-30 | 百度在线网络技术(北京)有限公司 | Generation method and device of acoustic model as well as voice synthetic method and device |
CN105374350A (en) * | 2015-09-29 | 2016-03-02 | 百度在线网络技术(北京)有限公司 | Speech marking method and device |
CN105957518A (en) * | 2016-06-16 | 2016-09-21 | 内蒙古大学 | Mongolian large vocabulary continuous speech recognition method |
WO2018145584A1 (en) * | 2017-02-13 | 2018-08-16 | 芋头科技(杭州)有限公司 | Voice activity detection method and voice recognition method |
Family Cites Families (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN100580770C (en) * | 2005-08-08 | 2010-01-13 | 中国科学院声学研究所 | Voice end detection method based on energy and harmonic |
TWI299855B (en) * | 2006-08-24 | 2008-08-11 | Inventec Besta Co Ltd | Detection method for voice activity endpoint |
CN103730110B (en) * | 2012-10-10 | 2017-03-01 | 北京百度网讯科技有限公司 | A kind of method and apparatus of detection sound end |
JP5753869B2 (en) * | 2013-03-26 | 2015-07-22 | 富士ソフト株式会社 | Speech recognition terminal and speech recognition method using computer terminal |
CN103886871B (en) * | 2014-01-28 | 2017-01-25 | 华为技术有限公司 | Detection method of speech endpoint and device thereof |
CN104409080B (en) * | 2014-12-15 | 2018-09-18 | 北京国双科技有限公司 | Sound end detecting method and device |
CN105869628A (en) * | 2016-03-30 | 2016-08-17 | 乐视控股(北京)有限公司 | Voice endpoint detection method and device |
CN105976810B (en) * | 2016-04-28 | 2020-08-14 | Tcl科技集团股份有限公司 | Method and device for detecting end point of effective speech segment of voice |
-
2017
- 2017-02-13 CN CN201710076757.2A patent/CN108428448A/en active Pending
-
2018
- 2018-01-26 WO PCT/CN2018/074311 patent/WO2018145584A1/en active Application Filing
- 2018-02-08 TW TW107104564A patent/TWI659409B/en active
Patent Citations (14)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2001086633A1 (en) * | 2000-05-10 | 2001-11-15 | Multimedia Technologies Institute - Mti S.R.L. | Voice activity detection and end-point detection |
WO2002061727A2 (en) * | 2001-01-30 | 2002-08-08 | Qualcomm Incorporated | System and method for computing and transmitting parameters in a distributed voice recognition system |
CN1953050A (en) * | 2005-10-19 | 2007-04-25 | 株式会社东芝 | Device, method, and for determining speech/non-speech |
CN101625857A (en) * | 2008-07-10 | 2010-01-13 | 新奥特(北京)视频技术有限公司 | Self-adaptive voice endpoint detection method |
CN101308653A (en) * | 2008-07-17 | 2008-11-19 | 安徽科大讯飞信息科技股份有限公司 | End-point detecting method applied to speech identification system |
CN101740024A (en) * | 2008-11-19 | 2010-06-16 | 中国科学院自动化研究所 | Method for automatic evaluation based on generalized fluent spoken language fluency |
CN102034475A (en) * | 2010-12-08 | 2011-04-27 | 中国科学院自动化研究所 | Method for interactively scoring open short conversation by using computer |
CN103117060A (en) * | 2013-01-18 | 2013-05-22 | 中国科学院声学研究所 | Modeling approach and modeling system of acoustic model used in speech recognition |
CN104681036A (en) * | 2014-11-20 | 2015-06-03 | 苏州驰声信息科技有限公司 | System and method for detecting language voice frequency |
CN105118502A (en) * | 2015-07-14 | 2015-12-02 | 百度在线网络技术(北京)有限公司 | End point detection method and system of voice identification system |
CN105374350A (en) * | 2015-09-29 | 2016-03-02 | 百度在线网络技术(北京)有限公司 | Speech marking method and device |
CN105206258A (en) * | 2015-10-19 | 2015-12-30 | 百度在线网络技术(北京)有限公司 | Generation method and device of acoustic model as well as voice synthetic method and device |
CN105957518A (en) * | 2016-06-16 | 2016-09-21 | 内蒙古大学 | Mongolian large vocabulary continuous speech recognition method |
WO2018145584A1 (en) * | 2017-02-13 | 2018-08-16 | 芋头科技(杭州)有限公司 | Voice activity detection method and voice recognition method |
Non-Patent Citations (1)
Title |
---|
田旺兰 等: "《改进运用深度置信网络的语音端点检测方法》", 《计算机工程与应用》 * |
Cited By (26)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109036459B (en) * | 2018-08-22 | 2019-12-27 | 百度在线网络技术(北京)有限公司 | Voice endpoint detection method and device, computer equipment and computer storage medium |
CN109036459A (en) * | 2018-08-22 | 2018-12-18 | 百度在线网络技术(北京)有限公司 | Sound end detecting method, device, computer equipment, computer storage medium |
CN110875033A (en) * | 2018-09-04 | 2020-03-10 | 蔚来汽车有限公司 | Method, apparatus, and computer storage medium for determining a voice end point |
CN110910905B (en) * | 2018-09-18 | 2023-05-02 | 京东科技控股股份有限公司 | Mute point detection method and device, storage medium and electronic equipment |
CN110910905A (en) * | 2018-09-18 | 2020-03-24 | 北京京东金融科技控股有限公司 | Mute point detection method and device, storage medium and electronic equipment |
CN109378016A (en) * | 2018-10-10 | 2019-02-22 | 四川长虹电器股份有限公司 | A kind of keyword identification mask method based on VAD |
CN111063356B (en) * | 2018-10-17 | 2023-05-09 | 北京京东尚科信息技术有限公司 | Electronic equipment response method and system, sound box and computer readable storage medium |
CN111063356A (en) * | 2018-10-17 | 2020-04-24 | 北京京东尚科信息技术有限公司 | Electronic equipment response method and system, sound box and computer readable storage medium |
CN109119070A (en) * | 2018-10-19 | 2019-01-01 | 科大讯飞股份有限公司 | A kind of sound end detecting method, device, equipment and storage medium |
WO2020192009A1 (en) * | 2019-03-25 | 2020-10-01 | 平安科技(深圳)有限公司 | Silence detection method based on neural network, and terminal device and medium |
CN112151073A (en) * | 2019-06-28 | 2020-12-29 | 北京声智科技有限公司 | Voice processing method, system, device and medium |
CN112259089A (en) * | 2019-07-04 | 2021-01-22 | 阿里巴巴集团控股有限公司 | Voice recognition method and device |
CN112259089B (en) * | 2019-07-04 | 2024-07-02 | 阿里巴巴集团控股有限公司 | Speech recognition method and device |
CN110634483A (en) * | 2019-09-03 | 2019-12-31 | 北京达佳互联信息技术有限公司 | Man-machine interaction method and device, electronic equipment and storage medium |
CN110634483B (en) * | 2019-09-03 | 2021-06-18 | 北京达佳互联信息技术有限公司 | Man-machine interaction method and device, electronic equipment and storage medium |
US11620984B2 (en) | 2019-09-03 | 2023-04-04 | Beijing Dajia Internet Information Technology Co., Ltd. | Human-computer interaction method, and electronic device and storage medium thereof |
CN110827858A (en) * | 2019-11-26 | 2020-02-21 | 苏州思必驰信息科技有限公司 | Voice endpoint detection method and system |
CN111128174A (en) * | 2019-12-31 | 2020-05-08 | 北京猎户星空科技有限公司 | Voice information processing method, device, equipment and medium |
CN111583933A (en) * | 2020-04-30 | 2020-08-25 | 北京猎户星空科技有限公司 | Voice information processing method, device, equipment and medium |
CN111583933B (en) * | 2020-04-30 | 2023-10-27 | 北京猎户星空科技有限公司 | Voice information processing method, device, equipment and medium |
CN112652296A (en) * | 2020-12-23 | 2021-04-13 | 北京华宇信息技术有限公司 | Streaming voice endpoint detection method, device and equipment |
CN112652296B (en) * | 2020-12-23 | 2023-07-04 | 北京华宇信息技术有限公司 | Method, device and equipment for detecting streaming voice endpoint |
CN112967739A (en) * | 2021-02-26 | 2021-06-15 | 山东省计算中心(国家超级计算济南中心) | Voice endpoint detection method and system based on long-term and short-term memory network |
CN115910043A (en) * | 2023-01-10 | 2023-04-04 | 广州小鹏汽车科技有限公司 | Voice recognition method and device and vehicle |
CN116469413A (en) * | 2023-04-03 | 2023-07-21 | 广州市迪士普音响科技有限公司 | Compressed audio silence detection method and device based on artificial intelligence |
CN116469413B (en) * | 2023-04-03 | 2023-12-01 | 广州市迪士普音响科技有限公司 | Compressed audio silence detection method and device based on artificial intelligence |
Also Published As
Publication number | Publication date |
---|---|
WO2018145584A1 (en) | 2018-08-16 |
TWI659409B (en) | 2019-05-11 |
TW201830377A (en) | 2018-08-16 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN108428448A (en) | A kind of sound end detecting method and audio recognition method | |
CN103578468B (en) | The method of adjustment and electronic equipment of a kind of confidence coefficient threshold of voice recognition | |
CN105374356B (en) | Audio recognition method, speech assessment method, speech recognition system and speech assessment system | |
US20170140750A1 (en) | Method and device for speech recognition | |
CN102005070A (en) | Voice identification gate control system | |
CN103165129B (en) | Method and system for optimizing voice recognition acoustic model | |
KR20190045278A (en) | A voice quality evaluation method and a voice quality evaluation apparatus | |
CN107886968B (en) | Voice evaluation method and system | |
CN104252864A (en) | Real-time speech analysis method and system | |
CN104036774A (en) | Method and system for recognizing Tibetan dialects | |
CN101510423B (en) | Multilevel interactive pronunciation quality estimation and diagnostic system | |
CN109065046A (en) | Method, apparatus, electronic equipment and the computer readable storage medium that voice wakes up | |
CN104318921A (en) | Voice section segmentation detection method and system and spoken language detecting and evaluating method and system | |
CN103106061A (en) | Voice input method and device | |
CN106782508A (en) | The cutting method of speech audio and the cutting device of speech audio | |
CN105225665A (en) | A kind of audio recognition method and speech recognition equipment | |
CN104823235A (en) | Speech recognition device | |
CN113129867B (en) | Training method of voice recognition model, voice recognition method, device and equipment | |
CN112992191B (en) | Voice endpoint detection method and device, electronic equipment and readable storage medium | |
CN103680505A (en) | Voice recognition method and voice recognition system | |
CN111883181A (en) | Audio detection method and device, storage medium and electronic device | |
CN109670148A (en) | Collection householder method, device, equipment and storage medium based on speech recognition | |
CN109243427A (en) | A kind of car fault diagnosis method and device | |
CN104103280A (en) | Dynamic time warping algorithm based voice activity detection method and device | |
CN105575402A (en) | Network teaching real time voice analysis method |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
REG | Reference to a national code |
Ref country code: HK Ref legal event code: DE Ref document number: 1252735 Country of ref document: HK |
|
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20180821 |
|
RJ01 | Rejection of invention patent application after publication |