CN106601229A - Voice awakening method based on soc chip - Google Patents
Voice awakening method based on soc chip Download PDFInfo
- Publication number
- CN106601229A CN106601229A CN201611003861.0A CN201611003861A CN106601229A CN 106601229 A CN106601229 A CN 106601229A CN 201611003861 A CN201611003861 A CN 201611003861A CN 106601229 A CN106601229 A CN 106601229A
- Authority
- CN
- China
- Prior art keywords
- mfcc
- model
- frame
- likelihood value
- voice
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 37
- 238000004422 calculation algorithm Methods 0.000 claims abstract description 22
- 238000001514 detection method Methods 0.000 claims abstract description 14
- 230000000694 effects Effects 0.000 claims abstract description 13
- 238000000605 extraction Methods 0.000 claims abstract description 9
- 238000009826 distribution Methods 0.000 claims description 28
- 238000001228 spectrum Methods 0.000 claims description 13
- 230000007704 transition Effects 0.000 claims description 10
- 230000008569 process Effects 0.000 claims description 9
- 238000004364 calculation method Methods 0.000 claims description 8
- 239000000203 mixture Substances 0.000 claims description 7
- 230000003068 static effect Effects 0.000 claims description 6
- 238000012549 training Methods 0.000 claims description 4
- 238000009432 framing Methods 0.000 claims description 3
- 230000004087 circulation Effects 0.000 claims description 2
- 238000000205 computational method Methods 0.000 claims description 2
- 238000009795 derivation Methods 0.000 claims description 2
- 238000007599 discharging Methods 0.000 claims description 2
- 238000005070 sampling Methods 0.000 abstract description 3
- 238000012545 processing Methods 0.000 abstract description 2
- 230000009191 jumping Effects 0.000 abstract 2
- 238000010586 diagram Methods 0.000 description 4
- 238000005516 engineering process Methods 0.000 description 4
- 230000006870 function Effects 0.000 description 4
- 230000008859 change Effects 0.000 description 3
- 241000208340 Araliaceae Species 0.000 description 2
- 235000005035 Panax pseudoginseng ssp. pseudoginseng Nutrition 0.000 description 2
- 235000003140 Panax quinquefolius Nutrition 0.000 description 2
- 235000008434 ginseng Nutrition 0.000 description 2
- 239000011159 matrix material Substances 0.000 description 2
- 238000012546 transfer Methods 0.000 description 2
- 244000000231 Sesamum indicum Species 0.000 description 1
- 235000003434 Sesamum indicum Nutrition 0.000 description 1
- 230000004913 activation Effects 0.000 description 1
- 230000005540 biological transmission Effects 0.000 description 1
- 230000000903 blocking effect Effects 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 235000013399 edible fruits Nutrition 0.000 description 1
- 230000003993 interaction Effects 0.000 description 1
- 230000002452 interceptive effect Effects 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 238000004088 simulation Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/02—Feature extraction for speech recognition; Selection of recognition unit
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
- G10L15/14—Speech classification or search using statistical models, e.g. Hidden Markov Models [HMMs]
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
- G10L15/14—Speech classification or search using statistical models, e.g. Hidden Markov Models [HMMs]
- G10L15/142—Hidden Markov Models [HMMs]
- G10L15/148—Duration modelling in HMMs, e.g. semi HMM, segmental models or transition probabilities
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L19/00—Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
- G10L19/04—Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis using predictive techniques
- G10L19/26—Pre-filtering or post-filtering
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/03—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
- G10L25/18—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being spectral information of each sub-band
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/03—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
- G10L25/24—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/78—Detection of presence or absence of voice signals
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Computational Linguistics (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Signal Processing (AREA)
- Probability & Statistics with Applications (AREA)
- Spectroscopy & Molecular Physics (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Complex Calculations (AREA)
Abstract
The invention discloses a voice awakening method based on an soc chip. The voice awakening method comprises the steps that: S1, acquiring voice data and sampling the voice data by means of the chip, and converting an analog signal into a digital signal; S2, carrying out MFCC feature extraction on voice data of the digital signal; S3, conducting voice activity detection on MFCC feature values, judging whether a new frame of MFCC data of the current MFCC feature value is a voice frame, if not, jumping to the step S2 and releasing the data, and if so, subjecting the MFCC feature values to processing at the next step; S4, recognizing the MFCC feature values by adopting a voice recognition algorithm based on an HMM model, and awakening control equipment if a recognition result is an effective instruction, otherwise jumping to the step S2. According to the voice awakening method provided by the invention, a real-time system implemented by adopting the algorithm with high robustness has high recognition rate, and achieves the requirements for low power consumption and high performance.
Description
Technical field
The present invention relates to technical field of voice recognition, more particularly to a kind of voice awakening method based on soc chips.
Background technology
With the development in epoch, increasing electronic equipment is entered in daily life, and people are enjoying electronics
Equipment brings easily at the same time, it is desirable to electronic equipment more intelligently can realize the interactive mode without touch-control.
Voice wakes up, i.e., user says the phonetic order of setting, and the equipment under allowing in a dormant state is entered directly into
Treat command status.By the technology, anyone directly says default wake-up word in any environment, any time to equipment, just
Energy activation equipment, so as to realize low-power consumption and the interaction without touch-control.
But the voice awakening technology major part for occurring at present is realized based on computer and mobile phone terminal, is needed powerful
Processor be supported, be not suitable for commercial Application.And the voice awakening technology based on mcu realizations is although with low cost,
But as the restriction of processor performance is unable to reach preferable effect.
The content of the invention
The technical problem to be solved in the present invention is, there is provided a kind of voice awakening method based on soc chips, by adopting
The real-time system that the high algorithm of robustness is realized has higher discrimination, reaches low-power consumption and high performance requirement.
To solve above-mentioned technical problem, the present invention provides following technical scheme:A kind of voice wake-up side based on soc chips
Method, comprises the following steps:
S1, chip collection speech data, and which is sampled, convert analog signals into digital signal;
S2, the speech data of digital signal is carried out into MFCC feature extractions;
S3, voice activity detection is carried out to MFCC eigenvalues, judge that the new frame MFCC data of current MFCC eigenvalues are
It is no for speech frame, if otherwise return to step S2 discharging data, if then by MFCC eigenvalues into next step process;
S4, MFCC eigenvalues are identified by the speech recognition algorithm based on HMM model, if recognition result is effective
Instruction, then wake up control device;Otherwise then return to step S2.
Further, MFCC feature extractions in step S2, which is specially:
1), the pretreatment of digital signal, including preemphasis, framing and adding window;
2) FFT is carried out to each frame signal, frequency spectrum is sought, and then is tried to achieve amplitude spectrum | Xn(k)|;
3), to amplitude spectrum | Xn(k) | plus Mel wave filter groups WlK (), formula are as follows:
Wherein k refers to k-th point of FFT;O (l), c (l), h (l) be respectively l-th triangular filter lower frequency limit, in
Frequency of heart and upper limiting frequency;
4) logarithm operation is done to the output of all of wave filter, discrete cosine transform is further done and is obtained MFCC features
Value, formula are as follows:
Wherein N, L are 26, refer to number of filter;I refers to MFCC coefficient exponent numbers, and i takes 12, has as obtained 12 cepstrum spies
Levy;Additionally, along with the logarithmic energy of a frame is used as the 13rd characteristic parameter, being defined as follows:
Wherein, XnK () is amplitude, 13 characteristic parameters are thus obtained, including 12 cepstrum features add 1 logarithm energy
Amount;
5), the cepstrum parameter MFCC of 13 required standards only reflects the static characteristic of speech parameter, the dynamic of voice
Characteristic is described according to the Difference Spectrum of the static nature;Calculate first-order difference dtm (i) and second differnce of 13 MFCC features
dtmm(i):
13 standard MFCC features and its 13 first-order differences, the MFCC features ginseng of 13 39 dimensions of second differnce composition
Number, so far MFCC feature extractions are finished.
Further, voice activity detection is carried out in step S3 to eigenvalue, is lived using the voice based on GMM model
Dynamic detection method, which assumes that voice and background noise meet Gaussian Mixture distribution in specific feature space, in feature space
It is middle to build silence model, non-mute model respectively;Then the new frame MFCC data of MFCC features are calculated, is calculated respectively
The likelihood value P1 of silence model, the likelihood value P2 of non-mute model;Compare likelihood value P1, the size of likelihood value P2, if P1 is more than
Then current MFCC Frames are speech frame, otherwise mute frame to P2.
Further, if after the current MFCC Frames are judged as speech frame, when judging next frame MFCC Frames,
Likelihood value P1 and likelihood value P2 are multiplied by corresponding transition probability respectively, compare two result of product, if the product knot of likelihood value P1
Result of product of the fruit more than likelihood value P2, then current MFCC Frames are speech frame, are otherwise mute frame;
If after the current MFCC Frames are judged as mute frame, when judging next frame MFCC Frames, likelihood value P1
Corresponding transition probability is multiplied by respectively with likelihood value P2, compares two result of product, if the result of product of likelihood value P1 is more than seemingly
The result of product of right value P2, then current MFCC Frames are speech frame, are otherwise mute frame;
The corresponding transition probability is the model data for pre-setting.
Further, the likelihood value P1 of the silence model, the computational methods of the likelihood value P2 of non-mute model, specifically
For:
Wherein silence model, non-mute model are constituted by 13 39 dimension Gauss models;The probability of one M rank Gauss model
Density function is obtained by M Gaussian probability-density function weighted sum, such as following formula 3.1:
In formula, M is multidimensional Gauss model number, and M takes 13;X is that a D ties up random vector, as 39 dimension MFCC eigenvalues;
bi(X) it is sub- distribution, ωiFor hybrid weight;Per height, distribution is the joint gaussian probability distribution of D dimensions, such as following formula 3.2:
Wherein μiIt is the average of i-th dimension;σi 2For variance;xiFor the MFCC eigenvalues of the i-th dimension of input;D represents total dimension,
D takes 39;
As formula 3.2 calculates excessively complicated, derivation simplification is carried out to which:
Take the logarithm and can obtain in formula both sides:
Understand that the plus sige left side is all known parameter in the model for training, can train in advance, therefore set gconst works
For a parameter of model:
So formula 3.2 is transformed to seek following formula:
And then formula 3.1 is reduced to:
MFCC Frames and model parameter are brought in above formula, you can obtain the frame data silence model likelihood value and
The likelihood value of non-mute model.
Further, the just MFCC Frames and model parameter are brought in above formula, you can obtain the quiet of the frame data
The likelihood value of the likelihood value and non-mute model of sound model, concretely comprises the following steps:
1) matching primitives are carried out with silence model and non-mute model respectively to the MFCC eigenvalues of each frame voice, first
Carry out (xi-μi)2/σ2Calculate, result of calculation added up, obtain the Multi-dimensional Gaussian distribution of two models exponential part fa0 and
fa1:
Wherein mean μiAnd varianceThe direct access from model data;
2), the result of calculation of previous step is calculated as below, the likelihood value b of Multi-dimensional Gaussian distribution is obtained:
Wherein gconst is the data trained in advance, the direct access from model data, so far the multidimensional in perfect 3.3
Gauss distribution likelihood value ln bi(X) calculate;
3), by above it will be appreciated that silence model and non-mute model respectively comprising 13 Multi-dimensional Gaussian distributions, thus step 1,2
The likelihood value ln b of 13 Multi-dimensional Gaussian distributions can be obtained after circulating 13 timesi(X), by these likelihood values and corresponding weights omegaiBring into
Following formula, obtains likelihood value P of the present frame to silence model1With the likelihood value P to non-mute model2:
Further, speech recognition algorithm of step S4 based on HMM model, which is specially:
S41, loading HMM model, construct the identification network of HMM chains;
S42, by the identification net mate of MFCC eigenvalues and HMM model, calculate initial likelihood value;
S43, according to initial likelihood value, Token Passing algorithms find the optimal path in HMM chain networks, complete to translate
The work of code;
S45, judge whether phonetic order is matched with HMM chains, if being then efficient voice, if being otherwise invalid voice.
After above-mentioned technical proposal, the present invention at least has the advantages that:
(1) present invention is transformed into log domains by will partly calculate in former algorithm, and a large amount of multiplyings are converted into addition fortune
Calculate, successfully reduce time delay when realizing on the microprocessor;The complicated calculations of algorithm are accelerated by specialized hardware, dropped
Low time delay, has been finally reached the purpose of Real time identification;
(2) present invention has higher discrimination by the real-time system realized using the high algorithm of robustness;
(3) algorithm that the present invention has easy upgradability, the present invention is divided into the extraction of independent three modular character, speech activity
Detection and speech recognition, subsequently have performance more preferably algorithm carry out to system by way of individually replacing submodule excellent
Change.
Description of the drawings
Fig. 1 is a kind of overall flow figure of the voice awakening method based on soc chips of the present invention;
Fig. 2 is a kind of triangular filter schematic diagram of the voice awakening method based on soc chips of the present invention;
Fig. 3 is a kind of triangular filter group schematic diagram of the voice awakening method based on soc chips of the present invention;
Fig. 4 is a kind of voice activity detection flow chart of the voice awakening method based on soc chips of the present invention;
Fig. 5 is that a kind of parameter of 39 dimension Gauss models of voice awakening method based on soc chips of the present invention constitutes signal
Figure;
Fig. 6 is a kind of voice activity detection flow chart of steps of the voice awakening method based on soc chips of the present invention;
Fig. 7 is that a kind of training in advance in voice activity detection of the voice awakening method based on soc chips of the present invention is good
Model data schematic diagram;
Fig. 8 is a kind of speech recognition algorithm entirety flow chart of steps of voice awakening method based on soc chips of the present invention;
Fig. 9 is a kind of HMM chains of example in speech recognition algorithm of the voice awakening method based on soc chips of the present invention
Schematic diagram.
Specific embodiment
It should be noted that in the case where not conflicting, the feature in embodiment and embodiment in the application can phase
Mutually combine, the application is described in further detail with specific embodiment below in conjunction with the accompanying drawings.
Fig. 1 is total algorithm flow chart of the present invention, wherein each module calculation process is as follows:
1st, speech front-end is processed:
Speech front-end process is exactly, by sampling, analogue signal to be converted to digital signal by the signal of speech data.This
In scheme, sample rate is 16K.Voice digital signal is PCM (Pulse Code Modulation)
Form, i.e. pulse code modulation, it, by the speech data after being quantified after speech simulation signal sampling, is most base
A kind of phonetic matrix of this most original.In the present invention, ADC is integrated in soc chips, is done at a speech detection per 10ms
Reason, sample frequency are 16K data of collection per second, and data bit width is 16bits.
2nd, MFCC characteristics are extracted:
1) pretreatment of signal, including preemphasis (Preemphasis), framing (Frame Blocking), adding window
(Windowing);Sample frequency fs=16KHz of voice signal, as voice signal is considered stable in 10-30ms, therefore
Arrange per frame 10ms, so frame length is 160 points;Frame moves 1/2, i.e., 80 for frame length;
2) FFT of 256 points is carried out to each frame, frequency spectrum is sought, and then is tried to achieve amplitude spectrum | Xn(k)|;
3) to amplitude spectrum | Xn(k) | plus Mel wave filter groups WlK (), formula are as follows:
Wherein k refers to k-th point of FFT;O (l), c (l), h (l) are the lower frequency limit of l-th triangular filter, center frequency
Rate and upper limiting frequency, as shown in Figure 2;
In the present invention, Mel wave filter groups are made up of 26 triangular filters, and parameter is calculated in advance.Triangular filter group
As shown in figure 3, the point in abscissa correspondence FFT, vertical coordinate is Wl(k), due to being symmetrical so only taking half before FFT
Point calculates frequency spectrum, is then added in triangular filter;
4) logarithm operation (Logarlithm) is done to the output of all of wave filter, further does discrete cosine transform
MFCC can be obtained, formula is as follows.
Wherein N, L are 26, refer to number of filter;I refers to MFCC coefficient exponent numbers, and the present invention takes 12, that is, obtained 12 cepstrums
Feature;In addition along with the logarithmic energy of a frame is used as the 13rd characteristic parameter, it is defined as follows:
Thus 13 characteristic parameters are obtained (12 cepstrum features add 1 logarithmic energy);
5), the cepstrum parameter MFCC of this 13 standards only reflects the static characteristic of speech parameter, the dynamic characteristic of voice
Can be described with the Difference Spectrum of these static natures;Calculate first-order difference dtm (i) and second differnce of 13 MFCC features
dtmm(i):
13 standard MFCC features and its 13 first-order differences, the MFCC features ginseng of 13 39 dimensions of second differnce composition
Number, so far MFCC feature extractions are finished.
3rd, voice activity detection (VAD):
Using the voice activity detection method based on GMM model in the present invention, the method assumes that voice and background noise exist
Meet Gaussian Mixture distribution in specific feature space, set up their GMM model in feature space respectively, then use model
The method of matching detects effective voice segments in measured signal;Algorithm flow is as shown in Figure 4:
Model trains out by HTK workboxes in advance, 1 39 dimension Gauss model by 1 weight (MIXTURE), 39
Individual average (MEAN), 39 variances (VARIANCE) and 1 gconst are constituted, as shown in Figure 5:
Silence model and non-mute model are made up of 13 multidimensional Gauss models as shown in Figure 5 respectively;When a new frame
Speech data is collected into system, and a new frame 39 is tieed up MFCC eigenvalues carries out likelihood value with quiet and non-mute model respectively
Calculate, compare two likelihood value sizes, the larger model of likelihood value is the Matching Model of present frame, so as to judge that present frame is
It is no for speech frame, VAD detailed process is as shown in Figure 6:
Wherein transfer ratio a11、a12、a21、a22For the good model data of training in advance, as shown in fig. 7, a11For former frame
It is mute frame, present frame is also the transition probability of mute frame;a12It is mute frame for former frame, present frame is but the transfer of speech frame
Probability;a21It is speech frame for former frame, present frame is but the transition probability of mute frame;a22It is speech frame for former frame, present frame
And the transition probability of speech frame;
The most complicated calculating for being calculated as likelihood value in whole processing procedure, the below calculating to likelihood value are introduced:
The probability density function of the multidimensional gauss hybrid models of 13 ranks is weighted by 13 multidimensional Gaussian probability-density functions
What summation was obtained, such as following formula 3.1:
In formula, M is multidimensional Gauss model number, is 13 in the present invention;X is that a D dimension random vector is (i.e. previously mentioned
39 dimension MFCC eigenvalues), bi(X) it is sub- distribution, ωiFor hybrid weight.Per height, distribution is the joint gaussian probability distribution of D dimensions,
Such as following formula:
For 1 dimension, μ is to expect, σ2It is variance;For multidimensional, D represents the dimension of X, represents the association side of D*D
Difference matrix, is defined as ∑=E [(x- μ) (x- μ)T], value of | the ∑ | for the determinant of the covariance;
So the concrete calculation procedure of vad algorithm is:
1) matching primitives are carried out with quiet and non-mute model respectively to 39 dimension MFCC eigenvalues of each frame voice, it is advanced
Row (X- μi)2/σ2Calculate, and 39 results are added up, obtain exponential part fa0 of the Multi-dimensional Gaussian distribution of two models
With fa1 (calculating is completed by hardware-accelerated IP):
Wherein mean μiAnd varianceThe direct access from model data;
2) previous step result is calculated as below, the likelihood value of Multi-dimensional Gaussian distribution is obtained:
B=exp (fa0)
Wherein gconst is the data trained in advance, the direct access from model data.So far the multidimensional in perfect 3.2
Gauss distribution likelihood value is calculated;
3) by above it will be appreciated that silence model and non-mute model respectively comprising 13 Multi-dimensional Gaussian distributions, thus step 1,2
The likelihood value of 13 Multi-dimensional Gaussian distributions can be obtained after circulating 13 times, these likelihood values are multiplied by into Model Weight and i.e. formula 3.1 is added,
The likelihood value of silence model and non-mute model can be obtained;So step 1,2 circulation 13 times after can obtain 13 Multi-dimensional Gaussian distributions
Likelihood value ln bi(X), by these likelihood values and corresponding weights omegaiBring following formula into, likelihood of the present frame to silence model can be obtained
Value P1With the likelihood value P to non-mute model2:
4) finally it is multiplied by transition probability a:
If previous frame data are speech frames, present frame is the probability=a of speech frame22*P2;
Present frame is the probability=a of mute frame21*P1;
If previous frame data are mute frames, present frame is the probability=a of speech frame12*P2;
Present frame is the probability=a of mute frame11*P1;
Comparison be the probability of speech frame and be mute frame probability size, the probability of speech frame then thinks that greatly present frame is language
Sound frame, on the contrary it is then mute frame, and so far vad algorithm is completed.
4th, speech recognition algorithm:
This block process is as shown in figure 8, wherein the loading of model and structure HMM chains are complete when program most starts initialization
Into need not subsequently repeat is carried out;When higher level's VAD module detects efficient voice, just calculated into this module.This module
Each state for the HMM model for calling is made up of 24 GMM, and flow process is described below:
(1) HMM model is loaded into, the identification network of HMM chains is constructed;
(2), by the identification net mate of MFCC eigenvalues and HMM model, calculate initial likelihood value;
(3), according to initial likelihood value, Token Passing algorithms find the optimal path in HMM chain networks, complete to translate
The work of code;
(4), judge whether phonetic order is matched with HMM chains, if being then efficient voice, if being otherwise invalid voice.
Whole flow process is described below:By taking " shutdown " as an example, below for " shutdown " corresponding HMM chains (actual HMM chains are longer,
Each syllable is made up of multiple states, here for convenience of explaining, is simplified)." shutdown " may be split into syllable " g " " uan "
4 syllables are described as 4 states with HMM model by " j " " i ", and are connected and can be obtained HMM chains, as shown in Figure 9;
A, this network starting point (i.e. " g ") initialization token value Pg=0;
B, when the first frame MFCC data arrive, start token-passing, the first frame only has PgThis token value, order
Board value PgCan transmit to state " g " and " uan ", be embodied in:
Pg=Pg+a11+log(GMMg)
Puan=Pg+a12+log(GMMuan)
log(GMMg) likelihood value for MFCC data to state " g ", log (GMMuan) it is MFCC data to state " uan "
Likelihood value, the calculation of likelihood value is consistent with vad, sees formula 3.3 and 3.4.
C, when the second frame data arrive, now state " g " and " uan " have token value, so token is to the two shapes
The state transmission connected by state.
The token value of state " g " is updated:
Pg=Pg+a11+log(GMMg)
The token value of state " uan " is updated:
Pg→uan=Pg+a12
Puan→uan=Puan+a22
After renewal:Puan=max (Pg→uan, Puan→uan)+log(GMMuan)
Due to being connected with " g " on the left of state " uan ", while oneself is connected with oneself, so two token values can be obtained, this
When to compare the two token values, choose that big and remain.
The token of state " j " is updated
Pj=Puan+a23+log(GMMj)
D, the token value renewal when the 3rd frame arrives, to state " g ":
Pg=Pg+a11+log(GMMg)
The token value of state " uan " is updated:
Pg→uan=Pg+a12
Puan→uan=Puan+a22
After renewal:Puan=max (Pg→uan, Puan→uan)+log(GMMuan)
The token of state " j " is updated
Puan→j=Puan+a23
Pj→j=Pj+a33
After renewal:Pj=max (Puan→j, Pj→j)+log(GMMj)
The token of state " i " is updated:
Pi=Pj+a34+log(GMMi)
E, the token value renewal when the 4th frame arrives, to state " g ":
Pg=Pg+a11+log(GMMg)
The token value of state " uan " is updated:
Pg→uan=Pg+a12
Puan→uan=Puan+a22
After renewal:Puan=max (Pg→uan, Puan→uan)+log(GMMuan)
The token of state " j " is updated
Puan→j=Puan+a23
Pj→j=Pj+a33
After renewal:Pj=max (Puan→j, Pj→j)+log(GMMj)
The token of state " i " is updated:
Pj→i=Pj+a34
Pi→i=Pi+a44
After renewal:Pi=max (Pj→i, Pi→i)+log(GMMi)
So far all phonetic order frames are all input into and finish, and start token and compare, the token value of four states is carried out size
Sequence, if the token value of last state of HMM chains (i.e. " i ") is maximum, illustrates phonetic order and " shutdown " being input into
This HMM chain is matched, and decoding result is " shutdown ".Otherwise be considered as input is invalid voice.
Whole decoding process can be seen that and increase that token is diffused into right-hand member always from left end, during this with frame number
Each state has a token, and token can be transmitted and be calculated to adjacent state, and (frame number is by pre- for the frame number specified when arrival
If phonetic order length determine that shorter such as " shutdown " if, " open sesame " is longer due to voice, and frame number also can be more), just general
The token of all states is ranked up, the voice of the current input of explanation if the token value maximum in the end state of HMM chains
Match with this HMM chain.The quantity of capable of speech instruction can be increased in actual applications, a plurality of HMM chains are now just had,
Such words last frame, all states of all HMM chains can all be ranked up, and determine specifically which bar is instructed with this.
Although an embodiment of the present invention has been shown and described, for the ordinary skill in the art, can be with
It is understood by, can these embodiments be carried out with various equivalent changes without departing from the principles and spirit of the present invention
Change, change, replace and modification, the scope of the present invention is limited by claims and its equivalency range.
Claims (7)
1. a kind of voice awakening method based on soc chips, it is characterised in that comprise the following steps:
S1, chip collection speech data, and which is sampled, convert analog signals into digital signal;
S2, the speech data of digital signal is carried out into MFCC feature extractions;
S3, voice activity detection is carried out to MFCC eigenvalues, judge that whether the new frame MFCC data of current MFCC eigenvalues are
Speech frame, if otherwise return to step S2 discharging data, if then by MFCC eigenvalues into next step process;
S4, MFCC eigenvalues are identified by the speech recognition algorithm based on HMM model, if recognition result is effectively finger
Order, then wake up control device;Otherwise then return to step S2.
2. a kind of voice awakening method based on soc chips as claimed in claim 1, it is characterised in that in step S2
MFCC feature extractions, which is specially:
1), the pretreatment of digital signal, including preemphasis, framing and adding window;
2) FFT is carried out to each frame signal, frequency spectrum is sought, and then is tried to achieve amplitude spectrum | Xn(k)|;
3), to amplitude spectrum | Xn(k) | plus Mel wave filter groups WlK (), formula are as follows:
Wherein k refers to k-th point of FFT;O (l), c (l), h (l) are respectively the lower frequency limit of l-th triangular filter, center frequency
Rate and upper limiting frequency;
4) logarithm operation is done to the output of all of wave filter, discrete cosine transform is further done and is obtained MFCC eigenvalues, it is public
Formula is as follows:
Wherein N, L are 26, refer to number of filter;I refers to MFCC coefficient exponent numbers, and i takes 12, has as obtained 12 cepstrum features;This
Outward, along with the logarithmic energy of a frame is used as the 13rd characteristic parameter, it is defined as follows:
Wherein, XnK () is amplitude, 13 characteristic parameters are thus obtained, including 12 cepstrum features add 1 logarithmic energy;
5), the cepstrum parameter MFCC of 13 required standards only reflects the static characteristic of speech parameter, the dynamic characteristic of voice
Described according to the Difference Spectrum of the static nature;Calculate first-order difference dtm (i) and second differnce dtmm of 13 MFCC features
(i):
13 standard MFCC features and its 13 first-order differences, the MFCC characteristic parameters of 13 39 dimensions of second differnce composition, extremely
This MFCC feature extraction is finished.
3. a kind of voice awakening method based on soc chips as claimed in claim 1, it is characterised in that in step S3
Voice activity detection is carried out to eigenvalue, using the voice activity detection method based on GMM model, which assumes that voice and background are made an uproar
Sound meets Gaussian Mixture distribution in specific feature space, builds silence model, non-mute model in feature space respectively;
Then the new frame MFCC data of MFCC features are calculated, calculates likelihood value P1, the non-mute model of silence model respectively
Likelihood value P2;Compare likelihood value P1, the size of likelihood value P2, current MFCC Frames are speech frame if P1 is more than P2, no
Then mute frame.
4. a kind of voice awakening method based on soc chips as claimed in claim 3, it is characterised in that if described current
After MFCC Frames are judged as speech frame, when judging next frame MFCC Frames, likelihood value P1 and likelihood value P2 are multiplied by respectively
Corresponding transition probability, compares two result of product, if the result of product of likelihood value P1 is more than the result of product of likelihood value P2,
Current MFCC Frames are speech frame, are otherwise mute frame;
If after the current MFCC Frames are judged as mute frame, when judging next frame MFCC Frames, likelihood value P1 and seemingly
So value P2 is multiplied by corresponding transition probability respectively, compares two result of product, if the result of product of likelihood value P1 is more than likelihood value
The result of product of P2, then current MFCC Frames are speech frame, are otherwise mute frame;
The corresponding transition probability is the model data for pre-setting.
5. a kind of voice awakening method based on soc chips as claimed in claim 3, it is characterised in that the silence model
Likelihood value P1, the computational methods of the likelihood value P2 of non-mute model, specially:
Wherein silence model, non-mute model are constituted by 13 39 dimension Gauss models;The probability density of one M rank Gauss model
Function is obtained by M Gaussian probability-density function weighted sum, such as following formula 3.1:
In formula, M is multidimensional Gauss model number, and M takes 13;X is that a D ties up random vector, as 39 dimension MFCC eigenvalues;bi(X)
For sub- distribution, ωiFor hybrid weight;Per height, distribution is the joint gaussian probability distribution of D dimensions, such as following formula 3.2:
Wherein μiIt is the average of i-th dimension;σi 2For variance;xiFor the MFCC eigenvalues of the i-th dimension of input;D represents total dimension, and D takes
39;
As formula 3.2 calculates excessively complicated, derivation simplification is carried out to which:
Take the logarithm and can obtain in formula both sides:
Understand that the plus sige left side is all known parameter in the model for training, can train in advance, therefore gconst be set as mould
One parameter of type:
So formula 3.2 is transformed to seek following formula:
And then formula 3.1 is reduced to:
MFCC Frames and model parameter are brought in above formula, you can obtain the likelihood value of the silence model of the frame data and non-quiet
The likelihood value of sound model.
6. a kind of voice awakening method based on soc chips as claimed in claim 5, it is characterised in that the just MFCC
Frame and model parameter are brought in above formula, you can obtain the likelihood value and non-mute model of the silence model of the frame data seemingly
So it is worth, concretely comprises the following steps:
1) matching primitives are carried out with silence model and non-mute model respectively to the MFCC eigenvalues of each frame voice, is first carried out
(xi-μi)2/σ2Calculate, result of calculation is added up, and obtains exponential part fa0 and fa1 of the Multi-dimensional Gaussian distribution of two models:
Wherein mean μiAnd varianceThe direct access from model data;
2), the result of calculation of previous step is calculated as below, the likelihood value b of Multi-dimensional Gaussian distribution is obtained:
Wherein gconst is the data trained in advance, the direct access from model data, so far the multidimensional Gauss in perfect 3.3
Distribution likelihood value ln bi(X) calculate;
3), by above it will be appreciated that silence model and non-mute model respectively comprising 13 Multi-dimensional Gaussian distributions, so step 1,2 circulations
The likelihood value ln b of 13 Multi-dimensional Gaussian distributions can be obtained after 13 timesi(X), by these likelihood values and corresponding weights omegaiBring into down
Formula, obtains likelihood value P of the present frame to silence model1With the likelihood value P to non-mute model2:
7. a kind of voice awakening method based on soc chips as claimed in claim 1, it is characterised in that the step S4 base
In the speech recognition algorithm of HMM model, which is specially:
S41, loading HMM model, construct the identification network of HMM chains;
S42, by the identification net mate of MFCC eigenvalues and HMM model, calculate initial likelihood value;
S43, according to initial likelihood value, Token Passing algorithms find the optimal path in HMM chain networks, complete decode
Work;
S45, judge whether phonetic order is matched with HMM chains, if being then efficient voice, if being otherwise invalid voice.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201611003861.0A CN106601229A (en) | 2016-11-15 | 2016-11-15 | Voice awakening method based on soc chip |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201611003861.0A CN106601229A (en) | 2016-11-15 | 2016-11-15 | Voice awakening method based on soc chip |
Publications (1)
Publication Number | Publication Date |
---|---|
CN106601229A true CN106601229A (en) | 2017-04-26 |
Family
ID=58590197
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201611003861.0A Pending CN106601229A (en) | 2016-11-15 | 2016-11-15 | Voice awakening method based on soc chip |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN106601229A (en) |
Cited By (11)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107886957A (en) * | 2017-11-17 | 2018-04-06 | 广州势必可赢网络科技有限公司 | Voice wake-up method and device combined with voiceprint recognition |
CN108615535A (en) * | 2018-05-07 | 2018-10-02 | 腾讯科技(深圳)有限公司 | Sound enhancement method, device, intelligent sound equipment and computer equipment |
CN108986822A (en) * | 2018-08-31 | 2018-12-11 | 出门问问信息科技有限公司 | Audio recognition method, device, electronic equipment and non-transient computer storage medium |
CN109088611A (en) * | 2018-09-28 | 2018-12-25 | 咪付(广西)网络技术有限公司 | A kind of auto gain control method and device of acoustic communication system |
CN110580919A (en) * | 2019-08-19 | 2019-12-17 | 东南大学 | voice feature extraction method and reconfigurable voice feature extraction device under multi-noise scene |
CN111028831A (en) * | 2019-11-11 | 2020-04-17 | 云知声智能科技股份有限公司 | Voice awakening method and device |
CN111124511A (en) * | 2019-12-09 | 2020-05-08 | 浙江省北大信息技术高等研究院 | Wake-up chip and wake-up system |
CN111868825A (en) * | 2018-03-12 | 2020-10-30 | 赛普拉斯半导体公司 | Dual pipeline architecture for wake phrase detection with voice onset detection |
CN112102848A (en) * | 2019-06-17 | 2020-12-18 | 华为技术有限公司 | Method, chip and terminal for identifying music |
CN114141272A (en) * | 2020-08-12 | 2022-03-04 | 瑞昱半导体股份有限公司 | Sound event detection system and method |
CN115132231A (en) * | 2022-08-31 | 2022-09-30 | 安徽讯飞寰语科技有限公司 | Voice activity detection method, device, equipment and readable storage medium |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN1455387A (en) * | 2002-11-15 | 2003-11-12 | 中国科学院声学研究所 | Rapid decoding method for voice identifying system |
CN101051462A (en) * | 2006-04-07 | 2007-10-10 | 株式会社东芝 | Feature-vector compensating apparatus and feature-vector compensating method |
CN203253172U (en) * | 2013-03-18 | 2013-10-30 | 北京承芯卓越科技有限公司 | Intelligent voice communication toy |
CN105096939A (en) * | 2015-07-08 | 2015-11-25 | 百度在线网络技术(北京)有限公司 | Voice wake-up method and device |
CN105206271A (en) * | 2015-08-25 | 2015-12-30 | 北京宇音天下科技有限公司 | Intelligent equipment voice wake-up method and system for realizing method |
CN105869628A (en) * | 2016-03-30 | 2016-08-17 | 乐视控股(北京)有限公司 | Voice endpoint detection method and device |
-
2016
- 2016-11-15 CN CN201611003861.0A patent/CN106601229A/en active Pending
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN1455387A (en) * | 2002-11-15 | 2003-11-12 | 中国科学院声学研究所 | Rapid decoding method for voice identifying system |
CN101051462A (en) * | 2006-04-07 | 2007-10-10 | 株式会社东芝 | Feature-vector compensating apparatus and feature-vector compensating method |
CN203253172U (en) * | 2013-03-18 | 2013-10-30 | 北京承芯卓越科技有限公司 | Intelligent voice communication toy |
CN105096939A (en) * | 2015-07-08 | 2015-11-25 | 百度在线网络技术(北京)有限公司 | Voice wake-up method and device |
CN105206271A (en) * | 2015-08-25 | 2015-12-30 | 北京宇音天下科技有限公司 | Intelligent equipment voice wake-up method and system for realizing method |
CN105869628A (en) * | 2016-03-30 | 2016-08-17 | 乐视控股(北京)有限公司 | Voice endpoint detection method and device |
Non-Patent Citations (1)
Title |
---|
姜楠: ""手机语音识别系统中语音活动检测算法研究与实现"", 《中国优秀硕士学位论文全文数据库信息科技辑》 * |
Cited By (16)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107886957A (en) * | 2017-11-17 | 2018-04-06 | 广州势必可赢网络科技有限公司 | Voice wake-up method and device combined with voiceprint recognition |
CN111868825B (en) * | 2018-03-12 | 2024-05-28 | 赛普拉斯半导体公司 | Dual pipeline architecture for wake phrase detection with speech start detection |
CN111868825A (en) * | 2018-03-12 | 2020-10-30 | 赛普拉斯半导体公司 | Dual pipeline architecture for wake phrase detection with voice onset detection |
CN108615535A (en) * | 2018-05-07 | 2018-10-02 | 腾讯科技(深圳)有限公司 | Sound enhancement method, device, intelligent sound equipment and computer equipment |
CN108986822A (en) * | 2018-08-31 | 2018-12-11 | 出门问问信息科技有限公司 | Audio recognition method, device, electronic equipment and non-transient computer storage medium |
CN109088611A (en) * | 2018-09-28 | 2018-12-25 | 咪付(广西)网络技术有限公司 | A kind of auto gain control method and device of acoustic communication system |
CN112102848A (en) * | 2019-06-17 | 2020-12-18 | 华为技术有限公司 | Method, chip and terminal for identifying music |
CN112102848B (en) * | 2019-06-17 | 2024-04-26 | 华为技术有限公司 | Method, chip and terminal for identifying music |
CN110580919A (en) * | 2019-08-19 | 2019-12-17 | 东南大学 | voice feature extraction method and reconfigurable voice feature extraction device under multi-noise scene |
CN110580919B (en) * | 2019-08-19 | 2021-09-28 | 东南大学 | Voice feature extraction method and reconfigurable voice feature extraction device under multi-noise scene |
CN111028831B (en) * | 2019-11-11 | 2022-02-18 | 云知声智能科技股份有限公司 | Voice awakening method and device |
CN111028831A (en) * | 2019-11-11 | 2020-04-17 | 云知声智能科技股份有限公司 | Voice awakening method and device |
CN111124511A (en) * | 2019-12-09 | 2020-05-08 | 浙江省北大信息技术高等研究院 | Wake-up chip and wake-up system |
CN114141272A (en) * | 2020-08-12 | 2022-03-04 | 瑞昱半导体股份有限公司 | Sound event detection system and method |
CN115132231A (en) * | 2022-08-31 | 2022-09-30 | 安徽讯飞寰语科技有限公司 | Voice activity detection method, device, equipment and readable storage medium |
CN115132231B (en) * | 2022-08-31 | 2022-12-13 | 安徽讯飞寰语科技有限公司 | Voice activity detection method, device, equipment and readable storage medium |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN106601229A (en) | Voice awakening method based on soc chip | |
US12080315B2 (en) | Audio signal processing method, model training method, and related apparatus | |
CN102800316B (en) | Optimal codebook design method for voiceprint recognition system based on nerve network | |
CN105976812B (en) | A kind of audio recognition method and its equipment | |
CN107767861B (en) | Voice awakening method and system and intelligent terminal | |
CN103117059B (en) | Voice signal characteristics extracting method based on tensor decomposition | |
US20170154640A1 (en) | Method and electronic device for voice recognition based on dynamic voice model selection | |
CN110675859B (en) | Multi-emotion recognition method, system, medium, and apparatus combining speech and text | |
CN109754790B (en) | Speech recognition system and method based on hybrid acoustic model | |
CN106653056A (en) | Fundamental frequency extraction model based on LSTM recurrent neural network and training method thereof | |
CN112786004A (en) | Speech synthesis method, electronic device, and storage device | |
CN111210807A (en) | Speech recognition model training method, system, mobile terminal and storage medium | |
CN105895082A (en) | Acoustic model training method and device as well as speech recognition method and device | |
CN106782502A (en) | A kind of speech recognition equipment of children robot | |
CN112382301B (en) | Noise-containing voice gender identification method and system based on lightweight neural network | |
CN106297769B (en) | A kind of distinctive feature extracting method applied to languages identification | |
CN113823323A (en) | Audio processing method and device based on convolutional neural network and related equipment | |
Sagi et al. | A biologically motivated solution to the cocktail party problem | |
CN111243621A (en) | Construction method of GRU-SVM deep learning model for synthetic speech detection | |
CN110580897A (en) | audio verification method and device, storage medium and electronic equipment | |
CN114400006B (en) | Speech recognition method and device | |
Zhu et al. | Continuous speech recognition based on DCNN-LSTM | |
CN115132170A (en) | Language classification method and device and computer readable storage medium | |
Hu et al. | Speaker Recognition Based on 3DCNN-LSTM. | |
CN114333790A (en) | Data processing method, device, equipment, storage medium and program product |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20170426 |
|
RJ01 | Rejection of invention patent application after publication |