CN103778914B

CN103778914B - Anti-noise voice identification method and device based on signal-to-noise ratio weighing template characteristic matching

Info

Publication number: CN103778914B
Application number: CN201410040474.9A
Authority: CN
Inventors: 宁更新; 吴丽菲; 宁小娟
Original assignee: South China University of Technology SCUT
Current assignee: South China University of Technology SCUT
Priority date: 2014-01-27
Filing date: 2014-01-27
Publication date: 2017-02-15
Anticipated expiration: 2034-01-27
Also published as: CN103778914A

Abstract

The invention discloses an anti-noise voice identification method and device based on signal-to-noise ratio weighing template characteristic matching. The anti-noise voice identification method based on signal-to-noise ratio weighing template characteristic matching comprises the following steps that (1) input voice signals are preprocessed, and a phase position coefficient is obtained; (2) the characteristics of input voice, namely a phase position MFCC, are calculated; (3) characteristic matching is carried out on a template based on SNR. The invention further discloses a device of the anti-noise voice identification method based on signal-to-noise ratio weighing template characteristic matching. The device comprises a power source module, a display module, a storage module, a DSP/ARM digital processing module, a microphone, an A/D converter and a USB interface. The anti-noise voice identification method and device based on signal-to-noise ratio weighing template characteristic matching have the advantages of being wide in application range, high in accuracy, low in cost, convenient and fast to use, high in adaptability and the like.

Description

Anti-noise audio recognition method based on noise Ratio Weighted template characteristic coupling and device

Technical field

The present invention relates to a kind of sound signal processing technology, particularly to a kind of based on noise Ratio Weighted template characteristic coupling Anti-noise audio recognition method and device.

Background technology

The application of speech recognition widely, is almost related to the every aspect of daily life.As phonetic dialing system System, seat reservation system, medical services, bank service, dictation machine, computer controls, Industry Control, voice communication system etc..Voice Technology of identification changes the existing daily life side of the mankind deeply in every field such as industry, household electrical appliances, communication, medical treatment, home services Formula.Nowadays, the acoustic noise robustness requirement more and more higher to speech recognition for the actual environment, therefore, extract have robustness and The characteristic vector of stronger separating capacity has great importance to speech recognition system.

The feature being currently used for speech recognition is all based on the power spectrum of voice signal, and power spectrum illustrates signal in frequency domain model The Energy distribution enclosed.When there is external noise, this Energy distribution further comprises the energy of noise.This allows for corresponding spy Levy vector very sensitive to external noise, lead to speech recognition system performance in a noisy environment not good.

The method of solution block eigenvector external portion noise-sensitive mainly has two aspects, and one is feature based, and one is base In model.The method of feature based is to make the characteristic vector of generation as far as possible unrelated with noise in the front end of speech recognition system.And It is the rear end in speech recognition system based on the method for model, by a small amount of self-adapting data under test environment, model is joined Number is adjusted, and gradually model parameter is transformed to actual environment, thus reaching the purpose improving system recognition rate.Feature based Solution have spectrum-subtraction, RASTA facture etc..Parallel model mixing method (PMC) is had based on the method for model, based on vector The adaptive method (VTS) of Taylor series, signal decomposition method etc..

The phonic signal character parameter being presently used for the extraction of speech recognition mainly has two kinds：Linear prediction residue error And Mel frequency cepstral coefficient (MFCC) (LPCC).LPCC characteristic parameter can effectively represent speech parameter and have higher Calculating speed, but do not account for the feature to speech processes for the auditory system of the mankind.It is special to human auditory system that Mel frequency band divides Property a kind of simulation of through engineering approaches, MFCC simulates the feature to speech processes for the human ear to a certain extent.

But either MFCC or LPCC, existing speech recognition features, the recognition performance under low signal-to-noise ratio environment Fine, in order to overcome this weakness, present invention firstly provides a kind of by change relativity measurement in low signal-to-noise ratio In the case of there is the new feature of more preferable robustness, that is, adopt the angle between two time-delay signals vectors as dependency degree Amount, because angle is the nonlinear transformation of traditional autocorrelation coefficient scalar product, can strengthen the work of crest on frequency domain with phase place With, and crest relative noise robustness is higher.Then, high s/n ratio is suitable to according to traditional characteristic, new feature is suitable to low signal-to-noise ratio, Propose a kind of template matching computational methods according to noise Ratio Weighted, finally propose related device.

Content of the invention

The primary and foremost purpose of the present invention is to overcome the shortcoming of prior art and deficiency, provides one kind to be based on noise Ratio Weighted mould The anti-noise audio recognition method of plate features coupling, the method wide accommodation, accuracy is high.

Another object of the present invention is to overcoming shortcoming and the deficiency of prior art, a kind of realization is provided to add based on signal to noise ratio The device of the anti-noise audio recognition method of power template characteristic coupling, in DSP/ARM7 chip operation, it is possible to use the TMS of TI The ARM7S3C44B0 of 320C6711 or Samsung realizes.

The primary and foremost purpose of the present invention is achieved through the following technical solutions：A kind of based on noise Ratio Weighted template characteristic coupling Anti-noise audio recognition method, comprises the following steps：

Step one：Pretreatment is carried out to input speech signal, tries to achieve phase coefficient；

Voice signal s [n] after digitized is carried out sub-frame processing, adding window is carried out to it using Hamming window simultaneously.It is divided into T Frame,

{s₀[n],s₁[n],...,s_t[n],...,s_T-1[n]}

Wherein

s_t[n]=and s [Kt], s [Kt+1] ..., s [Kt+N-1] }

K moves for frame, and N is frame length, s_t[n] is the frame signal sequence in moment t.

Voice signal has short-term stationarity, and therefore every frame signal is all stable.Gained frame signal is entered line period prolong Open up, thus obtaining auto-correlation function is

Be can be seen that by above formula, R [k] is the dot product of two N-dimensional vectors,

Wherein, | | x | |²=| | x₀||²=| | x_k||², expression is frame energy.θ_kIt is vector x₀And vector x_kIn N-dimensional space Angle.

Normalized autocorrelation coefficient is carried out the nonlinear change of anticosine, obtain phase coefficient.

The span of P [k] is between 0 to π, is normalized between 0 to 1, obtains normalized phase place auto-correlation letter Number

P_n[k] can improve the robustness in the case of low signal-to-noise ratio, but in the case of high s/n ratio, especially pure language In the case of sound, performance is not so good as R_n[k].

Step 2：Calculate the feature of input voice, i.e. phase place MFCC；

Respectively to P_n[k] carries out DFT transform, obtains phase power spectrum S_p[l].

Here S_p[l] is called phase power spectrum, and the MEL frequency cepstral coefficient therefrom obtaining is called phase place MFCC, that is, leads to Cross the filtering of Mel dimensions in frequency wave filter group, then carry out logarithm operation.After separating in the information of each frequency band, with from Scattered cosine transform (DCT) changes to frequency domain character in time domain, obtains phase place MFCC parameter.

Phase place MFCC parameter chooses L rank static cepstral coefficients and its single order and second dervative, common 3L dimension.

Step 3：Template characteristic coupling based on SNR；

There is j reference voice data template in reference database, wherein comprise the MFCC feature of 3M dimension and the phase place of 3L dimension MFCC feature.The Euclidean distance that characteristic vector 3M ties up between the test template of MFCC and wherein i-th reference template is D_Mi, feature to The Euclidean distance that amount 3L ties up between the test template of phase place MFCC and the i-th reference template is P_Li, i=0,1 ..., j-1.

Known robustness in the case of low signal-to-noise ratio using characteristic vector N-dimensional phase place MFCC is higher, and in high s/n ratio In the case of, especially in the case of clean speech, the robustness tieing up MFCC using characteristic vector M is higher.

According to this point, the present invention adopts a kind of method based on noise Ratio Weighted, under the conditions of different signal to noise ratios, adopts Different weight values, obtains the weight distance value C in mould distance between plates for two feature vectors_i.

C_i=(1-w) D_Mi+wP_Li, i=0,1 ..., j-1, (formula 5)

Template matching process is exactly search in j reference template, finds and makes min { C_i, i=0,1 ..., j-1 establishment That template.

W is the weight of distance between phase place MFCC parameterized template, and its value is determined by signal to noise ratio snr, signal to noise ratio thus can obtain：

||Y||²Represent is the frame energy of voice in actual environment, | | N | |²Represent be in actual environment sampling make an uproar The energy of acoustical signal,Represent the estimated value to this energy.

The value of w is determined by signal to noise ratio snr,

W=f (SNR), (formula 8)

F (SNR) represents the relation between weight coefficient w and signal to noise ratio snr.F (SNR) span is (0,1), with w each other Negative correlation, this relation can be linear or nonlinear.Can be to represent this pass using following two modes System:

Mode one：

Mode two：

U () is jump function, and α span is (1,5), is the threshold value of SNR, and when SNR is less than α, weight coefficient w is 1, when SNR is more than α, weight coefficient w and SNR is negatively correlated, and along index decreased, with the growth of SNR, final w gradually restrains In 0.The span of β be (1,10), be equivalent to traditional MFCC and phase place MFCC weight equal when SNR marginal value.γ's and θ Span be (0.1,1), be all used for adjusting the speed of change, its value is bigger, change slower.

Another object of the present invention is achieved through the following technical solutions：A kind of realization is based on noise Ratio Weighted template characteristic The device of the anti-noise audio recognition method joined, including：Power module, display module, memory module, DSP/ARM digital processing mould Block, mike, A/D converter and USB interface；Described memory module, USB interface, display module, power module and A/D conversion One end of device is all electrically connected with DSP/ARM digital signal processing module, and described mike is electrically connected with the other end of A/D converter Connect；Described mike is used for input test voice, and described A/D converter is used for tested speech digitized, described DSP/ARM core Piece is used for extracting feature and carrying out template matching, and described memory module is used for storing reference database, and described display module is used for Display result, described USB interface and computer connect.

Described A/D converter adopts ADC0832 chip；Described DSP/ARM digital signal processing module adopts DSP/ARM7 chip.

Described DSP/ARM7 chip adopts the TMS 320C6711 of the TI or ARM7S3C44B0 of Samsung.

On the basis of the present invention is calculated MFCC parameter in traditional autocorrelation coefficient, increased and replaced by phase coefficient Autocorrelation coefficient obtains phase place MFCC parameter, obtains individual features vector, and proposes the template matching meter according to noise Ratio Weighted Calculation method.

The present invention has such advantages as with respect to prior art and effect：

First, wide accommodation.The application of the present invention widely, is almost related to the every aspect of daily life.

2nd, accuracy is high.Invention applies the robustness of phase place MFCC is higher in the case of low signal-to-noise ratio, and in high letter Make an uproar than in the case of, the higher characteristic of the robustness of traditional MFCC especially in the case of clean speech, improve feature extraction distance Estimate mode, improve the accuracy of identification, the accuracy especially in the case of low signal-to-noise ratio.

3rd, low cost.All of computing can be completed using a common DSP or ARM chip.

4th, easy to use.This device can be inserted on any equipment having a USB interface, and plug and play is very convenient.

5th, strong adaptability.There is no particular/special requirement to use environment, can in most of environment normal work.

Brief description

Fig. 1 is the module frame chart of invention device.

Fig. 2 is pretreatment and the feature extraction flow chart of invention device.

Fig. 3 is the template matching block flow diagram of invention device.

Fig. 4 is the hardware structure diagram of invention device.

Specific embodiment

With reference to embodiment and accompanying drawing, the present invention is described in further detail, but embodiments of the present invention do not limit In this.

Embodiment 1

As shown in figure 1, tested speech initially enters pretreatment module, subsequently enter characteristic extracting module, obtain testing language MFCC and PAC-MFCC is input to template matching module by sound individual features vector, by calculating weight distance value and reference data Template in storehouse is mated (specific template matching block process is as shown in Figure 3), obtains the minimum coupling of weight distance value Template, result exports display module the most at last.

Wherein pretreatment process and feature extraction flow process as shown in Fig. 2 carrying out preemphasis in pretreatment process, digitized, Framing, adding window, extract tested speech frame feature in feature extraction flow process afterwards, by calculating autocorrelation coefficient and phase coefficient, Carry out FFT, by MEL wave filter group, then pass through logarithmic transformation and discrete Fourier transform DCT, try to achieve traditional MFCC and Phase place MFCC, and in the case of actual environment no tested speech, estimated noise energy, try to achieve respective environment SNR.

Speech recognition equipment to implement step as follows：

Step 1：Tested speech is digitized process, sample frequency is 8kHz, then carries out preemphasis, with 20ms is One frame, frame moves as 10ms, and window is Hamming window.

Step 2：Every frame voice is analyzed, carries out periodic extension first, try to achieve according still further to (formula 1-3) normalized Autocorrelation coefficient and phase coefficient.

Step 3：FFT is carried out to the coefficient tried to achieve, obtains corresponding power spectrum, then obtain two kinds are composed, lead to Cross the MEL scaling filter filtering of 13 ranks, then pass through logarithmic transformation and dct transform, try to achieve the static cepstral coefficients of 13 rank MFCC With the static cepstral coefficients of phase place MFCC of 13 ranks, and take both single order and second dervatives, obtain 39 dimensions MFCC parameter and The phase place MFCC parameter of 39 dimensions, as characteristic vector.

Step 4：In the case of no tested speech, gather the noise signal in actual environment, obtain noise energy.Pass through again (formula 6) and (formula 7), estimates the signal to noise ratio under the actual environment having tested speech.

Step 5：Calculate characteristic vector 39 and tie up the Euclidean distance D between the test template of MFCC and reference template_M, characteristic vector Euclidean distance P between the test template of 39 dimension phase places MFCC and reference template_N.

Step 6：Calculate the weighted value of two characteristic vector mould distances between plates according to (formula 8), finally according to (formula 5), obtain Weight distance value C.

The calculating formula calculating respective weights is as follows：

Take relevant parameter：α=3, γ=0.5.

As shown in figure 4, a kind of device realizing the anti-noise audio recognition method based on noise Ratio Weighted template characteristic coupling, Including：Power module, display module, memory module, DSP/ARM digital signal processing module, mike, A/D converter and USB connect Mouthful；One end of described memory module, USB interface, display module, power module and A/D converter all with DSP/ARM digital processing Modular electrical connects, and described mike is electrically connected with the other end of A/D converter；Described mike is used for input test language Sound, described A/D converter is used for tested speech digitized, and described DSP/ARM chip is used for extracting feature and carrying out template Join, described memory module is used for storing reference database, described display module is used for showing result, described USB interface and computer Connect.Described A/D converter adopts ADC0832 chip；Described DSP/ARM digital signal processing module adopts DSP/ARM7 chip.Institute State DSP/ARM7 chip and adopt the TMS 320C6711 of the TI or ARM7S3C44B0 of Samsung.

Embodiment 2

The present embodiment in addition to herein below, with embodiment 1：

The calculating formula calculating respective weights is as follows：

Take relevant parameter：β=3, θ=0.5.

Above-described embodiment is the present invention preferably embodiment, but embodiments of the present invention are not subject to above-described embodiment Limit, other any spirit without departing from the present invention and the change made under principle, modification, replacement, combine, simplify, All should be equivalent substitute mode, be included within protection scope of the present invention.

Claims

1. a kind of anti-noise audio recognition method based on noise Ratio Weighted template characteristic coupling is it is characterised in that include following walking Suddenly：

Step 2：Calculate the feature of input voice, i.e. phase place MFCC；

Step 3：Characteristic matching is carried out to the template based on SNR；

It is characterized in that, described step one comprises the following steps：

Step A, the voice signal s [n] after digitized is carried out sub-frame processing, adding window is carried out using Hamming window simultaneously, and be divided into T Frame：

{s₀[n],s₁[n],...,s_t[n],...,s_T-1[n] },

Wherein：

s_t[n]={ s [Kt], s [Kt+1] ..., s [Kt+N-1] }, K move for frame, and N is frame length, s_t[n] is the frame letter in moment t Number sequence；

Step B, gained frame signal is carried out periodic extension, obtain auto-correlation function：

R [k] = Σ_{n = 0}^{N - 1} {\tilde{s}}_{t} [n] {\tilde{s}}_{t} [n + k], k = 0, 1, ..., N - 1;

Can be drawn by the expression formula of auto-correlation function, R [k] is the dot product of two N-dimensional vectors,

x_{0} = {{\tilde{s}}_{t} [0], {\tilde{s}}_{t} [1], ..., {\tilde{s}}_{t} [N - 1]},

x_{k} = {{\tilde{s}}_{t} [k], ..., {\tilde{s}}_{t} [N - 1], {\tilde{s}}_{t} [0], ..., {\tilde{s}}_{t} [k - 1]},

R [k] = x_{0}^{T} x_{k} = | | x | |^{2} c o s (θ_{k}),

Wherein, | | x | |²=| | x₀||²=| | x_k||², expression is frame energy, θ_kIt is vector x₀And vector x_kFolder in N-dimensional space Angle；

Step C, normalized autocorrelation coefficient is carried out the nonlinear change of anticosine, obtain phase coefficient：

P [k] = θ_{k} = \cos^{- 1} (\frac{R [k]}{| | x | |^{2}}),

The span of P [k] is between 0 to π, is normalized between 0 to 1, obtains normalized phase place auto-correlation function：

P_{n} [k] = \frac{P [k]}{π} = \frac{\cos^{- 1} (R_{n} [k])}{π} = \frac{\cos^{- 1} (\frac{R [k]}{| | x | |^{2}})}{π},

Wherein, P_n[k] is used for improving the robustness in the case of low signal-to-noise ratio.

2. the anti-noise audio recognition method based on noise Ratio Weighted template characteristic coupling according to claim 1, its feature It is, described step 2 comprises the following steps：

Step I, to P_n[k] carries out DFT transform, obtains phase power spectrum S_p[l]：

S_{p} [l] = Σ_{k = 0}^{N - 1} P_{n} [k] \exp (- j \frac{2 π}{N} k l),

In formula, S_p[l] represents phase power spectrum, and the MEL frequency cepstral coefficient obtaining from formula is called phase place MFCC, that is,：Pass through Mel dimensions in frequency wave filter group filters, and then carries out logarithm operation；

Step II, after the information of each frequency band is separated, with discrete cosine transform, frequency domain character is changed in time domain, obtains To phase place MFCC parameter；Described phase place MFCC parameter chooses L rank static cepstral coefficients and its single order and second dervative, common 3L dimension.

3. the anti-noise audio recognition method based on noise Ratio Weighted template characteristic coupling according to claim 1, its feature It is, described step 3 comprises the following steps：

Step 1., have j reference voice data template in reference database, wherein comprises MFCC characteristic vector and the 3L dimension of 3M dimension Phase place MFCC characteristic vector；Characteristic vector 3M ties up the Euclidean distance between the test template of MFCC and wherein i-th reference template For D_Mi, the Euclidean distance that characteristic vector 3L ties up between the test template of phase place MFCC and the i-th reference template is P_Li, i=0,1 ..., j-1；

Step 2., under the conditions of different signal to noise ratios, using different weight values, obtain two feature vectors in mould distance between plates Weight distance value C_i：

C_i=(1-w) D_Mi+wP_Li, i=0,1 ..., j-1,

Wherein, w is the weight of distance between phase place MFCC parameterized template；Template matching process refers to search in j reference template, looks for To making min { C_i, i=0, the template of 1 ..., j-1 establishment；

Signal to noise ratio snr can be obtained by following formula：

S N R = \log_{10} (\frac{| | Y | |^{2}}{| | N | |^{2}}) &cong; \log_{10} (\frac{| | Y | |^{2}}{\overset{&OverBar;}{| | N | |^{2}}}),

| | Y | |^{2} = | | X | |^{2} + | | N | |^{2} &cong; | | X | |^{2} + \overset{&OverBar;}{| | N | |^{2}},

Wherein, | | Y | |²Represent is the frame energy of voice in actual environment, | | N | |²Represent be in actual environment sampling make an uproar The energy of acoustical signal,Represent the estimated value to this energy；

The value of w is determined by signal to noise ratio snr：

W=f (SNR),

Wherein, f (SNR) represents the relation between weight coefficient w and signal to noise ratio snr, the span of f (SNR) is (0,1), f (SNR) with the relation of w it is negative correlation linearly or nonlinearly each other.

4. the anti-noise audio recognition method based on noise Ratio Weighted template characteristic coupling according to claim 3, its feature It is, described f (SNR) is as follows with the expression formula of the relation of w：

w = f (S N R) = \exp (- \frac{S N R - α}{γ}) \cdot u (S N R - α) + u (S N R - α),

Wherein, u () is jump function, and α span is (1,5), and α is the threshold value of SNR, when SNR is less than α, weight coefficient w For 1, when SNR is more than α, weight coefficient w and SNR is negatively correlated, and along index decreased, with the growth of SNR, w gradually converges on 0.

5. the anti-noise audio recognition method based on noise Ratio Weighted template characteristic coupling according to claim 3, its feature It is, described f (SNR) is as follows with the expression formula of the relation of w：

w = f (S N R) = 1 - \frac{1}{1 + \exp [- \frac{(S N R - β)}{θ}]},

Wherein, the span of β be (1,10), be traditional MFCC and phase place MFCC weight equal when SNR marginal value；γ and θ Span be (0.1,1), γ and θ be used to adjust change speed, when the value of γ or θ is bigger, change slower.

6. a kind of dress of the anti-noise audio recognition method based on noise Ratio Weighted template characteristic coupling realized described in claim 1 Put it is characterised in that including：Power module, display module, memory module, DSP/ARM digital signal processing module, mike, A/D Transducer and USB interface；One end of described memory module, USB interface, display module, power module and A/D converter all with DSP/ARM digital signal processing module is electrically connected, and described mike is electrically connected with the other end of A/D converter；Described mike For input test voice, described A/D converter is used for tested speech digitized, and described DSP/ARM chip is used for extracting spy Levy and carry out template matching, described memory module is used for storing reference database, and described display module is used for showing result, described USB interface and computer connect.

7. device according to claim 6 is it is characterised in that described A/D converter adopts ADC0832 chip；Described DSP/ARM digital signal processing module adopts DSP/ARM7 chip.

8. device according to claim 7 is it is characterised in that described DSP/ARM7 chip adopts the TMS 320C6711 of TI Or the ARM7 S3C44B0 of Samsung.