CN102565759B

CN102565759B - Binaural sound source localization method based on sub-band signal to noise ratio estimation

Info

Publication number: CN102565759B
Application number: CN 201110448129
Authority: CN
Inventors: 周琳; 周菲菲; 吴镇扬
Original assignee: Southeast University
Current assignee: Southeast University
Priority date: 2011-12-29
Filing date: 2011-12-29
Publication date: 2013-10-30
Anticipated expiration: 2031-12-29
Also published as: CN102565759A

Abstract

A binaural sound source localization method based on sub-band signal to noise ratio estimation is an improved sound source localization method, wherein the mean value of the ITD (Interaural Time Difference) of various orientations is used as the localization characteristic clue for the sound source orientation to build an orientation mapping model; during the actual sound source localization, a dual-channel acoustic signal is input, the input acoustic signal is firstly subjected to frequency domain transformation, a frequency domain is divided into a plurality of sub-bands, signal to noise ratio estimation is carried out in each sub-band, according to the sub-band signal to noise ratio, the power spectrum of the corresponding sub-band is selected to calculate the ITD parameters of each frame, one-by-one match is performed according to the orientation characteristic model built by the ITD characteristic parameters and a training module, and based on the Euclidean distance measurement, the orientation is output. With the binaural sound source localization method, the performance of sound source localization in noisy environments can be improved.

Description

A kind of binaural sound sources localization method based on the subband SNR estimation

Technical field

The invention belongs to the auditory localization technical field, be a kind of binaural sound sources localization method based on the subband SNR estimation.

Background technology

The auditory localization technology can help to transmit and the identification visual information as an emerging intersect edge subject, increases the fidelity of three-dimensional artificial environment.At present main location algorithm has the auditory localization algorithm of multi-microphone array and based on the auditory localization algorithm of binary channel.The auditory localization algorithm of multi-microphone array exists that calculated amount is large, the microphone array size is large, and algorithm is subjected to the factor such as reverberation to disturb the problems such as large.Based on the aural signature of the sound localization method of binary channel acoustical signal simulation people ear, can realize comparatively accurately auditory localization.The most representative interaural difference ITD that is based on simple crosscorrelation (Interaural Time Difference) estimates, yet for signals and associated noises, the positioning performance degradation of estimating based on the ITD of simple crosscorrelation.

Summary of the invention

The problem to be solved in the present invention is: the auditory localization algorithm of present multi-microphone array exists that calculated amount is large, the microphone array size is large, and algorithm is subjected to the factor such as reverberation to disturb the problems such as large, and existing sound localization method based on the binary channel acoustical signal is not enough for the positioning performance of signals and associated noises.

Technical scheme of the present invention is: a kind of binaural sound sources localization method based on the subband SNR estimation, the training of advanced row data, training data is the known acoustical signal in orientation, through feature extraction, estimate the interaural difference ITD parameter of each orientation acoustical signal, with the average of the ITD parameter of each the orientation multiframe acoustical signal parameter as the vector quantization VQ model of this orientation ITD parameter, set up the orientation mapping model; During actual auditory localization, input binary channel acoustical signal, the input acoustical signal is passed through first the frequency domain conversion, divides some subbands at frequency domain, estimates each subband signal to noise ratio (S/N ratio), the snr threshold of each subband signal to noise ratio (S/N ratio) and setting is compared, select signal to noise ratio (S/N ratio) to be higher than the subband of snr threshold, calculate subband ITD characteristic parameter, train the orientation mapping model of setting up to mate one by one according to subband ITD characteristic parameter and data, estimate the output orientation based on Euclidean distance.

Concrete steps comprise:

1) data training:

11) use 37 orientation, the wide surface level right side of KEMAR microtia, i.e. the coherent pulse response HRIR data of θ=0 °～180 ° are with the known Virtual Sound of white noise convolution generating direction;

12) to step 11) the gained Virtual Sound carries out pre-service, comprises amplitude normalization, pre-emphasis, minute frame and windowing, and each frame acoustical signal in each orientation is obtained stably single frames signal;

13) with step 12) described stably single frames signal carries out end-point detection, obtains effective single frames signal;

14) calculate each single frames signal and carry out interaural difference ITD characteristic parameter, obtain the ITD training sample;

15) according to step 14) gained ITD training sample, with the average of the ITD training sample of each the orientation multiframe acoustical signal parameter as the vector quantization VQ model of corresponding orientation ITD, set up the orientation mapping model;

2) positioning step for the treatment of localization of sound source location is:

21) acoustical signal that gathers is carried out pre-service, comprise amplitude normalization, pre-emphasis, minute frame and windowing, each frame acoustical signal in each orientation is obtained stably single frames signal;

22) with step 21) gained single frames signal carries out end-point detection, obtains effective single frames signal;

23) with step 22) the effective single frames signal of gained carries out the FFT conversion, is divided into some subbands, calculates the signal to noise ratio (S/N ratio) of estimating each subband, and described subband adopts average division rule, is divided into 7-13 subband;

24) snr threshold with each subband signal to noise ratio (S/N ratio) and setting compares, and the subband amplitude that will be lower than snr threshold is made as 0, selects signal to noise ratio (S/N ratio) to be higher than the subband of snr threshold, calculates subband ITD characteristic parameter;

25) the orientation mapping model according to subband ITD characteristic parameter and data training foundation mates one by one, estimates the output azimuth information according to Euclidean distance.

Compare with existing binary channel acoustical signal location technology, the method that the present invention proposes can obviously improve the performance of auditory localization under the noise, when signal to noise ratio (S/N ratio) is 0dB, correct localization of the present invention reaches 89%, original method correct localization only is 63%, during signal to noise ratio (S/N ratio) 10dB, auditory localization accuracy of the present invention can reach 94%, and original method correct localization is 82%.

Description of drawings

Fig. 1 is the spatial coordinate system synoptic diagram of auditory localization of the present invention.

Fig. 2 is positioning system block diagram of the present invention.

Embodiment

The training of the advanced row of the present invention data, training data is the known acoustical signal in orientation, through feature extraction, estimate the interaural difference ITD parameter of each orientation acoustical signal, with the average of the ITD parameter of each the orientation multiframe acoustical signal parameter as vector quantization VQ (Vector Quantization) model of this orientation ITD parameter, set up the orientation mapping model; During actual auditory localization, input binary channel acoustical signal, the input acoustical signal is passed through first the frequency domain conversion, be Fast Fourier Transform (FFT) FFT (Fast Fourier Transform), divide some subbands at frequency domain, estimate each subband signal to noise ratio (S/N ratio), the snr threshold of each subband signal to noise ratio (S/N ratio) and setting is compared, select signal to noise ratio (S/N ratio) to be higher than the subband of snr threshold, calculate subband ITD characteristic parameter, orientation mapping model according to subband ITD characteristic parameter and data training foundation mates one by one, estimates the output orientation based on Euclidean distance.

Fig. 1 is the spatial coordinate system synoptic diagram of auditory localization of the present invention, and in the present invention, sound source position is by coordinate Unique definite.Wherein, 0≤r＜+∞ is the distance of sound source and initial point; The elevation angle

Be the angle of direction vector and surface level,

With+90 ° respectively the expression under, surface level and directly over; 0 °≤θ of deflection＜360 ° is that direction vector is at the projection of surface level and the angle of middle vertical plane.On the surface level, expression dead ahead, θ=0 °, along clockwise direction θ=90 °, 180 ° and 270 ° respectively expression positive right, just after and front-left.

The inventive method comprises that data training and auditory localization two go on foot greatly:

1) data training:

Following correspondence implementation step of the present invention is elaborated to the implementation of technical solution of the present invention by reference to the accompanying drawings:

Fig. 2 has provided the realization block diagram based on the auditory localization of the binary channel acoustical signal of SNR estimation, HRTF (Head-Response Transfer Function) is a related transfer function, with the white noise convolution, produce the directionality virtual sound signal that is used for training.Among the figure respectively the treatment scheme to training and testing stage acoustical signal mark, lower mask body is introduced function and the implementation of each module.

1, pretreatment module, corresponding step 12) and 21) described in pre-service:

Owing to may be mingled with a lot of electronic noises and ground unrest in the acoustical signal that collecting device collects, in order to suppress noise effect to the analysis of follow-up signal, need to carry out pre-service; The pre-service of this method comprises: amplitude normalization, pre-emphasis, minute frame and windowing.It is 30ms that the present invention takes frame length, and frame moves and is 10ms.

Pre-emphasis adopts order digital filter H (z)=1-μ z ^-1, μ=0.97 wherein.Voice signal after this method uses Hamming window to minute frame carries out windowing process, and the n frame signal after the windowing can be expressed as x _n(m)=w _H(m) x (nN+m) 0≤m＜N, N is a frame sampling data length, is 1323,

Wherein,

w_{H} (m) = \{\begin{matrix} 0.54 - 0.46 \cos [2 πm / (N - 1)] & 0 \leq m < N \\ 0 & m &GreaterEqual; N \end{matrix}

Be Hamming window.

2, endpoint detection module, corresponding step 13) and 22) described in end-point detection:

The purpose of end-point detection is exactly to receive the acoustical signal from one section to find out starting point and the end point of useful signal, thereby only useful signal is processed.End-point detection not only can reduce memory data output and processing time accurately, and can get rid of the interference of unvoiced segments and noise.The inventive method adopts short-time energy and zero-crossing rate feature to combine, and monaural signal is detected, and the method that adopts short-time energy and zero-crossing rate feature to combine to carry out sound end to detect is prior art, and the present invention here simply introduces:

Short-time energy is the average energy that a frame signal has, and computing formula is

E_{n} = Σ_{m = 0}^{N - 1} {| x_{n} (m) |}^{2} = Σ_{k = 0}^{N - 1} {| X_{n} (k) |}^{2}

X wherein _n(m), m=0,1 ... N-1 is for gathering acoustical signal, X through pretreated n frame _n(k), k=0,1 ... N-1 is corresponding frequency-region signal.The threshold value of short-time energy can be made as fixed value and also can use the multi-frame mean energy as decision threshold.

Short-time zero-crossing rate is the number percent that number of times that a frame waveform passes zero level accounts for frame length, and for discrete signal, as long as the symbol of more adjacent two sampled points, computing formula is

Z_{n} = \frac{1}{2 N} Σ_{m = 1}^{N - 1} | sgn {x_{n} (m)} - sgn {x_{n} (m - 1)} |

Wherein, sgn (x) is sign function.The decision threshold that the present invention uses is Z _Min=0.01, Z _Max=0.4, lower limit Z wherein is set _MinIt is the impact for filtering part mute frame.

Short-time energy and zero-crossing rate feature are useful signal within decision threshold, thereby can judge sound section initial sum final position.

3, subband signal to noise ratio snr estimation module, corresponding step 23):

The useful signal that the opposite end spot check records carries out the frequency domain conversion, divides some subbands at frequency domain, carries out SNR estimation in each subband, and described subband adopts average division rule, is divided into 7-13 subband among the present invention; Concrete formula is as follows:

The frequency-region signal model can be with vector representation:

X(k)＝S(k)+N(k)

X(k)＝{x _i(k)，x _r(k)} ^T

S(k)＝{S _i(k)，S _r(k)} ^T

Wherein, X (k) is noisy speech, and S (k) is pure acoustical signal, and k represents frequency domain.Subscript l, r represent respectively left and right two-way acoustical signal.

For the binary channel acoustical signal, travel path is distinguished to some extent to the decay of different frequency acoustical signal, and because different from the sound localization method of multi-microphone array, the binary channel auditory localization only has the two-way acoustical signal, so this method estimated snr in subband, one frame signal is divided into some subbands at frequency domain, estimates the covariance matrix of each subband, then calculated the signal to noise ratio (S/N ratio) of each frequency by covariance matrix.By frequency-region signal model vector form, as can be known, the covariance matrix of i subband is

R = [\begin{matrix} R_{1} & R_{2} \\ R_{3} & R_{4} \end{matrix}] = E {X_{i} (k) X_{i}^{T} (k)} = [\begin{matrix} P_{li} + σ^{2} & \sqrt{P_{li} * P_{ri}} \\ \sqrt{P_{li} * P_{ri}} & P_{ri} + σ^{2} \end{matrix}] = [\begin{matrix} P_{li} + σ^{2} & \sqrt{IID} P_{li} \\ \sqrt{IID} P_{li} & IId * P_{li} + σ^{2} \end{matrix}]

Wherein, X _i(k) be the frequency domain vectors of the left and right sides road acoustical signal composition of i subband.P _Li, P _Ri, σ ²The energy and the noise power spectral density that represent respectively i subband left and right sides acoustical signal, IID is the interaural intensity difference of this subband acoustical signal.

Can be drawn voice and the noise energy power spectrum density of i subband by following formula:

By equation

P_{li}^{2} + (R_{4} - R_{1}) P_{li} - R_{2}^{2} = 0

Can draw P _Li

σ ²＝R ₁-P _li P _ri＝R ₄-σ ²

Thus, can draw the signal to noise ratio (S/N ratio) of i subband,

In the subband SNR estimation, because just there is interaural intensity difference in binaural signal at the frequency spectrum of different sub-band itself.Therefore, the decision of subband size Algorithm Performance.

The number selection of subband is relevant with the factors such as height of the type of sound-source signal, signal to noise ratio (S/N ratio).The number of subband needs moderate, and on the one hand, if the subband number is too many, the Frequency point of each subband the inside when SNR is low, has added more insecure frequencies very little, has affected algorithm effect.Because the average SNR of a subband is lower, just can ignore the frequency data of whole subband on the other hand, the subband number also should not be very little.The present invention arranges the Simulation Test Environment of different parameters, and according to test result, balance considers that the sub band number that the present invention adopts is 7-13.

4, ITD characteristic extracting module, corresponding step 14) and 24) in the calculating of ITD characteristic parameter:

The binaural sound signal is inputted the ITD characteristic extracting module through after pre-service and the end-point detection with its signal to noise ratio (S/N ratio) parameter with each subband of each frame.Adopt constant signal-noise ratio threshold, the frequency band of selecting signal to noise ratio (S/N ratio) to be higher than threshold value carries out the calculating of ITD.When the location clue is extracted, select the high spectrum signal of signal to noise ratio (S/N ratio) to carry out ITD and estimate, and give up to fall the low spectrum signal of signal to noise ratio (S/N ratio), Effective Raise the extraction accuracy of signals and associated noises location clue, thereby improved positioning performance.

ITD estimation procedure and the formula of i frame acoustical signal are as follows:

(1) according to subband signal to noise ratio (S/N ratio) and threshold value, calculate the SNR identification parameter SNRIndex of each frequency:

(2) according to the SNR identification parameter, left and right sides road acoustical signal frequency spectrum is revised.In the binaural sound signal spectrum, the frequency spectrum that signal to noise ratio (S/N ratio) is lower than the subband of threshold value is made as 0:

P _u＝P _i·*SNRIndex

P _rr＝P _r·*SNRIndex

Wherein, P _iAnd P _rLeft and right sides road acoustical signal frequency spectrum, P _LlAnd P _RrFor according to revising rear left right wing acoustical signal frequency spectrum.

(3) use the broad sense cross-correlation method to carry out the estimation of ITD.

The cross-spectral density P of left and right sides acoustical signal _LrComputing formula be: P _Lr=P _Ll* P _RrBy P _LrThrough the IFFT conversion, can obtain cross correlation function R _Lr(k).Here R _LrThe cross correlation function of binaural signal when (k) the expression mistiming is k sampled point.

Thereby can calculate, the ITD estimated value of i frame acoustical signal is

5, training module, corresponding performing step 15):

Training module is used for setting up the statistical model of location feature, and its input signal is the known acoustical signal in orientation, through characteristic extraction procedure, estimates the ITD parameter of each orientation acoustical signal.Wherein, with the average of the ITD of each the orientation multiframe acoustical signal parameter as the VQ model of this orientation ITD.

The present invention uses the Virtual Sound of HRIR data that the MIT Media Lab measures and the generation of white noise convolution as training data; Use the HRIR data acquisition in wide 37 orientation, surface level right side of KEMAR microtia (0 °～180 ° of θ) for the virtual sound signal of training, the angle intervals of this partial data is 5 °.

6, locating module, corresponding performing step 25):

Locating module is used for each orientation characteristic model of acoustical signal to be measured and training module foundation is mated one by one and seeks the orientation of likelihood score maximum.Position fixing process carries out according to the following steps:

(1) signal to noise ratio (S/N ratio) of each each sub-band of frame of calculating acoustical signal to be positioned;

(2) acoustical signal to be positioned is carried out FFT, the frequency band that is lower than signal-noise ratio threshold is made as 0 with its amplitude;

(3) the ITD characteristic parameter of extraction acoustical signal to be positioned;

(4) according to the ITD characteristic parameter at 0 °～90 °, search minimum euclidean distances in 270 ° of-360 ° of scopes, orientation, output location:

p^{*} = \arg \min_{1 \leq p \leq P} d (x, λ_{p})

In the following formula, λ _p(p=1,2 ..., P, P are positional number) be the value of model ITD.X is for measuring the ITD value.P* is the forward acoustic source position of output.

Build positioning system according to the said system framework, the training of advanced row data, then be used for the binaural sound sources location, through the experiment contrast, compare with existing binary channel acoustical signal location technology, the method that the present invention proposes can obviously improve the performance of auditory localization under the noise, when signal to noise ratio (S/N ratio) is 0dB, correct localization of the present invention reaches 89%, the art methods correct localization only is 63%, during signal to noise ratio (S/N ratio) 10dB, auditory localization accuracy of the present invention can reach 94%, and the art methods correct localization is 82%.

Claims

1. binaural sound sources localization method based on the subband SNR estimation, it is characterized in that the training of advanced row data, training data is the known acoustical signal in orientation, through feature extraction, estimate the interaural difference ITD parameter of each orientation acoustical signal, with the average of the ITD parameter of each the orientation multiframe acoustical signal parameter as the vector quantization VQ model of this orientation ITD parameter, set up the orientation mapping model; During actual auditory localization, input binary channel acoustical signal, the input acoustical signal is passed through first the frequency domain conversion, divides some subbands at frequency domain, estimate each subband signal to noise ratio (S/N ratio), the snr threshold of each subband signal to noise ratio (S/N ratio) and setting is compared, select signal to noise ratio (S/N ratio) to be higher than the subband of snr threshold, calculate subband ITD characteristic parameter, orientation mapping model according to subband ITD characteristic parameter and data training foundation mates one by one, estimate based on Euclidean distance, the output orientation, concrete steps comprise:

1) data training:

12) step 11) gained Virtual Sound is carried out pre-service, comprise amplitude normalization, pre-emphasis, minute frame and windowing, each frame acoustical signal in each orientation is obtained stably single frames signal;

13) the described stably single frames of step 12) signal is carried out end-point detection, obtain effective single frames signal;