CN104599679A

CN104599679A - Speech signal based focus covariance matrix construction method and device

Info

Publication number: CN104599679A
Application number: CN201510052368.7A
Authority: CN
Inventors: 陈喆; 殷福亮; 张梦晗
Original assignee: Huawei Technologies Co Ltd
Current assignee: Huawei Technologies Co Ltd
Priority date: 2015-01-30
Filing date: 2015-01-30
Publication date: 2015-05-06
Also published as: WO2016119388A1

Abstract

The invention discloses a speech signal based focus covariance matrix construction method and device. The method includes the steps of determining sampling frequency points of a microphone array collecting a voice signal; calculating a first covariance matrix and a focus transformation matrix of the voice signal acquired at any sampling frequency point as well as a conjugate transpose matrix of the focus transformation matrix according to any of the determined sampling frequency points, and setting a product of the first covariance matrix, the focus transformation matrix and the conjugate transpose matrix of the focus transformation matrix as a focus covariance matrix of the voice signal acquired at any sampling frequency point; setting a sum of the calculated focus covariance matrixes of the voice signal acquired at every sampling frequency point as a focus covariance matrix of the voice signal. According to the speech signal based focus covariance matrix construction method and device, the prediction of the incidence angle of a sound source is avoided during the construction of the focus covariance matrix and errors exist in the prediction of the incidence angle of the sound source, so that the accuracy of the constructed focus covariance matrix is improved.

Description

Method and device for constructing focusing covariance matrix based on voice signals

Technical Field

The invention relates to the technical field of voice signal processing, in particular to a method and a device for constructing a focus covariance matrix based on voice signals.

Background

Compared with a single microphone, the microphone array can utilize time domain and frequency domain information of a sound source and also can utilize spatial information of the sound source, so that the microphone array has the advantages of strong anti-interference capability, flexible application and the like, has stronger advantages in the aspects of solving the problems of sound source positioning, speech enhancement, speech recognition and the like, and is widely applied to the fields of audio and video conference systems, vehicle-mounted systems, hearing-aid devices, human-computer interaction systems, robot systems, security monitoring, military reconnaissance and the like.

In the speech processing technology based on the microphone array, the number of sound sources is often required to be known, so that high processing performance can be obtained; if the number of sound sources is unknown, or if the number of sound sources is assumed to be too large or too small, the accuracy of the processing result of the speech acquired by the microphone array is degraded.

In order to improve the accuracy of the processing result of the speech acquired by the microphone array, a method for calculating a sound source is provided, and a focus covariance matrix needs to be constructed in the process of calculating the sound source, but at present, in the process of constructing the focus covariance matrix, the incident angle of the sound source needs to be predicted, then the focus covariance matrix is constructed according to the predicted incident angle, and the number of the sound sources is estimated, but if the predicted incident angle error of the sound source is large, the accuracy of the constructed focus covariance matrix is low.

Disclosure of Invention

The embodiment of the invention provides a method and a device for constructing a focusing covariance matrix based on a voice signal, which are used for solving the defect of low accuracy of the constructed focusing covariance matrix in the prior art.

The embodiment of the invention provides the following specific technical scheme:

in a first aspect, a method for constructing a focus covariance matrix based on a speech signal is provided, which includes:

determining sampling frequency points adopted when a microphone array collects voice signals;

aiming at any one of the determined sampling frequency points, calculating a first covariance matrix, a focusing transformation matrix and a conjugate transpose matrix of the focusing transformation matrix of the voice signal acquired at the any one sampling frequency point, and taking the product of the first covariance matrix, the focusing transformation matrix and the conjugate transpose matrix of the focusing transformation matrix as the focusing covariance matrix of the voice signal acquired at the any one sampling frequency point;

and taking the sum of the focus covariance matrixes of the voice signals acquired at the sampling frequency points as the focus covariance matrix of the voice signals acquired by the microphone array.

With reference to the first aspect, in a first possible implementation manner, the calculating the first covariance matrix specifically includes:

calculating the first covariance matrix as follows:

wherein, theRepresenting the first covariance matrix, k representing the arbitrary sampling frequency point, P representing the number of frames of the microphone array collecting the speech signal, and X_i(k) Discrete Fourier Transform (DFT) values representing the time of any frame and any sampling frequency point of the microphone array, and the microphone arrayRepresents said X_i(k) The N represents the number of sampling frequency points included in any one frame, and the number of sampling frequency points included in any two different frames is the same.

With reference to the first aspect and the first possible implementation manner of the first aspect, in a second possible implementation manner, before the calculating the focus transformation matrix, the method further includes:

determining a focusing frequency point of sampling frequency points adopted when the microphone array collects voice signals;

calculating a second covariance matrix of the voice signals collected by the microphone array at the focusing frequency point;

calculating the focus transformation matrix specifically comprises:

decomposing an eigenvalue of the first covariance matrix to obtain a first eigenvector matrix, and performing conjugate transpose on the first eigenvector matrix to obtain a conjugate transpose matrix of the first eigenvector matrix;

decomposing the eigenvalue of the second covariance matrix to obtain a second eigenvector matrix;

and taking the product of the conjugate transpose matrix of the first eigenvector matrix and the second eigenvector matrix as the focusing transformation matrix.

With reference to the second possible implementation manner of the first aspect, in a third possible implementation manner, the calculating the second covariance matrix specifically includes:

calculating the second covariance matrix as follows:

wherein, theRepresenting the second covariance matrix, the k₀Representing the focused frequency point, P representing the number of frames of the speech signal collected by the microphone array, X_i(k₀) The DFT value of the microphone array at any frame and the focusing frequency point is represented, and the DFT value is representedRepresents said X_i(k₀) The conjugate transpose matrix of (2).

With reference to the second or third possible implementation manner of the first aspect, in a fourth possible implementation manner, decomposing the eigenvalue of the first covariance matrix specifically includes:

decomposing eigenvalues for the first covariance matrix in the following manner:

wherein, theRepresenting the second covariance matrix, the U (k) representing theThe Λ represents the second eigenvector matrix ofThe characteristic values of the U are arranged from big to small to form a diagonal matrix, and the U^H(k) Represents the conjugate transpose of U (k).

With reference to the second to fourth possible implementation manners of the first aspect, in a fifth possible implementation manner, decomposing the eigenvalue of the second covariance matrix specifically includes:

decomposing eigenvalues for the second covariance matrix in the following manner:

wherein, theRepresents the second covariance matrix, the U (k)₀) Represents the aboveThe second eigenvector matrix, the₀Represents the aboveThe characteristic values of the U are arranged from big to small to form a diagonal matrix, and the U^H(k₀) Represents the U (k)₀) The conjugate transpose matrix of (2).

With reference to the first to fifth possible implementation manners of the first aspect, in a sixth possible implementation manner, the X is_i(k) The form is as follows:

X_i(k)＝[X_i1(k),X_i2(k),......,X_iL(k)]^T,i＝0,1,2,......,P-1

wherein: x_i1(k) The DFT value and X of the 1 st array element of the microphone array at the ith frame and the kth sampling frequency point are represented_i2(k) The DFT value and X of the 2 nd array element of the microphone array at the ith frame and the kth sampling frequency point are represented_iL(k) The DFT value of the Lth array element of the microphone array in the ith frame and the kth sampling frequency point is represented, and L is the DFT valueThe microphone array comprises a number of array elements.

In a second aspect, an apparatus for constructing a focus covariance matrix based on a speech signal is provided, including:

the determining unit is used for determining sampling frequency points adopted when the microphone array collects voice signals;

the first calculation unit is used for calculating a first covariance matrix, a focus transformation matrix and a conjugate transpose matrix of a voice signal acquired at any one of the determined sampling frequency points aiming at any one of the determined sampling frequency points, and taking the product of the first covariance matrix, the focus transformation matrix and the conjugate transpose matrix of the focus transformation matrix as a focus covariance matrix of the voice signal acquired at any one of the sampling frequency points;

and the second calculation unit is used for taking the sum of the focus covariance matrixes of the voice signals acquired at the sampling frequency points as the focus covariance matrix of the voice signals acquired by the microphone array.

With reference to the second aspect, in a first possible implementation manner, when the first calculating unit calculates the first covariance matrix, specifically:

calculating the first covariance matrix as follows:

With reference to the second aspect and the first possible implementation manner of the second aspect, in a second possible implementation manner, the determining unit is further configured to determine a focusing frequency point of a sampling frequency point adopted when the microphone array collects a voice signal;

the first calculation unit is further configured to calculate a second covariance matrix of the voice signals collected by the microphone array at the focused frequency point;

when the first calculating unit calculates the focus transformation matrix, the method specifically includes:

With reference to the second possible implementation manner of the second aspect, in a third possible implementation manner, when the first calculating unit calculates the second covariance matrix, specifically:

calculating the second covariance matrix as follows:

wherein, theRepresenting the second covariance matrix, the k₀Representing the focused frequency point, P representing the number of frames of the speech signal collected by the microphone array, X_i(k₀) Representing the microphone array atDFT value of one frame and focusing frequency point, theRepresents said X_i(k₀) The conjugate transpose matrix of (2).

With reference to the second or third possible implementation manner of the second aspect, in a fourth possible implementation manner, when the first calculating unit decomposes the eigenvalue of the first covariance matrix, specifically:

With reference to the second to fourth possible implementation manners of the second aspect, in a fifth possible implementation manner, when the first calculating unit decomposes the eigenvalue of the second covariance matrix, specifically:

With reference to the first to fifth possible implementation manners of the second aspect, in a sixth possible implementation manner, X is_i(k) The form is as follows:

X_i(k)＝[X_i1(k),X_i2(k),......,X_iL(k)]^T,i＝0,1,2,......,P-1

wherein: x_i1(k) The DFT value and X of the 1 st array element of the microphone array at the ith frame and the kth sampling frequency point are represented_i2(k) The DFT value and X of the 2 nd array element of the microphone array at the ith frame and the kth sampling frequency point are represented_iL(k) And the DFT value of the Lth array element of the microphone array at the ith frame and the kth sampling frequency point is represented, and L is the number of array elements included by the microphone array.

The invention has the following beneficial effects:

the main idea of constructing the focus covariance matrix based on the voice signal provided by the embodiment of the invention is as follows: determining sampling frequency points adopted when a microphone array collects voice signals; aiming at any one of the determined sampling frequency points, calculating a first covariance matrix, a focusing transformation matrix and a conjugate transpose matrix of the focusing transformation matrix of the voice signals acquired at any one of the sampling frequency points, and taking the product of the first covariance matrix, the focusing transformation matrix and the conjugate transpose matrix of the focusing transformation matrix as a focusing covariance matrix of the voice signals acquired at any one of the sampling frequency points; according to the scheme, the sum of the focus covariance matrixes of the voice signals acquired at each sampling frequency point is used as the focus covariance matrix of the voice signals, and in the scheme, when the focus covariance matrix is constructed, the incident angle of a sound source does not need to be predicted, and an error exists when the incident angle of the sound source is predicted, so that the accuracy of the constructed focus covariance matrix is improved.

Drawings

FIG. 1A is a flow chart of a method for constructing a focus covariance matrix based on a speech signal according to an embodiment of the invention;

FIG. 1B is a diagram illustrating frame shifting according to an embodiment of the present invention;

FIG. 1C is a schematic diagram of a comparison of the CSM-GDE calculated number of sound sources with the number of sound sources calculated according to an embodiment of the present invention;

FIG. 1D is a schematic diagram of another comparison of the CSM-GDE computed number of sound sources with the computed number of sound sources provided by an embodiment of the present invention;

FIG. 2 is an embodiment of constructing a focus covariance matrix based on a speech signal in accordance with an embodiment of the present invention;

FIG. 3A is a schematic structural diagram of an apparatus for constructing a focus covariance matrix based on a speech signal according to an embodiment of the present invention;

fig. 3B is a schematic structural diagram of an apparatus for constructing a focus covariance matrix based on a speech signal according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

The term "and/or" herein is merely an association describing an associated object, meaning that three relationships may exist, e.g., a and/or B, may mean: a exists alone, A and B exist simultaneously, and B exists alone. In addition, the letter "/" herein generally indicates that the former and latter associated objects are in an "or" relationship.

The preferred embodiments of the present invention will be described in detail below with reference to the accompanying drawings, it being understood that the preferred embodiments described herein are merely for illustrating and explaining the present invention, and are not intended to limit the present invention, and that the embodiments and features of the embodiments in the present application may be combined with each other without conflict.

Preferred embodiments of the present invention will be described in detail below with reference to the accompanying drawings.

Referring to fig. 1A, in the embodiment of the present invention, a process of constructing a focus covariance matrix based on a speech signal is as follows:

step 100: determining sampling frequency points adopted when a microphone array collects voice signals;

step 110: aiming at any one of the determined sampling frequency points, calculating a first covariance matrix, a focusing transformation matrix and a conjugate transpose matrix of a focusing transformation matrix of the voice signal acquired at any one of the sampling frequency points, and taking the product of the first covariance matrix, the focusing transformation matrix and the conjugate transpose matrix of the focusing transformation matrix as a focusing covariance matrix of the voice signal acquired at any one of the sampling frequency points;

step 120: and taking the sum of the focus covariance matrixes of the voice signals acquired at the sampling frequency points as the focus covariance matrix of the voice signals acquired by the microphone array.

In the embodiment of the present invention, in order to improve the accuracy of the constructed focus covariance matrix, after acquiring a speech signal acquired by a microphone array at any sampling frequency point, before calculating a first covariance matrix, a focus transformation matrix, and a conjugate transpose matrix of the focus transformation matrix of the speech signal acquired at any sampling frequency point, the following operations are further included:

pre-emphasis processing is carried out on the collected voice signals;

at this time, the first covariance matrix, the focus transformation matrix, and the conjugate transpose matrix of the focus transformation matrix of the speech signal collected at any sampling frequency point are calculated, optionally, the following method may be adopted:

pre-emphasis processing is carried out on the voice signals collected at any sampling frequency point;

and calculating a first covariance matrix, a focusing transformation matrix and a conjugate transpose matrix of the focusing transformation matrix of the voice signal after pre-emphasis processing.

In the embodiment of the present invention, optionally, the speech signal may be pre-emphasized in the following manner:

\hat{x} (k) = x (k) - ax (k - 1), k = 0,1,2, . . . . . ., N - 1

(formula one)

Wherein,in order to pre-emphasize a speech signal acquired at a kth sampling frequency point, x (k) is the speech signal acquired at the kth sampling frequency point, x (k-1) is the speech signal acquired at the kth sampling frequency point, N is the number of sampling frequency points, and a is a pre-emphasis coefficient, optionally, a is 0.9375.

Wherein, optionally, x (k) is in the form shown in formula two:

X_i(k)＝[X_i1(k),X_i2(k),......,X_iL(k)]^Ti-0, 1,2, a

Wherein: x_i1(k) 1 st array element of microphone array in ith frame and ithDFT value, X at k sampling frequency points_i2(k) DFT values … …, X of 2 nd array element of microphone array at ith frame and kth sampling frequency point_iL(k) The method comprises the steps of representing DFT values of an L-th array element of a microphone array at an ith frame and a kth sampling frequency point, wherein L is the number of array elements included by the microphone array, and P represents the number of frames of voice signals collected by the microphone array.

In the embodiment of the present invention, in order to improve the accuracy of the constructed focus covariance matrix, after acquiring a voice signal acquired by a microphone array at any sampling frequency point, before calculating a first covariance matrix, a focus transformation matrix, and a conjugate transpose matrix of the focus transformation matrix of the voice signal acquired at any sampling frequency point, the following operations are further included:

performing framing processing on the collected voice signals;

when the first covariance matrix, the focus transformation matrix, and the conjugate transpose matrix of the focus transformation matrix of the speech signal acquired at any sampling frequency point are calculated, optionally, the following method may be adopted:

performing frame processing on the voice signals collected at any sampling frequency point;

and calculating a first covariance matrix, a focusing transformation matrix and a conjugate transpose matrix of the focusing transformation matrix of the voice signal subjected to framing processing.

In the embodiment of the present invention, when performing framing, framing is performed in an overlapping manner, that is, two frames before and after the framing are overlapped, where the overlapped portion is called frame shift, optionally, the frame shift is selected to be half of the frame length, and the framing is overlapped as shown in fig. 1B.

In the embodiment of the present invention, in order to further improve the accuracy of the constructed focus covariance matrix, after the framing processing is performed on the received speech signal, the windowing processing needs to be performed on the speech signal subjected to the framing processing.

When performing windowing on the speech signal subjected to framing processing, the following method can be adopted:

the speech signal after frame division is multiplied by a Hamming window function w (n). Optionally, the Hamming window function w (n) is shown in formula three:

(formula three)

And k is any sampling frequency point, N represents the number of sampling frequency points included in any frame, and the number of sampling frequency points included in any two different frames is the same.

In practical applications, some of the speech signals collected by the microphone array may be speech signals emitted by a target object, and some of the speech signals may be speech signals emitted by a non-target object, for example: in a meeting, before a speaker speaks, there are some noises, which are voice signals sent by a non-target object, and when the speaker starts to speak, the voice signals collected by the microphone array are voice signals sent by the target object, and the accuracy of a focusing covariance matrix constructed according to the voice signals sent by the target object is high, so in the embodiment of the present invention, after the voice signals collected by the microphone array are obtained, before calculating a first covariance matrix, a focusing transformation matrix, and a conjugate transpose matrix of the focusing transformation matrix of the voice signals collected at any sampling frequency point, the following operations are further included:

calculating the energy value of the voice signal collected at any sampling frequency point and any frame;

determining a frame where the voice signal with the corresponding energy value reaching a preset energy threshold value is located;

and calculating a first covariance matrix, a focusing transformation matrix and a conjugate transpose matrix of the focusing transformation matrix of the voice signals collected at any sampling frequency point and the determined frame.

In the embodiment of the present invention, there are various ways to calculate the first covariance matrix, and optionally, the following ways may be adopted:

the first covariance matrix is calculated as follows:

(formula four)

Wherein,representing a first covariance matrix, k representing any sampling frequency point, P representing the number of frames of the microphone array for collecting the voice signals, and X_i(k) A DFT (Discrete Fourier Transform) value, which represents the frequency of any frame and any sampling frequency point of the microphone array,Represents X_i(k) The conjugate transpose matrix and the N represent the number of sampling frequency points included in any frame, and the number of the sampling frequency points included in any two different frames is the same.

In the embodiment of the present invention, before calculating the focus transformation matrix, the following operations are further included:

determining a focusing frequency point of sampling frequency points adopted when a microphone array collects voice signals;

at this time, when calculating the focus transformation matrix, optionally, the following manner may be adopted:

decomposing the eigenvalue of the first covariance matrix to obtain a first eigenvector matrix, and performing conjugate transpose on the first eigenvector matrix to obtain a conjugate transpose matrix of the first eigenvector matrix;

and taking the product of the conjugate transpose matrix of the first eigenvector matrix and the second eigenvector matrix as a focusing transformation matrix.

In the embodiment of the present invention, when calculating the second covariance matrix, optionally, the following manner may be adopted:

the second covariance matrix is calculated as follows:

(formula five)

Wherein,representing a second covariance matrix, k₀Representing focusing frequency point, P representing number of frames of voice signal collected by microphone array, X_i(k₀) Representing DFT value of microphone array at any frame and focusing frequency point,Represents X_i(k₀) The conjugate transpose matrix of (2).

In the embodiment of the present invention, when decomposing the eigenvalue of the first covariance matrix, optionally, the following manner may be adopted:

decomposing eigenvalues for the first covariance matrix in the following way:

(formula six)

Wherein,representing a second covariance matrix, U (k) representingIs represented by the second eigenvector matrix, ΛThe characteristic values of the first and second image frames are arranged in descending order to form a diagonal matrix, U^H(k) Denotes the conjugate transpose of U (k).

In the embodiment of the present invention, when decomposing the eigenvalue of the second covariance matrix, optionally, the following manner may be adopted:

decomposing the eigenvalues of the second covariance matrix in the following way:

(formula seven)

Wherein,represents the second covariance matrix, U (k)₀) To representSecond eigenvector matrix, Λ₀To representThe characteristic values of the first and second image frames are arranged in descending order to form a diagonal matrix, U^H(k₀) Represents U (k)₀) The conjugate transpose matrix of (2).

In the embodiment of the invention, optionally, X_i(k) The form is shown in formula two. In the embodiment of the present invention, after the focusing covariance matrix is obtained by calculation, the number of sound sources may be calculated according to the obtained focusing covariance matrix, and when the number of sound sources is calculated according to the obtained focusing covariance matrix, optionally, the following manner may be adopted:

and calculating the number of sound sources according to the obtained focusing covariance matrix by adopting a Gerr circle criterion. For example: in an indoor environment, the room size is 10m × 10m × 3m, and the eight vertex coordinates are (0,0,0), (0,10,2.5), (0,0,2.5), (10,0,0), (10,10,2.5), and (10,0,2.5), respectively. The uniform linear array composed of 10 microphones is distributed between two points (2,4,1.3) and (2,4.9,1.3), the spacing between array elements is 0.1m, the array elements are isotropic omnidirectional microphones, the positions of 6 speakers are respectively (8,1,1.3), (8,2.6,1.3), (8,4.2,1.3), (8,5.8,1.3), (8,7.4,1.3) and (8,9,1.3), and the background noise is assumed to be white gaussian noise. And processing the microphone array and the voice of the speaker by using an Image simulation model, sampling a voice signal at the sampling frequency of 8kHz, and acquiring a microphone array receiving signal. The coefficient γ of the folding resampling is 0.8, and the number of iterations is 20. The time length of the voice signal of the speaker is long enough, different data are taken in each experiment to carry out 50 times of tests, and the detection probability is as follows:

(formula eight)

If the number of actual speakers is 2, any frame includes 128 sampling frequency points, the number of frames is 100, the parameter d (k) in the bell circle criterion is 0.7, the Signal-to-noise ratio changes from-5 dB to 5dB, and the step length is 1dB, a comparison of the detection probability with the Signal-to-noise ratio of the Method of the focused covariance matrix constructed by the Method provided by the embodiment of the present invention and the existing CSM (Coherent Signal Subspace Method) -GDE (gerschgorine disk Estimator) Method is shown in fig. 1C. As can be seen from FIG. 1C, the CSM-GDE method has a detection probability of 0.9 at a signal-to-noise ratio of 0dB and a detection probability of 1 at a signal-to-noise ratio of 4 dB. Compared with the CSM-GDE method, the scheme provided by the invention has the advantages that when the signal to noise ratio is less than 0dB, the correct detection probability is greatly improved; when the signal-to-noise ratio is-3 dB, the detection probability reaches 0.9, and when the signal-to-noise ratio is-3 dB, the correct detection probability can reach 1.

If the number of speakers is 2 and the snr is 10dB, any frame includes 128 sampling frequency points, the number of frames changes from 5 to 70, and the step size is 5, the comparison of the detection probability of the focused covariance matrix constructed by the method according to the embodiment of the present invention with the existing CSM-GDE method with the number of frames is shown in fig. 1D. As shown in fig. 1D, the CSM-GDE method can achieve a detection probability of 0.9 for a frame number of 40 and 1 for a frame number of 65. When the number of frames is less than 50, compared with the CSM-GDE method, the detection probability is greatly improved; the detection probability reaches 0.9 when the number of frames is 25, and reaches 1 when the number of frames is 50.

Table 1 shows the performance comparison between the method for calculating the number of sound sources by constructing the focused covariance matrix according to the present invention and the method for calculating the number of sound sources by CSM-GDE under different numbers of speakers. In this experiment, the actual speaker number was 2, the signal-to-noise ratio was 10dB, the subframe length was 128 points, and the number of frames was 100. As can be seen from Table 1, when the actual number of speakers is 2 and 3, the detection probabilities of the method for calculating the number of sound sources by constructing the focusing covariance matrix and the method for calculating the number of sound sources by CSM-GDE provided by the scheme of the present invention can both reach 1, when the actual number of speakers is greater than 3, the detection probabilities gradually decrease as the number of speakers increases, and under the condition that the number of speakers is the same, the method for calculating the number of sound sources by constructing the focusing covariance matrix provided by the scheme of the present invention has a higher detection probability than the method for calculating the number of sound sources by CSM-GDE.

TABLE 1 variation of detection probability with actual speaker number

Actual speaker count	2 are provided with	3 are provided with	4 are provided with	5 are provided with	6 are
						CSM-GDE	1	1	0.94	0.84	0.66
Scheme of the invention	1	1	0.98	0.90	0.72

In the embodiment of the present invention, it is a common method in the art to calculate the number of sound sources according to the obtained focus covariance matrix by using the geuer circle criterion, and detailed description is omitted here.

For better understanding of the embodiment of the present invention, the following provides a specific application scenario, and further details are made regarding the process of constructing a focus covariance matrix based on a speech signal, as shown in fig. 2:

step 200: the sampling frequency points adopted when the microphone array collects the voice signals are determined to be 100: sampling frequency point 0, sampling frequency point 1, sampling frequency points 2 and … …, and sampling frequency point 99;

step 210: calculating a first covariance matrix for the sampling frequency point 0 aiming at the sampling frequency point 0;

step 220: determining a focusing frequency point of 100 sampling frequency points;

step 230: calculating a second covariance matrix of the voice signals collected by the microphone array at the focusing frequency point;

step 240: decomposing the eigenvalue of the first covariance matrix to obtain a first eigenvector matrix, and performing conjugate transpose on the first eigenvector matrix to obtain a conjugate transpose matrix of the first eigenvector matrix;

step 250: decomposing the eigenvalue of the second covariance matrix to obtain a second eigenvector matrix;

step 260: taking the product of the conjugate transpose matrix of the first eigenvector matrix and the second eigenvector matrix as a focusing transformation matrix, and performing conjugate transpose on the focusing transformation matrix to obtain a conjugate transpose matrix of the focusing transformation matrix;

step 270: taking the product of the first covariance matrix, the focusing transformation matrix and the conjugate transpose matrix of the focusing transformation matrix as the focusing covariance matrix of the voice signals collected at the sampling frequency point 0;

step 280: and calculating the focus covariance matrixes of other sampling frequency points in a mode of calculating the focus covariance matrix aiming at the sampling frequency point 0, and taking the sum of the focus covariance matrixes aiming at each sampling frequency point as the focus covariance matrix of the voice signals collected by the microphone array.

Based on the above technical solution of the corresponding method, referring to fig. 3A, an embodiment of the present invention provides an apparatus for constructing a focus covariance matrix based on a speech signal, the apparatus includes a determining unit 30, a first calculating unit 31, and a second calculating unit 32, wherein:

the determining unit 30 is configured to determine sampling frequency points adopted when the microphone array collects voice signals;

the first calculating unit 31 is configured to calculate, for any one of the determined sampling frequency points, a first covariance matrix, a focus transform matrix, and a conjugate transpose matrix of the focus transform matrix of the voice signal acquired at the any one sampling frequency point, and take a product of the first covariance matrix, the focus transform matrix, and the conjugate transpose matrix of the focus transform matrix as a focus covariance matrix of the voice signal acquired at the any one sampling frequency point;

and the second calculating unit 32 is configured to use the calculated sum of the focus covariance matrices of the speech signals acquired at each sampling frequency point as the focus covariance matrix of the speech signals acquired by the microphone array.

Optionally, when the first calculating unit 31 calculates the first covariance matrix, specifically:

the first covariance matrix is calculated as follows:

wherein,representing a first covariance matrix, k representing any sampling frequency point, P representing the number of frames of the microphone array for collecting the voice signals, and X_i(k) The Discrete Fourier Transform (DFT) value and the Discrete Fourier Transform (DFT) value of the microphone array at any frame and any sampling frequency point,Represents X_i(k) The conjugate transpose matrix and the N represent the number of sampling frequency points included in any frame, and the number of the sampling frequency points included in any two different frames is the same.

Further, the determining unit 30 is further configured to determine a focusing frequency point of a sampling frequency point adopted when the microphone array collects the voice signal;

the first calculating unit 31 is further configured to calculate a second covariance matrix of the speech signals collected by the microphone array at the focused frequency point;

when the first calculating unit 31 calculates the focus transformation matrix, it specifically includes:

Optionally, when the first calculating unit 31 calculates the second covariance matrix, specifically:

the second covariance matrix is calculated as follows:

Optionally, when the first calculating unit 31 decomposes the eigenvalue of the first covariance matrix, specifically:

decomposing eigenvalues for the first covariance matrix in the following way:

Optionally, when the first calculating unit 31 decomposes the eigenvalue of the second covariance matrix, specifically:

Optionally, X_i(k) The form is as follows:

X_i(k)＝[X_i1(k),X_i2(k),......,X_iL(k)]^T,i＝0,1,2,......,P-1

wherein: x_i1(k) The DFT value and X of the 1 st array element of the microphone array at the ith frame and the kth sampling frequency point are represented_i2(k) Indication wheatDFT value … …, X of 2 nd array element of wind array at ith frame and kth sampling frequency point_iL(k) And the DFT value of the Lth array element of the microphone array in the ith frame and the kth sampling frequency point is represented, and L is the number of the array elements included by the microphone array.

As shown in fig. 3B, another schematic structural diagram of an apparatus for constructing a focus covariance matrix based on a speech signal according to an embodiment of the present invention includes at least one processor 301, a communication bus 302, a memory 303, and at least one communication interface 304.

The communication bus 302 is used for realizing connection and communication among the above components, and the communication interface 304 is used for connecting and communicating with an external device.

The memory 303 is used for storing executable program codes, and the processor 301 executes the program codes to:

aiming at any one of the determined sampling frequency points, calculating a first covariance matrix, a focusing transformation matrix and a conjugate transpose matrix of a focusing transformation matrix of the voice signal acquired at any one of the sampling frequency points, and taking the product of the first covariance matrix, the focusing transformation matrix and the conjugate transpose matrix of the focusing transformation matrix as a focusing covariance matrix of the voice signal acquired at any one of the sampling frequency points;

Optionally, when the processor 301 calculates the first covariance matrix, specifically:

the first covariance matrix is calculated as follows:

Further, before the processor 301 calculates the focus transformation matrix, the method further includes:

calculating a focus transformation matrix, specifically comprising:

Optionally, when the processor 301 calculates the second covariance matrix, specifically:

the second covariance matrix is calculated as follows:

wherein,representing a second covariance matrix, k₀Indicating focus frequency pointP represents the number of frames of the speech signal collected by the microphone array, X_i(k₀) Representing microphone array in arbitrary frame and focusing frequency point

DFT value, X_i ^H(k₀) Represents X_i(k₀) The conjugate transpose matrix of (2).

Optionally, when the processor 301 decomposes the eigenvalue of the first covariance matrix, specifically:

decomposing eigenvalues for the first covariance matrix in the following way:

Optionally, when the processor 301 decomposes the eigenvalue of the second covariance matrix, specifically:

In the embodiment of the invention, optionally, X_i(k) The form is as follows:

X_i(k)＝[X_i1(k),X_i2(k),......,X_iL(k)]^T,i＝0,1,2,......,P-1

wherein: x_i1(k) The DFT value and X of the 1 st array element of the microphone array at the ith frame and the kth sampling frequency point are represented_i2(k) DFT values … …, X of 2 nd array element of microphone array at ith frame and kth sampling frequency point_iL(k) Representing the Lth array element of the microphone array at the ith frame and the kth sampling frequency pointThe DFT value, L, is the number of array elements included in the microphone array.

The present invention is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

While preferred embodiments of the present invention have been described, additional variations and modifications in those embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. Therefore, it is intended that the appended claims be interpreted as including preferred embodiments and all such alterations and modifications as fall within the scope of the invention.

It will be apparent to those skilled in the art that various modifications and variations can be made in the embodiments of the present invention without departing from the spirit or scope of the embodiments of the invention. Thus, if such modifications and variations of the embodiments of the present invention fall within the scope of the claims of the present invention and their equivalents, the present invention is also intended to encompass such modifications and variations.

Claims

1. A method for constructing a focus covariance matrix based on a speech signal, comprising:

2. The method of claim 1, wherein computing the first covariance matrix specifically comprises:

calculating the first covariance matrix as follows:

wherein, theRepresenting the first covariance matrix, k representing the arbitrary sampling frequency point, P representing the number of frames of the microphone array collecting the speech signal, and X_i(k) Representing the microphone array in any frame and the randomDiscrete Fourier Transform (DFT) value in case of sampling frequency point intentionally, method for generating DFT value, and DFT valueRepresents said X_i(k) The N represents the number of sampling frequency points included in any one frame, and the number of sampling frequency points included in any two different frames is the same.

3. The method of claim 1 or 2, wherein prior to computing the focus transform matrix, further comprising:

calculating the focus transformation matrix specifically comprises:

4. The method of claim 3, wherein computing the second covariance matrix specifically comprises:

calculating the second covariance matrix as follows:

5. The method of claim 3 or 4, wherein decomposing eigenvalues for the first covariance matrix specifically comprises:

6. The method of any one of claims 3-5, wherein decomposing eigenvalues for the second covariance matrix specifically comprises:

wherein, theRepresenting the second covariance matrix, the U(k₀) Represents the aboveThe second eigenvector matrix, the₀Represents the aboveThe characteristic values of the U are arranged from big to small to form a diagonal matrix, and the U^H(k₀) Represents the U (k)₀) The conjugate transpose matrix of (2).

7. The method of any one of claims 2 to 6, wherein X is_i(k) The form is as follows:

X_i(k)＝[X_i1(k),X_i2(k),......,X_iL(k)]^T,i＝0,1,2,......,P-1

8. An apparatus for constructing a focus covariance matrix based on a speech signal, comprising:

9. The apparatus of claim 8, wherein the first computing unit, when computing the first covariance matrix, is specifically:

calculating the first covariance matrix as follows:

wherein, theRepresenting the first covariance matrix, k representing the arbitrary sampling frequency point, P representing the number of frames of the microphone array collecting the speech signal, and X_i(k) Representing the microphone array in any frame and the anyDiscrete Fourier Transform (DFT) value at a sampling frequency pointRepresents said X_i(k) The N represents the number of sampling frequency points included in any one frame, and the number of sampling frequency points included in any two different frames is the same.

10. The device according to claim 8 or 9, wherein the determining unit is further configured to determine a focusing frequency point of sampling frequency points adopted when the microphone array collects the voice signals;

11. The apparatus according to claim 10, wherein the first calculating unit, when calculating the second covariance matrix, is specifically:

calculating the second covariance matrix as follows:

12. The apparatus according to claim 10 or 11, wherein the first computing unit, when decomposing eigenvalues for the first covariance matrix, is specifically:

13. The apparatus according to any of claims 10-12, wherein the first computing unit, when decomposing eigenvalues for the second covariance matrix, is specifically:

14. The apparatus of any one of claims 9-13, wherein X is_i(k) The form is as follows:

X_i(k)＝[X_i1(k),X_i2(k),......,X_iL(k)]^T,i＝0,1,2,......,P-1