WO2011042808A1

WO2011042808A1 - Signal separation system and signal separation method

Info

Publication number: WO2011042808A1
Application number: PCT/IB2010/002660
Authority: WO
Inventors: Tomoya Takatani; Jani Even
Original assignee: Toyota Jidosha Kabushiki Kaisha
Priority date: 2009-10-09
Filing date: 2010-10-07
Publication date: 2011-04-14
Also published as: JP2011081293A; WO2011042808A8

Abstract

A signal separation system includes: an external microphone; an internal sensor that detects only an internal noise from an internal noise source present inside the system; a discrete Fourier transform unit (310) that performs a discrete Fourier transform on signals from the external microphone and the internal sensor; an independent component analysis unit (320) that performs independent component analysis on transformed signals that have been subjected to a discrete Fourier transform so that an internal noise separated signal that contains only the internal noise is extracted on the basis of the transformed signal of the signal detected by the internal sensor and external separated signals that are independent of the internal, noise separated signal and that do not contain the internal noise are extracted; and a permutation solving unit (340) that executes permutation solving on the external separated signals to extract a specific voice.

Description

SIGNAL SEPARATION SYSTEM AND SIGNAL SEPARATION METHOD

BACKGROUND OF THE INVENTION 1. Field of the Invention

[0001] The invention relates to a signal separation system and signal separation method that extract a specific signal in a state where a plurality of signals are mixed in a space and, more particularly, to a technique for solving a permutation problem. 2. Description of the Related Art

[0002] There is known an independent component analysis (ICA) that separate and decode a plurality of original signals on the basis of statistical independence when the plurality of original signals are linearly mixed with unknown coefficients (see Japanese Patent Application Publication No. 2004-145172 (JP-A-2004-145172)).

[0003] Where observed signals that are obtained by observing a plurality of original signals (sound sources) s(t) with a plurality of microphones are x(t), the observed si nals x(t) are expressed by the mathematical expression (1).

[0004] In the ICA, signals S(f, t) are estimated through independent component analysis in the frequency domain by using signals X(f, t). The signals X(f, t) are obtained by transforming the observed signals x(t) into signal in the time-frequency domain through a short-time discrete Fourier transform. Here, the signals S(f, t) and X(f, t) are respectively obtained by performing a short-time discrete Fourier transform on the original signals s(t) and the observed signals x(t). The following mathematical expression (2) is considered to estimate the signals S(f, t) in the time-frequency domain. In the mathematical expression (2), Y(f, t) represents a column vector that has the kth output Y_k(f, t) as elements. W(f) represents n by n matrix (separating matrix) having Wjj(f) as elements.

[0005] Subsequently, the separating matrix W(f) by which the outputs Yi(f, t) to Y_n(f, t) are statistically independent of one another (actually, independence becomes maximum) when time t is varied while the frequency bin f is fixed is calculated. After statistically independent outputs Yi(f, t) to Y_n(f, t) are obtained for all of the frequency bin f on the basis of the thus calculated separating matrix W(f), those are subjected to an inverse Fourier transform to thereby make it possible to obtain separated signals y(t) in the time domain.

[0006] However, in the independent component analysis in the time-frequency domain, signal separating process is performed for each frequency bin, and the relationship among the frequency bins is not considered. Therefore, even when separation of signals is successful, there is a possibility that inconsistency of a separated destination occurs among the frequency bins. The inconsistency of a separated destination indicates a phenomenon that, for example, the signal Yi originates in the signal Si, whereas the signal Yi originates in the signal S₂ at frequency bin f = 2, and this is called a problem of permutation.

[0007] JP-A-2004-145172 describes a method of solving the permutation problem in such a manner that the incoming directions of signals are estimated and then the signals are labeled on the basis of the directional information of each signal. However, actually, not all sound sources are simple sound sources, so it is not always possible to correctly estimate the incoming directions of the signals. For example, in the case of a diffusive noise, the direction of the noise cannot be identified, and, therefore, wrong labeling occurs.

[0008] In addition, WO/2009/113192 and the following non-patent document describe a method in which the joint probability density distribution of each of separated signals is calculated and then the separated signals are divided into voice and noise on the basis of the shape of the joint probability density distribution. In this method, for example, a signal of which the joint probability density distribution is a non-Gaussian distribution is determined as a specific voice signal, and a signal of which the joint probability density distribution is a Gaussian distribution is determined as a noise signal. According to the above method, even a noise (diffusive noise) is also accurately labeled, so it is possible to determine the separated destination of a signal with high precision. The above non-patent document is "An Improved permutation solver for blind signal separation based front-ends in robot audition (Jani Even, Hiroshi Saruwatari, Kiyohiro Shikano)", IEEE/RSJ International Conference on Intelligent Robotics and Systems (IROS2008), Nice, France, pp. 2172-2177, September 2008.

[0009] Here, the following case is assumed as an environment in which a signal separation system is actually used. FIG. 5 is a view that shows a robot 10 having a voice recognition function. The robot 10 includes a microphone array 12 formed of a plurality of microphones 11 and a signal separation device 20 that processes observed signals from the microphone array 12. An ambient noise S₂ enters the microphone array 12 together with a user voice Si. Furthermore, the robot itself becomes a noise generating source. That is, the robot 10 includes a power source 30, such as a motor, so a noise sound S₃ from the power source 30 also enters the microphones 11.

[0010] Thus, the observed signals x(t) contain the noise S3 from the power source 30. The signal separation device 20 performs independent component analysis on signals that contain the user voice Si(f, t), the ambient noise S₂(f, t) and the power source noise S₃(f, t) to calculate statistically independent separated signals Y_\(f, t) to Y_n(f, t), and then labels the separated signals Y^f, t) to Y_n(f, t). However, if the signal of which the joint probability density distribution is a non-Gaussian distribution is simply determined as a user voice as described above, there is a possibility that wrong labeling occurs. This is because the noise S₃ of the power source 30 also has a non-Gaussian joint probability density distribution having a high kurtosis.

[0011] In this way, when the existing method described in WO/2009/113192 and the above non-patent document is applied to an actual environment, there is a possibility that labeling of a separated signal is wrong. Furthermore, a computation load for calculating a joint probability density distribution is excessively high. Therefore, if the shape of the joint probability density distribution of a power source noise is also calculated in addition to the shapes of the joint probability density distributions for a user voice and an ambient noise, the computation load is excessively high. SUMMARY OF INVENTION

[0012] A first aspect of the invention provides a signal separation system that separates an observed signal in the time domain, which mixedly contains a plurality of signals, into the plurality of signals using independent component analysis, and that extracts a specific voice from the separated signals. The signal separation system includes: an external microphone that is oriented outside of the signal separation system; an internal sensor that detects only an internal noise from an internal noise source present inside the signal separation system; a discrete Fourier transform unit that performs a discrete Fourier transform on signals from the external microphone and the internal sensor; an independent component analysis unit that performs independent component analysis on transformed signals that have been subjected to a discrete Fourier transform by the discrete Fourier transform unit so that an internal noise separated signal that contains only the internal noise is extract on the basis of the transformed signal of the signal detected by the internal sensor and external separated signals that are independent of the internal noise separated signal and that do not contain the internal noise, are extracted; and a permutation solving unit that executes permutation solving on the external separated signals to extract the specific voice.

[0013] In the first aspect of the invention, the permutation solving unit may include a spikedness calculation unit that calculates a spikedness, which is a degree of peakedness of probability density distribution of each of the external separated signals; and a clustering unit that labels the external separated signals as the specific voice or an ambient noise on the basis of the spikedness. In the above configuration, the spikedness calculation unit may calculate a scale parameter, as the spikedness, of Laplacian distribution when each of the external separated signals is subjected to fitting with Laplacian distribution.

[0014] In the above configuration, the clustering unit may determine the external separated signal having the largest spikedness as the specific voice.

[0015] A second aspect of the invention provides a signal separation method. The signal separation method separates an observed signal in the time domain, which mixedly contains a plurality of signals and is observed in a system that includes an external microphone that is oriented outside of the system and an internal sensor that detects only an internal noise from an internal noise source present inside the system, into the plurality of signals using independent component analysis, and that extracts a specific voice from the separated signals. The signal separation method includes: performing a discrete Fourier transform on signals from the external microphone and the internal sensor; performing independent component analysis on transformed signals that have been subjected to a discrete Fourier transform so that an internal noise separated signal that contains only the internal noise is extracted on the basis of the transformed signal of the signal detected by the internal sensor and external separated signals that are independent of the internal noise separated signal and that do not contain the internal noise are extracted; and executing permutation solving on the external separated signals to extract the specific voice. BRIEF DESCRIPTION OF DRAWINGS

[0016] The features, advantages, and technical and industrial significance of this invention will be described below with reference to the accompanying drawings, in which like numerals denote like elements, and wherein:

FIG 1 is a view that shows a robot equipped with a signal separation device 200 according to a embodiment of the invention;

FIG 2 is a block diagram of the signal separation device 200;

FIG 3 is a block diagram of a permutation solving unit 340;

FIG 4 is a schematic view of the flow of calculating a spikedness (scale parameter c-i(f)) from each of observed signals xi(t) and x₂(t); and

FIG 5 is a view that shows a robot 10 having a voice recognition function.

DETAILED DESCRIPTION OF EMBODIMENTS

[0017] Embodiments of the invention are illustrated in the drawings, and will be described with reference to the reference numerals assigned to components in the drawings. FIG 1 is a view that shows a robot equipped with a signal separation device according to the first embodiment. The robot 100 includes external microphones 110, an internal sensor 120 and a signal separation device 200.

[0018] The external microphones 110 each are a sound collecting microphone provided on the body surface of the robot 100. Here, for the sake of description, it is assumed that a first external microphone 111 and a second external microphone 112 are provided. At this time, the external microphones 110 receive a voice Si from a user and a noise S₂ from around the external microphones 110. In addition, the external microphones 110 also receive a noise S₃ from a power source 30.

[0019] The internal sensor 120 exclusively detects the noise S₃ from the power source 30. The internal sensor 120 detects the noise from the power source 30, but the internal sensor 120 does not detect a sound signal (Si or S₂) from the outside. It is desirable that the internal sensor 120 is, for example, arranged at a location in proximity to the external microphones 110, such as the back side of the external microphones 110. Such a sensor that exclusively detects the noise S₃ from the power source 30 may be, for example, an acceleration sensor or a microphone having high directivity.

[0020] Note that the number of the external microphones 110 and the number of the internal sensors 120 are not limited, and are increased or reduced where necessary. For example, when a plurality of the external microphones 110 are provided, the internal sensor 120 may be provided in one-to-one correspondence with each external microphone 110.

[0021] Here, the user voice is denoted by S^f, t), the ambient noise is denoted by S₂(f, t) and the power source noise is denoted by S₃(f, t). In addition, a signal observed by the first external microphone 111 is denoted by X^f, t), a signal observed by the second external microphone 112 is denoted by X₂(f, t) and a signal observed by the internal sensor 120 is denoted by Ri(f, t). At this time, the relationship between an original signal and an observed signal may be expressed by the following mathematical expression (3) using an unknown coefficient matrix A(f).

[0022] Here, the first external microphone 111 and the second external microphone 112 receive the user voice Si(f, t), the ambient noise S₂(f, t) and the power source noise S₃(f, t), so components (A_u(f), A₁₂(f), A₁₃(f), A₂₁(f), A₂₂(f) and A₂₃(f)) of the coefficient matrix A(f) corresponding to the observed signals Xi(f, t) and X₂(f, t) are not 0. In contrast, the internal sensor 120 does not receive the user voice Si(f, t) or the ambient noise S₂(f, t), so components of the coefficient matrix A(f) corresponding to the observed signal Ri(f, t) are 0 except the coefficient A₃₃(f) corresponding to the power source noise 30. [0023] FIG 2 is a block diagram of the signal separation device 200. The signal separation device 200 includes an analog/digital (A/D) conversion unit 210, a noise suppressing unit 300 and a voice recognition unit 220.

[0024] The A/D conversion unit 210 converts respective signals input from the external microphones 110 and the internal sensor 120 into digital signals and then outputs the digital signals to the noise suppressing unit 300.

[0025] The noise suppressing unit 300 executes process of suppressing noise contained in the input digital signals. The noise suppressing unit 300 includes a short-time discrete Fourier transform unit 310, an independent component analysis unit 320, a gain correction unit 330, a permutation solving unit 340 and an inverse discrete Fourier transform unit 350.

[0026] The short-time discrete Fourier transform unit 310 performs a short-time discrete Fourier transform on pieces of digital data input from the A/D conversion unit 210.

[0027] The independent component analysis unit 320 performs independent component analysis (ICA) on observed signals expressed in the time-frequency domain, obtained by the short-time discrete Fourier transform unit 310, and then calculates a separating matrix for each frequency bin. The detailed process of independent component analysis is, for example, described in JP -A- 2004-145172.

[0028] Here, the observed signals x^t), x₂(t) and ri(t) each are subjected to a short-time discrete Fourier transform, and then the obtained signals (hereinafter, also referred to as "transformed signals") are denoted by Xi(f, t), X₂(f, t) and R^f, t). Then, statistically independent separated signals (hereinafter, also referred to as "separated signals") Yi(f, t), Y₂(f, t) and (f, t) are extracted on the basis of the following mathematical expression (4) using the separating matrix W(f). w₂ f) (4)

[0029] In the present embodiment, the separated signal Qi(f, t) (a examp internal noise separated signal) that is obtained by multiplying the translated signal Ri(f, t) containing only the power source noise S₃(f, t) by the coefficient W₃₃(f) is generated. ICA adaptively learns the separating matrix W(f) so that the separated signal (f, t) is independent of the separated signals Y^f, t) and Y₂(f, t), so the separated signals Yi(f, t) and Y₂(f, t) that do not contain the power source noise S₃(f, t) are extracted (semi-blind signal separation). That is, the separated signals Y^f, t) and Y (f, t) each are components corresponding to other than the power source noise S₃(f, t), that is, any one of the user voice Si(f, t) and the ambient noise S₂(f, t).

[0030] The gain correction unit 330 executes gain correction process on a separating matrix W(f) at each frequency calculated by the independent component analysis unit 320.

[0031] The permutation solving unit 340 executes process for solving a permutation problem. FIG 3 is a block diagram of the permutation solving unit 340. Here, in the present embodiment, among the separated signals Yi(f, t), Y₂(f, t) and Qi(f, t) that are separated by the independent component analysis unit 320, the separated signals Yi(f, t) and Y₂(f, t) are already identified as correspondent components other than the power source noise S₃(f, t), that is, any one of the user voice S^f, t) and the ambient noise S₂(f, t). Thus, the object for permutation is the separated signals Yi(f, t) and Y₂(f, t). The separated signals Yi(f, t) and Y₂(f, t) are input to the permutation solving unit 340, and the separated signal C (f, t) is directly input to the subsequent inverse discrete Fourier transform unit 350.

[0032] Then, the permutation solving according to the present embodiment utilizes the fact that the probability density distribution of the user voice S^f, t) has a shape that is spiker than the probability density distribution of the ambient noise S₂(f, t). Furthermore, in order to estimate the spikedness (degree of the peakedness) of the probability density distribution, the scale parameter c-i(f) of Laplacian distribution is used. Here, when the scale parameter ctj(f) of Laplacian distribution is estimated, the expected value of absolute value of the separated signal Y(f, t) is utilized. Hereinafter, the description will be made sequentially. [0033] The permutation solving unit 340 includes a spikedness calculation unit 341 and a clustering determination unit 342.

[0034] The spikedness calculation unit 341 calculates the spikedness of the probability density distribution (degree of peakedness of distribution) of each of the separated signals Yi(f, t) and Y₂(f, t). The scale parameter aj(f) of Laplacian distribution when the separated signal Yi(f, t) is subjected to fitting with Laplacian distribution is used as the spikedness. Then, the scale parameter aj(f) may be calculated through the following mathematical expression (5) using a maximum likelihood method.

[0035] Here, the separated signal Yi(f, t) is a complex spectrum, so |Yi(f, t)| means the absolute value of a complex number. In addition, e_t{|Yi(f, t)|} means the average of |Yi(f, t)| in a predetermined number of frames.

[0036] Here, FIG 4 is a schematic view that shows the flow of calculating spikednesses (scale parameters _;(f)) from the observed signals x^t), x₂(t) and r^t). A voice signal collected by the first external microphone 111 is the observed signal xi(t), a voice signal collected by the second external microphone 112 is the observed signal x₂(t) and a signal detected by the internal sensor 120 is the observed signal ri(t). These observed signals are subjected to a discrete Fourier transform for each frame of a predetermined duration, and the results are transformed signals Xi(f, t), X₂(f, t) and Ri(f, t). The results of independent component analysis on the transformed signals Xi(f, t), X₂(f, t) and R^f, t) are the separated signals Y_x(f, t), Y₂(f, t) and C (f, t). At this time, the spikedness (scale parameter ai(fk)) for the frequency bin f = fk is, for example, expressed by the following mathematical expression (6) using the duration to to t₂.

[0037] The clustering determination unit 342 uses the thus calculated spikednesses (scale parameters o¾(f_k)) to label the separated signals Yi(fk, t) and Y₂(f_k, t), and, where necessary, interchanges the separated signals Yi(f_k, t) and Y₂(f_k, t). That is, one of the separated signals Yi(f_k, t) and Y (f_k, t) is determined to correspond to the user voice Si(f, t), the other one is determined to correspond to the ambient noise S₂(f, t), and then sorting destinations of the user voice Si(f, t) and the ambient noise S₂(f, t) is standardized, at every frequency bin. Specifically, the one that has the largest spikedness (scale parameter cti(f_k)) is determined to correspond to the user voice.

[0038] For example, when the user voice is sorted to index number 1 and the ambient noise is sorted to index number 2, the process will be as follows.

(Case 1) The case where ai(f_k)≥ ₂(f_k) is assumed as case 1. In this case, it may be determined that the separated signal Yi(f_k, t) corresponds to the user voice Si(f_k, t) and the separated signal Y₂(f_k, 0 corresponds to the ambient noise S₂(f_k, t). In this case, interchanging is not required.

[0039] (Case 2) The case where c i(f_k) < a₂(f_k) is assumed as case 2. In this case, it may be determined that the separated signal Y₂(f_k, t) corresponds to the user voice Si(f_k, t) and the separated signal Y^f_k, t) corresponds to the ambient noise S₂(fk, t). In this case, at the frequency bin f_k, the separated signals Yi(f_k, t) and Y₂(fk, t) are interchanged.

[0040] Such clustering is executed at every frequency bin.

[0041] Lastly, the inverse discrete Fourier transform unit 350 performs an inverse discrete Fourier transform, transforms data Yi(f, t), Y₂(f, t) and Qi(f, t) in the time-frequency domain into data in the time domain, and then outputs the data.

[0042] With the above configuration, the following advantageous effects may be obtained.

(1) The internal sensor 120 that exclusively detects only the noise from the internal noise source (power source) 30 is provided. Then, independent component analysis optimizes the separated signal Qi(f, t) for estimating the internal noise and the other separated signals Ύι(ΐ, t) and Y₂(f, t) so as to be independent of each other. The separated signals Qi(f, t) is generated from only the translated signal Ri(f, t) from the internal sensor 120, so the internal noise is definitely output to the separated signal C (f, t). If the internal noise is contained in the separated signals Yi(f, t) and Y₂(f, t), correlation occurs, so those components are removed through optimization in ICA. Thus, the internal noise is output to only the separated signals C (f, t). By so doing, any one of the separated signals Y^f, t) and Y₂(f, t) other than the separated signal C (f, t) corresponds to the user voice. That is, it is only necessary to solve the permutation problem for the separated signals Yi(f, t) and Y₂(f, t) other than the separated signal C (f, t). Thus, it is possible to reduce a calculation load for permutation solving.

[0043] (2) The noise from the internal noise source (power source) 30 is similar to the user voice in the high degree of peakedness of probability density distribution, or the like, and, therefore, it may be difficult to solve the permutation problem between the internal noise and the user voice. In the present embodiment, the sensor that detects only the internal noise is utilized, and the components W₃i(f) and W₃₂(f) of the separating matrix W(f) are modeled as 0. By so doing, the internal noise is concentrated to the separated signal Qi(f, t), and is not contained in the remaining separated signals Y^f, t) and Y₂(f, t). Thus, it is possible to improve accuracy for separating and extracting the user voice.

[0044] (3) In the present embodiment, in labeling, the spikedness of probability density distribution (degree of peakedness of distribution) of each of the separated signals Yi(f, t) and Y₂(f, t) is used, and, in addition, the scale parameter aj(f) of Laplacian distribution when the separated signal Y;(f, t) is subjected to fitting with Laplacian distribution is used as the spikedness. With the above method, it is possible to remarkably reduce a calculation load.

[0045] Note that the aspect of the invention is not limited to the embodiment described above; it may be appropriately modified without departing from the scope of the invention. For example, in the above embodiment, the robot 100 is equipped with the signal separation device 20; instead, the aspect of the invention may be applied to a voice recognition system of an automobile, a telephone, or the like.

Claims

CLAIMS:

1. A signal separation system that separates an observed signal in the time domain, which mixedly contains a plurality of signals, into the plurality of signals using independent component analysis, and that extracts a specific voice from the separated signals, the signal separation system comprising:

^: an external microphone that is oriented outside of the signal separation system; an internal sensor that detects only an internal noise from an internal noise source present inside the signal separation system;

a discrete Fourier transform unit that performs a discrete Fourier transform on signals from the external microphone and the internal sensor;

an independent component analysis unit that performs independent component analysis on transformed signals that have been subjected to a discrete Fourier transform by the discrete Fourier transform unit so that an internal noise separated signal that contains only the internal noise is extracted on the basis of the transformed signal of the signal detected by the internal sensor and external separated signals and that are independent of the internal noise separated signal and that do not contain the internal noise are extracted; and

a permutation solving unit that executes permutation solving on the external separated signals to extract the specific voice.

2. The signal separation system according to claim 1, wherein the permutation solving unit includes

a spikedness calculation unit that calculates a spikedness, which is a degree of peakedness of probability density distribution of each of the external separated signals; and

a clustering unit that labels the external separated signals as the specific voice or an ambient noise on the basis of the spikedness.

3. The signal separation system according to claim 2, wherein the spikedness calculation unit calculates a scale parameter, as the spikedness, of Laplacian distribution when each of the external separated signals is subjected to fitting with Laplacian distribution.

4. The signal separation system according to claim 3, wherein

the spikedness calculation unit calculates an expected value of an absolute value of each of the external separated signals as a maximum likelihood value of the scale parameter.

5. The signal separation system according to claim 3 or 4, wherein

the spikedness calculation unit calculates the scale parameter aj(f) using the following mathematical expression when a separated signals is denoted by Y(f, t),

where e_t{|Yi(f, t)|} represents an average of |Yj(f, t)| in a predetermined number of frames.

6. The signal separation system according to any one of claims 2 to 5, wherein the clustering unit labels the external separated signal having the largest spikedness as the specific voice.

7. The signal separation system according to claim 1, wherein

the independent component analysis unit performs independent component analysis using a separating matrix in which a component corresponding to the internal noise, among internal components that are components corresponding to the transformed signal of the signal detected by the internal sensor, is not zero and the other internal components are zero.

8. The signal separation system according to claim 7, wherein the independent component analysis unit adaptively learns the separating matrix so that the internal noise separated signal is independent of the external separated signals.

9. The signal separation system according to claim 1, wherein

the internal sensor is arranged on a back side of the external microphone.

10. A signal separation method that separates an observed signal in the time domain, which mixedly contains a plurality of signals and is observed in a system that includes an external microphone that is oriented outside of the system and an internal sensor that detects only an internal noise from an internal noise source present inside the system, into the plurality of signals using independent component analysis, and that extracts a specific voice from the separated signals, the signal separation method comprising:

performing a discrete Fourier transform on signals from the external microphone and the internal sensor;

performing independent component analysis on transformed signals that have been subjected to a discrete Fourier transform so that an internal noise separated signal that contains only the internal noise is extracted on the basis of the transformed signal of the signal detected by the internal sensor and external separated signals that are independent of the internal noise separated signal and that do not contain the internal noise are extracted; and

executing permutation solving on the external separated signals to extract the specific voice.