KR20050063987A

KR20050063987A - Robust speech recognition based on noise and channel space model

Info

Publication number: KR20050063987A
Application number: KR1020030095246A
Authority: KR
Inventors: 김동국; 김승희
Original assignee: 한국전자통신연구원
Priority date: 2003-12-23
Filing date: 2003-12-23
Publication date: 2005-06-29

Abstract

본 발명은 음성인식 분야에서 잡음과 채널에 의해 학습과 인식하는 환경의 차이로 인하여 음성인식 시스템의 성능이 저하되는데 이를 보상하여 성능을 향상 시키는 방법에 관한 것이다. The present invention relates to a method of improving the performance by compensating for the performance of the speech recognition system due to the difference in the environment of learning and recognition by noise and channel in the speech recognition field.

본 발명은 잡음 및 채널을 모델링하기 위해 새로운 잡음 및 채널 공간 모델을 제안하고 이를 바탕으로 하여 공간 모델에 대한 파라메터를 추정한다. 추정된 파라메터에 근거하여 잡음이나 채널이 제거된 깨끗한 음성을 추정함으로 인식기 성능을 높일 수 있다.The present invention proposes a new noise and channel spatial model for modeling noise and channel and estimates the parameters for the spatial model based on this. Based on the estimated parameters, the performance of the recognizer can be improved by estimating the clean voice without noise or channel.

Description

Robust Speech Recognition based on Noise and Channel Space Model

본 발명은 음성인식 기술에 관한 것으로, 상세하게는 잡음 및 채널 공간 모델을 근거로 공간 모델에 대한 파라메터를 추정하고, 추정된 파라메터에 근거하여 잡음이나 채널이 제거된 깨끗한 음성을 추정할 수 있도록 하는 잡음 및 채널 공간 모델에 근거한 음성인식 시스템 및 방법에 관한 것이다.The present invention relates to speech recognition technology, and more particularly, to estimate a parameter for a spatial model based on a noise and channel spatial model, and to estimate a clean speech from which noise or a channel is removed based on the estimated parameter. A voice recognition system and method based on noise and channel spatial model.

일반적으로, 음성 인식기술은 현재 실제 환경에 많이 적용되어 실용적으로 사용되고 있지만 실험실 환경보다 더 열악한 조건, 즉 채널이나 잡음과 같은 상황에서 인식 성능이 많이 저하된다. 이를 극복하기 위해 채널이나 잡음 등과 같은 환경을 보상하여 성능을 향상시키는 강인한 음성인식 기술 분야에 많은 노력이 수행되고 있다. In general, the speech recognition technology is applied to a practical environment and practically used, but the recognition performance is degraded in a worse condition than a lab environment, that is, a channel or noise. To overcome this, a lot of efforts are being made in the field of robust speech recognition technology that improves performance by compensating an environment such as a channel or noise.

채널과 잡음 등과 같은 환경 보상 기법으로는 잡음이나 채널의 특성을 현재 인식하고자 하는 문장으로부터 추정하여 이를 각각 잡음 음성으로부터 차감함으로써 잡음이나 채널 특성을 제거할 수 있다.Environmental compensation techniques such as channel and noise can remove noise or channel characteristics by estimating noise or channel characteristics from sentences to be recognized currently and subtracting them from the noise speech.

일반적인 잡음이나 채널을 추정하기 위해서는 이에 대한 모델을 가정할 필요가 있고, 현재 이에 근거한 추정 기술을 개발하여 사용하고 있다. 예를 들면, 종래 잡음과 채널에 사용되는 모델은 알려지지 않은 상수 값과 정규분포를 갖는 랜덤 변수 등 크게 두가지로 이용되었다. In order to estimate a general noise or channel, it is necessary to assume a model for it. Currently, an estimation technique based on this is developed and used. For example, conventional noise and channel models have been used in two main ways: unknown constant values and random variables with normal distributions.

첫번째 모델을 사용한 기법으로는 주파수 차감법, cepstral mean subtraction(CMS) 방법이 있으며, 두번?? 모델을 사용한 기법은 stochastic matching(SM), vector Taylor series(VTS) 기법들이 있지만 이러한 방법들은 본 발명과 비교할 때 그 잡음 및 채널에서 그 음성인식 성능에 대해 강인하지 못하다.Techniques using the first model include frequency subtraction and cepstral mean subtraction (CMS). Techniques using the model include stochastic matching (SM) and vector Taylor series (VTS) techniques, but these methods are not robust to the speech recognition performance in the noise and channel compared with the present invention.

본 발명은 상기한 종래 문제를 해결하기 위한 것으로, 상세하게는 잡음 및 채널에 대한 특성을 현재 인식 문장으로부터 추정하여 이를 보상함으로써 음성 인식기의 성능을 높일 수 있는 잡음 및 채널 공간 모델에 근거한 음성인식 시스템 및 방법을 제공하는데 목적이 있다. The present invention is to solve the above-mentioned conventional problem, and in detail, a speech recognition system based on a noise and channel space model that can improve the performance of the speech recognizer by estimating noise and channel characteristics from current recognition sentences and compensating them. And to provide a method.

본 발명의 잡음 및 채널 공간 모델에 근거한 음성인식 시스템 및 방법은 잡음과 채널에 대한 모델을 새롭게 제안하고 이 모델에 근거하여 파라메터를 추정하고, 추정된 파라메터를 사용한다.The speech recognition system and method based on the noise and channel spatial model of the present invention newly propose a model for noise and channel, estimate parameters based on the model, and use the estimated parameters.

상기 목적을 달성하기 위한 본 발명은 음성 인식의 특징 추출을 위한 여러 가지 영역에서 잡음 및 채널 공간모델을 사용하여 잡음 및 채널을 모델링하는 단계; 잡음 음성이 주어진 경우 이를 이용하여 잡음 및 채널 공간모델 파라메터를 추정하는 단계; 및 상기 파라메터를 근거로 깨끗한 음성을 추정하는 단계;를 포함하는 것을 특징으로 하는 잡음 및 채널 공간 모델에 근거한 음성인식 방법을 제공한다. The present invention for achieving the above object comprises the steps of modeling the noise and channel using noise and channel spatial model in various areas for feature extraction of speech recognition; Estimating noise and channel spatial model parameters using a given noise voice; And estimating a clean speech based on the parameter.

이하, 본 발명의 바람직한 실시예를 첨부된 도면을 참조하여 설명하면 다음과 같다.Hereinafter, preferred embodiments of the present invention will be described with reference to the accompanying drawings.

현재 음성인식에서 가장 많이 사용되고 있는 특징벡터는 MFCC(Mel Frequency Cepstral Coefficients)이다. MFCC 계수는 FFT(Fast Fourier Transform)을 취한 음성에 대해 mel 주파수 비율로 분배된 필터묶음(filterbank)을 적용하고, 로그 에너지를 취한 다음, 이를 정규직교 변환을 취하여 구한다. MFCC을 구하는 과정에서 특징 벡터들은 각각의 다른 영역에서 처리되어진다. The most commonly used feature vector in speech recognition is MFCC (Mel Frequency Cepstral Coefficients). The MFCC coefficients are obtained by applying a filterbank distributed at mel frequency ratios to speech having a Fast Fourier Transform (FFT), taking a log energy, and taking a normal orthogonal transformation. In the process of obtaining the MFCC, feature vectors are processed in different regions.

일반적으로 시간 영역에서의 깨끗한 음성과 잡음 및 채널과의 관계를 도 1과 같이 표현한다. 여기서 는 깨끗한 음성, 는 가산 잡음, 는 채널특성 그리고 는 잡음 음성을 각각 시간 영역에서 나타낸다.In general, the relationship between the clean voice in the time domain, the noise and the channel is expressed as shown in FIG. here Is a clean voice, Add noise, The channel characteristics and Denote noise noise in the time domain, respectively.

각각의 시간 t에 대해서 음성과 잡음은 오염되지(uncorrelated) 않았다고 가정하면, 이를 다음의 수학식 1과 같이 표현할 수 있다.Assuming that voice and noise are not uncorrelated for each time t, this can be expressed as Equation 1 below.

상기 수학식 1을 선형 스펙트럼 영역, 로그 스펙트럼 영역 그리고 cepstral 영역에서 표현하면 다음의 수학식과 같다.If Equation 1 is expressed in the linear spectral region, the log spectral region, and the cepstral region, it is as follows.

상기 선형 스펙트럼 영역은 다음의 수학식 2로 표현할 수 있다.The linear spectral region may be expressed by Equation 2 below.

여기서, 는 DFT된 각각의 신호이다.here, Is each signal that is DFTed.

상기 로그 스펙트럼 영역은 다음의 수학식 3으로 표현할 수 있다.The log spectral region may be expressed by Equation 3 below.

여기서 이다.here to be.

상기 cepstral 영역은 다음의 수학식 4로 표현할 수 있다.The cepstral region may be expressed by Equation 4 below.

여기서 는 는 cepstral 영역의 신호를 나타내며, DCT(discrete cosine transform)과 inverse DCT을 나타낸다.here Is Represents a signal in the cepstral region, and represents a discrete cosine transform (DCT) and an inverse DCT.

먼저 채널 특성에 대한 왜곡이 없고 가산 잡음만 존재한다고 가정하면, 선형 스펙트럼 영역에서의 잡음 신호에 대한 표현은 다음의 수학식 5와 같다.First, assuming that there is no distortion of channel characteristics and only additive noise, a representation of the noise signal in the linear spectral region is expressed by Equation 5 below.

여기서 이다. 그리고 잡음이 없고 채널 특성에 의한 왜곡이 존재한다고 가정하면, cepstral 영역에서의 잡음 신호에 대한 표현은 다음의 수학식 6과 같다.here to be. Assuming that there is no noise and there is distortion due to channel characteristics, the expression of the noise signal in the cepstral region is expressed by Equation 6 below.

즉, 잡음이나 채널 특성 중 한가지만 깨끗한 음성에 영향을 미치는 경우에는 각 특성을 표현하는 방법이 각각 다른 영역에서 깨끗한 음성에 더해지는 형태로 존재하게 된다. 이를 대표적으로 표현하면 다음의 수학식 7과 같이 표현할 수 있다.In other words, if one of the noise and channel characteristics affects the clean voice, the method of expressing each feature exists in the form of being added to the clean voice in different areas. Representatively, this may be expressed as in Equation 7 below.

여기서 는 선형 스펙트럼 영역에서의 가산 잡음, 또는 cepstral 영역에서의 채널 왜곡을 나타내는 바이어스(bias) 이다. 일반적으로 깨끗한 음성 특징벡터의 확률 밀도함수는 개의 혼합 정규분포로 다음의 수학식 8과 같이 주어진다.here Is a bias representing additive noise in the linear spectral region, or channel distortion in the cepstral region. In general, the probability density function of a clean speech feature vector is given by Equation 8 as the mixed normal distribution of.

수학식 8에서 는 각각 번째 정규분포의 가중치, 평균 그리고 공분산이다.In equation (8) Are the weights, mean, and covariance of each th normal distribution.

수학식 8의 잡음 음성에 대한 모델에서 바이어스(bias)에 대한 모델링이 매우 중요하다. 기존의 바이어스는 알려지지 않은 고정된 값이거나 하나의 정규분포로 가정하여 모델링하였다. 그러나 이러한 모델은 깨끗한 음성에 비해 바이어스 항이 커질수록 부 정확하거나 환경을 제대로 모델링하지 못하는 단점이 있다. Modeling bias is very important in the model for noise speech of Equation (8). Existing biases are modeled assuming unknown fixed values or a normal distribution. However, these models have disadvantages such as inaccurate or poorly modeled environment as the bias term is larger than clear voice.

특히 시간에 따라 바이어스 항이 변하는 경우는 단순한 고정 값이나 정규분포로 모델링하는 것에 한계가 존재한다. 그리하여 본 발명에서는 잡음이나 채널, 즉 바이어스(bias)에 대해 factor analysis(FA)나 probabilistic principal component analysis(PPCA)와 같은 은닉변수 모델을 사용한다. 그러면 잡음 음성과 bias에 대한 모델링은 다음의 수학식 9와 같다.In particular, when the bias term changes over time, there is a limit to modeling as a simple fixed value or normal distribution. Thus, the present invention uses hidden variable models such as factor analysis (FA) or probabilistic principal component analysis (PPCA) for noise or channels, ie, bias. Then, the modeling of the noise speech and the bias is shown in Equation 9 below.

여기서 는 bias 벡터에 대한 평균 값이고, 는 차원의 은닉변수, 는 차원의 잡음공간이 또는 채널공간을 나타내는 행렬이며, 는 와 독립적인 정규분포의 랜덤 잡음이다. FA모델에서는 잡음은 대각 공분산 행렬 를 가진 정규분포, 로 정의된다. 한편 PPCA는 잡음 공분산 행렬의 각 성분이 동일한, 즉 로 정의된다.here Is the mean value for the bias vector, Is Hidden variables of dimensions, Is The noise space of the dimension or matrix representing channel space, Is This is a random noise of normal distribution independent of and. In the FA model, the noise is a diagonal covariance matrix Normal distribution with Is defined as PPCA, on the other hand, means that each component of the noise covariance matrix is the same, Is defined as

여기서는 PPCA모델에 대해서만 고려한다. PPCA모델의 경우 수학식 9에 대한 파라메터는 로 주어진다.Only the PPCA model is considered here. For the PPCA model, the parameter for Equation 9 is Is given by

강인한 음성인식 기술은 잡음이나 채널 특성이 섞인 특징 벡터열 가 주어졌을 때 환경보상 기법을 통해 깨끗한 음성 특징 벡터열 를 추론하는 것이다. 시간 에서 깨끗한 음성 특징벡터의 추론 값을 라 하자. 그러면 관찰된 오염된 특징벡터로부터 다음의 수학식 10과 같이 추론할 수 있다.Robust Speech Recognition Technology combines noise and channel characteristics into a feature vector sequence Given a clean speech feature vector sequence using environmental compensation To infer. time Infer the value of the clean speech feature vector from Let's do it. Then, it can be inferred from the observed contaminated feature vector as in Equation 10 below.

여기서 는 오염된 잡음 음성 으로부터 추정된 잡음 및 채널 공간모델에 대한 추정된 파라메터이다. 그러므로 깨끗한 음성 을 추론하기 위해서는 잡음 및 채널 공간모델에 대한 모델 파라메터를 먼저 추정하여야 한다.here Polluted noise voice Estimated parameters for the noise and channel spatial model estimated from. Therefore, in order to infer clear speech, the model parameters for noise and channel spatial model must be estimated first.

추정은 최대 유사도(maximum likelihood : ML) 기준에 의해 관측 벡터열에 대한 유사도가 최대가 되도록 다음의 수학식 11과 같이 파라메터를 추정한다.The estimation estimates a parameter as shown in Equation 11 to maximize the similarity with respect to the observation vector sequence based on the maximum likelihood (ML) criterion.

그러나 이런 기준에 의한 추정은 해를 직접적으로 구하기 매우 어렵기 때문에 반복적으로 파라메터를 추정하는 EM(Expectation Maximization)기법을 사용한다. EM기법은 파라메터를 추정하기 위해 다음의 수학식 12와 같은 보조함수를 사용하여 구한다.However, the estimation based on this criterion uses an Expectation Maximization (EM) technique that estimates parameters repeatedly because the solution is very difficult to obtain directly. The EM technique is obtained by using an auxiliary function as shown in Equation 12 to estimate the parameter.

여기서 는 깨끗한 음성 특징 벡터열, 오염된 잡음 음성 벡터열이며, 는 bais 특징 벡터열, 혼합 성분열 그리고 은닉변수 벡터열을 각각 나타낸다. 이 함수를 최대화 하도록 파라메터를 추정함으로 파라메터를 추정한다. 파라메터를 추정하기 위해 EM 기법은 E-step과 M-step을 반복적으로 적용함으로 파라메터를 추정한다. 잡음 및 채널 공간모델에 대한 E-step과 M-step은 각각 다음의 수학식 13과 수학식 14와 같은 형태로 유도할 수 있다.here Are clean speech feature vectors, polluted noise speech vectors, Denotes a bais feature vector sequence, a mixture component sequence, and a hidden variable vector sequence, respectively. Estimate the parameter by estimating the parameter to maximize this function. To estimate the parameters, the EM technique estimates the parameters by applying E-step and M-step repeatedly. E-step and M-step for the noise and channel spatial model can be derived in the form of Equation 13 and Equation 14, respectively.

[E-step][E-step]

[M-step][M-step]

여기서 는 관측벡터가 주어졌을 때 혼합성분이 관측된 posterior 확률 값이다.here Is the posterior probability value for which the mixed component is observed given the observation vector.

그러므로, 위 과정을 여러 번 반복함으로 잡음 및 채널공간 모델에 대한 최적의 파라메터 를 추정할 수 있다.Therefore, by repeating the above process several times, the optimal parameters for the noise and channel space model Can be estimated.

본 발명은 추정된 최적의 모델 파라메터에 근거하여 다음과 같이 오염된 잡음 음성으로부터 다음과 같이 깨끗한 음성을 추정할 수 있다.The present invention can estimate the clean voice as follows from the contaminated noise voice based on the estimated optimal model parameter as follows.

본 발명은 도 1과 같이 잡음이나 채널에 의해 깨끗한 음성이 오염이 되는 경우 이를 보상하기 위해 잡음 및 채널 공간모델에 근거한 보상기법에 대해 도 2에 나타내었다.The present invention is shown in Figure 2 with respect to the compensation method based on the noise and channel spatial model to compensate for the clean voice is polluted by noise or channel as shown in FIG.

본 발명에 의한 알고리즘을 사용함으로써 음성인식 시스템을 구성할 때 잡음이나 채널 왜곡이 존재하는 환경 하에서 인식 성능을 향상시킬 수 있고 잡음 및 채널 특성을 좀 더 정확히 추정할 수 있다. 인식 환경에 맞추어 적응적으로 환경에 대한 모델 파라메터를 추정하여 환경에 가장 적합한 특성을 반영함으로써 좀 더 정확하게 깨끗한 음성을 모델링함으로 잡음 특성에 강한 인식성능을 발휘할 수 있다. 이러한 기술은 실제환경에도 적용이 가능하여 실용적인 제품 응용이 가능하다.By using the algorithm of the present invention, it is possible to improve the recognition performance in the environment where noise or channel distortion exists when constructing a speech recognition system, and to estimate noise and channel characteristics more accurately. By adaptively estimating the model parameters of the environment according to the recognition environment and reflecting the most suitable characteristics for the environment, it is possible to exhibit a strong recognition performance on the noise characteristics by modeling a clear voice more accurately. This technology can be applied to a real environment, so practical product application is possible.

도 1은 본 발명을 위한 깨끗한 음성에 잡음 및 채널에 의해 오염된 잡음 음성을 나타내는 구성도,1 is a block diagram illustrating a noise voice contaminated by noise and a channel in a clean voice for the present invention;

도 2는 본 발명에 대한 잡음 및 채널 공간모델 파라메터 추정과 이를 이용한 보상기법에 대한 구성도.2 is a block diagram of a noise and channel space model parameter estimation and a compensation method using the present invention.

Claims

Modeling noise and channels using noise and channel spatial models in various regions for feature extraction of speech recognition;

Estimating noise and channel spatial model parameters using a given noise voice; And

Estimating a clean speech based on the parameter;

Speech recognition method based on the noise and channel space model, characterized in that it comprises a.

The method of claim 1, wherein modeling the noise and channel comprises:

Speech recognition method based on the noise and channel space model, characterized in that modeled by the following equation.

here Is the mean value for the bias vector, Is Hidden variables of dimensions, Is Matrix representing the noise or channel space of a dimension, Is This is a random noise of normal distribution independent of and.

The method of claim 1, wherein estimating noise and channel space model parameters comprises:

A speech recognition method based on a noise and channel space model, wherein the parameters for E-step and M-step are estimated by the following equation.

[E-step]

[M-step]

here Is the posterior probability value of the mixed component observed when the observation vector is given.

The method of claim 1, wherein estimating clean speech based on the parameter comprises:

Speech recognition method based on noise and channel space model, characterized in that the speech is estimated by the following equation.