JP2001067094A

JP2001067094A - Voice recognizing device and its method

Info

Publication number: JP2001067094A
Application number: JP24285699A
Authority: JP
Inventors: Tomohiro Narita; 知宏成田
Original assignee: Mitsubishi Electric Corp
Current assignee: Mitsubishi Electric Corp
Priority date: 1999-08-30
Filing date: 1999-08-30
Publication date: 2001-03-16

Abstract

PROBLEM TO BE SOLVED: To provide a voice recognizing device and its method capable of reducing deterioration in recognition performance due to a change in distance between an input terminal of a voice signal and a noise source, and due to variations in environmental noise. SOLUTION: This voice recognizing device is equipped with a spectrum computing means 101 for obtaining a noise-superimposed voice spectrum time series, an average spectrum computing means 102 for obtaining a noise spectrum by estimating a spectrum of superimposed noise from non-vocal zones, a noiseremoved spectrum group computing means 201 for obtaining noise-removed vocal-spectrum time series of a plurality of kinds by changing a scaling factor relative to the noise spectrum, a characteristic vector group computing means 202 for converting the noise-removed vocal spectrum time series of two or more kinds into characteristic vector time series of two or more kinds, a collation model memory 205 for memorizing a noiseless voice pattern and a model representing transition of the kinds of the characteristic vectors, and a three-dimensional collation means 203 for collating the noiseless voice pattern with the model representing the transition of the kinds of the characteristic vectors in a three-dimensional space made up of three axes, time, state, and characteristic vector.

Description

DETAILED DESCRIPTION OF THE INVENTION

【０００１】[0001]

【発明の属する技術分野】この発明は、騒音環境下で発
声され雑音が重畳した音声を対象とする音声認識装置及
び方法に関するものである。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a speech recognition apparatus and method for speech uttered in a noisy environment and overlaid with noise.

【０００２】[0002]

【従来の技術】雑音環境下で発声された音声には背景雑
音が重畳しており音声認識率が劣化する。この重畳雑音
を除去するための簡単で有効な手法として、スペクトル
サブトラクション法が広く用いられる。ここでは、その
一例として、文献「日本音響学会編音響工学講座７改訂
音声」（中田和男、コロナ社、ｐ．１３０−１３１）に
記載されているスペクトルサブトラクション法を用いる
従来の音声認識装置の説明を行う。2. Description of the Related Art Background noise is superimposed on speech uttered in a noise environment, and the speech recognition rate is degraded. As a simple and effective method for removing the superimposed noise, a spectral subtraction method is widely used. Here, as an example, a description is given of a conventional speech recognition apparatus using the spectral subtraction method described in the document “Acoustic Engineering Course edited by the Acoustical Society of Japan, 7 Revised Speech” (Kazuo Nakada, Corona, pp. 130-131). I do.

【０００３】図８は従来の音声認識装置の構成を示すブ
ロック図である。図８において、１０１は雑音重畳音声
入力に対してスペクトル分析を施し雑音重畳音声スペク
トル時系列を抽出演算するスペクトル演算手段、１０２
は非音声区間のスペクトルを平均し雑音スペクトルとし
て出力する平均スペクトル演算手段、１０３は雑音重畳
音声スペクトル時系列から雑音スペクトルを減算し雑音
除去スペクトル時系列を出力する雑音除去スペクトル演
算手段、１０４は雑音除去スペクトル時系列から特徴ベ
クトル時系列を求める特徴ベクトル演算手段、１０５は
照合用の雑音無し音声パターンを記憶する照合モデルメ
モリ、１０６は特徴ベクトル時系列に対して、照合モデ
ルメモリ１０５が記憶する雑音無し音声パターンとの照
合処理を行い、最大の尤度を与える認識結果を出力する
照合手段である。FIG. 8 is a block diagram showing a configuration of a conventional voice recognition device. In FIG. 8, reference numeral 101 denotes a spectrum calculating means for performing spectrum analysis on a noise-superimposed speech input and extracting and computing a noise-superimposed speech spectrum time series;
Is an average spectrum calculating means for averaging the spectrum of the non-voice section and outputting the result as a noise spectrum; 103, a noise removing spectrum calculating means for subtracting the noise spectrum from the noise-superimposed voice spectrum time series to output a noise removing spectrum time series; A feature vector calculating means for obtaining a feature vector time series from the removed spectrum time series; 105, a matching model memory for storing a noise-free speech pattern for matching; and 106, a noise stored in the matching model memory 105 for the feature vector time series. This is a matching unit that performs a matching process with a no-voice pattern and outputs a recognition result that gives the maximum likelihood.

【０００４】以下、従来の音声認識装置の動作について
説明する。スペクトル演算手段１０１では、雑音重畳音
声入力に対して、一定時間ごとにフーリエ変換によりパ
ワースペクトルを計算し、雑音重畳音声スペクトルの時
系列として出力する。また、平均スペクトル演算手段１
０２では、雑音重畳音声スペクトル時系列の中の非音声
区間、例えば音声区間の直前、もしくは音声発声中の休
止区間から抽出した数フレーム分の雑音重畳音声スペク
トルを各周波数毎に平均し、雑音スペクトルとして出力
する。雑音除去スペクトル演算手段１０３では雑音重畳
音声スペクトルの時系列の各雑音重畳音声スペクトルか
ら雑音スペクトルを減算する。[0004] The operation of the conventional speech recognition apparatus will be described below. The spectrum calculating means 101 calculates a power spectrum by Fourier transform at predetermined time intervals with respect to the input of the noise-added speech, and outputs the power spectrum as a time series of the noise-added speech spectrum. Also, average spectrum calculation means 1
In No. 02, a noise-superimposed speech spectrum of several frames extracted from a non-speech section in the noise-superimposed speech spectrum time series, for example, immediately before a speech section or a pause section during speech production, is averaged for each frequency, and the noise spectrum is calculated. Output as The noise removal spectrum calculation means 103 subtracts the noise spectrum from each noise-superimposed speech spectrum in the time series of the noise-superimposed speech spectrum.

【０００５】ここで、雑音除去音声スペクトルの周波数
ωにおけるパワーＳ（ω）、雑音重畳音声スペクトルの
周波数ωにおけるパワーＸ（ω）、および推定雑音スペ
クトルの周波数ωにおけるパワーＮ（ω）の関係を示す
と式（１）のとおりである。Here, the relationship between the power S (ω) at the frequency ω of the noise-removed voice spectrum, the power X (ω) at the frequency ω of the noise-superimposed voice spectrum, and the power N (ω) at the frequency ω of the estimated noise spectrum is shown. This is as shown in equation (1).

【０００６】[0006]

【数１】 (Equation 1)

【０００７】なお、αはサブトラクト係数と呼ばれるパ
ラメータで、雑音成分を除去する程度を表し、通常、認
識精度を最大にするように調整する。また、ｍａｘ｛｝
は、括弧内の要素の中で最大の値の要素を返す関数であ
る。Note that α is a parameter called a subtract coefficient, which represents the degree of noise component removal, and is usually adjusted to maximize recognition accuracy. Also, max ｛｝
Is a function that returns the element with the largest value among the elements in parentheses.

【０００８】特徴ベクトル演算手段１０４は、雑音除去
スペクトル演算手段１０３が出力する雑音除去音声スペ
クトル時系列から、ＬＰＣ（Linear Predictive Codin
g）ケプストラムなどの音声認識において音響的な特徴
を表現するベクトルに変換する。[0008] The feature vector computing means 104 converts the noise-reduced speech spectrum time series output from the noise-removing spectrum computing means 103 into an LPC (Linear Predictive Codin).
g) Convert to a vector expressing acoustic features in speech recognition such as cepstrum.

【０００９】照合手段１０６は、特徴ベクトル演算手段
１０４が出力する特徴ベクトル時系列に対して、照合モ
デルメモリ１０５が記憶する雑音無し音声パターンとの
照合を行い、最大尤度を与える認識候補を認識結果とし
て出力する。ここでは、照合手段の一例として、文献
「音声認識の基礎（下）」（Lawrence Rabiner, Biing-
Hwang Juang 共著、ＮＴＴアドバンステクノロジ株式会
社、ｐ．１２５−１２８）に記載されている、隠れマル
コフモデル（以下ＨＭＭという）を用いた音声認識装置
における、Viterbiサーチを用いた最大尤度の演算方法
を説明する。[0009] The matching means 106 compares the feature vector time series output from the feature vector calculating means 104 with the noise-free voice pattern stored in the matching model memory 105 to recognize a recognition candidate giving the maximum likelihood. Output as result. Here, as an example of the matching means, the document “Basic of speech recognition (below)” (Lawrence Rabiner, Biing-
Hwang Juang, NTT Advanced Technology Corporation, p. 125-128), a method of calculating the maximum likelihood using a Viterbi search in a speech recognition apparatus using a hidden Markov model (hereinafter referred to as HMM) will be described.

【００１０】すなわち、時刻１〜Ｔまでの特徴ベクトル
時系列Ｙ＝（ｙ₁，ｙ₂，・・・，ｙ _T）に対して尤度最
大となる一本の最適状態系列ｑ＝（ｑ₁，ｑ₂，・・・，
ｑ_T）を見つけるViterbiサーチは以下の４つのステップ
から構成される。That is, the feature vectors from time 1 to time T
Time series Y = (y₁, Y_Two, ..., y _T) For the likelihood
One large optimal state sequence q = (q₁, Q_Two, ...,
q_TViterbi search to find the following four steps
Consists of

【００１１】ＳＴＥＰ１（初期化）STEP 1 (initialization)

【００１２】[0012]

【数２】 (Equation 2)

【００１３】[0013]

【数３】 (Equation 3)

【００１４】ＳＴＥＰ２（繰り返し）STEP 2 (repeated)

【００１５】[0015]

【数４】 (Equation 4)

【００１６】[0016]

【数５】 (Equation 5)

【００１７】ＳＴＥＰ３（終了）STEP 3 (End)

【００１８】[0018]

【数６】 (Equation 6)

【００１９】[0019]

【数７】 (Equation 7)

【００２０】ＳＴＥＰ４（バックトラック）STEP 4 (backtrack)

【００２１】[0021]

【数８】 (Equation 8)

【００２２】ここで、δ_t（ｉ）は一本のパス上の、時
刻ｔでの最大尤度であり、以下の式で表される。Here, δ _t (i) is the maximum likelihood at time t on one path and is expressed by the following equation.

【００２３】[0023]

【数９】 (Equation 9)

【００２４】式（２）〜（８）において、Ψ_t（ｊ）は
各時刻ｔ、各状態ｊで式（９）を最大にする経路の引数
を記憶する配列である。また、ａ_ijは状態ｉから状態ｊ
への遷移確率、ｂ_i（ｙ_t）は状態ｉにおける特徴ベクト
ルｙ_tの出力確率、π_iは初期状態で状態ｉに存在する確
率、λは照合用音声モデルを表し、それぞれ雑音の無い
環境下で発声した音声データから学習される。In the equations (2) to (8), Ψ _t (j) is an array for storing the argument of the path that maximizes the equation (9) at each time t and each state j. Also, a _{ij changes} from state i to state j.
, B _i (y _t ) is the output probability of the feature vector y _t in state i, π _i is the probability of being in state i in the initial state, λ is the voice model for verification, and each is a noise-free environment. Learned from voice data uttered below.

【００２５】一般的な音声認識装置では、照合用音声パ
ターンの状態遷移を、図９に示すような状態遷移に制約
のついたLeft-to-rightのＨＭＭモデルで表現する。な
お、ｂ_i（ｙ）は状態ｉにおける特徴ベクトルｙの出力
確率である。In a general speech recognition apparatus, the state transition of a collation speech pattern is represented by a left-to-right HMM model with a restriction on the state transition as shown in FIG. Note that b _i (y) is the output probability of the feature vector y in the state i.

【００２６】照合用音声パターンの状態遷移をLeft-to-
rightのＨＭＭモデルで表現する場合のViterbiサーチの
様子を図１０に示す。図１０は、時刻ｔ、状態ｊにおけ
る最大尤度δ_t（ｊ）が、時刻ｔ−１、状態ｊにおける
最大尤度δ_tー1（ｊ）と時刻ｔ−１、状態ｊ−１におけ
る最大尤度δ _tー1（ｊ−１）から、尤度最大になるよう
なパスが選択されることによって演算されることを示し
ている。The state transition of the matching voice pattern is defined as Left-to-
Viterbi search when expressing with the right HMM model
This is shown in FIG. FIG. 10 shows the state at time t and state j.
Maximum likelihood δ_t(J) is at time t-1, state j
Maximum likelihood δ_t-1(J) at time t-1, state j-1
Maximum likelihood δ _t-1From (j-1), the likelihood is maximized.
Is calculated by selecting the appropriate path.
ing.

【００２７】以上の動作により、入力される雑音重畳音
声信号のスペクトル時系列に非音声区間の雑音区間の平
均スペクトルが重畳していると見なして、パワースペク
トル上で雑音成分を除去した上で雑音無し照合モデルと
の照合処理を施し、認識結果を得る。By the above operation, it is considered that the average spectrum of the noise section of the non-voice section is superimposed on the spectrum time series of the input noise-superimposed speech signal, and the noise component is removed from the power spectrum. A matching process with the no-matching model is performed to obtain a recognition result.

【００２８】[0028]

【発明が解決しようとしする課題】従来のスペクトルサ
ブトラクション法を用いた騒音下音声認識装置は上記の
ように構成されているため、発声直前等の雑音の平均ス
ペクトルと実際の音声区間に重畳している雑音スペクト
ルの差が小さい場合、即ち環境騒音の変動が小さい場合
は比較的良好に動作する。しかし、騒音源が移動物であ
り、音声信号の入力端から騒音源までの距離が変化する
場合や、環境騒音が非定常で変動が大きい場合は、推定
した雑音スペクトルと実際に音声に重畳している雑音ス
ペクトルとの推定誤差が大きくなり、認識性能が劣化す
るという問題があった。Since the conventional under-noise speech recognition apparatus using the spectral subtraction method is configured as described above, it is superimposed on the average spectrum of noise immediately before utterance or the like and the actual speech section. When the noise spectrum difference is small, that is, when the fluctuation of the environmental noise is small, the operation is relatively good. However, when the noise source is a moving object and the distance from the input end of the audio signal to the noise source changes, or when the environmental noise is unsteady and has large fluctuations, the estimated noise spectrum and the noise are actually superimposed on the voice. There is a problem that an estimation error with respect to the noise spectrum increases and the recognition performance deteriorates.

【００２９】この発明は上記のような問題を解決するた
めのもので、音声信号の入力端と騒音源との距離の変化
による認識性能劣化を削減することができる音声認識装
置及び方法を得ることを目的としている。また、環境騒
音の変動による認識性能劣化を削減することができる音
声認識装置及び方法を得ることを目的としている。SUMMARY OF THE INVENTION The present invention has been made to solve the above problems, and has as its object to obtain a speech recognition apparatus and method capable of reducing the degradation of recognition performance due to a change in the distance between a speech signal input terminal and a noise source. It is an object. It is another object of the present invention to provide a speech recognition apparatus and method capable of reducing recognition performance deterioration due to fluctuations in environmental noise.

【００３０】[0030]

【課題を解決するための手段】この発明に係る音声認識
装置は、非音声区間を含む雑音重畳入力音声信号をスペ
クトル分析しスペクトル特徴パラメータを求め音声認識
処理を行う音声認識装置において、雑音重畳入力音声信
号をスペクトル分析し雑音重畳音声スペクトル時系列を
出力するスペクトル演算手段と、上記スペクトル演算手
段から出力される雑音重畳音声スペクトル時系列の中の
非音声区間から重畳雑音のスペクトルを推定し雑音スペ
クトルとして出力する平均スペクトル演算手段と、上記
スペクトル演算手段から出力される雑音重畳音声スペク
トル時系列から上記平均スペクトル演算手段から出力さ
れる雑音スペクトルを減算する際の当該雑音スペクトル
に対する倍率を変えて複数種類の雑音除去音声スペクト
ル時系列を出力する雑音除去スペクトル群演算手段と、
上記雑音除去スペクトル群演算手段から出力される複数
種類の雑音除去音声スペクトル時系列を複数種類の特徴
ベクトル時系列に変換する特徴ベクトル群演算手段と、
雑音のない環境下で発声した音声データを用いて学習し
た雑音無し音声パターンと特徴ベクトルの種類の遷移を
表したモデルを記憶してなる照合モデルメモリと、上記
特徴ベクトル群演算手段から出力される複数種類の雑音
除去音声特徴ベクトル時系列に対して、時刻、状態、特
徴ベクトルの種類の３軸からなる３次元空間内で、上記
照合モデルメモリに記憶された雑音無し音声パターンと
特徴ベクトルの種類の遷移を表したモデルとの照合を行
い認識結果を出力する３次元照合手段とを備えたことを
特徴とするものである。SUMMARY OF THE INVENTION A speech recognition apparatus according to the present invention is a speech recognition apparatus for analyzing a spectrum of a noise-superimposed input speech signal including a non-speech section to obtain spectrum characteristic parameters and performing speech recognition processing. Spectrum calculating means for analyzing the spectrum of the voice signal and outputting a noise-superimposed voice spectrum time series; and estimating the spectrum of the superimposed noise from the non-voice section in the noise-superimposed voice spectrum time series output from the spectrum calculating means. Means for calculating the average of the noise spectrum output from the spectrum calculating means and the noise spectrum output from the average spectrum calculating means when subtracting the noise spectrum output from the noise spectrum from the time series. The noise-removed speech spectrum time series And the noise removal spectrum group arithmetic means,
A feature vector group calculating means for converting a plurality of types of noise-removed speech spectrum time series output from the noise removing spectrum group calculating means into a plurality of types of feature vector time series;
A collation model memory storing a model representing a transition of the type of a feature vector and a noise-free speech pattern learned using speech data uttered in a noise-free environment, and output from the feature vector group calculating means. For a plurality of types of noise-removed speech feature vector time series, in a three-dimensional space consisting of three axes of time, state, and feature vector type, the noise-free speech pattern and the feature vector type stored in the matching model memory are stored. And a three-dimensional matching means for comparing the model with the model representing the transition and outputting a recognition result.

【００３１】また、上記平均スペクトル演算手段から出
力される雑音スペクトル、及び予め大量の雑音データか
らクラスタリング手法を用いて学習した複数種類の雑音
スペクトルパターンを記憶する雑音スペクトルメモリを
さらに備え、上記雑音除去スペクトル演算手段は、上記
スペクトル演算手段から出力される雑音重畳音声スペク
トル時系列の各雑音重畳音声スペクトルから、上記雑音
ベクトルに対する複数種類の倍率と、上記雑音スペクト
ルメモリに記憶された複数種類の雑音スペクトルパター
ンとを組み合わせて、複数種類の雑音除去音声スペクト
ルを求めることを特徴とするものである。The apparatus further comprises a noise spectrum memory for storing a noise spectrum output from the average spectrum calculating means and a plurality of types of noise spectrum patterns previously learned from a large amount of noise data by using a clustering technique. The spectrum calculating means calculates a plurality of types of magnifications for the noise vector and a plurality of types of noise spectrums stored in the noise spectrum memory from each of the noise-superposed speech spectra of the noise-superposed speech spectrum output from the spectrum calculating means. It is characterized in that a plurality of types of noise-removed speech spectra are obtained by combining with a pattern.

【００３２】また、上記照合モデルメモリは、特徴ベク
トルの種類の遷移を表したモデルとして、特徴ベクトル
の種類の遷移に制約を加えないモデルを記憶したことを
特徴とするものである。Further, the collation model memory is characterized in that a model which does not restrict the transition of the type of the feature vector is stored as a model representing the transition of the type of the feature vector.

【００３３】また、上記照合モデルメモリは、特徴ベク
トルの種類の遷移に制約を加えないモデルとして、全て
の種類に遷移可能なelgotic 隠れマルコフモデルを記憶
したことを特徴とするものである。The matching model memory is characterized by storing an elgotic hidden Markov model capable of transitioning to all types as a model that does not impose restrictions on the transition of the types of feature vectors.

【００３４】また、上記照合モデルメモリは、特徴ベク
トルの種類の遷移を表したモデルとして、特徴ベクトル
の種類の遷移に制約を加えたモデルを記憶したことを特
徴とするものである。The matching model memory is characterized by storing a model in which the transition of the type of the feature vector is restricted as a model representing the transition of the type of the feature vector.

【００３５】また、上記照合モデルメモリは、特徴ベク
トルの種類の遷移に制約を加えたモデルとして、隣接す
る特徴ベクトルの種類間のみ遷移可能な隠れマルコフモ
デルを記憶したことを特徴とするものである。Further, the matching model memory is characterized in that a hidden Markov model capable of transitioning only between adjacent types of feature vectors is stored as a model in which transitions of types of feature vectors are restricted. .

【００３６】また、この発明に係る音声認識方法は、非
音声区間を含む雑音重畳入力音声信号をスペクトル分析
しスペクトル特徴パラメータを求め音声認識処理を行う
音声認識方法において、雑音重畳入力音声に対しスペク
トル分析を施し雑音重畳音声スペクトル時系列を得るス
ペクトル演算工程と、上記スペクトル演算工程で得られ
る雑音重畳音声スペクトル時系列の中の非音声区間から
重畳雑音のスペクトルを推定し雑音スペクトルとして得
る平均スペクトル演算工程と、上記スペクトル演算工程
で得られる雑音重畳音声スペクトル時系列から上記平均
スペクトル演算工程で得られる雑音スペクトルを減算す
る際の当該雑音スペクトルに対する倍率を変えて複数種
類の雑音除去音声スペクトル時系列を得る雑音除去スペ
クトル群演算工程と、上記雑音除去スペクトル群演算工
程で得られる複数種類の雑音除去音声スペクトル時系列
を複数種類の特徴ベクトル時系列に変換する特徴ベクト
ル群演算工程と、上記特徴ベクトル群演算工程で得られ
る複数種類の雑音除去音声特徴ベクトル時系列に対し
て、時刻、状態、特徴ベクトルの種類の３軸からなる３
次元空間内で、雑音のない環境下で発声した音声データ
を用いて学習した雑音無し音声パターンと特徴ベクトル
の種類の遷移を表したモデルとの照合を行いその認識結
果を得る３次元照合工程とを備えたことを特徴とするも
のである。Further, according to the speech recognition method of the present invention, in the speech recognition method for analyzing the spectrum of a noise-superimposed input speech signal including a non-speech section to obtain a spectrum feature parameter and performing speech recognition processing, A spectrum calculation step of performing analysis to obtain a noise-superimposed speech spectrum time series, and an average spectrum calculation of estimating a spectrum of superimposed noise from a non-speech section in the noise-superimposed speech spectrum time series obtained in the spectrum calculation step to obtain a noise spectrum. And a plurality of types of noise-removed voice spectrum time series by changing the magnification for the noise spectrum when subtracting the noise spectrum obtained in the average spectrum calculation step from the noise superimposed voice spectrum time series obtained in the spectrum calculation step. Obtained noise removal spectrum group calculation process A feature vector group calculating step of converting a plurality of types of noise-removed voice spectrum time series obtained in the noise removing spectrum group calculating step into a plurality of types of feature vector time series; and a plurality of types of feature vector groups obtained in the feature vector group calculating step. For the noise-removed speech feature vector time series, three axes of time, state, and feature vector type
A three-dimensional matching step of comparing a noise-free speech pattern learned using speech data uttered in a noise-free environment in a three-dimensional space with a model representing a transition of the type of a feature vector to obtain a recognition result thereof; It is characterized by having.

【００３７】また、上記雑音除去スペクトル演算工程
は、上記スペクトル演算工程で得られる雑音重畳音声ス
ペクトル時系列の各雑音重畳音声スペクトルから、上記
雑音ベクトルに対する複数種類の倍率と、予め大量の雑
音データからクラスタリング手法を用いて学習した複数
種類の雑音スペクトルパターンとを組み合わせて、複数
種類の雑音除去音声スペクトルを求めることを特徴とす
るものである。In addition, the noise removal spectrum calculation step includes a plurality of types of magnifications with respect to the noise vector and a large amount of noise data in advance from each noise superimposed speech spectrum of the noise superimposed speech spectrum time series obtained in the spectrum calculation step. It is characterized in that a plurality of types of noise-removed speech spectra are obtained by combining a plurality of types of noise spectrum patterns learned using a clustering method.

【００３８】また、上記３次元照合工程は、特徴ベクト
ルの種類の遷移を表したモデルとして、特徴ベクトルの
種類の遷移に制約を加えないモデルを用いたことを特徴
とするものである。Further, the three-dimensional matching step is characterized in that a model that does not restrict the transition of the type of the feature vector is used as a model representing the transition of the type of the feature vector.

【００３９】また、上記３次元照合工程は、上記特徴ベ
クトルの種類の遷移に制約を加えないモデルとして、全
ての種類に遷移可能なelgotic 隠れマルコフモデルを用
いたことを特徴とするものである。Further, the three-dimensional matching step is characterized in that an elgotic hidden Markov model capable of transitioning to all types is used as a model that does not impose restrictions on the transition of the types of the feature vectors.

【００４０】また、上記３次元照合工程は、特徴ベクト
ルの種類の遷移を表したモデルとして、特徴ベクトルの
種類の遷移に制約を加えたモデルを用いたことを特徴と
するものである。Further, the three-dimensional matching step is characterized in that a model in which the transition of the type of the feature vector is restricted is used as the model representing the transition of the type of the feature vector.

【００４１】さらに、上記３次元照合工程は、特徴ベク
トルの種類の遷移に制約を加えたモデルとして、隣接す
る特徴ベクトルの種類間のみ遷移可能な隠れマルコフモ
デルを用いたことを特徴とするものである。Further, the three-dimensional matching step is characterized in that a hidden Markov model capable of transitioning only between adjacent types of feature vectors is used as a model in which transitions of types of feature vectors are restricted. is there.

【００４２】[0042]

【発明の実施の形態】実施の形態１．図１はこの発明の
実施の形態１に係る音声認識装置及び方法を説明するた
めの構成を示すブロック図である。図１において、図８
に示す従来例と同一部分は同一符号を付して示すものと
し、１０１は雑音重畳音声入力に対してスペクトル分析
を施し雑音重畳音声スペクトル時系列を抽出するスペク
トル演算手段、１０２は上記スペクトル演算手段１０１
から出力される雑音重畳音声スペクトル時系列の中の非
音声区間のスペクトルを平均し、雑音スペクトルとして
出力する平均スペクトル演算手段である。DESCRIPTION OF THE PREFERRED EMBODIMENTS Embodiment 1 FIG. 1 is a block diagram showing a configuration for explaining a speech recognition apparatus and method according to Embodiment 1 of the present invention. In FIG. 1, FIG.
Are denoted by the same reference numerals, 101 is a spectrum calculating means for performing spectrum analysis on the noise-superimposed speech input and extracting a noise-superimposed speech spectrum time series, and 102 is the spectrum computing means. 101
Average spectrum calculating means for averaging the spectrum of a non-speech section in the noise-superimposed speech spectrum time series output from, and outputting as a noise spectrum.

【００４３】また、新たな符号として、２０１は上記ス
ペクトル演算手段１０１から出力される雑音重畳音声ス
ペクトル時系列から上記平均スペクトル演算手段１０２
から出力される雑音スペクトルを減算する際の雑音スペ
クトルに対する倍率を変えて雑音スペクトルを減算し、
複数種類の雑音除去スペクトル時系列を出力する雑音除
去スペクトル群演算手段、２０２は複数種類の雑音除去
スペクトル時系列を複数種類の特徴ベクトル時系列に変
換する特徴ベクトル群演算手段、２０３は上記特徴ベク
トル群演算手段２０２から出力される複数種類の雑音除
去音声特徴ベクトル時系列に対して、時刻、状態、特徴
ベクトルの種類の３軸からなる３次元空間内で、後述す
る照合モデルメモリ２０５が記憶する雑音無し音声パタ
ーンと特徴ベクトルの種類の遷移を表したモデルとの照
合を行い認識結果を出力する３次元照合手段、２０５は
雑音のない環境下で発生した音声データを用いて学習し
た雑音無し音声パターンと特徴ベクトルの種類の遷移を
表したモデルを記憶してなる照合モデルメモリである。As a new code, reference numeral 201 denotes the average spectrum calculating means 102 based on the noise-superimposed speech spectrum time series output from the spectrum calculating means 101.
The noise spectrum is subtracted by changing the magnification for the noise spectrum when subtracting the noise spectrum output from
A noise removal spectrum group calculating means for outputting a plurality of types of noise removal spectrum time series; a feature vector group calculating means for converting a plurality of types of noise removal spectrum time series into a plurality of types of feature vector time series; For a plurality of types of noise-removed speech feature vector time series output from the group calculation means 202, a collation model memory 205 described later stores in a three-dimensional space composed of three axes of time, state, and feature vector type. Three-dimensional matching means for matching a noise-free voice pattern with a model representing a transition of the type of a feature vector and outputting a recognition result; and 205, a noise-free voice trained using voice data generated in a noise-free environment This is a collation model memory that stores a model representing the transition between the types of patterns and feature vectors.

【００４４】この図１に示す実施の形態１に係る音声認
識装置は、上述した図１に示すブロック図により構成さ
れるものであるが、対応する音声認識方法を構成する工
程としては次に示す工程を備える。ａ．雑音重畳入力音声に対しスペクトル分析を施し雑音
重畳音声スペクトル時系列を得るスペクトル演算工程、ｂ．上記スペクトル演算工程で得られる雑音重畳音声ス
ペクトル時系列の中の非音声区間から重畳雑音のスペク
トルを推定し雑音スペクトルとして得る平均スペクトル
演算工程、ｃ．上記スペクトル演算工程で得られる雑音重畳音声ス
ペクトル時系列から上記平均スペクトル演算工程で得ら
れる雑音スペクトルを減算する際の当該雑音スペクトル
に対する倍率を変えて複数種類の雑音除去音声スペクト
ル時系列を得る雑音除去スペクトル群演算工程、ｄ．上記雑音除去スペクトル群演算工程で得られる複数
種類の雑音除去音声スペクトル時系列を複数種類の特徴
ベクトル時系列に変換する特徴ベクトル群演算工程、ｅ．上記特徴ベクトル群演算工程で得られる複数種類の
雑音除去音声特徴ベクトル時系列に対して、時刻、状
態、特徴ベクトルの種類の３軸からなる３次元空間内
で、雑音のない環境下で発声した音声データを用いて学
習した雑音無し音声パターンと特徴ベクトルの種類の遷
移を表したモデルとの照合を行いその認識結果を得る３
次元照合工程。The speech recognition apparatus according to the first embodiment shown in FIG. 1 is constituted by the above-described block diagram shown in FIG. 1. The steps constituting the corresponding speech recognition method are as follows. Process. a. A spectrum calculation step of performing spectrum analysis on the noise-superimposed input speech to obtain a noise-superimposed speech spectrum time series; b. An average spectrum calculating step of estimating a spectrum of superimposed noise from a non-voice section in the noise-superimposed voice spectrum time series obtained in the spectrum calculating step and obtaining the spectrum as a noise spectrum; c. Noise removal to obtain a plurality of types of noise-removed speech spectrum time series by changing the magnification of the noise spectrum obtained when the noise spectrum obtained in the average spectrum calculation step is subtracted from the noise-superimposed speech spectrum time series obtained in the spectrum calculation step Spectrum group calculation step, d. A feature vector group calculating step of converting a plurality of types of noise-removed speech spectrum time series obtained in the noise removing spectrum group calculating step into a plurality of types of feature vector time series; e. A plurality of types of noise-removed speech feature vector time series obtained in the feature vector group calculation step were uttered in a three-dimensional space composed of three axes of time, state, and feature vector in a noise-free environment. Matching is performed between the noise-free speech pattern learned using the speech data and the model representing the transition of the type of the feature vector to obtain the recognition result 3
Dimension matching process.

【００４５】次に、上記構成に係る実施の形態１の動作
について説明する。スペクトル演算手段１０１及び平均
スペクトル演算手段１０２の動作は従来例の動作と同様
なため、ここでは説明を省略する。雑音除去スペクトル
群演算手段２０１では、雑音重畳音声スペクトルの時系
列の各雑音重畳音声スペクトルから、Ｖ種類（複数種
類）のサブトラクト係数α^(k ⁾，（１≦ｋ≦Ｖ）を用い
て、雑音スペクトルを減算し、Ｖ種類の雑音除去音声ス
ペクトルＳ^(k)（ω）を求める。ここでは、以下のよう
に０.５刻みにα^(k)の値を設定する。Next, the operation of the first embodiment having the above configuration will be described. The operations of the spectrum calculation means 101 and the average spectrum calculation means 102 are the same as the operations of the conventional example, and the description is omitted here. The noise removal spectrum group calculating means 201 uses V (multiple types) subtraction coefficients α ^(k ⁾ and (1 ≦ k ≦ V) from each of the time-series noise-superimposed voice spectra of the noise-superimposed voice spectrum. The spectrum is subtracted to obtain V kinds of noise-removed speech spectra S ^(k) (ω). Here, the value of α ^(k) is set in 0.5 steps as follows.

【００４６】[0046]

【数１０】 (Equation 10)

【００４７】ここで、Ｓ^(k)（ω）はｋ種類目の雑音除
去音声スペクトルの周波数ωにおけるパワー、Ｘ（ω）
は雑音重畳音声スペクトルの周波数ωにおけるパワーを
表す。このようにして、Ｖ種類の雑音除去音声スペクト
ル時系列Ｓ⁽¹⁾（ω），Ｓ⁽²⁾（ω），・・・，Ｓ^(v)（ω）（ただしＳ^(k)（ω）＝（Ｓ₁ ^(k)（ω），Ｓ
₂ ^(k)（ω），・・・，Ｓ_T ^(k)（ω）））を求める。Here, S ^(k) (ω) is the power at the frequency ω of the k-th noise-removed voice spectrum, and X (ω)
Represents the power at the frequency ω of the noise-superimposed speech spectrum. In this way, V types of noise-removed speech spectrum time series S ⁽¹⁾ (ω), S ⁽²⁾ (ω),..., S ^(v) (ω) (where S ^(k) (ω) = (S ₁ ^(k) (ω), S
₂ ^(k) (ω),..., _ST ^(k) (ω))).

【００４８】特徴ベクトル群演算手段２０２では、雑音
除去スペクトル群演算手段２０１が出力するＶ種類の雑
音除去音声スペクトル時系列Ｓ⁽¹⁾（ω），Ｓ
⁽²⁾（ω），・・・，Ｓ^(v)（ω）を従来例と同様に、Ｌ
ＰＣケプストラムなどの音声認識において音響的な特徴
を表現するＶ種類の特徴ベクトル時系列Ｙ⁽¹⁾，Ｙ⁽²⁾，
・・・，Ｙ^(v)（ただしＹ^(k)＝Ｙ₁ ^(k)，Ｙ₂ ^(k)，・・
・，Ｙ_T ^(k)）に変換する。The feature vector group calculating means 202 outputs V kinds of noise-removed speech spectrum time series S ⁽¹⁾ (ω), S output by the noise removing spectrum group calculating means 201.
⁽²⁾ (ω),..., S ^(v) (ω) is changed to L
V type feature vector time series Y ⁽¹⁾ , Y ⁽²⁾ , representing acoustic features in speech recognition such as PC cepstrum
···, Y ^{(v) (where} ^{_{^{Y (k) = Y 1 (}}} k), Y 2 (k), ··
, Y _T ^(k) ).

【００４９】３次元照合処理手段２０３では、特徴ベク
トル群演算手段２０２が出力するＶ種類の特徴ベクトル
時系列Ｙ⁽¹⁾，Ｙ⁽²⁾，・・・，Ｙ^(v)に対して、時刻、
状態、特徴ベクトルの種類の３軸からなる３次元空間内
で照合を行い、最大尤度を与える認識候補を認識結果と
して出力する。[0049] In three-dimensional matching process unit 203, V type of time-series feature vector Y output from the feature vector group arithmetic unit ^{^{202 (1), Y (2}} ), ···, against the Y ^(v), the time ,
Matching is performed in a three-dimensional space including three axes of the state and the type of feature vector, and a recognition candidate that gives the maximum likelihood is output as a recognition result.

【００５０】特徴ベクトルの種類の遷移は、図２に示す
elgoticＨＭＭモデルで表現する。図２において、ｃ_kl
は特徴ベクトルの種類ｋから特徴ベクトルの種類ｌへの
遷移確率であり、各状態間は観測事象を出力しないナル
遷移で結ばれている。elgoticＨＭＭモデルを用いてい
るのは、本実施の形態１では特徴ベクトルの種類の遷移
に制約を付けないためである。The transition of the type of the feature vector is shown in FIG.
Expressed by an elgotic HMM model. In FIG. 2, c _kl
Is the transition probability from the type k of the feature vector to the type 1 of the feature vector, and each state is connected by a null transition that does not output the observation event. The reason why the elgotic HMM model is used is that the first embodiment does not restrict the transition of the type of the feature vector.

【００５１】尤度最大となる一本の最適な、状態及び特
徴ベクトルの種類の組み合わせの系列（ｑ、ｖ）＝（ｑ
₁、ｖ₁），（ｑ₂、ｖ₂），・・・，（ｑ_T、ｖ_T）を見つ
けるために、以下の４つのステップから構成される、３
次元に拡張したViterbiサーチを実行する。A sequence (q, v) = (q) of one optimal combination of the type of the state and the feature vector having the maximum likelihood
₁ , v ₁ ), (q ₂ , v ₂ ),..., (Q _T , v _T )
Perform a Viterbi search extended to the dimension.

【００５２】ＳＴＥＰ１（初期化）STEP 1 (initialization)

【００５３】[0053]

【数１１】 [Equation 11]

【００５４】[0054]

【数１２】 (Equation 12)

【００５５】ＳＴＥＰ２（繰り返し）STEP 2 (Repeat)

【００５６】[0056]

【数１３】 (Equation 13)

【００５７】[0057]

【数１４】 [Equation 14]

【００５８】ＳＴＥＰ３（終了）STEP 3 (End)

【００５９】[0059]

【数１５】 (Equation 15)

【００６０】[0060]

【数１６】 (Equation 16)

【００６１】ＳＴＥＰ４（バックトラック）STEP 4 (Back Track)

【００６２】[0062]

【数１７】 [Equation 17]

【００６３】ここで、δ_t（ｉ，ｋ）は時刻、状態、特
徴ベクトルの種類の３軸から構成される３次元空間内の
一本のパス上の、時刻ｔ、状態ｉ、特徴ベクトルの種類
ｋでの最大尤度であり、以下の式で表される。Here, δ _t (i, k) is the time t, the state i, and the characteristic vector on one path in a three-dimensional space composed of three axes of time, state, and feature vector. This is the maximum likelihood of the type k, and is represented by the following equation.

【００６４】[0064]

【数１８】 (Equation 18)

【００６５】式（１１）〜（１７）において、Ψ
_t（ｊ，ｌ）は各時刻ｔ、各状態ｊ、特徴ベクトルの種
類ｌで式（１８）を最大にする経路の引数を記憶する２
次元配列である。また、ｂ_i（ｙ_t ^(k)）は状態ｉにおけ
る特徴ベクトルｙ_t ^(k)の出力確率、ｃ_klは特徴ベクトル
の種類ｋから特徴ベクトルの種類ｌへの遷移確率、ρ_k
は初期状態で特徴ベクトルの種類がｋである確率であ
る。In equations (11) to (17), Ψ
_t (j, l) stores the argument of the path that maximizes the expression (18) at each time t, each state j, and the type 1 of the feature vector.
It is a dimensional array. B _i (y _t ^(k) ) is the output probability of feature vector y _t ^(k) in state i, c _kl is the transition probability from feature vector type k to feature vector type l, ρ _k
Is the probability that the type of the feature vector is k in the initial state.

【００６６】図３は照合用音声パターンの状態遷移をLe
ft-to-right のＨＭＭモデルで表現し、特徴ベクトルの
種類の遷移をelgotic ＨＭＭモデルで表現する場合の３
次元Viterbiサーチの様子を表したものである。FIG. 3 shows the state transition of the verification voice pattern as Le.
3 in the case of expressing with ft-to-right HMM model and expressing the transition of the type of feature vector with elgotic HMM model
This shows the state of the dimensional Viterbi search.

【００６７】また、図４は、図３における時刻ｔ−１〜
ｔの範囲を抽出した図であり、時刻ｔ、状態ｊ、特徴ベ
クトルの種類ｌにおける最大尤度δ_t（ｊ，ｌ）が、時
刻ｔ−１、状態ｊ、特徴ベクトルの種類ｋにおける最大
尤度δ_tー1（ｊ，ｋ）（ただし（１≦ｋ≦Ｖ））と、時
刻ｔ−１、状態ｊ−１、特徴ベクトルの種類ｋ（におけ
る最大尤度δ_tー1（ｊ−１，ｋ）（ただし（１≦ｋ≦
Ｖ））とから、尤度最大になるようなパスが選択される
ことによって演算されることを示している。FIG. 4 is a timing chart of FIG.
FIG. 6 is a diagram in which a range of t is extracted, and the maximum likelihood δ _t (j, l) at time t, state j, and feature vector type 1 is the maximum likelihood at time t−1, state j, and feature vector type k. Degree δ _t−1 (j, k) (where (1 ≦ k ≦ V)) and the maximum likelihood δ _t−1 (j−1) at time t−1, state j−1, and feature vector type k. , K) (where (1 ≦ k ≦
V)), the calculation is performed by selecting a path that maximizes the likelihood.

【００６８】以下、実施の形態１に対する作用効果を述
べる。従来の騒音下音声認識装置では、非音声区間から
推定した雑音スペクトルが全音声区間に一様に重畳して
いると仮定し、評価データに対して認識性能が最大にな
るように調整した唯一つのサブトラクト係数αの値を用
いていた。しかし、騒音源と音声入力端の距離が時刻と
共に変動する場合には、ある時刻において音声に重畳す
る雑音スペクトルのパワーが雑音推定時の雑音スペクト
ルのパワーと異なるため、雑音スペクトルを引き過ぎた
り、引かな過ぎたりすることが起こり、正確な雑音除去
音声スペクトルを求めることができない。その結果とし
て、雑音無し音声パターンとのミスマッチが起き認識率
が劣化する。The operation and effect of the first embodiment will be described below. In a conventional noisy speech recognition device, it is assumed that the noise spectrum estimated from the non-speech section is uniformly superimposed on the entire speech section, and the only one that is adjusted to maximize the recognition performance for the evaluation data The value of the subtraction coefficient α was used. However, when the distance between the noise source and the voice input terminal fluctuates with time, the power of the noise spectrum superimposed on the voice at a certain time is different from the power of the noise spectrum at the time of noise estimation. In some cases, the sound is over-pulled, and an accurate noise-free speech spectrum cannot be obtained. As a result, a mismatch with the noise-free speech pattern occurs, and the recognition rate deteriorates.

【００６９】文献「並列ＨＭＭ法とスペクトルサブトラ
クションによる非定常雑音騒音下における音声認識」
（嶺竜治、電子情報通信学会論文誌（Ｄ−II）、Ｖｏ
ｌ．Ｊ−７８−Ｄ−II、Ｎｏ．７、ｐｐ．１０２１−１
０２７、１９９５）では、雑音ＨＭＭをelgotic ＨＭＭ
で表現し、スペクトルサブトラクション後の雑音除去音
声特徴ベクトルに対して、時刻、音声モデルの状態、雑
音モデルの状態の３次元空間上で照合処理を行うことに
よって非定常な雑音環境下での認識性能を向上させてい
る。しかし、上記文献には、サブトラクト係数の値につ
いての記述はないこと、本実施の形態１では、雑音モデ
ルではなく、特徴ベクトルの種類の遷移をモデル化して
いることから、両者は別の技術であるといえる。Document "Speech recognition under non-stationary noise by parallel HMM method and spectral subtraction"
(Ryuji Mine, IEICE Transactions (D-II), Vo
l. J-78-D-II, No. 7, pp. 1021-1
027, 1995), the noise HMM was transformed into an elgotic HMM.
And performs recognition processing on the noise-removed speech feature vector after spectral subtraction in a three-dimensional space of the time, the state of the speech model, and the state of the noise model, so that the recognition performance in an unsteady noise environment Has been improved. However, since the above document does not describe the value of the subtraction coefficient, and in the first embodiment, the transition of the type of the feature vector is modeled instead of the noise model. It can be said that there is.

【００７０】本実施の形態１に係る音声認識装置及び方
法では、各時刻ｔ毎にＶ種類のサブトラクト係数α^(k)
を用いて演算されたＶ種類の特徴ベクトル候補が存在す
る。各時刻ｔにおける特徴ベクトルの種類ｋは、尤度が
最大となるように選択されるため、騒音源と音声入力端
の距離が変動しても雑音スペクトルを引き過ぎたり、引
かな過ぎたりすることを防ぎ、認識率の劣化を抑えるこ
とができる。In the speech recognition apparatus and method according to the first embodiment, V types of subtraction coefficients α ^(k) are provided at each time t.
There are V types of feature vector candidates calculated using Since the type k of the feature vector at each time t is selected such that the likelihood is maximized, the noise spectrum may be overdrawn or overdrawn even if the distance between the noise source and the voice input terminal fluctuates. Can be prevented, and the deterioration of the recognition rate can be suppressed.

【００７１】また、本実施の形態１に係る音声認識装置
及び方法では、特徴ベクトルの種類の遷移を表したモデ
ルとして、特徴ベクトルの種類の遷移に制限を加えず
に、すべての種類に遷移可能なelgotic ＨＭＭモデルを
用いているが、特徴ベクトルの種類の遷移に制限を加え
たモデルとして、雑音除去時のサブトラクト係数α^(k)
の値が隣接する特徴ベクトルの種類間のみ遷移可能にし
た図５に示すＨＭＭモデルを用いることで、重畳雑音パ
ワーの時間的変化を適切にモデル化することが可能であ
る。Further, in the speech recognition apparatus and method according to the first embodiment, as a model representing the transition of the type of the feature vector, transition to all types is possible without restricting the transition of the type of the feature vector. Although the elgotic HMM model is used, the subtraction coefficient α ^(k) at the time of noise removal is used as a model that restricts the transition of the type of feature vector.
Can be appropriately modeled by using the HMM model shown in FIG. 5 in which the value of can be changed only between the types of adjacent feature vectors.

【００７２】実施の形態２．次に、図６はこの発明の実
施の形態２に係る音声認識装置及び方法を説明するため
の構成を示すブロック図である。図６において、図１に
示す実施の形態１と同一部分は同一符号を付して示し、
その説明は省略する。新たな符号として、２０４は平均
スペクトル演算手段１０２から出力される雑音スペクト
ル及び予め大量の雑音データからクラスタリング手法を
用いて学習した複数種類の雑音スペクトルパターンを記
憶する雑音スペクトルメモリであり、雑音除去スペクト
ル演算手段２０１は、スペクトル演算手段１０１から出
力される雑音重畳音声スペクトル時系列の各雑音重畳音
声スペクトルから雑音ベクトルに対する複数種類の倍率
と、上記雑音スペクトルメモリ２０４に記憶された複数
種類の雑音スペクトルパターンとを組み合わせて、複数
種類の雑音除去音声スペクトルを求めるようになされて
いる。Embodiment 2 Next, FIG. 6 is a block diagram showing a configuration for explaining a speech recognition apparatus and method according to Embodiment 2 of the present invention. 6, the same parts as those in the first embodiment shown in FIG.
The description is omitted. As a new code, reference numeral 204 denotes a noise spectrum memory that stores a noise spectrum output from the average spectrum calculation unit 102 and a plurality of types of noise spectrum patterns previously learned from a large amount of noise data using a clustering method. The calculating means 201 includes a plurality of types of magnifications for noise vectors from each of the noise-superimposed speech spectrums of the noise-superimposed speech spectrum time series output from the spectrum computing means 101, and a plurality of kinds of noise spectrum patterns stored in the noise spectrum memory 204. In order to obtain a plurality of types of noise-removed speech spectra.

【００７３】なお、実施の形態２に係る音声認識装置
は、上述した図６に示すブロック図により構成されるも
のであるが、対応する音声認識方法を構成する工程とし
ては、前述した実施の形態１に係る雑音除去スペクトル
演算工程が、スペクトル演算工程で得られる雑音重畳音
声スペクトル時系列の各雑音重畳音声スペクトルから、
雑音ベクトルに対する複数種類の倍率と、予め大量の雑
音データからクラスタリング手法を用いて学習した複数
種類の雑音スペクトルパターンとを組み合わせて、複数
種類の雑音除去音声スペクトルを求める点が異なるのみ
である。The speech recognition apparatus according to the second embodiment is constituted by the above-described block diagram shown in FIG. 6. The steps constituting the corresponding speech recognition method are the same as those of the above-described embodiment. The noise removal spectrum calculation step according to 1 is performed from each noise-added speech spectrum of the noise-added speech spectrum time series obtained in the spectrum calculation step.
The only difference is that a plurality of types of magnifications for the noise vector are combined with a plurality of types of noise spectrum patterns previously learned from a large amount of noise data by using a clustering method to obtain a plurality of types of noise-removed speech spectra.

【００７４】次に上記構成に係る実施の形態２の動作に
ついて説明する。スペクトル演算手段１０１及び平均ス
ペクトル演算手段１０２の動作は従来例の動作と同様な
ため、ここでは説明を省略する。雑音スペクトルメモリ
２０４では、平均スペクトル演算手段１０２が出力する
雑音スペクトル及びに予め大量の雑音データからクラス
タリング手法を用いて学習した、Ｖ₂ 種類の代表雑音ス
ペクトルパターンを記憶する。Next, the operation of the second embodiment according to the above configuration will be described. The operations of the spectrum calculation means 101 and the average spectrum calculation means 102 are the same as the operations of the conventional example, and the description is omitted here. In noise spectrum memory 204, learned from the noise spectrum average spectrum calculating unit 102 outputs and advance a large amount of noise data by using clustering techniques, stores V ₂ kinds of representative noise spectrum pattern.

【００７５】雑音除去スペクトル群演算手段２０１で
は、雑音重畳音声スペクトルの時系列の各雑音重畳音声
スペクトルから、Ｖ₁ 種類のサブトラクト係数α^(k1)，
（１≦ｋ₁≦Ｖ₁）と、Ｖ₂ 種類の雑音スペクトルパター
ンＮ_k2（ω），（１≦ｋ₂≦Ｖ₂）を組み合わせ、合計Ｖ
＝Ｖ₁Ｖ₂種類の雑音除去音声スペクトルＳ^(k)（ω），
（１≦ｋ≦Ｖ）を求める。ここでは、以下のように０.
５刻みにα^(k1)の値を設定する。The noise-removed spectrum group calculating means 201 calculates V ₁ types of subtraction coefficients α ^(k1) ,
(1 ≦ k ₁ ≦ V ₁ ) and V ₂ kinds of noise spectrum patterns N _k2 (ω), (1 ≦ k ₂ ≦ V ₂ )
= V ₁ V _Two types of noise-removed speech spectra S ^(k) (ω),
(1 ≦ k ≦ V) is obtained. Here, as below,
Set the value of α ^{(k1) in} 5 steps.

【００７６】[0076]

【数１９】 [Equation 19]

【００７７】ここで、Ｓ^(k)（ω）はｋ種類目の雑音除
去音声スペクトルの周波数ωにおけるパワー、Ｘ（ω）
は雑音重畳音声スペクトルの周波数ωにおけるパワー、
Ｎ（ω）は推定雑音スペクトルの周波数ωにおけるパワ
ーをそれぞれ表す。このようにして、Ｖ種類の雑音除去
音声スペクトル時系列Ｓ⁽¹⁾（ω），Ｓ⁽²⁾（ω），・・
・，Ｓ^(V)（ω）（ただし、Ｓ^(k)（ω）＝（Ｓ
₁ ^(k)（ω），Ｓ₂ ^(k)（ω），・・・，Ｓ_T ^(k)（ω））を
求める。Here, S ^(k) (ω) is the power at the frequency ω of the k-th noise-removed voice spectrum, and X (ω)
Is the power at the frequency ω of the noise-superimposed speech spectrum,
N (ω) represents the power at the frequency ω of the estimated noise spectrum. In this way, V types of noise-removed speech spectrum time series S ⁽¹⁾ (ω), S ⁽²⁾ (ω),.
·, S ^(V) (ω) (where S ^(k) (ω) = (S
₁ ^(k) (ω), S ₂ ^(k) (ω),..., _ST ^(k) (ω)).

【００７８】特徴ベクトル群演算手段２０２と３次元照
合手段２０３の動作は実施の形態１と同様なため、ここ
では説明を省略する。The operations of the feature vector group calculating means 202 and the three-dimensional collating means 203 are the same as those of the first embodiment, and the description is omitted here.

【００７９】以下、実施の形態２に係る音声認識装置及
び方法に関する効果を述べる。従来の騒音下音声認識装
置では、非音声区間から推定した雑音スペクトルが全音
声区間に一様に重畳していると仮定している。しかし、
走行自動車内等の非定常騒音環境下のように、時刻と共
に音声に重畳するスペクトルのパターンが変動する場合
には、ある時刻において音声に重畳する雑音スペクトル
のパターンが平均スペクトル演算時の雑音スペクトルの
パターンと異なるため、正確な雑音除去音声スペクトル
を求めることができない。その結果として雑音無し音声
パターンとのミスマッチが起き認識率が劣化する。Hereinafter, effects of the speech recognition apparatus and method according to the second embodiment will be described. In the conventional noisy speech recognition device, it is assumed that a noise spectrum estimated from a non-speech section is uniformly superimposed on all speech sections. But,
When the spectrum pattern superimposed on the voice fluctuates with time, such as in a non-stationary noise environment such as in a running car, the noise spectrum pattern superimposed on the voice at a certain time becomes the noise spectrum Since it is different from the pattern, an accurate noise-removed speech spectrum cannot be obtained. As a result, a mismatch with a noise-free voice pattern occurs, and the recognition rate deteriorates.

【００８０】また、実施の形態１の音声認識装置及び方
法では、スペクトルパワーの変動には対応できるもの
の、単一の雑音スペクトルパターンのみを用いるため、
スペクトルパターンの変動については対応できない。本
実施の形態２に係る音声認識装置及び方法では、各時刻
ｔ毎に、Ｖ₁種類のサブトラクト係数α^(k1)とＶ₂ 種類
の雑音スペクトルパターンＮ_k2（ω）を用いて演算され
た、Ｖ＝Ｖ₁Ｖ₂ 種類の特徴ベクトル候補が存在する。
各時刻ｔにおける特徴ベクトルの種類ｋは、尤度が最大
となるように選択されるため、騒音源と音声入力端の距
離や音声に重畳する雑音スペクトルパターンが変動して
も、認識率の劣化を抑えることができる。Further, the speech recognition apparatus and method according to the first embodiment can cope with fluctuations in spectrum power, but use only a single noise spectrum pattern.
It cannot deal with fluctuations in the spectral pattern. In the speech recognition apparatus and method according to the second embodiment, each time t is calculated using V ₁ types of subtraction coefficients α ^(k1) and V ₂ types of noise spectrum patterns N _k2 (ω). V = V ₁ V There are _two types of feature vector candidates.
Since the type k of the feature vector at each time t is selected such that the likelihood is maximized, the recognition rate deteriorates even if the distance between the noise source and the voice input terminal or the noise spectrum pattern superimposed on the voice fluctuates. Can be suppressed.

【００８１】また、本実施の形態２に係る音声認識装置
及び方法では、特徴ベクトルの種類の遷移を表したモデ
ルとして、特徴ベクトルの種類の遷移に制限を加えず
に、すべての種類に遷移可能なelgotic ＨＭＭモデルを
用いているが、特徴ベクトルの種類の遷移に制限を加え
たモデルとして、雑音除去時の雑音スペクトルパターン
Ｎ_k2（ω）が類似する、もしくは雑音除去時のサブトラ
クト係数α^(k)の値が隣接する特徴ベクトルの種類間の
み遷移可能にした図７に示すＨＭＭモデルを用いること
で、雑音スペクトルの時間的変化及び重畳雑音パワーの
時間的変化を適切にモデル化することが可能である。Further, in the speech recognition apparatus and method according to the second embodiment, as a model representing the transition of the type of the feature vector, transition to all types is possible without restricting the transition of the type of the feature vector. Although the elgotic HMM model is used, the noise spectrum pattern N _k2 (ω) at the time of noise removal is similar or the subtract coefficient α ^(k By using the HMM model shown in FIG. 7 in which the value of ( ⁾⁾ can be changed only between the types of adjacent feature vectors, it is possible to appropriately model the temporal change of the noise spectrum and the temporal change of the superimposed noise power. It is.

【００８２】[0082]

【発明の効果】以上のように、この発明によれば、各時
刻毎に複数種類のサブトラクト係数を用いて演算された
複数種類の特徴ベクトル候補が存在し、各時刻における
特徴ベクトルの種類は、尤度が最大となるように選択さ
れるため、騒音源と音声入力端の距離が変動しても雑音
スペクトルを引き過ぎたり、引かな過ぎたりすることを
防ぎ、認識率の劣化を抑えることができ、音声信号の入
力端と騒音源との距離の変化による認識性能劣化を削減
することができる。As described above, according to the present invention, there are a plurality of types of feature vector candidates calculated using a plurality of types of subtract coefficients at each time, and the type of feature vector at each time is Since the likelihood is selected to be the maximum, it is possible to prevent the noise spectrum from being pulled too much or too much even if the distance between the noise source and the voice input terminal fluctuates, and to suppress the deterioration of the recognition rate. As a result, it is possible to reduce recognition performance deterioration due to a change in the distance between the input end of the audio signal and the noise source.

【００８３】また、音声に重畳する雑音スペクトルパタ
ーンが変動しても、認識率の劣化を抑えることができ、
環境騒音の変動による認識性能劣化を削減することがで
きる。Further, even if the noise spectrum pattern to be superimposed on the voice fluctuates, the deterioration of the recognition rate can be suppressed.
Recognition performance degradation due to fluctuations in environmental noise can be reduced.

【００８４】また、特徴ベクトルの種類の遷移を表した
モデルとして、特徴ベクトルの種類の遷移に制限を加え
ないモデルを用いることにより、認識率の劣化を抑える
ことができる。Further, by using a model that does not limit the transition of the type of the feature vector as a model representing the transition of the type of the feature vector, the deterioration of the recognition rate can be suppressed.

【００８５】また、特徴ベクトルの種類の遷移に制限を
加えないモデルとして、すべての種類に遷移可能なelgo
tic ＨＭＭモデルを用いることにより、認識率の劣化を
抑えることができる。As a model that does not limit the transition of the types of feature vectors, elgo that can transition to all types
By using the tic HMM model, it is possible to suppress the deterioration of the recognition rate.

【００８６】また、特徴ベクトルの種類の遷移を表した
モデルとして、特徴ベクトルの種類の遷移に制限を加え
たモデルを用いることにより、重畳雑音パワーの時間的
変化を適切にモデル化することができる。By using a model in which the transition of the type of the feature vector is restricted as a model representing the transition of the type of the feature vector, the temporal change of the superimposed noise power can be appropriately modeled. .

【００８７】さらに、特徴ベクトルの種類の遷移に制限
を加えたモデルとして、隣接する特徴ベクトルの種類間
のみ遷移可能にしたＨＭＭモデルを用いることにより、
重畳雑音パワーの時間的変化を適切にモデル化すること
ができる。Further, as a model in which the transition of the type of the feature vector is restricted, an HMM model in which transition is possible only between the types of adjacent feature vectors is used.
The temporal change of the superimposed noise power can be appropriately modeled.

[Brief description of the drawings]

【図１】この発明の実施の形態１に係る音声認識装置
及び方法を説明するための構成を示すブロック図であ
る。FIG. 1 is a block diagram showing a configuration for describing a speech recognition device and method according to Embodiment 1 of the present invention.

【図２】この発明の実施の形態１に係る音声認識装置
及び方法を説明するもので、特徴ベクトルの種類の遷移
を表すelgoticＨＭＭモデルの説明図である。FIG. 2 is a diagram for explaining the speech recognition apparatus and method according to the first embodiment of the present invention, and is an explanatory diagram of an elegant HMM model representing a transition of a type of a feature vector.

【図３】この発明の実施の形態１に係る音声認識装置
及び方法を説明するもので、照合用音声パターンの状態
遷移をLeft-to-right のＨＭＭモデルで表現し、特徴ベ
クトルの種類の遷移をelgotic ＨＭＭモデルで表現する
場合の３次元Viterbiサーチの様子を表した説明図であ
る。FIG. 3 is a diagram for explaining a speech recognition apparatus and method according to Embodiment 1 of the present invention, in which a state transition of a collation speech pattern is represented by a left-to-right HMM model, and a transition of a type of a feature vector is performed. FIG. 5 is an explanatory diagram showing a state of a three-dimensional Viterbi search when is represented by an elgotic HMM model.

【図４】図３における時刻ｔ−１〜ｔの範囲を抽出し
た説明図である。FIG. 4 is an explanatory diagram in which a range from time t-1 to time t in FIG. 3 is extracted.

【図５】この発明の実施の形態１に係る音声認識装置
及び方法を説明するもので、隣接する特徴ベクトルの種
類間のみ遷移可能にしたＨＭＭモデルの説明図である。FIG. 5 is a diagram for explaining the speech recognition apparatus and method according to the first embodiment of the present invention, and is an explanatory diagram of an HMM model in which transition is possible only between types of adjacent feature vectors.

【図６】この発明の実施の形態２に係る音声認識装置
及び方法を説明するための構成を示すブロック図であ
る。FIG. 6 is a block diagram showing a configuration for explaining a speech recognition apparatus and method according to Embodiment 2 of the present invention.

【図７】この発明の実施の形態２に係る音声認識装置
及び方法を説明するもので、隣接する特徴ベクトルの種
類間のみ遷移可能にしたＨＭＭモデルの説明図である。FIG. 7 is a diagram for explaining a speech recognition apparatus and method according to Embodiment 2 of the present invention, and is an explanatory diagram of an HMM model in which transition is possible only between types of adjacent feature vectors.

【図８】従来例の音声認識装置の構成を示すブロック
図である。FIG. 8 is a block diagram showing a configuration of a conventional speech recognition apparatus.

【図９】従来例の照合用音声パターンの状態遷移を状
態遷移に制約のついたLeft-to-rightのＨＭＭモデルで
表現する説明図である。FIG. 9 is an explanatory diagram expressing a state transition of a matching voice pattern in a conventional example by a left-to-right HMM model with a restriction on the state transition.

【図１０】照合用音声パターンの状態遷移をLeft-to-
rightのＨＭＭモデルで表現する場合のViterbiサーチの
様子を示す説明図である。FIG. 10 shows the state transition of the voice pattern for verification as Left-to-
FIG. 9 is an explanatory diagram showing a state of a Viterbi search when expressed by a right HMM model.

[Explanation of symbols]

１０１スペクトル演算手段、１０２平均スペクトル
演算手段、２０１雑音除去スペクトル群演算手段、２
０２特徴ベクトル群演算手段、２０３３次元照合手
段、２０４雑音スペクトルメモリ、２０５照合モデ
ルメモリ。101 spectrum calculation means, 102 average spectrum calculation means, 201 noise removal spectrum group calculation means, 2
02 feature vector group calculating means, 203 three-dimensional matching means, 204 noise spectrum memory, 205 matching model memory.

Claims

[Claims]

1. A speech recognition apparatus for performing spectrum analysis on a noise-superimposed input speech signal including a non-speech section to obtain a spectrum feature parameter and perform speech recognition processing. A spectrum calculating means for outputting; a spectrum calculating means for estimating a spectrum of superimposed noise from a non-voice section in a noise-superimposed voice spectrum time series output from the spectrum calculating means and outputting the spectrum as a noise spectrum; Noise removal spectrum which outputs a plurality of types of noise removal speech spectrum time series by changing the magnification of the noise spectrum when subtracting the noise spectrum output from the average spectrum calculation means from the noise superimposed speech spectrum time series output from Group operation means, and the noise A feature vector group calculating means for converting a plurality of types of noise-reduced speech spectrum time series output from the spectrum group calculating means into a plurality of feature vector time series, and learning using voice data uttered in a noise-free environment. A matching model memory storing a model representing the transition of the type of the feature vector and the noise-free speech pattern, and a plurality of types of noise-removed speech feature vector time series output from the feature vector group calculating means. In a three-dimensional space composed of three axes of time, state, and feature vector type, the noise-free speech pattern stored in the matching model memory is compared with a model representing the transition of the feature vector type, and the recognition result is obtained. A speech recognition device comprising: a three-dimensional collating unit that outputs the speech.

2. A speech recognition apparatus according to claim 1, wherein a noise spectrum output from said average spectrum calculation means and a plurality of types of noise spectrum patterns previously learned from a large amount of noise data using a clustering method are stored. A noise spectrum memory, wherein the noise elimination spectrum calculation means includes a plurality of types of magnifications for the noise vector from each of the noise superimposed speech spectra of the noise superimposed speech spectrum time series output from the spectrum calculation means; A speech recognition apparatus characterized in that a plurality of kinds of noise-reduced speech spectra are obtained by combining a plurality of kinds of noise spectrum patterns stored in a spectrum memory.

3. The speech recognition device according to claim 1, wherein the matching model memory stores a model that does not impose a restriction on the transition of the type of the feature vector as a model representing the transition of the type of the feature vector. A speech recognition device characterized by the following.

4. The speech recognition apparatus according to claim 3, wherein the matching model memory stores an elgotic hidden Markov model capable of transitioning to all types as a model that does not impose restrictions on the transition of the types of feature vectors. A speech recognition device characterized by the above-mentioned.

5. The speech recognition device according to claim 1, wherein the matching model memory stores a model in which the transition of the type of the feature vector is restricted as a model representing the transition of the type of the feature vector. A speech recognition device characterized by the following.

6. The speech recognition apparatus according to claim 5, wherein the matching model memory is a hidden Markov model capable of transitioning only between adjacent types of feature vectors, as a model in which transitions of types of feature vectors are restricted. A speech recognition device characterized by storing

7. A speech recognition method for performing a spectrum analysis on a noise-superimposed input speech signal including a non-speech section to obtain a spectrum feature parameter and perform speech recognition processing, wherein the spectrum analysis is performed on the noise-superimposed input speech and a noise-superimposed speech spectrum time series. A spectrum calculation step of estimating a spectrum of superimposed noise from a non-speech section in the noise-superimposed speech spectrum time series obtained in the spectrum calculation step, and obtaining an average spectrum as a noise spectrum. A noise removal spectrum group calculation step of obtaining a plurality of types of noise removal speech spectrum time series by changing the magnification for the noise spectrum when subtracting the noise spectrum obtained in the average spectrum calculation step from the noise superimposed speech spectrum time series, Above noise removal spectrum group calculation Vector operation sequence for converting a plurality of types of noise-removed speech spectrum time series obtained in the above process into a plurality of types of feature vector time series, and a plurality of types of noise-removed speech feature vector time series obtained in the feature vector group operation process On the other hand, in a three-dimensional space consisting of three axes of time, state, and type of feature vector, transition of a noise-free voice pattern and a type of feature vector learned using voice data uttered in a noise-free environment is performed. A three-dimensional matching step of performing matching with a represented model and obtaining a recognition result thereof.

8. The speech recognition method according to claim 7, wherein the noise-removed spectrum calculation step includes a step of calculating a plurality of noise-free speech spectrums of the noise-superimposed speech spectrum time series obtained in the spectrum calculation step. A speech recognition method characterized by obtaining a plurality of types of noise-removed speech spectra by combining a plurality of types of magnifications and a plurality of types of noise spectrum patterns previously learned from a large amount of noise data using a clustering method.

9. The speech recognition method according to claim 7, wherein the three-dimensional matching step includes, as a model representing the transition of the type of the feature vector, a model that does not restrict the transition of the type of the feature vector. A speech recognition method characterized by using:

10. The speech recognition method according to claim 9, wherein in the three-dimensional matching step, an elgotic hidden Markov model capable of transitioning to all types is used as a model that does not impose restrictions on the types of the feature vectors. A speech recognition method characterized by using:

11. The speech recognition method according to claim 7, wherein the three-dimensional matching step includes, as a model representing the transition of the type of the feature vector, a model in which the transition of the type of the feature vector is restricted. A speech recognition method characterized by using a character string.

12. The speech recognition method according to claim 11, wherein in the three-dimensional matching step, as a model in which a transition of a type of a feature vector is restricted, a hidden Markov that can transition only between adjacent types of a feature vector is provided. A speech recognition method characterized by using a model.