JP3046029B2

JP3046029B2 - Apparatus and method for selectively adding noise to a template used in a speech recognition system

Info

Publication number: JP3046029B2
Application number: JP1048418A
Authority: JP
Inventors: ジャック・エリオット・ポーター
Original assignee: インターナショナル・スタンダード・エレクトリック・コーポレイション
Priority date: 1988-02-29
Filing date: 1989-02-28
Publication date: 2000-05-29
Anticipated expiration: 2015-05-29
Also published as: GB2216320A; GB2216320B; JPH01255000A; FR2627887B1; GB8902475D0; FR2627887A1

Description

【発明の詳細な説明】〔産業上の利用分野〕この発明は一般的な音声認識システム、特にテンプレ
ートを用いこの各テンプレートが雑音の選択的な付加に
よって生成されスピーチ認識の確率を増加させる音声認
識システムに関する。Description: BACKGROUND OF THE INVENTION The present invention relates to a general speech recognition system, and more particularly to speech recognition using templates, each of which is generated by the selective addition of noise to increase the probability of speech recognition. About the system.

(Prior art)

一般的な音声認識の方法は近年非常に発達してきてお
り、多くの形態で用いられている。音声認識の考え方
は、発声音により得られた情報を直接にコンピュータあ
るいは他の手段を駆動するのに用いられるということで
ある。基本的に先行技術における発声音中の情報の認識
のキー要素は周波数に関するエネルギの分布である。フ
ォルマント周波数は特にエネルギピークが重要なもので
ある周波数である。フォルマント周波数は口腔キャビテ
イの音響共振であって、舌、顎及び唇によって制御され
る。聞き手にとっては最初の２つか３つのフォルマント
周波数が決定すれば通常母音を特定するのに十分であ
る。このようにして先行技術のマシーン認識には、入っ
てくるスピーチ信号の振幅あるいはパワースペクトルを
決めるための手段をいくつか含んでいる。音声認識の初
めの過程はスピーチ信号を認識可能な特性またはパラメ
ータに変換し、データフローを処理しやすい割合に減少
させる前処理である。この過程を行うための１つの手段
は、いくつかの広い周波数帯域における信号のゼロ交差
率を測定してこの帯域におけるフォルマント周波数の推
定値を与えることである。General speech recognition methods have been very developed in recent years and are used in many forms. The idea of speech recognition is that the information obtained by the utterance is used to directly drive a computer or other means. Basically, the key element in the recognition of information in vocal sounds in the prior art is the distribution of energy with respect to frequency. The formant frequency is a frequency at which the energy peak is particularly important. Formant frequency is the acoustic resonance of the oral cavity and is controlled by the tongue, jaw and lips. For the listener, the determination of the first two or three formant frequencies is usually sufficient to identify a vowel. Thus, prior art machine recognition involves several means for determining the amplitude or power spectrum of the incoming speech signal. The first step in speech recognition is pre-processing, which converts the speech signal into recognizable properties or parameters and reduces the data flow to a manageable rate. One means for performing this process is to measure the zero-crossing rate of the signal in several broad frequency bands to provide an estimate of the formant frequency in this band.

別の手段はスペクトルが入力ピーチ信号のスペクトル
に最も良く適合するフィルタのパラメータによってスピ
ーチ信号を表わすことである。この方法は線形予想コー
ディング（LPC）として知られている。線形予想コーデ
ィング、すなわちLPCはその効率性、正確性及び簡便性
に特徴がある。スピーチから抽出される認識特性は通常
10乃至40ミリ秒に渡って平均化され50−100回／秒でサ
ンプリングされる。Another means is to represent the speech signal by the parameters of a filter whose spectrum best matches the spectrum of the input speech signal. This method is known as linear predictive coding (LPC). Linear predictive coding, or LPC, is characterized by its efficiency, accuracy, and simplicity. Recognition characteristics extracted from speech are usually
Averaged over 10 to 40 milliseconds and sampled at 50-100 times / second.

スピーチを表わして認識するために用いられるパラメ
ータは、直接的あるいは間接的に振幅またはパワースペ
クトルに関連する。フォルマント周波数及び線形予想フ
ィルタ係数は音声スペクトルに間接的に関連するパラメ
ータの例である。他の例ではセプストラルパラメータ及
びログエリア率パラメータがある。The parameters used to represent and recognize speech are directly or indirectly related to the amplitude or power spectrum. Formant frequencies and linear prediction filter coefficients are examples of parameters that are indirectly related to the speech spectrum. In another example, there is a Cepstral parameter and a log area ratio parameter.

[Problems to be solved by the invention]

これらのまた他の多くの場合認識に用いられる音声パ
ラメータはスペクトルパラメータから得ることができ
る。本発明は音声認識パラメータを生成するスペクトル
パラメータに雑音を選択的に付加することに関してい
る。本発明はスペクトルパラメータから得られた、ある
いは得ることができるスピーチパラメータを用いる音声
認識のあわゆる形態に適用される。These and many other speech parameters used for recognition can be derived from spectral parameters. The present invention relates to selectively adding noise to spectral parameters that generate speech recognition parameters. The invention applies to all forms of speech recognition using speech parameters obtained or obtainable from spectral parameters.

いずれにしても、過去における音声認識の多くの一般
的な方法はテンプレートを用いて整合を行なっている。
この方法では通常ワードはパラメータシーケンスの形で
表わされる。認識は予め定義された同様の方法を用いて
未知のテンプレートトークンを記憶されたテンプレート
と比較することによって行なっている。多くの場合ワー
ドの生成速度の可変性を説明するのに時間配置アルゴリ
ズムが用いられる。従ってテンプレート整合システムは
音声的な別々のワードの小さいセットによって高性能を
発揮することができる。研究者の中には広い範囲の話者
の精密な音声分別を最終的に行なうこのようなシステム
の能力を疑問視している。ジェイ・エス・パ−ケル（J.
S.Perkel）及びディー・エイチ・クラット（D.H.Klat
t）による論文“精密な音声分別の達成：テンプレート
対特性”（“スピーチ工程における可変性及び不変性”
ヒルスディル編、ニュージャージィ、ローレンス・エル
バウム・アソシエイツ刊、1985年、発行者アール・エイ
・コール、アーム・エム・スターン及びエム・ジェイ・
ラスリー）を参照されたい。In any case, many common methods of speech recognition in the past use templates to perform matching.
In this method, words are usually represented in the form of parameter sequences. Recognition is performed by comparing the unknown template token with the stored template using a similar predefined method. In many cases, a time alignment algorithm is used to account for the variability in word generation rate. Thus, the template matching system can perform well with a small set of phonetic discrete words. Some researchers have questioned the ability of such systems to ultimately perform precise speech classification of a wide range of speakers. Jay Parker
S.Perkel and DHKlat
t) Paper "Achieving Precise Speech Classification: Template vs. Properties"("Variability and Invariance in Speech Processing"
Hilsdill, New Jersey, published by Lawrence Elbaum Associates, 1985, publishers R. A. Cole, Arm M. Stern, and M. J.
(Lassley).

従って別の方法として、スピーチ信号内の音声的に関
連のある情報をとらえる音声特性のセットをまず識別す
るような音声認識のための特徴に基づいた方法を多くの
人が提案している。この知識に基づいてスピーチ信号か
らの特徴を抽出するようにアルゴリズムを構成すること
ができる。次に特徴を結合し認識決定に達するために分
類が行なわれる。特徴に基づいたシステムはテンプレー
ト整合技術よりも精密な音声弁別の実行性能が良く、し
たがって選れているという議論がある。いずれにして
も、テンプレート整合はパターン認識によく用いられる
方法であり、それによって未知のものがプロトタイプと
比較されてどれが最も近似しているかが決定される。Therefore, many have proposed alternative methods based on features for speech recognition, such as first identifying a set of speech characteristics that capture speech-related information in the speech signal. Algorithms can be configured to extract features from speech signals based on this knowledge. Classification is then performed to combine the features and arrive at a recognition decision. It has been argued that feature-based systems perform better and better at discriminating speech than template matching techniques. In any case, template matching is a commonly used method for pattern recognition, whereby unknowns are compared to prototypes to determine which is closest.

この決定によって分類のために多重変化ガウスモデル
を用いる特徴に基づく音声認識もテンプレート整合を実
行する。この場合特徴ベクトルをパターンとして用いる
のは統計分類者だけである。同様にスペクトル振幅及び
LPC係数を特徴として見ると、スペクトルに基づく技術
も同様に特徴に基づく方法である。With this decision, feature-based speech recognition using a multiple change Gaussian model for classification also performs template matching. In this case, only the statistical classifier uses the feature vector as a pattern. Similarly, the spectral amplitude and
Looking at LPC coefficients as features, spectrum-based techniques are also feature-based methods.

使用に関して、テンプレート整合及び特徴に基づくシ
ステムは、実際には連続体に沿った異なる点を表わす。
テンプレート整合法に伴う最も重要な問題の１つは、精
密な音声弁別に十分感度を有するが関係のないスペクト
ル変化には感受性のない距離尺度を規定する困難性であ
る。In use, template matching and feature-based systems actually represent different points along the continuum.
One of the most important problems with the template matching method is the difficulty in defining a distance measure that is sensitive enough to fine speech discrimination but insensitive to unrelated spectral changes.

この問題一つの表われは、長い不変母音のスペクトル
における重要でないフレーム間変化に与えられる過剰な
重みによるものである。従ってこのような問題に気付い
ている先行技術によって、音声距離に感受性があり関係
のない音声差異には感受性がないようにされている多数
の距離メトリクスが提案されている。例えばICASSP−82
の機関誌（IEEEカタログNo.CH1746−７、第1278頁乃至
第1281頁、1982年）に掲載されている論文“臨界帯域ス
ペクトルからの受容音声距離の予想”（ディー・エイチ
・クラットによる）を参照されたい。One manifestation of this problem is due to excessive weight given to insignificant inter-frame changes in the spectrum of long invariant vowels. Accordingly, prior arts aware of such problems have proposed a number of distance metrics that are sensitive to audio distance and insensitive to irrelevant audio differences. For example, ICASSP-82
(IEEE Catalog No. CH1746-7, pp. 1278-1281, 1982), a paper entitled "Estimation of Acceptable Speech Distance from Critical Band Spectrum" (by D.H.Clatt). Please refer to.

いずれにしても音声通信システムをよく理解するため
にプロシーディングズーオブIEEE（1985年11月号、Vo7
3、No.11、第1537頁乃至1696頁）を参照する。IEEEのこ
の文献ではマン／マシーンスピーチコミュニケーション
システムに関するいろいろな論文が提供されており、関
連する特定の問題の視野を広げてくれるものである。こ
こで理解できるように、どのような音声認識システムに
も関係する重要な点は、その分配タスクの実行、すなわ
ちすべての環境の型に関する音声を認識するためのシス
テムの能力である。Either way, proceeding Zoo of IEEE (November 1985, Vo7
3, No. 11, pages 1537 to 1696). This article from the IEEE offers a variety of papers on man / machine speech communication systems that will broaden the scope of certain related issues. As can be appreciated, a key point of any speech recognition system is the ability of the system to perform its distribution tasks, i.e., recognize speech for all environment types.

上記のように多くの音声認識システムでテンプレート
が用いられている。基本的にこのようなシステムでは発
声をパラメータシーケンスに変換させて、コンピュータ
に記憶させる。音声波は話者の口からマイクロホンを通
ってアナログ／デジタルコンバータに送られ、そこでフ
ィルタを通して、例えばそこにあるかもしれない背景雑
音と共にデジタル化される。次にデジタル化された信号
はさらにフィルタを通して認識パラメータに変換され、
この形態で記憶スピーチテンプレートと比較されて、話
されたワードの内の最も可能性がありそうなものの選択
を行なう。このような方法のさらに別の例としては、IE
EEスペクトル（1977年４月発刊、Vo124、No.4）があ
る。この中の論文ティー・ウオルヒによる“スピーチ認
識の実行”（第55頁乃至57頁）を参照されたい。As described above, many speech recognition systems use templates. Basically, in such a system, the utterance is converted into a parameter sequence and stored in a computer. The speech waves are passed from the speaker's mouth through a microphone to an analog-to-digital converter where they are digitized through a filter, for example, with any background noise that may be there. Next, the digitized signal is further converted to recognition parameters through a filter,
In this manner, the stored speech template is compared to make a selection of the most likely spoken words. Yet another example of such a method is IE
There is an EE spectrum (published in April 1977, Vo124, No. 4). See the paper entitled "Performing Speech Recognition" by T. Walch (pages 55-57).

この論文からわかる通り、音声認識システムの適用は
一定して拡大してきており、論文でも指摘されているよ
うにいろいろな適用例ですでにたくさんのモデルが用い
ることができるようになっている。テンプレートの形成
はまた先行技術でも良く知られている。このようなテン
プレートは多くの色々な型の音声認識システムで用いら
れている。システムの一例は“キーワード認識システ
ム”としてジェイ・エス・ブリドル（J.S.Bridle）によ
る論文“継続中のスピーチにおける与えられた単語を決
定するための効率的なエラスチックテンプレート方法”
（1973年４月、“イギリス音声学会の春季学会”、第１
頁乃至第４頁）に記載されている。この論文で著者は検
出されるべきキーワードの発声例のパラメータ表示から
エラスチックテンプレートを得ることを論じている。入
ってくるスピーチの同じようなパラメータ表示はこれら
のテンプレートと連続的に比較されて、スピーチとテン
プレートが導出されたキーワードとの間の類似性が測定
される。As you can see from this paper, the applications of speech recognition systems are steadily expanding, and many models can already be used in various applications as pointed out in the paper. The formation of templates is also well known in the prior art. Such templates are used in many different types of speech recognition systems. One example of a system is the "Keyword Recognition System" by JSBridle, an article "An Efficient Elastic Template Method for Determining Given Words in Ongoing Speech"
(April 1973, "The Spring Conference of the British Phonetic Society", No. 1
Page to page 4). In this paper the author discusses obtaining an elastic template from the parametric representation of the utterance examples of the keywords to be detected. Similar parametric representations of the incoming speech are continually compared to these templates to determine the similarity between the speech and the keyword from which the template was derived.

入ってくるスピーチのセグメントが対応するテンプレ
ートに十分に近似している場合は、認識装置によってワ
ードが話されたものと決定される。ワードテンプレート
は、話す速度の変化及び単語の発音速度の変化のために
時間的に拡大および圧縮されることができるために“エ
ラスチック”と呼ばれる。If the incoming speech segment is sufficiently close to the corresponding template, the word is determined to have been spoken by the recognizer. Word templates are called "elastic" because they can be expanded and compressed in time due to changes in speaking speed and changes in pronunciation speed of words.

キーワード認識は従来のスピーチ認識と同様である。
前者はテンプレートが恣意的なワード、すなわち音の文
脈の範囲内で認識されるべき“キー”ワードについての
み記憶されるものであるが、後者では話されると予想さ
れるスピーチのすべてに対してテンプレートが記憶され
る。このようなシステムのすべてはキーワード認識シス
テムであろうがテンプレートを用いる従来のスピーチ認
識システムであろうが、同じ問題、すなわち例えば異な
る個人によって発声された、あるいは同じ個人によって
異なる条件で発声されたワードを認識する能力をシステ
ムが持たないという問題に突き当たる。Keyword recognition is similar to conventional speech recognition.
The former is one in which the template is only remembered for arbitrary words, ie "key" words that should be recognized within the context of the sound, while the latter for all of the speech that is expected to be spoken. The template is stored. All such systems, whether keyword recognition systems or conventional speech recognition systems using templates, have the same problem, i.e. words that have been uttered by different individuals or uttered by the same individual in different conditions. The problem is that the system does not have the ability to recognize

従って本発明の目的は自動音声認識システムのための
改善された装置及び方法を提供することである。Accordingly, it is an object of the present invention to provide an improved apparatus and method for an automatic speech recognition system.

さらに雑音環境に自動的に適合する音声認識システム
を提供することも本発明の目的である。It is a further object of the present invention to provide a speech recognition system that automatically adapts to noisy environments.

[Means for solving the problem]

この明細書からわかるように、多くの音声認識システ
ムは雑音のある状態では動作性能が減少する。これは特
に、雑音がほとんどあるいは全くないか、あるいは認識
が実行される時点で異なる性質の雑音が存在するような
スピーチからテンプレートが導出された場合に懸著であ
る。この困難性を減少させている従来の方法では、新し
い雑音の存在する新しいテンプレートを生成することが
必要である。この生成には新しいスピーチ及び雑音の収
集が必要である。この発明のシステムではテンプレート
に雑音を分析的に付加し、それによって認識の確率を改
善してシステムの性能を実質的に改善し、しかもテンプ
レートの生成に新しいスピーチを集める必要がない。As can be seen from this specification, many speech recognition systems have reduced performance in noisy conditions. This is especially the case when templates are derived from speech with little or no noise or noise of different nature at the time the recognition is performed. Conventional methods that reduce this difficulty require generating a new template with new noise. This generation requires the collection of new speech and noise. The system of the present invention analytically adds noise to the template, thereby improving the probability of recognition and substantially improving the performance of the system, without the need to collect new speech in generating the template.

本発明のシステムは、雑音のない状態における認識可
能な音声のスペクトル値を示す初期テンプレートを記憶
する記憶装置と、雑音が存在する状態における音声を示
す入力信号の発声のスペクトル値を出力するスペクトル
分析器と、動作テンプレートを生成するための装置と、
動作テンプレートをスペクトル分析器からの出力スペク
トル値と比較して、発声中に認識された音声が存在する
ことを示す良好な比較結果が得られると出力を行う認識
モジュールとを具備するタイプの音声認識システムにお
いて、動作テンプレートを生成するための装置は、スペ
クトル分析器に結合され、入力信号中の雑音を示す推定
雑音信号を出力する第１の手段と、第１の手段に結合さ
れ、推定雑音信号に応答して、推定雑音信号に基づいて
初期テンプレートから修正された動作テンプレートを生
成する第２の手段とを備え、第１の手段は、発声中に雑
音が存在する状態で音声信号を検出し、音声信号に関係
するスカラー音声レベル値である、雑音が存在する状態
での音声信号の平均パワーを示す第１の信号を出力する
音声追跡部と、発声中の雑音を検出し、雑音の推定値を
示すスペクトル値のベクトルである、発声の所定時間期
間に対する雑音の平均パワーを示す第２の信号を出力す
る雑音追跡部とを備える音声及び雑音レベル追跡手段を
備え、第２の手段は、第１の信号に基づいて初期テンプ
レートの音声レベルを調節することと、第２の信号に基
づいて雑音の推定値のスペクトル値を初期テンプレート
に付加することとによって動作テンプレートを生成し、
動作テンプレートは、認識モジュールの改善された音声
認識性能を得るために、入力信号の発声の推定雑音と、
入力信号の発声と同じ信号対雑音比とを有することを特
徴とする。The system of the present invention comprises a storage device for storing an initial template indicating a spectrum value of a recognizable speech in a noise-free state, and a spectrum analysis for outputting a spectrum value of an utterance of an input signal indicating a speech in a state of noise A device for generating a motion template;
A recognition module that compares the operating template with the output spectral values from the spectrum analyzer and outputs when a good comparison result is obtained indicating the presence of the recognized speech during the utterance. In the system, an apparatus for generating an operation template is coupled to a spectrum analyzer for outputting an estimated noise signal indicative of noise in an input signal; and an estimated noise signal coupled to the first means. Generating a modified motion template from the initial template based on the estimated noise signal, wherein the first means detects the speech signal in the presence of noise during speech. An audio tracking unit that outputs a first signal indicating a scalar audio level value related to the audio signal and indicating an average power of the audio signal in the presence of noise; Noise and noise level tracking, comprising: a noise tracking unit that detects noise in the signal and outputs a second signal that is a vector of spectrum values indicating an estimated value of the noise and that indicates an average power of the noise for a predetermined time period of the utterance. Means for adjusting the audio level of the initial template based on the first signal; adding spectral values of the noise estimate to the initial template based on the second signal; Generates an action template,
The motion template includes an estimated noise of the utterance of the input signal to obtain an improved speech recognition performance of the recognition module;
It has the same signal-to-noise ratio as the utterance of the input signal.

また、本発明の方法は、入力信号のスペクトル値と動
作テンプレートとの比較に基づいて入力信号の発声中に
雑音が存在する状態で音声を認識する音声認識システム
において使用される動作テンプレートを形成する方法に
おいて、入力信号中に雑音が存在する状態で音声信号を
検出し、音声信号に関係するスカラー音声レベル値であ
る、雑音が存在する状態での音声信号の平均パワーを示
す第１の信号を出力することと、発声中の雑音を検出
し、雑音の推定値を示すスペクトル値のベクトルであ
る、発声の所定時間期間に対する雑音の平均パワーを示
す第２の信号を出力することとによって、推定雑音信号
を出力し、入力信号の発声の推定雑音と、入力信号の発
声と同じ信号対雑音比とを有する動作テンプレートを形
成して改善された音声認識機能を得るために、第１の信
号に基づいて初期テンプレートの音声レベルを調節する
ことと、第２の信号に基づいて雑音の推定値のスペクト
ル値を初期テンプレートに付加することとによって、雑
音の存在しない状態における認識可能な音声を示す初期
テンプレートを修正するステップを含むことを特徴とす
る。Also, the method of the present invention forms a motion template for use in a speech recognition system for recognizing speech in the presence of noise in the utterance of the input signal based on a comparison between the spectral value of the input signal and the motion template. The method comprises detecting an audio signal in the presence of noise in the input signal and generating a first signal indicative of a mean power of the audio signal in the presence of noise, which is a scalar audio level value associated with the audio signal. And detecting a noise in the utterance and outputting a second signal indicating a mean power of the noise for a predetermined time period of the utterance, which is a vector of spectral values indicating an estimated value of the noise. A method for outputting a noise signal and forming a motion template having the estimated noise of the utterance of the input signal and the same signal-to-noise ratio as the utterance of the input signal, the improved speech recognition Adjusting the audio level of the initial template based on the first signal and adding spectral values of the noise estimate to the initial template based on the second signal to obtain the function Modifying an initial template indicating recognizable speech in a non-existent state.

〔Example〕

理解されるように、本発明は性質がスペクトルである
かあるいは性質がスペクトルであるものから導出される
パラメータを用いるすべての認識システムに適用され
る。後者は雑音の分析的付加のためのスペクトルテンプ
レート及び動作テンプレートの２つの形態でテンプレー
トを記憶する必要がある。As will be appreciated, the invention applies to all recognition systems that use parameters whose properties are spectral or derived from those whose properties are spectral. The latter requires storing the template in two forms: a spectral template for the analytical addition of noise and an operational template.

第1A図を参照すると、本発明に従いスペクトルから導
出された認識パラメータを用いる音声認識システムのブ
ロック図が示されている。Referring to FIG. 1A, there is shown a block diagram of a speech recognition system using recognition parameters derived from a spectrum in accordance with the present invention.

マイクロホン10が示されており、システムを用いる話
者がこれを使用してスピーチを入力する。マイクロホン
10は音声波を電気信号に変換し、この電気信号は増幅器
11によって増幅される。増幅器11の出力はスペクトル分
析器12に結合されている。スペクトル分析器12は短期分
析能力を有する広帯域型または狭帯域型のスペクトル分
析器である。スペクトル分析器の機能及び構成は基本的
によく知られており、多数の方法で構成することができ
る。A microphone 10 is shown, which is used by a speaker using the system to input speech. Microphone
10 converts audio waves into electrical signals, which are then amplified
Amplified by 11. The output of amplifier 11 is coupled to spectrum analyzer 12. The spectrum analyzer 12 is a broadband or narrowband spectrum analyzer having a short-term analysis capability. The function and configuration of a spectrum analyzer is basically well known and can be configured in a number of ways.

スペクトル分析器12はスピーチを短いフレームに分割
し、その出力において各フレームのパラメータ表示を出
力する。スペクトル分析器12によって実行される特別な
型の音響分析は本発明には重要ではなく、多くの既知の
音響分析器またはスペクトル分析器が使用できる。この
ような例は米国特許出願第439018号（1982年11月３日出
願、ジー・ベンスコ等）及び第473422号（1983年３月９
日出願、ジー・ベンスコ等）明細書に記載されている。
両出願とも本発明の譲受人でもあるアイティーティー・
コーポレーションに譲り受けられており、参照のためこ
こに組み込まれている。The spectrum analyzer 12 splits the speech into short frames and outputs at its output a parametric representation of each frame. The particular type of acoustic analysis performed by the spectrum analyzer 12 is not critical to the invention, and many known acoustic or spectral analyzers can be used. Such examples are described in U.S. Patent Application Nos. 439018 (filed November 3, 1982, G. Bensco et al.) And 473422 (March 9, 1983).
Japanese Patent Application, G. Bensco et al.).
Both applications are the IT assignees of the present invention.
Assigned to Corporation and incorporated herein by reference.

米国特許出願第655958号（1984年９月28日出願、発明
者エー・エル・ヒギンズ等、名称“テンプレート−連結
モデルを用いたキーワード認識システム及び方法”）も
参照文献である。U.S. Patent Application No. 655958 (filed September 28, 1984, inventor Ahl Higgins et al., Entitled "Keyword Recognition System and Method Using Template-Concatenation Model") is also a reference.

スペクトル分析器12には14チャネルバンドパスフィル
タアレイが備えられており、用いられているフレームの
大きさは20ミリ秒かそれ以上である。これらのスペクト
ルパラメータは第1A図に示されているように処理され
る。図示されているように、スペクトル分析器12の出力
はスイッチ13に結合されており、このスイッチ13は認識
モード、形成テンプレートモード、あるいは修正テンプ
レートモードで動作することができる。The spectrum analyzer 12 is provided with a 14-channel bandpass filter array, and the frame size used is 20 milliseconds or more. These spectral parameters are processed as shown in FIG. 1A. As shown, the output of the spectrum analyzer 12 is coupled to a switch 13, which can operate in a recognition mode, a formation template mode, or a modified template mode.

スイッチ13が形成テンプレートモードに置かれると、
スペクトル分析器12の出力はスペクトル形態のテンプレ
ートとして示されているモジュール14に結合される。モ
ジュール14の目的はスペクトル分析器12の出力からテン
プレートを形成するのを助けることである。これらのテ
ンプレートはモジュール14中で形成され、スペクトル形
態のテンプレートであり、このようなテンプレートを形
成する多くの方法が良く知られている。基本的に形成テ
ンプレートモードではスペクトル分析器12の出力はモジ
ュール14によって処理され、モジュール14は話者がマイ
クロホン10を通して行なった発声に関するテンプレート
を出力する。話者は認識されべきワードを話すように促
され、基本的に話されたワードを示すテンプレートが生
成される。これらのテンプレートはモジュール15によっ
て使用されて、雑音が低いかあるいは雑音のない状態用
のモジュール16によって示されているような最終テンプ
レートを生成するための、スペクトル形態テンプレート
から得られる認識パラメータが導出される。次にモジュ
ール16によって示されているように、雑音のない状態の
テンプレートが記憶されて、これらのテンプレートは、
例えば特定の話者によって発声されたワード、フレーズ
などの特定の発声を示している。When switch 13 is placed in the forming template mode,
The output of the spectrum analyzer 12 is coupled to a module 14 shown as a template in spectral form. The purpose of module 14 is to help form a template from the output of spectrum analyzer 12. These templates are formed in module 14 and are templates in spectral form, and many methods of forming such templates are well known. Basically, in the form template mode, the output of the spectrum analyzer 12 is processed by a module 14, which outputs a template relating to the utterance made by the speaker through the microphone 10. The speaker is prompted to speak the word to be recognized, and a template is generated that basically indicates the spoken word. These templates are used by module 15 to derive the recognition parameters obtained from the spectral morphology templates to generate the final template as shown by module 16 for low or no noise conditions. You. The noise-free templates are then stored, as shown by module 16,
For example, a specific utterance such as a word or a phrase uttered by a specific speaker is shown.

記憶されたテンプレートはスイッチ100を通してプロ
セッサ160に結合され、このプロセッサ160は認識アルゴ
リズムを実行する。従ってプロセッサ160は認識モード
において、未知のスピーチを雑音のない状態で生成され
モジュール16に記憶されているテンプレートと比較する
ように動作する。そのため第1A図に示されているように
形成テンプレートモードではテンプレートパラメータを
得るために、スペクトル形態のテンプレートが出力され
る。このテンプレートパラメータは次に雑音がないかあ
るいは低雑音の状態用のテンプレートを形成するのに用
いられる。後に説明するようにプロセッサ160は、低雑
音かあるいは雑音のない状態用のモジュール16に記憶さ
れたテンプレートにより動作することができる。プロセ
ッサ160の機能もまた良く知られており、基本的に色々
な距離測定アルゴリズムやその他のアルゴリズムに基づ
いて整合を出力するように動作する。このような整合が
行なわれると、これは正しいワードであり、このワード
または音がシステムの出力となるという表示がなされ
る。The stored template is coupled through switch 100 to a processor 160, which executes a recognition algorithm. Accordingly, the processor 160 operates in the recognition mode to compare the unknown speech with a template generated in a noise-free manner and stored in the module 16. Therefore, as shown in FIG. 1A, in the formation template mode, a template in the form of a spectrum is output in order to obtain template parameters. This template parameter is then used to form a template for noiseless or low noise conditions. As will be described, the processor 160 can operate with templates stored in the module 16 for low or no noise conditions. The function of the processor 160 is also well-known and basically operates to output a match based on various distance measurement algorithms and other algorithms. When such a match is made, an indication is made that this is a correct word and that this word or sound is the output of the system.

スイッチ13が認識モードに置かれると、スイッチ13は
スペクトル分析器12の出力を導出パラメータモジュール
161に結合させ、このモジュール161は基本的にスペクト
ル分析器からパラメータを導出し、このパラメータは例
えば先に説明したようなモジュール16に記憶された記憶
テンプレートと比較される。When the switch 13 is placed in the recognition mode, the switch 13 outputs the output of the spectrum analyzer 12 to the derived parameter module.
Coupled to 161, this module 161 basically derives parameters from the spectrum analyzer, which parameters are compared with the storage templates stored in the module 16, for example, as described above.

第1A図に示されているように、スイッチ13はまた中央
位置にセットすることもできる。中央位置は修正テンプ
レートモード位置であり、この場合、スペクトル分析器
12の出力が推定雑音統計モジュール162に入る。モジュ
ール162の機能は基本的に雑音分析を行なうか、あるい
は雑音を処理して雑音統計の推定を行なうことである。
これは本発明の主要な特徴であり、これによって本発明
は雑音を選択的に付加してテンプレートを形成し、音声
認識を実行し、このような付加雑音のある状態でこのよ
うな認識における改善を行なう。As shown in FIG. 1A, switch 13 can also be set to the center position. The center position is the modified template mode position, in this case the spectrum analyzer
The twelve outputs enter the estimated noise statistics module 162. The function of module 162 is basically to perform noise analysis or to process the noise and make noise statistics estimates.
This is a key feature of the present invention, whereby the present invention selectively adds noise to form a template, performs speech recognition, and improves such recognition in the presence of such added noise. Perform

従って推定雑音統計モジュール162の機能は後にさら
に説明するが、モジュール14と結合しこのモジュール14
から情報を受けるモジュール164中に形成されるスペク
トルテンプレートを修正することである。モジュール16
5ではモジュール164の出力から認識パラメータが導出さ
れ、このパラメータはモジュール166によって示される
ように雑音のある状態かあるいは雑音が低レベルの状態
で用いられるテンプレートを形成するのに用いられる。
そのために第1A図に示されたシステムはスイッチ100を
切換えることにより、雑音のある状態のテンプレート
か、あるいは非常に低レベルの雑音または雑音のない状
態のテンプレートによって認識を行なうことができる。Accordingly, the function of the estimated noise statistics module 162 will be described further below, but in conjunction with module 14
Is to modify the spectral template formed in module 164, which receives information from. Module 16
At 5, a recognition parameter is derived from the output of module 164, which parameter is used to form a template that is used in noisy or low noise levels as indicated by module 166.
To that end, the system shown in FIG. 1A can recognize by switching the switch 100 either a noisy template or a very low level noise or noiseless template.

簡単に先に示したように、認識モードでは、スペクト
ル分析器12のスペクトルパラメータ出力が導出パラメー
タモジュール161を通してプロセッサ160の入力に与えら
れる。プロセッサ160は通常アルゴリズムを実行する
が、これもまた本発明には重要ではない。プロセッサ16
0は記憶されているテンプレートのシーケンスを決定
し、入ってくる認識されるべきスピーチに対するに最良
の整合を提供する。従ってプロセッサの出力は基本的に
一連のテンプレートラベルであり、各ラベルは最良に整
合しているテンプレートシーケンスにおける１つのテン
プレートを表している。As briefly indicated above, in the recognition mode, the spectral parameter output of the spectrum analyzer 12 is provided to the input of the processor 160 through the derived parameter module 161. Processor 160 typically executes the algorithm, but again is not important to the invention. Processor 16
0 determines the sequence of the stored templates and provides the best match for incoming speech to be recognized. Thus, the output of the processor is essentially a series of template labels, each label representing one template in the best matching template sequence.

例えば各テンプレートには１つの番号及び１つのラベ
ルが割当てられる。テンプレートはその番号のマルチビ
ット表示でも良い。この出力はプロセッサ160に備えら
れたテンプレートサーチシステムに与えられ、プロセッ
サはマルチビット表示がある場合にはたとえばテンプレ
ートラベルのための記憶装置を備えた比較器となる。従
ってプロセッサ160は入ってくるテンプレートラベルの
各々を記憶されているテンプレートと比較するように動
作する。次にサブシステムであるプロセッサ160によっ
て、特定のワードあるいはフレーズが話されたこととと
もに、どのワードあるいはフレーズが話されたかの表示
が与えられる。For example, each template is assigned one number and one label. The template may be a multi-bit display of that number. This output is provided to a template search system provided in the processor 160, which may be a comparator with storage for template labels, for example, if there is a multi-bit display. Accordingly, processor 160 operates to compare each of the incoming template labels with the stored template. The subsystem, processor 160, then provides an indication of which word or phrase was spoken as well as which word or phrase was spoken.

形成テンプレートモードあるいは修正テンプレートモ
ードのいずれかでは、使用者はいろいろなワードを話
し、スペクトル分析器12のスペクトル出力から認識パラ
メータが導出される。修正テンプレートモードではシス
テムが、認識モードにおけるシステムと協働して用いら
れる種々のテンプレートを生成し、このテンプレートは
上記のように推定雑音統計モジュール162による雑音の
選択的な付加によって修正される。このモジュール162
による雑音の選択的な付加によって、後にさらに説明す
るようにより信頼性の高いシステム動作が得られる。In either the form template mode or the modified template mode, the user speaks various words and recognition parameters are derived from the spectral output of the spectrum analyzer 12. In the modified template mode, the system generates various templates to be used in cooperation with the system in the recognition mode, which are modified by the selective addition of noise by the estimated noise statistics module 162 as described above. This module 162
, Resulting in more reliable system operation, as described further below.

第1B図を参照すると、性質がスペクトルである認識パ
ラメータを使用する認識システムが示されている。第1B
図では第1A図と同じ参照番号によって同じ機能の部材が
示されている。図からわかるように、マイクロホン10は
増幅器11の入力に結合され、増幅器11の出力はスペクト
ル分析器12の入力に結合されている。スペクトル分析器
12の出力は再びスイッチ13に結合され、スイッチ13は形
成テンプレートモード、修正テンプレートモード、ある
いは認識モードで動作できるようになっている。Referring to FIG. 1B, a recognition system that uses recognition parameters whose properties are spectra is shown. 1B
In the figure, members having the same functions are indicated by the same reference numerals as those in FIG. 1A. As can be seen, microphone 10 is coupled to the input of amplifier 11 and the output of amplifier 11 is coupled to the input of spectrum analyzer 12. Spectrum analyzer
The output of 12 is again coupled to a switch 13, which can be operated in a forming template mode, a modifying template mode, or a recognition mode.

第1B図からわかるように、形成テンプレートモードで
は、モジュール170によって低雑音か雑音のない状態用
のテンプレートが形成される。このモジュール170は性
質がスペクトルである認識パラメータを直接に与えるテ
ンプレートを形成する。次にこの形成されたテンプレー
トが記憶され、モジュール170はモジュール171に結合さ
れる。モジュール171は、基本的に雑音モジュール162と
して機能する推定雑音統計発生器172の影響のもとに、
例えばモジュール170から得られるスペクトルテンプレ
ートを修正する。修正スペクトルテンプレートモジュー
ル171の出力はモジュール173に結合され、モジュール17
3は雑音状態で用いるためのテンプレートを記憶する。
この図でもプロセッサ177が示され、このプロセッサ177
は、モジュール170中に記憶されたテンプレートかある
いはモジュール173中に記憶されたテンプレートのいず
れかで動作することができる。As can be seen from FIG. 1B, in the forming template mode, the module 170 forms a template for a low noise or noiseless condition. This module 170 forms a template that directly gives recognition parameters whose properties are spectra. This formed template is then stored and module 170 is coupled to module 171. Module 171 is based on the effect of estimated noise statistics generator 172, which basically functions as noise module 162,
For example, the spectrum template obtained from the module 170 is modified. The output of the modified spectrum template module 171 is coupled to module 173 and
3 stores a template for use in a noise state.
In this figure, the processor 177 is also shown.
Can operate on either the template stored in module 170 or the template stored in module 173.

いずれの場合もさらに処理する前に、先行技術に従っ
てどのようにテンプレートを生成するかが知られてい
る。テンプレートの生成にはいくつかの方法がある。テ
ンプレート生成の作業を実行する方法は自動的であり、
通常は多段階あるいは二段階工程を用いている。このよ
うな方法の１つでは訓練発声からのスピーチデータ（テ
ンプレートモード）がセグメントに分割される。次にこ
れらのセグメントが統計クラスタ分析のための入力とし
て与えられ、統計クラスタ分析は、セグメント間の距離
の測定値に基づいて数学的な関数を最大にするセグメン
トのサブセットを選択する。選択されたサブセットに属
するセグメントがテンプレートとして用いられる。In each case, before further processing, it is known how to generate the template according to the prior art. There are several ways to generate a template. The method of performing the task of template generation is automatic,
Usually, a multi-stage or two-stage process is used. In one such method, speech data from a training utterance (template mode) is divided into segments. These segments are then provided as inputs for statistical cluster analysis, which selects a subset of segments that maximizes a mathematical function based on a measure of the distance between the segments. Segments belonging to the selected subset are used as templates.

このような技術は上記の米国特許出願第655958号明細
書に記載されている。いずれにしても距離を測定するた
めのいろいろな方法は、発明の背景に引用したいくつか
の参考文献に記載されているようによく知られている。
距離を計測する方法で広く使用されているものの１つ
は、マハラノビス距離計算というものである。Such a technique is described in the above-mentioned US Patent Application No. 655958. In any case, various methods for measuring distance are well known as described in several references cited in the Background of the Invention.
One of the widely used methods of measuring distance is Mahalanobis distance calculation.

この方法の例は米国特許出願第003971号明細書（発明
の名称“多重パラメータ話者認識システム及び方法”、
1987年１月16日、レンチ等に譲渡されている）に記載さ
れている。この明細書には話者認識システムに用いられ
る他の種々の技術例が示されており、このようなシステ
ムとともに用いられるアルゴリズムのいくつかが詳細に
記載されている。いずれにしても第１図を参照すると本
発明の主要な特徴が第１図に示された音声認識システム
と関係しており、音声認識システムは、入ってくるスピ
ーチとの比較にテンプレートを用いており、それによっ
てどのワードが話されたかを決定する。この方法はキー
ワード認識システム、音声認識システム、話者認識シス
テム、話者確認システム、言語認識システム、あるいは
テンプレートまたは各種テンプレートの組合せを用いる
ことにより話された音に関しての決定を行なうようなシ
ステムならどのようなシステムにも用いることができ
る。An example of this method is described in US Patent Application No. 003971, entitled "Multi-Parameter Speaker Recognition System and Method",
(Transferred to a wrench on January 16, 1987). This document provides various other examples of techniques used in speaker recognition systems, and details some of the algorithms used with such systems. In any event, referring to FIG. 1, a key feature of the present invention is related to the speech recognition system shown in FIG. 1, which uses a template for comparison with incoming speech. And thereby determine which word was spoken. The method may be a keyword recognition system, a speech recognition system, a speaker recognition system, a speaker recognition system, a language recognition system, or any other system that makes decisions regarding spoken sound using a template or a combination of various templates. Such a system can also be used.

本発明の構成及び方法の説明の前に、発明の原埋及び
考え方を説明する。Prior to the description of the configuration and method of the present invention, the underlying concept and concept of the invention will be described.

本発明者は、テンプレートのS/N比が未知のあるいは
発声されたスピーチと同じである時は、それよりも雑音
が大きかったり小さかったりするテンプレートを用いる
よりも認識性能が良いことを認識した。従って音声信号
のS/N比が予想できると考えられる場合は、入ってくる
未知のスピーチと同じS/N比のスピーチからテンプレー
トが生成された“かのように”、使用される前にテンプ
レートを修正することによって認識性能を最適化するこ
とができる。The inventor has recognized that when the S / N ratio of the template is the same as the unknown or uttered speech, the recognition performance is better than using a template that is noisier or less noisy. Therefore, if the signal-to-noise ratio of the audio signal is considered predictable, the template is generated "as if" from speech with the same signal-to-noise ratio as the incoming unknown speech, before the template is used. The recognition performance can be optimized by correcting.

従って本発明を実用化するには以下のような考慮をし
なければならない。第１に入ってくるスピーチのS/N比
を予想し、第２にテンプレートを“かのように”要求を
満たすように修正することである。予想は理論と経験の
両方に基づいて行なう。多くの場合、比較的一定のレベ
ルで、あるいは低レベルの雑音かまたは一定の雑音の場
合において絶対的に、この雑音よりも大きな比較的一定
のレベルで、話者が話すことを期待することができる。
そうするとスピーチ及び雑音レベルを用いて未知のスピ
ーチのS/N比を予想することができる。以下に説明する
ように、これはスピーチ及び雑音レベル追跡モジュール
を用いることによって行なわれる。ある例では各々のフ
ィルタチャネルにおける話すレベルと雑音レベルの両方
が、現在値が近い将来の値の有効な推定値となるように
十分にゆっくりと変化すると仮定される。Therefore, in order to put the present invention into practical use, the following considerations must be taken. The first is to anticipate the S / N ratio of the incoming speech, and the second is to modify the template to meet the "as if" requirement. Conjectures are based on both theory and experience. Often, it is expected that a speaker will speak at a relatively constant level, or at a relatively constant level greater than this noise, or absolutely in the case of low level noise or constant noise. it can.
Then, the S / N ratio of the unknown speech can be predicted using the speech and the noise level. As explained below, this is done by using a speech and noise level tracking module. In one example, it is assumed that both the talk level and the noise level in each filter channel change slowly enough that the current value is a valid estimate of the near future value.

雑音がないか雑音が比較的ないテンプレートを修正す
ることによって、テンプレートがより雑音のあるスピー
チから作られた“かのよう”にすることは、経験と理論
的な考慮の両方に基づいている。Making the template "as if" made from more noisy speech by modifying the noisy or relatively noisy template is based on both experience and theoretical considerations.

研究の結果、個々のフィルタバンクチャネルのそれぞ
れにおいて雑音及びスピーチのパワーが付加されると仮
定すると非常によい近似であることが決定された。より
正確な近似値は、スピーチ及び雑音の組み合せが、フィ
ルタバンクチャネル帯域幅に関して多くの自由度を持つ
非心x²分布を有する場合である。上記のまた別の考慮か
ら、既知の統計特性の雑音と既知のスピーチパワーの組
合せの予想値からなるさらに正確な推定値を作ることが
できる。このようにして得られた“雑音の付加”による
精度の増加は生成されるテンプレートの精度を増加させ
るが、“パワー付加”規則を用いて得られる改善よりも
認識の精度を顕著に増加させることはない。従ってスピ
ーチ及び雑音パワーの組合せの予想値を推定する別の方
法に代替させることによってプロセスはより理論上さら
に正確にすることはできるが、以下はパワー付加規則に
ついて述べる。この代替によって本発明の意図あるいは
実体が変化を受けることはない。Research has determined that this is a very good approximation, assuming that noise and speech power are added in each of the individual filter bank channels. A more accurate approximation is where the combination of speech and noise has a non-central x ² distribution with many degrees of freedom with respect to the filter bank channel bandwidth. From the above additional considerations, a more accurate estimate of the expected value of the combination of noise of known statistical characteristics and known speech power can be made. The increased accuracy resulting from the "addition of noise" obtained in this way increases the accuracy of the generated template, but significantly more than the improvement obtained using the "add power" rule. There is no. Thus, although the process can be made more theoretically more accurate by substituting another method for estimating the expected value of the combination of speech and noise power, the power addition rules are described below. This substitution does not alter the intent or substance of the present invention.

さらに内部電子雑音及び量子雑音の両者が“パワー付
加”規則に関して音響雑音及び信号と結合することが観
察された。これらの雑音は対象とする音響雑音よりは小
さいが適用は可能である。従って、さまざまなモデルを
構成する際に“パワー付加”の結果を使用することがで
きるので、有効モデルから導出される多くのものを使用
する継続的な努力を通して研究作業の応用が明らかにな
る。これは以下に説明する。Furthermore, it has been observed that both internal electronic noise and quantum noise combine with acoustic noise and signals with respect to the "power addition" rule. These noises are smaller than the target acoustic noise, but can be applied. Thus, since the results of "power addition" can be used in constructing various models, the application of research work becomes apparent through continued efforts to use many derived from effective models. This is described below.

その平均値に等しい雑音パワーから生じるテンプレー
トが信頼できる認識出力の生成に関して非常に良好に機
能することが示された。従って雑音パワーのフレーム間
の可変性を予想することは必要ではなく、平均値を用い
ることで十分である。要求されるテンプレートパラメー
タは、現在の平均雑音パワーと基本形態テンプレートに
おいて有効に結合されるものと同じスピーチパワーから
生成されるパラメータである。It has been shown that templates resulting from noise powers equal to that average perform very well with respect to producing reliable recognition output. Therefore, it is not necessary to anticipate the variability of noise power between frames, and it is sufficient to use an average value. The required template parameters are those generated from the same average speech power and the same speech power that is effectively combined in the base form template.

システムからのチャネル雑音パワー値は雑音パワーの
推定値であり、数学的に決められることができる平均雑
音パワーに関連するように取ることができる。従って本
過程及び正当性を完全に理解するために、以下説明す
る。The channel noise power value from the system is an estimate of the noise power and can be taken to be related to the average noise power, which can be determined mathematically. Therefore, in order to fully understand the process and the validity, the description will be made below.

まず指摘されるのは、加算的なゼロ平均ガウス雑音に
よって損なわれたスピーチ信号の単一の離散フーリエ変
換（DFT）の出力の確率分布は容易に計算することがで
きることである。バンドパスフィルタバンクの各チャネ
ルに適用可能にするためにどのようにスピーチ及び雑音
を結合するかというモデルを拡張するために重要な次に
考えるべき要因は、チャネルの帯域幅が単一DFTチャネ
ルよりもかなり大きいか、大きくすることができるとい
うことである。従って雑音パワーパラメータ及び寄与チ
ャネル数は、スピーチがなく雑音がある状態でのバンド
パスフィルタの出力を観察することによって指定するこ
とができる。First, it is pointed out that the probability distribution of the output of a single discrete Fourier transform (DFT) of a speech signal corrupted by additive zero-mean Gaussian noise can be easily calculated. The next important factor to extend the model of how to combine speech and noise to make it applicable to each channel of the bandpass filter bank is that the channel bandwidth must be less than that of a single DFT channel. Is also quite large or can be made large. Thus, the noise power parameter and the number of contributing channels can be specified by observing the output of the bandpass filter in the absence of speech and noise.

次のステップは、雑音がない状態で形成されたスピー
チ認識テンプレートを雑音のある状態での予想される値
に等しくなるように修正することによって雑音のある状
態で用いられるように改善することである。従って用い
られる方法は、雑音のないテンプレートにおいて表わさ
れた各スピーチサンプル及び各バンドパスフィルタチャ
ネルに対して、現在の雑音の存在によって修正されるよ
うな雑音のないテンプレートの予想値に置換することで
ある。The next step is to improve the speech recognition template formed in the noisy condition to be used in the noisy condition by modifying it to be equal to the expected value in the noisy condition. . Therefore, the method used is to replace, for each speech sample and each bandpass filter channel represented in the noise-free template, the expected value of the noise-free template as modified by the presence of the current noise. It is.

そのためバンドパスフィルタチャネルの出力における
平均及び変動を測定することによって、ガウス雑音の通
し方によりチャネルの特性を推定することができる。基
本的には上記から理解できるように（そして上記事項の
大方は数学的にも証明されている）、本発明の実行は理
論上及び経験上の両方に基づいている。基本的にはこの
ように本発明の特性は、音声認識システムの信頼性を増
大させるように機能するテンプレートを形成するため
に、雑音を分析的に付加することである。Therefore, by measuring the average and fluctuation in the output of the bandpass filter channel, it is possible to estimate the characteristics of the channel by how Gaussian noise passes. Basically, as can be seen from the above (and much of the above has been proved mathematically), the practice of the present invention is based on both theory and experience. Basically, thus, a feature of the present invention is the analytical addition of noise to form a template that functions to increase the reliability of a speech recognition system.

雑音のない環境で集められたテンプレートデータに雑
音を付加し、それによって雑音のある環境で用いる新し
いテンプレートを形成するには２つの方法がある。厳密
な方法では各テンプレートトークンに雑音を付加し、そ
れから結果を平均する。近似的な方法では雑音のないト
ークンを平均して基本形態データを形成し、“パワー付
加”あるいは他の便利なまたはより正確な規則を用いて
現在の状態に適切な雑音を付加することによってデータ
を修正する。厳密な方法はすべてのテンプレート及び周
囲のトークンを維持することが必要であり、また過剰な
記憶が必要である。近似的な方法は基本的に同じテンプ
レート及び認識結果を提供する。実行の際に絶対的な仮
定がある。すなわち、テンプレートデータが用いられる
環境と比較してテンプレートデータに雑音がないという
ことである。There are two ways to add noise to template data collected in a noisy environment, thereby forming a new template for use in a noisy environment. The exact method adds noise to each template token and then averages the results. The approximate method is to average the noise-free tokens to form the basic morphological data, and then add the appropriate noise to the current state using "power addition" or other convenient or more accurate rules. To correct. The exact method requires maintaining all templates and surrounding tokens, and requires excessive storage. Approximate methods provide essentially the same template and recognition results. There are absolute assumptions during execution. That is, there is no noise in the template data as compared with the environment in which the template data is used.

第２図を参照すると、基本形態テンプレートに雑音を
付加することによって使用されるテンプレート形成の詳
細なブロック図が示されている。基本形態テンプレート
はそれ自体１組のワード“トークン”に対して形成され
た平均である。各トークンは所定のワードの１つの発声
から取ったパターンから成る。１つあるいはそれ以上の
トークンが配列されて基本形態テンプレートが形成され
る。基本形態テンプレートは静かな状態で形成され、第
1A図に示されたモジュール16か、あるいは第1B図に示さ
れたモジュール170に記憶される。第３図は第２図に示
された各値を定義している表である。第２図には再びマ
イクロホン10が示されており、このマイクロホンに話者
が発声する。マイクロホンの出力は増幅器11の入力に結
合され、増幅器11の出力はBPF、すなわちパンドパスフ
ィルタバンク12として図示されているスペクトル分析器
12に結合される。スイッチ13は修正テンプレート位置に
ある。スペクトル分析器12からの出力はバンドパスフィ
ルタスペクトルの大きさの値のベクトルであってモジュ
ール20に与えられ、このモジュール20はフレーム対を平
均化する。Referring to FIG. 2, there is shown a detailed block diagram of the template formation used by adding noise to the basic form template. The base form template is itself an average formed for a set of words "tokens". Each token consists of a pattern taken from one utterance of a given word. One or more tokens are arranged to form a basic form template. The basic form template is formed in a quiet state.
It is stored in module 16 shown in FIG. 1A or module 170 shown in FIG. 1B. FIG. 3 is a table defining each value shown in FIG. FIG. 2 again shows the microphone 10 at which the speaker speaks. The output of the microphone is coupled to the input of an amplifier 11, the output of which is a BPF, a spectrum analyzer shown as a bandpass filter bank 12.
Combined with 12. Switch 13 is in the correction template position. The output from the spectrum analyzer 12 is a vector of bandpass filter spectral magnitude values which is provided to a module 20, which averages the frame pairs.

フレーム対の平均化は良く知られた技術であり、基本
的に多くの既知の回路によって実行される。モジュール
20の出力はスペクトル分析器12からの入力の連続対を平
均化した結果であり、モジュール20は有効なフレーム速
度を半分にする。モジュール20の出力はスケールビット
モジュール21及び２乗成分モジュール22に与えられる。
２乗成分モジュール22はベクトル出力を与え、このベク
トル出力は基本的に平均フレーム対モジュール20の出力
のパワー値である２乗された大きさに等しい。Frame pair averaging is a well-known technique and is basically performed by many known circuits. module
The output of 20 is the result of averaging a continuous pair of inputs from spectrum analyzer 12, and module 20 halves the effective frame rate. The output of module 20 is provided to scale bit module 21 and squared component module 22.
The squared component module 22 provides a vector output, which is essentially equal to the squared magnitude of the average frame versus the power value of the output of the module 20.

スケールビットモジュール21の出力は、基本的に一連
のシフトによって実行される連続対の平均を２倍にする
作用をし、ベクトル最大成分を７ビットスケールに適合
させることを可能にする。そのためにモジュール21はシ
フトレジスタであり、このレジスタは基本的に多数の右
シフトを行ない先に説明した動作を実行する。スケール
ビットモジュール21からの出力は対数変換器23に向けら
れ、この変換器23はその出力にスケール対数スペクトル
パラメータベクトルを生成する。次にこのパラメータベ
クトルはモジュール24によって所定の１組のテンプレー
トークンに対して平均化され、基本的に基本形態テンプ
レートの１つのパラメータを与えるスケール対数スペク
トルパラメータが出力において生成される。２乗成分モ
ジュール22からの出力は相対化エネルギモジュールであ
るモジュール25の入力とスピーチ及び雑音レベル追跡器
であるモジュール26の荷に向けられる。The output of the scale bit module 21 serves essentially to double the average of successive pairs performed by a series of shifts, making it possible to adapt the vector maximum to a 7-bit scale. To this end, module 21 is a shift register, which basically performs a number of right shifts and performs the operations described above. The output from the scale bit module 21 is directed to a logarithmic converter 23, which generates a scale logarithmic spectral parameter vector at its output. This parameter vector is then averaged by the module 24 over a predetermined set of template tokens, and a scale logarithmic spectral parameter is generated at the output which basically gives one parameter of the basic form template. The output from the squared component module 22 is directed to the input of module 25, a relativized energy module, and to the load of module 26, a speech and noise level tracker.

相対化エネルギモジュール25の出力は、例えば２乗成
分モジュール22の出力からのエネルギを平均化すること
によって決められる相対エネルギを示すパラメータであ
る。これはモジュール36によってテンプレートトークン
に対して平均化され、別の基本形態データ値を与えるの
に必要な相対エネルギパラメータである出力ベクトルを
示す平均が生成される。スピーチ及び雑音レベル追跡器
26からの出力はエネルギレベルを示し、このエネルギレ
ベルは後に述べるように、モジュール27によって再び平
均化されて、このモジュール27の出力でさらに別の基本
形態特性のエネルギレベルが生成される。スピーチ及び
雑音レベル追跡器26からはさらに述べられるように２つ
の付加出力が与えられ、この内の１つはワード時間およ
びチャネルにわたって平均化された発声レベルの対数表
示であり、これはワードに結び付けられたスケーラーで
ある。他のものはチャネルに関してではなく、時間に対
して平均化された各チャネルにおける雑音レベルのベク
トルである。これもまたワード認識ユニットに結び付け
られたベクトルである。従ってモジュール27からの出力
は第１の加算器モジュール30に与えられ、このモジュー
ル30はスピーチ及び雑音レベル追跡器26から付加的な出
力を受ける。加算器30の出力は加算器31の入力の１つに
与えられ、この加算器は31はその他方の入力においてス
ケールビットモジュール21から得られた出力を受ける。
スケールビットモジュール21の出力はモジュール32によ
り係数Ｋだけ乗算され、Ｋは18.172に等しくさらに第３
図に示されている。次にこの値はモジュール33によって
平均化され、その出力において加算器31の他方の入力に
与えられる対数値の基本形態値が生成される。加算器31
の出力は加算器32に与えられる。加算器32はもう１つの
入力としてスピーチ及び雑音レベル追跡器26からの出力
を受け、これは再び各チャネルにおける雑音レベルのベ
クトルである。この出力は機能モジュール40の１つの入
力に与えられ、モジュール40は他の入力においてモジュ
ール24からの出力を受ける。機能モジュール40からの出
力は雑音付加テンプレートに対するスケール対数スペク
トルパラメータベクトルである。これは機能モジュール
41に与えられ、その出力において特定の発声に対するメ
ル−コサイン変換マトリックスである認識パラメータベ
クトルが生成される。従ってモジュール41からの出力及
びモジュール36からの出力が用いられて動作テンプレー
トデータが生成される。The output of the relative energy module 25 is a parameter that indicates the relative energy determined, for example, by averaging the energy from the output of the square component module 22. This is averaged over the template tokens by module 36 to produce an average that indicates the output vector, which is the relative energy parameter needed to provide another elementary morphology data value. Speech and noise level tracker
The output from 26 indicates an energy level, which will be averaged again by module 27, as will be described, to produce a further basic form characteristic energy level at the output of module 27. The speech and noise level tracker 26 provides two additional outputs, one of which is a logarithmic representation of the utterance level averaged over word time and channel, which is associated with the word as described further below. Is a scaler. Others are vectors of the noise level in each channel, averaged over time, and not with respect to the channel. This is also a vector associated with the word recognition unit. Thus, the output from module 27 is provided to a first adder module 30, which receives additional output from speech and noise level tracker 26. The output of adder 30 is provided to one of the inputs of adder 31, which receives the output obtained from scale bit module 21 at the other input.
The output of scale bit module 21 is multiplied by a factor K by module 32, where K is equal to 18.172 and a third
It is shown in the figure. This value is then averaged by module 33 to produce at its output a logarithmic base form value which is provided to the other input of adder 31. Adder 31
Is supplied to the adder 32. Adder 32 receives as another input the output from speech and noise level tracker 26, which is again a vector of noise levels in each channel. This output is provided to one input of a functional module 40, which receives the output from module 24 at the other input. The output from the function module 40 is a scale logarithmic spectral parameter vector for the noise-added template. This is a functional module
At its output, a recognition parameter vector, which is a mel-cosine transform matrix for a particular utterance, is generated at the output. Therefore, the operation template data is generated using the output from the module 41 and the output from the module 36.

上記のように第２図のブロック図と関連するすべての
出力は第３図に示されている。第３図からわかるよう
に、第２図から得られる基本形態テンプレートに対する
実効的なスペクトルの大きさは基本的に次の式によって
与えられる。All outputs associated with the block diagram of FIG. 2 as described above are shown in FIG. As can be seen from FIG. 3, the effective spectrum magnitude for the basic form template obtained from FIG. 2 is basically given by the following equation.

^Ｂ＝2^SBexpb（^Ｂ）実効的なパワーは次の式によって与えられる。 ^B = 2 ^SB expb ( ^B ) The effective power is given by:

^Ｂ＝^2B＝2^2SBexp_b（２^Ｂ）定義は第３図を参照されたい。 ^B = ^2B = 2 ^2SB exp _b (2 ^B ) See FIG. 3 for definition.

第２図のモジュール27の出力において示されるテンプ
レートの平均発声レベルが、加算器30の入力に与えられ
るスピーチ及び雑音レベル追跡器26の出力によって示さ
れる現在の発声レベルと同じであるように雑音を付加す
る前に各フレームのパワーが修正される。その値は認識
ユニット（0.331デシベル）中にあるため、モジュール2
6の出力で示される基本形態における実効パワーは変更
される。これに対して、現在の雑音レベルが付加される
ことから、雑音付加テンプレートの実効パワーレベルが
得られるので、雑音付加テンプレートの実効的な大きさ
がモジュール41の出力として示されている。2 so that the average utterance level of the template shown at the output of module 27 in FIG. 2 is the same as the current utterance level indicated by the output of the speech and noise level tracker 26 provided at the input of adder 30. Before adding, the power of each frame is modified. Module 2 because its value is in the recognition unit (0.331 dB)
The effective power in the basic form shown by the output of 6 is changed. In contrast, since the current noise level is added, the effective power level of the noise-added template is obtained, and the effective size of the noise-added template is shown as an output of the module 41.

従ってすべての動作認識パターンは、対数スペクトル
パラメータのメル−コサイン変換であり、相対的なエネ
ルギの尺度である。第３図の定義と共に第２図を見れば
当業者にとって上述のことは明らかであり、数学的にも
明白である。Thus, all motion recognition patterns are mel-cosine transforms of log spectral parameters, and are a measure of relative energy. FIG. 2 together with the definition of FIG. 3 makes the above clear and mathematically obvious to a person skilled in the art.

従って同じ正確な技術を用いることによって、テンプ
レートトークンに雑音を付加し、次に平均化することに
よりテンプレートを形成することができる。基本的にこ
れを行うプロセスは第２図に示されたものと同じであ
り、それによって機能ユニット40の後に平均化が行なわ
れること以外は第２図に示されたものと同じ正確な出力
が与えられる。Thus, using the same exact technique, the template can be formed by adding noise to the template token and then averaging. Basically, the process of doing this is the same as that shown in FIG. 2, whereby the same exact output as shown in FIG. 2 is obtained except that averaging is performed after functional unit 40. Given.

第４図には、先に示したようなテンプレート形成技術
を用いる典型的なシステムの詳細なブロック図が示され
ている。第４図では、同じ機能の部品を示すのに同じ参
照番号が用いられている。第４図でわかるように、コー
ダ／デコーダ（CODEC）モジュール及び線形回路47に加
算器46の出力が結合された状態で、加算器46の１つの入
力に結合されたAGC、すなわち自動利得制御モジュール4
5が配置されている。コーダ／デコーダモジュールは基
本的にアナログ／デジタル変換器であり、これにデジタ
ル／アナログ変換器が続いている。CODECモジュールの
出力はシンセサイザ、またはバンドパスフィルタバン
ク、すなわちスペクトル分析器12に与えられる。FIG. 4 shows a detailed block diagram of a typical system using the template formation technique as described above. In FIG. 4, the same reference numbers are used to indicate parts having the same function. As can be seen in FIG. 4, with the output of adder 46 coupled to a coder / decoder (CODEC) module and linear circuit 47, an AGC coupled to one input of adder 46, ie, an automatic gain control module. Four
5 are located. The coder / decoder module is basically an analog / digital converter, followed by a digital / analog converter. The output of the CODEC module is provided to a synthesizer, or bandpass filter bank, or spectrum analyzer 12.

スペクトル分析器12からの出力は平均フレーム対モジ
ュール20に送られ、このモジュール20は再び後に述べる
スケールモジュール21及びスピーチ及び雑音追跡器モジ
ュール26と関連する。第４図の右側に示された出力ライ
ンからはいろいろな動作テンプレートデータ値が与えら
れ、これらは雑音のあるテンプレートを形成するのに用
いられる。The output from the spectrum analyzer 12 is sent to an average frame pair module 20, which is again associated with a scale module 21 and a speech and noise tracker module 26 described below. The various output template data values are provided from the output lines shown on the right side of FIG. 4 and are used to form a noisy template.

主要機能モジュールはスピーチ雑音追跡器26でありこ
れはさらに後述する。また第４図にはマイクロホン10へ
の入力に記号Nc及びScが付けられ、これは重要な信号源
及び雑音源である。下付きの“c"によって、これらの表
現がスペクトル分析器12を形成するフィルタバンクチャ
ネルの各々の通過帯域に対する平均のスペクトルの大き
さを表わすことを示している。この下付き“c"には14の
値があり、フィルタバンクの各々のフィルタに対して１
つである。従ってScは音響スピーチ信号のチャネルＣに
おけるスペクトルの大きさであり.Ncはこのチャネルに
対する音響雑音の２乗平均スペクトルの大きさである。
加算器50及び46からの出力は電子雑音のスペクトルの大
きさであり、これはAGC利得制御モジュール45の前また
は後に注入される。CODEC47からの出力にはCODECによっ
て導入される量子化雑音のスペクトルの大きさが含まれ
る。いずれにせよ、スペクトル分析器12の出力はバンド
パスフィルタのスペクトルの大きさの値のベクトルであ
り，平均フレーム対モジュール20の出力はスペクトルの
大きさの値の連続対を平均化した結果である。The main functional module is the speech noise tracker 26, which is described further below. Also in FIG. 4, the inputs to the microphone 10 are labeled Nc and Sc, which are important sources of signal and noise. The subscript "c" indicates that these expressions represent the average spectral magnitude for each passband of the filter bank channels forming the spectrum analyzer 12. This subscript "c" has 14 values, one for each filter in the filter bank.
One. Thus, Sc is the magnitude of the spectrum of the acoustic speech signal in channel C. Nc is the magnitude of the root mean square spectrum of the acoustic noise for this channel.
The output from summers 50 and 46 is the magnitude of the spectrum of the electronic noise, which is injected before or after AGC gain control module 45. The output from CODEC 47 contains the magnitude of the spectrum of the quantization noise introduced by the CODEC. In any case, the output of spectrum analyzer 12 is a vector of spectral magnitude values of the bandpass filter, and the output of average frame pair module 20 is the result of averaging a continuous pair of spectral magnitude values. .

スペクトル分析器12の実効的な出力信号は、フィルタ
バンクの通過帯域に対するフィルタバンク入力における
信号のスペクトルの大きさの推定値であり、これはフィ
ルタバンク内の各チャネルに対して示されている。これ
らの値の連続対は平均化されて50/秒の速度でモジュー
ル20から出力が生成される。The effective output signal of the spectrum analyzer 12 is an estimate of the spectral magnitude of the signal at the filterbank input relative to the filterbank passband, and is shown for each channel in the filterbank. Successive pairs of these values are averaged to produce output from module 20 at a rate of 50 / sec.

基本的に14すべてのチャネルに対する１組のすべての
値はすべてモジュール21において同じ数Ｓだけ右にシフ
トされるので、それによって最大のものが７ビットある
いはそれ以下を占有し、その結果の値は見出し表によっ
て対数に比例する数に変換される。見出し表は127の入
力に対して127戻すことから、結果は入力の自然対数の2
6.2倍と、すなわち底ｂに対する対数と考えることがで
きる（ｂは1.03888である）。20ミリ秒のフレーム値は
追跡器26によっても用いられてピークスピーチエネルギ
の尺度と各チャネルに対する平均雑音エネルギの推定値
とが生成される。発声レベルはマイクロホン10における
総スピーチエネルギの底ｂに対する対数の推定値に任意
の定数を加えたものである。Basically, all values of a set for all 14 channels are shifted right by the same number S in module 21, so that the largest occupies 7 bits or less, and the resulting value is It is converted to a number proportional to the logarithm by the index table. The table returns 127 for 127 inputs, so the result is the natural log of the input, 2
It can be considered as 6.2 times, that is, the logarithm for the base b (b is 1.03888). The 20 ms frame value is also used by tracker 26 to generate a measure of peak speech energy and an estimate of the average noise energy for each channel. The utterance level is obtained by adding an arbitrary constant to the logarithmic estimate of the total speech energy at the microphone 10 with respect to the base b.

AGC利得の効果は基本的に除去されるためスペクトル
値ではない。例えばこの利得の効果はフィルタバンク全
体の通過帯域の総エネルギに関係する。発声レベルの推
定値もワードかフレーズ関連である。その時定数は短い
発声がなされる時のレベルの尺度のようなものである。
従って各テンプレートあるいはテンプレート期間の未知
のセグメントに関連するレベル値がただ１つしかない。
追跡器26からの雑音推定値の時間的な制約は、発声され
ている時間の長さに対して各チャネルに割り当てられる
雑音レベル推定値がただ１つでなければならないことで
ある。そのため第４図の対数回路54に結合しているスピ
ーチ及び雑音追跡器26からの出力値は、フィルタバンク
の出力に対する平均エネルギ推定値である。従ってこれ
らの値はAGC利得によって影響を受け、対数変換を行な
わずに平均スペクトルエネルギに正比例する。The effect of AGC gain is not a spectral value because it is basically removed. For example, the effect of this gain is related to the total energy in the passband of the entire filter bank. The utterance level estimates are also word or phrase related. The time constant is like a measure of the level at which short utterances are made.
Thus, there is only one level value associated with the unknown segment of each template or template period.
A temporal constraint on the noise estimate from tracker 26 is that only one noise level estimate must be assigned to each channel for the length of time that it is speaking. Thus, the output value from the speech and noise tracker 26 coupled to the logarithmic circuit 54 of FIG. 4 is an average energy estimate for the output of the filter bank. Thus, these values are affected by the AGC gain and are directly proportional to the average spectral energy without any log transformation.

信号源及び種々の雑音源は統計上は独立しており、そ
のエネルギは平均に加算されると仮定する。これは内部
ノイズ源を決定するのに都合がよいだけでななく、音響
雑音源及び信号源の両方に対して選れた近似であること
が実証されている。さらにマイクロホンにおける等価雑
音パワーとして呼ぶことができる雑音値があると仮定す
る。これらの値には音響雑音パワー及び他のシステム雑
音パワーが含まれ、このうちの一部はAGC45の利得によ
って減少される。It is assumed that the signal source and the various noise sources are statistically independent and their energies are added to the average. This has proven to be not only convenient for determining internal noise sources, but also a chosen approximation for both acoustic noise sources and signal sources. Assume further that there is a noise value that can be referred to as the equivalent noise power in the microphone. These values include the acoustic noise power and other system noise power, some of which are reduced by the gain of the AGC 45.

従って第４図より導出され第２図及び第３図に示され
ているスケール係数が雑音関連テンプレートを生成する
ために与えられる。このため、テンプレート平均化プロ
セスを使用することによって、同じ発声レベル及びS/N
比におけるすべてのトークンの対数スペクトルパラメー
タを平均化することによって得られるのと同じまたは等
価な平均テンプレートを生成することができる。したが
って、全体的な問題を簡単にするために、すべてのテン
プレートならびにすべてのテンプレートトークンにおけ
るS/N比が同じであると仮定する。これは等しくすべき
すべてのトークンにおける発声レベルを調節することに
よって実行することができるため、同一のS/N比の結果
は全トークンにおける等しい雑音値である。この仮定の
下、雑音の同等値を平均化するすべての形態を作ること
ができる。Accordingly, the scale factors derived from FIG. 4 and shown in FIGS. 2 and 3 are provided to generate a noise-related template. Therefore, by using the template averaging process, the same utterance level and S / N
An average template that is the same or equivalent to that obtained by averaging the log spectral parameters of all tokens in the ratio can be generated. Therefore, to simplify the overall problem, assume that the S / N ratios in all templates as well as all template tokens are the same. Since this can be performed by adjusting the utterance level in all tokens to be equal, the result of the same S / N ratio is an equal noise value in all tokens. Under this assumption, all forms of averaging noise equivalents can be made.

上記のようにテンプレートのS/N比が未知のスピーチ
と同じ場合は、認識性能は雑音がそれよりも大きかった
り小さかったりするテンプレートの場合よりも良好であ
ることが研究からわかっている。従って上記の技術に基
くと、オーディオ信号のS/Nを予想することができ、そ
れによって、テンプレートを使用する前に、入ってくる
未知のスピーチと同じS/N比のスピーチからテンプレー
トが生成された“かのよう”にテンプレートを修正する
ことにより認識性能を最適化することができることが示
された。Studies have shown that when the S / N ratio of the template is the same as the unknown speech as described above, the recognition performance is better than that of the template where the noise is larger or smaller. Thus, based on the above technique, the S / N of the audio signal can be predicted, thereby generating a template from speech with the same S / N ratio as the incoming unknown speech before using the template. It was shown that recognition performance could be optimized by modifying the template "as if".

従って２つのステップが用いられる。１つは入ってく
るスピーチのS/N比を予想し、この要求に合うようにテ
ンプレートを修正することである。そのため以下に説明
するようにスピーチ及び雑音追跡器26は、各チャネルに
おけるスピーチパワーの推定値が各々の音声内容によっ
てワード間で変化することから、スピーチパワーの推定
値を形成しない。そのためどのようなワードが発声され
るか予想することはできないので、データには予想力は
ない。重要なことは通常の手順に対して、各チャネルの
S/N比の推定値を持たないということである。従って上
記のようにテンプレート修正手順では、チャネルごとに
特定のS/N比を使用することを避ける。そのためその平
均値に等しい雑音パワーから得られるテンプレートは、
認識システムにおいて非常に良好に動作する。Therefore, two steps are used. One is to anticipate the S / N ratio of the incoming speech and modify the template to meet this requirement. Therefore, as described below, the speech and noise tracker 26 does not form a speech power estimate because the speech power estimate in each channel varies between words with each speech content. Since it is not possible to what word is expected either spoken therefore, not expected force the data. It is important to note that each channel is
That is, it has no estimate of the S / N ratio. Therefore, as described above, the template modification procedure avoids using a specific S / N ratio for each channel. So the template obtained from noise power equal to its average value is
Works very well in recognition systems.

すなわち、平均値を使用することで十分であることか
ら、雑音パワーのフレーム間の可変性を考える必要はな
い。そしてテンプレートパラメータは、現在の平均雑音
パワーと結合される“基本形態”テンプレートに有効に
存在するのと同じスピーチパワーから生成されるもので
ある。基本的には上記のように、スピーチ及び雑音追跡
器26はデジタル信号処理（DSP）回路である。このDSP回
路は、付加的な音響雑音が存在するスピーチ信号のパワ
ーレベルの尺度と任意の形態のバンドパスフィルタバン
クチャネルにおける平均雑音パワーの尺度とを与えるア
ルゴリズムを実行するように動作する。見出された発声
レベルの尺度は、音声認識のためにS/N比を調節するの
に適切な話者の会話レベルを示す。発声レベルの他の尺
度は速く変化し、および／あるいは話されたスピーチ内
の有声音及び無声音の発声の相対頻度で変化する。スピ
ーチ及び雑音追跡器によって見出された尺度は、母音核
中のわずかになめらかなピークパワーを検出することに
よってこれらの問題を回避している。That is, since it is sufficient to use the average value, there is no need to consider the variability of noise power between frames. The template parameters are then generated from the same speech power that is effectively present in the "basic form" template combined with the current average noise power. Basically, as described above, the speech and noise tracker 26 is a digital signal processing (DSP) circuit. The DSP circuit operates to execute an algorithm that provides a measure of the power level of the speech signal in the presence of additional acoustic noise and a measure of the average noise power in any form of bandpass filter bank channel. The measure of utterance level found indicates the appropriate speaker conversation level to adjust the S / N ratio for speech recognition. Other measures of utterance level change rapidly and / or change in the relative frequency of voiced and unvoiced utterances in spoken speech. The measures found by speech and noise trackers avoid these problems by detecting slightly smooth peak power in vowel nuclei.

さらに詳細に説明すると、それはよりエネルギの多い
母音核中にあるわずかになめらかなピークパワーを追跡
する。強勢のない音節核の間と母音核でないスピーチ間
隔の間におけるパワーピークを無視することによって、
尺度は、一般的なスピーチレベルの連続的な表示であ
る。追跡器は、総雑音パワーがスピーチ内の母音核生成
率（通常５乃至15/秒）と比較して通常ゆっくりと変化
する場合に、存在するスピーチと相互関係がない加算的
な雑音が存在する状態で用いられることが意図されてい
る。追跡器は雑音レベルにおけるより高速の変化から回
復するようにも動作する。スピーチ及び雑音追跡器26は
対数または圧縮技術を用いており、それによって関心の
ある周波数領域に対する総スピーチパワーの尺度が与え
られる。この尺度はまず遅い立上り、速い立下りのフィ
ルタによる処理を受けるが、この場合の上昇及び下降時
間制限は、母音核の初めの数ミル秒間において瞬間的信
号パワーとフィルタ値との間に大きな正の差が存在し、
大きな負の値の差は生じないように選択される。More specifically, it tracks the slightly smoother peak power in the more energetic vowel nuclei. By ignoring power peaks between unstressed syllable nuclei and non-vowel nucleus speech intervals,
The measure is a continuous representation of a common speech level. The tracker has additive noise that is uncorrelated with the speech present when the total noise power changes usually slowly compared to the vowel nucleation rate in the speech (typically 5-15 / sec). It is intended to be used in the state. The tracker also operates to recover from faster changes in the noise level. Speech and noise tracker 26 employs a logarithmic or compression technique, which provides a measure of the total speech power for the frequency domain of interest. This measure is first processed by a slow rise, fast fall filter, where the rise and fall time limits are large positive instantaneous signal powers and filter values during the first few mils of a vowel kernel. The difference between
The choice is made such that no large negative difference occurs.

強勢されない母音核は通常スキップして、結果的に生
じる値がスピーチ間隔における通常の母音核の間あるい
は強勢された母音核間だけ適切なしきい値よりも上昇す
るように、瞬間的信号パワーと速く下降してゆっくりと
上昇する時間でフィルタされた値との間の差の非線形関
数が、適切な期間の可動ボックスカー積分プロセスに向
けられる。このしきい値との交差はスピーチ核による高
い信号パワーの間隔を識別するのに用いられる。このよ
うに識別された間隔だけが発声レベルの追跡に用いられ
る。そしてスピーチ核しきい値より小さい第２のしきい
値より大きいボックスカー積分プロセスからの値が、ス
ピーチパワーだけでなく雑音パワーを含む間隔を識別す
るのに用いられる。ボックスカー積分値が第２の（低い
方の）しきい値よりも小さくまた瞬間パワーがその速い
立上り、遅い立上り時間を有するフィルタでフィルタさ
れた値よりも大きい第３のしきい値よりも大きくない場
合の間隔だけが、雑音パワー追跡機能に対する入力とし
て用いられる。Unintensified vowel nuclei are usually skipped, and the instantaneous signal power and the fast signal velocities are increased so that the resulting value rises above the appropriate threshold only during normal vowel nuclei during speech intervals or only between stressed vowel nuclei The non-linear function of the difference between the falling and slowly rising time filtered values is directed to the moving boxcar integration process for the appropriate period. The intersection with this threshold is used to identify high signal power intervals due to speech kernels. Only the intervals identified in this way are used for tracking the utterance level. The value from the boxcar integration process that is greater than a second threshold less than the speech kernel threshold is then used to identify intervals that include noise power as well as speech power. The boxcar integral is less than a second (lower) threshold and the instantaneous power is greater than a third threshold greater than a value filtered by a filter having its fast rise and slow rise times. Only the missing interval is used as input to the noise power tracking function.

雑音パワー追跡モジュールは基本的に集積回路チップ
によって構成されているデジタル信号プロセッサを備え
ている。このようなチップの多くは手に入れることがで
きるものであり、基本的にプログラム可能であって色々
な型式のアルゴリズムを実行するものである。雑音及び
信号追跡機能に関連するアルゴリズムは、信号エネルギ
内容及び雑音エネルギ内容の両方を決定するように動作
し、以下の方法で動作する。The noise power tracking module comprises a digital signal processor which is basically constituted by an integrated circuit chip. Many of these chips are available, are essentially programmable, and perform various types of algorithms. The algorithms associated with the noise and signal tracking functions operate to determine both the signal energy content and the noise energy content and operate in the following manner.

まずチャネルエネルギを示す数学上の値を得る。これ
は各フレームすべてで行なわれる。次に全体のエネルギ
を計算する。これによりシステムは自動利得制御変化に
対する調整をすることができる。エネルギが計算される
と、次にその結果が所定の期間に対して平滑にされる。
平滑にされたエネルギ値が得られた後、総エネルギの対
数値を計算する。総エネルギの対数値を計算した後、パ
ンドパスフィルタアレイへの入力におけるスピーチレベ
ル推定値に対するボックスカー積分あるいは平均化を実
行する。次のステップでは非対称フィルタが用いられ、
スピーチ信号の立上り時間を監視することによるスピー
チ検出のために対数エネルギをフィルタする。スピーチ
信号は一般的に参照され、入ってくる信号は雑音である
かあるいはアーティファクト信号である可能性があり、
アーティファクト信号は雑音またはスピーチ信号ではな
く、激しい呼気や基本的には情報ではなく雑音でもない
話者の声の他のいくつかの他の特性によるものである。
いずれにしてもこれもまた真のスピーチ信号である。First, a mathematical value indicating the channel energy is obtained. This is done on every frame. Next, the total energy is calculated. This allows the system to make adjustments for automatic gain control changes. Once the energy has been calculated, the result is then smoothed for a predetermined period.
After the smoothed energy values are obtained, the log value of the total energy is calculated. After calculating the logarithm of the total energy, a boxcar integration or averaging is performed on the speech level estimates at the input to the pan-pass filter array. The next step uses an asymmetric filter,
Filter the logarithmic energy for speech detection by monitoring the rise time of the speech signal. The speech signal is commonly referred to and the incoming signal may be noise or an artifact signal,
Artifact signals are not noise or speech signals, but are due to intense exhalation and some other other characteristic of the speaker's voice which is essentially neither information nor noise.
In any case, this is also a true speech signal.

従ってこれを決めるために、平滑化されたエネルギに
対する対数エネルギの瞬間値を監視する。アルゴリズム
は、信号の立上り及び立下り時間に関連する時間間隔を
所定の間隔に分割するように動作する。立下りが負に比
べて正である時に、認識すべき入ってくる信号の特性に
関して一定の決定が行なわれる。先に示したようにこれ
らの決定によってスピーチであるか、アーティファクト
であるか、あるいは純粋の雑音であるかを決める。例え
ば立上りが負である期間では、立上りが継続的に負であ
る場合は雑音信号であると無条件に仮定される。雑音信
号が受け取られると、システムは雑音値を滑らかにし、
これらの血を使用して平均雑音エネルギに寄与させ、計
算値を用いてこの値を雑音推定値に適用することによっ
て、信号を継続的に追跡する。次にこれを用いてテンプ
レートを形成する。正の移行に関する注意はさらに困難
である。Therefore, to determine this, the instantaneous value of the logarithmic energy relative to the smoothed energy is monitored. The algorithm operates to divide the time interval associated with the rise and fall times of the signal into predetermined intervals. When the fall is positive relative to negative, certain decisions are made regarding the characteristics of the incoming signal to be recognized. As indicated above, these decisions determine whether speech, artifacts, or pure noise. For example, during a period when the rising is negative, if the rising is continuously negative, it is unconditionally assumed to be a noise signal. When a noise signal is received, the system smoothes the noise value,
The signal is tracked continuously by using these blood to contribute to the average noise energy and applying this value to the noise estimate using the calculated value. Next, a template is formed using this. Attention about the positive transition is even more difficult.

正の移行は雑音か、アーティファクトか、あるいはス
ピーチを表わしている。この決定のために非線形関数の
積分を実行する。従って積分値を一定のしきい値と比較
することに基づいて、正の立上りがスピーチか、雑音
か、あるいはアーティファクトのどれを表わしているか
を決定することができる。このようにしてスピーチ及び
雑音追跡器モジュールから生じた値は真のスピーチ値を
表わす。第5A図乃至第5C図にはスピーチ及び雑音追跡器
のプログラムが示されており、ここでは完全なプログラ
ムが示されている。Positive transitions indicate noise, artifacts, or speech. The integration of the non-linear function is performed for this determination. Thus, based on comparing the integral to a certain threshold, it can be determined whether the positive rise represents speech, noise, or an artifact. The values resulting from the speech and noise tracker module in this way represent true speech values. 5A-5C show the speech and noise tracker program, where the complete program is shown.

第６図は第5A図乃至第5C図に示されているプログラミ
ングフォーマットを理解するために必要な工学パラメー
タの定義を示している。さらに説明すると、この手順は
単一フレーム毎に実行され、以下のように動作する。第
5A図に示されているように手順の第１のステップでは各
チャネルでのエネルギとともに総エネルギが得られる。
これはステップ１および２に示される。そしてステップ
3,4に示されているように、自動利得制御スケール変化
を考慮に入れて、エネルギが各チャネルでフィルタされ
る。次のステップではエネルギ値をなめらかにし、AGC
に対して補正されるエネルギのなめらかな対数値を得
る。これはステップ5,6,7に示されている。次のステッ
プ８ではスピーチレベル推定値のボックスカー平均を取
る。そしてステップ９及び10に示されているように、エ
ネルギの非対称フィルタ値と、フィルタされた値に対す
る現在のエネルギの立上りを得る。FIG. 6 shows the definition of the engineering parameters required to understand the programming format shown in FIGS. 5A to 5C. More specifically, this procedure is performed every single frame, and operates as follows. No.
In the first step of the procedure, as shown in FIG. 5A, the total energy is obtained along with the energy in each channel.
This is shown in steps 1 and 2. And step
Energy is filtered on each channel, taking into account automatic gain control scale changes, as shown in 3,4. In the next step, the energy value is smoothed and the AGC
To obtain a smooth logarithmic value of the energy corrected for This is shown in steps 5,6,7. In the next step 8, a boxcar average of the speech level estimate is taken. Then, as shown in steps 9 and 10, the asymmetric filter value of the energy and the rise of the current energy for the filtered value are obtained.

そしてプログラムは第5B図に移る。第5A図のステップ
10に示されている変数ｒは、現在の対数エネルギがその
非対称的な平滑値を越えるような量である。母音核の
間、ｒは正に向かい、かなりの間隔の期間にわたって正
にとどまっている。Then the program moves to FIG. 5B. Steps in Figure 5A
The variable r, shown at 10, is such that the current log energy exceeds its asymmetric smoothness. During vowel nuclei, r is positive and remains positive for a significant period of time.

これはその正及び負の期間に対して特別な意味を持つ
ように示されているため、初めて正になったり負になっ
たりする時に特別な処理が必要となる。これは第5B図に
詳細に示されている。ｒが最初に正になると、一定のス
ピーチ核の始まりでありうるものとしてフレーム番号を
記録する。そしてそれがスピーチであるかどうかを決め
るのに用いられる値Ｐをリセットし、雑音追跡を中断す
るように動作する。いずれにしてもｒが正にとどまって
いる間は、値Ｐを累算して、Ｐが特定のしきい値を越え
た場合にアーティファクト及びスピーチフラグをセット
する。これらは第5B図の左側に示されている。ｒが初め
て正になると、雑音追跡器を最後に既知となった雑音値
にリセットし、仮定されたスピーチレベルが雑音レベル
から十分に高いことを確認している間に、スピーチある
いはアーティファクトが検出された場合に所定の遅延の
後に雑音追跡を再開する。この立上り中にスピーチが検
出されると、既知のスピーチ間隔の端部としてフレーム
の番号が記録される。This is shown to have a special meaning for the positive and negative periods, so special handling is required the first time it goes positive or negative. This is shown in detail in FIG. 5B. When r first becomes positive, it records the frame number as being the beginning of a fixed speech nucleus. It then operates to reset the value P used to determine if it is speech and to interrupt noise tracking. In any case, while r remains positive, the value P is accumulated, and if P exceeds a certain threshold, the artifact and speech flags are set. These are shown on the left side of FIG. 5B. When r becomes positive for the first time, the noise tracker is reset to the last known noise value, and while confirming that the assumed speech level is sufficiently high from the noise level, no speech or artifact is detected. If so, the noise tracking is resumed after a predetermined delay. If speech is detected during this rise, the frame number is recorded as the end of the known speech interval.

ｒが負にとどまっている間、所定の遅延の後に雑音を
追跡し続ける。これは、提供される様々な動作を明確に
記載しているフローチャートにすべて示されている。Continue to track noise after a predetermined delay while r stays negative. This is all shown in the flow chart, which clearly describes the various operations provided.

第5C図は基本的に、例えば第２図及び第４図に示され
ている動作テンプレートを提供するために用いられる出
力変数の生成を示している。従って上記からわかるよう
に、本発明のシステムの主要な考え方では、テンプレー
トを生成し、それによって雑音を正しい予想された方法
で付加して、関連する予想S/N比を有するテンプレート
を形成する。テンプレートに関連する雑音レベルは、入
ってくる信号に存在する雑音の推定値を示している。こ
のようにして、音声認識システムの認識確率を実質的に
増加させる。FIG. 5C basically shows the generation of output variables used to provide the operation templates shown in FIGS. 2 and 4, for example. Thus, as can be seen from the above, the main idea of the system of the present invention is to generate a template and thereby add noise in the correct expected way to form a template with an associated expected S / N ratio. The noise level associated with the template indicates an estimate of the noise present in the incoming signal. In this way, the recognition probability of the speech recognition system is substantially increased.

上記のように雑音を付加することによってこのような
テンプレートを生成することは、テンプレートを用いて
入ってくる信号と比較して、その信号が実際にスピーチ
か、アーティファクトか、あるいは雑音かを決める任意
の音声認識システムに用いることができる。従ってこの
システムは、雑音のない状態でまず形成され、そして雑
音のある状態のこれらの予想値に等しくなるように修正
することによって雑音のある状態で使用できるように改
善された音声認識テンプレートを提供するように動作す
る。Generating such a template by adding noise as described above is an option to determine whether the signal is actually speech, artifact, or noise, compared to the incoming signal using the template. Can be used for the voice recognition system. Thus, the system provides an improved speech recognition template that is first formed in a noisy condition and modified for use in noisy conditions by modifying it to be equal to these expected values of the noisy condition. To work.

[Brief description of the drawings]

第1A図は本発明を用いてスペクトルから得られた認識パ
ラメータを使用する音声認識システムを示すブロック図
である。第1B図は本発明に従った性質がスペクトルである認識パ
ラメータを使用する別の音声認識システムを示すブロッ
ク図である。第２図は動作テンプレートデータを形成する本発明によ
る技術を示す詳細なブロック図である。第３図は第２図に示された色々な出力の定義の表を示す
図である。第４図は本発明の別の実施例の詳細なブロック図であ
る。第5A図乃至第5C図は本発明によるスピーチ及び雑音追跡
器の動作を示す詳細なフローチャートである。第６図は第5A図乃至第5C図による工学パラメータの定義
の表を示す図である。 10……マイクロホン、11……増幅器、12……スペクトル
分析器、13,100……スイッチ、14,15,16,20,21,25,27,4
0,162,166……モジュール、26……追跡器、160……プロ
セッサ、31,32……加算器、54……対数回路。FIG. 1A is a block diagram illustrating a speech recognition system using recognition parameters obtained from a spectrum using the present invention. FIG. 1B is a block diagram illustrating another speech recognition system using recognition parameters whose properties are spectra according to the present invention. FIG. 2 is a detailed block diagram showing the technique according to the present invention for forming operation template data. FIG. 3 is a diagram showing a table of definitions of various outputs shown in FIG. FIG. 4 is a detailed block diagram of another embodiment of the present invention. 5A to 5C are detailed flowcharts illustrating the operation of the speech and noise tracker according to the present invention. FIG. 6 is a diagram showing a table of definitions of engineering parameters according to FIGS. 5A to 5C. 10… Microphone, 11… Amplifier, 12… Spectrum analyzer, 13,100… Switch, 14,15,16,20,21,25,27,4
0,162,166 ... module, 26 ... tracker, 160 ... processor, 31, 32 ... adder, 54 ... logarithmic circuit.

───────────────────────────────────────────────────── フロントページの続き (56)参考文献特開昭59−137999（ＪＰ，Ａ) 特開昭62−65088（ＪＰ，Ａ) 特開昭62−42198（ＪＰ，Ａ) 特公昭63−67197（ＪＰ，Ｂ２) 特公昭61−2960（ＪＰ，Ｂ２) 特公平５−56519（ＪＰ，Ｂ２) 米国特許4933973（ＵＳ，Ａ) 英国特許出願公開2216320（ＧＢ，Ａ) 日本音響学会昭和51年度春季研究発表会講演論文集４−２−10「外部雑音が音声の機械認識系におよぼす影響」ｐ. 527−528（昭和51年５月25日発行) (58)調査した分野(Int.Cl.⁷，ＤＢ名) G10L 15/06 G10L 15/20 G10L 21/02 ＪＩＣＳＴファイル（ＪＯＩＳ) ＷＰＩ（ＤＩＡＬＯＧ)──────────────────────────────────────────────────続き Continuation of the front page (56) References JP-A-59-137999 (JP, A) JP-A-62-65088 (JP, A) JP-A-62-42198 (JP, A) 67197 (JP, B2) JP-B 61-2960 (JP, B2) JP-B 5-56519 (JP, B2) U.S. Pat. Proceedings of the 51st Spring Meeting of Spring Meeting, 4-2-10, "Effects of External Noise on Speech Recognition System", pp. 527-528 (issued on May 25, 1976) (58) .Cl. ⁷ , DB name) G10L 15/06 G10L 15/20 G10L 21/02 JICST file (JOIS) WPI (DIALOG)

Claims

(57) [Claims]

1. A storage device for storing an initial template indicating a spectrum value of a recognizable speech in a noise-free state, and a spectrum analyzer for outputting a spectrum value of an utterance of an input signal indicating a speech in a noise-existing state. When,
An apparatus for generating a motion template, and comparing the motion template with an output spectrum value from the spectrum analyzer to output a good comparison result indicating that there is a recognized speech in the utterance. A speech recognition system of the type comprising: a first module coupled to the spectrum analyzer for outputting an estimated noise signal indicative of noise in the input signal. Means coupled to the first means and responsive to the estimated noise signal for generating the modified motion template from the initial template based on the estimated noise signal; The first means detects an audio signal in a state where noise is present during the utterance,
A voice tracking unit that outputs a first signal that is a scalar voice level value related to the voice signal and that indicates an average power of the voice signal in the presence of noise; and detects noise in the voice and estimates noise. A noise tracking unit that outputs a second signal indicating a mean power of noise for a predetermined time period of the utterance, which is a vector of spectral values indicating values, and a voice and noise level tracking unit, Adjusting the audio level of the initial template based on the first signal, and adding a spectral value of an estimated value of noise to the initial template based on the second signal. Generating the motion template, the estimated noise of the utterance of the input signal and the input signal in order to obtain the improved speech recognition performance of the recognition module. A speech recognition system having the same signal-to-noise ratio as the utterance.

2. The spectrum analyzer according to claim 1, further comprising a plurality of bandpass filters arranged in a filter bank array, each filter configured to pass a predetermined spectral component according to a band of the filter. 2. The speech recognition system according to 1.

3. The speech recognition system according to claim 2, wherein said first means comprises means for measuring an average and a fluctuation of said bandpass filter and outputting an estimated value of a noise pass characteristic of each filter.

4. The speech recognition system according to claim 3, wherein said noise estimate is estimated based on said filter response to Gaussian noise.

5. The template generated in the absence of noise is formed from a token without noise, means for forming an average value in response to the token and outputting basic form data, and a current expected noise. 2. The apparatus according to claim 1, further comprising: means for modifying the basic form data based on the signal.
A speech recognition system as described.

6. A processing means coupled to said spectrum analyzer for generating said operational template for storage by modifying said initial template based on an estimated noise signal indicative of the presence of noise. 2. The speech recognition system according to 1.

7. The speech recognition system according to claim 6, wherein the expected value calculated by said processing means indicates the presence of Gaussian noise.

8. A means for averaging a noise-free template to obtain a basic form data output, and modifying the basic form data output by adding calculated noise data to the basic form data. The speech recognition system according to claim 6, comprising:

9. An averaging means for outputting an average value of a continuous pair of the magnitude values of the spectrum output by the analyzer, and a processing means coupled to an output of the averaging means, 7. The voice according to claim 6, further comprising scaling means for outputting a field signal having a length, and means for converting the field signal having the predetermined length into a logarithmic signal and outputting one of the basic form data outputs. Recognition system.

10. A squaring means coupled to said averaging means for outputting a vector signal indicating a squared magnitude of an average value of said continuous pair, and coupled to an output of said squaring means,
The speech recognition system according to claim 9, further comprising: a unit for outputting another output of the basic form data output.

11. A means coupled to an output of said squaring means, comprising: a relative energy forming means for outputting a basic morphological energy parameter in response to said vector signal; and a basic energy indicating means for indicating both speech and noise power levels. 11. The speech recognition system according to claim 10, further comprising voice and noise level tracking means for outputting a morphological parameter.

12. A method for forming a motion template for use in a speech recognition system for recognizing speech in the presence of noise in the utterance of an input signal based on a comparison between a spectral value of the input signal and the motion template. Detects audio signals in the presence of noise in the input signal,
Outputting a first signal, which is a scalar audio level value related to the audio signal, indicating an average power of the audio signal in the presence of noise, detecting noise during the utterance, and calculating an estimated value of the noise. Outputting a second signal, which is a vector of spectral values indicating the average power of the noise for a predetermined period of time of the utterance, thereby outputting an estimated noise signal; Adjusting the speech level of the initial template based on the first signal to form the motion template having the same signal-to-noise ratio as the utterance of the first template and obtain an improved speech recognition function; Adding the spectral value of the noise estimate to the initial template based on the second signal, thereby providing an initial indication of recognizable speech in the absence of noise. A method comprising modifying a template.

13. The method of claim 12, wherein outputting the estimated noise signal comprises measuring a response of a predetermined speech processing channel with respect to noise and estimating the signal to be output based on the measurement. Method.

14. The method of claim 12, wherein said modifying step comprises first forming a relatively noise-free basic form template and modifying the basic form template based on the signal indicating the expected noise level. .

15. The modifying step includes forming a relatively noise-free basic form template, adding noise to each template, and averaging the added noise template data to form a new template based on the analysis data. 13. The method of claim 12, comprising steps.