JPH0511798A

JPH0511798A - Voice recognition device

Info

Publication number: JPH0511798A
Application number: JP3227452A
Authority: JP
Inventors: Yasuki Yamashita; 泰樹山下; Yoichi Takebayashi; 洋一竹林
Original assignee: Toshiba Corp
Current assignee: Toshiba Corp
Priority date: 1990-09-07
Filing date: 1991-09-06
Publication date: 1993-01-22

Abstract

PURPOSE:To eliminate the need for all high order parameters at real times and for special hardware which performs high order parameter analysis at a high speed. CONSTITUTION:A voice recognition device consists of a low order parameter analysis section 2 which finds the low order parameter time series of input voice, a leading and trailing end detection section 3 which detects the leading and trailing ends of input voice by using the parameter time series found by the low order parameter analysis section 2, a high order parameter analysis section 5 which finds high order parameters at preset sample points from within input voice corresponding to the detected leading and trailing ends, a pattern collation section 7 which collates featured parameters by the found high order parameters with previously registered reference parameters to calculate similarity and a recognition result output section 8 which outputs the recognition result of input voice in accordance with the calculated similarity.

Description

Detailed Description of the Invention

【０００１】[0001]

【産業上の利用分野】本発明は、入力音声からパラメー
タを抽出することにより音声の認識を行う音声認識装置
に関する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a voice recognition device for recognizing a voice by extracting a parameter from an input voice.

【０００２】[0002]

【従来の技術】一般の音声認識方法においては、入力音
声から低次数パラメータおよび高次数パラメータを各々
抽出し、これら両方を用いて入力音声を正確に認識する
手法が主流となっている。これは入力音声に対してアル
タイムで低次数パラメータ分析及び高次数パラメータ分
析を同時におこなう方法である。しかしながら高次パラ
メータ分析では、８ｍｓｅｃ毎にＤＦＴスペクトル、フ
イルタバンク、ＬＰＣパラメータ等をリアルタイムで分
析する必要があるため、ＤＳＰ（ディジタルシグナルプ
ロセッサ）等特別な高速信号処理用ハードウエアが必要
であった。2. Description of the Related Art In general speech recognition methods, a method in which a low-order parameter and a high-order parameter are respectively extracted from an input voice and the input voice is accurately recognized using both of them is the mainstream. This is a method of simultaneously performing low-order parameter analysis and high-order parameter analysis on the input voice in real time. However, in the high-order parameter analysis, since it is necessary to analyze the DFT spectrum, filter bank, LPC parameters, etc. in real time every 8 msec, special high-speed signal processing hardware such as DSP (digital signal processor) is required.

【０００３】例えば１９８２年３月、音響学会「ハイブ
リッド構造マッチング法による電話音声の認識」（麻田
他）に記載された従来の音響認識システムによると、入
力音声（音声入力部の出力）に対し先ず低次数パラメー
タ分析および高次数パラメータ分析を並列して行う。低
次数パラメータ分析の結果に従って入力音声の始終端を
検出し、この始終端に対応する入力音声の範囲から、予
め設定されたフレーム数に従ってリサンプルフレームを
決定する。このリサンプルフレーム番号に従って、入力
音声の範囲に対応した高次数パラメータ分析結果（既に
得られたもの）の中から、高次数パラメータについて固
定次元の音声特徴ベクトルを抽出し、登録単語辞書の標
準パターンと照合して類似度を計算する。類似度に従っ
て認識結果を出力する。For example, according to a conventional acoustic recognition system described in "The Recognition of Telephone Speech by Hybrid Structure Matching Method" (Asada et al.) In March 1982, the Acoustic Society, first, the input speech (the output of the speech input section) Low order parameter analysis and high order parameter analysis are performed in parallel. The start and end of the input voice are detected according to the result of the low-order parameter analysis, and the resample frame is determined from the range of the input voice corresponding to the start and end according to a preset number of frames. According to this resample frame number, a fixed-dimension speech feature vector for a high-order parameter is extracted from the high-order parameter analysis results (which have already been obtained) corresponding to the range of the input speech, and the standard pattern of the registered word dictionary is extracted. To calculate the degree of similarity. The recognition result is output according to the similarity.

【０００４】[0004]

【発明が解決しようとする課題】上記のような従来の音
声認識方法では、先ず入力音声の全てを高次数パラメー
タ分析し、この中から必要な高次数パラメータのみを用
いて単語特徴ベクトルを抽出する。つまり入力音声の全
高次数パラメータ分析結果のうちの数フレームしか用い
ないため、折角計算した高次数パラメータの大部分が無
駄になっている。換言すると、認識に不要な計算を行っ
ているわけである。In the conventional speech recognition method as described above, first, all the input voices are subjected to high-order parameter analysis, and the word feature vector is extracted from only the necessary high-order parameters. . That is, since only a few frames of the total high-order parameter analysis result of the input voice are used, most of the high-order parameters calculated at all are wasted. In other words, the calculation unnecessary for recognition is performed.

【０００５】以上のように従来の音声認識方法では、入
力音声全てに対してリアルタイムで高次数パラメータを
分析するために特別なハードウエアを必要とし、折角計
算した高次数パラメータの大部分が無駄になる。As described above, in the conventional speech recognition method, special hardware is required to analyze the high-order parameters in real time for all input voices, and most of the calculated high-order parameters are wasted. Become.

【０００６】この発明の目的は、必要なフレームのみに
ついて入力音声の高次数パラメータを分析することによ
り特別なハードウエアを不要とした音声認識装置を提供
することにある。An object of the present invention is to provide a speech recognition apparatus which does not require special hardware by analyzing the high order parameters of the input speech only for the necessary frames.

【０００７】[0007]

【課題を解決するための手段】この発明によると、入力
音声の低次数パラメータ時系列を求める低次数パラメー
タ分析部と、この低次数パラメータ分析部により求めら
れたパラメータ次系列を用いて入力音声の始終端を検出
する始終端検出部と、この始終端検出部により検出され
た始終端に対応する入力音声の範囲から所定のサンプル
点での高次数パラメータを求める高次数パラメータ分析
部と、この高次数パラメータ分析部により求められたパ
ラメータによる特徴パラメータと予め登録された標準パ
ラメータとを照合して類似度を計算するパターン照合部
と、このパターン照合部により計算された類似度に従っ
て入力音声に対する認識結果を出力する認識結果出力部
とにより構成される音声認識装置が提供される。According to the present invention, a low-order parameter analysis section for obtaining a low-order parameter time series of input speech, and a parameter order series obtained by the low-order parameter analysis section A start-end detection unit that detects the start-end, a high-order parameter analysis unit that obtains a high-order parameter at a predetermined sample point from the range of the input voice that corresponds to the start-end detected by the start-end detection unit, and this high-order parameter analysis unit. A pattern matching unit that calculates the similarity by matching the feature parameter based on the parameters obtained by the order parameter analysis unit with the standard parameter registered in advance, and the recognition result for the input voice according to the similarity calculated by the pattern matching unit. There is provided a voice recognition device including a recognition result output unit for outputting

【０００８】[0008]

【作用】本発明では、高次数パラメータをリアルタイム
で全て求めないので、ＤＳＰなど高次数パラメータ分析
を高速に行う特別なハードウエアが必要なくなる。従っ
てパーソナルコンピュータ等の計算機能力でも音声認識
が実現できる。In the present invention, since all high-order parameters are not calculated in real time, special hardware such as DSP for performing high-order parameter analysis at high speed is not required. Therefore, the voice recognition can be realized by the calculation function of a personal computer.

【０００９】また、高速ワークステーションのように計
算機能力が大きくて、ソフトウエアだけで高次数パラメ
ータの実時間処理が可能な場合でも、マルチタスク処理
が行われて同時に他のタスクが並列して走っているシス
テムであれば、無駄な高次数パラメータを計算しなくな
る分、他のタスクの演算効率が高くなる。Further, even when the calculation function is large like a high-speed workstation and real-time processing of high-order parameters is possible only by software, multitask processing is performed and other tasks simultaneously run in parallel. In such a system, the calculation efficiency of other tasks is increased because unnecessary high-order parameters are not calculated.

【００１０】[0010]

【実施例】以下、図面を参照しながら実施例を説明す
る。Embodiments will be described below with reference to the drawings.

【００１１】図１に示されるこの発明の一実施例である
音声認識装置によると、音声入力部１はマイクロホン等
を介して入力される音声信号をディジタル信号に変換す
る。音声入力部１は例えば入力音声信号に含まれる高周
波雑音成分を除去するカットオフ周波数５．６ｋＨｚの
ローパスフィルタと、このローパスフィルタを介して取
り込まれた入力音声（アナログ信号）を例えば標本化周
波数１２ｋＨｚ、量子化ビット数１２ビットでディジタ
ル信号に変換するＡ／Ｄ変換器とから構成される。In the voice recognition device according to the embodiment of the present invention shown in FIG. 1, the voice input unit 1 converts a voice signal input via a microphone or the like into a digital signal. The voice input unit 1 has, for example, a low-pass filter with a cutoff frequency of 5.6 kHz for removing high-frequency noise components included in the input voice signal, and an input voice (analog signal) captured through the low-pass filter, for example, with a sampling frequency of 12 kHz. , And an A / D converter for converting into a digital signal with a quantization bit number of 12 bits.

【００１２】低次数パラメータ分析部２は、エネルギ、
零交叉回数などの演算を少なくして音声データを求める
ための低次数パラメータを、例えば分析窓長２４ｍｓｅ
ｃ、分析フレーム周期８ｍｓｅｃ毎に計算する。求めら
れた低次数パラメータは次の始終端検出部３によって単
語の始終端を検出するために使用される。The low-order parameter analysis unit 2 uses energy,
For example, an analysis window length of 24 mse is used as a low order parameter for obtaining voice data by reducing the number of zero crossings.
c, Calculation is performed every 8 msec of the analysis frame period. The obtained low-order parameter is used by the next start / end detection unit 3 to detect the start / end of a word.

【００１３】始終端検出部３は、図２に示すように低次
数パラメータであるエネルギに閾値を設定し、この閾値
より分析フレームのエネルギが大きくなった時点を単語
の始端とし閾値よりエネルギが小さくなった時点を単語
の終端とする。エネルギの代わりに零交叉回数を低次数
パラメータとして使用しても同様な方法で単語の始終端
が検出できる。音声の始終端検出方法は従来様々な方法
が提案されており、例えばY.Takabayashi, H.Shinaoda,
H.Asada、T.Nitta, S.Hirai and S.watanabe, Teleph
one speech recognition using a hybrid method", Pro
c. Seventh IICPR, pp.1232-1235, 1984に開示された始
終端検出方法を用いることができる。As shown in FIG. 2, the start / end detection unit 3 sets a threshold value for energy, which is a low-order parameter, and defines a time point at which the energy of the analysis frame becomes larger than this threshold value as the start point of the word and the energy is smaller than the threshold value. The time when it becomes is the end of the word. Even if the number of zero crossings is used as a low-order parameter instead of energy, the start and end of a word can be detected in a similar manner. Various methods have been proposed in the past for detecting the start and end of voice, for example, Y. Takabayashi, H. Shinaoda,
H.Asada, T.Nitta, S.Hirai and S.watanabe, Teleph
one speech recognition using a hybrid method ", Pro
The start / end detection method disclosed in c. Seventh IICPR, pp.1232-1235, 1984 can be used.

【００１４】リサンプルフレーム決定部４は検出された
始終端に従ってリサンプルフレーム、即ち高次数パラメ
ータ分析をするためのフレームを決定する。例えばこの
決定部４は図２に示すように検出された始終端の間の分
析すべきフレームの内から等間隔に例えば１０フレーム
を決定する。但しこのフレーム数は予め定められてい
る。もちろんリサンプルフレームは等間隔でなく、デー
タ等に鑑み不等間隔に決定されてもよい。ここで、１０
フレームを必要とすると、これらフレームの実際のフレ
ーム番号がそれぞれ求められる。The resample frame determination unit 4 determines a resample frame, that is, a frame for high order parameter analysis according to the detected start and end points. For example, the determination unit 4 determines, for example, 10 frames at equal intervals from the frames to be analyzed between the detected start and end points as shown in FIG. However, the number of frames is predetermined. Of course, the resample frames may be determined not at regular intervals but at irregular intervals in consideration of data and the like. Where 10
When frames are needed, the actual frame number of each of these frames is determined.

【００１５】高次数パラメータ分析部５は決定されたリ
サンプルフレームにのみ対応する音声入力部１から音声
信号から、フイルタ分析出力、ＤＦＴスペクトル、ケプ
ストラム、ＬＰＣパラメータ等の演算を多く必要とする
高次数パラメータを求める。音声特徴ベクトル抽出部６
は求められた高次数パラメータから単語特徴パラメータ
を計算する。例えば、１２次の高次数パラメータを用い
る場合、１２×１０＝１２０次元の単語特徴ベクトルを
以下のようにして求める。The high-order parameter analysis unit 5 requires many operations such as filter analysis output, DFT spectrum, cepstrum, and LPC parameters from the voice signal from the voice input unit 1 corresponding to only the determined resample frame. Find the parameters. Speech feature vector extraction unit 6
Calculates a word feature parameter from the obtained high-order parameter. For example, when using a 12th-order high-order parameter, a 12 × 10 = 120-dimensional word feature vector is obtained as follows.

【００１６】ＦＦＴ（高速フーリエ変換）により音声特
徴ベクトルを求めるような場合には、例えば２４ｍｓｅ
ｃ時間長、１２８点のＦＦＴを施すことにより１２８点
の分解能を有する周波数スペクトラム（ＤＦＴスペクト
ラム）Ｘｋが求められる。この周波数スペクトルＸｋの
パワー｜Ｘｋ｜² が周波数方向に平滑化され、周波数方
向に１２個に分割され、これにより１２チャネル（次
元）のフィルタバンク相当出力Ｚｉ（ｉ＝１、２、・・
・、１２）が求められる。具体的には、１２チャネルの
フィルタバンク相当出力Ｚｉ（ｉ＝１、２、・・・１
２）を求める場合には、When a voice feature vector is obtained by FFT (Fast Fourier Transform), for example, 24 mse
A frequency spectrum (DFT spectrum) Xk having a resolution of 128 points is obtained by performing FFT of 128 points for c time length. Power of this frequency spectrum Xk | Xk | ² Is smoothed in the frequency direction and is divided into 12 in the frequency direction, whereby 12 channel (dimensional) filter bank equivalent outputs Zi (i = 1, 2, ...
・, 12) is required. Specifically, a 12-channel filter bank equivalent output Zi (i = 1, 2, ... 1)
When seeking 2),

【００１７】[0017]

【数１】として周波数スペクトルが周波数方向に平滑化処理され
る。これにより得られる。フィルタバンク相当出力Ｚｉ
（ｉ＝１，２，・・・，１２）が対数化されることによ
りＧｉ＝１０log Ｚｉ（ｉ＝１，２，・・・、１２）として１２次元の高次数パラメータが求められる。更
に、デジタルフイルタバンク、ＬＰＣ分析やケプストラ
ム分析により単語特徴ベクトルを求める場合も、上記と
同様な方法が実施される。[Equation 1] As a result, the frequency spectrum is smoothed in the frequency direction. This is obtained. Filter bank equivalent output Zi
By logarithmizing (i = 1, 2, ..., 12), a 12-dimensional high-order parameter is obtained as Gi = 10log Zi (i = 1, 2, ..., 12). Further, when the word feature vector is obtained by digital filter bank, LPC analysis or cepstrum analysis, the same method as above is executed.

【００１８】パターン照合部７は求められた単語特徴ベ
クトルと登録してある音声単語辞書の単語標準特徴ベク
トルとの類似度を計算する。この類似計算は例えば固定
次元の特徴ベクトルを基本とする複合類似度法の他にマ
ハラノビス距離やニューラウネットワークのような統計
的パターン認識を用いた照合も可能である。このように
して求められた類似度のうち、入力音声と最も類似度の
高かった単語を認識結果出力部８から出力することによ
って認識結果が得られる。The pattern matching unit 7 calculates the degree of similarity between the obtained word feature vector and the word standard feature vector of the registered voice word dictionary. For this similarity calculation, for example, collation using statistical pattern recognition such as Mahalanobis distance or Neulau network is possible in addition to the composite similarity method based on a fixed dimension feature vector. The recognition result is obtained by outputting from the recognition result output unit 8 the word having the highest similarity to the input voice among the similarities thus obtained.

【００１９】上記実施例によれば、無駄な高次数パラメ
ータの計算がなくなるので全体の演算量が減る。但し、
図３に示すようにこの発明の方法では、始終端検出後に
高次数パラメータ分析を行うので、次の処理である類似
度計算までに若干の時間遅れを生じる。しかし、パター
ン照合に要するサンプル点が時間方向に（例えば１０点
にまで）少なくできるため高次数パラメータの計算に要
する時間はそれほど増加せず、特に有効である。According to the above-mentioned embodiment, since the useless calculation of the high-order parameter is eliminated, the total calculation amount is reduced. However,
As shown in FIG. 3, in the method of the present invention, since the high-order parameter analysis is performed after the start / end detection, a slight time delay occurs before the similarity calculation, which is the next process. However, since the number of sample points required for pattern matching can be reduced in the time direction (for example, up to 10 points), the time required for calculating the high-order parameter does not increase so much, which is particularly effective.

【００２０】通常、音声の始端を決定する場合、入力音
声エネルギレベルが時間ｔ0 にて閾値Ｅ0 を越え、一定
時間が経過すれば、そのとき、時間ｔ0 が始端として決
定される。また、音声の終端を決定する場合、音声エネ
ルギレベルが閾値Ｅ0 以下になり、プレセット期間ＴQ
の間、閾値Ｅ0 以下を維持していれば、音声エネルギレ
ベルが閾値Ｅ0 に達した時が図４に示されるように終端
として決定される。終端を決定する方法は、例えば、J.
G. Wilpon, L.R. Rabiner, & T.Mortion: "AnImproved
Word-Detection Algorithm for Telephone-Quality Spe
ech Incorporating Bot h Syntactic and Semantic Con
straints" AT & T Bell Technical T.63.3, 479(1984)
に開示された方法を用いることができる。Normally, when determining the beginning of a voice, if the input voice energy level exceeds the threshold value E0 at time t0 and a certain period of time elapses, then the time t0 is determined as the beginning. When determining the end of the voice, the voice energy level becomes equal to or lower than the threshold value E0, and the preset period TQ
During the period, if the threshold value E0 or less is maintained, the time when the voice energy level reaches the threshold value E0 is determined as the end as shown in FIG. The method of determining the termination is described in, for example, J.
G. Wilpon, LR Rabiner, & T. Mortion: "AnImproved
Word-Detection Algorithm for Telephone-Quality Spe
ech Incorporating Both Syntactic and Semantic Con
straints "AT & T Bell Technical T.63.3, 479 (1984)
The method disclosed in can be used.

【００２１】通常の方法のように、音声の終端が求めら
れ、それから高次数パラメータ分析が行われれば、高次
数パラメータ分析が常に一定時間（ＴQ ）だけ遅れる。
そこで、認識結果を得るために必要な時間に遅延時間が
含まれない実施例が図５に示されている。If the end of the voice is obtained and the high-order parameter analysis is performed as in the usual method, the high-order parameter analysis is always delayed by a fixed time (TQ).
Therefore, an embodiment in which the delay time is not included in the time required to obtain the recognition result is shown in FIG.

【００２２】この実施例では、音声入力部１の音声出力
を低次数パラメータ分析する低次数パラメータ分析部２
の出力端子が始終端候補検出部３ａの第１入力端子に接
続される。始終端候補検出部３ａは低次数パラメータ分
析部２ａから出力される出力信号を受けて単語の始端お
よび終端となる候補を検出する。始終端候補検出部３ａ
の出力端子はリサンプル候補決定部４ａを介して高次数
パラメータ分析部５ａの第１入力端子に接続される。高
次数パラメータ５ａの第２入力端子は音声入力部１の出
力端子に接続される。高次数パラメータ分析部５ａの出
力端子および低次数パラメータ分析部２ａの第２出力端
子は始終端決定部３ｂの第１及び第２入力端子に接続さ
れる。始終端決定部３ｂの第１出力端子は単語特徴ベク
トル抽出部６の入力端子に接続される。始終端決定部３
ｂの第２出力端子は始終端候補検出部３ａの第２入力端
子に接続される。単語特徴ベクトル抽出部６はパターン
照合部７を介して認識結果出力部８に接続される。次
に、図５に示す音声認識装置の動作を図６のフローチャ
ートを参照して説明する。In this embodiment, a low-order parameter analysis unit 2 for analyzing a low-order parameter of the voice output of the voice input unit 1
Is connected to the first input terminal of the start / end candidate detection unit 3a. The start / end candidate detection unit 3a receives the output signal output from the low-order parameter analysis unit 2a and detects candidates that are the start and end of a word. Start / End candidate detector 3a
Is connected to the first input terminal of the high-order parameter analysis unit 5a via the resample candidate determination unit 4a. The second input terminal of the high-order parameter 5a is connected to the output terminal of the voice input unit 1. The output terminal of the high-order parameter analysis unit 5a and the second output terminal of the low-order parameter analysis unit 2a are connected to the first and second input terminals of the start / end determination unit 3b. The first output terminal of the start / end determination unit 3b is connected to the input terminal of the word feature vector extraction unit 6. Start and end determination unit 3
The second output terminal of b is connected to the second input terminal of the start / end candidate detection unit 3a. The word feature vector extraction unit 6 is connected to the recognition result output unit 8 via the pattern matching unit 7. Next, the operation of the voice recognition device shown in FIG. 5 will be described with reference to the flowchart of FIG.

【００２３】音声入力部１から例えば“ｓｉｘ”なる音
声に対応する音声信号が低次数パラメータ分析部２ａに
入力されると、図１の実施例と同様に低次数パラメータ
分析部２ａが入力音声信号を低次数パラメータ分析す
る。分析音声信号は始終端候補検出部３ａに入力される
ことにより始終端候補検出部３ａは先ず、音声（ｓｉ
ｘ）の始端の候補を検出する。この始端候補検出では、
入力音声レベルが閾値Ｅｏを時点ｔ0 において越え、一
定時間ＴQ 、例えば０．２秒を経過すると、図７に示さ
れるように時点ｔ0 が始端候補として検出される。When a voice signal corresponding to the voice "six", for example, is input from the voice input unit 1 to the low order parameter analysis unit 2a, the low order parameter analysis unit 2a inputs the input voice signal as in the embodiment of FIG. To analyze low order parameters. The analysis voice signal is input to the start / end candidate detection unit 3a, so that the start / end candidate detection unit 3a first outputs the voice (si
x) The starting point candidate of x) is detected. In this start point candidate detection,
When the input voice level exceeds the threshold value Eo at the time point t0 and a predetermined time TQ, for example, 0.2 seconds has elapsed, the time point t0 is detected as a start point candidate as shown in FIG.

【００２４】次に、終端候補が検出される。この終端候
補検出では、音声エネルギが閾値Ｅｏ以下になり、一定
時間ＴQ 、この閾値Ｅｏを越えなかったとき、図７に示
されるように音声エネルギが閾値Ｅｏとなった時点が終
端候補として検出される。Next, a termination candidate is detected. In this termination candidate detection, when the voice energy becomes equal to or lower than the threshold Eo and does not exceed the threshold Eo for a certain period of time TQ, the time when the voice energy reaches the threshold Eo as shown in FIG. 7 is detected as the termination candidate. It

【００２５】始終端候補検出部３ａにより検出された始
端および終端候補がリサンプル候補決定部４ａに入力さ
れると、リサンプル候補決定部４ａは始終端候補によっ
て決定される範囲におけるサンプル点を決定する。リサ
ンプル候補決定部４ａの出力信号が高次数パラメータ分
析部５ａに入力されると、高次数パレメータ分析部５ａ
は決定されたサンプル点における高次数パラメータを計
算する。When the start and end candidates detected by the start and end candidate detecting section 3a are input to the resample candidate determining section 4a, the resample candidate determining section 4a determines sample points in the range determined by the start and end candidates. To do. When the output signal of the resample candidate determination unit 4a is input to the high order parameter analysis unit 5a, the high order parameter analysis unit 5a.
Computes the high order parameters at the determined sample points.

【００２６】始終端決定部３ｂが低次数パラメータ分析
部２ａおよび高次数パラメータ分析部５ａからの出力信
号を受けることにより、終端候補が終端であるか否かを
決定する。これは、図８に示されるように終端候補点以
降の音声エネルギーが閾値Ｅｏ以下で、しかも一定時間
長ＴQ その閾値を越えないか否かを監視することにより
実現できる。図８に示すように“ｓｉｘ”と発声され、
音声エネルギが閾値Ｅｏを一定時間長ＴQ 以上越えてい
ないと、入力された終端候補は終端でないと始終端決定
部３ｂが判定する。このとき、始終端決定部３ｂは出力
信号を始終端候補検出部３ａ、リサンプル候補決定部４
ａおよび高次数パラメータ分析部５ａに出力する。高次
数パラメータ分析部５は高次数パラメータ等の計算を中
断し、始終端候補検出部３ａは改めて始終端候補を求め
る。この後再び、リサンプル候補が決定され、高次数パ
ラメータ分析が行われる。The start / end determination unit 3b receives the output signals from the low-order parameter analysis unit 2a and the high-order parameter analysis unit 5a to determine whether or not the termination candidate is the termination. This can be realized by monitoring whether or not the voice energy after the termination candidate point is equal to or lower than the threshold value Eo and does not exceed the threshold value TQ for a certain period of time as shown in FIG. As shown in Fig. 8, "six" is uttered,
If the voice energy does not exceed the threshold value Eo for the fixed time length TQ or more, the start / end determination unit 3b determines that the input end candidate is not the end. At this time, the start / end determination unit 3b outputs the output signal to the start / end candidate detection unit 3a and the resample candidate determination unit 4
a and the high-order parameter analysis unit 5a. The high-order parameter analysis unit 5 interrupts the calculation of the high-order parameters and the like, and the start / end candidate detection unit 3a again obtains the start / end candidates. After this, the resample candidates are determined again and the high-order parameter analysis is performed.

【００２７】始終端決定部３ｂが終端候補が終端である
と判定すると、始終端決定部３ｂは始終端の範囲の音声
エネルギを単語特徴ベクトル抽出部６に送出する。単語
特徴ベクトル抽出部６は送り込まれた音声エネルギから
単語特徴ベクトルを抽出する。抽出された単語特徴ベク
トルはパターン照合部７に入力され、基準単語特徴ベク
トルと照合される。この照合により得られる類似度が認
識結果出力部８に送られると、認識結果出力部８は認識
結果を出力する。When the start / end determination unit 3b determines that the end candidate is the end, the start / end determination unit 3b sends the speech energy in the range of the start / end to the word feature vector extraction unit 6. The word feature vector extraction unit 6 extracts a word feature vector from the sent voice energy. The extracted word feature vector is input to the pattern matching unit 7 and matched with the reference word feature vector. When the similarity obtained by this collation is sent to the recognition result output unit 8, the recognition result output unit 8 outputs the recognition result.

【００２８】図１の実施例によると、始終端検出のため
に一定時間長ＴQだけ常に時間遅れがあるが、図５の実
施例によると、終端候補検出後に直ちに分析処理を開始
するので、単語の終端検出に要する一定時間の遅れが認
識結果を得るまでの時間に含まれなくなる。従って、認
識性能は維持されたままより早い認識処理が実現でき
る。According to the embodiment shown in FIG. 1, there is always a time delay of a fixed time length TQ for detecting the start and end, but according to the embodiment shown in FIG. 5, the analysis process is started immediately after the end candidate is detected. The fixed time delay required for detecting the end of is not included in the time until the recognition result is obtained. Therefore, faster recognition processing can be realized while maintaining the recognition performance.

【００２９】[0029]

【発明の効果】本発明では、上述したように高次数パラ
メータをリアルタイムで全て求めないので、ＤＳＰなど
の高次数パラメータ分析を高速に行う特別なハードウエ
アを必要としなくなる。従って、パーソナルコンピュー
タ等の小型計算機の計算機能力でも音声認識が実現でき
る。As described above, according to the present invention, since all high-order parameters are not obtained in real time, special hardware such as DSP for performing high-order parameter analysis at high speed is not required. Therefore, voice recognition can be realized even with the calculation function of a small computer such as a personal computer.

【００３０】また、高速ワークステーションのように計
算機能力が大きく、ＶＮＩＸ等のオペレーティングシス
テムがインストールされ、ソフトウエアだけで高次数パ
ラメータの実時間処理が可能なコンピュータシステムを
使用する場合、マルチタスク処理により低次数パラメー
タ分析および高次数パラメータ分析等を同時に並列に行
うことができる。この場合、本発明では、無駄な高次数
パラメータを計算しないので、その分、他のタスクが実
行でき、音声認識のための演算効率が向上される。この
様にすれば、他のタスクに音声認識で無駄な処理を行い
浪費をしていた計算機のリソースを図ることができ、効
率的である。Further, when a computer system such as a high-speed workstation, which has a large calculation function, an operating system such as VIX is installed, and which is capable of real-time processing of high-order parameters only by software, multitask processing is used. The low-order parameter analysis and the high-order parameter analysis can be simultaneously performed in parallel. In this case, according to the present invention, since unnecessary high-order parameters are not calculated, other tasks can be executed correspondingly, and the calculation efficiency for speech recognition is improved. By doing so, it is possible to efficiently use the resources of the computer that wasted by performing wasteful processing by voice recognition for other tasks.

[Brief description of drawings]

【図１】この発明の一実施例に従った音声認識装置のブ
ロック図。FIG. 1 is a block diagram of a voice recognition device according to an embodiment of the present invention.

【図２】この発明による始終端検出と音声特徴ベクトル
抽出の例を説明するための図。FIG. 2 is a diagram for explaining an example of start / end detection and voice feature vector extraction according to the present invention.

【図３】この発明の音声認識装置による始終端検出後の
処理時間を示す図。FIG. 3 is a diagram showing a processing time after the start and the end are detected by the voice recognition device of the present invention.

【図４】始終端検出を説明する図。FIG. 4 is a diagram illustrating start / end detection.

【図５】他の実施例に従った音声認識装置のブロック
図。FIG. 5 is a block diagram of a voice recognition device according to another embodiment.

【図６】図５の音声認識装置の動作を説明するためのフ
ローチャート図。FIG. 6 is a flowchart for explaining the operation of the voice recognition device in FIG.

【図７】始終端候補を検出する例を説明するための図。FIG. 7 is a diagram for explaining an example of detecting start and end candidates.

【図８】始終端検出および音声特徴スペクトラムの抽出
の例を説明する図。FIG. 8 is a diagram illustrating an example of detection of start and end points and extraction of a voice feature spectrum.

[Explanation of symbols]

１… 音声入力部、２…低次数パラメータ分析部、３…
始終端検出部、４…リサンプルフレーム決定部、５…高
次数パラメータ分析部、６…単語特徴ベクトル抽出部、
７…パターン照合部、８…認識結果出力部。1 ... Voice input unit, 2 ... Low-order parameter analysis unit, 3 ...
Start-end detection unit, 4 ... Resample frame determination unit, 5 ... High-order parameter analysis unit, 6 ... Word feature vector extraction unit,
7 ... Pattern matching unit, 8 ... Recognition result output unit.

Claims

[Claims]

1. Low-order parameter analysis means for obtaining a low-order parameter time series from input speech, and start / end detection means for detecting the start / end of the input speech using the parameter time series obtained by the low-order parameter. A high-order parameter analysis means for obtaining a high-order parameter at a predetermined sample point from the range of the input voice corresponding to the start and end detected by the start-and-end detection means, and the high-order parameter analysis means. A voice recognition device comprising pattern matching means for matching the parameter-based feature pattern with a standard pattern registered in advance to identify the input voice.

2. The voice recognition device according to claim 1, wherein the low-order parameter includes an analysis result such as voice power and the number of zero crossings with a small amount of calculation.

3. The speech recognition apparatus according to claim 1, wherein the high-order parameters include analysis results with a large amount of calculation such as DFT spectrum, filter bank analysis, and LPC parameters.

4. A low-order parameter time series is obtained from an input voice, and a low-order parameter analyzing means for outputting voice energy, and the input from the relation between the level of the voice energy output from the low-order parameter analyzing means and a threshold value. A start / end candidate detecting means for detecting a start / end candidate of voice, and a high order for obtaining a high order parameter at a predetermined sample point from the range of the input voice corresponding to the start / end candidate detected by the start / end detecting means. Parameter analysis means, a start / end determination means for monitoring the relationship between the level of the voice energy and a threshold value, and determining the start / end candidate as the start / end, and the high-order parameter analysis in response to the determination of the start / end. A voice recognition device comprising: identification means for identifying the input voice based on the parameter obtained by the means.