JPH11252595A

JPH11252595A - Voice recognition system having push signal reception function and device realizing the system

Info

Publication number: JPH11252595A
Application number: JP10049724A
Authority: JP
Inventors: Yoshio Nakadai; 芳夫中台; Yoshitake Suzuki; 義武鈴木
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 1998-03-02
Filing date: 1998-03-02
Publication date: 1999-09-17

Abstract

PROBLEM TO BE SOLVED: To reduce hardware quantity by simultaneously mounting a push signal reception function on a voice recognition hardware and its algorithm and, to handle voice recognition and reception of a push signal as the same function. SOLUTION: Signals from a telephone network are given to a power spectrum conversion section 4, where a power spectrum is calculated, and fed to a push signal detection section 5 and a characteristics amount analysis section 7. A signal detection section 5 discriminates whether or not a received voice signal includes a push signal component. On the other hand, the characteristics amount analysis section 7 extracts the characteristics. A result discrimination section 10 gives a symbol stream of push signals to a result output section 12 when the symbol streams of push signals are stored or when a voice period is detected, a voice recognition 11 is started and the recognition result is fed to a result output section 12.

Description

【発明の詳細な説明】DETAILED DESCRIPTION OF THE INVENTION

【０００１】[0001]

【発明の属する技術分野】この発明は、電話による音声
認識およびプッシュ信号受信に関するものである。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to voice recognition by telephone and reception of a push signal.

【０００２】[0002]

【従来の技術】電話による音声メッセージサービスが利
用されている。これは、特定の番号に電話を掛けると、
ニュース速報、天気予報、懸賞クイズなど、自動的に送
出される音声メッセージが聞けるものである。ここで、
音声メッセージはメニュー形式になっていて、通話者が
例えばプッシュ信号（ダイヤルトーン、ＤＴＭＦとも呼
ぶ）などのメニュー選択情報を送出すれば、その信号に
応じた別の音声メッセージを聞けるようになっている。2. Description of the Related Art Voice message services by telephone are used. This means that when you call a specific number,
You can listen to automatically sent voice messages such as breaking news, weather forecasts, and prize quizzes. here,
The voice message is in the form of a menu, and when the caller sends menu selection information such as a push signal (also called a dial tone or DTMF), another voice message corresponding to the signal can be heard. .

【０００３】ところで最近は音声認識性能が電話音声に
対してもほぼ実用的なレベルにあり、前出の音声メッセ
ージサービスに対する応用が検討されつつある。例えば
「プロ野球ニュース」の音声メッセージを呼び出すプッ
シュ信号のコードが「プロ野球」をもじって「２６８９
＃」であったとしても、「プロ野球」が「２６８９＃」
という連想ができなければ、この音声メッセージサービ
スが利用できない。ところが音声認識を利用すれば、音
声メッセージサービスセンターに電話を掛け、たった一
言「プロ野球」と発声するだけで、目的のプロ野球ニュ
ースの音声メッセージをすぐに聞けるシステムの構築も
可能になる。[0003] Recently, the voice recognition performance is almost at a practical level even for telephone voice, and application to the above-mentioned voice message service is being studied. For example, the code of the push signal for calling up the voice message of “Professional Baseball News” is “2689
# "," Professional baseball "is" 2689 # "
Without this association, the voice message service cannot be used. However, if voice recognition is used, it is possible to construct a system in which a voice message of a target professional baseball news can be immediately heard by simply calling the voice message service center and saying only one word "professional baseball".

【０００４】しかし現在の音声認識技術では、ごく短い
時間長の音声の認識が難しい。その最も典型的な例は数
字音声である。認識させたい数字０から９までの名称を
それぞれ「ぜろ」「いち」「に」「さん」「よん」
「ご」「ろく」「なな」「はち」「きゅう」とした場
合、「いち」と「はち」、また「さん」と「よん」など
は音声の特徴が類似し区別が難しい。あるいは「に」や
「ご」などはごく短い音声のため入力しても認識処理さ
れない障害も生じやすい。この点では、電話相当の音声
帯域で数字の入力を必要とする場合、従来通りのプッシ
ュ信号による入力が、誤りが少なく有利である。そこで
電話音声認識用ハードウェアには、プッシュ信号受信機
能を追加する例が見られる。例えば、ＮＴＴインテリジ
ェントテクノロジ株式会社が「パロットボード」の名前
で発売した音声認識ボードがその一例である。However, it is difficult for current speech recognition technology to recognize speech having a very short time length. The most typical example is a digit voice. Names from 0 to 9 to be recognized are "Zero", "Ichi", "Ni", "San", "Yon".
When "go", "roku", "nana", "hachi" and "kyu" are used, "ichi" and "hachi" and "san" and "yon" have similar voice characteristics and are difficult to distinguish. Alternatively, since "Ni" and "Go" are very short voices, a failure that is not recognized even when input is likely to occur. In this regard, when the input of a number is required in a voice band equivalent to a telephone, input by a conventional push signal is advantageous with less errors. Therefore, there is an example in which a push signal receiving function is added to telephone voice recognition hardware. For example, a speech recognition board released by NTT Intelligent Technology Co., Ltd. under the name of "parrot board" is one example.

【０００５】ところで、上記のような音声認識ボードで
は、一般に、音声認識用ハードウェア部分と、プッシュ
信号受信用ハードウェア部分とは分離構成になってい
る。その理由は、音声認識における分析処理（例えばＬ
ＰＣ分析法）がプッシュ信号のように特定の周波数帯域
にのみ急峻なスペクトル成分を持つ信号波形の分析には
向いていないため、プッシュ信号受信専用のＬＳＩが低
価格で市販されているため、などによる。By the way, in the above-described speech recognition board, generally, a hardware portion for speech recognition and a hardware portion for receiving a push signal have a separated configuration. The reason is that the analysis processing in speech recognition (for example, L
PC analysis method) is not suitable for analyzing signal waveforms having a steep spectral component only in a specific frequency band such as a push signal, and an LSI dedicated to receiving a push signal is commercially available at a low price. by.

【０００６】ところが従来、音声認識用ハードウェアは
ディジタル信号処理ＬＳＩ（ＤＳＰ：Digital Signal P
rocessor）で構成され、ディジタルの信号入力が必要な
のに対し、プッシュ信号受信ＬＳＩは内部でＤＳＰを採
用していても汎用性を高めるためアナログ入力となって
いる。すなわち電話回線からそれぞれのハードウェアに
音声を取り込むためにそれぞれ個別の波形変換回路や音
声分岐回路が必要になり、全体ではハードウェア量が増
加する。特に、新しい電話の形態であるコンピュータ・
テレフォニー・インテグレーション（ＣＴＩ）のハード
ウェアでは、音声をディジタル信号で処理しているた
め、余分なＤ／Ａ変換回路が必要なプッシュ信号受信Ｌ
ＳＩはＣＴＩハードウェアに搭載しづらい。Conventionally, however, the hardware for voice recognition is a digital signal processing LSI (DSP).
A push signal receiving LSI is an analog input to enhance versatility even if a DSP is internally used, while a digital signal input is required. That is, a separate waveform conversion circuit and a separate voice branching circuit are required to capture voice from the telephone line into the respective hardware, and the amount of hardware increases as a whole. In particular, the new form of telephone, computer
In the telephony integration (CTI) hardware, since the voice is processed by a digital signal, the push signal receiving L which requires an extra D / A conversion circuit is required.
SI is difficult to mount on CTI hardware.

【０００７】[0007]

【発明が解決しようとする課題】以上述べたように、従
来は音声認識ハードウェアとプッシュ信号受信ハードウ
ェアとはそれぞれ異なった信号入力手段が必要であった
ため、同一の機器に搭載しようとするとハードウェア規
模が大きくなる問題があった。ところで最近は、音声認
識用ＤＳＰの高速化が進み、高速フーリエ変換のよう
に、ある程度の処理能力が要求された処理でも実時間処
理が可能になった。高速フーリエ変換は音声スペクトル
を細かい帯域で分析する計算法の一例であり、特定の周
波数帯域にのみ急峻なスペクトル成分を持つプッシュ信
号でも正確に検知できる。従って認識用の音声分析過程
でプッシュ信号を分析し検出できれば、プッシュ信号受
信用ハードウェアの追加は不要となり、また音声認識を
扱うアプリケーション側から見ても、音声認識とプッシ
ュ信号受信とが同一の機能として取り扱える。As described above, conventionally, the voice recognition hardware and the push signal receiving hardware each required different signal input means. There was a problem that the wear scale became large. By the way, recently, the speed of the DSP for speech recognition has been increased, and real-time processing has become possible even for processing requiring a certain processing capacity, such as fast Fourier transform. The fast Fourier transform is an example of a calculation method for analyzing a speech spectrum in a small band, and can accurately detect a push signal having a steep spectrum component only in a specific frequency band. Therefore, if the push signal can be analyzed and detected in the speech analysis process for recognition, it is not necessary to add hardware for receiving the push signal, and from the application handling speech recognition, speech recognition and push signal reception are the same. Can be treated as a function.

【０００８】この発明の目的は、上記のように、音声認
識用ハードウェアおよびアルゴリズムにプッシュ信号受
信機能を同時に搭載し、ハードウェア量の低減を図ると
ともに、音声認識とプッシュ信号受信とを同一機能とし
て取り扱うことにある。An object of the present invention is to simultaneously implement a push signal receiving function in voice recognition hardware and an algorithm as described above to reduce the amount of hardware and to perform the same function in voice recognition and push signal reception. It is to be treated as.

【０００９】[0009]

【課題を解決するための手段】この発明によれば、音声
入力手段よりの入力音声はディジタル波形に変換手段で
変換され、そのディジタル波形のパワースペクトルがパ
ワースペクトル変換手段で求められ、そのパワースペク
トルから、プッシュ信号成分がプッシュ信号検出手段で
検出され、その検出されたプッシュ信号がプッシュ信号
蓄積手段で記号列に変換されて蓄積され、同時に特徴量
分析手段により上記パワースペクトルから音声特徴量が
導出され、その音声特徴量が分析結果蓄積手段に蓄積さ
れ、検出されたプッシュ信号の記号列および検出された
音声特徴量からどちらを有効な入力結果であるかが判定
手段で判定され、蓄積された音声特徴量を用いて認識手
段で音声認識され、プッシュ信号の記号列あるいは音声
認識結果が出力手段より出力される。作用この発明では、電話回線から音声信号を受信し、認識用
音声分析手法として短時間スペクトル分析を行う。スペ
クトル分析結果からプッシュ信号成分が検出されればプ
ッシュ信号の記号への変換と蓄積を行い、プッシュ信号
が観測されずに音声特徴量が検出されれば、検出された
音声区間に対し認識処理する。According to the present invention, the input voice from the voice input means is converted into a digital waveform by the conversion means, and the power spectrum of the digital waveform is obtained by the power spectrum conversion means. , The push signal component is detected by the push signal detecting means, the detected push signal is converted into a symbol string by the push signal storing means and stored, and at the same time, the voice feature is derived from the power spectrum by the feature analyzing means. Then, the voice feature amount is stored in the analysis result storage unit, and which is a valid input result is determined by the determination unit from the detected symbol sequence of the push signal and the detected voice feature amount, and is stored. The voice recognition is performed by the recognition means using the voice feature amount, and the symbol sequence of the push signal or the voice recognition result is output to the output means. It is more output. In the present invention, a speech signal is received from a telephone line, and a short-time spectrum analysis is performed as a speech analysis technique for recognition. If a push signal component is detected from the spectrum analysis result, conversion and accumulation of the push signal into symbols are performed, and if a voice feature is detected without observing the push signal, recognition processing is performed on the detected voice section. .

【００１０】[0010]

【発明の実施の形態】図１にこの発明の実施例である音
声認識装置を示す。回線入力端子１は公衆網２１を介し
電話回線２２と接続する端子であり、回線インタフェー
ス部２は接続された電話回線２２に対する呼制御を行
い、電話音声を音声波形変換部３へ供給する。音声波形
変換部３は、電話音声を後段の信号処理で取り扱いやす
いディジタル波形へ変換し、パワースペクトル変換部４
は、前段で変換された音声波形を短時間区間に分割し、
そのパワースペクトル波形を求め、プッシュ信号検出部
５と特徴量分析部７へ供給する。プッシュ信号検出部５
はパワースペクトル波形中に存在するプッシュ信号成分
を検出する。プッシュ信号蓄積部６は、１桁毎検出され
たプッシュ信号を数字列として蓄積する。特徴量分析部
７はパワースペクトル波形から認識に必要な音声特徴量
を抽出し、音声区間検出部８、分析結果蓄積部９へ供給
する。音声区間検出部８では認識に必要な音声区間を特
定し、その検出区間について、分析結果蓄積部９に、抽
出した音声特徴量を蓄積する。結果判定部１０は音声分
析結果について音声特徴量またはプッシュ信号のどちら
を有効と見るかを判定し、音声特徴量が有効であると判
定された場合音声認識部１１は音声認識を実行し、結果
出力部１２は音声認識結果あるいはプッシュ信号受信結
果を出力する。FIG. 1 shows a voice recognition apparatus according to an embodiment of the present invention. A line input terminal 1 is a terminal connected to a telephone line 22 via a public network 21, and a line interface unit 2 performs call control on the connected telephone line 22 and supplies telephone voice to a voice waveform conversion unit 3. The voice waveform conversion unit 3 converts the telephone voice into a digital waveform that is easy to handle by signal processing in the subsequent stage,
Divides the audio waveform converted in the previous stage into short time sections,
The power spectrum waveform is obtained and supplied to the push signal detection unit 5 and the feature value analysis unit 7. Push signal detector 5
Detects a push signal component present in the power spectrum waveform. The push signal storage unit 6 stores the push signal detected digit by digit as a digit string. The feature analysis unit 7 extracts a speech feature required for recognition from the power spectrum waveform, and supplies the speech feature to the speech section detection unit 8 and the analysis result storage unit 9. The voice section detection section 8 specifies a voice section required for recognition, and stores the extracted voice feature amount in the analysis result storage section 9 for the detection section. The result determination unit 10 determines whether the voice feature value or the push signal is valid for the voice analysis result, and if the voice feature value is determined to be valid, the voice recognition unit 11 executes voice recognition. The output unit 12 outputs a speech recognition result or a push signal reception result.

【００１１】以下、この図１に示す装置の動作について
説明する。通話者がこの発明の音声認識装置２３に対し
電話機２４から発信すると、通信呼は回線入力端子１を
経由して回線インタフェース部２に接続され、その音声
通話路は音声波形変換部３と接続される。通話者の音声
あるいは通話者の操作によるプッシュ信号は音声波形変
換部３によってディジタル波形に変換される。ここで回
線入力端子１に入力される電話線がアナログ網の場合、
音声波形変換部３の動作はＡ／Ｄ変換に相当し、また前
記電話線がＩＳＤＮ網の場合、音声波形変換部３の動作
は非線形ディジタル波形から線形ディジタル波形への変
換動作となる。ここで、ディジタル波形のサンプリング
周波数は、例えば、８ｋＨｚである。音声波形変換部３
より出力されるディジタル波形は、例えば８ｍｓｅｃ毎
に、３２ｍｓｅｃ長の重畳した区間（フレームと呼称す
る）に区切られてパワースペクトル計算部４へ入力さ
れ、音声のパワースペクトル情報が求められる。パワー
スペクトル変換部４で計算する手法は、例えば、高速フ
ーリエ変換として良く知られているスペクトル計算法で
あり、また例えば、パワースペクトルと物理的に等価な
特徴量である自己相関関数などである。この計算結果は
プッシュ信号検出部５へ送られ、入力された音声信号に
プッシュ信号成分を含むかどうかが判定される。プッシ
ュ信号は図２に示される高群Ｆ１［Ｈｚ］、低群Ｆ２
［Ｈｚ］の２つの周波数の純音を組み合わせた連続波形
として判定されるものであるが、・Ｆ１、Ｆ２の成分を持つ連続信号が一定のパワーを以
て一定時間以上持続するものを判定し、・Ｆ１、Ｆ２の成分を持つ連続信号であっても、それ以
外の帯域に一定値以上のパワーを併せ持つ信号（例えば
肉声）は判定しない、・Ｆ１、Ｆ２のパワーがごく短時間に変化しても別のプ
ッシュ信号の入力と誤判定（一般にチャタリングと呼ば
れる現象）しない、条件が必要である。具体的な判定アルゴリズムのフロー
チャートの例を図３に示す。プッシュ信号検出部５はま
ず短時間パワースペクトルを受信し（ステップ１）、短
時間区間の信号パワーＰｏｗを観測する（ステップ
２）。また図２に示される高群周波数Ｆ１、低群周波数
Ｆ２の組合せについて、パワーが最大となるＦ１、Ｆ２
の組み合わせを推定し、その時のＦ１、Ｆ２帯域だけの
信号パワーを加算した最大パワーＭを導出する（ステッ
プ３）。また３００Ｈｚから６００Ｈｚの周波数帯域で
の最大パワーＱを観測する（ステップ４）。この帯域は
電話音声帯域（３００Ｈｚ〜３．４ｋＨｚ）の下限値か
ら、図２のプッシュ信号周波数Ｆ１の下限値に挟まれた
帯域であるが、音声のピッチ周波数の２〜３次高調波が
観測される帯域であり、プッシュ信号でなく音声を観測
した場合にはＱはＭ以上の値となる。そこで信号パワー
Ｐｏｗがしきい値Ａ以上、パワー比Ｍ／Ｑがしきい値Ｂ
以上（ただしＢ＞１）であることをプッシュ信号受信条
件として判定を行う（ステップ５）。The operation of the apparatus shown in FIG. 1 will be described below. When a caller makes a call to the voice recognition device 23 of the present invention from the telephone 24, the communication call is connected to the line interface unit 2 via the line input terminal 1, and the voice communication path is connected to the voice waveform conversion unit 3. You. The voice of the caller or a push signal generated by the caller's operation is converted into a digital waveform by the voice waveform converter 3. If the telephone line input to the line input terminal 1 is an analog network,
The operation of the audio waveform converter 3 corresponds to A / D conversion. When the telephone line is an ISDN network, the operation of the audio waveform converter 3 is a conversion operation from a nonlinear digital waveform to a linear digital waveform. Here, the sampling frequency of the digital waveform is, for example, 8 kHz. Audio waveform converter 3
The output digital waveform is divided into superimposed sections (referred to as frames) having a length of 32 msec at intervals of, for example, 8 msec, and is input to the power spectrum calculation unit 4 to obtain power spectrum information of voice. The method of calculating by the power spectrum conversion unit 4 is, for example, a spectrum calculation method well known as fast Fourier transform, and for example, an autocorrelation function which is a feature amount physically equivalent to the power spectrum. The result of this calculation is sent to the push signal detection unit 5, and it is determined whether or not the input audio signal contains a push signal component. The push signal has a high group F1 [Hz] and a low group F2 shown in FIG.
It is determined as a continuous waveform combining pure tones of two frequencies of [Hz]. It is determined that a continuous signal having components of F1 and F2 has a constant power and lasts for a predetermined time or more. , F2, even if it is a continuous signal, the signal (for example, real voice) having a power equal to or higher than a certain value in other bands is not determined. Even if the power of F1 or F2 changes in a very short time, it is different. A condition is required to avoid erroneous determination (typically a phenomenon called chattering) when the push signal is input. FIG. 3 shows an example of a flowchart of a specific determination algorithm. The push signal detector 5 first receives the short-time power spectrum (Step 1) and observes the signal power Pow in the short-time section (Step 2). For the combination of the high group frequency F1 and the low group frequency F2 shown in FIG.
Is estimated, and the maximum power M obtained by adding the signal powers of only the F1 and F2 bands at that time is derived (step 3). Further, the maximum power Q in the frequency band from 300 Hz to 600 Hz is observed (step 4). This band is a band between the lower limit of the telephone voice band (300 Hz to 3.4 kHz) and the lower limit of the push signal frequency F1 in FIG. Q is a value of M or more when a voice is observed instead of a push signal. Therefore, the signal power Pow is equal to or greater than the threshold value A, and the power ratio M / Q is equal to the threshold value B.
It is determined that the above condition (B> 1) is set as the push signal receiving condition (step 5).

【００１２】この後、受理したプッシュ信号がその直前
に観測したプッシュ信号と同一かどうかを判定し（ステ
ップ６）、Ｆ１、Ｆ２の組み合わせが異なっていれば新
しい桁の入力としてその情報をプッシュ信号蓄積部６に
送出し（ステップ７）、同一であれば同一のプッシュ信
号が継続して送出されている（すなわち、ある一つのプ
ッシュボタンから指が離れずにずっと押され続けている
状態）とみなして無視する。またステップ５でプッシュ
信号として受理されなかった場合、信号パワーＭが下限
しきい値Ｃを下回るかどうかを判定する（ステップ
８）。下回った場合はプッシュ信号の入力がまったくな
いもの（すなわち、プッシュボタンから指が離れた状
態）として前に入力されたプッシュ信号情報を破棄し
（ステップ９）、下回らない場合はまだプッシュ信号入
力途中と判定して無視する。これは前記のようにチャタ
リングを防止する効果を持つ。これらの判定処理を完了
すると、プッシュ信号検出部５は次の短時間フレームの
信号パワーを待ち受ける状態に復帰する（ステップ１
０）。Thereafter, it is determined whether the received push signal is the same as the push signal observed immediately before (step 6). If the combination of F1 and F2 is different, the information is input as a new digit and the push signal is used. It is transmitted to the storage unit 6 (step 7), and if the same, the same push signal is continuously transmitted (that is, a state where the finger is kept pressed without leaving a certain push button). Consider and ignore. If the signal is not received as a push signal in step 5, it is determined whether the signal power M is lower than the lower threshold C (step 8). If it is lower, the input of the push signal is not performed at all (that is, the finger is released from the push button), and the previously input push signal information is discarded (step 9). And ignore it. This has the effect of preventing chatter as described above. When these determination processes are completed, the push signal detection unit 5 returns to a state of waiting for the signal power of the next short frame (step 1).
0).

【００１３】以上のアルゴリズムにより、プッシュ信号
の新しい桁の入力と判定されたものは図２に記載された
記号列としてプッシュ信号蓄積部６に蓄積される。ここ
で、プッシュ信号蓄積部６では入力終了に相当する記号
が予約されていて（デリミタと呼称）、プッシュ信号検
出部５からデリミタが送出されれば、デリミタまで蓄積
されたプッシュ信号の記号列を結果判定部１０に送出す
る。プッシュ信号蓄積部６では同時に、最後にプッシュ
信号が入力されてから現在までの時刻（桁間時間）が観
測され、桁間時間が所定値に達した場合はプッシュ信号
の入力終了として、それまでに蓄積されたプッシュ信号
の記号列を結果判定部１０に送出する。According to the above-mentioned algorithm, the input of a new digit of the push signal is stored in the push signal storage unit 6 as the symbol string shown in FIG. Here, a symbol corresponding to the end of the input is reserved in the push signal storage unit 6 (referred to as a delimiter). When the delimiter is transmitted from the push signal detection unit 5, the symbol sequence of the push signal stored up to the delimiter is stored. The result is sent to the result determination unit 10. At the same time, the push signal storage unit 6 observes the time (interdigit time) from the last input of the push signal to the present time. When the interdigit time reaches a predetermined value, the input of the push signal is determined to be completed. And sends the symbol string of the push signal stored in the result determination unit 10.

【００１４】一方、パワースペクトル変換部４の計算結
果は分岐され、特徴量分析部７に入力されて音声認識用
に特徴抽出される。特徴抽出方法は、例えば、対数パワ
ーや、また例えばＬＰＣケプストラムとして良く知られ
る手法である。このとき対数パワーのように音声区間情
報を反映しやすい特徴量は音声区間検出部８に入力され
て、区間情報の検出に利用される。区間検出方法は、例
えば、一定値以上のパワーを持つ区間をすべて音声とみ
なす方法である。音声区間検出部８は、また例えば、音
声認識部１１のアルゴリズムによっては、音声の始端だ
けを検出し、それ以後の一定時間長を音声区間とみなす
方法であってもよい。音声区間検出部８で音声と判定さ
れた区間に対しては、特徴量分析部７の出力である音声
特徴量が分析結果蓄積部９に蓄積される。On the other hand, the calculation result of the power spectrum conversion unit 4 is branched and input to the feature value analysis unit 7 to extract features for speech recognition. The feature extraction method is, for example, a method well known as log power or, for example, LPC cepstrum. At this time, a feature amount that easily reflects voice section information, such as logarithmic power, is input to the voice section detection unit 8 and used for detecting section information. The section detection method is, for example, a method in which all sections having power equal to or higher than a certain value are regarded as speech. For example, the voice section detection unit 8 may detect only the beginning of the voice depending on the algorithm of the voice recognition unit 11 and consider a certain time length thereafter as a voice section. For the section determined to be speech by the speech section detection unit 8, the speech feature amount output from the feature amount analysis unit 7 is stored in the analysis result storage unit 9.

【００１５】結果判定部１０はプッシュ信号蓄積部６の
プッシュ信号蓄積情報と、音声区間検出部８の区間検出
情報とをそれぞれ受け、どちらの結果を有効な出力結果
とするかを判定する。プッシュ信号は音声と同様にある
程度の信号パワーと継続時間長を有しているため、プッ
シュ信号が蓄積されれば、それは音声区間としても誤判
定されている可能性が大きい。従って結果出力はプッシ
ュ信号出力を優先させた方が妥当であり、結果判定部１
０は、もしプッシュ信号の記号列が蓄積されていればプ
ッシュ信号の記号列を結果出力部１２に送り、記号列が
蓄積されていなければ、音声認識部１１を起動する。ま
た、結果判定部１０は音声認識のみ、あるいはプッシュ
信号のみの結果の出力も可能である。音声認識部１１で
使われる認識手法は、例えば、テンプレートマッチング
方法であり、また例えば、ＨＭＭと呼ばれる統計的な照
合方法である。一般に音声認識処理は音声分析処理より
も計算規模が大きいため、予め音声かプッシュ信号かを
事前判別すれば、不要な計算を回避できる。音声認識結
果は結果出力部１２に送られ、結果出力部１２は、音声
認識結果である語彙名や、プッシュ信号の記号列を、こ
の音声認識装置２３を必要とする外部機器に送出する。The result determination section 10 receives the push signal accumulation information of the push signal accumulation section 6 and the section detection information of the voice section detection section 8, respectively, and judges which result is a valid output result. Since a push signal has a certain signal power and a certain duration like voice, if the push signal is accumulated, it is highly likely that the push signal is erroneously determined as a voice section. Therefore, it is more appropriate for the result output to give priority to the push signal output.
A value of 0 sends the symbol sequence of the push signal to the result output unit 12 if the symbol sequence of the push signal is stored, and activates the speech recognition unit 11 if the symbol sequence is not stored. In addition, the result determination unit 10 can output a result of only voice recognition or only a push signal. The recognition method used in the voice recognition unit 11 is, for example, a template matching method, and is, for example, a statistical matching method called HMM. In general, the voice recognition process has a larger calculation scale than the voice analysis process, so if it is determined in advance whether it is a voice signal or a push signal, unnecessary calculations can be avoided. The result of the speech recognition is sent to the result output unit 12, and the result output unit 12 sends the vocabulary name and the symbol string of the push signal, which are the result of the speech recognition, to an external device requiring the speech recognition device 23.

【００１６】この発明の音声認識装置は、図１中に制御
部１４として示すように、この制御部１４は、例えばマ
イクロプロセッサを主体とするものであって、各部に対
する制御を順次行わせたり、各種蓄積部の読出し書込み
などを行い、つまりこの装置はコンピュータによりプロ
グラムの解読実行により動作させることもできる。As shown in FIG. 1 as a control unit 14 in the speech recognition apparatus of the present invention, the control unit 14 is mainly composed of a microprocessor, for example, and controls each unit sequentially. It performs reading and writing of various storage units, that is, this device can be operated by decoding and executing programs by a computer.

【００１７】[0017]

【発明の効果】以上述べたようにこの発明は、従来の音
声認識装置に比べて、（１）プッシュ信号受信に要する
ハードウェアを音声認識用ハードウェアに包含し、ハー
ドウェア量の増加を防ぐ、（２）音声認識アルゴリズム
にプッシュ信号受信アルゴリズムを包含するため、アプ
リケーション側は音声認識とプッシュ信号受信とを共通
の手続きで呼び出し、どちらか一方のみ、または両方を
情報入力手段として利用できる、利点を持つ。As described above, according to the present invention, (1) the hardware required for receiving a push signal is included in the hardware for speech recognition as compared with the conventional speech recognition apparatus, and the amount of hardware is prevented from increasing. (2) Since the voice recognition algorithm includes the push signal receiving algorithm, the application can call the voice recognition and the push signal reception in a common procedure, and can use only one or both of them as information input means. have.

[Brief description of the drawings]

【図１】この発明による音声認識装置の実施例の機能的
構成を示すブロック図。FIG. 1 is a block diagram showing a functional configuration of an embodiment of a speech recognition device according to the present invention.

【図２】ＰＢ信号の記号とその周波数の関係を示す図。FIG. 2 is a diagram showing a relationship between a symbol of a PB signal and its frequency.

【図３】図１中のプッシュ信号検出部５、蓄積部６の動
作手順の例を示すフローチャート。FIG. 3 is a flowchart showing an example of an operation procedure of a push signal detection unit 5 and a storage unit 6 in FIG.

Claims

[Claims]

1. A voice input means, a conversion means for converting an input voice from the voice input means into a digital waveform, a power spectrum conversion means for obtaining a power spectrum of the digital waveform, and a push signal component from the power spectrum. Push signal detecting means for detecting, push signal accumulating means for converting the detected push signal into a symbol string and accumulating the same, characteristic amount analyzing means for deriving an audio characteristic amount from a power spectrum, and accumulating the audio characteristic amount Analysis result accumulating means, and a judgment means for judging which is a valid input result from the detected symbol sequence of the push signal and the detected speech feature quantity, and a speech feature quantity of the analysis result accumulation means. Recognition means for recognizing the voice signal by using the symbol sequence of the push signal according to the determination result of the determination means. Or an output means for outputting the result of voice recognition, wherein the voice recognition device has a push signal receiving function.

2. A process for obtaining a power spectrum from an input digital waveform, a process for detecting a push signal component from the power spectrum, a process for converting the push signal into a symbol string, and storing the symbol sequence in a push signal storage means. A feature value analyzing step of deriving a voice feature value from the power spectrum; a step of storing the voice feature value in an analysis result storage unit; a symbol sequence of the detected push signal and the detected voice feature value A determination step of determining which is a valid input result from the above; a step of outputting the symbol string of the push signal when the determination determines that the symbol string of the push signal is valid; If it is determined that the volume is valid, voice recognition is performed using the voice feature amount stored in the analysis result storage unit, and the result is determined. Recording medium for recording a program for executing a recognition process of outputting, with computer.

3. The recording medium according to claim 2, wherein the program includes executing the recognition step when it is determined that the signal is not a push signal.

4. A voice section is detected from the voice feature amount,
5. The recording medium according to claim 2, wherein the program includes executing a voice section detection step of using the detected voice section information for determining validity from the voice feature amount in the determination step.