JP3588975B2

JP3588975B2 - Voice input device

Info

Publication number: JP3588975B2
Application number: JP13579397A
Authority: JP
Inventors: 健大野; 則政岸
Original assignee: Nissan Motor Co Ltd
Current assignee: Nissan Motor Co Ltd
Priority date: 1997-05-12
Filing date: 1997-05-12
Publication date: 2004-11-17
Anticipated expiration: 2017-05-12
Also published as: JPH10312193A

Description

【０００１】
【発明の属する技術分野】
この発明は、少ない入力回数で大語彙認識を行なうことができる音声入力装置に関する。
【０００２】
【従来の技術】
音声を識別して情報を読み取る音声入力装置は、手を介さずに指令やデータなどの入力ができるため、障害者用ワープロ、車載ナビゲーションといったような手操作が困難な場面に採用されつつある。
典型的な音声入力装置は、図１３に示すように、音声を取り込み、電気信号に変えて音声信号を出力するマイク５００と、アナログの音声信号をＡ／Ｄ変換してディジタル情報に変換させるＡ／Ｄコンバータ５０１と、音声信号から単語認識をするのに用いられる単語辞書を記憶した外部記憶装置５０３と、ＣＰＵとメモリを持ち単語辞書をメモリに読み込ませたうえでＣＰＵが音声信号と辞書内の単語との一致度判定により音声認識を行なう信号処理装置５０２とを有する。
スイッチ５０５は装置使用者に操作され音声信号の取り込みタイミングを信号処理装置５０２に与える。モニタ５０４は認識された単語を表示する。
【０００３】
一般に音声による入力は、認識対象の語数が多くなる程認識率が低下するため、大語彙認識の場合階層化した辞書が使用される。入力の際、辞書の分類に従って複数回音声入力を繰り返して階層を進み、認識対象語数を絞り込んだ状態で認識し、誤認識率を低下させるようにしている。
【０００４】
上記従来の装置における音声認識の詳細について、図１４のフローチャートに従って説明する。
ここでは、ＪＲの駅名である「桜木町」を音声で入力する場合を示す。このような駅名入力は、自動車のナビゲーション装置の目的地設定などに用いられている。外部記憶装置５０３には例えば図２のような階層化した単語辞書を記憶させてある。
【０００５】
まず、ステップ５２１において、音声入力の初期状態で、信号処理装置５０２は階層化された単語辞書の、階層の最上位の部分辞書を最初の認識対象とし設定する。ここでは図２中の部分辞書ａが設定される。
次に、装置使用者はスイッチ５０５を操作してマイク５００に「施設」を発話する。これは「桜木町」が「施設」という範疇に含まれることを使用者が認識しているためである。
スイッチ５０５が操作されたことは、ステップ５２２において、検出される。これによって音声認識処理が開始される。
【０００６】
ステップ５２３では、信号処理装置５０２内のＣＰＵが設定された階層の部分辞書を外部記憶装置５０３からメモリに読み込ませる。初期では部分辞書ａが読み込まれる。
信号処理装置５０２は、スイッチ５０５が操作されるまでの音声の平均パワーを演算しており、スイッチ５０５が操作された後、音声の平均パワーに比べて音声信号の瞬間パワーが所定値以上大きくなったとき、使用者が発話開始したと判断し、ステップ５２４において音声取り込みを開始する。
【０００７】
ステップ５２５では、ＣＰＵが入力された音声とメモリに読み込んだ部分辞書内の単語との一致度を演算する。初期では「住所」、「施設」それぞれとの一致度を演算する。
【０００８】
そして、音声信号の瞬間パワーが所定値以下になった時、使用者の発話が終了したと判断し、ステップ５２６において、音声の取り込みを終了する。
入力音声と部分辞書内の単語との一致度の演算を終了すると、ステップ５２７において、ＣＰＵが最も一致度高い単語を選択する。ここでは「施設」の方が一致度が高いので選択される。
【０００９】
ステップ５２８では、選択された一致度の高い単語が認識の結果としてモニタ５０４において表示される。
ステップ５２９では、選択された単語が下の階層の部分辞書名であるかどうかを判定し、辞書名である場合ステップ５２３へ、そうでない場合ステップ５３０へ進む。ここでは単語「施設」は図２の単語辞書に示されるように下の階層の部分辞書ｅの辞書名となっているので、ステップ５２３に戻る。
【００１０】
ステップ５２３では、ＣＰＵが外部記憶装置５０３から新たに部分辞書ｅをメモリに読み込む。このように、処理を繰り返すたび階層が進む。選択された単語の下に部分辞書がなくなると、ステップ５３０で、認識された単語は目的語となり出力される。ここでは「桜木町」が出力される。
【００１１】
【発明が解決しようとする課題】
上記のように従来の音声入力装置においては、一つの単語を伝えるのに、その単語の属する最下層までの部分辞書名を階層毎に入力しなければならない。その階層は単語数が多いほど複雑になるので、大語彙の中からの認識になるほど使用者にとって入力負担が大きいという問題があった。また使用者が最も伝えたいのは目的語であるにも係わらず、目的語の入力が最後で感覚的に違和感を抱かせ、使い勝手がよくないという問題があった。
本発明は、上記従来の問題点に鑑み、少ない入力回数で正確な音声入力を実現することを目的としている。
【００１２】
【課題を解決するための手段】
発話音声を情報化して入力する音声入力手段と、
複数の単語を含む部分辞書からなるとともに階層構造を持った単語辞書と、
使用する部分辞書を決定する部分辞書決定手段と、
前記音声入力手段からの入力音声と決定された部分辞書内の単語との一致度を演算する演算手段と、
演算された単語のうち最も一致度の高い単語を選択して出力する単語選択手段とを有し、
前記部分辞書決定手段は、初めに所定の最下層の部分辞書を決定し、上位階層への変更指示を受けた場合、指示された上位階層の部分辞書を決定し、入力により単語選択手段が選択した単語が示す階層に変更し、該階層下の所定の最下層部分辞書を決定するものとした。
【００１３】
前記部分辞書決定手段により初めに決定される部分辞書は前回使用された最下層の部分辞書であることが望ましい。
また前回使用された最下層の部分辞書の代わりに使用頻度の高い最下層の部分辞書を用いても可能である。
【００１４】
音声入力装置には表示手段を接続し、単語選択手段が選択した単語及び上位の階層構造を表示するのが望ましい。
前記表示手段はタッチパネルを合わせ持ち、使用者によってタッチされた階層に対応した階層変更指令を前記部分辞書決定部に出力することも可能である。
また、タッチパネルに対するタッチ入力があったときに、変更可能な階層を表示手段上で表示するのが望ましい。
【００１５】
前記部分辞書決定手段は、目的語に続き入力が行なわれたときに、上位階層への変更指示を受けたとし、設定されている最下層部分辞書より上位の全ての部分辞書を決定し、単語選択手段が選択した単語が示す階層に階層変更を行なうことができる。
【００１６】
前記部分辞書決定手段は、目的語に続き入力が行なわれたときに、設定されている最下層部分辞書及びそれより上位の全ての部分辞書を決定し、単語選択手段が選択した単語が上位部分辞書内の単語であれば、上位階層への変更指示を受けたとし、前記単語が示す階層に階層変更を行なうこともできる。
【００１７】
また、最初に入力された音声を記憶する記憶手段を合わせ持ち、階層が変更されたとき、前記単語選択手段は前記記憶手段に記憶された音声を用いて、決定された最下層の部分辞書内の単語との一致度を演算し、出力することができる。
【００１８】
【作用】
階層化した単語辞書を使用し、部分辞書を設定する際に、初めに所定の最下層の部分辞書を設定するので、大語彙認識ができるとともに、１回目の単語を目的語とすることもできる。そして一回の入力で認識できなかった場合、階層変更指令で階層変更を行なうので、例えば現在の部分辞書と目的語のある部分辞書と共通する上位階層に変更して最下層の部分辞書を決定する段取りをとることができ、最上位の部分辞書からの入力が必要でなくなり、効率のよい入力が図られる。また、最初から目的語で入力し、辞書の階層変更が認識できなかったときに行なうので、感覚に合致した音声入力が行なえる。
【００１９】
部分辞書決定手段は、上位階層への変更指示を受けた場合、上位階層の部分辞書を決定し、入力により単語選択手段が選択した単語が示す階層に変更し、該階層下の所定の最下層部分辞書を決定するようにした場合、階層変更があっても、目的語の部分辞書と上位階層との間の部分辞書に対する認識が不要で、階層が上位に変更されても、すぐ目的語の入力となるので、装置使用者を焦らせることなく、入力できる効果が得られる。
また部分辞書決定手段が初めに決定する部分辞書は前回使用された最下層の部分辞書となれば、辞書構成を理解し、階層変更が円滑に行なえる。
さらに前回使用された最下層の部分辞書の代わりに使用頻度の高い最下層の部分辞書を用いても同じ効果が得られる。
【００２０】
音声入力装置に表示手段を接続し、単語選択手段が選択した単語及び上位の階層構造を表示手段で表示するようにすると、階層変更が必要のときに、表示された階層構造を頼りに階層変更ができるので、入力負担が軽減される。
表示手段はタッチパネルを合わせ持ち、使用者によってタッチされた階層に対応した階層変更指令を前記部分辞書決定部に出力するようにすると、対話式な入力となり一層の入力負担軽減が図られる。
また、タッチ入力があったときに、音声入力によって変更可能な階層を前記表示手段上で表示した場合、辞書の構造を表示画面から知ることができ、辞書の構成などの予備知識が無くても階層変更ができる効果が得られる。
【００２１】
前記部分辞書決定手段は、目的語に続き入力が行なわれたときに、上位階層への変更指示を受けたとし、設定されている最下層部分辞書より上位の全ての部分辞書を決定し、単語選択手段が選択した単語が示す階層に階層変更を行なうようにすると、上位階層の単語のみで、階層変更ができ入力負担がさらに軽減される。
【００２２】
前記部分辞書決定手段は、目的語に続き入力が行なわれたときに、設定されている最下層部分辞書及びそれより上位の全ての部分辞書を決定し、単語選択手段が選択した単語が上位部分辞書内の単語であれば、上位階層への変更指示を受けたとし、前記単語が示す階層に階層変更を行なった場合、部分辞書が正確でありながら、単語の誤認識による階層変更が防止される。
【００２３】
最初に入力された音声を記憶する記憶手段を合わせ持ち、階層が変更されたとき、前記単語選択手段は前記記憶手段に記憶された音声を用いて、決定された最下層の部分辞書内の単語との一致度を演算し、出力することで、階層を変更し、最下層の部分辞書が変わっても、目的語の音声を二度と入力する必要がなく、音声の入力回数がさらに減らされる。
【００２４】
【発明の実施の形態】
次に、本発明の実施の形態を実施例により説明する。
図１は、本発明の第１の実施例を構成を示すブロック図である。
マイク５００により取り込まれた音声信号はＡ／Ｄコンバータ５０１でディジタル情報に変えられて信号処理装置５に入力される。スイッチ５０５は音声入力用に装置使用者が発話する直前に操作され、信号処理装置５に発話音声の検出タイミングを与えている。外部記憶装置５０３には単語辞書を記憶させてある。モニタ６０はタッチパネル付きモニタで、信号処理装置５の処理結果を表示するとともに、タッチパネルで部分辞書の変更が行なえるようになっている。
【００２５】
単語辞書は図２に示すような階層化した単語辞書が用いられる。図においては部分辞書間の連線が上下層間の繋がり関係を表示し、枠内に突出したのは下の階層を代表する単語についての表示である。
「住所」と「施設」の二つの単語で構成する部分辞書ａは単語辞書の最上層にあり、各単語に下位階層辞書として「住所」からは政令指定都市に従った分類で都道府県名を記した部分辞書ｂを設け、各都道府県について、市区名を記した部分辞書ｃと区村名を記した部分辞書ｄが順次に設けられている。
【００２６】
部分辞書ａの単語「施設」の下位には「駅」、「デパート」、「ホテル」のような「施設」の種類名を記した単語の部分辞書ｅが続いている。「駅」の下位に都道府県名を記した部分辞書ｆが設けられている。部分辞書ｆからは交通会社名を記した部分辞書ｇと、部分辞書ｇに繋がっている駅名の部分辞書ｈが都道府県毎に設けられている。
【００２７】
また、単語「デパート」も都道府県名を記した下層の部分辞書ｉと、市名を記した部分辞書ｊと、「デパート」名を記した部分辞書ｋを持ち、都道府県別に市区の部分辞書と区村の部分辞書に繋がっている。
部分辞書ｄ、部分辞書ｈ、部分辞書ｋは最下層部分辞書であり、目的語が登録され、認識したい単語はここで照合され認識される。
【００２８】
信号処理装置５は、音声入力部５３においてスイッチ５０５が操作されるまでマイク５００からディジタル化された音声信号の平均パワーを演算している。その平均パワー値より瞬間パワーが大きくなったときに、発話が開始したと判断し、音声信号の取り込みを開始する。その音声信号は単語選択部５２において単語認識される。この際最初に使われる単語辞書は部分辞書決定部５１が外部記憶装置５０３から読み込んだ最下位階層にある部分辞書である。
【００２９】
単語出力部５４はその認識の結果と単語の上位階層の名称をモニタ６０に一時出力するが、装置としての出力は保留する。この間モニタ６０上に表示された上位階層の名称をタッチ入力すれば部分辞書の変更ができる。所定時間内にタッチパネルへのタッチが無ければ、単語出力部５４は認識の結果を装置の出力として出力する。
【００３０】
部分辞書の変更があった場合は、部分辞書決定部５１はタッチされた名称の部分辞書を外部記憶装置５０３から読み込み、装置使用者はスイッチ５０５を操作して下位辞書を決定するための発話をする。部分辞書決定部５１はそれに係わる所定の最下位部分辞書を決定し、外部記憶装置５０３から読み込む。その後再び音声入力をし単語選択部５２において単語認識される。
最初に入力するあるいは階層変更後に入力される最下層の部分辞書は前回使用した部分辞書あるいは使用頻度の最も高い部分辞書である。
【００３１】
次に、図３のフローチャートに従って装置の作動の流れを説明する。
まず、ステップ１０１において、音声入力の初期状態で、階層の最下位の部分辞書を最初の認識対象として部分辞書決定部５１において設定する。初期設定として最も使用頻度の高い部分辞書あるいは前回使用された最下層の部分辞書が用いられるが、ここでは例えば図２中の部分辞書ｈが設定されたとする。
【００３２】
装置使用者はスイッチ５０５を操作して、マイク５００に発話をする。
ステップ１０２では、音声入力部５３がスイッチ５０５が操作されたか否かを検出し、操作された場合、マイク５００が動作してからの音声の平均パワー値を算出してステップ１０３へ進む。
ステップ１０３では、決定された部分辞書を前記外部記憶装置５０３から部分辞書決定部５１に読み込む。ここでは部分辞書ｈが読み込まれる。
【００３３】
ステップ１０４では、使用者の発話音声取り込みを開始する。すなわち入力した音声信号を絶えず算出した音声の平均パワーと比べ、瞬間パワーが所定値以上大きくなったとき、使用者の発話開始と判断し、発話音声の取り込みを開始する。
ここでは、例えば使用者は目的語として「桜木町」を発話したとする。
ステップ１０５では、部分辞書決定部５１に読み込まれた部分辞書ｈ内の単語と発話音声との一致度を単語選択部５２において演算する。なお、ここでの処理は音声信号をとりながら所定時間で区切った音声区間部分と各単語との比較が行なわれており、音声信号の取り込みはそれに同時に進行されている。
【００３４】
ステップ１０６で、音声信号の瞬間パワーが平均パワー以下になったとき、使用者が発話完了と判断し、音声取り込み終了してステップ１０７へ進む。
ステップ１０７では、単語選択部５２で演算された一致度の最も高い単語を認識の結果として単語出力部５４に出力する。
【００３５】
ステップ１０８では、単語出力部５４はその認識の結果と単語の上位階層の名称をモニタ６０に一時出力し表示させるが、装置としての出力は保留する。図４はモニタ６０の表示画面である。ｈが選択された単語の辞書で、ａ、ｅ、ｆ、ｇはその単語の上位辞書である。ｈの枠から表示されているのは目的語であり、その上の枠に並べられているのが目的語に関連する上位辞書内の単語で、右から左へ階層順位の増加を表示している。階層を変更する場合に、単語が映っているところをタッチすれば、階層の変更ができる。
そしてステップ１０９で、所定時間内にタッチパネルへのタッチ入力が無ければ、ステップ１１０において「桜木町」という単語の認識結果を装置の出力として出力する。
この場合、階層を変更する必要がなく、発話回数が一回で終了する。
【００３６】
もし、使用者がステップ１０４で「そごう」と発話していたとすれば、初期の部分辞書がｈであったため、他の単語が表示されてしまう。その場合、上位の階層を変更する必要が生じるため、使用者は図４中のｅをタッチ入力する。
【００３７】
そのタッチ入力がステップ１０９で検出されると処理がステップ１１１へ進む。
ステップ１１１では、タッチされた階層をどの階層に変更可能かモニタ６０に表示する。ここでは「駅」、「デパート」、「ホテル」を表示する。図５はその表示様子を示している。
ステップ１１２では、部分辞書決定部５１が新たに部分辞書ｅを外部記憶装置５０３から読み込む。
【００３８】
ステップ１１３では、タッチして表示された上位階層の中から選択して音声入力する。「そごう」の場合は使用者はそれに関連のある「デパート」を音声で入力し、単語選択部５２で認識される。本ステップはステップ１０２からステップ１０７までと同様の処理を行なってもよい。
ステップ１１４では、部分辞書決定部５１が、認識された単語「デパート」に繋がる最下層の部分辞書ｋを決定する。
【００３９】
ステップ１１５では、再度音声入力やり直しを使用者に報知するため、モニタ６０に再度入力の指示マークを表示して、ステップ１０２に戻る。
その後使用者が「そごう」を発話してステップ１０２からステップ１０８までの処理が再び行なわれて、ステップ１１０において、「そごう」という単語が出力される。
【００４０】
マイク５００、Ａ／Ｄコンバータ５０１、音声入力部５３は音声入力手段を構成している。部分辞書決定部５１は辞書決定手段を構成している。単語選択部５２は単語選択手段を構成している。モニタ６０は表示手段を構成している。
【００４１】
本実施例は以上のように構成され、第１の音声の入力が行なわれた場合、音声信号と所定の最下層の部分辞書の単語との一致度が演算されるので、第１の単語を目的語とすることができ、使用者の入力負担が軽減されることになる。また、部分辞書の変更が必要なとき、上位の階層の下に複数の階層が存在しても、その最下層の部分辞書を設定して使用するので、階層変更による入力負担増加が少ない。
そして、最初に使用する部分辞書を使用頻度の高いものあるいは前回使用したものとしているので、使用者は出力結果を容易に理解し、階層変更が円滑に行なえる。
【００４２】
また、モニタが単語を表示するときに単語の上位階層構造を表示するので、どの階層を変更したいかを容易に判断できる。
タッチ入力があったとき、音声入力によってどのような階層に変更可能かを表示することで、辞書の構成を知らなくても、どの階層に変更するかの操作ができる。
【００４３】
図６は、第２の実施例の構成を示すブロック図である。
この実施例は、図１に示す第１の実施例における信号処理装置５の代わりに音声記憶部６６を設けた信号処理装置６を用いる。音声信号記憶部６６は音声入力部６３の音声信号を単語選択部６２に出力するとともに、一回目の音声信号は音声記憶部６６が記憶する。部分辞書変更があった場合、音声記憶部６６に記憶された音声信号を用いて部分辞書との一致度を判定し、目的語を認識する。その他は第１の実施例と同様である。
【００４４】
次に、図７のフローチャートに従って装置の作動の流れを説明する。
ステップ２０１〜ステップ２１５まではスッテプ２０７を除いて第１の実施例における図３のフローチャートのステップ１０１〜ステップ１１４と同様の処理を行なう。
すなわちまず、階層の最下位の部分辞書を最初の認識対象として設定し、外部記憶装置５０３から読み込む。使用者の発話音声信号を取り込みながら設定された部分辞書内の単語と一致度を演算する。そして音声信号の取り込みが終了すると、ステップ２０７において、音声記憶部６６が音声信号を記憶する。
一致度の最も高い単語が認識の結果として選択され、上位階層を含めてモニタ６０において表示される。
【００４５】
そして、所定時間内にモニタ６０のタッチパネルに対するタッチ入力が無ければ、選択された単語を装置の出力として出力するが、タッチ入力があった場合、タッチされた階層に部分辞書の変更を行なう。階層変更された部分辞書が新たに外部記憶装置５０３から読み込まれる。その後階層変更のための発話をし、音声入力によって階層を示す単語が特定されると、その階層下の最下位部分辞書を外部記憶装置５０３から読み込む。
【００４６】
その後、第１の実施例では再度音声入力やり直しを使用者に報知するが、本実施例では、ステップ２１６において、音声記憶部６６から記憶した音声信号を用い、最下層の部分辞書内の単語との一致度を演算する。各単語から一致度の最も高い単語がステップ２０８において選択される。そしてステップ２１０にモニタ６０へのタッチ入力がないと判定されると、ステップ２１１で認識した単語を装置の出力として出力する。
【００４７】
本実施例によっても、第１の実施例と同様の効果が得られるとともに、部分辞書の変更後、再度最初に目的語を発話し音声入力する必要がないので、入力回数が減少されることになる。
【００４８】
次に、第３の実施例について説明する。
図８は、第３の実施例の構成を示すブロック図である。
この実施例は、図６に示す第２の実施例における信号処理装置６の代わりに信号処理装置７を用いる。音声信号記憶部６６は音声入力部７３の音声信号を単語選択部６２に出力するとともに、一回目の音声信号は音声記憶部６６が記憶する。部分辞書変更がある場合、音声記憶部６６に記憶された音声信号を用いて認識する。
【００４９】
第１あるいは第２の実施例では階層の変更はモニタ６０のタッチパネルで行なったが、本実施例は音声による階層変更を行なう。すなわち第一回目の音声入力を目的語に対応させ、以後の音声入力は階層変更に対応させている点が第１あるいは第２の実施例と異なる。モニタ６０の代わりに表示のみのモニタ７０を使用する。その他は第１の実施例と同様である。
【００５０】
次に、図９のフローチャートに従って装置の作動の流れを説明する。
ステップ２０１〜ステップ２１６まではステップ３１０、ステップ３１３を除いて第２の実施例における図７のフローチャートと同様の処理を行なう。
すなわちまず、階層の最下位の部分辞書を最初の認識対象として設定され、外部記憶装置５０３から読み込まれる。使用者の発話音声信号を取り込みながら設定された部分辞書内の単語と一致度を演算する。そして音声信号の取り込みが終了すると、音声記憶部６６が音声信号を記憶する。
そしてステップ２０８において一致度の最も高い単語が認識の結果として選択され、ステップ２０９において上位階層を含めてモニタ７０において表示されるとステップ３１０へ進む。
【００５１】
ステップ３１０では、音声入力部７３がスイッチ５０５が操作されたか否かを検出し、操作された場合、音声入力があるため、ステップ３１３へ進む。
ステップ３１３では現在使用中の部分辞書の上位全ての部分辞書を外部記憶装置５０３から読み込む。
その後第１、第２の実施例と同様に上位階層の音声入力がなされ、階層を示す単語が特定されると、その階層下の最下位部分辞書を外部記憶装置５０３から読み込む。
【００５２】
ステップ２１６において、音声記憶部６６から記憶した音声信号を用い、部分辞書内の単語との一致度を演算する。各単語から一致度の最も高い単語がステップ２０８において選択される。
ステップ３１０においてスイッチ５０５に対する操作がないことが検出されると、ステップ２１１で認識した単語を装置の出力として出力する。
このように、例えば最初に部分辞書ｈが設定されていて、入力する単語は辞書にない「そごう」の場合、使用者が「デパート」を発話すれば、図１０に示すように「そごう」の上位辞書を含むハッチングした部分辞書が自動的に設定されるので、階層変更があっても煩わしい操作はない。
【００５３】
すなわち、最初の発話に対し、装置が使用者の指定したい階層と異なる階層の単語を出力する場合、使用者が指定したい階層名称を発話するという手順をとることで、発話、返答、詳しい説明という自然な対話を実現でき、最も自然な感覚で音声入力ができる効果が得られる。
各階層の部分辞書が設定されるので、音声による認識率の低下が懸念されるが、一般に上位の部分辞書内の単語は、分類のための単語であり、単語数は最下層の部分辞書内の単語数に比較して極めて少ないため、認識率は十分に使用に堪える。
【００５４】
次に、第３の実施例の変形例について説明する。
前記各実施例では、最下層の部分辞書が正確に設定されても、騒音などの影響で単語を誤認識することがある。この場合すべての上位階層の部分辞書を設定し直しても辿り着いたのはもとの部分辞書であり、余計な入力回数を作る結果となる。この変形例はそれに対処するためのもので、誤認識があっても、同じ部分辞書内の単語なら、階層変更せずに処理できるようにしている。
【００５５】
図１１は装置作動の流れを示すフローチャートである。
このフローチャートはステップ４１２、ステップ４１３、ステップ４１４を図９の第３の実施例におけるステップ３１３、ステップ２１４に置き換えて構成される。
すなわちまず、階層の最下位の部分辞書を最初の認識対象として設定され、外部記憶装置５０３から読み込まれる。使用者の発話音声信号を取りながら設定された部分辞書内の単語と一致度を演算する。そして音声信号の取り込みが終了すると、音声記憶部６６が音声信号を記憶する。
一致度の最も高い単語が認識の結果として選択され、ステップ２０９において上位階層を含めてモニタ７０において表示される。
【００５６】
そして、ステップ３１０において、音声入力部７３がスイッチ５０５が操作されたか否かを検出し、操作された場合、音声入力があるため、ステップ４１２へ進む。
ステップ４１２では、現在使用中の部分辞書およびその上位全ての部分辞書を読み込む。
ステップ４１３では、使用者が発話した音声を入力し、読み込まれた部分辞書内の単語との一致度を演算する。
【００５７】
ステップ４１４では一致度の最も高い単語が上位階層の部分辞書内にあるか否かを判定する。単語が上位階層の部分辞書内にある場合、階層変更するためのものであり、ステップ２１５へ進み、第３の実施例と同様の処理を行なう。単語が最下層の部分辞書内にある場合、既に認識された目的語であるためステップ２０９へ進み、ステップ３１０でスイッチに対する新たな操作が行なわれていないかの判定を経てステップ２１１において単語が出力される。
【００５８】
このように、もし最初に部分辞書ｈが設定されていて、入力したい単語は「桜木町」の場合、騒音などによる「桜木町」の音声信号に歪みが生じ、「関内」と誤って表示される可能性がある。また「桜木町」を発話するつもりで「関内」と誤って発話することが考えられる。このような部分辞書が正しい場合でも、図１２にハッチングで示すように部分辞書ｈを含む全ての上位階層の部分辞書が自動的に設定されるので、「桜木町」と発話し直しすれば、階層変更をせずに音声入力ができる。これによって階層の変更は必要なときのみ行なわれることになり、音声入力負担がさらに軽減される。このほか第３の実施例のように発話、返答、詳しい説明という自然な対話形式で音声入力ができる効果も得られる。
【００５９】
【発明の効果】
以上説明したように、本発明によれば、１回目の単語を目的語とすることができるので、まず目的語で入力し、正確でなかったら階層を変更して目的語を再度入力して認識し、自然な感覚で音声入力ができる。また辞書を変更する際、例えば現在の部分辞書から遡り、目的語と共通する内容の上位部分辞書に変更して最下層の部分辞書を決定する段取りをとるので、目的語が認識されるまでの入力回数が少なく、入力負担が軽減される
【００６０】
部分辞書決定手段は、上位階層への変更指示を受けた場合、上位階層の部分辞書を決定し、音声入力により単語選択手段が選択した単語が示す階層に変更し、該階層下の所定の最下層部分辞書を決定するようにした場合、階層変更があったも、目的語の入っている部分辞書に至るまでの部分辞書に対する認識が不要で、大きな入力負担とならない効果が得られる。
また部分辞書決定手段が初めに決定する部分辞書は前回使用された最下層の部分辞書となれば、辞書構成を理解し、階層変更が円滑に行なえる。
さらに前回使用された最下層の部分辞書の代わりに使用頻度の高い最下層の部分辞書を用いても同じ効果が得られる。
【００６１】
音声入力装置に表示手段を接続し、単語選択手段が選択した単語及び上位の階層構造を表示手段で表示するようにすると、階層変更が必要のときに、表示された階層構造を頼りに階層変更ができるので、入力負担がさらに軽減される。
前記表示手段はタッチパネルを合わせ持ち、使用者によってタッチされた階層に対応した階層変更指令を前記部分辞書決定部に出力するようにすると、対話式な入力となり一層の入力負担軽減が図られる。
また、タッチ入力があったときに、音声入力によって変更可能な階層を前記表示手段上で表示した場合、辞書の構造を表示画面から知ることができ辞書の構成に対する予備知識が無くても階層変更ができる効果が得られる。
【００６２】
前記部分辞書決定手段は、目的語に続き音声入力が行なわれたときに、上位階層への変更指示を受けたとし、設定されている最下層部分辞書より上位の全ての部分辞書を決定し、単語選択手段が選択した単語が示す階層に階層変更を行なうようにすると、階層変更に対する操作が不要で入力負担が一層軽減される。
【００６３】
前記部分辞書決定手段は、目的語に続き音声入力が行なわれたときに、設定されている最下層部分辞書及びそれより上位の全ての部分辞書を決定し、単語選択手段が選択した単語が上位部分辞書内の単語であれば、上位階層への変更指示を受けたとし、前記単語が示す階層に階層変更を行なった場合、部分辞書が正確でありながら、単語の誤認識による階層変更が防止され、無駄な入力を生じさせない効果が得られる。
【００６４】
最初に入力された音声を記憶する記憶手段を合わせ持ち、階層が変更されたとき、前記単語選択手段は前記記憶手段に記憶された音声を用いて、決定された最下層の部分辞書内の単語との一致度を演算し、出力することで、階層を変更し、最下層の部分辞書が変わっても、目的語の音声入力を二度と行なう必要がなく、音声の入力回数がさらに減らされる。また、最初の発話に対し、装置が使用者の指定したい階層と異なる階層の単語を出力する場合、第２の実施例と同様に使用者が指定したい階層名称を発話するという手順をとることで、発話、返答、詳しい説明という自然な対話を実現でき、最も自然な感覚で音声入力ができる効果が得られる。
【図面の簡単な説明】
【図１】第１の実施例の構成を示すブロック図である。
【図２】階層化した単語辞書の構成を示す図である。
【図３】第１の実施例のフローチャートである。
【図４】モニタ表示画面を示すブロック図である。
【図５】部分辞書を変更時のモニタ表示画面を示す図である。
【図６】第２の実施例の構成を示すブロック図である。
【図７】第２の実施例のフローチャートである。
【図８】第３の実施例の構成を示すブロック図である。
【図９】第３の実施例のフローチャートである。
【図１０】最下位部分辞書を変更時に仮設定された上位部分辞書を示す図である。
【図１１】第３の実施例の変形例を示すフローチャートである。
【図１２】最下位部分辞書を変更時に仮設定された部分辞書を示す図である。
【図１３】従来例の構成を示すブロック図である。
【図１４】従来例のフローチャートである。
【符号の説明】
５、６、７信号処理装置
５１、７１部分辞書決定部
５２、６２単語選択部
５３、６３、７３音声入力部
５４、６４単語出力部
６０、７０、５０４モニタ
６６音声記憶部
５００マイク
５０１Ａ／Ｄコンバータ
５０２信号処理装置
５０３外部記憶装置
５０５スイッチ
ａ、ｂ、ｃ、ｄ、ｅ、ｆ部分辞書
ｇ、ｈ、ｉ、ｊ、ｋ部分辞書[0001]
TECHNICAL FIELD OF THE INVENTION
The present invention relates to a speech input device capable of performing large vocabulary recognition with a small number of inputs.
[0002]
[Prior art]
2. Description of the Related Art Voice input devices that recognize voice and read information can input commands, data, and the like without using hands, and thus are being used in places where manual operations are difficult, such as word processors for disabled persons and in-vehicle navigation.
As shown in FIG. 13, a typical voice input device includes a microphone 500 that captures voice and converts it into an electric signal and outputs a voice signal, and an A / D converter that converts an analog voice signal into digital information by A / D conversion. / D converter 501, an external storage device 503 storing a word dictionary used for performing word recognition from a voice signal, a CPU and a memory, and having the word dictionary read into the memory. And a signal processing device 502 that performs voice recognition by determining the degree of coincidence with the word.
The switch 505 is operated by the user of the apparatus and gives the timing of capturing the audio signal to the signal processing apparatus 502. The monitor 504 displays the recognized word.
[0003]
In general, in the case of speech input, the recognition rate decreases as the number of words to be recognized increases, and a hierarchical dictionary is used for large vocabulary recognition. At the time of input, voice input is repeated a plurality of times in accordance with the dictionary classification, the hierarchy is advanced, recognition is performed in a state where the number of words to be recognized is narrowed down, and the erroneous recognition rate is reduced.
[0004]
Details of the speech recognition in the above-described conventional device will be described with reference to the flowchart of FIG.
Here, a case where "Sakuragicho" which is the station name of JR is input by voice is shown. Such a station name input is used for setting a destination of a navigation device of an automobile. The external storage device 503 stores a hierarchical word dictionary as shown in FIG. 2, for example.
[0005]
First, in step 521, in the initial state of voice input, the signal processing device 502 sets a partial dictionary at the highest level of the hierarchical word dictionary as a first recognition target. Here, the partial dictionary a in FIG. 2 is set.
Next, the device user operates the switch 505 to speak “facility” to the microphone 500. This is because the user has recognized that “Sakuragicho” is included in the category of “facility”.
The operation of the switch 505 is detected in step 522. Thereby, the voice recognition processing is started.
[0006]
In step 523, the CPU in the signal processing device 502 reads the partial dictionary of the set hierarchy from the external storage device 503 into the memory. Initially, the partial dictionary a is read.
The signal processing device 502 calculates the average power of the sound until the switch 505 is operated, and after the switch 505 is operated, the instantaneous power of the sound signal becomes larger than the average power of the sound by a predetermined value or more. Then, it is determined that the user has started uttering, and voice capture is started in step 524.
[0007]
In step 525, the CPU calculates the degree of coincidence between the input voice and the word in the partial dictionary read into the memory. Initially, the degree of coincidence with each of “address” and “facility” is calculated.
[0008]
When the instantaneous power of the audio signal becomes equal to or less than the predetermined value, it is determined that the utterance of the user has ended, and in step 526, the capturing of the audio is ended.
When the calculation of the degree of matching between the input voice and the word in the partial dictionary is completed, in step 527, the CPU selects the word having the highest degree of matching. Here, "facility" is selected because the degree of coincidence is higher.
[0009]
In step 528, the selected word having a high degree of matching is displayed on the monitor 504 as a result of the recognition.
In step 529, it is determined whether or not the selected word is a partial dictionary name of a lower hierarchy. If the selected word is a dictionary name, the process proceeds to step 523; otherwise, the process proceeds to step 530. Here, since the word “facility” is the dictionary name of the partial dictionary e in the lower hierarchy as shown in the word dictionary of FIG. 2, the process returns to step 523.
[0010]
In step 523, the CPU newly reads the partial dictionary e from the external storage device 503 into the memory. In this way, the hierarchy advances each time the processing is repeated. When there is no partial dictionary under the selected word, in step 530, the recognized word is output as an object. Here, "Sakuragicho" is output.
[0011]
[Problems to be solved by the invention]
As described above, in the conventional voice input device, to transmit one word, the partial dictionary names up to the lowest layer to which the word belongs must be input for each layer. Since the hierarchy becomes more complicated as the number of words increases, there is a problem in that the recognition load from the large vocabulary increases the user's input burden. Further, there is a problem that although the user wants to convey the object most, the input of the object lasts a sense of incongruity at the end and is not easy to use.
The present invention has been made in view of the above-described conventional problems, and has as its object to realize accurate voice input with a small number of input times.
[0012]
[Means for Solving the Problems]
Voice input means for converting the uttered voice into information and inputting the information;
A word dictionary consisting of a partial dictionary including a plurality of words and having a hierarchical structure,
Means for determining a partial dictionary to be used;,
Calculating means for calculating the degree of coincidence between the input voice from the voice input means and the word in the determined partial dictionary;
Word selecting means for selecting and outputting the word having the highest matching degree among the calculated words,
The partial dictionary determining means first determines a predetermined lowermost partial dictionary,When an instruction to change to a higher hierarchy is received, a partial dictionary of the specified upper hierarchy is determined, the input dictionary is changed to the hierarchy indicated by the word selected by the word selecting means, and a predetermined lower hierarchy below the hierarchy is determined.A partial dictionary was determined.
[0013]
It is preferable that the partial dictionary first determined by the partial dictionary determining means is the lowest partial dictionary used last time.
It is also possible to use a lower-level partial dictionary frequently used in place of the lowest-level partial dictionary used last time.
[0014]
It is desirable that display means is connected to the voice input device to display the word selected by the word selection means and the upper hierarchical structure.
The display unit may have a touch panel and output a hierarchy change command corresponding to the hierarchy touched by the user to the partial dictionary determination unit.
Further, it is desirable to display a changeable hierarchy on the display means when a touch input is made on the touch panel.
[0015]
The partial dictionary determination means determines that, when an input is made following the object, an instruction to change to a higher hierarchy is received, and all partial dictionaries higher than the set lowermost partial dictionary are determined. The hierarchy can be changed to the hierarchy indicated by the word selected by the selection means.
[0016]
The sub-dictionary deciding means, when an input is made following the object, decides the set lowest sub-dictionary and all sub-dictionaries higher than the set sub-dictionary, and the word selected by the word selecting means is set If the word is in the dictionary, it is assumed that a change instruction to a higher hierarchy has been received, and the hierarchy can be changed to the hierarchy indicated by the word.
[0017]
In addition, when the hierarchy is changed, the word selecting means uses the voice stored in the storage means to store the first input speech in the determined partial dictionary. Can be calculated and output.
[0018]
[Action]
When a hierarchical dictionary is used and a partial dictionary is set, a predetermined lowermost partial dictionary is first set, so that large vocabulary recognition can be performed and the first word can be used as an object. . If it cannot be recognized by one input, the hierarchy is changed by the hierarchy change command, so for example, change to the upper hierarchy common to the current partial dictionary and the partial dictionary with the object, and determine the lowest partial dictionary This eliminates the need for input from the uppermost partial dictionary, thereby achieving efficient input. Further, since the input is performed from the beginning with the object and the change in the hierarchy of the dictionary cannot be recognized, the voice input matching the sense can be performed.
[0019]
The partial dictionary determining means, when receiving an instruction to change to a higher hierarchical level, determines a partial dictionary of the upper hierarchical level, changes to a hierarchical level indicated by the word selected by the word selecting means by input, and determines a predetermined lowermost hierarchical level below the hierarchical level. If a partial dictionary is determined, there is a hierarchy change.handAlso, it is not necessary to recognize the partial dictionary between the partial dictionary of the object and the upper hierarchy, and even if the hierarchy is changed to a higher hierarchy, the object is input immediately, so that the input is performed without frustrating the device user. The effect that can be obtained is obtained.
Further, if the partial dictionary determined first by the partial dictionary determining means is the lowest partial dictionary used last time, the dictionary configuration can be understood and the hierarchy can be changed smoothly.
Further, the same effect can be obtained by using a lower-level partial dictionary frequently used in place of the lowest-level partial dictionary used last time.
[0020]
When the display means is connected to the voice input device and the word selected by the word selection means and the upper hierarchical structure are displayed on the display means, when the hierarchy needs to be changed, the hierarchy is changed based on the displayed hierarchical structure. , The input burden is reduced.
If the display unit has a touch panel and outputs a hierarchy change command corresponding to the hierarchy touched by the user to the partial dictionary determination unit, the input becomes interactive and the input load is further reduced.
Further, when a hierarchy that can be changed by voice input is displayed on the display unit when there is a touch input, the structure of the dictionary can be known from the display screen, and even if there is no prior knowledge such as the configuration of the dictionary. The effect that the hierarchy can be changed is obtained.
[0021]
When the input is performed following the object, the partial dictionary determining means determines that an instruction to change to a higher hierarchical level is received, determines all partial dictionaries higher than the set lowest partial dictionary, and If the hierarchy is changed to the hierarchy indicated by the word selected by the selecting means, the hierarchy can be changed only by the words in the upper hierarchy, and the input load can be further reduced.
[0022]
The sub-dictionary deciding means, when an input is made following the object, decides the set lowest sub-dictionary and all sub-dictionaries higher than the set sub-dictionary, and the word selected by the word selecting means is set If it is a word in the dictionary, it is assumed that a change instruction to a higher hierarchy has been received, and if the hierarchy is changed to the hierarchy indicated by the word, the hierarchy change due to incorrect recognition of the word is prevented while the partial dictionary is accurate. You.
[0023]
When the hierarchy is changed, the word selection means uses the speech stored in the storage means to store a word in the lowermost partial dictionary determined when the hierarchy is changed. By calculating and outputting the degree of coincidence with, even if the hierarchy is changed and the partial dictionary at the lowest level is changed, it is not necessary to input the speech of the object word again, and the number of times of inputting the speech is further reduced.
[0024]
BEST MODE FOR CARRYING OUT THE INVENTION
Next, embodiments of the present invention will be described with reference to examples.
FIG. 1 is a block diagram showing the configuration of the first embodiment of the present invention.
The audio signal captured by the microphone 500 is converted into digital information by the A / D converter 501 and input to the signal processing device 5. The switch 505 is operated just before the device user speaks for voice input, and gives the signal processing device 5 a detection timing of the speech voice. The external storage device 503 stores a word dictionary. The monitor 60 is a monitor with a touch panel, which displays the processing results of the signal processing device 5 and allows the partial dictionary to be changed using the touch panel.
[0025]
As the word dictionary, a hierarchical word dictionary as shown in FIG. 2 is used. In the figure, the connecting lines between the partial dictionaries indicate the connection relationship between the upper and lower layers, and what protrudes in the frame is the display of words representing the lower hierarchy.
The partial dictionary a composed of two words “address” and “facility” is located at the top level of the word dictionary. Each word is a lower-level dictionary. A partial dictionary b is provided, and a partial dictionary c in which city and ward names are written and a partial dictionary d in which ward and village names are written are sequentially provided for each prefecture.
[0026]
Below the word “facility” in the partial dictionary a, a partial dictionary e of words describing the type names of “facility” such as “station”, “department store”, and “hotel” follows. A partial dictionary f in which the names of prefectures are described below “station” is provided. From the partial dictionary f, a partial dictionary g describing a transportation company name and a partial dictionary h of a station name connected to the partial dictionary g are provided for each prefecture.
[0027]
The word "department store" also has a lower-level partial dictionary i describing the name of the prefecture, a partial dictionary j describing the name of the city, and a partial dictionary k describing the name of the "department store". It is linked to a dictionary and a sub-dictionary of the ward.
The partial dictionary d, the partial dictionary h, and the partial dictionary k are the lowest-level partial dictionaries, in which object words are registered, and words to be recognized are collated and recognized here.
[0028]
The signal processing device 5 calculates the average power of the audio signal digitized from the microphone 500 until the switch 505 is operated in the audio input unit 53. When the instantaneous power becomes larger than the average power value, it is determined that the utterance has started, and the capture of the audio signal is started. The voice signal is word-recognized in the word selection unit 52. At this time, the word dictionary used first is the partial dictionary in the lowest hierarchy read by the partial dictionary determination unit 51 from the external storage device 503.
[0029]
The word output unit 54 temporarily outputs the result of the recognition and the name of the upper layer of the word to the monitor 60, but suspends the output as the device. During this time, the partial dictionary can be changed by touch-inputting the name of the upper hierarchy displayed on the monitor 60. If there is no touch on the touch panel within a predetermined time, the word output unit 54 outputs the recognition result as an output of the device.
[0030]
When there is a change in the partial dictionary, the partial dictionary determination unit 51 reads the partial dictionary of the touched name from the external storage device 503, and the device user operates the switch 505 to give an utterance for determining the lower dictionary. I do. The partial dictionary determining section 51 determines a predetermined lowest-order partial dictionary relating to the partial dictionary and reads it from the external storage device 503. Thereafter, voice input is performed again, and word recognition is performed in the word selection unit 52.
The lowest partial dictionary input first or input after changing the hierarchy is the partial dictionary used last time or the most frequently used partial dictionary.
[0031]
Next, the operation flow of the apparatus will be described with reference to the flowchart of FIG.
First, in step 101, in the initial state of voice input, the partial dictionary determination unit 51 sets the lowest partial dictionary in the hierarchy as the first recognition target. The most frequently used partial dictionary or the lowest partial dictionary used last time is used as the initial setting. Here, for example, it is assumed that the partial dictionary h in FIG. 2 is set.
[0032]
The user of the device operates the switch 505 to speak into the microphone 500.
In step 102, the voice input unit 53 detects whether or not the switch 505 has been operated. If the switch 505 has been operated, the average power value of the voice after the microphone 500 has been operated is calculated, and the process proceeds to step 103.
In step 103, the determined partial dictionary is read from the external storage device 503 into the partial dictionary determination unit 51. Here, the partial dictionary h is read.
[0033]
In step 104, the utterance voice of the user is started. That is, when the instantaneous power becomes greater than a predetermined value compared to the average power of the voice which is constantly calculated from the input voice signal, it is determined that the user's utterance has started, and the capturing of the utterance voice is started.
Here, for example, it is assumed that the user has spoken “Sakuragicho” as the object.
In step 105, the word selection unit 52 calculates the degree of coincidence between the word in the partial dictionary h read by the partial dictionary determination unit 51 and the uttered voice. In this process, a speech section portion divided by a predetermined time is compared with each word while taking a speech signal, and the acquisition of the speech signal is proceeding at the same time.
[0034]
When the instantaneous power of the audio signal becomes equal to or less than the average power in step 106, the user determines that the utterance is completed, ends the voice capturing, and proceeds to step 107.
In step 107, the word having the highest matching degree calculated by the word selection unit 52 is output to the word output unit 54 as a recognition result.
[0035]
In step 108, the word output unit 54 temporarily outputs the recognition result and the name of the upper layer of the word to the monitor 60 for display, but suspends the output as a device. FIG. 4 shows a display screen of the monitor 60. h is a dictionary of the selected word, and a, e, f, and g are high-order dictionaries of the word. The object displayed from the h frame is the object, and the words arranged in the upper frame are words in the upper dictionary related to the object, and the hierarchical order is displayed from right to left. I have. When changing the hierarchy, the user can touch the place where the word is displayed to change the hierarchy.
Then, in step 109, if there is no touch input on the touch panel within a predetermined time, in step 110, the recognition result of the word "Sakuragicho" is output as an output of the apparatus.
In this case, there is no need to change the hierarchy, and the number of utterances ends once.
[0036]
If the user uttered “Sougo” in step 104, another word would be displayed because the initial partial dictionary was h. In this case, since it is necessary to change the upper hierarchy, the user touches e in FIG.
[0037]
When the touch input is detected in step 109, the process proceeds to step 111.
In step 111, the monitor 60 displays which layer the touched layer can be changed to. Here, "station", "department store", and "hotel" are displayed. FIG. 5 shows the display state.
In step 112, the partial dictionary determination unit 51 newly reads the partial dictionary e from the external storage device 503.
[0038]
In step 113, the user touches and selects from the displayed upper hierarchy to input a voice. In the case of “Sougo”, the user inputs “Department store” related thereto by voice and is recognized by the word selection unit 52. In this step, the same processing as steps 102 to 107 may be performed.
In step 114, the partial dictionary determination unit 51 determines the lowest partial dictionary k connected to the recognized word “department”.
[0039]
In step 115, an input instruction mark is displayed again on the monitor 60 to notify the user of the voice input retry again, and the process returns to step 102.
Thereafter, the user utters “SOGO”, and the processing from step 102 to step 108 is performed again. In step 110, the word “SOGO” is output.
[0040]
The microphone 500, the A / D converter 501, and the audio input unit 53 constitute an audio input unit. The partial dictionary determination section 51 constitutes a dictionary determination unit. The word selecting section 52 constitutes word selecting means. The monitor 60 constitutes a display unit.
[0041]
The present embodiment is configured as described above. When the first speech is input, the degree of coincidence between the speech signal and a word in the predetermined lowermost partial dictionary is calculated. It can be used as an object, and the user's input burden is reduced. Further, when a partial dictionary needs to be changed, even if there are a plurality of hierarchies below the higher hierarchy, the lowest partial dictionary is set and used.
Then, since the partial dictionary used first is the one used frequently or the one used last time, the user can easily understand the output result and smoothly change the hierarchy.
[0042]
In addition, since the monitor displays the hierarchical structure of the word when the word is displayed, it is possible to easily determine which layer the user wants to change.
When there is a touch input, by displaying what level can be changed by voice input, it is possible to perform an operation to change to which level without knowing the dictionary configuration.
[0043]
FIG. 6 is a block diagram showing the configuration of the second embodiment.
In this embodiment, a signal processing device 6 provided with an audio storage unit 66 is used instead of the signal processing device 5 in the first embodiment shown in FIG. The voice signal storage unit 66 outputs the voice signal of the voice input unit 63 to the word selection unit 62, and stores the first voice signal in the voice storage unit 66. When the partial dictionary is changed, the degree of coincidence with the partial dictionary is determined using the voice signal stored in the voice storage unit 66, and the object is recognized. Others are the same as the first embodiment.
[0044]
Next, the operation flow of the apparatus will be described with reference to the flowchart of FIG.
In steps 201 to 215, the same processes as those in steps 101 to 114 of the flowchart of FIG. 3 in the first embodiment are performed except for step 207.
That is, first, the partial dictionary at the lowest level of the hierarchy is set as the first recognition target, and is read from the external storage device 503. While fetching the user's utterance voice signal, a word in the set partial dictionary and the degree of coincidence are calculated. Then, when the capture of the audio signal is completed, in step 207, the audio storage unit 66 stores the audio signal.
The word having the highest matching degree is selected as a result of the recognition, and is displayed on the monitor 60 including the upper hierarchy.
[0045]
Then, if there is no touch input on the touch panel of the monitor 60 within a predetermined time, the selected word is output as an output of the apparatus. If there is a touch input, the partial dictionary is changed to the touched hierarchy. The partial dictionary whose hierarchy has been changed is newly read from the external storage device 503. Thereafter, an utterance for changing the hierarchy is made, and when a word indicating the hierarchy is specified by voice input, the lowest partial dictionary under the hierarchy is read from the external storage device 503.
[0046]
Thereafter, in the first embodiment, the user is notified again of the re-input of the voice, but in the present embodiment, in step 216, the voice signal stored from the voice storage unit 66 is used, and the word in the lowermost partial dictionary is used. Is calculated. From each word, the word with the highest degree of matching is selected in step 208. If it is determined in step 210 that there is no touch input to the monitor 60, the word recognized in step 211 is output as an output of the device.
[0047]
According to this embodiment, the same effect as that of the first embodiment can be obtained, and it is not necessary to first speak and input the object again after the partial dictionary is changed. Become.
[0048]
Next, a third embodiment will be described.
FIG. 8 is a block diagram showing the configuration of the third embodiment.
In this embodiment, a signal processing device 7 is used instead of the signal processing device 6 in the second embodiment shown in FIG. The voice signal storage unit 66 outputs the voice signal of the voice input unit 73 to the word selection unit 62, and stores the first voice signal in the voice storage unit 66. If there is a partial dictionary change, it is recognized using the voice signal stored in the voice storage unit 66.
[0049]
In the first or second embodiment, the hierarchy is changed by the touch panel of the monitor 60. In this embodiment, the hierarchy is changed by voice. That is, the first and second embodiments are different from the first and second embodiments in that the first speech input corresponds to the object and the subsequent speech inputs correspond to the hierarchy change. Instead of the monitor 60, a monitor 70 for displaying only is used. Others are the same as the first embodiment.
[0050]
Next, the operation flow of the apparatus will be described with reference to the flowchart of FIG.
In steps 201 to 216, the same processes as those in the flowchart of FIG. 7 in the second embodiment are performed except for steps 310 and 313.
That is, first, the lowest partial dictionary in the hierarchy is set as the first recognition target, and is read from the external storage device 503. While fetching the user's utterance voice signal, a word in the set partial dictionary and the degree of coincidence are calculated. When the capture of the audio signal ends, the audio storage unit 66 stores the audio signal.
Then, in step 208, the word having the highest degree of matching is selected as a result of the recognition.
[0051]
In step 310, the voice input unit 73 detects whether or not the switch 505 has been operated. If the switch 505 has been operated, the process proceeds to step 313 because there is a voice input.
In step 313, all the partial dictionaries higher than the currently used partial dictionary are read from the external storage device 503.
After that, as in the first and second embodiments, voice input of the upper hierarchy is performed, and when a word indicating the hierarchy is specified, the lowest partial dictionary under the hierarchy is read from the external storage device 503.
[0052]
In step 216, the degree of coincidence with a word in the partial dictionary is calculated using the voice signal stored from the voice storage unit 66. From each word, the word with the highest degree of matching is selected in step 208.
If it is detected in step 310 that there is no operation on the switch 505, the word recognized in step 211 is output as an output of the device.
In this way, for example, when the partial dictionary h is initially set and the word to be input is “Sougo” that is not in the dictionary, if the user speaks “Department store”, the “Sougo” as shown in FIG. Since the hatched partial dictionaries including the upper dictionary are automatically set, there is no troublesome operation even if there is a hierarchy change.
[0053]
That is, when the device outputs words of a different hierarchy from the hierarchy specified by the user in response to the first utterance, the user utters the hierarchy name desired by the user, and the utterance, reply, and detailed explanation are taken. A natural dialogue can be realized, and an effect that voice input can be performed with the most natural feeling can be obtained.
Since the partial dictionaries of each layer are set, there is a concern that the recognition rate may decrease due to speech.However, words in the upper partial dictionary are generally words for classification, and the number of words is in the lowest partial dictionary. Since the number of words is extremely small compared to the number of words, the recognition rate is sufficient for use.
[0054]
Next, a modification of the third embodiment will be described.
In each of the above embodiments, even if the partial dictionary at the lowest level is set correctly, words may be erroneously recognized due to the influence of noise or the like. In this case, even if the partial dictionaries of all the upper hierarchies are reset, the original partial dictionary is reached, which results in an unnecessary number of times of input. This modification is intended to deal with such a case, and even if there is an erroneous recognition, words in the same partial dictionary can be processed without changing the hierarchy.
[0055]
FIG. 11 is a flowchart showing the flow of the operation of the apparatus.
This flowchart is configured by replacing Steps 412, 413, and 414 with Steps 313 and 214 in the third embodiment of FIG.
That is, first, the lowest partial dictionary in the hierarchy is set as the first recognition target, and is read from the external storage device 503. A word in the set partial dictionary and the degree of coincidence are calculated while taking the user's speech signal. When the capture of the audio signal ends, the audio storage unit 66 stores the audio signal.
The word having the highest degree of matching is selected as a result of recognition, and is displayed on the monitor 70 including the upper layer in step 209.
[0056]
Then, in step 310, the voice input unit 73 detects whether or not the switch 505 has been operated. If the switch 505 has been operated, the process proceeds to step 412 because there is a voice input.
In step 412, the currently used partial dictionary and all partial dictionaries thereof are read.
In step 413, the voice spoken by the user is input, and the degree of coincidence with the read word in the partial dictionary is calculated.
[0057]
In step 414, it is determined whether or not the word having the highest degree of coincidence exists in the partial dictionary in the upper hierarchy. If the word is in the partial dictionary of the upper hierarchy, it is for changing the hierarchy, and the process proceeds to step 215 to perform the same processing as in the third embodiment. If the word is in the lowermost partial dictionary, the process proceeds to step 209 because it is a recognized object, and the word is output in step 211 after determining in step 310 whether a new operation has been performed on the switch. Is done.
[0058]
In this way, if the partial dictionary h is initially set and the word to be input is "Sakuragicho", the sound signal of "Sakuragicho" is distorted due to noise or the like, and is incorrectly displayed as "Kannai". May be It is also conceivable that the user intends to utter "Sakuragicho" and erroneously utters "Kannai". Even if such a partial dictionary is correct, all upper-layer partial dictionaries including the partial dictionary h are automatically set, as indicated by hatching in FIG. 12, so if "Sakuragicho" is re-uttered, Voice input without changing the hierarchy. As a result, the hierarchy is changed only when necessary, and the voice input burden is further reduced. In addition, as in the third embodiment, there is also obtained an effect that a voice can be input in a natural dialogue style such as utterance, reply, and detailed explanation.
[0059]
【The invention's effect】
As described above, according to the present invention, the first word can be used as the object. Therefore, the object is first input, and if it is not correct, the hierarchy is changed and the object is input again to recognize the object. And you can input speech with a natural feeling. Also, when changing a dictionary, for example, it is necessary to go back from the current partial dictionary and change to a higher partial dictionary with the same content as the object and determine the lowest partial dictionary. Fewer inputs and less typing
[0060]
The partial dictionary determining means, when receiving an instruction to change to a higher hierarchy, determines a partial dictionary in the upper hierarchy, changes to a hierarchy indicated by the word selected by the word selecting means by voice input, and determines a predetermined lower hierarchy in the hierarchy. When the lower-level partial dictionary is determined, there is no need to recognize the partial dictionary up to the partial dictionary containing the object even if the hierarchy is changed, and an effect of not causing a large input load can be obtained.
Further, if the partial dictionary determined first by the partial dictionary determining means is the lowest partial dictionary used last time, the dictionary configuration can be understood and the hierarchy can be changed smoothly.
Further, the same effect can be obtained by using a lower-level partial dictionary frequently used in place of the lowest-level partial dictionary used last time.
[0061]
When the display means is connected to the voice input device and the word selected by the word selection means and the upper hierarchical structure are displayed on the display means, when the hierarchy needs to be changed, the hierarchy is changed based on the displayed hierarchical structure. Therefore, the input load can be further reduced.
If the display means has a touch panel and outputs a hierarchy change command corresponding to the hierarchy touched by the user to the partial dictionary determination unit, the input becomes interactive and the input load is further reduced.
In addition, when a touch input is performed and a hierarchy that can be changed by voice input is displayed on the display unit, the structure of the dictionary can be known from the display screen, and the hierarchy can be changed without prior knowledge of the dictionary configuration. The effect that can be obtained is obtained.
[0062]
The partial dictionary determination means, when a speech input is performed following the object, determines that a change instruction to a higher hierarchy has been received, and determines all partial dictionaries higher than the set lowermost partial dictionary, When the hierarchy is changed to the hierarchy indicated by the word selected by the word selecting means, the operation for changing the hierarchy is unnecessary, and the input load is further reduced.
[0063]
The sub-dictionary deciding means decides the lowermost sub-dictionary set and all the sub-dictionaries higher than the set sub-dictionary when the speech input is performed following the object, and the word selected by the word selecting means is higher If it is a word in the partial dictionary, it is assumed that a change instruction to a higher hierarchical level has been received, and if the hierarchical level is changed to the hierarchical level indicated by the word, the hierarchical change due to incorrect recognition of the word is prevented while the partial dictionary is accurate. Thus, an effect of not causing useless input can be obtained.
[0064]
When the hierarchy is changed, the word selection means uses the speech stored in the storage means to store a word in the lowermost partial dictionary determined when the hierarchy is changed. By calculating and outputting the degree of coincidence with the subject, even if the hierarchy is changed and the partial dictionary at the lowest level is changed, there is no need to input the speech of the object again, and the number of times of speech input is further reduced. In addition, when the device outputs words of a different hierarchy from the hierarchy that the user wants to specify for the first utterance, a procedure is adopted in which the user utters the hierarchy name that the user wants to specify, as in the second embodiment. , Utterance, reply, and detailed explanation can be realized, and the effect of inputting voice with the most natural feeling can be obtained.
[Brief description of the drawings]
FIG. 1 is a block diagram illustrating a configuration of a first embodiment.
FIG. 2 is a diagram showing a configuration of a hierarchical word dictionary.
FIG. 3 is a flowchart of the first embodiment.
FIG. 4 is a block diagram showing a monitor display screen.
FIG. 5 is a diagram showing a monitor display screen when a partial dictionary is changed.
FIG. 6 is a block diagram showing a configuration of a second embodiment.
FIG. 7 is a flowchart of the second embodiment.
FIG. 8 is a block diagram illustrating a configuration of a third embodiment.
FIG. 9 is a flowchart of the third embodiment.
FIG. 10 is a diagram illustrating an upper partial dictionary temporarily set when the lowermost partial dictionary is changed.
FIG. 11 is a flowchart showing a modification of the third embodiment.
FIG. 12 is a diagram illustrating a partial dictionary temporarily set when the lowest partial dictionary is changed.
FIG. 13 is a block diagram showing a configuration of a conventional example.
FIG. 14 is a flowchart of a conventional example.
[Explanation of symbols]
5, 6, 7 signal processing device
51, 71 partial dictionary determination unit
52, 62 Word selector
53, 63, 73 Voice input unit
54, 64 word output unit
60, 70, 504 monitor
66 Voice storage unit
500 microphone
501 A / D converter
502 signal processing device
503 External storage device
505 switch
a, b, c, d, e, f partial dictionary
g, h, i, j, k partial dictionary

Claims

Voice input means for converting the uttered voice into information and inputting the information;
A word dictionary consisting of a partial dictionary including a plurality of words and having a hierarchical structure,
A partial dictionary determining means for determining a partial dictionary to be used ;
Calculating means for calculating the degree of coincidence between the voice information from the voice input means and the word in the determined partial dictionary;
Word selecting means for selecting and outputting the word having the highest matching degree among the calculated words,
The partial dictionary determining means first determines a predetermined lower-level partial dictionary, and when an instruction to change to a higher hierarchical level is received, determines a designated partial dictionary at an upper hierarchical level. A voice input device for changing to a hierarchy indicated by a selected word and determining a predetermined lowermost partial dictionary under the hierarchy .

Voice input device according to claim 1, wherein the portion dictionaries that is determined at the beginning, characterized in lowermost portion dictionary der Rukoto the previously used by the partial dictionary determining means.

2. The voice input device according to claim 1, wherein the partial dictionary first determined by the partial dictionary determining means is a lower-most partial dictionary that is frequently used .

2. The voice input device according to claim 1 , wherein a display unit is connected to the voice input device, and the word selected by the word selection unit and an upper hierarchical structure are displayed .

The voice input device according to claim 4, wherein the display unit has a touch panel, and a hierarchy change command corresponding to a hierarchy touched by a user is output to the partial dictionary determination unit .

The display means when there is a touch input, a voice input device according to claim 5, wherein you to view a changeable hierarchy.

The partial dictionary determination means determines that, when an input is made following the object, an instruction to change to a higher hierarchy is received, and all partial dictionaries higher than the set lowermost partial dictionary are determined. voice input device according to claim 1, characterized in that the hierarchy indicated by the word selecting means selects performing hierarchy changed.

The sub-dictionary determining means, when an input is made following the object, determines the set lowest sub-dictionary and all sub-dictionaries higher than the set sub-dictionary. 2. The voice input device according to claim 1 , wherein if the word is in the dictionary, a change instruction to a higher hierarchy is received, and the hierarchy is changed to the hierarchy indicated by the word.

The voice input device also has storage means for storing the first input voice signal, and when the hierarchy is changed, the word selection means uses the voice signal stored in the storage means to determine the most recently determined voice signal. calculates the degree of coincidence between words in the underlying portion dictionaries claim 1, 7 or voice input apparatus according 8, characterized in that the output.