JP2004219747A

JP2004219747A - Device and method for speech recognition, and program

Info

Publication number: JP2004219747A
Application number: JP2003007378A
Authority: JP
Inventors: Seiichi Miki; 清一三木
Original assignee: NEC Corp
Current assignee: NEC Corp
Priority date: 2003-01-15
Filing date: 2003-01-15
Publication date: 2004-08-05
Anticipated expiration: 2023-01-15
Also published as: JP3695448B2

Abstract

<P>PROBLEM TO BE SOLVED: To improve speech recognition performance by obtaining a correct answer even when neither a speech recognition result obtained last or before that, nor a currently obtained speech recognition result is the correct answer, by improving the speech recognition performance by referring to both the speech recognition results. <P>SOLUTION: Provided are: a recognition part 100 which recognizes a speech each time the speech is spoken and calculates a likelihood showing how accurate each recognition result candidate is; a recognition result storage part 110 which stores recognition result candidates by spoken speeches and their likelihoods; a reliability calculation part 120 which calculates reliability as a normalized score according to the likelihoods of the recognition result candidates by the spoken speeches; a result selection part 130 which selects a recognition result out of the respective recognition result candidates of the respective spoken speeches stored according to the reliability; a misrecognition result storage part 150 which stores misrecognition results decided as misrecognition as to recognition results of up to the last spoken speech; and a result filtering part 160 which selects a recognition result again by removing the stored misrecognition results from selected recognition results. <P>COPYRIGHT: (C)2004,JPO&NCIPI

Description

【０００１】
【発明の属する技術分野】
本発明は音声認識装置、音声認識方法、及びプログラムに関し、特にユーザが発声を誤認識され、同内容の発声が再度入力された場合に、誤認識した発声の認識結果と再度入力された発声の認識結果とを用いた音声認識技術に関する。
【０００２】
【従来の技術】
従来の音声認識装置の一例を示す特開平１０−１３３６８４号公報によれば、この音声認識装置は以前に認識された認識結果候補と、新たに認識された認識結果候補を使用してそれぞれの発声に一致する確率が最も高い認識結果（ただし、誤認識結果を除く）を選択することが示されている。具体的には、最も高い確率を計算するために両方の認識結果候補から共通の認識結果候補を検出し、それらの確率を乗算した結果の合成確率を用いて選択する。
【０００３】
具体例として図７に話し手が「ｍａｋｅ」と２回発声した場合にそれぞれを認識した認識結果候補とそれに対応する確からしさが示されている。図７を参照すれば、２回目の発声に対応する新たに認識された結果でのみ判断すると、「Ｆａｋｅ」の確率＝０．４が最も高いので「Ｆａｋｅ」が誤って選択される。１回目で「Ｆａｋｅ」が誤認識され、その結果を反映して「Ｆａｋｅ」を除去したとしても「Ｆａｋｅ」の次に高い確率＝０．３の「Ｍａｃｅ」が誤って選択される。
【０００４】
ところが、誤認識の「Ｆａｋｅ」を除去し、さらに前回と今回の２回の発声の各認識結果候補の合成確率で見ると、「Ｍａｋｅ」が０．０６（＝０．１ｘ０．３）、「Ｍａｃｅ」が０．０３（＝０．３ｘ０．２）、「Ｂａｋｅ」が０．０１（＝０．１ｘ０．１）となるため、最も高い剛性確率の「Ｍａｋｅ」を正しく選択できる。
【０００５】
また、特開２０００−２５０５８５号公報では、データベースの検索で検索の対象となる音声検索キーの確定を行う際に、入力された音声（例：市町村名）の音声検索キー候補の尤度と、音声検索キーの属性項目の関連情報（例：都道府県名）の質問に対する応答を音声認識した関連情報の尤度とをそれぞれ正規化して乗算し認識尤度とすることが示されている。この手法では、尤度を正規化しているが、誤認識した前回の結果との調整をするためのものではなく、また両尤度を乗算した結果を認識尤度としている。
【０００６】
【特許文献１】
特開平１０−１３３６８４号公報（第５頁）
【特許文献２】
特開２０００−２５０５８５号公報
【０００７】
【発明が解決しようとする課題】
上述した前回の結果を参照する音声認識装置では、正解が前回に認識された結果と今回認識された結果の両方に出現しないと正解が得られないという問題がある。すなわち、正解が一方にしか出現しなかった場合はその合成確率は乗算の結果”０”になりその結果は最小となり選択されない。この問題は入力された音声と関連情報の入力との結果を乗算する音声認識装置においても解決されていない。
【０００８】
本発明の目的は、前回得られた音声認識結果と今回得られた音声認識結果の両方を参照して音声認識性能を向上させつつ、正解が両方に出現しない場合でも正解を得られるようして音声認識性能を向上させた音声認識装置、音声認識方法、及びプログラムを提供することにある。
【０００９】
【課題を解決するための手段】
本発明の第１の音声認識装置は、同内容の発声が複数入力される場合に、各発声間で比較可能となるように認識結果候補の確からしさを示す確率値を正規化した値を用いて各発声の各認識結果候補の中から認識結果を選択することを特徴とする。
【００１０】
本発明の第２の音声認識装置は、同内容の発声が複数回入力される場合に、発声に対する音声認識の結果として得た認識結果候補毎の確からしさを示す尤度を発声毎に正規化して信頼度を算出し、各発声の認識結果候補の中から信頼度に基づいて認識結果を選択することを特徴とする。
【００１１】
本発明の第３の音声認識装置は、同内容の発声が複数回入力される場合に、発声に対する音声認識の結果として得た認識結果候補毎の確からしさを示す尤度を発声毎に正規化して信頼度を算出し、前回の発声までの間に誤認識とされた認識結果候補を除去した認識結果候補の中から信頼度に基づいて認識結果を選択することを特徴とする。
【００１２】
本発明の第４の音声認識装置は、発声毎に音声を認識し単一または複数の認識結果候補毎の確かさを示す尤度を計算する認識部と、発声毎の認識結果候補とその尤度とを蓄積する認識結果蓄積部と、発声毎に認識結果候補の尤度に基づいて正規化したスコアである信頼度を計算する信頼度計算部と、信頼度に基づいて蓄積された各発声の各認識結果候補の中から認識結果を選択する結果選択部と、を有することを特徴とする。
【００１３】
本発明の第５の音声認識装置は、発声毎に音声を認識し単一または複数の認識結果候補毎の確かさを示す尤度を計算する認識部と、発声毎の認識結果候補とその尤度とを蓄積する認識結果蓄積部と、発声毎に認識結果候補の尤度に基づいて正規化したスコアである信頼度を計算する信頼度計算部と、信頼度に基づいて蓄積された各発声の各認識結果候補の中から認識結果を選択する結果選択部と、前回までの発声に対する認識結果に対して誤認識と判定された誤認識結果を蓄積する誤認識情報蓄積部と、結果選択部で選択された認識結果から誤認識情報蓄積部に蓄積された誤認識結果を除去して認識結果を再選択する結果フィルタリング部と、を有することを特徴とする。
【００１４】
本発明の第６の音声認識装置は、発声毎に音声を認識し単一または複数の認識結果候補毎の確かさを示す尤度を計算する認識部と、発声毎の認識結果候補とその尤度とを蓄積する認識結果蓄積部と、発声毎に認識結果候補の尤度に基づいて正規化したスコアである信頼度を計算する信頼度計算部と、前回までの発声に対する認識結果に対して誤認識と判定された誤認識結果を蓄積する誤認識情報蓄積部と、発声毎の認識結果候補から誤認識結果を除去する結果フィルタリング部と、誤認識結果を除去した後の各発声の各認識結果候補の中から認識結果を選択する結果選択部と、を有することを特徴とする。
【００１５】
本発明の第７の音声認識装置は、同内容の発声が複数回入力される場合に、発声に対する音声認識の結果として得た認識結果候補毎の確からしさを示す尤度を発声毎に正規化して信頼度を算出し、さらに全ての発声の認識結果候補から同じ認識結果候補毎に相加平均を計算した合成信頼度を求め、前回の発声までの間に誤認識とされた認識結果候補を除去した認識結果候補の中から合成信頼度に基づいて認識結果を選択することを特徴とする。
【００１６】
本発明の第８の音声認識装置は、発声毎に音声を認識し単一または複数の認識結果候補毎の確かさを示す尤度を計算する認識部と、発声毎の認識結果候補とその尤度とを蓄積する認識結果蓄積部と、発声毎に認識結果候補の尤度に基づいて正規化したスコアである信頼度を計算する信頼度計算部と、全ての発声の認識結果候補から同じ認識結果候補毎に相加平均を計算した合成信頼度を求める相加平均計算部と、合成信頼度に基づいて蓄積された各発声の各認識結果候補の中から認識結果を選択する結果選択部と、前回までの発声に対する認識結果に対して誤認識と判定された誤認識結果を蓄積する誤認識情報蓄積部と、結果選択部で選択された認識結果から誤認識情報蓄積部に蓄積された誤認識結果を除去して再選択する結果フィルタリング部と、を有することを特徴とする。
【００１７】
本発明の第１の音声認識方法は、同内容の発声が複数入力される場合に、各発声間で比較可能となるように認識結果候補の確からしさを示す確率値を正規化した値を用いて各発声の各認識結果候補の中から認識結果を選択することを特徴とする。
【００１８】
本発明の第２の音声認識方法は、同内容の発声が複数回入力される場合に、発声に対する音声認識の結果として得た認識結果候補毎の確からしさを示す尤度を発声毎に正規化して信頼度を算出し、各発声の認識結果候補の中から信頼度に基づいて認識結果を選択することを特徴とする。
【００１９】
本発明の第３の音声認識方法は、同内容の発声が複数回入力される場合に、発声に対する音声認識の結果として得た認識結果候補毎の確からしさを示す尤度を発声毎に正規化して信頼度を算出し、前回の発声までの間に誤認識とされた認識結果候補を除去した認識結果候補の中から信頼度に基づいて認識結果を選択することを特徴とする。
【００２０】
本発明の第４の音声認識方法は、発声毎に音声を認識し単一または複数の認識結果候補毎の確かさを示す尤度を計算して蓄積し、発声毎に認識結果候補の尤度に基づいて正規化したスコアである信頼度を計算し、計算された信頼度に基づいて蓄積された各発声の各認識結果候補の中から認識結果を選択することを特徴とする。
【００２１】
本発明の第５の音声認識方法は、発声毎に音声を認識し単一または複数の認識結果候補毎の確かさを示す尤度を計算して蓄積し、発声毎に認識結果候補の尤度に基づいて正規化したスコアである信頼度を計算し、計算された信頼度に基づいて蓄積された各発声の各認識結果候補の中から認識結果を選択し、前回までの発声に対する認識結果に対して誤認識と判定され蓄積された誤認識結果を選択された認識結果から除去して認識結果を再選択することを特徴とする。
【００２２】
本発明の第６の音声認識方法は、発声毎に音声を認識し単一または複数の認識結果候補毎の確かさを示す尤度を計算して蓄積し、発声毎に認識結果候補の尤度に基づいて正規化したスコアである信頼度を計算し、前回までの発声に対する認識結果に対して誤認識と判定され蓄積された誤認識結果を蓄積された認識結果候補から除去し、誤認識結果を除去した後の各発声の各認識結果候補の中から認識結果を選択することを特徴とする。
【００２３】
本発明の第７の音声認識方法は、同内容の発声が複数回入力される場合に、発声に対する音声認識の結果として得た認識結果候補毎の確からしさを示す尤度を発声毎に正規化して信頼度を算出し、さらに全ての発声の認識結果候補から同じ認識結果候補毎に相加平均を計算した合成信頼度を求め、前回の発声までの間に誤認識と判定された認識結果候補を除去した認識結果候補の中から合成信頼度に基づいて認識結果を選択することを特徴とする。
【００２４】
本発明の第８の音声認識方法は、発声毎に音声を認識し単一または複数の認識結果候補毎の確かさを示す尤度を計算して蓄積し、発声毎に認識結果候補の尤度に基づいて正規化したスコアである信頼度を計算し、全ての発声の認識結果候補から同じ認識結果候補毎に相加平均を計算した合成信頼度を求め、合成信頼度に基づいて蓄積された各発声の各認識結果候補の中から認識結果を選択し、前回までの発声に対する認識結果に対して誤認識と判定され蓄積された誤認識結果を選択された認識結果から除去して再選択することを特徴とする。
【００２５】
本発明の第１のプログラムは、同内容の発声が複数入力される場合に、各発声間で比較可能となるように認識結果候補の確からしさを示す確率値を正規化した値を用いて各発声の各認識結果候補の中から認識結果を選択する手順をコンピュータに実行させる。
【００２６】
本発明の第２のプログラムは、同内容の発声が複数回入力される場合に、発声に対する音声認識の結果として得た認識結果候補毎の確からしさを示す尤度を発声毎に正規化して信頼度を算出する手順と、各発声の認識結果候補の中から信頼度に基づいて認識結果を選択する手順とをコンピュータに実行させる。
【００２７】
本発明の第３のプログラムは、同内容の発声が複数回入力される場合に、発声に対する音声認識の結果として得た認識結果候補毎の確からしさを示す尤度を発声毎に正規化して信頼度を算出する手順と、前回の発声までの間に誤認識とされた認識結果候補を除去した認識結果候補の中から信頼度に基づいて認識結果を選択する手順とをコンピュータに実行させる。
【００２８】
本発明の第４のプログラムは、発声毎に音声を認識し単一または複数の認識結果候補毎の確かさを示す尤度を計算して蓄積する手順と、発声毎に認識結果候補の尤度に基づいて正規化したスコアである信頼度を計算する手順と、計算された信頼度に基づいて蓄積された各発声の各認識結果候補の中から認識結果を選択する手順とをコンピュータに実行させる。
【００２９】
本発明の第５のプログラムは、発声毎に音声を認識し単一または複数の認識結果候補毎の確かさを示す尤度を計算して蓄積する手順と、発声毎に認識結果候補の尤度に基づいて正規化したスコアである信頼度を計算する手順と、計算された信頼度に基づいて蓄積された各発声の各認識結果候補の中から認識結果を選択する手順と、前回までの発声に対する認識結果に対して誤認識と判定され蓄積された誤認識結果を選択された認識結果から除去して認識結果を再選択する手順とをコンピュータに実行させる。
【００３０】
本発明の第６のプログラムは、発声毎に音声を認識し単一または複数の認識結果候補毎の確かさを示す尤度を計算して蓄積する手順と、発声毎に認識結果候補の尤度に基づいて正規化したスコアである信頼度を計算する手順と、前回までの発声に対する認識結果に対して誤認識と判定され蓄積された誤認識結果を蓄積された認識結果候補から除去する手順と、誤認識結果を除去した後の各発声の各認識結果候補の中から認識結果を選択する手順とをコンピュータに実行させる。
【００３１】
本発明の第７のプログラムは、同内容の発声が複数回入力される場合に、発声に対する音声認識の結果として得た認識結果候補毎の確からしさを示す尤度を発声毎に正規化して信頼度を算出する手順と、さらに全ての発声の認識結果候補から同じ認識結果候補毎に相加平均を計算した合成信頼度を求める手順と、前回の発声までの間に誤認識と判定された認識結果候補を除去した認識結果候補の中から合成信頼度に基づいて認識結果を選択する手順とをコンピュータに実行させる。
【００３２】
本発明の第８のプログラムは、発声毎に音声を認識し単一または複数の認識結果候補毎の確かさを示す尤度を計算して蓄積する手順と、発声毎に認識結果候補の尤度に基づいて正規化したスコアである信頼度を計算する手順と、全ての発声の認識結果候補から同じ認識結果候補毎に相加平均を計算した合成信頼度を求める手順と、合成信頼度に基づいて蓄積された各発声の各認識結果候補の中から認識結果を選択する手順と、前回までの発声に対する認識結果に対して誤認識と判定され蓄積された誤認識結果を選択された認識結果から除去して再選択する手順とをコンピュータに実行させる。
【００３３】
【発明の実施の形態】
次に、本発明の実施の形態について図面を参照して詳細に説明する。図１は本発明の第１の実施の形態を示したブロック図である。図１を参照すると、本発明の第１の実施の形態の音声認識装置１０は、プログラムで実現されるかプログラムを含む認識部１００、信頼度計算部１２０、結果選択部１３０、修正部１４０、結果フィルタリング部１６０、及び結果提示部１７０と、記憶手段に設けられる認識結果蓄積部１１０と誤認識結果蓄積部１５０とを含んでいる。
【００３４】
認識部１００はユーザの発声により入力された音声に対して予め決められた方法にて１又は複数の認識結果候補及び認識結果候補毎の確からしさを示す尤度を出力する。この予め決められた方法については特に限定しない。認識結果蓄積部１１０は認識部１００で出力された認識結果候補及び尤度を蓄積しておく記憶手段である。
【００３５】
信頼度計算部１２０はユーザの発声毎の各認識結果候補の尤度に対して正規化を実行し信頼度と呼ぶスコアを算出する。信頼度は発声毎に独立に計算でき、しかもその値は各発声間で比較可能である。結果選択部１３０は複数回の発声の認識結果候補をまとめて信頼度の高い順に結果を選択する。
【００３６】
結果提示部１７０はユーザの発声に対する音声認識装置１０の認識結果をユーザへ提示する機能を持ち、提示内容は表示装置に表示したり、音声で通知したりする方法があるが特に限定しない。修正部１４０は結果提示部１７０が提示した音声認識結果に対してユーザが正誤を判定し誤認識を指摘する機能を提供する。ユーザが誤認識を指摘する方法は、例えばキーボードやタッチパネル等の入力手段を操作したり、決められたボタンを押下したり、あるいは「はい」「いいえ」と発声したりするようなする方法でよいが、特に限定しない。
【００３７】
誤認識結果蓄積部１５０はユーザの指摘による誤認識結果を蓄積する記憶手段である。結果フィルタリング部１６０は結果選択部１３０で信頼度の高い順に結果を選択された認識結果候補から誤認識結果蓄積部１５０に蓄積された誤認識結果を除去する機能を持つ。
【００３８】
次に、本発明の第１の実施の形態の動作について図面を参照して詳細に説明する。図２は本発明の第１の実施の形態の動作を示したフローチャートである。まず、ユーザの発声した音声が入力されると（Ｓ５１）、認識部１００は、音声認識を行い認識結果候補と認識結果候補毎の確からしさを示す尤度を結果として得て（Ｓ５２）、結果の認識結果候補と各認識結果候補の尤度とを認識結果蓄積部１１０に蓄積する（Ｓ５３）。
【００３９】
次に、信頼度計算部１２０は、認識結果蓄積部１１０に蓄積されている結果全てに対して認識結果候補毎に信頼度を計算する（Ｓ５４）。実際には発声毎に計算できるので各発声毎にその結果に対して計算することができる。信頼度Ｐ（候補ｋ，発声ｍ）は次のように計算される。
Ｐ（候補ｋ，発声ｍ）＝ｅｘｐ（Ｌ（候補ｋ，発声ｍ））／Ｚ（発声ｍ）
ここで、Ｌ（候補ｋ，発声ｍ）は第ｍ回目の発声の第ｋ番目の認識結果候補の尤度であり、Ｚ（発声ｍ）は第ｍ回目の発声の全認識結果候補についてＰの和が１になるようにする正規化係数である。
【００４０】
Ｚ（発声ｍ）は次式のように計算される。
Ｚ（発声ｍ）＝Σｅｘｐ（Ｌ（候補ｋ，発声ｍ））
ｋは発声ｍの全認識結果候補
Ｌは認識部１００により計算される尤度であるが、例えば発声長で正規化したり、テキストデータから学習される、認識結果の出易さを示す言語スコアを適当な重み付けをして加えたりすることもできる。
【００４１】
結果選択部１３０は、信頼度計算部１２０で算出された信頼度の高い順に認識結果候補を選択する（Ｓ５５）。この時、認識結果蓄積部１１０に蓄積されている認識結果候補全てを対象としているが、例えば各発声で上位ｎ位までの認識結果候補のみを対象としてもよいし、最近の数回の発声の結果のみを対象としてもよいし、それらを組み合わせてもよい。
【００４２】
結果フィルタリング部１６０は、結果選択部１３０で選択された認識結果候補に対して誤認識結果蓄積部１５０に蓄積されている誤認識結果と一致する認識結果候補を取り除いて選択し直す（Ｓ５６）。最終的に最も高い信頼度の認識結果候補から順番に１又は数個の認識結果候補が認識結果として選択され結果提示部１７０によりユーザに提示される（Ｓ５７）。
【００４３】
ユーザは結果の正誤を判定して入力する（Ｓ５８）。ユーザが全て誤認識と入力すると、修正部１４０は今回提示した認識結果である認識結果候補を誤認識結果蓄積部１５０に蓄積し（Ｓ５９）、次のユーザの発声の入力を処理する。この時、提示される結果は必ずしも一つとは限らない。すなわち、複数提示される場合はステップＳ５９において蓄積される誤認識は複数になる。
【００４４】
誤認識でなかった場合すなわち音声認識に成功した場合、結果提示部１７０は誤認識結果蓄積部１５０及び認識結果蓄積部１１０を初期化し（Ｓ６０）、対象とするユーザの発声内容に対する音声認識が終了する。必要に応じて次のユーザの発声内容の音声認識を開始できる。
【００４５】
このように、本発明の第１の実施の形態の音声認識装置１０では、信頼度計算部１２０を持つことにより、発声毎に各認識結果候補について信頼度を計算でき、その値は他の発声の結果の信頼度と直接比較できるため、全ての発声に正解が含まれていなくても信頼度に応じて適切な認識結果を選択でき、よりよい認識精度を得ることができる。
【００４６】
次に、具体例を用いて本実施の形態の動作を説明する。具体例ではユーザが「いわい」と２回発声した場合の動作例を示す。まず、ユーザが「いわい」と発声すると（Ｓ５１）、認識部１００は音声認識を実行し結果として認識結果候補とその尤度を得る（Ｓ５２）。
【００４７】
図３は１回目と２回目のそれぞれの結果の認識結果候補とその尤度を示した図である。図３の結果の１回目が認識結果蓄積部１１０に蓄積される（Ｓ５３）。取り消し線は１回目の発声の認識結果が誤認識となったため、２回目の発声の際に取り除かれたことを説明するために付けたものであり、蓄積される結果としては含まれない。図３では、１回目では「ひろい」の尤度が最大であり、正解の「いわい」の尤度はその次となっている。
【００４８】
信頼度計算部１２０は、正規化を実行し図４に示す１回目の結果を得る（Ｓ５４）。図４は、図３に示した認識結果候補の尤度を正規化した値である信頼度を認識結果候補毎に示した図である。図４では１回目と２回目のそれぞれの値を示しているが、この時点では１回目の結果のみが計算される。
【００４９】
結果選択部１３０は、１回目の信頼度の順に認識結果候補を選択する（Ｓ５５）。結果フィルタリング部１６０は、誤認識結果蓄積部１５０を参照するが１回目であるため蓄積はないので選択された認識結果候補は取り除かれることなく（Ｓ５７）、「ひらい」の信頼度が最大のため、結果提示部１７０は「ひらい」を認識結果としてユーザに提示する（Ｓ５８）。
【００５０】
ユーザは提示内容を見て誤認識と判断して入力する。修正部１４０は誤認識の入力を受けて「ひらい」を誤認識結果蓄積部１５０に蓄積し（Ｓ５９）、ユーザの２回目の発声の入力を待つ（Ｓ５１）。
【００５１】
ユーザが２回目の発声を入力すると認識部１００は音声認識を実行し図３に示す２回目の結果を得て（Ｓ５２）、認識結果蓄積部１１０に蓄積する（Ｓ５３）。信頼度計算部１２０は、認識結果蓄積部１１０に蓄積された１回目と２回目の結果を正規化して図４の結果を得る（Ｓ５４）。
【００５２】
結果選択部１３０は、１回目と２回目の全ての結果の信頼度の大きい順に選択する（Ｓ５５）。図４を参照すれば、１回目の「ひらい」の信頼度が０．２７５と最大で、２番目が１回目の「いわい」の信頼度の０．２７３となる。３番目以降は１回目の「いまい」、１回目の「しらい」、２回目の「ひらい」、２回目の「しらい」と続く。
【００５３】
結果フィルタリング部１６０は、誤認識結果蓄積部１５０に蓄積されている「ひらい」を選択された結果から取り除く（Ｓ５７）。すなわち、図４に取り消し線で示された「ひらい」が取り除かれ、１回目の「いわい」の信頼度が最大となる。結果提示部１７０は「いわい」を認識結果としてユーザに提示する（Ｓ５８）。ユーザは提示結果が正解であることを入力するので、結果提示部１７０は認識結果蓄積部１１０と誤認識結果蓄積部１５０とに蓄積されている認識結果と誤認識結果の情報をクリアし（Ｓ６０）、新たな内容のユーザ発声に備える。
【００５４】
このように、上記具体例では正解の「いわい」は１回目の発声にしか含まれないため、従来のように１回目と２回目の結果を乗算してしまうと「いわい」の合成確率は”０”となり正解を得ることができない。また、１回目の結果を使用せず誤認識結果を２回目の認識結果から取り除いても、やはり２回目の発声の上位に正解が含まれないため正解が得られない。
【００５５】
本手法では１回目と２回目の結果をそれぞれ正規化し信頼度を計算し、１回目と２回目の誤認識結果候補に対して信頼度の高い順に選択し、さらに誤認識結果を反映させることにより正解を得ることができる。我々は本手法について実際に音声認識実験を行った。日本のほとんどの姓を対象とした認識実験を行い、一回目が誤認識であった発声について同内容の発声をもう一度行ってもらった場合の認識率において、２回目の発声の認識結果から１回目の誤認識結果を取り除く従来の方法で５２．８％であった認識率を、５６．５％に向上させることができた。このように、従来ある信頼度とこれまでの誤認識結果を取り除く方法の単純な組み合わせでは得られないような顕著な効果が本手法により得られている。
【００５６】
次に、本発明の第２の実施の形態について説明する。図５は本発明の第２の実施の形態の音声認識装置２０の構成を示すブロック図である。図５を参照すると、音声認識装置２０は本発明の第１の実施の形態の音声認識装置１０と比べて結果フィルタリング部２６０と結果選択部２３０の実行手順が結果フィルタリング部１６０と結果選択部１３０と異なる。なお、図５では図１と同じ機能の構成要素は図１と同じ符号を付しているのでこれらの構成要素の説明は省略する。
【００５７】
結果フィルタリング部２６０は、信頼度計算部１２０により信頼度が計算された認識結果候補に対して誤認識結果蓄積部１５０に蓄積された誤認識結果を除去する。結果選択部２３０は、誤認識結果が除去された認識結果候補をまとめて信頼度の高い順に選択する。結果フィルタリング部２６０と結果選択部２３０は通常プログラムで実現される。
【００５８】
図６は本発明の第２の実施の形態の動作を示したフローチャートである。図６のフローチャートと図２に示すフローチャートと比べると、図１のステップＳ５１〜Ｓ５４、Ｓ５７〜Ｓ６０は図６のステップＳ６１〜Ｓ６４、Ｓ６７〜Ｓ７０にそれぞれ相当し、図２のステップＳ５５、Ｓ５６の実行順序が図６ではステップＳ６５，Ｓ６６で逆になっている。すなわち、図２では先に信頼度に基づいて認識結果候補を選択してから（Ｓ５５）から誤認識結果を除去（Ｓ５６）しているが、図６では先に結果フィルタリング部２６０で認識結果候補から誤認識結果を除去して（Ｓ６５）から結果選択部２３０で認識結果候補を選択する。
【００５９】
このように本発明の第２の実施の形態では、第１の実施の形態と同じ効果を得るとともに、結果選択部２３０で選択を行う前に誤認識を除去した認識結果候補の順位情報を利用することができる。例えば、誤認識を除去した各発声の最上位認識結果候補同士を信頼度に基づいて比較するなどである。
【００６０】
次に、本発明の第３の実施の形態について説明する。図７は本発明の第３の実施の形態の音声認識装置３０の機能ブロック図である。図７を参照すると、音声認識装置３０は、本発明の第１の実施の形態の音声認識装置１０と比べて相加平均計算部３８０が新たに追加されている。なお、図１と同じ機能の構成要素については同じ符号を付しているのでこれらの構成要素の説明は省略する。
【００６１】
相加平均計算部３８０は、信頼度計算部１２０により信頼度が計算された認識結果候補に対して発声毎の信頼度の相加平均を新たな信頼度として付与する。相加平均計算部３８０は通常プログラムで実現され、図示しないが音声認識装置３０の演算回路を制御して相加平均の計算を実行する。
【００６２】
相加平均の計算において、発声の中に該当する認識結果候補が含まれない場合は、信頼度を０として相加平均を求めてもよく、また該当する認識結果候補が含まれる発声のみで平均を求めてもよい。例えば、全部で５つの発声があって認識結果候補が２発声にしか含まれていなければその２発声分で信頼度の平均を求めればよい。
【００６３】
図８は本発明の第３の実施の形態の動作を示したフローチャートである。図８のフローチャートと図２に示すフローチャートと比べると、図１のステップＳ５１〜Ｓ６０は図８のステップＳ７１〜Ｓ８０にそれぞれ相当し、図８ではステップＳ７４とＳ７５の間にステップＳ８１が追加される。ステップＳ８１では、ステップＳ７４で計算された各発声の各認識結果候補の信頼度について、相加平均計算部３８０が、同じ認識結果候補の信頼度の相加平均を計算し、この値を新たに信頼度とする。
【００６４】
例えば、図４の信頼度に対して認識結果候補の出現回数の相加平均を取る場合は、「ひらい」は（０．２７５＋０．１７６）／２＝０．２２６となり、「いわい」は２回目には出現しないため１回目の値のままとなり、「しらい」は（０．２１４＋０．１２６）＝０．１７０となる。この結果０．２７３の「いわい」が最大となり、２回目の発声にして正解となる認識結果を得ることができる。
【００６５】
また、図４の信頼度に対して発声回数で相加平均を取る場合は、「ひらい」は（０．２７５＋０．１７６）／２＝０．２２６、「いわい」は（０．２７３＋０）／２＝０．１３７、「しらい」は（０．２１４＋０．１２６）＝０．１７０となる。この結果０．２２６の「ひらい」が最大で次が０．１７０の「しらい」となるが、ステップＳ７６において１回目の結果で誤認識となった「ひらい」が除去されるので「しらい」が選択される。この場合は２回目で正解となる認識結果が得られないため、「しらい」は新たに誤認識とユーザに判定されて誤認識結果蓄積部１５０に蓄積され、３回目の発声の処理に入ることになる。
【００６６】
上記の例では、各発声に対して同じ重みとして相加平均を計算しているが、相加平均する際に発声の時間順に従って重み付けをして相加平均を計算することもできる。例えば、図４で２回目の信頼度には”１”の重みを掛け、１回目の信頼度には０〜１の間の数値（例えば”０．８”）の重みを掛けるような手法であるが、重みの設定は限定しない。また、相加平均の計算も上記２例以外でもよく限定するものではない。
【００６７】
以上説明した本発明の第１、第２、又は第３の構成の他に、第２の実施の形態の構成に対して相加平均計算部３８０の機能を追加した新たな構成も容易に実現できることは明らかであり、どの構成においても各発声の尤度を乗算することはしていないために、正解が認識結果候補として得られない発声があっても、正解を得ることが可能である。
【００６８】
【発明の効果】
本発明によれば、新たに得られた発声の認識結果を正規化したスコア（信頼度）と、以前の発声の認識結果を正規化したスコアとを乗算せずに合成確率を求めるようにしたので、正解が全ての発声に出現しない場合でも正解を得ることができるようになり、認識率を改善できるという効果がある。
【図面の簡単な説明】
【図１】本発明の第１の実施の形態の構成を示すブロック図である。
【図２】本発明の第１の実施の形態の動作を示すフローチャートである。
【図３】本発明の第１の実施の形態の尤度の具体例を示す図である。
【図４】本発明の第１の実施の形態の信頼度の具体例を示す図である。
【図５】本発明の第２の実施の形態の構成を示すブロック図である。
【図６】本発明の第２の実施の形態の動作を示すフローチャートである。
【図７】本発明の第３の実施の形態の構成を示すブロック図である。
【図８】本発明の第３の実施の形態の動作を示すフローチャートである。
【図９】従来の技術の説明のための具体例を示す図である。
【符号の説明】
１０音声認識装置
１００認識部
１１０認識結果蓄積部
１２０信頼度計算部
１３０結果選択部
１４０修正部
１５０誤認識結果蓄積部
１６０結果フィルタリング部
１７０結果提示部
２０音声認識装置
２３０結果選択部
２６０結果フィルタリング部
３０音声認識装置
３８０相加平均計算部[0001]
TECHNICAL FIELD OF THE INVENTION
The present invention relates to a speech recognition device, a speech recognition method, and a program, and particularly to a recognition result of a misrecognized utterance and a re-input utterance when a user misrecognizes a speech and re-inputs the same utterance. The present invention relates to a speech recognition technology using a recognition result.
[0002]
[Prior art]
According to Japanese Patent Application Laid-Open No. Hei 10-133684, which shows an example of a conventional speech recognition apparatus, this speech recognition apparatus uses a previously recognized recognition result candidate and a newly recognized recognition result candidate to generate each utterance. It is shown that the recognition result having the highest probability of matching (excluding the erroneous recognition result) is selected. Specifically, in order to calculate the highest probability, a common recognition result candidate is detected from both recognition result candidates, and the candidate is selected using a composite probability of a result obtained by multiplying the detected result.
[0003]
As a specific example, FIG. 7 shows recognition result candidates recognized when a speaker utters “make” twice and the likelihood corresponding to the recognition result candidates. Referring to FIG. 7, when the determination is made only based on the newly recognized result corresponding to the second utterance, “Fake” is erroneously selected because the probability of “Fake” = 0.4 is the highest. “Fake” is erroneously recognized at the first time, and even if “Fake” is removed by reflecting the result, “Make” having the next highest probability of “Fake” = 0.3 is erroneously selected.
[0004]
However, when “Fake” of erroneous recognition is removed, and the synthesis probability of each recognition result candidate of the previous and current two utterances is 0.06 (= 0.1 × 0.3), “Make” is 0.06 (= 0.1 × 0.3). Since “Make” is 0.03 (= 0.3 × 0.2) and “Bake” is 0.01 (= 0.1 × 0.1), “Make” having the highest rigidity probability can be correctly selected.
[0005]
Further, in Japanese Patent Application Laid-Open No. 2000-250585, when a voice search key to be searched is determined in a database search, the likelihood of a voice search key candidate of an input voice (eg, a municipal name) is determined. It is shown that the response to the question of the related information (eg, the name of the prefecture) of the attribute item of the voice search key is normalized and multiplied by the likelihood of the related information obtained by voice recognition to obtain the recognition likelihood. In this method, the likelihood is normalized, but this is not for adjusting the result of the previous misrecognition, and the result of multiplying both likelihoods is used as the recognition likelihood.
[0006]
[Patent Document 1]
JP-A-10-133684 (page 5)
[Patent Document 2]
JP 2000-250585 A
[0007]
[Problems to be solved by the invention]
The above-described speech recognition apparatus that refers to the previous result has a problem that the correct answer cannot be obtained unless the correct answer appears in both the previously recognized result and the currently recognized result. That is, when the correct answer appears only on one side, the synthesis probability becomes "0" as a result of the multiplication, and the result becomes the minimum and is not selected. This problem has not been solved even in a speech recognition device that multiplies the result of the input speech and the result of the input of the related information.
[0008]
An object of the present invention is to improve the speech recognition performance by referring to both the speech recognition result obtained last time and the speech recognition result obtained this time, and to obtain a correct answer even when a correct answer does not appear in both. An object of the present invention is to provide a speech recognition device, a speech recognition method, and a program that have improved speech recognition performance.
[0009]
[Means for Solving the Problems]
The first speech recognition apparatus of the present invention uses a value obtained by normalizing a probability value indicating the likelihood of a recognition result candidate so that a comparison can be made between utterances when a plurality of utterances having the same content are input. And selecting a recognition result from each recognition result candidate of each utterance.
[0010]
The second speech recognition device of the present invention normalizes the likelihood indicating the likelihood of each recognition result candidate obtained as a result of speech recognition for an utterance for each utterance when the same utterance is input a plurality of times. And calculating the reliability based on the reliability based on the reliability.
[0011]
The third speech recognition device of the present invention normalizes the likelihood indicating the likelihood of each recognition result candidate obtained as a result of speech recognition for an utterance for each utterance when the same utterance is input a plurality of times. The recognition result is calculated based on the reliability, and the recognition result is selected from the recognition result candidates from which the recognition result candidates that have been erroneously recognized until the previous utterance are removed.
[0012]
A fourth speech recognition apparatus of the present invention comprises: a recognition unit that recognizes speech for each utterance and calculates likelihood indicating certainty for each of a plurality of candidate recognition results; a recognition result candidate for each utterance; A recognition result accumulating unit for accumulating degrees, a reliability calculating unit for calculating a reliability that is a score normalized based on the likelihood of a recognition result candidate for each utterance, and each utterance accumulated based on the reliability And a result selecting unit for selecting a recognition result from among the respective recognition result candidates.
[0013]
A fifth speech recognition apparatus of the present invention comprises: a recognition unit that recognizes speech for each utterance and calculates likelihood indicating certainty for each of a plurality of candidate recognition results; a recognition result candidate for each utterance; A recognition result accumulating unit for accumulating degrees, a reliability calculating unit for calculating a reliability that is a score normalized based on the likelihood of a recognition result candidate for each utterance, and each utterance accumulated based on the reliability A result selecting unit that selects a recognition result from among the recognition result candidates of the above, an erroneous recognition information storage unit that stores erroneous recognition results determined as erroneous recognition with respect to the recognition result of the previous utterance, and a result selection unit And a result filtering unit that removes the erroneous recognition result stored in the erroneous recognition information storage unit from the recognition result selected in step (1) and reselects the recognition result.
[0014]
A sixth speech recognition apparatus of the present invention comprises: a recognition unit that recognizes speech for each utterance and calculates likelihood indicating certainty for each of a plurality of candidate recognition results; a recognition result candidate for each utterance; A recognition result accumulating unit that accumulates a degree, a reliability calculating unit that calculates a reliability that is a score normalized based on the likelihood of the recognition result candidate for each utterance, and a recognition result for the utterance up to the previous time. A misrecognition information storage unit that stores misrecognition results determined to be misrecognized, a result filtering unit that removes misrecognition results from recognition result candidates for each utterance, and each recognition of each utterance after removing the misrecognition results And a result selection unit for selecting a recognition result from among the result candidates.
[0015]
The seventh speech recognition device of the present invention normalizes the likelihood indicating the likelihood of each recognition result candidate obtained as a result of speech recognition for an utterance for each utterance when the same utterance is input a plurality of times. From the recognition result candidates of all utterances, the arithmetic reliability is calculated for each of the same recognition result candidates to obtain a combined reliability, and the recognition result candidates that have been erroneously recognized until the previous utterance are determined. A recognition result is selected from the removed recognition result candidates based on the synthesis reliability.
[0016]
An eighth speech recognition apparatus according to the present invention includes a recognition unit that recognizes speech for each utterance and calculates likelihood indicating certainty for each of a plurality of candidate recognition results, a recognition result candidate for each utterance, and its likelihood. A recognition result accumulating unit that accumulates degrees, a reliability calculating unit that calculates a reliability that is a score normalized based on the likelihood of the recognition result candidate for each utterance, and the same recognition from all utterance recognition result candidates An arithmetic mean calculating section for calculating a synthetic reliability by calculating an arithmetic average for each result candidate; and a result selecting section for selecting a recognition result from among the recognition result candidates for each utterance accumulated based on the synthetic reliability. An erroneous recognition information storage unit that stores erroneous recognition results determined to be erroneous recognition with respect to recognition results for previous utterances, and an erroneous recognition information stored in the erroneous recognition information storage unit based on the recognition result selected by the result selection unit. Result filtering to remove and reselect recognition results And having a part, a.
[0017]
The first speech recognition method of the present invention uses a value obtained by normalizing a probability value indicating the likelihood of a recognition result candidate so that a comparison can be made between utterances when a plurality of utterances having the same content are input. And selecting a recognition result from each recognition result candidate of each utterance.
[0018]
In a second speech recognition method of the present invention, when an utterance having the same content is input a plurality of times, the likelihood indicating the likelihood of each recognition result candidate obtained as a result of speech recognition for the utterance is normalized for each utterance. And calculating the reliability based on the reliability based on the reliability.
[0019]
In the third speech recognition method of the present invention, when an utterance having the same content is input a plurality of times, the likelihood indicating the likelihood of each recognition result candidate obtained as a result of speech recognition for the utterance is normalized for each utterance. The recognition result is calculated based on the reliability, and the recognition result is selected from the recognition result candidates from which the recognition result candidates that have been erroneously recognized until the previous utterance are removed.
[0020]
According to a fourth speech recognition method of the present invention, a speech is recognized for each utterance, a likelihood indicating the certainty of each of a plurality of candidate recognition results is calculated and stored, and the likelihood of the recognition result candidate is calculated for each utterance. Is calculated based on the calculated reliability, and the recognition result is selected from the recognition result candidates of each utterance accumulated based on the calculated reliability.
[0021]
According to a fifth speech recognition method of the present invention, a speech is recognized for each utterance, a likelihood indicating the certainty of each of a plurality of candidate recognition results is calculated and stored, and the likelihood of the recognition result candidate is calculated for each utterance. Calculates the reliability, which is a score normalized based on, and selects a recognition result from among the recognition result candidates of each utterance accumulated based on the calculated reliability, and determines the recognition result for the previous utterance. On the other hand, the present invention is characterized in that the recognition result determined as erroneous recognition is removed from the selected recognition result, and the recognition result is reselected.
[0022]
The sixth speech recognition method of the present invention recognizes a speech for each utterance, calculates and stores a likelihood indicating certainty for each of a plurality of candidate recognition results, and stores a likelihood of the recognition result candidate for each utterance. Calculates the reliability, which is a score normalized based on, and removes the accumulated recognition result candidates that have been determined to be misrecognition from the recognition result for the previous utterance from the accumulated recognition result candidates, and removes the misrecognition result. The recognition result is selected from among the recognition result candidates of each utterance after removing the utterance.
[0023]
According to a seventh speech recognition method of the present invention, when an utterance having the same content is input a plurality of times, the likelihood indicating the likelihood of each recognition result candidate obtained as a result of speech recognition for the utterance is normalized for each utterance. Calculates the reliability, and then calculates the arithmetic mean for each of the same recognition result candidates from the recognition result candidates of all utterances. The recognition result is selected from the recognition result candidates from which the is removed based on the synthesis reliability.
[0024]
According to an eighth speech recognition method of the present invention, a speech is recognized for each utterance, a likelihood indicating the certainty of each of a plurality of candidate recognition results is calculated and stored, and the likelihood of the recognition result candidate is calculated for each utterance. Is calculated based on the calculated reliability, and an arithmetic average is calculated from the recognition result candidates of all utterances for each of the same recognition result candidates to obtain a composite reliability, which is accumulated based on the composite reliability. A recognition result is selected from each recognition result candidate of each utterance, and the recognition result of the previous utterance is determined to be erroneous recognition, and the accumulated erroneous recognition results are removed from the selected recognition result and reselected. It is characterized by the following.
[0025]
When a plurality of utterances having the same content are input, the first program according to the present invention employs a normalized value of a probability value indicating the likelihood of a recognition result candidate so that the utterances can be compared. A computer is caused to execute a procedure for selecting a recognition result from each of the utterance recognition result candidates.
[0026]
According to a second program of the present invention, when an utterance having the same content is input a plurality of times, the likelihood indicating the likelihood of each recognition result candidate obtained as a result of speech recognition for the utterance is normalized for each utterance, and the The computer is caused to execute a procedure for calculating the degree and a procedure for selecting a recognition result based on the reliability from the recognition result candidates of each utterance.
[0027]
According to a third program of the present invention, when an utterance having the same content is input a plurality of times, the likelihood indicating the likelihood of each recognition result candidate obtained as a result of speech recognition for the utterance is normalized for each utterance to obtain a reliable The computer is made to execute a procedure for calculating the degree and a procedure for selecting a recognition result based on the reliability from the recognition result candidates from which the recognition result candidates that have been erroneously recognized until the previous utterance are removed.
[0028]
A fourth program of the present invention is a program for recognizing a speech for each utterance, calculating and accumulating likelihood indicating certainty for each of a plurality of candidate recognition results, and a likelihood of the recognition result candidate for each utterance. The computer performs a procedure of calculating a reliability that is a score normalized based on, and a procedure of selecting a recognition result from among the recognition result candidates of each utterance accumulated based on the calculated reliability. .
[0029]
According to a fifth program of the present invention, there is provided a program for recognizing a speech for each utterance, calculating and storing a likelihood indicating certainty for each of a single or a plurality of recognition result candidates, and a likelihood of the recognition result candidate for each utterance. Calculating the confidence level, which is a score normalized based on the utterance, selecting the recognition result from among the recognition result candidates of each utterance accumulated based on the calculated reliability, and deciding the previous utterance And re-selecting the recognition result by removing from the selected recognition result the accumulated recognition error result which is determined to be a false recognition with respect to the recognition result with respect to.
[0030]
According to a sixth program of the present invention, there is provided a program for recognizing a speech for each utterance, calculating and storing a likelihood indicating certainty for each of a single or a plurality of recognition result candidates, and a likelihood of the recognition result candidate for each utterance. A procedure for calculating the reliability, which is a score normalized based on the utterance, and a procedure for removing the accumulated erroneous recognition results determined as erroneous recognition with respect to the recognition result for the previous utterance from the accumulated recognition result candidates. And selecting a recognition result from among the recognition result candidates of each utterance after removing the erroneous recognition result.
[0031]
According to a seventh program of the present invention, when an utterance of the same content is input a plurality of times, the likelihood indicating the likelihood of each recognition result candidate obtained as a result of speech recognition for the utterance is normalized for each utterance to obtain a reliable The procedure for calculating the degree, the procedure for calculating the composite reliability by calculating the arithmetic mean for the same recognition result candidate from all the recognition result candidates for all utterances, and the recognition determined to be erroneous recognition until the previous utterance Selecting a recognition result based on the combined reliability from the recognition result candidates from which the result candidates have been removed.
[0032]
According to an eighth program of the present invention, there is provided a program for recognizing a speech for each utterance, calculating and storing a likelihood indicating certainty for each of a plurality of recognition result candidates, and a likelihood of the recognition result candidate for each utterance. A procedure for calculating a reliability which is a score normalized on the basis of the above, a procedure for calculating a composite reliability by calculating an arithmetic mean for each of the same recognition result candidates from the recognition result candidates of all utterances, The procedure for selecting a recognition result from among the recognition result candidates of each utterance accumulated and the recognition result determined as erroneous recognition with respect to the recognition result for the previous utterance from the selected recognition result. Removing and reselecting the computer.
[0033]
BEST MODE FOR CARRYING OUT THE INVENTION
Next, embodiments of the present invention will be described in detail with reference to the drawings. FIG. 1 is a block diagram showing a first embodiment of the present invention. Referring to FIG. 1, a speech recognition apparatus 10 according to a first embodiment of the present invention includes a recognition unit 100 realized or including a program, a reliability calculation unit 120, a result selection unit 130, a correction unit 140, It includes a result filtering unit 160, a result presenting unit 170, a recognition result storage unit 110 and an erroneous recognition result storage unit 150 provided in a storage unit.
[0034]
The recognizing unit 100 outputs one or a plurality of recognition result candidates and a likelihood indicating the likelihood of each recognition result candidate by a predetermined method for the voice input by the user's utterance. The predetermined method is not particularly limited. The recognition result storage unit 110 is a storage unit that stores the recognition result candidates and the likelihoods output by the recognition unit 100.
[0035]
The reliability calculation unit 120 performs normalization on the likelihood of each recognition result candidate for each utterance of the user, and calculates a score called reliability. The confidence can be calculated independently for each utterance, and its value can be compared between each utterance. The result selection unit 130 collects a plurality of recognition result candidates of utterances and selects a result in descending order of reliability.
[0036]
The result presenting unit 170 has a function of presenting the recognition result of the voice recognition device 10 to the user's utterance to the user, and there is a method of displaying the content of the presentation on the display device or notifying by voice, but is not particularly limited. The correction unit 140 provides a function that allows the user to determine whether the speech recognition result presented by the result presentation unit 170 is correct or not and to point out erroneous recognition. As a method for the user to point out misrecognition, for example, a method of operating input means such as a keyboard or a touch panel, pressing a predetermined button, or saying “yes” or “no” may be used. However, there is no particular limitation.
[0037]
The misrecognition result accumulation unit 150 is a storage unit that accumulates misrecognition results indicated by the user. The result filtering unit 160 has a function of removing the erroneous recognition results stored in the erroneous recognition result storage unit 150 from the recognition result candidates whose results have been selected in descending order of reliability by the result selection unit 130.
[0038]
Next, the operation of the first exemplary embodiment of the present invention will be described in detail with reference to the drawings. FIG. 2 is a flowchart showing the operation of the first embodiment of the present invention. First, when a voice uttered by the user is input (S51), the recognition unit 100 performs voice recognition to obtain a recognition result candidate and a likelihood indicating the likelihood of each recognition result candidate (S52). And the likelihood of each recognition result candidate are stored in the recognition result storage unit 110 (S53).
[0039]
Next, the reliability calculation unit 120 calculates the reliability of each recognition result candidate for all the results stored in the recognition result storage unit 110 (S54). Actually, since the calculation can be performed for each utterance, the calculation can be performed for the result of each utterance. The reliability P (candidate k, utterance m) is calculated as follows.
P (candidate k, utterance m) = exp (L (candidate k, utterance m)) / Z (utterance m)
Here, L (candidate k, utterance m) is the likelihood of the k-th recognition result candidate of the m-th utterance, and Z (utterance m) is P of all the m-th utterance recognition result candidates. This is a normalization coefficient that makes the sum equal to 1.
[0040]
Z (utterance m) is calculated as follows.
Z (utterance m) = Σexp (L (candidate k, utterance m))
k is a candidate for all recognition results of utterance m
L is a likelihood calculated by the recognizing unit 100. For example, L is normalized by the utterance length, or a linguistic score, which is learned from text data and indicates the easiness of the recognition result, is added with appropriate weighting. You can also.
[0041]
The result selection unit 130 selects recognition result candidates in descending order of the reliability calculated by the reliability calculation unit 120 (S55). At this time, all the recognition result candidates stored in the recognition result storage unit 110 are targeted. However, for example, only the recognition result candidates up to the top n in each utterance may be targeted, or several recent utterances may be targeted. Only the results may be targeted, or they may be combined.
[0042]
The result filtering unit 160 removes the recognition result candidate selected by the result selection unit 130 and removes the recognition result candidate that matches the erroneous recognition result stored in the erroneous recognition result storage unit 150 (S56). Finally, one or several recognition result candidates are selected as recognition results in order from the recognition result candidate having the highest reliability, and presented to the user by the result presentation unit 170 (S57).
[0043]
The user determines whether the result is correct or not and inputs the result (S58). When the user inputs all incorrect recognitions, the correction unit 140 stores the recognition result candidates, which are the recognition results presented this time, in the false recognition result storage unit 150 (S59), and processes the input of the next user's utterance. At this time, the presented result is not always one. That is, when a plurality of recognitions are presented, a plurality of erroneous recognitions are accumulated in step S59.
[0044]
If the recognition is not erroneous, that is, if the speech recognition is successful, the result presentation unit 170 initializes the erroneous recognition result accumulation unit 150 and the recognition result accumulation unit 110 (S60), and the speech recognition for the utterance content of the target user ends. I do. The voice recognition of the utterance content of the next user can be started as needed.
[0045]
As described above, in the speech recognition apparatus 10 according to the first embodiment of the present invention, by having the reliability calculation unit 120, the reliability can be calculated for each recognition result candidate for each utterance, and the value is calculated for another utterance. Can be directly compared with the reliability of the result, even if all utterances do not include the correct answer, an appropriate recognition result can be selected according to the reliability, and better recognition accuracy can be obtained.
[0046]
Next, the operation of the present embodiment will be described using a specific example. In a specific example, an operation example in the case where the user utters “Iwai” twice is shown. First, when the user utters "I'm sorry" (S51), the recognition unit 100 executes speech recognition and obtains a recognition result candidate and its likelihood as a result (S52).
[0047]
FIG. 3 is a diagram showing recognition result candidates of the first and second results and their likelihoods. The first result of FIG. 3 is stored in the recognition result storage unit 110 (S53). The strike-through line is added to explain that the recognition result of the first utterance was erroneously recognized and was removed at the time of the second utterance, and is not included as a stored result. In FIG. 3, the likelihood of “hiro” is the maximum at the first time, and the likelihood of “Iwai” of the correct answer is next.
[0048]
The reliability calculation unit 120 performs the normalization and obtains the first result shown in FIG. 4 (S54). FIG. 4 is a diagram showing, for each recognition result candidate, the reliability which is a value obtained by normalizing the likelihood of the recognition result candidate shown in FIG. FIG. 4 shows the values of the first time and the second time, but at this time, only the result of the first time is calculated.
[0049]
The result selection unit 130 selects recognition result candidates in the order of the first reliability (S55). The result filtering unit 160 refers to the misrecognition result accumulation unit 150, but does not accumulate because it is the first time. Therefore, the selected recognition result candidate is not removed (S57), and the reliability of "open" is maximized. Then, the result presenting unit 170 presents “open” to the user as a recognition result (S58).
[0050]
The user sees the content of the presentation and determines that the recognition is erroneous, and inputs. The correction unit 140 receives the input of erroneous recognition and accumulates “open” in the erroneous recognition result accumulation unit 150 (S59), and waits for the input of the user's second utterance (S51).
[0051]
When the user inputs the second utterance, the recognition unit 100 executes the voice recognition, obtains the second result shown in FIG. 3 (S52), and stores it in the recognition result storage unit 110 (S53). The reliability calculation unit 120 normalizes the first and second results stored in the recognition result storage unit 110 to obtain the result of FIG. 4 (S54).
[0052]
The result selecting unit 130 selects the first and second results in descending order of reliability (S55). Referring to FIG. 4, the reliability of the first “open” is 0.275, which is the maximum, and the second is the reliability of the first “Iwai”, which is 0.273. After the third, the first "Imai", the first "Shirai", the second "Hirai", the second "Shirai" and so on.
[0053]
The result filtering unit 160 removes “hirai” accumulated in the misrecognition result accumulation unit 150 from the selected result (S57). In other words, the "openness" indicated by the strikethrough line in FIG. 4 is removed, and the reliability of the first "iwai" is maximized. The result presenting unit 170 presents “Iwai” to the user as a recognition result (S58). Since the user inputs that the presentation result is correct, the result presentation unit 170 clears the information of the recognition result and the misrecognition result stored in the recognition result accumulation unit 110 and the misrecognition result accumulation unit 150 (S60). ), To prepare for user utterance of new contents.
[0054]
As described above, in the above specific example, since the correct answer “Iwai” is included only in the first utterance, if the first and second results are multiplied as in the related art, the composite probability of “Iwai” is “ 0 "and no correct answer can be obtained. Further, even if the erroneous recognition result is removed from the second recognition result without using the first result, the correct answer cannot be obtained because the correct answer is not included in the second utterance.
[0055]
In this method, the first and second results are each normalized to calculate the reliability, and the first and second false recognition result candidates are selected in descending order of reliability, and the false recognition result is reflected. You can get the correct answer. We conducted a speech recognition experiment on this method. A recognition experiment was conducted on most surnames in Japan, and the recognition rate when the first utterance was incorrectly recognized and the same utterance was repeated again was the first from the recognition result of the second utterance. The recognition rate which was 52.8% by the conventional method for removing the false recognition result of the above was improved to 56.5%. As described above, a remarkable effect that cannot be obtained by a simple combination of the conventional reliability and the method of removing the misrecognition result is obtained by the present method.
[0056]
Next, a second embodiment of the present invention will be described. FIG. 5 is a block diagram showing the configuration of the speech recognition device 20 according to the second embodiment of the present invention. Referring to FIG. 5, the speech recognition device 20 is different from the speech recognition device 10 according to the first embodiment of the present invention in that the execution procedure of the result filtering unit 260 and the result selection unit 230 is different from the result filtering unit 160 and the result selection unit 130. And different. In FIG. 5, components having the same functions as those in FIG. 1 are denoted by the same reference numerals as those in FIG. 1, and thus description of these components is omitted.
[0057]
The result filtering unit 260 removes the erroneous recognition results stored in the erroneous recognition result storage unit 150 from the recognition result candidates whose reliability has been calculated by the reliability calculation unit 120. The result selection unit 230 collectively selects the recognition result candidates from which the erroneous recognition results have been removed, in order of decreasing reliability. The result filtering unit 260 and the result selecting unit 230 are realized by a normal program.
[0058]
FIG. 6 is a flowchart showing the operation of the second embodiment of the present invention. Compared with the flowchart of FIG. 6 and the flowchart of FIG. 2, steps S51 to S54 and S57 to S60 in FIG. 1 correspond to steps S61 to S64 and S67 to S70 in FIG. In FIG. 6, the execution order is reversed in steps S65 and S66. That is, in FIG. 2, the recognition result candidate is selected based on the reliability first, and then the erroneous recognition result is removed (S56) from (S55). However, in FIG. Then, the erroneous recognition result is removed from (S65), and the result selection unit 230 selects a recognition result candidate from (S65).
[0059]
As described above, in the second embodiment of the present invention, the same effect as that of the first embodiment is obtained, and the order information of the recognition result candidates from which the erroneous recognition is removed before the selection is performed by the result selection unit 230 is used. can do. For example, the top recognition result candidates of each utterance from which the erroneous recognition has been removed are compared based on the reliability.
[0060]
Next, a third embodiment of the present invention will be described. FIG. 7 is a functional block diagram of the speech recognition device 30 according to the third embodiment of the present invention. Referring to FIG. 7, the speech recognition device 30 is different from the speech recognition device 10 according to the first embodiment of the present invention in that an arithmetic mean calculation unit 380 is newly added. The components having the same functions as those in FIG. 1 are denoted by the same reference numerals, and the description of these components will be omitted.
[0061]
The arithmetic mean calculating unit 380 gives an arithmetic mean of the reliability of each utterance to the recognition result candidate whose reliability has been calculated by the reliability calculating unit 120 as a new reliability. Arithmetic average calculation section 380 is usually realized by a program, and controls an arithmetic circuit (not shown) of speech recognition device 30 to calculate an arithmetic average.
[0062]
In the calculation of the arithmetic mean, if the corresponding recognition result candidate is not included in the utterance, the arithmetic mean may be obtained by setting the reliability to 0, or the average is obtained only by the utterance including the corresponding recognition result candidate. May be requested. For example, if there are five utterances in total and the recognition result candidate is included in only two utterances, the average of the reliability may be obtained for the two utterances.
[0063]
FIG. 8 is a flowchart showing the operation of the third embodiment of the present invention. Compared with the flowchart of FIG. 8 and the flowchart of FIG. 2, steps S51 to S60 in FIG. 1 correspond to steps S71 to S80 in FIG. 8, respectively, and in FIG. 8, step S81 is added between steps S74 and S75. . In step S81, regarding the reliability of each recognition result candidate of each utterance calculated in step S74, the arithmetic average calculation unit 380 calculates an arithmetic average of the reliability of the same recognition result candidate, and newly calculates this value. Assume reliability.
[0064]
For example, in the case of taking the arithmetic mean of the number of appearances of the recognition result candidate with respect to the reliability of FIG. 4, "open" is (0.275 + 0.176) /2=0.226, and "open" is the second time. Does not appear in, the value remains at the first time, and “Shirai” is (0.214 + 0.126) = 0.170. As a result, the “sore” of 0.273 is maximized, and a recognition result that is correct when the second utterance is obtained can be obtained.
[0065]
In addition, in the case of taking an arithmetic mean with the number of utterances for the reliability of FIG. 4, "open" is (0.275 + 0.176) /2=0.226, and "Iwai" is (0.273 + 0) / 2. = 0.137, and "Shirai" is (0.214 + 0.126) = 0.170. As a result, the “hirai” of 0.226 is the maximum and the “hirai” of the next is 0.170. However, in the step S76, the “hirai” which is erroneously recognized in the first result is removed. Is selected. In this case, since a recognition result that is a correct answer cannot be obtained in the second time, “Shirai” is newly determined to be erroneous recognition by the user, is stored in the erroneous recognition result storage unit 150, and the process of the third utterance is started. Will be.
[0066]
In the above example, the arithmetic mean is calculated as the same weight for each utterance. However, the arithmetic average may be calculated by performing weighting according to the chronological order of the utterances. For example, in FIG. 4, the second reliability is weighted by “1”, and the first reliability is weighted by a numerical value between 0 and 1 (eg, “0.8”). However, the setting of the weight is not limited. The arithmetic mean calculation is not limited to the above two examples.
[0067]
In addition to the first, second, or third configuration of the present invention described above, a new configuration in which the function of the arithmetic average calculator 380 is added to the configuration of the second embodiment can be easily realized. It is clear that this is possible, and since the likelihood of each utterance is not multiplied in any configuration, it is possible to obtain a correct answer even if there is an utterance for which a correct answer cannot be obtained as a recognition result candidate.
[0068]
【The invention's effect】
According to the present invention, a synthesis probability is obtained without multiplying a score (reliability) obtained by normalizing a newly obtained utterance recognition result and a score obtained by normalizing a previous utterance recognition result. Therefore, even when the correct answer does not appear in all utterances, the correct answer can be obtained, and the recognition rate can be improved.
[Brief description of the drawings]
FIG. 1 is a block diagram showing a configuration of a first exemplary embodiment of the present invention.
FIG. 2 is a flowchart showing an operation of the first exemplary embodiment of the present invention.
FIG. 3 is a diagram illustrating a specific example of likelihood according to the first embodiment of this invention;
FIG. 4 is a diagram illustrating a specific example of reliability according to the first embodiment of this invention;
FIG. 5 is a block diagram showing a configuration of a second exemplary embodiment of the present invention.
FIG. 6 is a flowchart illustrating an operation of the second exemplary embodiment of the present invention.
FIG. 7 is a block diagram illustrating a configuration of a third exemplary embodiment of the present invention.
FIG. 8 is a flowchart showing the operation of the third embodiment of the present invention.
FIG. 9 is a diagram showing a specific example for explaining a conventional technique.
[Explanation of symbols]
10 Speech recognition device
100 Recognition unit
110 Recognition result storage unit
120 Reliability calculator
130 Result selector
140 Correction unit
150 Accident recognition result accumulation unit
160 Result filtering unit
170 Result presentation section
20 Speech recognition device
230 Result selector
260 Result Filtering Unit
30 Voice recognition device
380 arithmetic mean calculator

Claims

When multiple utterances with the same content are input, the value indicating the probability of the recognition result candidate is recognized from each recognition result candidate of each utterance using a value normalized so that it can be compared between each utterance A speech recognition device characterized by selecting a result.

When an utterance of the same content is input a plurality of times, the likelihood indicating the likelihood of each recognition result candidate obtained as a result of speech recognition for the utterance is normalized for each utterance to calculate reliability, and each utterance is recognized. A speech recognition device, wherein a recognition result is selected from result candidates based on reliability.

When an utterance of the same content is input a plurality of times, the likelihood indicating the likelihood of each recognition result candidate obtained as a result of speech recognition for the utterance is normalized for each utterance to calculate reliability, and until the previous utterance A speech recognition apparatus characterized in that a recognition result is selected based on reliability from recognition result candidates from which recognition result candidates that have been erroneously recognized during the period are removed.

A recognition unit that recognizes speech for each utterance and calculates likelihood indicating certainty for each of a plurality of recognition result candidates, and a recognition result accumulation unit that accumulates recognition result candidates for each utterance and their likelihoods, A reliability calculation unit that calculates a reliability that is a score normalized based on the likelihood of the recognition result candidate for each utterance, and a recognition result from among the recognition result candidates of each utterance accumulated based on the reliability. And a result selection unit for selecting.

A recognition unit that recognizes speech for each utterance and calculates likelihood indicating certainty for each of a plurality of recognition result candidates, and a recognition result accumulation unit that accumulates recognition result candidates for each utterance and their likelihoods, A reliability calculation unit that calculates a reliability that is a score normalized based on the likelihood of the recognition result candidate for each utterance, and a recognition result from among the recognition result candidates of each utterance accumulated based on the reliability. A result selection unit to select, a misrecognition information storage unit for storing misrecognition results determined to be misrecognition with respect to recognition results for previous utterances, and misrecognition information storage from the recognition results selected by the result selection unit A result filtering unit that removes an erroneous recognition result accumulated in the unit and reselects a recognition result.

A recognition unit that recognizes speech for each utterance and calculates likelihood indicating certainty for each of a plurality of recognition result candidates, and a recognition result accumulation unit that accumulates recognition result candidates for each utterance and their likelihoods, A reliability calculation unit that calculates a reliability that is a score normalized based on the likelihood of a recognition result candidate for each utterance, and accumulates erroneous recognition results determined as erroneous recognition with respect to recognition results for previous utterances Error recognition information storage unit, a result filtering unit that removes false recognition results from recognition result candidates for each utterance, and a result that selects a recognition result from each recognition result candidate of each utterance after removing the false recognition results And a selecting unit.

When an utterance of the same content is input a plurality of times, the likelihood indicating the likelihood of each recognition result candidate obtained as a result of speech recognition for the utterance is normalized for each utterance to calculate reliability, and all utterances are further calculated. Calculates the arithmetic reliability for each recognition result candidate for the same recognition result candidate from the recognition result candidates, and removes the recognition result candidates that were erroneously recognized until the previous utterance. A speech recognition apparatus, wherein a recognition result is selected based on a speech.

A recognition unit that recognizes speech for each utterance and calculates likelihood indicating certainty for each of a plurality of recognition result candidates, and a recognition result accumulation unit that accumulates recognition result candidates for each utterance and their likelihoods, A reliability calculation unit that calculates a reliability that is a score normalized based on the likelihood of a recognition result candidate for each utterance, and a synthesis that calculates an arithmetic average for each recognition result candidate from all utterance recognition result candidates An arithmetic mean calculating unit for obtaining the reliability, a result selecting unit for selecting a recognition result from among the recognition result candidates of each utterance accumulated based on the combined reliability, and a recognition result for the previous utterance. A misrecognition information storage unit that accumulates misrecognition results determined as misrecognition, and a result filtering unit that removes the misrecognition results accumulated in the misrecognition information storage unit from the recognition result selected by the result selection unit and reselects the result. And a part Voice recognition device.

When multiple utterances with the same content are input, the value indicating the probability of the recognition result candidate is recognized from each recognition result candidate of each utterance using a value normalized so that it can be compared between each utterance A speech recognition method comprising selecting a result.

When an utterance of the same content is input a plurality of times, the likelihood indicating the likelihood of each recognition result candidate obtained as a result of speech recognition for the utterance is normalized for each utterance to calculate reliability, and each utterance is recognized. A speech recognition method characterized by selecting a recognition result from among result candidates based on reliability.

When an utterance of the same content is input a plurality of times, the likelihood indicating the likelihood of each recognition result candidate obtained as a result of speech recognition for the utterance is normalized for each utterance to calculate reliability, and until the previous utterance A speech recognition method characterized by selecting a recognition result based on reliability from recognition result candidates from which recognition result candidates that have been erroneously recognized during the period are removed.

A reliability that is a score that is obtained by recognizing a speech for each utterance, calculating and storing a likelihood indicating the certainty of each of a plurality of candidate recognition results, and normalizing the likelihood of the recognition result candidate for each utterance. , And selecting a recognition result from among the recognition result candidates of each utterance accumulated based on the calculated reliability.

A reliability that is a score that is obtained by recognizing a speech for each utterance, calculating and storing a likelihood indicating the certainty of each of a plurality of candidate recognition results, and normalizing the likelihood of the recognition result candidate for each utterance. Is calculated, and a recognition result is selected from among the recognition result candidates for each utterance accumulated based on the calculated reliability. A speech recognition method comprising removing a recognition result from a selected recognition result and reselecting the recognition result.

A reliability that is a score that is obtained by recognizing a speech for each utterance, calculating and storing a likelihood indicating the certainty of each of a plurality of candidate recognition results, and normalizing the likelihood of the recognition result candidate for each utterance. Is calculated, and the recognition result of the previous utterance is determined to be erroneous recognition, the accumulated erroneous recognition result is removed from the accumulated recognition result candidates, and each recognition result of each utterance after removing the erroneous recognition result is calculated. A speech recognition method characterized by selecting a recognition result from candidates.

When an utterance of the same content is input a plurality of times, the likelihood indicating the likelihood of each recognition result candidate obtained as a result of speech recognition for the utterance is normalized for each utterance to calculate reliability, and all utterances are further calculated. Calculates the arithmetic reliability for each of the same recognition result candidates from the recognition result candidates, and removes the recognition result candidates that were determined to be incorrectly recognized up to the previous utterance. A speech recognition method comprising selecting a recognition result based on a degree.

A reliability that is a score that is obtained by recognizing a speech for each utterance, calculating and storing a likelihood indicating the certainty of each of a plurality of candidate recognition results, and normalizing the likelihood of the recognition result candidate for each utterance. Is calculated, and an arithmetic mean is calculated for each of the same recognition result candidates from the recognition result candidates of all utterances, and a synthetic reliability is obtained, and recognition is performed from among the recognition result candidates of each utterance accumulated based on the synthetic reliability. A speech recognition method comprising: selecting a result; removing the accumulated misrecognition result determined as a misrecognition with respect to the recognition result of the previous utterance from the selected recognition result; and reselecting the result.

When multiple utterances with the same content are input, each recognition result of each utterance is normalized using a value obtained by normalizing a value indicating the likelihood of the recognition result candidate so that the utterance can be compared between the utterances. A program that causes a computer to execute a procedure for selecting a recognition result from candidates.

When an utterance having the same content is input a plurality of times, a procedure for normalizing likelihood indicating the likelihood of each recognition result candidate obtained as a result of speech recognition for the utterance for each utterance to calculate reliability, and for each utterance And selecting a recognition result based on the reliability from among the recognition result candidates.

When an utterance having the same content is input a plurality of times, a procedure of normalizing the likelihood indicating the likelihood of each recognition result candidate obtained as a result of speech recognition for the utterance for each utterance to calculate the reliability, Selecting a recognition result based on reliability from recognition result candidates from which recognition result candidates that have been erroneously recognized until utterance are removed.

A procedure for recognizing speech for each utterance and calculating and storing a likelihood indicating certainty for each of a plurality of candidate recognition results, and a score normalized for each utterance based on the likelihood of the recognition result candidate. A program for causing a computer to execute a procedure for calculating the reliability and a procedure for selecting a recognition result from among the recognition result candidates of each utterance accumulated based on the calculated reliability.

A procedure for recognizing a speech for each utterance and calculating and storing likelihood indicating certainty for each of a plurality of candidate recognition results, and a score normalized for each utterance based on the likelihood of the recognition result candidate. A procedure for calculating the reliability, a procedure for selecting a recognition result from among the recognition result candidates of each utterance accumulated based on the calculated reliability, and a step of erroneously recognizing the recognition result for the previous utterance. Removing the determined and accumulated erroneous recognition results from the selected recognition results and reselecting the recognition results.

A procedure for recognizing speech for each utterance and calculating and storing a likelihood indicating certainty for each of a plurality of candidate recognition results, and a score normalized for each utterance based on the likelihood of the recognition result candidate. A procedure for calculating the degree of reliability, a procedure for removing the accumulated misrecognition result determined as misrecognition from the recognition result for the previous utterance from the accumulated recognition result candidates, and a procedure for removing the misrecognition result. Selecting a recognition result from among the recognition result candidates of each utterance.

When a utterance having the same content is input a plurality of times, a procedure of normalizing likelihood indicating the likelihood of each recognition result candidate obtained as a result of speech recognition for the utterance for each utterance to calculate the reliability, and further all Of calculating the arithmetic mean for each of the same recognition result candidates from the recognition result candidates of the utterance of the utterance, and determining the recognition result candidates from which the recognition result candidates determined to be erroneously recognized until the previous utterance are removed. Selecting a recognition result based on the synthesis reliability from among them.

A procedure for recognizing speech for each utterance and calculating and storing a likelihood indicating certainty for each of a plurality of candidate recognition results, and a score normalized for each utterance based on the likelihood of the recognition result candidate. A procedure for calculating the reliability, a procedure for calculating a composite reliability by calculating an arithmetic mean for each of the same recognition result candidates from the recognition result candidates for all utterances, and a method for recognizing each utterance accumulated based on the composite reliability A procedure for selecting a recognition result from among the candidate results and a procedure for removing and re-selecting the accumulated recognition errors determined as misrecognition from the recognition results for the previous utterance from the selected recognition results. A program to be executed by a computer.