JP3757638B2

JP3757638B2 - Voice recognition method, voice recognition apparatus, and recording medium recording voice recognition processing program

Info

Publication number: JP3757638B2
Application number: JP25013998A
Authority: JP
Inventors: 康永宮沢
Original assignee: Seiko Epson Corp
Current assignee: Seiko Epson Corp
Priority date: 1998-09-03
Filing date: 1998-09-03
Publication date: 2006-03-22
Anticipated expiration: 2018-09-03
Also published as: JP2000081891A; US6338036B1

Abstract

When a sound which is to be recognized is input to a device, this invention briefly informs the user of whether the sound has been input in an appropriate state. A sound inputting part which outputs a sound to be recognized, spoken by a user as a plurality of words forming one group, as digitized sound data, a sound analysis part which analyzes the sound data and calculates the sound power and characteristic data, a sound division detection/determination part which detects an effective sound division based upon the sound power which has been obtained in the sound analysis part and determines whether the sound to be recognized has been input in an appropriate state, based on the size of the sound power and the time length of the effective sound division, a sound recognition processing part in which the sound to be recognized is recognized and processed, and an information outputting part which outputs information which shows that the recognition object sound is appropriate immediately after the inputting of the sound to be recognized, are provided.

Description

【０００１】
【発明の属する技術分野】
本発明は、話者の発話する音声を認識する際、話者の音声が適正な状態で入力されたか否かを話者に報知するようにした音声認識方法及びそれを用いた音声認識装置並びに音声認識処理プログラムを記録した記録媒体に関する。
【０００２】
【従来の技術】
近年、音声認識技術は様々な分野で幅広く利用されてきている。特に最近では、日用品とも言える家電製品や、子供用の玩具にまで用いられている。
【０００３】
このように、特定のユーザが対象ではなく、不特定多数の幅広いユーザが使用すると考えられる機器に音声認識技術を用いるとなれば、ユーザの発話した音声を高い確率で認識できるように、ユーザに対し、適正な音声の入力の仕方など使い方について分かり易くガイドするなどして、使い勝手のよいものとすることが重要な課題である。
【０００４】
たとえば、幅広いユーザを対象とした音声認識を用いた機器の１つとして最近では音声時計なるものが開発されつつある。これは、時計に設けられたボタンなどを押すと、現在の時刻を「何時何分」というように音声で教えてくれるものである。
【０００５】
この音声時計は、暗闇の中でも簡単に現在の時刻を知ることができることから、真夜中に目を覚ましたとき現在時刻を知りたいときや、夜間に外出中、暗闇の中で時刻を知るときに便利なものであり、さらには、目の不自由な人にとっても大変便利なものとなる。また、このような音声時計は子供の玩具にも適用される可能性もある。
【０００６】
また、このような音声時計は、時刻を音声によって出力するだけでなく、現在時刻合わせやアラーム時刻の設定なども音声によって行うことができる。たとえば、現在時刻が午前６時３０分であれば、音声時計を現在時刻設定モードとして、ユーザが「午前」、「６時」、「３０分」というように、必要な単語を決められた順番に発話する。そして、音声時計側では、ユーザの発話した音声を認識し、その認識結果に基づいて時刻合わせ処理を行う。アラーム時刻の設定も同様であり、アラーム時刻設定モードとして希望のアラーム時刻を発話する。
【０００７】
このような操作により時刻合わせが行われるが、このとき、ユーザ側からすると、自分の発話した音声が適正な状態（認識処理を行う上で適正な状態）で入力されたかどうかが不安となることが多い。
【０００８】
これを解消するために、ユーザの発話した１単語ごとにその単語に対する認識結果を応答しながら音声入力する方式のものもある。たとえば、前述した発話内容例においては、ユーザが「午前」と発話し、それに対する認識結果として、装置側からは、「午前」という応答が返ってきて、続いて、ユーザが「６時」と発話すると、装置側からは、「６時」という応答が返り、さらに、ユーザが「３０分」と発話すると、装置側からは、「３０分」という応答が返ってくるというような動作を行う。なお、この場合、ユーザの発話した音声が不適切であって、認識ができなかった場合には、装置側からの応答がなかったり、「もう一度発話して下さい」といった応答がなされるようにすることもできる。
【０００９】
このように、ユーザの発話した１単語ごとにその認識結果を応答したり、認識できなかった場合には、それに対する何らかの応答が返ってくるというようにすれば、ユーザ側にとっては、自分の発話した内容が適切であるか否かがわかり、しかも、それがどのように認識されたかがわかるので安心感が得られ、使い勝手のよいものとなる。
【００１０】
【発明が解決しようとする課題】
しかしながら、前述のように、１単語ごとにそれを認識して応答するのでは、たとえば、時刻合わせというような１つの設定動作を行わせるために多くの時間を要する問題がある。また、このような音声認識技術を日用品や玩具など低価格が要求される機器に適用する場合、コストをできる限り低く抑えることが必要となってくるため、ＣＰＵの処理能力やメモリの容量には大ききな制約がある。したがって、装置側ではＣＰＵに大きな負担をかけたり、メモリを多く使う処理はできるだけ少なくすることが要求される。
【００１１】
これに対処するには、前述の時刻合わせを例に取れば、ユーザが１単語発話するとそれを認識してその認識結果を応答するというのではなく、ユーザに時刻合わせに必要な内容として、たとえば、「午前」・「６時」・「３０分」を１つの組とし、この１つの組を構成する単語を、１単語ごとに少し間をおきながら断続的に発話してもらい、その発話内容について音声認識するということが考えられる。この場合、１つの組を構成する複数の単語の１つ１つに対して装置側から、前述したような認識結果の応答はないので、確かに、時刻設定時間は短くできる。
【００１２】
しかし、このように複数の単語からなる比較的長い一連の音声を始めから最後まで発話することにより装置側に入力する方法では、前述したように、ユーザにとっては、自分の発話した１単語ごとの音声が、適正な状態で入力されたかどうかということが不安となって残る。したがって、ユーザの発話した音声が、機器に適正な状態で入力されたか否かを、面倒な処理を行うことなく、何らかの形でユーザに報知することが必要となってくる。
【００１３】
そこで本発明は、ユーザの発話した音声に対し、その音声が音声認識を行う上で適正に入力されたか否かを、簡単な処理を行うだけで、ユーザに報知可能とし、音声入力操作時の使い勝手を向上させることを目的としている。
【００１４】
【課題を解決するための手段】
前述した目的を達成するために、本発明の音声認識方法は、単語間に間を有して断続的に発声された複数の単語からなる組を認識する音声認識装置における音声認識方法であって、前記複数の単語は第１から第ｎ（ｎは正の整数）の単語群に属しており、前記第１から第ｎの単語群ごとに、前記複数の単語が属する全ての単語群に対して、時間的な長さの基準を設定する工程と、前記時間的な長さの基準を設定する工程の後に、前記単語間に間を有して断続的に発声された前記単語に対応する音声を入力してディジタル化された音声データを出力する音声入力工程と、前記音声入力工程から出力された音声データを所定時間ごとに分析し、当該所定時間ごとの音声パワーと特徴データを算出する音声分析工程と、前記音声分析工程によって算出された音声パワーに基づいて当該発声された単語に対応する音声の有効音声区間を検出し、当該有効音声区間の時間的な長さが前記単語群ごとに設定された基準内か否かと、当該有効音声区間内の音声パワーの大きさによって、前記発声された単語に対応する音声が適正な状態で入力されたか否かを判断し、適正であると判断した場合に、前記発声された単語に対応する音声が適正な状態で入力されたことを示す信号を出力する音声区間検出・判定工程と、前記音声区間検出・判定工程において出力された、前記発声された単語に対応する音声が適正な状態で入力されたことを示す信号を受けると、前記単語間の間に、前記発声された単語に対応する音声が適正な状態で入力されたことを示す情報を出力する情報出力工程と、前記情報出力工程により情報が出力された後に、前記音声区間検出・判定工程において適正な状態で入力されたと判断された前記発声された単語に対応する音声を認識する音声認識工程と、を有することを特徴とする。
【００１５】
また、本発明の音声認識方法において、前記音声区間検出・判定工程が、有効音声区間の時間的な長さによって前記音声が適正な状態で入力されたか否かを判断する処理は、前記音声パワーが所定のしきい値より大きくなる時刻を始端とし、続いて、前記音声パワーが前記しきい値より小さくなる時刻を終端とし、当該終端の時刻から所定時間の間、前記音声パワーが前記しきい値より大きくならなかった場合に、前記始端から前記終端までの区間を有効音声区間として検出し、当該検出された有効音声区間が、所定の基準内である場合に、前記音声が適正な状態で入力されたと判断することを特徴とする。
【００１６】
また、本発明の音声認識方法において、前記情報出力工程において出力される前記発声された単語に対応する音声が適正な状態で入力されたことを示す情報は、信号音、光、音声メッセージ、表示画面上での表示の少なくとも１つであることを特徴とする。
【００１８】
また、本発明の音声認識装置は、単語間に間を有して断続的に発声された複数の単語からなる組を認識する音声認識装置であって、前記複数の単語は第１から第ｎ（ｎは正の整数）の単語群に属しており、前記第１から第ｎの単語群ごとに、前記複数の単語が属する全ての単語群に対して、時間的な長さの基準を設定する手段と、前記時間的な長さの基準を設定する手段によって時間的な長さの基準の設定が行われた後に、前記単語間に間を有して断続的に発声された前記単語に対応する音声を入力してディジタル化された音声データを出力する音声入力手段と、前記音声入力手段から出力された音声データを所定時間ごとに分析し、当該所定時間ごとの音声パワーと特徴データを算出する音声分析手段と、前記音声分析手段によって算出された音声パワーに基づいて当該発声された単語に対応する音声の有効音声区間を検出し、当該有効音声区間の時間的な長さが前記単語群ごとに設定された基準内か否かと、当該有効音声区間内の音声パワーの大きさによって、前記発声された単語に対応する音声が適正な状態で入力されたか否かを判断し、適正であると判断した場合に、前記発声された単語に対応する音声が適正な状態で入力されたことを示す信号を出力する音声区間検出・判定手段と、前記音声区間検出・判定手段で検出された前記音声に対する有効音声区間と、前記音声分析手段で算出された特徴データと、に基づいて前記音声を認識処理する音声認識処理手段と、前記音声区間検出・判定手段から出力された、前記発声された単語に対応する音声が適正な状態で入力されたことを示す信号を受けると、前記単語間の間に、前記発声された単語に対応する音声が適正な状態で入力されたことを示す情報を出力する情報出力手段と、を有し、前記音声認識処理手段は、前記情報出力手段により情報が出力された後に、前記音声区間検出・判定手段において適正な状態で入力されたと判断された前記発声された単語に対応する音声を認識処理することを特徴とする。
【００１９】
また、本発明の音声認識装置において、前記音声区間検出・判定手段が、有効音声区間の時間的な長さによって前記音声が適正な状態で入力されたか否かを判断する処理は、前記音声パワーが所定のしきい値より大きくなる時刻を始端とし、続いて、前記音声パワーが前記しきい値より小さくなる時刻を終端とし、当該終端の時刻から所定時間の間、前記音声パワーが前記しきい値より大きくならなかった場合に、前記始端から前記終端までの区間を有効音声区間として検出し、当該検出された有効音声区間が、所定の基準内である場合に、前記音声が適正な状態で入力されたと判断することを特徴とする。
【００２０】
また、本発明の音声認識装置において、前記情報出力手段が出力する前記発声された単語に対応する音声が適正な状態で入力されたことを示す情報は、信号音、光、音声メッセージ、表示画面上での表示の少なくとも１つであることを特徴とする。
【００２２】
また、本発明の音声認識処理プログラムを記録した記録媒体は、単語間に間を有して断続的に発声された複数の単語からなる組を認識する音声認識装置における認識対象音声の入力状態報知プログラムを記録した記録媒体であって、前記複数の単語は第１から第ｎ（ｎは正の整数）の単語群に属しており、前記第１から第ｎの単語群ごとに、前記複数の単語が属する全ての単語群に対して、時間的な長さの基準を設定する工程と、前記時間的な長さの基準を設定する工程の後に、前記単語間に間を有して断続的に発声された前記単語に対応する音声を入力してディジタル化された音声データを出力する音声入力工程と、前記音声入力工程から出力された音声データを所定時間ごとに分析し、当該所定時間ごとの音声パワーと特徴データを算出する音声分析工程と、前記音声分析工程によって算出された音声パワーに基づいて当該発声された単語に対応する音声の有効音声区間を検出し、当該有効音声区間の時間的な長さが前記単語群ごとに設定された基準内か否かと、当該有効音声区間内の音声パワーの大きさによって、前記発声された単語に対応する音声が適正な状態で入力されたか否かを判断し、適正であると判断した場合に、前記発声された単語に対応する音声が適正な状態で入力されたことを示す信号を出力する音声区間検出・判定工程と、
前記音声区間検出・判定工程において出力された、前記発声された単語に対応する音声が適正な状態で入力されたことを示す信号を受けると、前記単語間の間に、前記発声された単語に対応する音声が適正な状態で入力されたことを示す情報を出力する情報出力工程と、前記情報出力工程により情報が出力された後に、前記音声区間検出・判定工程において適正な状態で入力されたと判断された前記発声された単語に対応する音声を認識する音声認識工程と、を音声認識装置に実行させるための音声認識処理プログラムを記録した記録媒体である。
【００２３】
また、本発明の音声認識処理プログラムを記録した記録媒体において、前記音声区間検出・判定工程が、有効音声区間の時間的な長さによって前記音声が適正な状態で入力されたか否かを判断する処理は、前記音声パワーが所定のしきい値より大きくなる時刻を始端とし、続いて、前記音声パワーが前記しきい値より小さくなる時刻を終端とし、当該終端の時刻から所定時間の間、前記音声パワーが前記しきい値より大きくならなかった場合に、前記始端から前記終端までの区間を有効音声区間として検出し、当該検出された有効音声区間が、所定の基準内である場合に、前記音声が適正な状態で入力されたと判断することを特徴とする。
【００２４】
また、本発明の音声認識処理プログラムを記録した記録媒体において、前記情報出力工程において出力される前記発声された単語に対応する音声が適正な状態で入力されたことを示す情報は、信号音、光、音声メッセージ、表示画面上での表示の少なくとも１つであることを特徴とする
【００２６】
本発明は、ユーザの入力した認識対象音声が適正な状態で入力されたか否かを簡単な処理を行うだけで話者に報知することを可能とし、使い勝手の向上を図るものである。これを実現するために、ユーザの発話した認識対象音声における有効音声区間の時間的な長さと、当該有効音声区間内の音声パワーの大きさによって、前記認識対象音声が適正な状態で入力されたか否かを判断し、適正であると判断した場合には、当該認識対象音声の入力直後に適正であることを示す情報を発するようにしている。これにより、ユーザは自分の発話した音声が適正な状態で入力されたか否かを簡単に知ることができ、音声の入力操作を行う際、ユーザに対し、自分の入力した音声が本当に適正な状態で入力されたのかどうかという不安感を与えることがなくなる。
【００２７】
また、このような適正な状態で入力されたか否かの判定対象となる認識対象音声は、複数の単語を１つの組として発話された音声であって、この１つの組を構成するそれぞれの単語に対するそれぞれの音声間に、各単語の区切りを示す間を有して発話された音声を対象としている。たとえば、現在時刻などの時刻設定を音声により設定可能な時計を例に取れば、「午前」・「何時」・「何分」というように複数の単語を１つの組とし、それを構成する各単語間に区切りとしての間をおきながら断続的に発話される音声を対象としている。
【００２８】
このように、複数の単語を１つの組として、各単語間に装置側から認識結果の応答なしに、ユーザの発話を一方的に入力する状況にあっては、各単語が果たして適正な状態（認識を行う上で適正な状態）で入力されたのかどうかがユーザにとって不安なもとなる。
【００２９】
これを解消するために、複数の単語を１つの組として発話される状況の場合は、それぞれの単語間に装置側から何らかの情報を発信することで、ユーザに安心感を与えることができる。
【００３０】
その情報としては、各単語間の区切り時間内に瞬時的に発せられる信号音（たとえば「ピッ」というような信号音）、発光ダイオード（ＬＥＤ）などにより瞬時的に発光する光、音声メッセージ（たとえば「はい」というようなごく短い音声メッセージ）、液晶ディスプレイ（ＬＣＤ）などの表示部を有する装置にあっては、ＬＣＤ上での「ＯＫ」などの簡単な表示）などが考えられる。ユーザは、自分の発話した１単語ごとの音声のあとに、このような簡単な情報が装置側から瞬時的に発せられることによって、自分の発話した音声が適正な状態で入力されたことがわかるので、音声入力操作に対する安心感が得られる。
【００３１】
また、１つの組を構成する複数の単語は、第１から第ｎ（ｎは正の整数）までの単語群に属し、前述した有効音声区間の時間的な長さを判定する基準は、それぞれの単語群ごとに設定するようにしている。これは、各単語群に属する単語の長さ（発話に必要な時間的長さ）が、単語群間で大きく異なる可能性があるからである。したがって、有効音声区間の時間的な長さを判定する基準を、それぞれの単語群ごとに設定しておくことによって、各単語群に属する単語に対し適正な有効音声区間の長さの判定が可能となる。
【００３２】
また、本発明の音声認識装置は、以上説明したような認識対象音声の入力状態報知方法を採用することにより、使い勝手をよくすることができ、この種の機器の取り扱いに不慣れなユーザでも容易に取り扱うことができるようになる。
【００３３】
【発明の実施の形態】
以下、本発明の実施の形態を図面を参照しながら説明する。なお、この実施の形態では、音声認識技術を用いた機器として前述した音声時計を例に取り、この音声時計において、午前何時何分というような時刻合わせをする例について説明する。ここでは、具体例として、「午前６時３０分」を設定することを考える。
【００３４】
図１は本発明が適用される音声時計の外観構成を示すものであって、この音声時計は、液晶表示部など時刻の表示部分を持たない音声メッセージだけの音声時計であり、筺体１には音声出力手段としてのスピーカ２とユーザからの音声コマンドが入力されるマイクロホン３が設けられる。さらに、現在時刻合わせを行ったりアラーム時刻合わせを行ったりするときにモード設定を行うためのモード設定部４、現在時刻を知りたいときに押される時刻ボタン５などを少なくとも有している。これ以外にも、機能によっては様々な構成要素が設けられるが、本発明の要旨とは直接関係しない部分の図示およびその説明は省略する。
【００３５】
このような音声時計は、現在時刻が正確に合わせられていれば、時刻ボタン５がユーザによって押されると、その時点の現在時刻として、たとえば、「午前８時３０分」などと装置側から音声メッセージによる時刻が出力される。
【００３６】
図２はこのような音声時計に用いられる音声認識装置の構成を示すブロック図であり、音声入力部１１、音声分析部１２、音声区間検出・判定部１３、音声認識処理部１４、情報出力部１５、音声認識を行うための音声認識用モデルデータ１６、装置側から発せられる様々な情報（認識結果に対応する応答内容や、ユーザに問いかけを行う際の音声メッセージ内容、さらには、話者の発話した音声が適正であると判断されたときに出力される情報）を出力するための出力用データ１７を有している。
【００３７】
音声入力部１１は、前述したマイクロホン２や図示されていないアンプ、さらには、図示されていないＡ／Ｄ変換器などを有し、ユーザによって発話された音声をマイクロホン２を通して入力し、増幅したのちＡ／Ｄ変換を行い、たとえば、８ＫＨｚ、１０bitのディジタル化された音声データとして出力する。
【００３８】
音声分析部１２は、音声入力部１１によって出力された音声データを、たとえば、２０msec程度（シフト量は１０msec程度）の短時間ごとに音声分析を行い、その短時間（２０msec程度）ごとに音声パワーと特徴データ（たとえば、１０次元ＬＰＣケプストラム）を算出する。
【００３９】
音声区間検出・判定部１３は、音声分析部１２で算出された音声パワーを用いて、有効な音声区間（有効音声区間という）を検出し、有効音声区間の時間的な長さが予め定めた所定の時間的長さ（Ｌ１，Ｌ２で表し、Ｌ１＜Ｌ２とする）の範囲内（Ｌ１よりも長く、Ｌ２よりも短い範囲内）で、音声パワーの最大値がある２つのしきい値（ｔｈ２，ｔｈ３で表し、ｔｈ２＜ｔｈ３とする）の範囲内（ｔｈ２よりも大きく、ｔｈ３よりも小さい範囲内）に入っている場合、その有効音声区間は正常な範囲内であると判定され、その音声は音声認識を行う上で適正な状態で入力されたと判断する。そして、適正であると判断すると前記有効音声区間の終端から一定時間（Ｌ４とする）後に、当該音声が適正な状態で入力されたことを示す信号を出力する。
【００４０】
なお、ここでいう有効音声区間というのは、ユーザの音声パワーが、あるしきい値（ｔｈ１とする）より大きくなった時点を音声区間の始端として求め、その後、音声パワーがしきい値ｔｈ１より小さくなり、かつ、しきい値ｔｈ１より小さくなった時点から予め定めた所定時間（Ｌ３とする）経過しても再びしきい値ｔｈ１より大きくならない場合、しきい値ｔｈ１より小さくなった時点を音声区間の終端として求め、その音声区間の始端から終端までを有効音声区間という。
【００４１】
また、音声認識処理部１４は、検出された有効音声区間内の特徴データ（音声分析部１２で得られた前述の音声特徴データのうち有効音声区間内の特徴データ）に基づき、音声認識用モデルデータ１６を用いて音声認識処理を行う。
【００４２】
情報出力部１５は、前述したように、認識結果に対応する応答音声や、ユーザに問いかける内容の音声メッセージを出力用データ１７を用いて作成して出力するものであるが、その他に、音声区間検出・判定部１３から出力された前記適正であることを示す信号を受け取ると、音声が適正に入力されたことを示す情報（この実施の形態では、「ピッ」という信号音とする）を出力する。
【００４３】
このような構成された音声時計において、何らかの時刻設定（ここでは現在時刻とする）を行う例について説明する。ここでは、設定する時刻としては、前述したように、午前６時３０分であるとする。このとき、音声時計（装置という）を現在時刻設定モードとする。
【００４４】
図３（ａ）はユーザの発話した「午前」・「６時」・「３０分」の音声波形であり、音声入力部１１によってＡ／Ｄ変換されたあとの出力であるとする。この図３（ａ）からもわかるように、ある時刻に時刻合わせをするような場合、たとえば、「午前６時３０分」という時刻合わせ内容を発話する際、認識率を高くするために、時刻合わせ内容を構成する各単語（「午前」・「６時」・「３０分」）を１単語づつ、各単語間に少しの間（ΔＴ１，ΔＴ２時間）をおいて発話するようにユーザ側に予め知らせておくとよい結果が得られる。
【００４５】
このように、この実施の形態において用いられる音声は、複数の単語を１つの組として発話される音声であって、この１つの組を構成するそれぞれの単語に対するそれぞれの音声間に、各単語の区切りとしての間を有して断続的に発話される音声であるとする。
【００４６】
また、ここでは、現在時刻設定モードであり、このような現在時刻設定モード（アラーム時刻設定モードの場合も同様）のときは、最初に「午前」か「午後」を発話し、２番目に「何時」、３番目に「何分」というように、発話する順番は決められているものとする。また、説明の都合上、最初に発話される部分を第１の単語群に属する単語、２番目に発話される部分を第２の単語群に属する単語、３番目に発話される部分を第３の単語群に属する単語と呼ぶことにする。
【００４７】
図３（ａ）に示すような音声データに対し、音声分析部１２によって、たとえば２０msecごとのフレームに区切って音声分析を行い、各フレームごとに音声パワーと特徴データを求める。なお、特徴データは音声認識処理のときに用いられ、ここで行われるユーザの発話した音声が適正であるか否かの判定処理では、音声パワーを用いる。
【００４８】
図３（ｂ）は各フレームごとに求められた音声パワーを曲線で結んだ音声パワー曲線を示すものである。なお、図３（ａ）では、ユーザが「午前」・「６時」・「３０分」というように、１つの組を構成する単語を順番にすべて発話したときに得られる音声データであるが、本発明が行う処理は、ユーザが「午前」と発話すると、その「午前」の発話内容について、適正であるか否かを判定する処理を行い、適正であれば装置側から「ピッ」という信号音を出し、その後に、ユーザが「６時」と発話することにより、その「６時」の発話内容について、適正であるか否かを判定する処理を行い、適正であれば装置側から「ピッ」という信号音を出し、続いて、「３０分」と発話することにより、その「３０分」の発話内容について、適正であるか否かを判定する処理を行うというように、１つの組を構成するそれぞれの単語についてその単語が適正な状態で入力されたか否かの判定処理を行う。以下、それぞれの単語ごとの処理について詳細に説明する
また、前述の有効音声区間の時間的長さを判定するための基準となる時間的長さＬ１とＬ２は、実際には、前述した第１の単語群、第２の単語群、第３の単語群のそれぞれの単語群ごとに設定されるものである。
【００４９】
この単語群とは、この場合、ユーザの発話する内容は、「午前」、「何時」、「何分」という決められたパターンであるので、「午前」の部分に発話される単語群を第１の単語群といい、この第１の単語群に属する単語は、この場合、「午前」の他には「午後」がある。また、「何時」の部分に発話される単語群を第２の単語群といい、この第２の単語群に属する単語は、この場合、「０時」、「１時」、「２時」など時間の単位を表す単語である。また、「何分」の部分に発話される単語群を第３の単語群といい、この第３の単語群に属する単語は、この場合、「０分」、「１分」、「２分」など分の単位を表す単語である。しかも、これら第１から第３の単語群は、最初に第１の単語群（たとえば「午前」）を発話し、続いて第２の単語群（たとえば「６時」）を発話し、さらに続いて第３の単語群（たとえば「３０分」）を発話するというように、発話する順番はきまっていて、装置側では、その順番に従って、入力された各単語群に対する認識処理を行うようになっている。
【００５０】
したがって、前述のＬ１とＬ２は、第１から第３の単語群ごとに設定しておく方がよい結果が得られる。以下、第１の単語群に対して設定される時間的長さをＬ１１，Ｌ２１（Ｌ１１＜Ｌ２１）とし、第２の単語群に対して設定される時間的長さをＬ１２，Ｌ２２（Ｌ１２＜Ｌ２２）とし、第３の単語群に対して設定される時間的長さをＬ１３，Ｌ２３（Ｌ１３＜Ｌ２３）とする。
【００５１】
まず、ユーザが第１の単語群に属する単語として「午前」と発話すると、音声分析部１２によって、前述したように、たとえば２０msecごとのフレームに区切って音声分析を行い、各フレームごとに音声パワーと特徴データを求める。
【００５２】
そして、ユーザの発話した「午前」に対して得られた音声パワー曲線から、「午前」に対する音声データの有効音声区間Ｔ１を求める。
【００５３】
まず、予め設定されたしきい値ｔｈ１を基準にして、ユーザが発話して得られた音声パワーが、最初にしきい値ｔｈ１を越えた時刻を、「午前」の音声区間の始端とする。この図３（ｂ）からもわかるように、時刻ｔ１でしきい値ｔｈ１を越えているので、この時刻ｔ１を「午前」に対する音声区間の始端とする（始端ｔ１という）。
【００５４】
続いて、「午前」に対する音声パワーが、しきい値ｔｈ１より小さくなる時刻を調べ、その時刻がｔ２であるとする。そして、この時刻ｔ２から予め定められたある一定時間Ｌ３が経過したのちにも音声パワーがしきい値ｔｈ１を再び越えなければ、「午前」に対する音声区間ｔ１の終端は、時刻ｔ２であるとする。この時刻ｔ２を音声区間ｔ１の終端とする（終端ｔ２という）。
【００５５】
そして、このように求められた始端ｔ１と終端ｔ２の間の区間を有効音声区間Ｔ１とし、ｔ１を有効音声区間Ｔ１の始端とし、ｔ２を有効音声区間Ｔ１の終端とする。なお、前記一定時間Ｌ３はごく短い時間が設定され、具体的には、隣接する単語間に存在する単語間の区切りとしての間の時間ΔＴ１，ΔＴ２よりもきわめて短い時間である。
【００５６】
このようにして、第１の単語群に属する「午前」の音声に対する有効音声区間Ｔ１が求められる。次に、この有効音声区間Ｔ１の時間的長さと音声パワーが、予め設定された範囲内にあるか否かを判断する。
【００５７】
つまり、前述したように、有効音声区間Ｔ１の時間的長さ（時間的長さもＴ１で表す）が予め設定された時間的長さＬ１１，Ｌ２１に対し、Ｌ１１＜Ｔ１＜Ｌ２１であって、その有効音声区間Ｔ１内の音声パワーの最大値ｍ１が予め設定されたしきい値ｔｈ２，ｔｈ３に対し、ｔｈ２＜ｍ１＜ｔｈ３を満たす場合、抽出された有効音声区間Ｔ１は正常な範囲内にあると判断され、ユーザの発話した「午前」は音声認識を行う上で適正な状態で入力されたと判断する。
【００５８】
このようにして、第１の単語群に属する「午前」に対する有効音声区間Ｔ１が所定の範囲内（Ｌ１１＜Ｔ１＜Ｌ２１で、かつ、ｔｈ２＜ｍ１＜ｔｈ３）であると判断されると、音声区間検出・判定部１３は、図３（ｃ）に示すように、その有効音声区間Ｔ１の終端ｔ２からＬ４時間経過後に、情報出力部１５に対し、当該音声（この場合「午前」）が適正な状態で入力された音声であることを示す信号ｓ１を出力する。
【００５９】
情報出力部１５はこの信号ｓ１を受けると、図３（ｄ）に示すように、予め決められた瞬時的な情報として信号音を出力する。この信号音は、ユーザの発話した音声が認識を行うに適正な状態で入力された音声であることをユーザに対して報知するもので、種々の報知手段が考えられるが、この実施の形態では、「ピッ」という瞬時的な信号音を出力する。
【００６０】
すなわち、ユーザが「午前」と発話してそれが適正であると判断されると、ユーザの発話した「午前」の後につづいて「ピッ」という信号音が装置側から発せられる。これにより、ユーザは自分の発話した「午前」という音声が適正な状態で装置側に入力されたということがわかる。
【００６１】
つづいて、ユーザが第２の単語群に属する「６時」と発話すると、その音声データに対する音声パワー曲線から、「６時」に対する音声データの有効音声区間Ｔ２を求める。
【００６２】
まず、予め設定されたしきい値ｔｈ１を基準にして、ユーザが発話して得られた音声パワーが、最初に、このしきい値ｔｈ１を越えた時刻を、「６時」の音声区間の始点とする。図３（ｂ）からもわかるように、時刻ｔ３でしきい値ｔｈ１を越えているので、この時刻ｔ３を「６前」に対する音声区間の始端とする（始端ｔ３という）。
【００６３】
続いて、「６時」という単語に対する音声パワーが、しきい値ｔｈ１より小さくなる時刻を調べ、その時刻がｔ４であるとする。そして、この時刻ｔ４から予め定められたある一定時間Ｌ３が経過したのちにも音声パワーがしきい値ｔｈ１を再び越えなければ、「６時」に対する音声区間Ｔ２の終端は、時刻ｔ４であるとする。この時刻ｔ４を音声区間の終端とする（終端ｔ４という）。
【００６４】
このように求められた始端ｔ３と終端ｔ４の間の区間を有効音声区間Ｔ２とし、ｔ３を有効音声区間Ｔ２の始端とし、ｔ４を有効音声区間Ｔ２の終端とする。次に、この有効音声区間Ｔ２の時間的長さが、予め設定された範囲内にあるか否かを判断する。この場合、Ｌ１２＜Ｔ２＜Ｌ２２か否かを判断する。また、その音声区間Ｔ２内の音声パワーの最大値ｍ２がしきい値ｔｈ２，ｔｈ３の範囲内に入っているか否かを判断する。そして、これらの条件が成立すると、抽出された有効音声区間Ｔ２は正常な範囲内にあると判断され、その音声（この場合、「６時」）は音声認識処理を行う上で適正に入力されたと判断する。
【００６５】
このようにして、第２の単語群に属する「６時」に対する有効音声区間Ｔ２が正常な範囲内であると判断されると、音声区間検出・判定部１３は、図３（ｃ）に示すように、その有効音声区間Ｔ２の終端ｔ４からＬ４時間経過後に、情報出力部１５に対し、当該有効音声区間Ｔ２が正常な範囲内であったことを示す信号ｓ２を出力する。
【００６６】
情報出力部１５はこの信号ｓ２を受けると、図３（ｄ）に示すように、適正であることを示す情報として、前述したように、「ピッ」という瞬時的な信号音を出力する。
【００６７】
すなわち、ユーザが「午前」に続いて「６時」と発話し、それが適正であると判断されると、「６時」の後につづいて「ピッ」という信号音が装置側から発せられる。これにより、ユーザは自分の発話した「６時」という音声が装置側に適正な状態で入力されたということがわかる。
【００６８】
続いて、ユーザが第３の単語群に属する「３０分」と発話すると、その音声データに対する音声パワー曲線から、「３０分」に対する音声データの有効音声区間Ｔ３を求める。
【００６９】
まず、図３（ｂ）からもわかるように、時刻ｔ５でしきい値ｔｈ１を越えているので、この時刻ｔ５を「３０分前」に対する音声区間Ｔ３の始端とする（始端ｔ５という）。そして、「３０分」という単語に対する音声パワーが、しきい値ｔｈ１より小さくなる時刻を調べ、その時刻がｔ６であるとする。この時刻ｔ６から予め定められたある一定時間Ｌ３が経過したのちにも音声パワーがしきい値ｔｈ１を再び越えなければ、「３０分」に対する音声区間の終端は、時刻ｔ６であるとする。この時刻ｔ６を音声区間の終端とする（終端ｔ６という）。
【００７０】
このように求められた始端ｔ５と終端ｔ６の間の区間を有効音声区間Ｔ３とし、ｔ５を有効音声区間Ｔ３の始端とし、ｔ６を有効音声区間Ｔ３の終端とする。次に、この有効音声区間Ｔ３の時間的長さが、予め設定された範囲内にあるか否かを判断する。この場合、Ｌ１３＜Ｔ３＜Ｌ２３か否かを判断する。また、その有効音声区間Ｔ３内の音声パワーの最大値ｍ３がしきい値ｔｈ２，ｔｈ３の範囲内に入っているか否かを判断する。そして、これらの条件が成立すると、抽出された有効音声区間Ｔ３は正常な範囲内にあると判断され、その音声（この場合、「３０分」）が適正な状態で入力されたと判断する。
【００７１】
このようにして、第３の単語群に属する「３０分」に対する有効音声区間Ｔ３が正常な範囲内であると判断されると、音声区間検出・判定部１３は、図３（ｃ）に示すように、その有効音声区間Ｔ３の終端ｔ６からＬ４時間経過後に、情報出力部１５に対し、当該有効音声区間Ｔ３が正常な範囲内であったことを示す信号ｓ３を出力する。
【００７２】
情報出力部１５はこの信号ｓ３を受けると、適正であることを示す情報として、前述したように、「ピッ」という瞬時的な信号音を出力する。
【００７３】
すなわち、ユーザが「午前」、「６時」に続いて「３０分」と発話し、それが適正な状態で入力されたと判断されると、「３０分」の後につづいて「ピッ」という信号音が装置側から発せられる。これにより、ユーザは自分の発話した「３０分」が装置側に適正な状態で入力されたと判断できる。
【００７４】
このように、ユーザが、「午前」、「６時」、「３０分」と発話した場合、それぞれの単語に対する音声が適正な状態で入力されたと判断されると、「午前」、「ピッ」、「６時」、「ピッ」、「３０分」、「ピッ」というように、ユーザの発話した音声のあとに装置側から「ピッ」が応答されるので、ユーザはそれを聞くことにより自分の発話した音声が適正な状態で入力されたことがわかり、安心感が得られる。
【００７５】
なお、前述のＬ３は第１から第３の単語群において共通の時間として説明したが、これもＬ１，Ｌ２と同様に、それぞれの単語群ごとに適当な時間を設定するようにしてもよい。
【００７６】
また、ある有効音声区間（有効音声区間Ｔ１とする）が前述した条件を満たさない例として、図４（ａ）（ｂ），（ｃ）（ｄ）がある。図４（ａ）（ｂ）は有効音声区間Ｔ１の２つの最大値ｍ１がしきい値ｔｈ３を越え、しかも、有効音声区間Ｔ１の時間的長さが、Ｌ１１よりも短い場合である。また、図４（ｃ）（ｄ）は有効音声区間Ｔ１の最大値ｍ１がしきい値ｔｈ２より小さく、かつ、有効音声区間Ｔ１の時間的長さが、Ｌ２１よりも長い場合である。
【００７７】
図４（ａ）（ｂ）の例は、ユーザの発話した音声が強すぎ、しかも、きわめて早口で発話したような例であり、このような状態で発話された音声に対しては適正な音声認識が行えない可能性が高いことから、その入力音声は適正でないとする。
【００７８】
また、図４（ｃ）（ｄ）の例は、ユーザの発話した音声が小さすぎ、しかも、きわめて間延びした状態で発話したような例であり、このような状態で発話された音声に対しては、適正な音声認識が行えない可能性が高いことから、その入力音声は適正でないとする。
【００７９】
なお、図４（ａ）（ｂ），（ｃ）（ｄ）の例は、共に、音声パワーと有効音声区間の時間的長さの両方が条件を満たさない例であるが、音声パワーと有効音声区間の時間的長さのいずれか一方が条件を満たさない場合でも、その入力音声は適正でないと判断される。
【００８０】
このように、ユーザの発話した音声が適正な状態で入力されなかったと判断した場合には、「ピッ」という信号音は発しないようにする。これにより、ユーザは自分の発話した音声が適正ではないということを知ることができる。この場合、装置側からの反応が無音であることから、ユーザは、再度、入力し直すというようなことを行う。あるいは、適正な状態で入力がなされなかった場合には、再度、入力し直すことを促す音声メッセージやその他の信号などでユーザに報知することもできる。
【００８１】
以上説明したように、この実施の形態は、装置側に対し、「午前」、「６時」、「３０分」というような複数の単語から構成される内容を、１単語発話するごとに少し間をおいて次の単語を入力するという断続的な発話によって音声の入力を行うことで、この入力音声を認識させて、それに対応する動作（現在時刻設定など）を行わせるような場合を例にしている。
【００８２】
この場合、ユーザが、まず、「午前」と発話すると、その音声データから得られた有効音声区間の時間的長さと音声パワーとに基づいて、その音声が適正な状態で入力されたか否かを判定し、適正な音声であると判定した場合には、適正であることを示す「ピッ」というような瞬時的な信号音を発するようにしている。これにより、ユーザは、自分が発した音声が正常な状態で入力されたか否かを装置側からの信号音で知ることができ、音声入力操作を不安感を抱くことなく行うことができる。
【００８３】
このように、本発明は、装置との間で対話形式で音声を入力する方式でなく、１つの組に存在する複数の単語を断続的に続けて入力するような場合に特に有効なものとなる。
【００８４】
なお、本発明は以上説明した実施の形態に限定されるものではなく、本発明の要旨を逸脱しない範囲で種々変形実施可能となるものである。
【００８５】
たとえば、前述の実施の形態では、有効音声区間が適正か否かの判断条件の一つとして、音声パワーの最大値を用いているが、有効音声区間内の音声パワーの平均値を用いて有効音声区間が適正か否かの判断を行っても良い。
【００８６】
また、前述の実施の形態では、ユーザの発話した音声が適正であることを示す情報として「ピッ」というような信号音を用いたが、このような信号音に限られるものではなく、たとえば、発光ダイオードなどを点灯させるようにしてもよく、また、「はい」というような短い音声による応答でであってもよい。さらには、液晶ディスプレイのような表示部のあるものであれば、話者の音声入力直後にその音声が適正である場合、その表示部にたとえば「ＯＫ」というような表示を行うようにしてもよい。
【００８７】
また、前述の実施の形態では、音声認識技術を用いた時計における時刻合わせを例にとって説明したが、本発明は、時計以外の機器にも適用できることは勿論である。
【００８８】
また、前述の実施の形態では、３つの単語を１組として入力した例を示したが、１組を構成する単語数は３つに限られるものでないことは勿論である。
【００８９】
また、以上説明した本発明の認識対象音声の入力状態報知処理を行う処理プログラムは、フロッピィディスク、光ディスク、ハードディスクなどの記録媒体に記録させておくことができ、本発明はその記録媒体をも含むものである。また、ネットワークから処理プログラムを得るようにしてもよい。
【００９０】
以上説明したように本発明によれば、話者の発話した認識対象音声における有効音声区間の時間的な長さと、当該有効音声区間内の音声パワーの大きさによって、前記認識対象音声が適正な状態で入力されたか否かを判断し、適正であると判断した場合には、当該認識対象音声の入力直後に適正であることを示す情報を発するようにしている。これにより、ユーザが装置に対して音声の入力操作を行う際、自分の入力した音声が本当に適正な状態で入力されたのかどうかという不安感を抱くことがなくなり、音声入力を行う際の操作性の向上を図ることができる。
【００９１】
特に、認識対象音声が、複数の単語を１つの組として発話される音声であって、このような音声を入力する際に効果が得られる。たとえば、現在時刻などの時刻設定を音声により設定可能な時計を例に取れば、「午前」・「何時」・「何分」というように複数の単語を１つの組とし、それを構成する各単語間に区切りとしての間をおきながら断続的に発話されるような場合である。このように、複数の単語を１つの組としてそれぞれの単語が断続的に発話される場合、各単語を発話したあとに、装置側から瞬時的に発せられる信号音などが返ってくるようにすることで、ユーザは、自分の発話した音声が適正な状態で入力されたことが即座にわかり、音声入力操作に対する安心感が得られる。
【００９２】
また、ユーザの発話した音声が適正な状態で入力されことを示す情報として、瞬時的な情報を発するだけであるので、たとえば、それぞれの単語を認識してその認識結果をそのまま応答するのに比べると、処理を軽いものとすることができ、処理時間も大幅に短縮することができる。
【００９３】
そして、このような認識対象音声の入力状態報知方法を採用した音声認識装置は、使い勝手のよいものとなり、この種の機器の取り扱いに不慣れなユーザでも容易に取り扱うことができるようになり、また、全体的な処理を軽いものとすることができるので、ＣＰＵやメモリに低コストなものが使用でき、装置そのものの価格も低コスト化が図れる。
【図面の簡単な説明】
【図１】本発明の実施に形態に用いられる音声時計の外観を概略的に示す図。
【図２】図１で示した音声時計に用いられる音声認識装置部分の概略構成を説明するブロック図である。
【図３】図１で示した音声時計にて時刻設定を行う際の認識対象音声の入力状態報知処理を説明するタイムチャート。
【図４】本発明の実施の形態における入力音声が適正でない例を説明する図。
【符号の説明】
１音声時計の筺体
２スピーカ
３マイクロホン
４モード設定部
５時刻ボタン
１１音声入力部
１２音声分析部
１３音声区間検出・判定部
１４音声認識処理部
１５情報出力部
１６音声認識用モデルデータ
１７出力用モデルデータ[0001]
BACKGROUND OF THE INVENTION
The present invention relates to a speech recognition method for notifying a speaker whether or not a speaker's speech is input in an appropriate state when recognizing speech uttered by the speaker, a speech recognition apparatus using the speech recognition method, and the like The present invention relates to a recording medium on which a voice recognition processing program is recorded.
[0002]
[Prior art]
In recent years, speech recognition technology has been widely used in various fields. Especially recently, it has been used for household appliances that can be called daily necessities and toys for children.
[0003]
In this way, if voice recognition technology is used for a device that is not intended for a specific user and is expected to be used by a wide variety of unspecified users, the user can be recognized with high probability so that the voice spoken by the user can be recognized. On the other hand, it is an important issue to make it easy to use by providing easy-to-understand guidance on how to input appropriate voices.
[0004]
For example, a voice clock has recently been developed as one of devices using voice recognition for a wide range of users. When a button or the like provided on the watch is pressed, the current time is told by voice such as “what hour and what minute”.
[0005]
This audio watch can easily know the current time even in the dark, so it is useful when you want to know the current time when you wake up at midnight, when you go out at night, or know the time in the dark In addition, it is very convenient for the blind. In addition, such an audio timepiece may be applied to children's toys.
[0006]
In addition, such an audio timepiece can not only output the time by voice, but can also set the current time and set the alarm time by voice. For example, if the current time is 6:30 am, the voice clock is set to the current time setting mode, and the user determines the necessary words such as “AM”, “6 AM”, and “30 minutes”. Speak to The voice clock side recognizes the voice spoken by the user and performs time adjustment processing based on the recognition result. The setting of the alarm time is the same, and the desired alarm time is uttered as the alarm time setting mode.
[0007]
The time is adjusted by such an operation, but at this time, from the user side, it becomes uneasy whether the voice spoken by himself / herself was input in an appropriate state (appropriate state for performing the recognition process). There are many.
[0008]
In order to solve this problem, there is a method of inputting voice while responding to the recognition result for each word spoken by the user. For example, in the utterance content example described above, the user utters “AM”, and as a recognition result for the utterance, a response “AM” is returned from the device side, followed by “6 AM”. When an utterance is made, a response “6 o'clock” is returned from the device side, and when the user utters “30 minutes”, a response “30 minutes” is returned from the device side. . In this case, if the voice spoken by the user is inappropriate and cannot be recognized, there will be no response from the device or a response such as "Please speak again" will be made. You can also.
[0009]
In this way, if the recognition result is answered for each word uttered by the user, or if the recognition result is not recognized, some kind of response is returned. It is possible to know whether or not the content is appropriate, and how it is recognized, so that a sense of security can be obtained and it is easy to use.
[0010]
[Problems to be solved by the invention]
However, as described above, recognizing and responding to each word has a problem that it takes a lot of time to perform one setting operation such as time adjustment. Also, when applying such speech recognition technology to equipment that requires low prices such as daily necessities and toys, it is necessary to keep the cost as low as possible. There are major limitations. Therefore, the apparatus side is required to put a large burden on the CPU and to reduce processing that uses a lot of memory as much as possible.
[0011]
In order to deal with this, taking the time adjustment described above as an example, instead of recognizing the user's utterance of one word and responding with the recognition result, the user needs the contents for time adjustment, for example, , “AM”, “6 o'clock”, “30 minutes” as one set, and the words that make up this set are uttered intermittently with a short interval for each word. It is conceivable to recognize the voice. In this case, since there is no response of the recognition result as described above from the device side to each of a plurality of words constituting one set, the time setting time can certainly be shortened.
[0012]
However, in this method of inputting a relatively long series of sounds composed of a plurality of words from the beginning to the end by speaking to the apparatus side, as described above, for the user, for each word spoken by the user, It remains anxious whether the voice is input in an appropriate state. Therefore, it is necessary to notify the user in some form whether or not the voice uttered by the user has been input to the device in an appropriate state without performing any troublesome processing.
[0013]
In view of this, the present invention enables the user to be notified of whether or not the voice spoken by the user has been properly input in performing voice recognition, by performing simple processing. The purpose is to improve usability.
[0014]
[Means for Solving the Problems]
In order to achieve the above-described object, the speech recognition method of the present invention is a speech recognition method in a speech recognition apparatus that recognizes a set of a plurality of words that are intermittently uttered with a gap between words. The plurality of words belong to a first to nth (n is a positive integer) word group, and for each of the first to nth word groups, all the word groups to which the plurality of words belong Then, after the step of setting a time length reference and the step of setting the time length reference, the word corresponding to the word that is intermittently spoken between the words A voice input step of inputting voice and outputting digitized voice data, and analyzing voice data output from the voice input step at predetermined time intervals, and calculating voice power and feature data at the predetermined time intervals Voice analysis process and calculation by the voice analysis process Detecting an effective speech section of speech corresponding to the uttered word based on the speech power, and whether the time length of the effective speech section is within a reference set for each word group, It is determined whether or not the voice corresponding to the spoken word is input in an appropriate state according to the magnitude of the voice power in the effective voice section, and when it is determined that the voice is appropriate, A voice segment detection / judgment step for outputting a signal indicating that the corresponding voice is input in a proper state, and a voice corresponding to the uttered word output in the voice segment detection / judgment step is proper An information output step of outputting information indicating that a voice corresponding to the uttered word is input in an appropriate state between the words when receiving a signal indicating that the state is input; and Information output A voice recognition step of recognizing a voice corresponding to the uttered word determined to have been input in an appropriate state in the voice segment detection / determination step after the information is output in a step. To do.
[0015]
In the speech recognition method of the present invention, the speech segment detection / determination step may determine whether the speech is input in an appropriate state according to the time length of an effective speech segment. Is a time when the voice power is greater than a predetermined threshold, and then the time when the voice power is lower than the threshold is the end, and the voice power is the threshold for a predetermined time from the end time. If it is not greater than the value, the section from the start end to the end is detected as an effective speech section, and when the detected effective speech section is within a predetermined reference, the speech is in an appropriate state. It is determined that the input has been made.
[0016]
In the speech recognition method of the present invention, the information indicating that the speech corresponding to the uttered word output in the information output step is input in an appropriate state is signal sound, light, voice message, display It is at least one of the displays on the screen.
[0018]
The speech recognition device of the present invention is a speech recognition device for recognizing a set of a plurality of words that are intermittently uttered with a gap between words, wherein the plurality of words are first to nth. (N is a positive integer) belonging to a word group, and for each of the first to nth word groups, a temporal length reference is set for all the word groups to which the plurality of words belong And after the time length reference is set by the means for setting the time length reference, the word spoken intermittently with a gap between the words. Voice input means for inputting corresponding voice and outputting digitized voice data; voice data output from the voice input means is analyzed every predetermined time, and voice power and feature data for each predetermined time are analyzed. A voice analysis means for calculating and the voice analysis means The effective speech section of the speech corresponding to the uttered word is detected based on the speech power, and whether or not the time length of the effective speech section is within the standard set for each word group, and the effective It is determined whether or not the speech corresponding to the spoken word is input in an appropriate state according to the level of the speech power in the speech section, and when it is determined to be appropriate, the speech corresponds to the spoken word Voice section detection / determination means for outputting a signal indicating that the voice to be input is input in an appropriate state, an effective voice section for the voice detected by the voice section detection / determination means, and calculation by the voice analysis means Speech recognition processing means for recognizing the speech based on the feature data obtained, and speech corresponding to the uttered word output from the speech segment detection / determination means is input in an appropriate state. Information output means for outputting information indicating that speech corresponding to the uttered word is input in an appropriate state between the words when receiving a signal indicating that, The speech recognition processing means recognizes the speech corresponding to the uttered word determined to be input in an appropriate state by the speech section detection / determination means after the information is output by the information output means. It is characterized by.
[0019]
In the speech recognition apparatus of the present invention, the speech section detection / determination means may determine whether or not the speech has been input in an appropriate state based on a time length of an effective speech section. Is a time when the voice power is greater than a predetermined threshold, and then the time when the voice power is lower than the threshold is the end, and the voice power is the threshold for a predetermined time from the end time. If it is not greater than the value, the section from the start end to the end is detected as an effective speech section, and when the detected effective speech section is within a predetermined reference, the speech is in an appropriate state. It is determined that the input has been made.
[0020]
In the speech recognition apparatus of the present invention, the information indicating that the speech corresponding to the uttered word output by the information output means is input in an appropriate state is signal sound, light, voice message, display screen It is at least one of the above displays.
[0022]
In addition, the recording medium on which the speech recognition processing program of the present invention is recorded is an input state notification of recognition target speech in a speech recognition device that recognizes a set of a plurality of words that are intermittently uttered with a gap between words. A recording medium on which a program is recorded, wherein the plurality of words belong to a first to nth (n is a positive integer) word group, and each of the plurality of words includes the plurality of words for each of the first to nth word groups. After the step of setting a temporal length criterion for all the word groups to which the word belongs and the step of setting the temporal length criterion, there are intermittent intervals between the words. A voice input step of inputting a voice corresponding to the word uttered and outputting digitized voice data; analyzing the voice data output from the voice input step every predetermined time; and Voice power and feature data And an effective speech section of speech corresponding to the uttered word based on the speech power calculated by the speech analysis process, and the time length of the effective speech section is the word group It is determined whether or not the voice corresponding to the spoken word has been input in an appropriate state according to whether or not it is within the standard set for each and the magnitude of the voice power in the effective voice section. A speech section detection / determination step for outputting a signal indicating that the speech corresponding to the spoken word is input in an appropriate state,
When a signal indicating that the speech corresponding to the spoken word output in the speech segment detection / determination step is input in an appropriate state is received, the spoken word is inserted between the words. An information output step for outputting information indicating that the corresponding voice is input in an appropriate state; and after the information is output by the information output step, the information is input in an appropriate state in the voice segment detection / determination step. A recording medium recording a voice recognition processing program for causing a voice recognition apparatus to execute a voice recognition step of recognizing a voice corresponding to the determined uttered word.
[0023]
Further, in the recording medium on which the speech recognition processing program of the present invention is recorded, the speech segment detection / determination step determines whether or not the speech is input in an appropriate state according to the time length of the effective speech segment. The process starts at a time when the voice power is greater than a predetermined threshold, and then ends at a time when the voice power is lower than the threshold. When the voice power is not greater than the threshold, the section from the start end to the end is detected as an effective voice section, and when the detected effective voice section is within a predetermined reference, It is determined that the voice is input in an appropriate state.
[0024]
Further, in the recording medium on which the speech recognition processing program of the present invention is recorded, the information indicating that the speech corresponding to the uttered word output in the information output step is input in an appropriate state is a signal sound, It is at least one of light, voice message, and display on the display screen
[0026]
The present invention makes it possible to notify a speaker of whether or not a recognition target voice input by a user has been input in an appropriate state by simply performing a process, thereby improving usability. In order to realize this, whether the recognition target speech is input in an appropriate state according to the time length of the effective speech segment in the recognition target speech uttered by the user and the size of the speech power in the effective speech segment. If it is determined whether or not it is appropriate, information indicating that it is appropriate is issued immediately after the input of the recognition target speech. As a result, the user can easily know whether or not the voice he / she spoke has been input in an appropriate state, and when performing the voice input operation, the user's input voice is in a truly appropriate state. You will no longer feel uneasy about whether it was entered in
[0027]
In addition, the recognition target speech to be determined whether or not it is input in such an appropriate state is a speech uttered with a plurality of words as one set, and each word constituting this one set The speech uttered with a space between each speech between the words indicating the break of each word is targeted. For example, taking as an example a clock that can set the time setting such as the current time by voice, a plurality of words such as “AM”, “what time”, and “how many minutes” are combined into one set, It is intended for voices that are spoken intermittently with intervals between words.
[0028]
Thus, in a situation where a user's utterance is input unilaterally without a response of the recognition result from the device side between each word as a set of a plurality of words, each word is actually in an appropriate state ( The user is uneasy about whether or not the input is performed in a state appropriate for recognition.
[0029]
In order to solve this, in a situation where a plurality of words are uttered as one set, it is possible to give the user a sense of security by transmitting some information from the device side between each word.
[0030]
The information includes a signal sound (for example, a signal sound such as “beep”) that is instantaneously generated within a break time between words, light that is instantaneously emitted by a light emitting diode (LED), and a voice message (for example, For a device having a display unit such as a very short voice message such as “Yes” or a liquid crystal display (LCD), a simple display such as “OK” on the LCD) may be considered. The user knows that the voice he spoke is input in an appropriate state by instantly uttering such simple information from the device side after the voice for each word he spoke. Therefore, a sense of security for voice input operation can be obtained.
[0031]
Further, a plurality of words constituting one set belong to the first to nth (n is a positive integer) word groups, and the criteria for determining the time length of the effective speech section described above are respectively This is set for each word group. This is because the length of words belonging to each word group (the time length necessary for utterance) may vary greatly between word groups. Therefore, by setting a criterion for determining the time length of the effective speech section for each word group, it is possible to determine the length of the effective speech section appropriate for the words belonging to each word group. It becomes.
[0032]
In addition, the speech recognition apparatus of the present invention can improve the usability by adopting the recognition target speech input state notification method as described above, and even a user unfamiliar with the handling of this type of device can easily perform the operation. It becomes possible to handle.
[0033]
DETAILED DESCRIPTION OF THE INVENTION
Embodiments of the present invention will be described below with reference to the drawings. In this embodiment, the above-described voice clock is taken as an example of a device using voice recognition technology, and an example in which the time is set to what hour and what time in this voice clock will be described. Here, as a specific example, consider setting “6:30 am”.
[0034]
FIG. 1 shows an external configuration of a voice clock to which the present invention is applied. This voice clock is a voice clock only for voice messages having no time display portion such as a liquid crystal display portion. A speaker 2 as an audio output means and a microphone 3 to which an audio command from the user is input are provided. Furthermore, it has at least a mode setting unit 4 for setting a mode when performing current time adjustment or alarm time adjustment, a time button 5 pressed when it is desired to know the current time, and the like. In addition to this, various components are provided depending on functions, but illustration and description of portions not directly related to the gist of the present invention are omitted.
[0035]
In such an audio clock, if the current time is set correctly, when the time button 5 is pressed by the user, the current time at that time is, for example, “8:30 am” or the like. The time according to the message is output.
[0036]
FIG. 2 is a block diagram showing the configuration of a voice recognition device used for such a voice watch. The voice input unit 11, the voice analysis unit 12, the voice segment detection / determination unit 13, the voice recognition processing unit 14, and the information output unit. 15. Voice recognition model data 16 for voice recognition, various information (response contents corresponding to the recognition result, voice message contents when making a question to the user, Output data 17 for outputting (information output when it is determined that the spoken voice is appropriate).
[0037]
The voice input unit 11 includes the microphone 2 described above, an amplifier (not shown), an A / D converter (not shown), and the like. The voice uttered by the user is input through the microphone 2 and amplified. A / D conversion is performed and output as, for example, digitized voice data of 8 KHz and 10 bits.
[0038]
The voice analysis unit 12 performs voice analysis on the voice data output from the voice input unit 11, for example, every short time of about 20 msec (shift amount is about 10 msec), and the voice power every short time (about 20 msec). And feature data (for example, a 10-dimensional LPC cepstrum) is calculated.
[0039]
The speech section detection / determination unit 13 detects an effective speech section (referred to as an effective speech section) using the speech power calculated by the speech analysis unit 12, and the time length of the effective speech section is determined in advance. Two threshold values (maximum voice power) within a predetermined time length (represented by L1 and L2, L1 <L2) (longer than L1 and shorter than L2) If it is within the range of (th2 and th3, and th2 <th3) (greater than th2 and less than th3), it is determined that the effective speech section is within the normal range, It is determined that the voice is input in an appropriate state for voice recognition. When it is determined that the sound is appropriate, a signal indicating that the sound is input in an appropriate state is output after a predetermined time (L4) from the end of the effective sound section.
[0040]
The effective voice section here refers to the time when the user's voice power becomes greater than a certain threshold value (th1) as the beginning of the voice section, and then the voice power exceeds the threshold value th1. If the predetermined time (denoted as L3) elapses from the time when the value becomes smaller than the threshold value th1 and does not become larger again than the threshold value th1, the time when the value becomes smaller than the threshold value th1 Obtained as the end of the section, the beginning to end of the speech section is called the effective speech section.
[0041]
Further, the speech recognition processing unit 14 based on the detected feature data in the effective speech section (feature data in the effective speech section of the speech feature data obtained by the speech analysis unit 12). Voice recognition processing is performed using the data 16.
[0042]
As described above, the information output unit 15 creates and outputs a response voice corresponding to the recognition result and a voice message with a content to ask the user using the output data 17. When the signal indicating the appropriateness output from the detection / determination unit 13 is received, information indicating that the sound has been input properly (in this embodiment, a signal sound “beep”) is output. To do.
[0043]
An example in which some time setting (here, the current time) is performed in the thus configured voice timepiece will be described. Here, it is assumed that the set time is 6:30 am as described above. At this time, the voice clock (referred to as an apparatus) is set to the current time setting mode.
[0044]
FIG. 3A shows speech waveforms of “am”, “6 o'clock”, and “30 minutes” uttered by the user, and is an output after A / D conversion by the voice input unit 11. As can be seen from FIG. 3 (a), when the time is set at a certain time, for example, when speaking the time setting content “6:30 am”, the time is set to increase the recognition rate. Each word (“AM”, “6 o'clock”, “30 minutes”) composing the combined content is spoken one word at a time (ΔT1, ΔT2 hours) between each word. Good results will be obtained if you inform them in advance.
[0045]
As described above, the voice used in this embodiment is a voice that is uttered as a set of a plurality of words, and between each of the voices corresponding to each word constituting the one set, It is assumed that the voice is spoken intermittently with a gap as a break.
[0046]
In addition, here is the current time setting mode, and in such a current time setting mode (also in the case of the alarm time setting mode), “am” or “afternoon” is spoken first, and “ It is assumed that the order of speaking is determined, such as “what time” and “how many minutes” third. For convenience of explanation, the first uttered part is a word belonging to the first word group, the second uttered part is a word belonging to the second word group, and the third uttered part is the third word. It will be called a word belonging to the word group.
[0047]
The voice analysis unit 12 performs voice analysis on the voice data as shown in FIG. 3A, for example, by dividing into 20 msec frames, and obtains voice power and feature data for each frame. The feature data is used at the time of the voice recognition process, and the voice power is used in the determination process of whether or not the voice spoken by the user is appropriate.
[0048]
FIG. 3B shows an audio power curve obtained by connecting the audio power obtained for each frame with a curve. In FIG. 3A, the voice data is obtained when the user utters all the words constituting one group in order, such as “AM”, “6:00”, and “30 minutes”. In the process performed by the present invention, when the user utters “AM”, the process of determining whether or not the content of the utterance of “AM” is appropriate is performed. A process is performed to determine whether or not the content of the utterance at “6 o'clock” is appropriate when the user utters “6 o'clock”, and if it is appropriate, from the device side. One beeping process is performed, such as making a signal sound “beep” and then uttering “30 minutes” to determine whether or not the content of the utterance of “30 minutes” is appropriate. The word is appropriate for each word that makes up the pair. It is determined whether or not processing has been input in a state. Hereinafter, the processing for each word will be described in detail.
In addition, the temporal lengths L1 and L2, which are the criteria for determining the temporal length of the effective speech section, are actually the first word group, the second word group, and the third It is set for each word group in the word group.
[0049]
In this case, since the content of the user's utterance is a predetermined pattern of “am”, “what time”, and “how many minutes”, the word group uttered in the “am” portion is the first word group. In this case, the word belonging to the first word group includes “afternoon” in addition to “am”. Further, a word group uttered at the “what time” portion is called a second word group, and words belonging to the second word group are “0 o'clock”, “1 o'clock”, “2 o'clock” in this case. A word representing a unit of time. A word group uttered in the “how many” part is called a third word group. In this case, words belonging to the third word group are “0 minutes”, “1 minute”, “2 minutes”. It is a word representing a unit of minutes such as “”. In addition, these first to third word groups first speak the first word group (for example, “AM”), then speak the second word group (for example, “6 o'clock”), and then continue. The third word group (for example, “30 minutes”) is uttered, and the order of speaking is determined, and the device side performs the recognition process for each input word group according to the order. Yes.
[0050]
Therefore, it is better to set the above-described L1 and L2 for each of the first to third word groups. Hereinafter, the time lengths set for the first word group are L11 and L21 (L11 <L21), and the time lengths set for the second word group are L12 and L22 (L12 <L L22), and the time length set for the third word group is L13, L23 (L13 <L23).
[0051]
First, when the user utters “AM” as a word belonging to the first word group, the voice analysis unit 12 performs voice analysis, for example, by dividing into frames every 20 msec as described above. And feature data.
[0052]
Then, an effective voice section T1 of the voice data for “AM” is obtained from the voice power curve obtained for “AM” spoken by the user.
[0053]
First, with the threshold value th1 set in advance as a reference, the time when the voice power obtained by the user's utterance first exceeds the threshold value th1 is set as the beginning of the “AM” voice segment. As can be seen from FIG. 3B, since the threshold value th1 is exceeded at time t1, this time t1 is set as the start end of the voice section for “AM” (referred to as start end t1).
[0054]
Subsequently, the time when the audio power for “AM” becomes smaller than the threshold value th1 is examined, and the time is assumed to be t2. If the audio power does not exceed the threshold value th1 again after a predetermined time L3 has elapsed from time t2, the end of the audio section t1 for “AM” is time t2. . This time t2 is the end of the voice section t1 (referred to as end t2).
[0055]
Then, the section between the start end t1 and the end t2 obtained in this way is set as the effective speech section T1, t1 is set as the start end of the effective speech section T1, and t2 is set as the end of the effective speech section T1. The fixed time L3 is set to a very short time, and specifically, is a time shorter than the time ΔT1 and ΔT2 between the words existing between adjacent words.
[0056]
In this way, the effective speech section T1 for the “AM” speech belonging to the first word group is obtained. Next, it is determined whether or not the time length and voice power of the effective voice section T1 are within a preset range.
[0057]
That is, as described above, the time length of the effective speech section T1 (the time length is also represented by T1) is L11 <T1 <L21 with respect to the preset time lengths L11 and L21, and When the maximum value m1 of the voice power in the valid voice section T1 satisfies th2 <m1 <th3 with respect to the preset threshold values th2 and th3, the extracted valid voice section T1 is within the normal range. It is determined that “am” spoken by the user is input in an appropriate state for voice recognition.
[0058]
In this way, when it is determined that the effective speech section T1 for “AM” belonging to the first word group is within a predetermined range (L11 <T1 <L21 and th2 <m1 <th3), the speech As shown in FIG. 3 (c), the section detection / determination unit 13 determines that the voice (in this case, “AM”) is appropriate to the information output unit 15 after the end of t4 of the effective voice section T1. A signal s1 indicating that the voice is input in a different state is output.
[0059]
Upon receiving this signal s1, the information output unit 15 outputs a signal sound as predetermined instantaneous information as shown in FIG. 3 (d). This signal sound informs the user that the voice spoken by the user is a voice input in an appropriate state for recognition, and various notification means can be considered. In this embodiment, , Output an instantaneous signal sound of “beep”.
[0060]
That is, when the user utters “AM” and determines that it is appropriate, a signal sound “PI” is emitted from the device side after “AM” spoken by the user. Thus, the user can understand that the voice “am” spoken by himself / herself has been input to the apparatus side in an appropriate state.
[0061]
Subsequently, when the user utters “6 o'clock” belonging to the second word group, the effective voice section T2 of the audio data for “6 o'clock” is obtained from the audio power curve for the audio data.
[0062]
First, the time when the voice power obtained by the user's utterance exceeds the threshold th1 for the first time based on the preset threshold th1 is defined as the start point of the voice section of “6 o'clock”. And As can be seen from FIG. 3B, since the threshold value th1 is exceeded at time t3, this time t3 is set as the start end of the speech section for “6 before” (referred to as start end t3).
[0063]
Subsequently, the time when the voice power for the word “6 o'clock” becomes smaller than the threshold th1 is examined, and the time is assumed to be t4. If the audio power does not exceed the threshold value th1 again after a certain fixed time L3 has elapsed from time t4, the end of the audio section T2 for “6 o'clock” is time t4. To do. This time t4 is set as the end of the voice section (referred to as end t4).
[0064]
The section between the start end t3 and the end t4 thus determined is defined as an effective speech section T2, t3 is defined as the start end of the effective speech section T2, and t4 is defined as the end of the effective speech section T2. Next, it is determined whether or not the time length of the effective speech section T2 is within a preset range. In this case, it is determined whether or not L12 <T2 <L22. Further, it is determined whether or not the maximum value m2 of the voice power in the voice section T2 falls within the range of threshold values th2 and th3. When these conditions are satisfied, it is determined that the extracted effective speech section T2 is within a normal range, and the speech (in this case, “6 o'clock”) is appropriately input in performing speech recognition processing. Judge that
[0065]
In this way, when it is determined that the effective speech section T2 for “6 o'clock” belonging to the second word group is within the normal range, the speech section detection / determination unit 13 illustrated in FIG. As described above, the signal s2 indicating that the effective voice section T2 is within a normal range is output to the information output unit 15 after the elapse of L4 time from the end t4 of the effective voice section T2.
[0066]
When the information output unit 15 receives this signal s2, as shown in FIG. 3D, the information output unit 15 outputs an instantaneous signal sound “beep” as information indicating appropriateness as described above.
[0067]
That is, the user speaks “6:00” following “AM”, and if it is determined that it is appropriate, a signal sound “PI” is emitted from the apparatus side after “6 AM”. Thus, the user can understand that the voice “6 o'clock” spoken by himself / herself has been input to the apparatus in an appropriate state.
[0068]
Subsequently, when the user speaks “30 minutes” belonging to the third word group, an effective voice section T3 of the voice data for “30 minutes” is obtained from the voice power curve for the voice data.
[0069]
First, as can be seen from FIG. 3B, since the threshold value th1 is exceeded at time t5, this time t5 is set as the start end of the speech section T3 for “30 minutes ago” (referred to as start end t5). Then, the time when the voice power for the word “30 minutes” becomes smaller than the threshold value th1 is examined, and the time is assumed to be t6. If the audio power does not exceed the threshold value th1 again after a predetermined fixed time L3 has elapsed from time t6, the end of the audio section for “30 minutes” is assumed to be time t6. This time t6 is set as the end of the voice section (referred to as end t6).
[0070]
The section between the start end t5 and the end t6 thus determined is defined as an effective speech section T3, t5 is defined as the start end of the effective speech section T3, and t6 is defined as the end of the effective speech section T3. Next, it is determined whether or not the time length of the effective speech section T3 is within a preset range. In this case, it is determined whether L13 <T3 <L23. Also, it is determined whether or not the maximum value m3 of the voice power in the effective voice section T3 is within the range of the threshold values th2 and th3. When these conditions are satisfied, it is determined that the extracted effective speech section T3 is within a normal range, and it is determined that the speech (in this case, “30 minutes”) is input in an appropriate state.
[0071]
When it is determined in this way that the effective speech section T3 for “30 minutes” belonging to the third word group is within the normal range, the speech section detection / determination unit 13 shows the result shown in FIG. As described above, the signal s3 indicating that the effective voice section T3 is within the normal range is output to the information output unit 15 after the lapse of L4 time from the end t6 of the effective voice section T3.
[0072]
Upon receiving this signal s3, the information output unit 15 outputs an instantaneous signal sound “beep” as information indicating that it is appropriate as described above.
[0073]
That is, when the user speaks “30 minutes” after “AM” and “6 o'clock”, and it is determined that the input has been made in an appropriate state, a signal “PICK” follows “30 minutes”. Sound is emitted from the device side. Thus, the user can determine that “30 minutes” spoken by himself / herself has been input to the apparatus side in an appropriate state.
[0074]
As described above, when the user utters “am”, “6 o'clock”, and “30 minutes”, if it is determined that the sound for each word is input in an appropriate state, “am”, “pi” , “6 pm”, “beep”, “30 minutes”, “beep”, etc., the user will respond to “beep” after the voice spoken by the user. It can be seen that the voice spoken of is input in an appropriate state, and a sense of security can be obtained.
[0075]
The above-described L3 has been described as a common time in the first to third word groups, but an appropriate time may be set for each word group as in L1 and L2.
[0076]
Further, as an example in which a certain effective speech section (referred to as effective speech section T1) does not satisfy the above-described conditions, there are FIGS. 4A, 4B, 4C, and 4D. 4A and 4B show a case where the two maximum values m1 of the effective speech section T1 exceed the threshold value th3, and the time length of the effective speech section T1 is shorter than L11. FIGS. 4C and 4D show the case where the maximum value m1 of the effective speech section T1 is smaller than the threshold th2 and the time length of the effective speech section T1 is longer than L21.
[0077]
The example of FIGS. 4A and 4B is an example in which the voice uttered by the user is too strong and the voice is spoken very quickly, and is appropriate for the voice uttered in such a state. Since there is a high possibility that recognition cannot be performed, it is assumed that the input voice is not appropriate.
[0078]
In addition, the examples of FIGS. 4C and 4D are examples in which the voice spoken by the user is too low and the voice is spoken in a very extended state. Since there is a high possibility that proper voice recognition cannot be performed, it is assumed that the input voice is not appropriate.
[0079]
4A, 4B, 4C, and 4D are examples in which both the voice power and the time length of the effective voice section do not satisfy the conditions, the voice power and the effective Even when any one of the time lengths of the speech section does not satisfy the condition, it is determined that the input speech is not appropriate.
[0080]
As described above, when it is determined that the voice spoken by the user has not been input in an appropriate state, the signal sound of “beep” is not emitted. Thereby, the user can know that the voice he / she spoke is not appropriate. In this case, since the reaction from the apparatus side is silent, the user performs input again. Alternatively, if an input is not made in an appropriate state, the user can be notified with a voice message or other signal prompting the user to input again.
[0081]
As described above, in this embodiment, the content of a plurality of words such as “AM”, “6 o'clock”, and “30 minutes” is slightly given to the device every time one word is uttered. An example of inputting speech by intermittent utterances that input the next word at intervals, causing the input speech to be recognized and corresponding actions (current time setting, etc.) to be performed. I have to.
[0082]
In this case, when the user first speaks “AM”, whether or not the voice is input in an appropriate state is determined based on the time length and voice power of the effective voice section obtained from the voice data. If it is determined that the sound is appropriate, an instantaneous signal sound such as “beep” indicating appropriateness is emitted. As a result, the user can know from the signal sound from the apparatus whether or not the voice he / she has input is in a normal state, and can perform the voice input operation without feeling uneasy.
[0083]
As described above, the present invention is not particularly effective for inputting speech in an interactive manner with the apparatus, but is particularly effective when a plurality of words existing in one set are input continuously. Become.
[0084]
The present invention is not limited to the embodiment described above, and various modifications can be made without departing from the gist of the present invention.
[0085]
For example, in the above-described embodiment, the maximum value of the voice power is used as one of the determination conditions as to whether or not the effective voice section is appropriate. However, it is effective using the average value of the voice power in the effective voice section. It may be determined whether or not the voice section is appropriate.
[0086]
In the above-described embodiment, a signal sound such as “beep” is used as information indicating that the voice spoken by the user is appropriate. However, the signal sound is not limited to such a signal sound. A light emitting diode or the like may be turned on, or a short voice response such as “Yes” may be used. Furthermore, if there is a display unit such as a liquid crystal display, if the voice is appropriate immediately after the speaker's voice is input, a display such as “OK” may be displayed on the display unit. Good.
[0087]
In the above-described embodiment, the time adjustment in the timepiece using the voice recognition technology has been described as an example. However, the present invention is naturally applicable to devices other than the timepiece.
[0088]
Moreover, although the example which input three words as one set was shown in the above-mentioned embodiment, it is needless to say that the number of words constituting one set is not limited to three.
[0089]
In addition, the processing program for performing the input state notifying process of the recognition target voice of the present invention described above can be recorded on a recording medium such as a floppy disk, an optical disk, or a hard disk, and the present invention also includes the recording medium. It is a waste. Further, the processing program may be obtained from a network.
[0090]
As described above, according to the present invention, the recognition target speech is appropriate depending on the time length of the effective speech section in the recognition target speech uttered by the speaker and the magnitude of the speech power in the effective speech section. It is determined whether or not it is input in a state, and if it is determined to be appropriate, information indicating that the input is appropriate is issued immediately after the input of the recognition target speech. As a result, when the user performs a voice input operation on the device, the user does not have a sense of anxiety about whether or not the voice input by the user is actually input in an appropriate state. Can be improved.
[0091]
In particular, the recognition target speech is speech uttered as a set of a plurality of words, and an effect can be obtained when inputting such speech. For example, taking as an example a clock that can set the time setting such as the current time by voice, a plurality of words such as “AM”, “what time”, and “how many minutes” are combined into one set, This is a case where speech is uttered intermittently with intervals between words. Thus, when each word is uttered intermittently with a plurality of words as one set, a signal sound that is instantaneously emitted from the device side is returned after each word is uttered. Thus, the user can immediately know that the voice he / she spoke has been input in an appropriate state, and can obtain a sense of security with respect to the voice input operation.
[0092]
Moreover, since only the instantaneous information is emitted as information indicating that the voice spoken by the user is input in an appropriate state, for example, it is compared with the case where each word is recognized and the recognition result is directly responded. As a result, the processing can be lightened, and the processing time can be greatly shortened.
[0093]
And the voice recognition device adopting such a recognition target voice input state notification method becomes easy to use, and can be easily handled even by a user unfamiliar with the handling of this type of equipment, Since the overall processing can be made light, low-cost CPUs and memories can be used, and the cost of the device itself can be reduced.
[Brief description of the drawings]
FIG. 1 is a diagram schematically showing the appearance of an audio timepiece used in an embodiment of the present invention.
FIG. 2 is a block diagram illustrating a schematic configuration of a voice recognition device portion used in the voice timepiece shown in FIG. 1;
FIG. 3 is a time chart for explaining an input state notifying process of a recognition target voice when setting the time with the voice timepiece shown in FIG. 1;
FIG. 4 is a diagram for explaining an example in which the input voice is not appropriate in the embodiment of the present invention.
[Explanation of symbols]
1 Audio clock housing
2 Speaker
3 Microphone
4 Mode setting section
5 Time button
11 Voice input part
12 Speech analysis unit
13 Voice segment detection / determination unit
14 Voice recognition processor
15 Information output section
16 Model data for speech recognition
17 Model data for output

Claims

A speech recognition method in a speech recognition apparatus for recognizing a set of a plurality of words intermittently uttered with a gap between words,
The plurality of words belong to a first to nth (n is a positive integer) word group, and for each of the first to nth word groups, all the word groups to which the plurality of words belong Setting a time length standard,
After the step of setting the time length reference, the voice corresponding to the word that is intermittently uttered with the interval between the words is input and the digitized voice data is output. Input process;
Analyzing the voice data output from the voice input step every predetermined time, and calculating the voice power and feature data for each predetermined time;
A reference in which an effective speech section of a speech corresponding to the uttered word is detected based on the speech power calculated by the speech analysis step, and a time length of the effective speech section is set for each word group Whether or not the voice corresponding to the uttered word is input in an appropriate state depending on whether or not the voice is within the effective voice section, A speech section detection / determination step of outputting a signal indicating that speech corresponding to the spoken word is input in an appropriate state;
When a signal indicating that the speech corresponding to the spoken word output in the speech segment detection / determination step is input in an appropriate state is received, the spoken word is inserted between the words. An information output step of outputting information indicating that the corresponding voice is input in an appropriate state;
A speech recognition step for recognizing speech corresponding to the uttered word determined to have been input in an appropriate state in the speech section detection / determination step after information is output by the information output step;
A speech recognition method comprising:

The process in which the voice segment detection / determination step determines whether or not the voice is input in an appropriate state based on the time length of an effective voice segment is a time at which the voice power is greater than a predetermined threshold value. , And then the time when the voice power is lower than the threshold is terminated, and the voice power is not greater than the threshold for a predetermined time from the time of the termination, A section from a start end to the end is detected as an effective voice section, and when the detected effective voice section is within a predetermined reference, it is determined that the voice is input in an appropriate state. The speech recognition method according to claim 1.

The information indicating that the voice corresponding to the uttered word output in the information output step is input in an appropriate state is at least one of signal sound, light, voice message, and display on the display screen. The speech recognition method according to claim 1, wherein the speech recognition method is provided.

A speech recognition apparatus for recognizing a set of a plurality of words uttered intermittently with a gap between words,
The plurality of words belong to a first to nth (n is a positive integer) word group, and for each of the first to nth word groups, all the word groups to which the plurality of words belong A means of setting a time length reference;
After the time length reference is set by the means for setting the time length reference, the speech corresponding to the word uttered intermittently with a gap between the words. Voice input means for inputting and outputting digitized voice data;
Analyzing the voice data output from the voice input means every predetermined time, and calculating voice power and feature data for each predetermined time;
A reference in which an effective speech section of speech corresponding to the uttered word is detected based on the speech power calculated by the speech analysis means, and a time length of the effective speech section is set for each word group Whether or not the voice corresponding to the uttered word is input in an appropriate state depending on whether or not the voice is within the effective voice section, Voice segment detection / determination means for outputting a signal indicating that the voice corresponding to the uttered word is input in an appropriate state;
A voice recognition processing means for recognizing the voice based on an effective voice section for the voice detected by the voice section detection / determination means, and feature data calculated by the voice analysis means;
When receiving a signal output from the speech section detection / determination means indicating that the speech corresponding to the uttered word is input in an appropriate state, the uttered word is inserted between the words. Information output means for outputting information indicating that the corresponding voice is input in an appropriate state;
Have
The speech recognition processing means recognizes speech corresponding to the uttered word determined to have been input in an appropriate state by the speech section detection / determination means after information is output by the information output means. A speech recognition apparatus characterized by that.

The process in which the voice section detection / determination means determines whether or not the voice has been input in an appropriate state according to the time length of an effective voice section is a time at which the voice power is greater than a predetermined threshold value. , And then the time when the voice power is lower than the threshold is terminated, and the voice power is not greater than the threshold for a predetermined time from the time of the termination, A section from a start end to the end is detected as an effective voice section, and when the detected effective voice section is within a predetermined reference, it is determined that the voice is input in an appropriate state. The speech recognition apparatus according to claim 4.

The information indicating that the voice corresponding to the uttered word output by the information output means is input in an appropriate state is at least one of signal sound, light, voice message, and display on the display screen. The speech recognition apparatus according to claim 4 or 5, wherein

A recording medium that records a speech recognition processing program in a speech recognition apparatus that recognizes a set of a plurality of words that are intermittently uttered with a gap between words,
The plurality of words belong to a first to nth (n is a positive integer) word group, and for each of the first to nth word groups, all the word groups to which the plurality of words belong Setting a time length standard,
After the step of setting the time length reference, the voice corresponding to the word that is intermittently uttered with the interval between the words is input and the digitized voice data is output. Input process;
Analyzing the voice data output from the voice input step every predetermined time, and calculating the voice power and feature data for each predetermined time;
A reference in which an effective speech section of a speech corresponding to the uttered word is detected based on the speech power calculated by the speech analysis step, and a time length of the effective speech section is set for each word group Whether or not the voice corresponding to the uttered word is input in an appropriate state depending on whether or not the voice is within the effective voice section, A speech section detection / determination step of outputting a signal indicating that speech corresponding to the spoken word is input in an appropriate state;
When a signal indicating that the speech corresponding to the spoken word output in the speech segment detection / determination step is input in an appropriate state is received, the spoken word is inserted between the words. An information output step of outputting information indicating that the corresponding voice is input in an appropriate state;
A speech recognition step for recognizing speech corresponding to the uttered word determined to have been input in an appropriate state in the speech section detection / determination step after information is output by the information output step;
A recording medium on which a voice recognition processing program for causing a voice recognition device to execute is recorded.

The process in which the voice segment detection / determination step determines whether or not the voice has been input in an appropriate state according to the time length of an effective voice segment is a time at which the voice power is greater than a predetermined threshold value. , And then, when the voice power is less than the threshold value as a termination, and when the voice power is not greater than the threshold for a predetermined time from the termination time, A section from a start end to the end is detected as an effective voice section, and when the detected effective voice section is within a predetermined reference, it is determined that the voice is input in an appropriate state. A recording medium on which the voice recognition processing program according to claim 7 is recorded.

The information indicating that the voice corresponding to the uttered word output in the information output step is input in an appropriate state is at least one of signal sound, light, voice message, and display on the display screen. 9. A recording medium on which the voice recognition processing program according to claim 7 or 8 is recorded.