JP2004101901A

JP2004101901A - Speech interaction system and speech interaction program

Info

Publication number: JP2004101901A
Application number: JP2002264081A
Authority: JP
Inventors: Takehiro Sekine; 関根　剛宏; Takashi Nishiyama; 西山　高史
Original assignee: Matsushita Electric Works Ltd
Current assignee: Panasonic Electric Works Co Ltd
Priority date: 2002-09-10
Filing date: 2002-09-10
Publication date: 2004-04-02

Abstract

<P>PROBLEM TO BE SOLVED: To achieve interaction according to a speaker by accurately recognizing whether a speaker is a registered user or a new user. <P>SOLUTION: A user database storage part 4, which stores a user speech database 11 for speaker recognition and a personal information database 12, is prepared. When speech interaction with the user is carried out, a speaker recognition part 2 identifies the user from an inputted speech signal; and an interaction control part 23 reads out user information corresponding to the identified speaker and selects an interaction scenario corresponding to the user out of a plurality of interaction scenarios according to the read-out user information. Then the interaction control part 23 realizes interaction using the interaction scenario selected corresponding to the user. <P>COPYRIGHT: (C)2004,JPO

Description

【０００１】
【発明の属する技術分野】
本発明は、ユーザとの音声対話を実現するに際して、ユーザの話者認識をし、話者認識の結果に応じて対話内容を変更可能とする音声対話装置及び音声対話プログラムに関する。
【０００２】
【従来の技術】
従来より、ユーザに対して音声を放音し、更に、ユーザから発せられた音声を入力して、ユーザとの間で対話を実現する音声対話システムが知られている。この従来の音声対話システムでは、現在のユーザが誰であるかを認識せず、新規のユーザである場合や過去に利用経験のあるユーザである場合に拘わらず対話を実行することが多かった。
【０００３】
これに対し、従来の他の音声対話システムでは、例えばユーザにパスワード等のキー入力を促すことで、ユーザを認識していた。
【０００４】
【特許文献１】
特開昭６３−８５６９８号公報
【０００５】
【発明が解決しようとする課題】
しかしながら、上述の従来の音声対話システムの多くは、現在の話し相手が誰であるかを認識せずに対話を実行するため、話し相手の固有情報や知識を参照することはなされていない。したがって、従来の音声対話システムでは、ユーザとの間で有効な対話が進行しないことや、以前に対話した内容を再度繰り返し実行するため、性能面や効率面での問題があった。
【０００６】
また、従来の他の音声対話システムでは、ユーザにパスワード入カを促すことでユーザを識別していたが、対話の開始に際してユーザに操作負担を強いることがあるという問題がある。
【０００７】
そこで、本発明は、上述した実情に鑑みて提案されたものであり、既に登録済のユーザであるか、新規のユーザであるかの話者認識を正確にして、話者に応じた対話を実現する音声対話装置及び音声対話プログラムを提供することを目的とする。
【０００８】
【課題を解決するための手段】
本発明は、ユーザに関するユーザ情報を記憶するデータベース記憶手段を用意しておき、ユーザとの間で音声対話をするに際して、入力した音声信号から話者識別をし、識別した話者に対応したユーザ情報を読み出し、読み出したユーザ情報に基づいて、複数の対話シナリオのうち、ユーザに対応した対話シナリオを選択する。そして、本発明では、ユーザに応じて選択した対話シナリオを用いた対話を実現することにより、上述の課題を解決する。
【０００９】
【発明の実施の形態】
以下、本発明の実施の形態について図面を参照して説明する。
【００１０】
［音声対話装置の構成］
本発明は、例えば図１に示すように構成された音声対話装置１に適用される。この音声対話装置１は、図示しない記憶機構に音声対話プログラムが格納され、図示しないＣＰＵ（Ｃｅｎｔｒａｌ　Ｐｒｏｃｅｓｓｉｎｇ　Ｕｎｉｔ）等を含むコンピュータにより音声対話プログラムを実行することで後述の話者認識部２、音声対話部３及び音声記憶部５の各部の機能モジュールを構成する。また、ユーザデータベース記憶部４及び、音声対話部３内の対話シナリオ記憶部２１は、音声対話プログラムを実行することで構成される各機能モジュールによりその内容が読み込まれると共に、情報更新や新規情報登録処理などがなされる。
【００１１】
音声対話装置１は、装置正面位置に向かってマイク等の収音機構（図示せず）が設けられ、この収音機構を介してユーザから発せられた音声が電気信号の音声信号として話者認識部２及び音声対話部３に入力される。
【００１２】
この音声対話装置１では、ユーザに関する情報を管理するために、話者認識用ユーザ音声データベース１１と個人情報データベース１２とを予め用意しておき、ユーザデータベース記憶部４に記憶させている。また、この音声対話装置１では、ユーザに応じて異なる対話を実現するための複数の対話シナリオを予め用意し、この対話シナリオを対話シナリオ記憶部２１に記憶させている。なお、本例において、この対話シナリオは、新規ユーザ向け対話シナリオ３１、登録ユーザ向け対話シナリオ３２を用意しておく。
【００１３】
ユーザデータベース記憶部４は、例えばハードディスク等の大容量データが記憶可能な記録媒体からなる。このユーザデータベース記憶部４に記憶されている話者認識用ユーザ音声データベース１１には、過去に音声対話装置１を利用したユーザの音声が音声信号として格納されている。すなわち、話者認識用ユーザ音声データベース１１に記憶されている音声信号は、個人情報データベース１２の個人情報の名称等と対応づけられて登録されて記憶されている。
【００１４】
本例において、話者認識用ユーザ音声データベース１１に記憶されている音声信号としては、ユーザの名前がある。この話者認識用ユーザ音声データベース１１には、同じ名前の複数の音声信号が格納される場合があるが、声の音程等により音声信号の周波数特性が個人ごとに異なるために、後述の話者認識部２にて認識可能となっている。
【００１５】
個人情報データベース１２には、過去に音声対話装置１を利用したユーザの名称を始めとする各種の個人情報が記憶されている。この各種の個人情報とは、音声対話装置１の利用履歴や、利用目的等であって、音声対話装置１の用途によって異なる。なお、この個人情報の具体例については、後述の音声対話装置１の具体的な動作にて説明する。
【００１６】
話者認識部２は、ユーザデータベース記憶部４に格納された話者認識用ユーザ音声データベース１１の内容、外部から入力された音声信号及び音声対話部３からの音声認識結果を用いて話者認識をする。
【００１７】
このとき、話者認識部２は、音声対話部３からの音声認識結果を例えばテキスト形式にて入力すると、このテキストに類似したテキストの音声信号を話者認識用ユーザ音声データベース１１から抽出する話者識別処理をする。次いで、話者認識部２は、抽出したユーザ候補となる音声信号と、入力した音声信号とを照合して、ユーザ候補から単一の話者を決定する話者照合処理をする。
【００１８】
これにより、話者認識部２は、話者認識処理の結果、話者が決定した場合にはその旨の情報を音声対話部３に送り、新規のユーザと判定した場合にはその旨を音声対話部３に送る。
【００１９】
ここで、例えば話者の候補としてＮ人を抽出した場合であっても、実際にはＮ人以外の他人が発声したとき、話者識別処理のみではＮ人のうちの一番似通った音声を持つ人を候補として選んでしまうが、Ｎ人以外の他人（或いは新規の人）であると判断するために、話者の候補を抽出する話者識別処理後に話者照合処理を行い、Ｎ人に含まれる話者か、Ｎ人以外の他人（或いは新規の人）かを判断する。
【００２０】
また、この音声対話装置１は、音声対話部３にて入力した音声信号の音声認識結果を入力して、この音声信号を話者認識用ユーザ音声データベース１１に登録する音声記憶部５を備える。話者認識用ユーザ音声データベース１１は、音声記憶部５からの音声信号を入力すると、個人情報データベース１２に記憶された個人情報と対応させ、話者認識部２での話者認識処理時に参照可能とする。
【００２１】
音声対話部３は、外部からユーザの音声信号をする音声認識部２２、対話制御部２３、音声合成部２４を備える。この音声対話部３では、音声認識部２２により音声信号を入力すると、音声認識部２２により音声認識をする。
【００２２】
このとき、音声認識部２２では、入力した音声信号と予め用意した音声識別用データベースとを比較することで音声認識をし、音声認識結果を話者認識部２、音声記憶部５又は対話制御部２３に送る。ここで、音声認識部２２による音声認識の開始タイミングとしては、音圧レベルが所定値以上となりこの音圧レベルが所定時間以上継続したタイミングとする。そして、音声認識部２２では、音声認識の開始タイミング後に入力した音声信号にフーリエ変換等を施して音声特徴量を抽出し、その音声特徴量を用いて音声認識をする。
【００２３】
音声認識部２２は、話者認識部２にて話者認識をさせるに際して、音声認識結果をテキスト形式にして話者認識部２に送る。このとき、音声認識部２２は、音声認識のスコア（確実度）の高い上位複数のテキストを音声認識結果として話者認識部２に送る。そして、音声認識部２２は、話者認識部２により話者認識をした結果、話者が確定した場合には、そのときの音声信号を音声記憶部５に送る。これにより、音声記憶部５により、話者認識用ユーザ音声データベース１１に音声信号の新規登録や、既に登録されている音声信号の更新をさせる。
【００２４】
音声合成部２４は、例えばスピーカ等の放音機構と接続され、対話制御部２３の制御に従って各種内容の音声をユーザに向かって放音させる。本例では、対話制御部２３により対話シナリオ記憶部２１及び個人情報データベース１２を参照して生成したテキストが音声合成部２４に送られ、音声合成部２４により音声合成をすることで、ユーザに発する音声が生成されることになる。
【００２５】
対話制御部２３は、話者認識部２からの話者認識結果に従って、話者に対応した対話シナリオを対話シナリオ記憶部２１から選択する。このとき、対話制御部２３は、個人情報データベース１２を参照して、ユーザごとの対話シナリオを読み出す。そして、対話制御部２３では、個人情報データベース１２を参照し、音声認識部２２からの音声認識結果に応じて音声合成部２４を制御して話者に対応した音声を放音させる。
【００２６】
対話制御部２３は、話者認識部２から新規のユーザである旨の話者認識結果を入力した場合には、対話シナリオ記憶部２１から新規ユーザ向け対話シナリオ３１を読み出して放音させる。また、対話制御部２３は、話者認識部２からユーザデータベース記憶部４に登録済のユーザである旨の話者認識結果を入力した場合には、対話シナリオ記憶部２１から登録ユーザ向け対話シナリオ３２を読み出して放音させる。ここで、登録ユーザ向け対話シナリオ３２としては、ユーザデータベース記憶部４に登録するユーザごとに用意しても良く、所定のカテゴリなどを設定しておいて用意しても良い。
【００２７】
このような音声対話装置１では、話者認識部２により登録済のユーザと認識した場合に、個人情報データベース１２を参照しながら登録ユーザ向け対話シナリオ３２を用いて音声対話を進行する。これにより、音声対話装置１では、個人に対応した音声対話エージェントとして機能することができる。
【００２８】
すなわち、この音声対話装置１では、ユーザからの音声が入力されたことに応じて音声認識部２２により音声認識をして音声信号をテキストに変換し、このテキストを対話制御部２３に入力させることにより、対話制御部２３により個人情報データベース１２及び対話シナリオを参照することで音声対話装置１からユーザに返答すべきテキストを生成する。これにより、音声合成部２４では、返答するテキストを入力して、音声に変換し、ユーザに返答することになる。
【００２９】
また、この音声対話装置１では、話者認識部２により新規のユーザと認識した場合に、新規ユーザ向け対話シナリオ３１を用いて音声対話を進行する。これにより、音声対話装置１では、登録済のユーザと新規のユーザとを区別して音声対話を進行させる。ここで、新規のユーザの場合には、音声対話中または音声対話終了後に新たに話者認識用ユーザ音声データベース１１及び個人情報データベース１２にユーザに関する情報を登録することになる。
【００３０】
［音声対話装置の他の構成］
つぎに、本発明を適用した他の音声対話装置４０について図２を参照して説明する。なお、上述の図１に示した音声対話装置１と同様の部分については同一符号を付することによりその詳細な説明を省略する。
【００３１】
図２に示す音声対話装置４０は、話者を認識するに際して、ユーザの顔画像を撮像し、顔画像を用いて話者認識をする点で図１に示した音声対話装置１と異なる。
【００３２】
この音声対話装置４０は、図１に示した音声対話装置１に加えて、顔画像認識用データベース４１、顔画像認識部４２、ユーザ認識部４３を備える。このような音声対話装置４０では、ユーザの立ち位置に視野角を有するカメラ機構（図示せず）を備え、このカメラ機構からの顔画像データを顔画像認識部４２にて入力する。ここで、顔画像の入力タイミングとしては、例えば話者認識部２にユーザの音声が入力されて、話者認識部２による話者認識を開始するタイミングなどがある。
【００３３】
顔画像認識部４２では、顔画像データを入力すると、顔画像データから顔特徴量を抽出する。そして、顔画像認識部４２では、抽出した顔特徴量と、顔画像認識用データベース４１に登録されている複数のユーザの顔特徴量とを比較してマッチングすることで、現在音声対話装置４０を利用しようとしている複数のユーザ候補を認識する。そして、顔画像認識部４２では、複数のユーザ候補についてマッチングスコアを作成し、顔画像認識結果としてユーザ認識部４３に送る。顔画像認識用データベース４１には、過去に音声対話装置４０を利用したユーザの顔特徴量が個人情報と対応づけられて蓄積されている。
【００３４】
また、ユーザ認識部４３には、話者認識部２から話者認識結果が送られる。この音声対話装置４０では、音声対話装置１の場合とは異なり、顔画像認識部４２による顔画像を用いたマッチングスコアと総合的にユーザ認識をするために、話者認識部２により音声信号を用いて抽出したユーザ候補ごとにマッチングスコアを作成してユーザ認識部４３に送る。
【００３５】
ユーザ認識部４３は、話者認識部２からのマッチングスコアと、顔画像認識部４２からのマッチングスコアとを用いてユーザ認識をする。この時、ユーザ認識部４３では、ユーザ候補ごとに、顔画像認識部４２からのマッチングスコアと話者認識部２からのマッチングスコアとを用いた複合演算をして、所定のしきい値を超えたマッチングスコアのユーザを話者に決定する。
【００３６】
このような音声対話装置４０では、顔画像及び音声信号の双方を用いて話者認識をするので、音声信号のみを用いて話者認識する場合と比較して話者認識率を向上させる。
【００３７】
［音声対話装置の具体的な動作］
つぎに、上述した音声対話装置による具体的な音声対話処理について図３を参照して説明する。なお、図３を用いた説明では、図２に示した音声対話装置４０による音声対話処理について説明する。
【００３８】
また、本例では、例えば病院の受付案内を代行する音声対話装置４０について説明する。すなわち、音声対話装置４０では、初めての来院者に対しては新規ユーザ向け対話シナリオ３１を用いた音声対話を実行し、過去に音声対話装置４０を利用した来院者については登録ユーザ向け対話シナリオ３２を用いた音声対話を実行する場合について説明する。
【００３９】
先ず、音声対話装置４０では、ユーザの立ち位置に来院者が存在すると認識した場合に、ステップＳ１に処理を進め、対話制御部２３により、「お名前は？」との問いかけをするように音声合成部２４を制御して、ステップＳ２に処理を進める。
【００４０】
ステップＳ２においては、ステップＳ１にて問いかけをしたことに対し、「西山です」と来院者が名乗った場合に、その音声信号をマイクなどの入カデバイス及びＡ／Ｄコンバータを介してデジタルデータとして音声認識部２２及び話者認識部２にて取得して、ステップＳ３に処理を進める。
【００４１】
ステップＳ３においては、音声認識部２２により、ステップＳ２にて入力された音声信号を用いた音声認識処理をすることで音声特徴を抽出して「ニシヤマ」をテキストとして取得し、ユーザの候補として話者認識部２に送ってステップＳ４に処理を進める。なお、本例において、話者の名称を認証ＩＤとしている。なお、音声認識部２２による音声認識手法としてはＨＭＭ（Ｈｉｄｄｅｎ　Ｍａｒｋｏｖ　Ｍｏｄｅｌ）やＤＰ（Ｄｙｎａｍｉｃ　Ｐｒｏｇｒａｍｍｉｎｇ）マッチング、又はその他の手法を用いる。
【００４２】
ステップＳ４においては、話者認識部２により、ステップＳ３にて音声認識部２２から取得したユーザ候補となるテキストが話者認識用ユーザ音声データベース１１に登録されているか否かの判定をする。このとき、話者認識部２では、「ニシヤマ」のテキストの他に、「ニシヤマ」に類似した他のテキストもユーザの候補として話者認識用ユーザ音声データベース１１を検索する。話者認識部２によりユーザ候補となるテキストが話者認識用ユーザ音声データベース１１に存在すると判定した場合にはステップＳ５に処理を進め、存在しないと判定した場合にはステップＳ６に処理を進める。
【００４３】
ステップＳ５においては、話者認識部２により、ステップＳ３にて入力したテキスト（名前）と類似するテキスト（名前）と対応づけられた音声信号を話者認識用ユーザ音声データベース１１から読み出して取得し、ステップＳ７に処理を進める。
【００４４】
ステップＳ７においては、話者認識部２により、ステップＳ２にて取得した音声信号と、ステップＳ５にて読み出して取得音声信号とをマッチングしてマッチングスコアＭｓ１を作成して、ユーザ認識部４３に送ってステップＳ８に処理を進める。
【００４５】
ステップＳ８においては、ユーザ認識部４３により、ステップＳ７にて入力した音声信号を用いたマッチングスコアＭｓ１と、顔画像を用いたマッチングスコアＭｓ２とを用いて、双方のマッチングスコアを複合演算した結果の総合スコアＭｓｓを作成して、ステップＳ９に処理を進める。
【００４６】
ここで、ステップＳ１〜ステップＳ３では、話者認識部２によって音声信号に応じたテキストを取得する場合について説明したが、ステップＳ１〜ステップＳ３の処理と平行して顔画像を用いたユーザ候補の抽出をする。このとき、音声対話装置４０では、顔画像認識部４２により顔画像を入力して顔画像の特徴量を抽出し、抽出した顔特徴量と顔画像認識用データベース４１に蓄積された顔特徴量とをマッチングさせて複数のユーザ候補についてのマッチングスコアＭｓ２をユーザ認識部４３に送る。
【００４７】
ステップＳ９においては、ユーザ認識部４３により、ステップＳ８にて演算した総合スコアＭｓｓが予め設定したしきい値よりも大きいか否かを判定する。ここで、しきい値は、音声信号を用いたマッチングスコアＭｓ１と顔画像を用いたマッチングスコアＭｓ２とを複合演算したときに、登録済の話者を特定するマッチングスコアが予め設定されている。なお、このしきい値は、音声対話装置４０のシステム設計時に話者認識部２により作成するマッチングスコアＭｓ１や顔画像認識部４２により作成するマッチングスコアＭｓ２の演算手法、ユーザ認識部４３にて作成する総合スコアＭｓｓの演算手法により変化するものである。
【００４８】
ステップＳ９において、総合スコアＭｓｓがしきい値よりも大きくないと判定した場合には、新規のユーザ又は他の登録済話者と特定するためにステップＳ１０に処理を進め、総合スコアＭｓｓがしきい値よりも大きいと判定した場合には、登録済の話者を特定したと判定してステップＳ１２に処理を進める。
【００４９】
ステップＳ１０においては、ユーザ認識部４３により、話者の候補となる他のテキストが存在するか否かを判定し、存在すると判定した場合には前のステップＳ５〜ステップＳ８での処理対象となっていたテキストを除外してステップＳ５に処理を戻し、存在しないと判定した場合にはステップＳ４に処理を戻す。
【００５０】
そして、この音声対話装置４０では、ユーザの候補として話者認識部２にて取得したテキストが存在する限り、ステップＳ５、ステップＳ７〜ステップＳ１０の処理を繰り返し、ユーザの候補となるテキストが存在しないと判定した場合にステップＳ４に処理を進め、ステップＳ４からステップＳ６に処理を進める。
【００５１】
ステップＳ６においては、ステップＳ５、ステップＳ７〜ステップＳ１０の処理を繰り返した結果、ユーザ候補のテキスト（本例では「ニシヤマ」及びそれに類似したテキスト）が存在しないことから、例えば音声にて新規来院者かどうかを確認し、新規来院者でない場合にはステップＳ１に処理を戻し、新規来院者である場合には新規話者ＩＤを個人情報データベース１２に登録して、ステップＳ１３に処理を進める。
【００５２】
一方、ステップＳ９において、ユーザ認識部４３により総合スコアＭｓｓがしきい値を超えるユーザ候補のテキストが存在すると判定された場合のステップＳ１２においては、当該ユーザ候補のテキストを話者認識部２から対話制御部２３に送る。そして、対話制御部２３により、個人情報データベース１２を参照してユーザ候補のテキストに対応した個人情報を読み出してステップＳ１３に処理を進める。
【００５３】
ステップＳ１３においては、対話制御部２３により、対話シナリオ記憶部２１から新規ユーザ向け対話シナリオ３１又は登録ユーザ向け対話シナリオ３２を選択して読み出す。このとき、対話制御部２３では、ステップＳ１２にて個人情報を取得した場合には登録ユーザ向け対話シナリオ３２を読み出してステップＳ１４に処理を進め、ステップＳ１１にて新規話者ＩＤを取得した場合には新規ユーザ向け対話シナリオ３１を読み出してステップＳ１４に処理を進める。
【００５４】
ステップＳ１４においては、対話制御部２３により、ステップＳ１３にて選択した対話シナリオに応じて音声を合成するように音声合成部２４にテキストデータを送り、ステップＳ１５においては、ステップＳ１４にて音声を放音させたことに対する音声を入力して音声認識部２２により音声認識をする。
【００５５】
このステップＳ１４及びステップＳ１５を実行することで、対話制御部２３では、新規ユーザ向け対話シナリオ３１を用いて、来院者の名前、生年月日、住所、連絡先、既往症などの個人情報を音声対話によって取得する。これに対し、登録ユーザ向け対話シナリオ３２を用いた場合には、新規来院者と同じ個人情報の問い合わせは行わず、前回の来院記録などを個人情報データベース１２から参照しながら、会話を進める。
【００５６】
そして、ステップＳ１４及びステップＳ１５が完了した時点のステップＳ１６において、対話制御部２３により、対話シナリオを用いた対話が終了したか否かを判定する。対話が終了していないと判定した場合にはステップＳ１７に処理を進め、更に個人情報データベース１２を参照してステップＳ１３〜ステップＳ１５の処理を繰り返す。
【００５７】
一方、ステップＳ１６において対話が終了したと判定した場合には、ステップＳ１４及びステップＳ１５での音声対話を反映させるように話者認識用ユーザ音声データベース１１及び個人情報データベース１２の内容を更新する。すなわち、対話制御部２３では、対話が終了したと判定した場合に最新のユーザ音声を個人情報データベース１２に記憶して、最新のユーザ音声に更新させるように音声記憶部５を制御する。また、対話制御部２３では、対話内容に応じて、ユーザの個人情報を更新するように個人情報データベース１２の内容を書き換える。
【００５８】
［実施の形態の効果］
以上詳細に説明したように、本発明を適用した音声対話装置１，４０によれば、入力した音声及び／又は顔画像から登録済のユーザか、新規のユーザかを認識し、登録済のユーザと新規のユーザとで対話シナリオを選択して対話内容を区別することができる。さらに、この音声対話装置１，４０によれば、登録済のユーザについては個人情報データベース１２を参照して個人情報に応じて対話内容を変更することができ、更にユーザに対応した対話を実現することができる。
【００５９】
すなわち、この音声対話装置１，４０によれば、具体的な対話を開始する前に話者認識をすることで、登録済のユーザである場合には以前に対話した同じ内容は対話せず、以前の対話で得たユーザに関する情報や知識を次回の対話に反映させるように対話シナリオを変更することができ、より人間との対話に近い知的な機能を付与することができる。
【００６０】
また、この音声対話装置１，４０によれば、個人情報と対応させて対話シナリオを識別する情報を個人情報データベース１２に登録しておき、対話シナリオ記憶部２１に個人ごとの対話シナリオを用意しておいても良い。このようにすることで、対話制御部２３では、話者認識結果に応じて、話者に対応した個人情報から適切な対話シナリオを対話シナリオ記憶部２１から選択することができる。
【００６１】
これにより、例えば、個人情報として来院回数、来院目的などの情報を取得し、対話制御部２３により、来院者が今回お見舞いで来院したか否かをステップＳ１４及びステップＳ１５にて確認することで、次のステップＳ１３にてお見舞い用の対話シナリオを用いた対話を開始させることができる。この効果に加えて、音声対話装置１，４０では、新規ユーザ向け対話シナリオを用意しておくので、新規のユーザにとっても自然な対話を実現することができる。
【００６２】
また、この音声対話装置１，４０によれば、話者を認識するに際してパスワード入力等を促す必要が無く、ユーザ側の負担を低減することができる。
【００６３】
更に、この音声対話装置１，４０によれば、音声認識部２２により音声認識をして取得したテキストから話者認識部２にてマッチングを行う音声信号を話者認識用ユーザ音声データベース１１から取得するので、入力した音声信号と話者認識用ユーザ音声データベース１１に記憶されている全部の音声信号とのマッチングをする必要なく、話者認識部２での演算量を削減することができる。
【００６４】
更にまた、この音声対話装置１，４０によれば、ユーザの発話開始を音声信号のレベルからのみ判断しても良く、更には、顔画像認識部４２によりユーザの顔画像の口部分の動きを検出してユーザの発話開始を判断しても良い。これにより、ユーザの発話開始及び発話終了を正確に判断して、タイミングよく音声認識及び話者認識を開始及び終了することができ、音声対話装置１，４０の周囲が騒音のある環境であっても音声認識及び話者認識の精度を向上させると共に誤認識を少なくすることができる。
【００６５】
なお、上述の実施の形態は本発明の一例である。このため、本発明は、上述の実施形態に限定されることはなく、この実施の形態以外であっても、本発明に係る技術的思想を逸脱しない範囲であれば、設計等に応じて種々の変更が可能であることは勿論である。
【００６６】
すなわち、上述した一例では、音声対話装置１，４０の具体的な使用例として病院の受付に適用した場合について説明したが、これに限らず、商品紹介や説明エージェントヘの応用、コンビニエンスストアにおけるレジ・エージェントヘの応用、携帯電話での話者照合による電子決裁システムヘの応用などに適用しても、上述と同様の効果を発揮できることは勿論である。
【００６７】
【発明の効果】
本発明によれば、音声から話者認識をし、ユーザ情報に応じて対話シナリオを選択して対話シナリオに従って音声対話をするので、ユーザに応じて対話内容を変更することができ、既に登録済のユーザであるか、新規のユーザであるかの話者認識を正確にして、話者に応じた対話を実現することができる。
【図面の簡単な説明】
【図１】本発明を適用した音声対話装置の構成を示すブロック図である。
【図２】本発明を適用した他の音声対話装置の構成を示すブロック図である。
【図３】本発明を適用した音声対話装置の具体的な処理を説明するためのフローチャートである。
【符号の説明】
１，４０　音声対話装置
２　話者認識部
３　音声対話部
４　ユーザデータベース記憶部
５　音声記憶部
１１　話者認識用ユーザ音声データベース
１２　個人情報データベース
２１　対話シナリオ記憶部
２２　音声認識部
２３　対話制御部
２４　音声合成部
３１　新規ユーザ向け対話シナリオ
３２　登録ユーザ向け対話シナリオ
４１　顔画像認識用データベース
４２　顔画像認識部
４３　ユーザ認識部[0001]
TECHNICAL FIELD OF THE INVENTION
The present invention relates to a voice dialogue apparatus and a voice dialogue program that recognize a user's speaker when realizing a voice dialogue with the user and change the content of the dialogue according to the result of the speaker recognition.
[0002]
[Prior art]
2. Description of the Related Art Conventionally, there has been known a voice interaction system that emits a voice to a user and further inputs a voice emitted from the user to realize a dialogue with the user. In this conventional voice dialogue system, it is often the case that the current user is not recognized, and the dialogue is executed regardless of whether the user is a new user or a user who has used in the past.
[0003]
On the other hand, in another conventional voice interaction system, for example, the user is recognized by prompting the user to input a key such as a password.
[0004]
[Patent Document 1]
JP-A-63-85698
[0005]
[Problems to be solved by the invention]
However, many of the above-described conventional voice dialogue systems perform a dialogue without recognizing who the current talker is, and therefore do not refer to the unique information or knowledge of the talker. Therefore, in the conventional voice dialogue system, there is a problem in terms of performance and efficiency in that effective dialogue with the user does not progress, and the content of the previous dialogue is repeatedly executed.
[0006]
Further, in another conventional voice interaction system, the user is identified by prompting the user to enter a password. However, there is a problem that an operation burden is imposed on the user when starting the dialogue.
[0007]
Therefore, the present invention has been proposed in view of the above-described situation, and makes it possible to accurately perform speaker recognition as to whether the user is a registered user or a new user, and perform a dialog according to the speaker. It is an object of the present invention to provide a spoken dialogue apparatus and a spoken dialogue program to be realized.
[0008]
[Means for Solving the Problems]
According to the present invention, a database storage unit for storing user information about a user is prepared, and when performing a voice dialogue with the user, a speaker is identified from an input voice signal, and a user corresponding to the identified speaker is identified. The information is read, and an interaction scenario corresponding to the user is selected from the plurality of interaction scenarios based on the read user information. The present invention solves the above-described problem by realizing a dialog using a dialog scenario selected according to a user.
[0009]
BEST MODE FOR CARRYING OUT THE INVENTION
Hereinafter, embodiments of the present invention will be described with reference to the drawings.
[0010]
[Configuration of voice interaction device]
The present invention is applied to, for example, a spoken dialogue apparatus 1 configured as shown in FIG. The speech dialogue apparatus 1 stores a speech dialogue program in a storage mechanism (not shown), and executes the speech dialogue program by a computer including a CPU (Central Processing Unit) (not shown) and the like. The function module of each section of the section 3 and the voice storage section 5 is configured. The contents of the user database storage unit 4 and the dialog scenario storage unit 21 in the voice dialogue unit 3 are read by each functional module configured by executing the voice dialogue program, and information update and new information registration are performed. Processing is performed.
[0011]
The voice interactive device 1 is provided with a sound collecting mechanism (not shown) such as a microphone toward the front position of the device, and a voice emitted from the user via the sound collecting mechanism is recognized as a voice signal of an electric signal by a speaker. It is input to the section 2 and the voice dialogue section 3.
[0012]
In the spoken dialogue apparatus 1, a speaker recognition user speech database 11 and a personal information database 12 are prepared in advance and stored in the user database storage unit 4 in order to manage information about the user. Further, in the voice interaction device 1, a plurality of interaction scenarios for realizing different interactions depending on the user are prepared in advance, and the interaction scenarios are stored in the interaction scenario storage unit 21. In this example, a dialog scenario 31 for a new user and a dialog scenario 32 for a registered user are prepared as the dialog scenario.
[0013]
The user database storage unit 4 is formed of a recording medium such as a hard disk capable of storing a large amount of data. In the speaker recognition user voice database 11 stored in the user database storage unit 4, voices of users who have used the voice interactive device 1 in the past are stored as voice signals. That is, the voice signal stored in the speaker recognition user voice database 11 is registered and stored in association with the name of the personal information in the personal information database 12 and the like.
[0014]
In this example, the voice signal stored in the speaker recognition user voice database 11 includes the name of the user. The speaker recognition user voice database 11 may store a plurality of voice signals having the same name. However, since the frequency characteristics of voice signals differ from individual to individual depending on the pitch of the voice, a speaker described later is used. The recognizing unit 2 can recognize it.
[0015]
The personal information database 12 stores various types of personal information including names of users who have used the voice interaction device 1 in the past. The various types of personal information are the usage history of the voice interaction device 1 and the purpose of use, and vary depending on the use of the voice interaction device 1. A specific example of the personal information will be described in a specific operation of the voice interaction device 1 described later.
[0016]
The speaker recognition unit 2 uses the contents of the speaker recognition user voice database 11 stored in the user database storage unit 4, a voice signal input from the outside, and a voice recognition result from the voice dialogue unit 3 to perform speaker recognition. do.
[0017]
At this time, when the speech recognition result from the speech dialogue unit 3 is input in, for example, a text format, the speaker recognition unit 2 extracts a speech signal of a text similar to the text from the speaker recognition user speech database 11. Performs user identification processing. Next, the speaker recognizing unit 2 performs a speaker matching process of determining the single speaker from the user candidates by comparing the extracted voice signal as the user candidate with the input voice signal.
[0018]
As a result, if the speaker is determined as a result of the speaker recognition processing, the speaker recognizing unit 2 sends information to that effect to the voice dialogue unit 3 and, if it is determined that the speaker is a new user, notifies the voice Send to dialogue unit 3.
[0019]
Here, for example, even if N people are extracted as speaker candidates, when other people than the N people actually utter, the most similar voice of the N people is obtained only by the speaker identification processing. Although a person having possession is selected as a candidate, in order to determine that the person is another person (or a new person) other than N, speaker verification processing is performed after speaker identification processing for extracting speaker candidates, and N , Or a person other than N (or a new person).
[0020]
The voice interaction device 1 further includes a voice storage unit 5 that receives a voice recognition result of the voice signal input by the voice interaction unit 3 and registers the voice signal in the speaker recognition user voice database 11. When a voice signal from the voice storage unit 5 is input, the speaker recognition user voice database 11 is associated with the personal information stored in the personal information database 12 and can be referred to when the speaker recognition unit 2 performs the speaker recognition process. And
[0021]
The voice dialogue unit 3 includes a voice recognition unit 22, a dialogue control unit 23, and a voice synthesis unit 24 that externally output a user's voice signal. In the voice interaction unit 3, when a voice signal is input by the voice recognition unit 22, voice recognition is performed by the voice recognition unit 22.
[0022]
At this time, the voice recognition unit 22 performs voice recognition by comparing the input voice signal with a prepared voice identification database, and outputs the voice recognition result to the speaker recognition unit 2, the voice storage unit 5, or the dialog control unit. Send to 23. Here, the start timing of voice recognition by the voice recognition unit 22 is a timing at which the sound pressure level becomes equal to or higher than a predetermined value and the sound pressure level continues for a predetermined time or longer. Then, the speech recognition unit 22 performs a Fourier transform or the like on the speech signal input after the start timing of the speech recognition to extract a speech feature amount, and performs speech recognition using the speech feature amount.
[0023]
When causing the speaker recognition unit 2 to perform speaker recognition, the speech recognition unit 22 sends the speech recognition result to the speaker recognition unit 2 in a text format. At this time, the voice recognition unit 22 sends a plurality of texts having higher scores (certainty) of voice recognition to the speaker recognition unit 2 as voice recognition results. When the speaker is determined by the speaker recognizing unit 2 as a result of the speaker recognition, the voice recognizing unit 22 sends the voice signal at that time to the voice storage unit 5. As a result, the voice storage unit 5 causes the speaker recognition user voice database 11 to newly register a voice signal or update a voice signal already registered.
[0024]
The voice synthesizing unit 24 is connected to a sound emitting mechanism such as a speaker, and emits voices of various contents toward the user under the control of the dialog control unit 23. In this example, the text generated by the dialogue control unit 23 with reference to the dialogue scenario storage unit 21 and the personal information database 12 is sent to the speech synthesis unit 24, and the speech synthesis unit 24 synthesizes the speech to be issued to the user. Voice will be generated.
[0025]
The dialogue control unit 23 selects a dialogue scenario corresponding to the speaker from the dialogue scenario storage unit 21 according to the speaker recognition result from the speaker recognition unit 2. At this time, the interaction control unit 23 refers to the personal information database 12 and reads out an interaction scenario for each user. Then, the conversation control unit 23 refers to the personal information database 12 and controls the speech synthesis unit 24 according to the speech recognition result from the speech recognition unit 22 to emit a sound corresponding to the speaker.
[0026]
When a speaker recognition result indicating that the user is a new user is input from the speaker recognition unit 2, the dialog control unit 23 reads the dialog scenario 31 for a new user from the dialog scenario storage unit 21 and emits sound. When the speaker control unit 23 receives a speaker recognition result indicating that the user is a registered user in the user database storage unit 4 from the speaker recognition unit 2, the dialog scenario for registered users is input from the dialog scenario storage unit 21. 32 is read and sound is emitted. Here, the registered user interaction scenario 32 may be prepared for each user registered in the user database storage unit 4, or may be prepared by setting a predetermined category or the like.
[0027]
In such a voice interaction device 1, when the speaker recognition unit 2 recognizes a registered user, the voice interaction proceeds with the registered user interaction scenario 32 while referring to the personal information database 12. Thus, the voice interaction device 1 can function as a voice interaction agent corresponding to an individual.
[0028]
That is, in the voice dialogue apparatus 1, in response to input of voice from the user, the voice recognition unit 22 performs voice recognition to convert a voice signal into text, and causes the text to be input to the dialogue control unit 23. Thus, the dialogue control unit 23 refers to the personal information database 12 and the dialogue scenario to generate a text to be answered from the voice dialogue apparatus 1 to the user. As a result, the voice synthesizer 24 inputs the text to be replied, converts the text to voice, and replies to the user.
[0029]
Further, in the voice dialogue apparatus 1, when the speaker recognition unit 2 recognizes a new user, the voice dialogue proceeds using the dialogue scenario 31 for a new user. As a result, in the voice interaction device 1, the registered user and the new user are distinguished from each other, and the voice interaction proceeds. Here, in the case of a new user, information on the user is newly registered in the speaker recognition user voice database 11 and the personal information database 12 during or after the voice conversation.
[0030]
[Other Configurations of Voice Dialogue Device]
Next, another voice interactive device 40 to which the present invention is applied will be described with reference to FIG. The same parts as those of the voice interaction device 1 shown in FIG. 1 described above are denoted by the same reference numerals, and detailed description thereof will be omitted.
[0031]
The voice interaction device 40 illustrated in FIG. 2 differs from the voice interaction device 1 illustrated in FIG. 1 in that a face image of a user is captured when recognizing a speaker, and speaker recognition is performed using the face image.
[0032]
This voice interaction device 40 includes a face image recognition database 41, a face image recognition unit 42, and a user recognition unit 43 in addition to the voice interaction device 1 shown in FIG. Such a voice interaction device 40 includes a camera mechanism (not shown) having a viewing angle at the user's standing position, and inputs face image data from the camera mechanism to the face image recognition unit 42. Here, the input timing of the face image includes, for example, a timing at which a user's voice is input to the speaker recognition unit 2 and the speaker recognition unit 2 starts speaker recognition.
[0033]
When face image data is input, the face image recognition unit 42 extracts a face feature amount from the face image data. Then, the face image recognition unit 42 compares the extracted face feature amounts with the face feature amounts of a plurality of users registered in the face image recognition database 41 and performs matching to thereby determine the current voice interaction device 40. Recognize a plurality of user candidates trying to use. Then, the face image recognition unit 42 creates matching scores for a plurality of user candidates and sends the matching scores to the user recognition unit 43 as face image recognition results. In the face image recognition database 41, the facial feature amounts of users who have used the voice interaction device 40 in the past are stored in association with personal information.
[0034]
Further, the speaker recognition result is sent from the speaker recognition unit 2 to the user recognition unit 43. In the voice interaction device 40, unlike the case of the voice interaction device 1, in order to perform user recognition comprehensively with a matching score using a face image by the face image recognition unit 42, a voice signal is output by the speaker recognition unit 2. A matching score is created for each of the extracted user candidates and sent to the user recognition unit 43.
[0035]
The user recognition unit 43 performs user recognition using the matching score from the speaker recognition unit 2 and the matching score from the face image recognition unit 42. At this time, the user recognizing unit 43 performs a composite operation using the matching score from the face image recognizing unit 42 and the matching score from the speaker recognizing unit 2 for each user candidate, and exceeds a predetermined threshold. The user with the matching score obtained is determined as the speaker.
[0036]
In such a voice interaction device 40, since the speaker is recognized using both the face image and the voice signal, the speaker recognition rate is improved as compared with the case where the speaker is recognized using only the voice signal.
[0037]
[Specific operation of voice interaction device]
Next, a specific voice interaction process by the above-described voice interaction device will be described with reference to FIG. In the description with reference to FIG. 3, the voice dialogue processing by the voice dialogue apparatus 40 shown in FIG. 2 will be described.
[0038]
Further, in this example, a description will be given of, for example, the voice interaction device 40 acting as a reception guide of a hospital. That is, the voice interaction device 40 executes a voice interaction using the new user interaction scenario 31 for the first visitor, and a registered user interaction scenario 32 for the visitor who used the voice interaction device 40 in the past. A case in which a voice dialogue is executed using will be described.
[0039]
First, when the voice interaction device 40 recognizes that a visitor is present at the user's standing position, the process proceeds to step S1, and the dialogue control unit 23 prompts the user to ask "What is your name?" The combining unit 24 is controlled, and the process proceeds to step S2.
[0040]
In step S2, in response to the inquiry in step S1, if the visitor claims to be "Nishiyama," the audio signal is converted into digital data via an input device such as a microphone and an A / D converter. The information is acquired by the voice recognition unit 22 and the speaker recognition unit 2, and the process proceeds to step S3.
[0041]
In step S3, the voice feature is extracted by performing voice recognition processing using the voice signal input in step S2 by the voice recognition unit 22, and “Nishiyama” is obtained as text, and the voice is recognized as a user candidate. Sent to the person recognition unit 2 and proceeds to step S4. In this example, the name of the speaker is used as the authentication ID. In addition, as a voice recognition method by the voice recognition unit 22, HMM (Hidden Markov Model), DP (Dynamic Programming) matching, or another method is used.
[0042]
In step S4, the speaker recognizing unit 2 determines whether or not the user candidate text acquired from the voice recognizing unit 22 in step S3 is registered in the speaker recognizing user voice database 11. At this time, in addition to the text “Nishiyama”, the speaker recognition unit 2 searches the speaker recognition user voice database 11 for other texts similar to “Nishiyama” as user candidates. When the speaker recognition unit 2 determines that the text as the user candidate exists in the user voice database 11 for speaker recognition, the process proceeds to step S5, and when it is determined that the text does not exist, the process proceeds to step S6.
[0043]
In step S5, the speaker recognizing unit 2 reads out and acquires a speech signal associated with a text (name) similar to the text (name) input in step S3 from the speaker recognition user speech database 11. The process proceeds to step S7.
[0044]
In step S7, the speaker recognition unit 2 matches the voice signal obtained in step S2 with the voice signal read out and obtained in step S5 to create a matching score Ms1 and sends it to the user recognition unit 43. To step S8.
[0045]
In step S8, the user recognizing unit 43 uses the matching score Ms1 using the voice signal input in step S7 and the matching score Ms2 using the face image to perform a compound operation on both matching scores. A total score Mss is created, and the process proceeds to step S9.
[0046]
Here, in steps S1 to S3, the case has been described in which the speaker recognition unit 2 acquires a text corresponding to a voice signal. However, in parallel with the processing in steps S1 to S3, a user candidate using a face image is Extract. At this time, in the voice interaction device 40, the face image is input by the face image recognition unit 42 to extract the feature amount of the face image, and the extracted face feature amount and the face feature amount stored in the face image recognition database 41 are compared. And sends a matching score Ms2 for a plurality of user candidates to the user recognition unit 43.
[0047]
In step S9, the user recognition unit 43 determines whether or not the total score Mss calculated in step S8 is larger than a preset threshold. Here, the threshold value is set in advance to a matching score that specifies a registered speaker when a composite operation is performed on the matching score Ms1 using the audio signal and the matching score Ms2 using the face image. This threshold value is calculated by the method of calculating the matching score Ms1 created by the speaker recognizing unit 2 or the matching score Ms2 created by the face image recognizing unit 42 when designing the system of the voice interactive device 40, It changes depending on the calculation method of the total score Mss.
[0048]
If it is determined in step S9 that the total score Mss is not larger than the threshold, the process proceeds to step S10 in order to identify a new user or another registered speaker, and the total score Mss is determined. If it is determined that the value is larger than the value, it is determined that the registered speaker has been specified, and the process proceeds to step S12.
[0049]
In step S10, the user recognizing unit 43 determines whether or not there is another text that is a speaker candidate. If it is determined that there is another text, the text is processed in the previous steps S5 to S8. The process returns to step S5 excluding the text that has been described, and returns to step S4 if it is determined that the text does not exist.
[0050]
Then, in the voice interaction apparatus 40, as long as the text acquired by the speaker recognition unit 2 exists as a user candidate, the processing of step S5, step S7 to step S10 is repeated, and there is no text as a user candidate. When the determination is made, the process proceeds to step S4, and the process proceeds from step S4 to step S6.
[0051]
In step S6, as a result of repeating the processing of step S5 and steps S7 to S10, there is no user candidate text ("Nishiyama" and a text similar thereto) in this example. It is confirmed whether or not it is not a new visitor, and the process returns to step S1. If it is a new visitor, the new speaker ID is registered in the personal information database 12, and the process proceeds to step S13.
[0052]
On the other hand, in step S12, when the user recognition unit 43 determines that there is a text of the user candidate whose overall score Mss exceeds the threshold, in step S12, the text of the user candidate is interacted with by the speaker recognition unit 2. Send to control unit 23. Then, the dialog control unit 23 reads the personal information corresponding to the text of the user candidate with reference to the personal information database 12, and proceeds to step S13.
[0053]
In step S13, the dialogue control unit 23 selects and reads out the dialogue scenario 31 for new users or the dialogue scenario 32 for registered users from the dialogue scenario storage unit 21. At this time, the dialogue control unit 23 reads the dialogue scenario 32 for the registered user when the personal information is obtained in step S12, proceeds to step S14, and when the new speaker ID is obtained in step S11. Reads out the new user interaction scenario 31, and proceeds to step S14.
[0054]
In step S14, the dialogue control unit 23 sends text data to the speech synthesis unit 24 so as to synthesize speech in accordance with the dialogue scenario selected in step S13. In step S15, the speech is released in step S14. A voice corresponding to the sound is input and the voice recognition unit 22 performs voice recognition.
[0055]
By executing Steps S14 and S15, the dialogue control unit 23 uses the dialogue scenario 31 for new users to voice-communicate personal information such as the name, date of birth, address, contact information, and history of the visitor. Get by. On the other hand, when the registered user interaction scenario 32 is used, the conversation is advanced while referring to the previous visit record and the like from the personal information database 12 without inquiring about the same personal information as the new visitor.
[0056]
Then, in step S16 when step S14 and step S15 are completed, the dialog control unit 23 determines whether or not the dialog using the dialog scenario has been completed. If it is determined that the conversation has not ended, the process proceeds to step S17, and the processes of steps S13 to S15 are repeated with reference to the personal information database 12.
[0057]
On the other hand, if it is determined in step S16 that the dialog has ended, the contents of the speaker recognition user voice database 11 and the personal information database 12 are updated to reflect the voice dialog in steps S14 and S15. That is, the dialog control unit 23 stores the latest user voice in the personal information database 12 when it is determined that the dialog has ended, and controls the voice storage unit 5 to update the voice to the latest user voice. Further, the dialog control unit 23 rewrites the contents of the personal information database 12 so as to update the user's personal information according to the contents of the dialog.
[0058]
[Effects of Embodiment]
As described above in detail, according to the voice interaction apparatuses 1 and 40 to which the present invention is applied, whether a registered user or a new user is recognized from the input voice and / or face image, and the registered user And a new user can select a dialog scenario to distinguish the contents of the dialog. Further, according to the voice interaction devices 1 and 40, the registered user can change the content of the interaction according to the personal information with reference to the personal information database 12, and further realize the interaction corresponding to the user. be able to.
[0059]
That is, according to the voice interaction devices 1 and 40, by performing speaker recognition before starting a specific interaction, if the user is a registered user, the same content that was previously interacted does not interact. The dialog scenario can be changed so that the information and knowledge about the user obtained in the previous dialog are reflected in the next dialog, and an intelligent function closer to a human dialog can be provided.
[0060]
Further, according to the voice interaction devices 1 and 40, information for identifying the interaction scenario is registered in the personal information database 12 in association with the personal information, and an interaction scenario for each individual is prepared in the interaction scenario storage unit 21. You can keep it. By doing so, the dialog control unit 23 can select an appropriate dialog scenario from the dialog scenario storage unit 21 from personal information corresponding to the speaker, according to the speaker recognition result.
[0061]
Thereby, for example, information such as the number of visits and the purpose of the visit is acquired as personal information, and the dialog control unit 23 confirms in step S14 and step S15 whether the visitor has visited the hospital this time. In the next step S13, it is possible to start a dialogue using a visit dialogue scenario. In addition to this effect, the voice dialogue apparatuses 1 and 40 prepare a dialogue scenario for a new user, so that a natural dialogue can be realized even for a new user.
[0062]
Further, according to the voice interaction devices 1 and 40, it is not necessary to prompt for a password or the like when recognizing a speaker, and the burden on the user can be reduced.
[0063]
Further, according to the voice interaction devices 1 and 40, a voice signal for matching in the speaker recognition unit 2 is obtained from the speaker recognition user voice database 11 from the text obtained by performing voice recognition by the voice recognition unit 22. Therefore, it is not necessary to perform matching between the input voice signal and all the voice signals stored in the speaker recognition user voice database 11, and the amount of calculation in the speaker recognition unit 2 can be reduced.
[0064]
Furthermore, according to the voice interaction devices 1 and 40, the start of the utterance of the user may be determined only from the level of the voice signal. Further, the face image recognition unit 42 detects the movement of the mouth of the user's face image. Detection may be performed to determine the start of the utterance of the user. This makes it possible to accurately determine the start and end of the utterance of the user and start and end the speech recognition and the speaker recognition in a timely manner. This can also improve the accuracy of voice recognition and speaker recognition, and reduce erroneous recognition.
[0065]
Note that the above embodiment is an example of the present invention. For this reason, the present invention is not limited to the above-described embodiment, and other than the present embodiment, various modifications may be made according to the design and the like within a range not departing from the technical idea according to the present invention. Can be changed.
[0066]
That is, in the above-described example, the case where the voice dialogue devices 1 and 40 are applied to reception at a hospital has been described as a specific example of use. However, the present invention is not limited to this. The same effects as described above can of course be achieved even when applied to an application to an agent, an application to an electronic approval system using speaker verification on a mobile phone, and the like.
[0067]
【The invention's effect】
According to the present invention, speaker recognition is performed from voice, a dialogue scenario is selected according to user information, and a voice dialogue is performed according to the dialogue scenario. It is possible to accurately perform speaker recognition as to whether the user is a new user or a new user, and realize a conversation corresponding to the speaker.
[Brief description of the drawings]
FIG. 1 is a block diagram illustrating a configuration of a voice interaction device to which the present invention has been applied.
FIG. 2 is a block diagram showing a configuration of another voice interactive device to which the present invention is applied.
FIG. 3 is a flowchart illustrating a specific process of the voice interaction device to which the present invention is applied;
[Explanation of symbols]
1,40 spoken dialogue device
2 Speaker recognition unit
3 Voice Dialogue Department
4 User database storage
5 Voice storage unit
11 User speech database for speaker recognition
12 Personal information database
21 Dialogue scenario storage unit
22 Voice Recognition Unit
23 Dialogue control unit
24 Voice synthesis unit
31 Dialogue scenario for new users
32 Dialogue scenarios for registered users
41 Face Image Recognition Database
42 face image recognition unit
43 User Recognition Unit

Claims

Voice input means for receiving a voice from the user and generating a voice signal;
Audio output means for emitting audio to the user,
Speaker recognition means for performing speaker identification from the voice signal generated by the voice input means, database storage means for storing a user database storing user information about the user,
Information corresponding to the speaker identified by the speaker recognition means is read from the database storage means, and based on the read user information, an interaction scenario corresponding to the user is selected from a plurality of interaction scenarios. A dialogue control unit that controls the voice output unit so as to output a voice according to a dialogue scenario.

The database storage means includes, as the user database, a user voice database in which voice signals previously input by the voice input means are stored and registered for each user, and a personal information database in which personal information is registered for each user. Is remembered,
The speaker recognition unit performs speaker identification by comparing a voice signal registered in the user voice database with a voice signal generated by the voice input unit, and the dialog control unit outputs a speaker identification result. The voice interaction device according to claim 1, wherein a dialog scenario corresponding to a user is selected by referring to the personal information database based on the personal information database.

The speaker recognizing means performs voice recognition using the voice signal generated by the voice input means, and extracts a user candidate voice signal registered in the user voice database from the voice recognition result. 3. The voice interaction apparatus according to claim 2, wherein the voice recognition unit compares the voice signal generated by the voice input unit with the voice signal generated by the voice input unit to perform speaker recognition.

As a result of performing the speaker recognition, the speaker recognition unit determines whether the user is a registered user or a new user,
When it is determined that the user is a registered user by the speaker recognition means, the dialog control means selects a scenario for the registered user as a dialog scenario and performs a voice dialogue. 2. The voice interaction apparatus according to claim 1, wherein when it is determined that the user is a user, a scenario for a new user is selected as the interaction scenario to have a voice interaction.

The speaker recognition unit further includes a face image capturing unit that captures a face image of the user, and a voice signal input by the voice input unit in the past is stored in a user voice database stored and registered for each user. The voice signal is collated with the voice signal generated by the voice input means to perform speaker identification, and a face image previously registered in a face image recognition database and a face image captured by the face image capturing means are compared. The voice dialogue according to claim 1, wherein the speaker is identified by collating, and the user is recognized based on the speaker identification result using the voice signal and the speaker identification result using the face image. apparatus.

When conducting a voice dialogue with the user,
Identify the speaker from the input voice signal,
Reading information corresponding to the identified speaker from a database storage unit that stores user information about the user;
Based on the read user information, select an interaction scenario corresponding to the user from among a plurality of interaction scenarios,
A speech dialogue program for causing a computer to execute a process of outputting speech according to the above-mentioned dialogue scenario.

The voice signal is read from the user voice database in which the voice signal is registered for each user, the speaker is identified by comparing the voice signal with the input voice signal,
7. The voice according to claim 6, wherein the computer is caused to execute a process of selecting a conversation scenario corresponding to the user by referring to a personal information database in which personal information is registered for each user based on a speaker identification result. Dialogue program.

The voice recognition is performed using the input voice signal, and a voice signal to be a user candidate registered in the user voice database is extracted from the voice recognition result,
The computer-readable storage medium according to claim 7, wherein the computer performs a process of performing speaker recognition by comparing the extracted voice signal with the input voice signal.

By performing speaker recognition, it is determined whether the user is already registered or a new user,
When it is determined that the user is a registered user, a scenario for the registered user is selected as a dialog scenario, and a voice dialogue is performed.
7. The voice dialogue program according to claim 6, wherein, when it is determined that the user is a new user, a computer performs a process of selecting a scenario for a new user as a dialogue scenario and performing voice dialogue.

The voice signal is stored for each user, and the voice signal registered in a user voice database registered and collated with the input voice signal is used to identify a speaker, and a face image registered in a face image recognition database in advance and image pickup is performed. The speaker is identified by comparing it with the
The computer-readable storage medium according to claim 6, wherein the program causes a computer to execute a process of recognizing a user based on a speaker identification result using a voice signal and a speaker identification result using a face image.