JP2006071936A

JP2006071936A - Dialogue agent

Info

Publication number: JP2006071936A
Application number: JP2004254794A
Authority: JP
Inventors: Takashi Nishiyama; 高史西山
Original assignee: Matsushita Electric Works Ltd
Current assignee: Panasonic Electric Works Co Ltd
Priority date: 2004-09-01
Filing date: 2004-09-01
Publication date: 2006-03-16

Abstract

<P>PROBLEM TO BE SOLVED: To provide a dialogue agent capable of performing anatural dialogue making a user feel neither incompatible nor unpleasant by changing a response to the user according to the ego-state. <P>SOLUTION: A face feeling estimation section 13 estimates a feeling from the expression of the user imaged by a camera 41. From a user's speech inputted from a microphone 42, a speech feeling estimation section 14 estimates a feeling, a tone estimation section 15 estimates a tone, and a text extraction section 16 extracts a text. An ego-state estimation section 20 combines four kinds of pieces of information of the feeling obtained from the expression of the user, the feeling obtained from the speech, the tone, and the text to estimate an ego-state vector for the utterance of the user. A dialogue control section 30 determines an ego-state vector and a text for responding from the ego-state vector estimated by the utterance of the user to respond with a synthesized speech through a speaker 43. <P>COPYRIGHT: (C)2006,JPO&NCIPI

Description

本発明は、利用者の音声に応答して利用者と対話する対話エージェントに関するものである。 The present invention relates to a dialog agent that interacts with a user in response to the user's voice.

従来から、利用者の音声に応答して利用者と対話する対話システムが種々提案されている。この種の対話システムでは、コンピュータを用いて実現されるものであるが、人間同士の対話と同様な自然な対話を行うことが要望されている。たとえば、利用者の音声による認識情報を静的な情報と動的な情報とに分けて保持し、動的な情報を認識対象の項目別に管理する構成によって、対話に用いる内容の絞り込みを迅速に行えるようにする技術が提案されている（たとえば、特許文献１参照）。
特開平６−２０８３８９号公報（第００２３−００４６段落、図３） Conventionally, various interactive systems that interact with a user in response to the user's voice have been proposed. In this type of dialogue system, which is realized using a computer, it is desired to perform a natural dialogue similar to a dialogue between humans. For example, the user's voice recognition information is divided into static information and dynamic information, and the dynamic information is managed according to the items to be recognized. A technique for enabling this is proposed (for example, see Patent Document 1).
JP-A-6-208389 (paragraphs 0023-0046, FIG. 3)

上述した特許文献１に記載された技術では、対話における応答時間を短縮することによって応答の遅れによる違和感を抑制する技術であって、利用者の発話内容が同じであれば同じ応答になるから、画一的な応答しかできないものである。たとえば、利用者が大人か子供かにかかわらず、いかにも機械が応答しているという対話しか行えないという問題がある。 The technique described in Patent Document 1 described above is a technique for suppressing a sense of incongruity due to a delay in response by shortening a response time in dialogue, and the same response is obtained if the user's utterance content is the same. Only a uniform response is possible. For example, regardless of whether the user is an adult or a child, there is a problem that only a dialogue that the machine is responding can be performed.

本発明は上記事由に鑑みて為されたものであり、その目的は、利用者との対話の際に自我状態を認識することによって、状況に応じた自我状態で対話することを可能とし、利用者への応答を自我状態に応じて適宜に変化させることにより、利用者にとって受け入れやすくかつ違和感のない自然な対話を行うことを可能とした対話エージェントを提供することにある。 The present invention has been made in view of the above-described reasons, and its purpose is to recognize a ego state at the time of dialogue with a user, thereby enabling a dialogue in an ego state according to the situation, and use. It is an object to provide a dialog agent that makes it possible for a user to perform a natural conversation that is easy to accept and does not feel uncomfortable by appropriately changing the response to the user according to the ego state.

請求項１の発明は、利用者の音声が入力される音声入力手段と、音声入力手段から入力された音声の内容に応答するテキストを生成する対話処理手段と、対話処理手段により生成されたテキストを利用者に対して出力するテキスト出力手段とを有し、対話処理手段は、音声入力手段から入力された音声の韻律的特徴を用いて利用者の感情を複数種類に分類し音声感情データとして出力する音声感情推定部と、音声入力手段から入力された音声の韻律的特徴を用いて利用者の口調を複数種類に分類し口調データとして出力する口調推定部と、音声入力手段から入力された音声から音列を抽出しテキストデータとして出力するテキスト抽出部と、利用者の顔を撮像する画像入力手段と、画像入力手段により撮像した利用者の顔の各部位に設定した特徴点の時間経過に伴う位置の変化から表情を分類する表情推定部と、表情推定部で抽出された表情が入力され時間変化に伴う表情の変化パターンを用いて利用者の感情を複数種類に分類し感情サマリデータとして出力する顔感情推定部と、対話する両者の心のモデルである自我状態の組合せを話し手から聞き手への向きも含めた自我状態ベクトルとし感情サマリデータと音声感情データと口調データとテキストデータとの組から利用者の発話による自我状態ベクトルを推定する自我状態推定部と、自我状態推定部で推定された自我状態ベクトルからあらかじめ設定されている対応ルールに従って利用者に応答する際の自我状態ベクトルを決定しかつテキストデータの内容から利用者に応答するテキストを自動的に決定する対話制御部とを備えることを特徴とする。 According to the first aspect of the present invention, there is provided voice input means for inputting a user's voice, dialog processing means for generating text responding to the contents of the voice input from the voice input means, and text generated by the dialog processing means. And a text output means for outputting to the user, and the dialogue processing means classifies the user's emotions into a plurality of types using the prosodic features of the voice input from the voice input means, as voice emotion data Voice emotion estimation unit for output, tone estimation unit for classifying user's tone into multiple types using the prosodic features of speech input from speech input means, and output as tone data, input from speech input means A text extraction unit that extracts a sound string from speech and outputs it as text data, an image input unit that captures the user's face, and each part of the user's face that is captured by the image input unit A facial expression estimator that classifies facial expressions based on changes in position over time, and facial expressions extracted by the facial expression estimator are used to input multiple types of user emotions using facial expression change patterns with time. Emotion summary data, voice emotion data, and tone as the ego state vector, including the orientation from the speaker to the listener, the combination of the ego state, which is a model of the emotions of the face that classifies and outputs as emotion summary data. An ego state estimation unit that estimates an ego state vector based on a user's utterance from a set of data and text data, and responds to the user according to a pre-set correspondence rule from the ego state vector estimated by the ego state estimation unit A dialogue control unit that determines the ego state vector and automatically determines the text to respond to the user from the contents of the text data And wherein the Rukoto.

この構成によれば、利用者の表情から得られる感情と、利用者の音声から得られる感情、口調、テキストとの４種類の情報を用いることにより、利用者の自我状態と刺激された対話エージェントの自我状態との組合せを刺激の向きとともに自我状態ベクトルとして推定し、この自我状態ベクトルを用いて利用者に応答する際の自我状態ベクトルを決定するとともに応答用のテキストを決定するから、利用者の発話により推定される自我状態ベクトルに応じて利用者への応答を変化させることになり、利用者にとって受け入れやすく違和感や不快感を生じさせない自然な対話が可能になる。 According to this configuration, by using four types of information, emotion obtained from the user's facial expression and emotion, tone, and text obtained from the user's voice, the user's ego state and the dialogue agent stimulated The user's combination with the ego state is estimated as the ego state vector along with the direction of the stimulus, and the ego state vector when responding to the user is determined using this ego state vector and the response text is determined. The response to the user is changed in accordance with the ego state vector estimated by the utterance of the utterance, so that it is easy for the user to accept and natural conversation without causing discomfort and discomfort is possible.

請求項２の発明では、請求項１の発明において、前記自我状態推定部は、前記感情サマリデータと前記音声感情データとの組合せから推定される自我状態ベクトルの候補ごとに尤度を示す感情自我状態スコアを求める感情スコア割当部と、前記音声感情データと前記口調データとの組合せから推定される自我状態ベクトルの候補ごとに尤度を示す口調自我状態スコアを求める口調スコア割当部と、前記テキストデータの内容から推定される自我状態ベクトルの候補ごとに尤度を示すテキスト自我状態スコアを求めるテキストスコア割当部と、感情スコア割当部と口調スコア割当部とテキストスコア割当部とで得られた自我状態ベクトルの候補に含まれている自我状態を対話する各者ごとに分類し、各者の各自我状態ごとに当該候補の感情自我状態スコアと口調自我状態スコアとテキスト自我状態スコアとにそれぞれ重み係数を乗じて加算した加重和を尤度の評価値である統合スコアとして求め、自我状態ベクトルの候補に含まれている各者の各自我状態ごとの統合スコアのうち尤度が最大になる自我状態を利用者の発話による自我状態ベクトルにおける各者の自我状態と推定するスコア統合演算部とを備えることを特徴とする。 In the invention of claim 2, in the invention of claim 1, the ego state estimator indicates the likelihood ego for each ego state vector candidate estimated from a combination of the emotion summary data and the voice emotion data. An emotion score assigning unit for obtaining a state score, a tone score assigning unit for obtaining a tone ego state score indicating likelihood for each candidate of ego state vector estimated from a combination of the voice emotion data and the tone data, and the text The ego obtained by the text score assigning unit that obtains the text ego state score indicating the likelihood for each candidate of the ego state vector estimated from the content of the data, the emotion score assigning unit, the tone score assigning unit, and the text score assigning unit Classify the ego state included in the candidate state vector for each person who interacts, and for each ego state of each person, the candidate's emotional ego A weighted sum obtained by multiplying each of the state score, the tone ego state score, and the text ego state score by multiplying each weight coefficient is obtained as an integrated score that is an evaluation value of the likelihood, and each person included in the ego state vector candidate is obtained. It is characterized by comprising a score integration calculation unit that estimates the ego state having the maximum likelihood among the integrated scores for each ego state as the ego state of each person in the ego state vector based on the user's utterance.

この構成によれば、表情および音声から得られる感情と、音声から得られる口調と、音声から得られるテキストとによりそれぞれ自我状態ベクトルを推定し、各自我状態ベクトルの尤度を統合することによって、妥当と考えられる自我状態ベクトルを求めるから、利用者の発話に対する自我状態ベクトルの決定精度が高くなる。なお、尤度の用語は、もっともらしさの程度という意味で用いている。 According to this configuration, by estimating the ego state vector by the emotion obtained from the facial expression and voice, the tone obtained from the voice, and the text obtained from the voice, respectively, and by integrating the likelihood of each ego state vector, Since an appropriate ego state vector is obtained, the accuracy of determining the ego state vector for the user's utterance is increased. The term “likelihood” is used in the sense of plausibility.

請求項３の発明では、請求項２の発明において、前記感情自我状態スコアは、前記感情サマリデータと前記音声感情データとが示す感情が一致する場合に自我状態ベクトルの候補に対して満点を与え、感情が一致しない場合に得られる自我状態ベクトルの候補に対して同点に配分される数値であり、前記口調自我状態スコアは、前記音声感情データと前記口調データとに矛盾がない場合に自我状態ベクトルの候補に対して満点を与え、矛盾がある場合に自我状態ベクトルの候補に対して同点に配分される数値であり、前記テキスト自我状態スコアは、前記テキストデータに特定の付帯語句が含まれるときに当該付帯語句に対応する自我状態ベクトルの候補に対して尤度の高い順に大きい値が充てられる数値であり、１つの付帯語句に対応する自我状態ベクトルの候補に充てたテキスト自我状態スコアの合計が満点になることを特徴とする。 In the invention of claim 3, in the invention of claim 2, the emotion ego state score gives a perfect score to the candidate of ego state vector when the emotions indicated by the emotion summary data and the voice emotion data match. , A numerical value that is distributed to the same point for candidates of the ego state vector obtained when emotions do not match, and the tone ego state score is an ego state when there is no contradiction between the voice emotion data and the tone data This is a numerical value that gives a full score to a vector candidate and is distributed to the ego state vector candidates when there is a contradiction, and the text ego state score includes a specific incidental phrase in the text data A numerical value in which a large value is assigned in descending order of likelihood with respect to the ego state vector candidate corresponding to the supplementary phrase, and corresponds to one supplementary phrase The sum of the text ego state score was devoted to the candidates of our state vector is characterized to be a perfect score.

この構成によれば、感情自我状態スコアと口調自我状態スコアとテキスト自我状態スコアとを比較的簡単かつ適切に設定することができる。 According to this configuration, the emotional ego state score, the tone ego state score, and the text ego state score can be set relatively easily and appropriately.

請求項４の発明では、請求項２または請求項３の発明において、前記統合スコアを求める重み係数は、感情自我状態スコアに対する重み係数と口調自我状態スコアに対する重み係数との和がテキスト自我状態スコアに対する重み係数よりも大きく、かつテキスト自我状態スコアに対する重み係数は利用者の自我状態に対する重み係数よりも刺激された自我状態に対する重み係数のほうが大きいことを特徴とする。 In the invention of claim 4, in the invention of claim 2 or claim 3, the weighting coefficient for obtaining the integrated score is the sum of the weighting coefficient for the emotional ego state score and the weighting coefficient for the tone tone ego state score. The weight coefficient for the text ego state score is larger for the stimulated ego state than the weight coefficient for the user ego state.

この構成によれば、利用者の発話に対する自我状態ベクトルの決定の際には、非言語情報を言語情報よりも重視して感情自我状態スコアおよび口調自我状態スコアをテキスト自我状態スコアに対して優勢に用いるから、感情自我状態スコアおよび口調自我状態スコアを主に用いるとともにテキスト自我状態スコアを補助的に用いて自我状態を推定することになり、利用者の発話に対する自我状態を適切に決定することができる。また、言語情報については、話し手の自我状態よりもむしろ聞き手のどの自我状態に対する刺激を意図したものであるかが表出されていると考え、上述のようにテキスト自我状態スコアの重み係数を設定している。 According to this configuration, in determining the ego state vector for the user's utterance, the non-linguistic information is more important than the linguistic information, and the emotion ego state score and the tone ego state score are superior to the text ego state score. Therefore, the emotional ego state score and the tone ego state score are mainly used and the text ego state score is used as an auxiliary to estimate the ego state, and the ego state for the user's utterance is appropriately determined. Can do. For language information, it is assumed that the listener's ego state is intended rather than the speaker's ego state, and the text ego state score weighting factor is set as described above. is doing.

請求項５の発明では、請求項１ないし請求項４の発明において、前記対話制御部は、決定した応答用の自我状態ベクトルと決定した応答用のテキストとから音声の韻律パラメータを生成する機能を有し、前記テキスト出力手段は、対話制御部で決定した応答用のテキストに韻律パラメータを適用した合成音声を生成する音声合成処理部と、音声合成処理部で生成された合成音声を出力する音声出力手段とを備えることを特徴とする。 According to a fifth aspect of the present invention, in the first to fourth aspects of the invention, the dialogue control unit has a function of generating a prosodic parameter of speech from the determined response ego state vector and the determined response text. The text output means includes: a speech synthesis processing unit that generates a synthesized speech in which prosodic parameters are applied to the response text determined by the dialogue control unit; and a voice that outputs the synthesized speech generated by the speech synthesis processing unit Output means.

この構成によれば、利用者に対して音声による応答が可能であり、利用者が他の作業をしながらでも対話することが可能になる。また、視覚障害者との対話が可能になる。 According to this configuration, it is possible to respond to the user by voice, and it is possible for the user to interact while performing other work. In addition, it is possible to interact with visually impaired people.

請求項６の発明では、請求項１ないし請求項５の発明において、前記表情推定部は、「無表情」、「驚き」、「恐怖」、「嫌悪」、「怒り」、「幸福」、「悲しみ」の７種類の表情を分類することを特徴とする。 According to a sixth aspect of the present invention, in the first to fifth aspects of the present invention, the facial expression estimation unit includes “no expression”, “surprise”, “fear”, “disgust”, “anger”, “happiness”, “ It is characterized by classifying seven kinds of expressions of “sadness”.

この構成によれば、表情推定部で分類する表情として、「無表情」、「驚き」、「恐怖」、「嫌悪」、「怒り」、「幸福」、「悲しみ」の７種類を用いるのであって、これらの７種類の表情を用いれば自我状態との対応付けは比較的容易である。 According to this configuration, seven types of facial expressions classified by the facial expression estimation unit are used: “no facial expression”, “surprise”, “fear”, “disgust”, “anger”, “happiness”, and “sadness”. If these seven types of facial expressions are used, the association with the ego state is relatively easy.

請求項７の発明では、請求項１ないし請求項６の発明において、利用者の音声の特徴および利用者の顔の特徴を利用者に対応付けて登録したユーザデータベースを有し、前記音声入力手段から入力される利用者の音声の特徴と前記画像入力手段により撮像される利用者の顔の画像の特徴とをユーザデータベースに照合して利用者を特定するユーザ認識部を備え、前記対話制御部は、ユーザ認識部で特定される利用者の属性があらかじめ登録されており、利用者に応答する際の自我状態ベクトルおよびテキストを決定する際に前記自我状態推定部で推定された自我状態ベクトルと前記テキストデータとのほかに、利用者の属性も用いることを特徴とする。 According to a seventh aspect of the present invention, in the first to sixth aspects of the present invention, the voice input means includes a user database in which the voice characteristics of the user and the facial characteristics of the user are registered in association with the user. A user recognizing unit for identifying a user by comparing a feature of a user's voice input from the user and a feature of a user's face image captured by the image input unit with a user database, and the dialog control unit The user attribute specified by the user recognition unit is registered in advance, and the ego state vector estimated by the ego state estimation unit when determining the ego state vector and text when responding to the user and In addition to the text data, user attributes are also used.

この構成によれば、あらかじめユーザデータベースに登録されている特定多数の利用者について自我状態を決定するから、応答時の自我状態ベクトルを決定する際に利用者について既知の情報を利用することが可能になり、不特定多数の利用者について自我状態を決定する場合に比較すると、違和感や不快感を生じさせない応答ができる可能性を高めることができる。また、対話エージェントが対話する利用者を特定することによって許可されていない利用者との対話を禁止することも可能である。 According to this configuration, since the ego state is determined for a specific number of users registered in the user database in advance, it is possible to use known information about the user when determining the ego state vector at the time of response. Compared with the case where the ego state is determined for an unspecified number of users, the possibility of a response that does not cause discomfort or discomfort can be increased. It is also possible to prohibit a dialog with an unauthorized user by specifying the user with whom the dialog agent interacts.

請求項８の発明では、請求項７の発明において、前記自我状態推定部が推定した自我状態ベクトルを前記ユーザ認識部により特定された利用者に対応付けて蓄積記憶する自我状態履歴記憶部と、自我状態履歴記憶部に蓄積された利用者の自我状態ベクトルの出現頻度の分布パターンにより利用者の性格を推定する自我状態特徴抽出部と、自我状態特徴抽出部により推定された性格を利用者に対応付けて記憶する自我状態特徴記憶部とが付加され、前記対話制御部は自我状態特徴記憶部に格納された利用者の性格を用いて利用者に応答する際の自我状態ベクトルおよびテキストを決定することを特徴とする。 In the invention of claim 8, in the invention of claim 7, the ego state history storage unit that stores and stores the ego state vector estimated by the ego state estimation unit in association with the user specified by the user recognition unit; The ego state feature extraction unit that estimates the user's personality from the distribution pattern of the appearance frequency of the user's ego state vector stored in the ego state history storage unit, and the personality estimated by the ego state feature extraction unit to the user An ego state feature storage unit is stored in association with each other, and the dialog control unit determines an ego state vector and text when responding to the user using the user's personality stored in the ego state feature storage unit. It is characterized by doing.

この構成によれば、利用者の自我状態の出現頻度の履歴によって利用者の性格を推定することができるから、カウンセラの診断や自己診断テストなどを行うことなく、利用者の性格推定が可能になる。性格の推定結果は自我状態特徴記憶部に格納され、利用者の性格の推定結果を用いて利用者に応答する際の自我状態ベクトルおよびテキストを決定するから、利用者に応じたスムーズな応答が可能になる。なお、利用者の自我状態の履歴と性格の推定結果とを記憶しているから、利用者のカウンセリングのためにカウンセラが利用することも可能である。 According to this configuration, since the user's personality can be estimated from the history of the appearance frequency of the user's ego state, the user's personality can be estimated without performing a counselor diagnosis or self-diagnosis test. Become. Since the personality estimation result is stored in the ego state feature storage unit and the user's personality estimation result is used to determine the ego state vector and text when responding to the user, a smooth response according to the user can be obtained. It becomes possible. Since the user's ego state history and personality estimation results are stored, the counselor can also use the user's counseling.

請求項９の発明は、請求項１ないし請求項８の発明において、前記対話制御部の対応ルールでは、利用者の発話によって刺激される自我状態を応答時の自我状態とし、応答時に刺激する利用者の自我状態を利用者の先の発話時の自我状態とすることを特徴とする。 According to a ninth aspect of the present invention, in the first to eighth aspects of the invention, in the correspondence rule of the dialogue control unit, the ego state stimulated by the user's utterance is set as the ego state at the time of response, and the use is stimulated at the time of response. The user's ego state is the ego state at the time of the user's previous utterance.

この構成によれば、利用者が発話する際の自我状態ベクトルと対話エージェントが応答する際の自我状態ベクトルとが一致するから、利用者に違和感や不快感を生じさせないスムーズな対話が可能になる。また、利用者が発話する際の自我状態ベクトルが決まれば対話エージェントが応答する際の自我状態ベクトルを一意に決定できるから、対応ルールが簡単になる。 According to this configuration, since the ego state vector when the user utters and the ego state vector when the dialogue agent responds, the user can have a smooth dialogue without causing discomfort or discomfort to the user. . In addition, if the ego state vector when the user speaks is determined, the ego state vector when the dialog agent responds can be uniquely determined, so that the correspondence rule becomes simple.

請求項１０の発明では、請求項１ないし請求項９の発明において、身体動作を伴う表現を行う身体モデル表現部と、前記対話制御部で決定された自我状態ベクトルおよびテキストを身体モデル表現部の身体動作に変換する身体表現制御部とが付加されていることを特徴とする。 According to a tenth aspect of the present invention, in the first to ninth aspects of the present invention, the body model expression unit that performs expression accompanied by a body motion, and the ego state vector and the text determined by the dialogue control unit are stored in the body model expression unit. A body expression control unit for converting into body motion is added.

この構成によれば、対話制御部で決定されたテキストおよび自我状態ベクトルを身体モデル表現部の身体動作に反映させるから、対話の際に利用者に対して身振りや手まねを付与して応答することができ、利用者へのメッセージの伝達がスムーズになる。 According to this configuration, since the text and the ego state vector determined by the dialog control unit are reflected in the body movement of the body model expression unit, the user can respond by giving gestures or imitations to the user during the dialog. Message transmission to users.

請求項１１の発明では、請求項１ないし請求項１０の発明において、前記自我状態は、交流分析に基づく心のモデルである「批判的な親」、「保護的な親」、「大人」、「自由な子供」、「順応する子供」の５種類に分類されることを特徴とする。 In the invention of claim 11, in the inventions of claims 1 to 10, the ego state is a model of a mind based on an alternating current analysis, which is “critical parent”, “protective parent”, “adult”, It is classified into five types: “free children” and “adapting children”.

この構成によれば、自我状態として、交流分析に基づく心のモデルである「批判的な親」、「保護的な親」、「大人」、「自由な子供」、「順応する子供」を用いるから、交流分析に従って応答時の自我状態ベクトルを比較的容易に設定することができる。 According to this configuration, as the ego state, the models of the mind based on the exchange analysis, “critical parents”, “protective parents”, “adults”, “free children”, “adapted children” are used. Thus, the ego state vector at the time of response can be set relatively easily according to the AC analysis.

本発明の構成によれば、利用者の表情から得られる感情と、利用者の音声から得られる感情、口調、テキストとの４種類の情報を用いることにより、利用者の自我状態と刺激された自我状態との組合せである自我状態ベクトルを推定し、この自我状態ベクトルを用いて利用者に応答する際の自我状態ベクトルを決定するとともに応答用のテキストを決定するから、利用者の発話により推定される自我状態ベクトルに応じて利用者への応答を変化させることになり、利用者にとって受け入れやすく違和感や不快感を生じさせない自然な対話が可能になるという利点がある。 According to the configuration of the present invention, the user's ego state is stimulated by using four types of information, emotion obtained from the user's facial expression and emotion, tone, and text obtained from the user's voice. Estimate the ego state vector, which is a combination with the ego state, and determine the ego state vector when responding to the user using this ego state vector and the response text. The response to the user is changed in accordance with the ego state vector to be performed, and there is an advantage that a natural dialogue that is easy for the user to accept and does not cause discomfort or discomfort is possible.

（基本動作）
以下に説明する対話エージェントは、コンピュータを用いて構成され、利用者との間で自然な対話を実現するために自我状態に着目して応答用の音声を生成するものである。自我状態は、交流分析（たとえば、杉田峰泰：「交流分析」，日本文化科学社，１９８５）に基づく心のモデルであり、親（Ｐ）、大人（Ａ）、子供（Ｃ）の３状態に分類され、さらに親は批判的な親（ＣＰ）と保護的な親（ＮＰ）に分類され、子供は自由な子供（ＦＣ）と順応する子供（ＡＣ）とに分類される。つまり、自我状態は５種類に分類される。以下では自我状態を説明する際に、ＣＰ、ＮＰ、Ａ、ＦＣ、ＡＣの符号を用いる。また、以下に説明する実施形態では「利用者」として人間を想定して「利用者」と呼び、原則として利用者が先に発話するものとする。 (basic action)
The dialogue agent described below is configured using a computer, and generates a response voice by focusing on the ego state in order to realize a natural dialogue with the user. The ego state is a model of the mind based on the exchange analysis (for example, Mineyasu Sugita: “exchange analysis”, Nihon Bunka Sakusha, 1985). Furthermore, parents are classified as critical parents (CP) and protective parents (NP), and children are classified as free children (FC) and adapting children (AC). That is, the ego state is classified into five types. In the following, the CP, NP, A, FC, and AC codes are used when describing the ego state. In the embodiment described below, a “user” is assumed to be a human and is called a “user”, and in principle, the user speaks first.

交流分析においては、対話する両者の自我状態が適切な関係であるときにスムーズな対話が成立する。すなわち、対話する際の発話者と応答者との関係においては、発話者がある自我状態で発話すると、発話者の発話を受けて応答者の自我状態が刺激されるのであって、発話者の自我状態と応答者の刺激された自我状態との関係が適正な関係であれば、応答者の応答が発話者に違和感や不快感を与えることがなく、応答の内容を効率よく伝達することが可能になる。一方、発話者の自我状態と応答者の刺激された自我状態との関係が適正でないときには、応答者の応答が発話者に違和感や不快感を与え、馴れ馴れしい印象や素っ気ない印象を与えることになる。つまり、スムーズな対話を成立させるには、発話者と応答者との自我状態の関係を適正にすることが必要であることが知られている。 In the exchange analysis, a smooth dialogue is established when the ego states of the two parties having the dialogue are in an appropriate relationship. In other words, in the relationship between the speaker and the responder when talking, if the speaker speaks in a certain ego state, the responder's ego state is stimulated by the speaker's speech, and the speaker's If the relationship between the ego state and the responder's stimulated ego state is appropriate, the responder's response will not cause discomfort or discomfort to the speaker, and the content of the response can be transmitted efficiently. It becomes possible. On the other hand, when the relationship between the ego state of the speaker and the stimulated ego state of the responder is not appropriate, the responder's response gives the speaker a sense of incongruity and discomfort, and gives a familiar and unfamiliar impression. . That is, it is known that in order to establish a smooth dialogue, it is necessary to make the relationship between the ego state of the speaker and the responder appropriate.

対話の際に自我状態を考慮するために、対話エージェントは、基本的には図３に示す手順で動作する。まず利用者の音声がマイクロホンから入力されると（Ｓ１）、利用者の音声および利用者の表情をマイクロホンおよびＴＶカメラを用いて取得し（Ｓ２）、利用者の自我状態と利用者により刺激される対話エージェントの自我状態との組合せを推定する（Ｓ３）。ここに、利用者と対話エージェントとの自我状態の組合せを話し手から聞き手への向きを含めて「自我状態ベクトル」と呼び、また、利用者が対話エージェントの自我状態を刺激する場合には被刺激の自我状態ベクトルと呼び、対話エージェントが利用者の自我状態を刺激する場合には加刺激の自我状態ベクトルと呼ぶことにする。被刺激の自我状態ベクトルが推定されると、対話エージェントの自我状態と対話エージェントが刺激する利用者の自我状態との組合せである加刺激の自我状態ベクトルを決定した後（Ｓ４）、加刺激の自我状態ベクトルに適合した応答用のテキストを生成し（Ｓ５）、利用者に対してテキストを出力するのである（Ｓ６）。また、本発明では、被刺激の自我状態ベクトルの推定に際して、利用者から取得した情報の組合せを評価値であるスコアに置き換え、スコアを用いて加刺激の自我状態ベクトルを決定する構成を採用している。スコアには０〜１００の整数値を用いる。 In order to consider the ego state during the dialogue, the dialogue agent basically operates according to the procedure shown in FIG. First, when the user's voice is input from the microphone (S1), the user's voice and the user's facial expression are acquired using the microphone and the TV camera (S2), and stimulated by the user's ego state and the user. The combination with the ego state of the dialogue agent is estimated (S3). Here, the combination of the ego state of the user and the dialogue agent is called the “ego state vector” including the direction from the speaker to the listener, and if the user stimulates the ego state of the dialogue agent, it is stimulated. If the dialogue agent stimulates the user's ego state, it will be called the ego state vector of the stimulus. Once the ego state vector of the stimulus is estimated, after determining the ego state vector of the stimulus that is a combination of the ego state of the dialog agent and the user's ego state stimulated by the dialog agent (S4), A response text suitable for the ego state vector is generated (S5), and the text is output to the user (S6). Further, in the present invention, when estimating the ego state vector of the stimulus, a configuration is adopted in which the combination of information acquired from the user is replaced with a score that is an evaluation value, and the ego state vector of the stimulus is determined using the score. ing. An integer value of 0 to 100 is used for the score.

自我状態ベクトルは、丸括弧内に利用者の自我状態と対話エージェントの自我状態とを左側が利用者の自我状態になるように左右に並べて記述し、被刺激の場合は利用者から対話エージェントに向かうように両者間を右向き矢印で結合し、加刺激の場合は両者間を左向き矢印で結合する。たとえば、利用者の自我状態がＣＰ、対話エージェントの自我状態がＡＣである被刺激の場合には（ＣＰ→ＡＣ）と記述し、加刺激の場合には（ＣＰ←ＡＣ）と記述する。また、自我状態に対応するスコアは、丸括弧内に利用者のスコアと対話エージェントのスコアとを左側が利用者のスコアになるように左右に並べて記述し、両者間をコンマで区切る。たとえば、利用者のスコアが５０であり、対話エージェントのスコアが５０であれば、（５０，５０）と記述する。被刺激の自我状態ベクトルが（ＣＰ→ＡＣ）であり、その自我状態ベクトルのスコアが（５０，５０）であるときには、（ＣＰ→ＡＣ）＝（５０，５０）と記述する。 The ego state vector describes the user's ego state and the dialogue agent's ego state in parentheses side by side so that the left side is the user's ego state. The two are coupled with a right-pointing arrow so that they are directed, and in the case of stimulation, the two are coupled with a left-pointing arrow. For example, if the user's ego state is CP and the dialogue agent's ego state is AC, it is described as (CP → AC), and if it is a stimulus, it is described as (CP ← AC). For the score corresponding to the ego state, the user's score and the dialogue agent's score are written side by side in parentheses so that the left side is the user's score, and the two are separated by commas. For example, if the score of the user is 50 and the score of the dialog agent is 50, (50, 50) is described. When the ego state vector of the stimulus is (CP → AC) and the score of the ego state vector is (50, 50), (CP → AC) = (50, 50) is described.

（実施形態１）
本実施形態の対話エージェントの構成を図１に示す。対話エージェントは、利用者の顔の表情を取得するための画像入力手段としてＴＶカメラからなるカメラ４１を備え、また利用者の音声を取得するための音声入力手段としてマイクロホン４２を備える。カメラ４１は利用者の顔付近を撮像するように視野が設定される。また、対話エージェントは、利用者の発話に応答してテキストを出力するものであり、本実施形態では、テキストを音声と文字とにより出力する例を示す。したがって、テキストを出力するテキスト出力手段として、テキストに応じて音声合成処理部１７で生成した合成音声を出力するための音声出力手段であるスピーカ４３と、テキストを画面に表示する画像出力手段であるＣＲＴあるいは液晶表示器のようなディスプレイ４４とを備える。カメラ４１、マイクロホン４２、スピーカ４３、ディスプレイ４４は、適宜のプログラムを実行するコンピュータにより実現される対話処理手段１０に接続され、対話処理手段１０では、以下に説明する処理によって、マイクロホン４２から入力された利用者の音声に応答するテキストを生成する。 (Embodiment 1)
The configuration of the dialog agent of this embodiment is shown in FIG. The dialogue agent includes a camera 41 including a TV camera as an image input unit for acquiring the facial expression of the user, and a microphone 42 as an audio input unit for acquiring the user's voice. The field of view of the camera 41 is set so as to capture the vicinity of the user's face. The dialogue agent outputs text in response to the user's utterance. In the present embodiment, an example is shown in which text is output by voice and characters. Accordingly, as text output means for outputting text, there are a speaker 43 which is voice output means for outputting synthesized speech generated by the voice synthesis processing unit 17 according to the text, and an image output means for displaying the text on the screen. And a display 44 such as a CRT or a liquid crystal display. The camera 41, the microphone 42, the speaker 43, and the display 44 are connected to the dialogue processing means 10 realized by a computer that executes an appropriate program, and the dialogue processing means 10 is input from the microphone 42 by the processing described below. Generate text that responds to the voice of the user.

対話処理手段１０は、基本的にはカメラ４１で撮像された画像とマイクロホン４２から入力された音声とにより被刺激の自我状態ベクトルを推定する手段と、被刺激の自我状態ベクトルから加刺激の自我状態ベクトルを決定する手段と、マイクロホン４２から入力された音声に含まれるテキストと加刺激の自我状態ベクトルとから応答用のテキストを生成するとともに、スピーカ４３とディスプレイ４４との少なくとも一方を通して応答用のテキストを利用者に提示する手段とを備える。 The dialogue processing means 10 basically estimates a ego state vector of the stimulus from the image captured by the camera 41 and the sound input from the microphone, and the ego of the stimulus from the ego state vector of the stimulus. A response text is generated from the means for determining the state vector, the text included in the voice input from the microphone 42, and the ego state vector of the stimulus, and the response text is output through at least one of the speaker 43 and the display 44. Means for presenting text to the user.

本実施形態では、被刺激の自我状態ベクトルを推定する手段は、表情推定部１１、表情データベース１２、顔感情推定部１３、音声感情推定部１４、口調推定部１５、テキスト抽出部１６、自我状態推定部２０により構成してあり、加刺激の自我状態ベクトルを決定する手段および応答用のテキストを利用者に提示する手段は、対話制御部３０、音声合成処理部１７により構成してある。 In this embodiment, the means for estimating the ego state vector of the stimulus is the expression estimation unit 11, the expression database 12, the face emotion estimation unit 13, the voice emotion estimation unit 14, the tone estimation unit 15, the text extraction unit 16, the ego state. The estimation unit 20 is configured by the dialogue control unit 30 and the voice synthesis processing unit 17 to determine the ego state vector of the stimulus and to present the response text to the user.

カメラ４１で撮像される画像は動画像であって、カメラ４１で取得した動画像は表情推定部１１に入力される。表情推定部１１は、カメラ４１で撮像した利用者の顔の画像（たとえば、３０フレーム／秒で画像を取り込む）のうち眉、目、口などの各部位について特徴点を設定し、特徴点の位置の時間経過に伴う変化から表情を分類する。表情の種類としては、エクマン：「表情分析入門」などに記述されているように、「無表情」、「驚き」、「恐怖」、「嫌悪」、「怒り」、「幸福」、「悲しみ」の７種類を用いる。 The image captured by the camera 41 is a moving image, and the moving image acquired by the camera 41 is input to the facial expression estimation unit 11. The facial expression estimation unit 11 sets feature points for each part such as the eyebrows, eyes, and mouth of the user's face image captured by the camera 41 (for example, the image is captured at 30 frames / second). Classify facial expressions based on changes in position over time. As described in Ekman: “Introduction to facial expression analysis”, the types of facial expressions are “no expression”, “surprise”, “fear”, “disgust”, “anger”, “happiness”, “sadness”. 7 types are used.

表情推定部１１では、顔の各部位の時間変化のパターンを抽出するとともに、抽出した時間変化のパターンを表情データベース１２と照合する。表情データベース１２には、特徴点の時間変化のパターンが上述した７種類の表情に対応付けて格納してあり、表情推定部１１は、顔の各部位の時間変化のパターンを表情データベース１２と照合することによって、表情データベース１２からパターンマッチングの技術によって表情を分類する。 The facial expression estimation unit 11 extracts a temporal change pattern of each part of the face and collates the extracted temporal change pattern with the facial expression database 12. The facial expression database 12 stores temporal change patterns of feature points in association with the seven types of facial expressions described above, and the facial expression estimation unit 11 collates the temporal change patterns of each part of the face with the facial expression database 12. Thus, the facial expressions are classified from the facial expression database 12 by the pattern matching technique.

なお、表情推定部１１で扱う画像はデジタル画像であるものとする。デジタル画像はカメラ４１から出力することができるが、カメラ４１から出力されるアナログ信号の映像信号に対して表情推定部１１においてアナログ−デジタル変換を行ってもよい。 Note that the image handled by the facial expression estimation unit 11 is a digital image. Although the digital image can be output from the camera 41, the facial expression estimation unit 11 may perform analog-digital conversion on the analog video signal output from the camera 41.

表情推定部１１で抽出された表情は顔感情推定部１３に入力される。顔感情推定部１３は、時間経過に伴うフレーム毎の表情の変化パターンを用いて利用者の感情を推定する。表情から感情を推定する際には、利用者の発話の開始時点と終了時点とを検出し、開始から終了までの区間と終了後の区間とにおける表情の整合性を評価し、両区間で推定した感情に明らかな矛盾が発生していなければ、その結果を採用する。顔の画像に基づいて利用者の感情を分類する際には、リアルタイムでの分類が可能となるように、顔感情推定部１３が出力する感情の種類は「平静または怒り」「喜び」「推定不能」の３種類に制限してある。 The facial expression extracted by the facial expression estimation unit 11 is input to the facial emotion estimation unit 13. The face emotion estimation unit 13 estimates a user's emotion using a facial expression change pattern for each frame over time. When estimating emotions from facial expressions, it detects the start and end points of the user's utterance, evaluates the consistency of facial expressions in the interval from start to end, and estimates in both intervals If there is no obvious contradiction in the sentiment, adopt the result. When classifying the user's emotion based on the face image, the types of emotion output by the face emotion estimation unit 13 are “seduce or anger”, “joy”, “estimation” so that classification in real time is possible. It is limited to three types of “impossible”.

一方、マイクロホン４２から出力される音声信号は、音声感情推定部１４と口調推定部１５とテキスト抽出部１６とに入力される。音声感情推定部１４と口調推定部１５とテキスト抽出部１６とにおいては、音声信号のセグメンテーションを行い、ＦＦＴなどの技術を用いて周波数成分を特徴量として抽出する。特徴量を抽出する処理は、音声感情推定部１４と口調推定部１５とテキスト抽出部１６とで同様の処理になるから、特徴量を抽出する処理を行う処理部を設けて、音声感情推定部１４と口調推定部１５とテキスト抽出部１６とで共用し、音声感情推定部１４と口調推定部１５とテキスト抽出部１６とにおいては特徴量を抽出する処理を省略してもよい。 On the other hand, the voice signal output from the microphone 42 is input to the voice emotion estimation unit 14, the tone estimation unit 15, and the text extraction unit 16. The voice emotion estimation unit 14, the tone estimation unit 15, and the text extraction unit 16 perform segmentation of a voice signal and extract a frequency component as a feature amount using a technique such as FFT. Since the process for extracting the feature amount is the same for the voice emotion estimation unit 14, the tone estimation unit 15, and the text extraction unit 16, a processing unit for performing the process for extracting the feature amount is provided, and the voice emotion estimation unit is provided. 14, the tone estimation unit 15, and the text extraction unit 16 may be shared, and the voice emotion estimation unit 14, the tone estimation unit 15, and the text extraction unit 16 may omit the process of extracting the feature amount.

音声感情推定部１４では、利用者が発話した音声の韻律的特徴によって感情を分類するものであり、韻律的特徴の分類にはガウス混合モデル（ＧＭＭ）を用いる。分類する感情は３種類であって、「怒り」「喜び」「平静」とする。 The voice emotion estimation unit 14 classifies emotions based on the prosodic features of the speech uttered by the user, and uses a Gaussian mixture model (GMM) for classification of the prosodic features. There are three types of emotions to be classified: “anger”, “joy” and “calm”.

また、口調推定部１５は、ガウス混合モデルを用いて、利用者が発話した音声の口調を分類する。５種類の自我状態に対する口調は、上述した文献では１つの自我状態について１６口調ずつ示されているが、本実施形態では表１のように、１つの自我状態について４口調ずつ対応させ、利用者の音声を２０種類の口調に分類する。口調を分類するために用いるＧＭＭにおける混合ガウス分布数は６４とする。また、音声信号からの特徴量抽出条件を表２に示す。音声感情推定部１４および口調推定部１５の学習方法については後述する。 The tone estimation unit 15 classifies the tone of the voice spoken by the user using a Gaussian mixture model. The tone for the five types of ego states is shown in the above-mentioned document as 16 tone for each ego state. In this embodiment, as shown in Table 1, four tone values are associated with each ego state. Are classified into 20 types of tone. The number of mixed Gaussian distributions in the GMM used for classifying the tone is 64. Table 2 shows the feature extraction conditions from the audio signal. The learning method of the voice emotion estimation unit 14 and the tone estimation unit 15 will be described later.

テキスト抽出部１６は、利用者の音声からテキスト（音列）を抽出する。テキスト抽出部１６における音声の認識にはＧＭＭや隠れマルコフモデル（ＨＭＭ）を用いた周知の技術を採用することができる。 The text extraction unit 16 extracts text (sound string) from the user's voice. A known technique using a GMM or a Hidden Markov Model (HMM) can be employed for speech recognition in the text extraction unit 16.

上述のように、顔感情推定部１３ではカメラ４１で撮像した利用者の画像から３種類の感情が抽出される。また、音声感情推定部１４では利用者の音声から「怒り」「喜び」「平静」の３種類の感情が抽出され、口調推定部１５では利用者の音声が表１に示した２０種類の口調に分類され、テキスト抽出部１６では利用者の音声からテキストが抽出される。以下では、顔感情推定部１３の出力を表情サマリデータｄ１、音声感情推定部１４の出力を音声感情データｄ２、口調推定部１５の出力を口調データｄ３、テキスト抽出部１６の出力をテキストデータｄ４、と呼ぶことにする。表情サマリデータｄ１、音声感情データｄ２、口調データｄ３、テキストデータｄ４は、被刺激の自我状態ベクトルを推定する機能を有した自我状態推定部２０に入力される。 As described above, the face emotion estimation unit 13 extracts three types of emotions from the user image captured by the camera 41. Further, the voice emotion estimation unit 14 extracts three types of emotions “anger”, “joy”, and “calmness” from the user's voice, and the tone estimation unit 15 extracts the user's voice from the 20 types of tone shown in Table 1. The text extraction unit 16 extracts text from the user's voice. In the following, the facial emotion estimation unit 13 outputs the facial expression summary data d1, the voice emotion estimation unit 14 outputs the voice emotion data d2, the tone estimation unit 15 outputs the tone data d3, and the text extraction unit 16 outputs the text data d4. I will call it. The facial expression summary data d1, the voice emotion data d2, the tone data d3, and the text data d4 are input to the ego state estimation unit 20 having a function of estimating the ego state vector of the stimulus.

自我状態推定部２０は、表情サマリデータｄ１と音声感情データｄ２との組合せを感情自我状態スコアＳ１に置き換える感情スコア割当部２１と、音声感情データｄ２と口調データｄ３との組合せを口調自我状態スコアＳｂに置き換える口調スコア割当部２２と、テキストデータｄ４の内容からテキスト自我状態スコアＳｃを求めるテキストスコア割当部２３とを備える。感情自我状態スコアＳａ、口調自我状態スコアＳｂ、テキスト自我状態スコアＳｃは、被刺激の自我状態ベクトルについて、利用者と対話エージェントとのそれぞれの自我状態に０〜１００点の点数を与えたものであって、１００点に近いほどその自我状態である可能性が高いことを示す。 The ego state estimation unit 20 converts the combination of the facial expression summary data d1 and the voice emotion data d2 with the emotion ego state score S1, and the combination of the voice emotion data d2 and the tone data d3 with the tone ego state score. The tone score assigning unit 22 to be replaced with Sb and the text score assigning unit 23 for obtaining the text ego state score Sc from the contents of the text data d4 are provided. The emotion ego state score Sa, the tone ego state score Sb, and the text ego state score Sc are obtained by assigning 0 to 100 points to the respective ego states of the user and the dialogue agent for the stimulated ego state vector. The closer to 100 points, the higher the possibility of being in the ego state.

ところで、自我状態には５状態があるから、自我状態ベクトルは２５種類が考えられる。もっとも、料理のレシピを提示する対話エージェントのように、利用者が対話エージェントに対して要求や質問を行い、対話エージェントが利用者に対して違和感や不快感をもたらさないように応答する場合を想定すると、利用者の発話時の自我状態としてＮＰやＡＣはあり得ず、対話エージェントの応答時の自我状態としてＣＰはあり得ない。つまり、利用者の自我状態としては、ＣＰ、Ａ、ＦＣの３種類が選択可能であり、対話エージェントの自我状態としては、ＮＰ、Ａ、ＦＣ、ＡＣの４種類が選択可能である。さらに、相補的交流パターン（対話する両者の自我状態が同じである交流パターン）が成立する自我状態ベクトルと、上述した文献に記載されている一般に起こりやすい交流パターンが成立する自我状態ベクトルとを考慮すれば、被刺激の自我状態ベクトルは、（Ａ→Ａ）、（ＣＰ→ＡＣ）、（ＦＣ→ＦＣ）、（ＦＣ→ＮＰ）の４種類になる。 By the way, since there are five ego states, there are 25 possible ego state vectors. However, it is assumed that the user makes a request or question to the dialog agent and responds so that the user does not feel uncomfortable or uncomfortable, like a dialog agent that presents a recipe for cooking. Then, there is no NP or AC as the ego state when the user speaks, and there is no CP as the ego state when the dialog agent responds. That is, three types of CP, A, and FC can be selected as the user's ego state, and four types of NP, A, FC, and AC can be selected as the ego state of the conversation agent. Further, consider an ego state vector in which a complementary AC pattern (an AC pattern in which both of the interacting users have the same ego state) is established, and an ego state vector in which an AC pattern that is likely to occur as described in the above-mentioned literature is established. Then, there are four types of stimulated ego state vectors: (A → A), (CP → AC), (FC → FC), and (FC → NP).

利用者が対話エージェントと対話するには、まず利用者の自我状態がＣＰ、Ａ，ＦＣのどれであるかを対話エージェントが推定しなければならない。一方、自我状態がＦＣであるときには、「本能的」「自由な感情表現」「興奮」などの特徴があり、自我状態がＣＰであるときには、「高圧的」「避難」「叱責」「怒りっぽい頑固おやじ」などの特徴があることが知られている。また、感情自我状態スコアＳａと口調自我状態スコアＳｂとは、音声感情データｄ２から求める利用者の感情を含んでおり、ＦＣという自我状態を音声感情データｄ２の３種類の感情である「怒り」「喜び」「平静」の特徴を持つか否かの観点で見れば、ＦＣという自我状態では、「喜び」あるいは「喜び」に類した特徴を持つ場合と、「怒り」の特徴と持つ場合との２種類の場合がある。 In order for a user to interact with a dialog agent, the dialog agent must first estimate whether the user's ego state is CP, A, or FC. On the other hand, when the ego state is FC, there are characteristics such as “instinctive”, “free expression of emotion”, “excitement”, and when the ego state is CP, “high pressure”, “evacuation”, “reprimand”, “anger” It is known to have a feature such as “Poi Stubborn Father”. Further, the emotion ego state score Sa and the tone ego state score Sb include the user's emotion obtained from the voice emotion data d2, and the “anger” which is the three kinds of emotions of the voice emotion data d2 for the FC ego state. From the viewpoint of whether or not it has the characteristics of “joy” and “calmness”, the ego state of FC has characteristics similar to “joy” or “joy”, and cases of having characteristics of “anger”. There are two types of cases.

そこで、感情自我状態スコアＳａと口調自我状態スコアＳｂとを求める際には、ＦＣを、「喜び」の特徴を持つＦＣａと、「怒り」の特徴を持つＦＣｂとに分けて考えることにする。つまり、被刺激の自我状態ベクトルとして、（Ａ→Ａ）、（ＣＰ→ＡＣ）、（ＦＣａ→ＦＣａ）、（ＦＣａ→ＮＰ）、（ＦＣｂ→ＦＣｂ）の５種類を考える。また、音声感情データｄ２から求める利用者の感情が、「平静」の場合は自我状態をＡ、「喜び」の場合は自我状態をＦＣａ、「怒り」の場合は自我状態をＣＰまたはＦＣｂとする。 Therefore, when the emotional ego state score Sa and the tone ego state score Sb are obtained, the FC is considered to be divided into an FCa having a “joy” feature and an FCb having an “anger” feature. In other words, five types of (A → A), (CP → AC), (FCa → FCa), (FCa → NP), and (FCb → FCb) are considered as stimulated ego state vectors. Further, if the user's emotion obtained from the voice emotion data d2 is “peaceful”, the ego state is A, if it is “joy”, the ego state is FCa, and if it is “anger”, the ego state is CP or FCb. .

表情サマリデータｄ１と音声感情データｄ２との組合せに対する感情自我状態スコアＳａは、表情サマリデータｄ１が「平静または怒り」「喜び」「推定不能」の３種類であり、音声感情データｄ２が「怒り」「喜び」「平静」の３種類であるから、表３のように、組合せは９種類になる。つまり、表情サマリデータｄ１と音声感情データｄ２との９種類の組合せに対して自我状態ベクトルを対応付け、各自我状態ベクトルごとに感情自我状態スコアＳａを規定する。表３においては、表情サマリデータｄ１と音声感情データｄ２とが示す感情が一致する場合には、感情自我状態スコアＳａを満点（本実施形態では１００）とし、表情サマリデータｄ１と音声感情データｄ２とが示す感情が一致しない場合には、表情サマリデータｄ１と音声感情データｄ２との内容に応じて、自我状態ベクトルの可能性の順に感情自我状態スコアＳａを設定している。 The emotion ego state score Sa for the combination of the facial expression summary data d1 and the voice emotion data d2 includes three types of expression summary data d1 of “seduce or anger”, “joy” and “cannot be estimated”, and voice emotion data d2 of “anger”. Since there are three types of “joy” and “seduce”, there are nine types of combinations as shown in Table 3. That is, the ego state vector is associated with nine types of combinations of the facial expression summary data d1 and the voice emotion data d2, and the emotion ego state score Sa is defined for each ego state vector. In Table 3, when the emotions indicated by the facial expression summary data d1 and the voice emotion data d2 match, the emotion ego state score Sa is set to a perfect score (100 in this embodiment), and the facial expression summary data d1 and the voice emotion data d2 If the emotions indicated by are different from each other, the emotional ego state score Sa is set in the order of the possibility of the ego state vector according to the contents of the facial expression summary data d1 and the voice emotion data d2.

表情サマリデータｄ１と音声感情データｄ２との１つの組合せに対して複数種類の自我状態ベクトルを推定できる場合には、表情サマリデータｄ１と音声感情データｄ２との１つの組合せにおける感情自我状態スコアＳａの合計が満点になるように、感情自我状態スコアＳａを割り振る。なお、上述した５種類の自我状態ベクトルのいずれにも該当しないと考えられる場合には、自我状態ベクトルを「不明」とし、不明の自我状態ベクトルに感情自我状態スコアＳａを規定している。すなわち、表情サマリデータｄ１が「推定不能」であるときに、音声感情データｄ２の内容にかかわらず、自我状態ベクトルが「不明」である場合を規定し、感情自我状態スコアＳａを（２０，２０）とし、残りの（８０，８０）を可能性がある他の自我状態ベクトルで配分する。また、表情サマリデータｄ１が「喜び」であり音声感情データｄ２が「怒り」である場合には感情が矛盾しているから、自我状態ベクトルを「不明」として感情自我状態スコアＳａは（１００，１００）とする。 When a plurality of types of ego state vectors can be estimated for one combination of the expression summary data d1 and the voice emotion data d2, the emotion ego state score Sa in one combination of the expression summary data d1 and the voice emotion data d2. The emotion ego state score Sa is assigned so that the sum of When it is considered that none of the above-described five kinds of ego state vectors is applicable, the ego state vector is set to “unknown”, and the emotion ego state score Sa is defined in the unknown ego state vector. That is, when the facial expression summary data d1 is “cannot be estimated”, the case where the ego state vector is “unknown” regardless of the content of the voice emotion data d2 is defined, and the emotion ego state score Sa is (20, 20). ) And the remaining (80, 80) are allocated by other possible ego state vectors. In addition, when the facial expression summary data d1 is “joy” and the voice emotion data d2 is “anger”, the emotions are contradictory, so that the emotional ego state score Sa is (100, 100).

口調自我状態スコアＳｂを求める際に用いる口調データｄ３は、本実施形態では、５種類の自我状態について４種類ずつの口調を対応付けているものであるから、２０種類の口調に分類することが可能であるが、上述したように、利用者の発話時の自我状態をＡ、ＣＰ、ＦＣ（ＦＣａ、ＦＣｂ）の３種類に制限しているから、口調の種類は表４に示すように合計１２種類になる。さらに、本実施形態では口調自我状態スコアＳｂを求めるための口調について、表４のように、自我状態がＣＰに対応するものと、Ａに対応するものと、ＦＣ１、ＦＣ３に対応するものと、ＦＣ２に対応するものと、ＦＣ４に対応するものとの５種類にまとめている。したがって、音声感情データｄ２と口調データｄ３との組合せに対する口調自我状態スコアＳｂは、表４のように、１５種類の組合せになる。ここに、表４におけるアルファベットと数字との組合せは表１における縦行のアルファベットと横列の数字との組合せであり、アルファベットと数字との交差する升目が口調の種類になる。また、表４におけるＮは「口調なし」を示す。音声感情データｄ２と口調データｄ３との１５種類の組合せに対して自我状態ベクトルを対応付け、各自我状態ベクトルごとに感情自我状態スコアＳａを規定する。口調自我状態スコアＳｂに対応する自我状態ベクトルの推定においても、５種類の自我状態ベクトルのいずれにも該当しないと考えられる場合には、自我状態ベクトルを「不明」とし、適宜の口調自我状態スコアＳｂを与える。 In the present embodiment, the tone data d3 used when obtaining the tone ego state score Sb is obtained by associating four types of tone with respect to the five types of ego states. Although it is possible, as described above, since the user's ego state at the time of utterance is limited to three types of A, CP, and FC (FCa, FCb), the types of tone are as shown in Table 4. There are 12 types. Further, in the present embodiment, as to the tone for obtaining the tone ego state score Sb, as shown in Table 4, the ego state corresponds to CP, the case corresponding to A, the case corresponding to FC1, FC3, These are classified into five types, one corresponding to FC2 and one corresponding to FC4. Therefore, the tone ego state score Sb for the combination of the voice emotion data d2 and the tone data d3 is 15 types of combinations as shown in Table 4. Here, the combinations of alphabets and numbers in Table 4 are combinations of vertical alphabets and rows in Table 1, and the meshes where the alphabets and numbers intersect are the type of tone. Further, N in Table 4 indicates “no tone”. Ego state vectors are associated with 15 types of combinations of voice emotion data d2 and tone data d3, and an emotional ego state score Sa is defined for each ego state vector. In the estimation of the ego state vector corresponding to the tone ego state score Sb, if it is considered that none of the five kinds of ego state vectors corresponds, the ego state vector is set to “unknown” and an appropriate tone ego state score is set. Sb is given.

本実施形態では、利用者が対話エージェントに対して要求や質問を行う場合を想定しているから、テキスト自我状態スコアＳｃを求めるには、テキストデータｄ４から要求あるいは質問を表す語句を付帯語句として抽出し、付帯語句に自我状態ベクトルを対応付け、各自我状態ベクトルごとにテキスト自我状態スコアＳｃを規定する。テキストデータｄ４から抽出する付帯語句としては表５のように１４種類を想定する。テキストデータｄ４から抽出する付帯語句によって得られる自我状態ベクトルには、（ＦＣｂ→ＦＣｂ）も考えられるが、表５の例では自我状態としてＦＣｂは含まれていない。 In the present embodiment, it is assumed that the user makes a request or question to the dialog agent. Therefore, in order to obtain the text ego state score Sc, a phrase representing the request or question is used as an additional phrase from the text data d4. Extract, associate an ego state vector with an accompanying phrase, and define a text ego state score Sc for each ego state vector. As supplementary phrases extracted from the text data d4, 14 types are assumed as shown in Table 5. (FCb → FCb) is also conceivable as an ego state vector obtained by an auxiliary phrase extracted from the text data d4, but FCb is not included as an ego state in the example of Table 5.

表３〜５は、感情スコア割当部２１と口調スコア割当部２２テキストスコア割当部２３とにそれぞれ登録される。なお、表３〜５は実験結果に基づいて決定したものであるが、表情サマリデータｄ１、音声感情データｄ２、口調データｄ３、テキストデータｄ４を抽出する構成や対話エージェントの使用目的などによって適宜に変更される。 Tables 3 to 5 are registered in the emotion score assigning unit 21 and the tone score assigning unit 22 and the text score assigning unit 23, respectively. Tables 3 to 5 are determined on the basis of the experimental results, but are appropriately determined depending on the configuration for extracting the facial expression summary data d1, the voice emotion data d2, the tone data d3, and the text data d4, the purpose of use of the dialogue agent, and the like. Be changed.

自我状態推定部２０は、感情スコア割当部２１で求めた感情自我状態スコアＳａと口調スコア割当部２２で求めた口調自我状態スコアＳｂとテキストスコア割当部２３で求めたテキスト自我状態スコアＳｃとにそれぞれ重み係数を乗じて加算した加重和を求めるスコア統合演算部２４を備える。つまり、スコア統合演算部２４では、次式の演算により統合スコアＳＩを求める。
ＳＩ＝ｗ１・Ｓａ＋ｗ２・Ｓｂ＋ｗ３・Ｓｃ
ただし、ｗ１，ｗ２，ｗ３は重み係数である。表３〜表５に示すように、感情自我状態スコアＳａと口調自我状態スコアＳｂとテキスト自我状態スコアＳｃとは、利用者と対話エージェントとの両方について点数が与えられているから、統合スコアＳＩは利用者と対話エージェントとの双方について求める。つまり、感情自我状態スコアＳａと口調自我状態スコアＳｂとテキスト自我状態スコアＳｃとは、いずれも利用者のスコアと対話エージェントのスコアとの組であるが、統合スコアＳＩを求める際には、上式の演算を利用者のスコアと対話エージェントのスコアとについてそれぞれ個別に行い、演算結果を統合スコアＳＩにおける利用者のスコアと対話エージェントのスコアとに用いる。統合スコアＳＩは、被刺激の自我状態ベクトルの評価値であって、統合スコアＳＩにより被刺激の自我状態ベクトルを推定することができる。 The ego state estimation unit 20 uses the emotion ego state score Sa obtained by the emotion score assigning unit 21, the tone ego state score Sb obtained by the tone score assigning unit 22, and the text ego state score Sc obtained by the text score assigning unit 23. A score integration calculation unit 24 for obtaining a weighted sum obtained by multiplying and adding the respective weighting factors is provided. That is, the score integration calculation unit 24 calculates the integrated score SI by the calculation of the following equation.
SI = w1, Sa + w2, Sb + w3, Sc
However, w1, w2, and w3 are weighting factors. As shown in Tables 3 to 5, since the emotion ego state score Sa, the tone ego state score Sb, and the text ego state score Sc are scored for both the user and the dialogue agent, the integrated score SI Asks for both users and dialog agents. That is, the emotion ego state score Sa, the tone ego state score Sb, and the text ego state score Sc are all a set of a user score and a dialogue agent score. The calculation of the expression is performed individually for the user score and the interactive agent score, and the calculation result is used as the user score and the interactive agent score in the integrated score SI. The integrated score SI is an evaluation value of the stimulated ego state vector, and the stimulated ego state vector can be estimated by the integrated score SI.

表３〜表５では、感情自我状態スコアＳａと口調自我状態スコアＳｂとテキスト自我状態スコアＳｃについて、どの自我状態ベクトルについても利用者のスコアと対話エージェントのスコアとが同じ値になっているが、重み係数ｗ１，ｗ２，ｗ３について以下の条件を設定することで、統合スコアＳＩでは利用者と対話エージェントとの値が異なる値になることがある。ここでは、統合スコアＳＩおよび重み係数ｗ１，ｗ２，ｗ３について利用者については（ｕ）を付加して記述し、対話エージェントについては（ａ）を付加して記述する。つまり、統合スコアＳＩは（ＳＩ（ｕ），ＳＩ（ａ））と表すことができる。また、重み係数ｗ１（ｕ），ｗ２（ｕ），ｗ３（ｕ），ｗ１（ａ），ｗ２（ａ），ｗ３（ａ）の条件は、以下の３条件を満たすことである。
ｗ１（ｕ）＋ｗ２（ｕ）＞ｗ３（ｕ）
ｗ１（ａ）＋ｗ２（ａ）＞ｗ３（ａ）
ｗ３（ｕ）＜ｗ３（ａ）
上記条件を満たす重み係数ｗ１（ｕ），ｗ２（ｕ），ｗ３（ｕ），ｗ１（ａ），ｗ２（ａ），ｗ３（ａ）としては、たとえばｗ１（ｕ）＝ｗ２（ｕ）＝ｗ２（ａ）＝ｗ３（ａ）＝０．４、ｗ３（ｕ）＝ｗ１（ａ）＝０．２と設定することができる。すなわち、下記関係になる。
ｗ１（ｕ）＋ｗ２（ｕ）＝０．８＞０．２＝ｗ３（ｕ）
ｗ１（ａ）＋ｗ２（ａ）＝０．６＞０．４＝ｗ３（ａ）
上例において、ｗ３（ｕ）を比較的小さくしているのは、テキストデータｄ４から抽出される付帯語句は、利用者の自我状態を反映しているものの感情に比較すると自我状態を反映する程度が小さいからであり、対話エージェントにおいてｗ１（ａ）を比較的小さくしているのは、対話エージェントにおいて利用者に刺激される自我状態は、口調データｄ３とテキストデータｄ４との反映の程度が大きいからである。 In Tables 3 to 5, the emotional ego state score Sa, the tone ego state score Sb, and the text ego state score Sc have the same values for the user score and the dialogue agent score for any ego state vector. By setting the following conditions for the weighting factors w1, w2, and w3, the integrated score SI may have different values for the user and the interactive agent. Here, the integrated score SI and the weighting factors w1, w2, and w3 are described by adding (u) to the user, and the dialog agent is described by adding (a). That is, the integrated score SI can be expressed as (SI (u), SI (a)). Further, the conditions of the weighting factors w1 (u), w2 (u), w3 (u), w1 (a), w2 (a), and w3 (a) are to satisfy the following three conditions.
w1 (u) + w2 (u)> w3 (u)
w1 (a) + w2 (a)> w3 (a)
w3 (u) <w3 (a)
As weighting factors w1 (u), w2 (u), w3 (u), w1 (a), w2 (a), w3 (a) satisfying the above conditions, for example, w1 (u) = w2 (u) = w2 (A) = w3 (a) = 0.4 and w3 (u) = w1 (a) = 0.2 can be set. That is, the following relationship is established.
w1 (u) + w2 (u) = 0.8> 0.2 = w3 (u)
w1 (a) + w2 (a) = 0.6> 0.4 = w3 (a)
In the above example, the reason why w3 (u) is relatively small is that the incidental phrase extracted from the text data d4 reflects the ego state as compared to the emotion, although it reflects the user's ego state. This is because w1 (a) is relatively small in the dialogue agent, and the ego state stimulated by the user in the dialogue agent is largely reflected in the tone data d3 and the text data d4. Because.

ところで、表３〜表５のようなデータを用いることによって、表情サマリデータｄ１と音声感情データｄ２と口調データｄ３とテキストデータｄ４との１つの組合せに対して、１種類以上の自我状態ベクトルが抽出される。また、表情サマリデータｄ１と音声感情データｄ２と口調データｄ３とテキストデータｄ４との１つの組合せに対して、感情自我状態スコアＳａと口調自我状態スコアＳｂとテキスト自我状態スコアＳｃとの少なくとも１つが複数規定されている場合もある。このように、複数種類の自我状態ベクトルが得られるときには、自我状態ベクトルに含まれる自我状態を利用者と対話エージェントとについてそれぞれ分類し、分類した自我状態ごとに、重み係数ｗ１（ｕ），ｗ２（ｕ），ｗ３（ｕ），ｗ１（ａ），ｗ２（ａ），ｗ３（ａ）を乗じて加算した加重和を求め、得られた加重和を利用者および対話エージェントにおける各自我状態の統合スコアＳＩとする。統合スコアＳＩは、利用者の自我状態と対話エージェントにおいて刺激された自我状態とのそれぞれについて自我状態の種類別に求められるから、利用者の各自我状態について求めた統合スコアＳＩ（ｕ）のうち最大値が得られる自我状態を利用者の自我状態と推定し、対話エージェントの各自我状態について求めた統合スコアＳＩ（ａ）のうち最大値が得られる自我状態を対話エージェントの自我状態と推定する。 By using data as shown in Tables 3 to 5, one or more types of ego state vectors are obtained for one combination of facial expression summary data d1, voice emotion data d2, tone data d3, and text data d4. Extracted. Further, for one combination of facial expression summary data d1, voice emotion data d2, tone data d3, and text data d4, at least one of emotion ego state score Sa, tone ego state score Sb, and text ego state score Sc is provided. There may be a plurality of rules. As described above, when a plurality of types of ego state vectors are obtained, the ego states included in the ego state vector are classified for the user and the dialogue agent, and the weight coefficients w1 (u) and w2 are classified for each classified ego state. The weighted sum obtained by multiplying (u), w3 (u), w1 (a), w2 (a), and w3 (a) is obtained, and the obtained weighted sum is integrated for each ego state in the user and the dialogue agent. The score is SI. Since the integrated score SI is obtained for each type of ego state for each of the user's ego state and the ego state stimulated by the dialogue agent, the maximum of the integrated score SI (u) obtained for each ego state of the user. The ego state where the value is obtained is estimated as the user's ego state, and the ego state where the maximum value is obtained from the integrated scores SI (a) obtained for each ego state of the dialogue agent is estimated as the ego state of the dialogue agent.

以下に、利用者の発話による被刺激の自我状態ベクトルを推定する手順の一例を示す。ここでは、表情サマリデータｄ１が「平静」、音声感情データｄ２が「平静」、口調データｄ３が「口調なし」、テキストデータｄ４の付帯語句が「〜して」であるものとする。表３によれば、表情サマリデータｄ１が「平静」で音声感情データｄ２が「平静」である組合せでは、（Ａ→Ａ）＝（１００，１００）になる。また、表４によれば、口調データｄ３が「口調なし」で音声感情データｄ２が「平静」である組合せでは、（Ａ→Ａ）＝（１００，１００）になる。さらに、表５によりテキストデータｄ４から抽出した付帯語句が「〜して」であるときには、自我状態ベクトルが複数種類得られ、（Ａ→Ａ）＝（５０，５０）、（ＣＰ→ＡＣ）＝（３０，３０）、（ＦＣａ→ＦＣａ）＝（１０，１０）、（ＦＣａ→ＮＰ）＝（１０，１０）になる。 Below, an example of the procedure which estimates the ego state vector of the stimulus by a user's utterance is shown. Here, it is assumed that the facial expression summary data d1 is “peaceful”, the voice emotion data d2 is “peaceful”, the tone data d3 is “no tone”, and the incidental phrase of the text data d4 is “to”. According to Table 3, in a combination in which the facial expression summary data d1 is “serious” and the voice emotion data d2 is “serious”, (A → A) = (100, 100). Further, according to Table 4, in a combination in which the tone data d3 is “no tone” and the voice emotion data d2 is “calm”, (A → A) = (100, 100). Further, when the incidental phrase extracted from the text data d4 according to Table 5 is “to”, a plurality of types of ego state vectors are obtained, and (A → A) = (50, 50), (CP → AC) = (30, 30), (FCa → FCa) = (10, 10), (FCa → NP) = (10, 10).

ところで、自我状態は、ＣＰ、ＮＰ、Ａ、ＦＣ、ＡＣの５種類であり、さらに本実施形態ではＦＣをＦＣａ，ＦＣｂに分けているから合計６種類の自我状態がある。各重み係数を、それぞれｗ１（ｕ）＝０．４，ｗ２（ｕ）＝０．４，ｗ３（ｕ）＝０．２，ｗ１（ａ）＝０．２，ｗ２（ａ）＝０．４，ｗ３（ａ）＝０．４とし、自我状態を区別するために各統合スコアＳＩ（ｕ），ＳＩ（ａ）、感情自我状態スコアＳａ、口調自我状態スコアＳｂ、テキスト自我状態スコアＳｃにそれぞれ＜Ｘ＞を付加し、Ｘを自我状態とすれば、利用者および対話エージェントの各自我状態ごとの統合スコアＳＩ（ｕ）＜Ｘ＞，ＳＩ（ａ）＜Ｘ＞は次式のようになる。
ＳＩ（ｕ）＜Ｘ＞＝０．４×Ｓａ＜Ｘ＞＋０．４×Ｓｂ＜Ｘ＞＋０．２×Ｓｃ＜Ｘ＞
ＳＩ（ａ）＜Ｘ＞＝０．２×Ｓａ＜Ｘ＞＋０．４×Ｓｂ＜Ｘ＞＋０．４×Ｓｃ＜Ｘ＞
また、上述の例では利用者については、Ｓａ＜Ａ＞＝１００、Ｓｂ＜Ａ＞＝１００、Ｓｃ＜Ａ＞＝５０、Ｓｃ＜ＣＰ＞＝３０、Ｓｃ＜ＦＣａ＞＝１０＋１０（自我状態がＦＣａである自我状態ベクトルが２個あることを意味する）であり、対話エージェントについては、Ｓａ＜Ａ＞＝１００、Ｓｂ＜Ａ＞＝１００、Ｓｃ＜Ａ＞＝５０、Ｓｃ＜ＡＣ＞＝３０、Ｓｃ＜ＦＣａ＞＝１０、Ｓｃ＜ＮＰ＞＝１０であり、感情自我状態スコアＳａ、口調自我状態スコアＳｂ、テキスト自我状態スコアＳｃの他の値は０になる。 By the way, there are five types of ego states, CP, NP, A, FC, and AC. Further, in this embodiment, since FC is divided into FCa and FCb, there are a total of six types of ego states. The weighting factors are respectively w1 (u) = 0.4, w2 (u) = 0.4, w3 (u) = 0.2, w1 (a) = 0.2, w2 (a) = 0.4. , W3 (a) = 0.4, and in order to distinguish the ego state, each integrated score SI (u), SI (a), emotion ego state score Sa, tone ego state score Sb, text ego state score Sc, respectively If <X> is added and X is an ego state, the integrated scores SI (u) <X>, SI (a) <X> for each ego state of the user and the dialogue agent are as follows: .
SI (u) <X> = 0.4 × Sa <X> + 0.4 × Sb <X> + 0.2 × Sc <X>
SI (a) <X> = 0.2 × Sa <X> + 0.4 × Sb <X> + 0.4 × Sc <X>
In the above example, for the user, Sa <A> = 100, Sb <A> = 100, Sc <A> = 50, Sc <CP> = 30, Sc <FCa> = 10 + 10 (ego state is FCa For the dialogue agent, Sa <A> = 100, Sb <A> = 100, Sc <A> = 50, Sc <AC> = 30, Sc <FCa> = 10, Sc <NP> = 10, and other values of the emotion ego state score Sa, the tone ego state score Sb, and the text ego state score Sc are 0.

これらの値を用い、利用者と対話エージェントとに分類して６種類の各自我状態ごとの総合スコアＳＩ（ｕ），ＳＩ（ａ）を求めると、以下のようになる。
ＳＩ（ｕ）＜ＣＰ＞＝０．４×０＋０．４×０＋０．２×３０＝６
ＳＩ（ｕ）＜ＮＰ＞＝０
ＳＩ（ｕ）＜Ａ＞＝０．４×１００＋０．４×１００＋０．２×５０＝９０
ＳＩ（ｕ）＜ＦＣａ＞＝０．４×０＋０．４×０＋０．２×（１０＋１０）＝４
ＳＩ（ｕ）＜ＡＣ＞＝０
ＳＩ（ｕ）＜ＦＣｂ＞＝０
ＳＩ（ａ）＜ＣＰ＞＝０
ＳＩ（ａ）＜ＮＰ＞＝０．２×０＋０．４×０＋０．４×１０＝４
ＳＩ（ａ）＜Ａ＞＝０．２×１００＋０．４×１００＋０．４×５０＝８０
ＳＩ（ａ）＜ＦＣａ＞＝０．２×０＋０．４×０＋０．４×１０＝４
ＳＩ（ａ）＜ＡＣ＞＝０．２×０＋０．４×０＋０．４×３０＝１２
ＳＩ（ａ）＜ＦＣｂ＞＝０
利用者と対話エージェントとについて、それぞれ統合スコアＳＩ（ｕ），ＳＩ（ａ）の最大値を求めると、ＳＩ（ｕ）＜Ａ＞＝９０、ＳＩ（ａ）＜Ａ＞＝８０であるから、利用者の自我状態をＡ、対話エージェントの自我状態をＡとし、被刺激の自我状態ベクトルを（Ａ→Ａ）＝（９０，８０）と推定することができる。 When these values are used to categorize the user and the dialogue agent and obtain the total scores SI (u) and SI (a) for each of the six types of ego states, they are as follows.
SI (u) <CP> = 0.4 × 0 + 0.4 × 0 + 0.2 × 30 = 6
SI (u) <NP> = 0
SI (u) <A> = 0.4 × 100 + 0.4 × 100 + 0.2 × 50 = 90
SI (u) <FCa> = 0.4 × 0 + 0.4 × 0 + 0.2 × (10 + 10) = 4
SI (u) <AC> = 0
SI (u) <FCb> = 0
SI (a) <CP> = 0
SI (a) <NP> = 0.2 × 0 + 0.4 × 0 + 0.4 × 10 = 4
SI (a) <A> = 0.2 × 100 + 0.4 × 100 + 0.4 × 50 = 80
SI (a) <FCa> = 0.2 × 0 + 0.4 × 0 + 0.4 × 10 = 4
SI (a) <AC> = 0.2 × 0 + 0.4 × 0 + 0.4 × 30 = 12
SI (a) <FCb> = 0
When the maximum values of the integrated scores SI (u) and SI (a) are obtained for the user and the dialogue agent, respectively, SI (u) <A> = 90 and SI (a) <A> = 80. Assume that the user's ego state is A, the dialogue agent's ego state is A, and the stimulated ego state vector is (A → A) = (90, 80).

自我状態推定部２０において利用者の自我状態と刺激された対話エージェントの自我状態との推定により自我状態ベクトルが決まると、自我状態ベクトルは、テキストデータｄ４とともに対話制御部３０に与えられる。対話制御部３０では、自我状態推定部２０で推定された被刺激の自我状態ベクトルから加刺激の自我状態ベクトルを決定する。対話エージェントから利用者に応答する際の加刺激の自我状態ベクトルは、応答戦略決定部３１において被刺激の自我状態ベクトルに対応付けてあらかじめ設定してある対応ルールを用いて決定する。対話制御部３０には、応答戦略決定部３１により決定された自我状態ベクトルとテキストデータｄ４とを用いて利用者に応答するテキストを決定する応答テキスト決定部３２も設けられる。 When the ego state vector is determined by estimating the ego state of the user and the ego state of the stimulated dialogue agent in the ego state estimation unit 20, the ego state vector is given to the dialogue control unit 30 together with the text data d4. The dialogue control unit 30 determines the ego state vector of the stimulus from the ego state vector of the stimulus estimated by the ego state estimation unit 20. The ego state vector of the stimulation when responding to the user from the dialogue agent is determined by using a correspondence rule set in advance in association with the ego state vector of the stimulus in the response strategy determination unit 31. The dialog control unit 30 is also provided with a response text determination unit 32 that determines text to be responded to the user using the ego state vector determined by the response strategy determination unit 31 and the text data d4.

対話制御部３０は、表６に示すように、被刺激の自我状態ベクトルと加刺激の自我状態ベクトルとの対応ルールを集めて登録したシナリオデータベース３３を備え、応答戦略決定部３１は、被刺激の自我状態ベクトルが自我状態推定部２０から与えられると、シナリオデータベース３３に格納された表６の対応関係を用いて加刺激の自我状態ベクトルを抽出する。なお、表６において、自我状態ベクトル間を結ぶ矢印は、因果関係を示している。つまり、矢印の左側の括弧の（被刺激の自我状態ベクトル）が与えられると、矢印の右側の括弧の（加刺激の自我状態ベクトル）を用いることを意味している。 As shown in Table 6, the dialogue control unit 30 includes a scenario database 33 in which correspondence rules between the stimulated ego state vector and the stimulated ego state vector are collected and registered. When the ego state vector is given from the ego state estimation unit 20, the ego state vector of the stimulus is extracted using the correspondence relationship of Table 6 stored in the scenario database 33. In Table 6, arrows connecting the ego state vectors indicate a causal relationship. That is, when the parenthesis (ego state vector of the stimulus) on the left side of the arrow is given, it means that the (ego state vector of the stimulus) on the right side of the arrow is used.

料理のレシピを提案する対話エージェントでは、利用者の自我状態としてＮＰ、ＡＣはないから、表６のうちＮｏ６、Ｎｏ７は使用しない。また、自我状態ベクトルには「不明」の場合があるから、Ｎｏ１〜Ｎｏ５に当てはまらない場合には、Ｎｏ８を用いる。Ｎｏ３は利用者が「甘える」場合に相当し、Ｎｏ５は利用者が「怒る」場合に相当する。 In the dialogue agent that proposes a recipe for cooking, since there is no NP or AC as the user's ego state, No. 6 and No. 7 in Table 6 are not used. In addition, since there is a case where the ego state vector is “unknown”, No8 is used when it is not applicable to No1 to No5. No. 3 corresponds to a case where the user “pampers” and No. 5 corresponds to a case where the user “gets angry”.

応答テキスト決定部３２では、応答戦略決定部３１で得られた加刺激の自我状態ベクトルとテキストデータｄ４に含まれるキーワードとを用いて対話エージェントによる応答用のテキストを自動的に決定する。対話エージェントが利用者に応答するテキストは、利用者の発話によるテキストデータｄ４の中のキーワードに対応するように応答用のキーワードが決められており、対話エージェントと利用者との自我状態に応じて決まる付帯語句を応答用のキーワードに付加してテキストを組み立てることにより、対話エージェントが応答するテキストを生成する。応答用のキーワードおよび自我状態ベクトルに応じて決定される付帯語句はシナリオデータベース３３に格納しておく。つまり、シナリオデータベース３３では、被刺激の自我状態ベクトルに加刺激の自我状態ベクトルが対応付けて登録され、加刺激の自我状態ベクトルに付帯語句が対応付けて登録され、テキストデータｄ４に含まれることが予測されるキーワードに付加して応答用のキーワードが登録されている。さらに、テキストデータｄ４に含まれるキーワードのうち利用者からの命令語として解釈されるキーワードには、レシピの検索などの他の作業を行うためのコマンドが対応付けられる。また、応答用のテキストと自我状態ベクトルとから応答用の韻律が決まるから、応答テキスト決定部３２では韻律制御用の韻律パラメータも生成する。 The response text determination unit 32 automatically determines the text for response by the dialogue agent using the ego state vector of the stimulus obtained by the response strategy determination unit 31 and the keyword included in the text data d4. The text that the dialog agent responds to the user has a response keyword determined so as to correspond to the keyword in the text data d4 generated by the user's utterance, and the text depends on the ego state between the dialog agent and the user. By adding the determined incidental phrase to the response keyword and assembling the text, the text to which the interactive agent responds is generated. The incidental phrases determined according to the response keyword and the ego state vector are stored in the scenario database 33. That is, in the scenario database 33, the ego state vector of the stimulus is registered in association with the ego state vector of the stimulus, the incidental phrase is registered in association with the ego state vector of the stimulus, and is included in the text data d4. Is added to the predicted keyword, and a response keyword is registered. Further, among the keywords included in the text data d4, a keyword to be interpreted as a command word from the user is associated with a command for performing other work such as a recipe search. Further, since the response prosody is determined from the response text and the ego state vector, the response text determination unit 32 also generates a prosody parameter for prosody control.

応答テキスト決定部３２で決定された応答用のテキストと韻律パラメータとは、テキスト合成を行う音声合成処理部１７に与えられる。音声合成処理部１７では、テキストと韻律パラメータとを用いて応答用の合成音声を生成し、スピーカ４３を通して利用者に対する応答音声として出力する。 The response text and prosodic parameters determined by the response text determination unit 32 are provided to the speech synthesis processing unit 17 that performs text synthesis. The speech synthesis processing unit 17 generates a synthesized speech for response using the text and the prosodic parameters, and outputs it as a response speech to the user through the speaker 43.

以下では、対話エージェントに料理のレシピを検索させる場合を例として具体的に説明する。本例では利用者が材料を口頭で対話エージェントに伝えると、その材料で作ることができる料理名の候補を検索して複数提示し、提示された候補から利用者が所望の料理名を選択すると、対話エージェントがその料理のレシピを提示するように対話システムを構築しているものとする。料理名およびレシピはディスプレイ４４に提示する。また、利用者から対話エージェントに対する指示には、スイッチのような操作部を併用する構成としてもよいが、ここでは利用者からの指示は音声のみによるものとする。対話エージェントが検索するレシピのデータは、対話エージェントに登録しておくか、対話エージェントにウェブ検索の機能を設けておきウェブ検索によって入手させる。あるいはまた、対話エージェントに登録したものを優先的に提示し、登録されているレシピ以外のレシピを利用者が求めるときにウェブ検索でレシピを入手するようにしてもよい。 In the following, a specific example will be described in which a dialogue agent is searched for a cooking recipe. In this example, when the user verbally conveys the material to the dialog agent, the search is made for a plurality of candidate names that can be made from the material, and when the user selects a desired name from the displayed candidates. Assume that the dialogue system is constructed so that the dialogue agent presents the recipe of the dish. The dish name and recipe are presented on the display 44. In addition, an instruction from the user to the dialogue agent may be configured to use an operation unit such as a switch, but here, the instruction from the user is only by voice. The recipe data to be searched by the dialog agent is registered in the dialog agent, or the dialog agent is provided with a Web search function and is obtained by Web search. Alternatively, what is registered in the dialogue agent may be preferentially presented, and the recipe may be obtained by web search when the user requests a recipe other than the registered recipe.

料理のレシピの検索を対話エージェントに要求する場合の具体例を表７に示す。表７において「自我状態（利）」は利用者の自我状態、「自我状態（エ）」は対話エージェントの自我状態を表す。上述した構成例では、表情サマリデータｄ１と音声感情データｄ２と口調データｄ３とテキストデータｄ４とを用いて、被刺激の自我状態ベクトルを決定しているが、表７では自我状態推定部２０において、表情サマリデータｄ１と音声感情データｄ２と口調データｄ３とを用いて利用者の感情が「平静」「怒り」「喜び」「不明」のいずれかであることがわかり、テキストデータｄ４によりキーワードおよび付帯語句が得られている場合を想定している。感情と付帯語句とを用いることにより、自我状態推定部２０において被刺激の自我状態ベクトルが決まるから、対話制御部３０では応答戦略決定部３１の対応ルールを用いて加刺激の自我状態ベクトルを決める。加刺激の自我状態ベクトルが決まれば、応答テキスト決定部３２において、テキストデータｄ４のキーワードとあらかじめ登録してある付帯語句とを用いて利用者に応答するテキストおよび応答用の韻律を決めるのである。 Table 7 shows a specific example in a case where the dialogue agent is requested to search for a cooking recipe. In Table 7, “ego state (interest)” represents the user's ego state, and “ego state (d)” represents the ego state of the dialog agent. In the configuration example described above, the ego state vector of the stimulus is determined using the facial expression summary data d1, the voice emotion data d2, the tone data d3, and the text data d4. Using the facial expression summary data d1, the voice emotion data d2, and the tone data d3, it is found that the user's emotion is one of “seduce”, “anger”, “joy”, and “unknown”. It is assumed that incidental phrases are obtained. Since the ego state estimation unit 20 determines the ego state vector of the stimulus by using the emotion and the accompanying phrase, the dialogue control unit 30 determines the ego state vector of the stimulus using the corresponding rule of the response strategy determination unit 31. . When the ego state vector of the stimulus is determined, the response text determination unit 32 determines the text to respond to the user and the prosody for response using the keyword of the text data d4 and the auxiliary words registered in advance.

ところで、対話エージェントには、利用者があらかじめ登録されているか否かを認識することによって、特定の利用者の特徴を利用して認識率を高めたり、利用者の認証を行ったりすることができるユーザ認識部５０を設けている。ユーザ認識部５０は、図２に示すように、マイクロホン４２から入力される利用者の音声の特徴量を利用者データベース５１に照合して利用者の候補を抽出する利用者認識処理部５２と、カメラ４１により撮像される利用者の顔の画像の特徴量を顔画像データベース５３に照合して利用者の候補を抽出する顔画像認識処理部５４とを備える。ここに、図では便宜上、利用者データベース５１と顔画像データベース５３とを別に分けて記述しているが、両者を一括してユーザデータベースとすることができる。利用者認識処理部５２で抽出した利用者の候補と顔画像認識処理部５４で抽出した利用者の候補とは利用者判断部５５に入力され、利用者判断部５５ではファジー論理などを用いて利用者の候補を組み合わせることにより、利用者を決定するとともに利用者ごとに付与した識別情報を出力する。このように、利用者の音声の特徴量と利用者の顔の画像の特徴量とをユーザデータベースに照合して利用者の候補を抽出するとともに、抽出した利用者の候補を用いて利用者を決定するから、利用者の認識率が高くなる。 By the way, by recognizing whether or not the user is registered in advance, the dialogue agent can use the characteristics of a specific user to increase the recognition rate or perform user authentication. A user recognition unit 50 is provided. As shown in FIG. 2, the user recognition unit 50 collates the feature amount of the user's voice input from the microphone 42 with the user database 51 and extracts a user candidate, A face image recognition processing unit 54 that extracts a user candidate by collating the feature amount of the user's face image captured by the camera 41 with the face image database 53; Here, for convenience, the user database 51 and the face image database 53 are separately described in the figure, but both can be collectively used as a user database. The user candidates extracted by the user recognition processing unit 52 and the user candidates extracted by the face image recognition processing unit 54 are input to the user determination unit 55, and the user determination unit 55 uses fuzzy logic or the like. By combining the user candidates, the user is determined and the identification information given to each user is output. As described above, the user's voice feature value and the user's face image feature value are collated with the user database to extract the user candidate, and the extracted user candidate is used to select the user. Since it is determined, the recognition rate of the user is increased.

ユーザ認識部５０から出力される識別情報を対話制御部３０で採用するために、シナリオデータベース３３には、対話における利用者の好みや利用者の性格などの情報が識別情報に対応付けて格納されており、対話制御部３０では、ユーザ認識部５０から識別情報が得られるときには、利用者に応じた対応ルールを決定し、その対応ルールを用いて自我状態ベクトルを決定する。この動作を可能とするために、シナリオデータベース３３では、利用者の好みや性格に応じた対応ルールを識別情報に対応付けてあり、対話制御部３０では、ユーザ認識部５０で識別情報が得られるときには識別情報に対応付けた対応ルールを用い、識別情報が得られないときにはデフォルトの対応ルールを用いる。また、利用者の識別情報が得られることによって、対話エージェントが利用者の名前を呼びかけに用いることも可能になる。 In order to employ the identification information output from the user recognition unit 50 in the dialogue control unit 30, the scenario database 33 stores information such as user preferences and user personalities in the dialogue in association with the identification information. When the identification information is obtained from the user recognition unit 50, the dialogue control unit 30 determines a corresponding rule according to the user, and determines the ego state vector using the corresponding rule. In order to enable this operation, in the scenario database 33, a correspondence rule corresponding to the user's preference and personality is associated with identification information, and in the dialog control unit 30, the identification information is obtained by the user recognition unit 50. Sometimes the correspondence rule associated with the identification information is used, and when the identification information cannot be obtained, the default correspondence rule is used. Further, by obtaining the user identification information, the dialog agent can use the user name for calling.

対話エージェントはディスプレイ４４に表示された仮想的な身体を用いて身体動作を伴う表現が可能になっており、身体表現を行うことによって親近感のある応答が可能になる。このような仮想的な身体による身体表現を行うために、本実施形態の対話エージェントには、仮想的な身体を表現するためのデータ群を記憶装置に格納した身体モデル表現部１８と、対話制御部３０で決定した応答内容（テキスト、エージェントの自我状態、刺激する利用者の自我状態）を身体モデル表現部１８の身体動作に変換する身体表現制御部１９とを設けてある。ここに、身体モデル表現部１８の身体動作に変換するとは、身体モデル表現部１８に格納されている身体表現のデータ群から、表現に必要なデータ群を抽出するとともに、抽出したデータ群をディスプレイ４４に表示された仮想的な身体に適用することを意味する。この処理によって、ディスプレイ４４に表示されたエージェントの身振り・手振り（手を挙げる、手を伸ばす、首を振る、頷くなど）を対話エージェントの応答内容に応じて制御することが可能になる。 The dialogue agent can express with a body motion using a virtual body displayed on the display 44, and a friendly response is possible by performing the body expression. In order to perform such body representation by a virtual body, the dialogue agent of the present embodiment includes a body model representation unit 18 in which a data group for representing a virtual body is stored in a storage device, and dialogue control. There is provided a body expression control unit 19 that converts response contents (text, agent's ego state, stimulating user's ego state) determined by the unit 30 into body movements of the body model expression unit 18. Here, the conversion to the body movement of the body model expression unit 18 is to extract a data group necessary for expression from the data group of the body expression stored in the body model expression unit 18 and display the extracted data group on the display. This means that it is applied to the virtual body displayed at 44. By this processing, it becomes possible to control the gesture / hand gesture of the agent displayed on the display 44 (lifting hands, reaching out, shaking head, whispering, etc.) according to the response content of the dialogue agent.

ところで、音声感情推定部１４および口調推定部１５は、ガウス混合モデルを用いるからあらかじめ学習させる必要がある。以下では、まず口調推定部１５の学習方法について説明し、次に音声感情推定部１４の学習方法について説明する。口調推定部１５の学習にあたっては、表１に示した２０種類の口調の音声をデータとして収集する必要がある。そこで、本発明者らは、まずサーチエンジンを用いＷｅｂからテキストを収集した。テキストの収集には、〔」〕と活用形とを組み合わせた検索文を用いた。たとえば、「命令」口調であれば、検索文として〔」と命令した〕を用いると、〔「前へ出ろ」と命令した〕というようなテキストが抽出される。この場合、「前へ出ろ」を「命令」口調のテキストとして利用することができる。このようにして抽出されたテキストを自動整形し、不適切な表現が含まれる文は手作業で取り除いた。また、口調によってはＷｅｂでは十分な数の文例を収集できない場合があったから、５０文以上収集できた口調について学習に用いた。表１の口調は５０文以上収集できた口調である。なお、各口調を識別するために口調データｄ３は表１における自我状態と数値との組合せを用いた（表４参照）。たとえば、「命令」口調は口調データｄ３を「ＣＰ１」とし、「慰める」口調は口調データｄ３を「ＮＰ２」とした。 By the way, since the voice emotion estimation part 14 and the tone estimation part 15 use a Gaussian mixture model, it is necessary to learn beforehand. Below, the learning method of the tone estimation part 15 is demonstrated first, and the learning method of the audio | voice emotion estimation part 14 is demonstrated next. In learning by the tone estimation unit 15, it is necessary to collect 20 kinds of tone voices shown in Table 1 as data. Therefore, the inventors first collected text from the Web using a search engine. In order to collect the text, a search sentence combining []] and a utilization form was used. For example, in the case of a “command” tone, when “[” is commanded] is used as a search sentence, a text such as “(command)“ go ahead ”” is extracted. In this case, “go ahead” can be used as the text for the “command” tone. The text extracted in this way was automatically shaped, and sentences containing inappropriate expressions were removed manually. Also, depending on the tone, there were cases where a sufficient number of sentence examples could not be collected on the Web, so the tone that was able to collect more than 50 sentences was used for learning. The tone in Table 1 is a tone that can be collected over 50 sentences. In addition, in order to identify each tone, the tone data d3 used the combination of the ego state and the numerical value in Table 1 (see Table 4). For example, in the “command” tone, the tone data d3 is “CP1”, and in the “comfort” tone, the tone data d3 is “NP2”.

各口調のテキストの収集後に、簡易防音室において、俳優にテキストを読み上げてもらい、その際に顔の表情についても演技してもらった。俳優は父親役（男性２名）、母親役（女性２名）、子供役（女性２名）の合計６名で、表１の２０種類の口調について３０文ずつのテキストを読み上げてもらった。また、口調付きの３０文とは異なる５文ずつのテキストを口調をつけずに読み上げてもらった。音響分析時のパラメータは表２に示した通りである。 After collecting the texts of each tone, the actors read the texts in the simple soundproof room, and they also acted on the facial expressions. There were 6 actors in total, including a father role (2 men), a mother role (2 women), and a child role (2 women). In addition, 5 texts, which are different from 30 sentences with tone, were read aloud without tone. The parameters at the time of acoustic analysis are as shown in Table 2.

上述の方法で口調推定部１５の学習を行った後に、評価用音声により口調を識別する能力を実験した。６名の俳優にそれぞれ５文を発生してもらい、口調推定部１５で口調が正しく識別された割合を全俳優について平均した。実験結果では、口調付きの音声に対して口調を正しく認識できた割合は４９．５％であり、口調なしの音声について口調がないと認識できた割合は９０％であった。なお、口調データｄ３がＣＰ２、ＦＣ１、ＡＣ１、ＡＣ３、ＡＣ４である口調は高い割合で識別でき、ＣＰ４、ＮＰ３である口調は識別できた割合がやや低かった。 After learning the tone estimation unit 15 by the above-described method, the ability to identify the tone by the evaluation voice was tested. Six actors each generated five sentences, and the average of all the actors that were correctly identified by the tone estimation unit 15 was averaged. As a result of the experiment, the rate of correctly recognizing the tone with respect to the voice with tone was 49.5%, and the rate of recognizing that there was no tone with respect to the speech without tone was 90%. Note that the tone with the tone data d3 of CP2, FC1, AC1, AC3, and AC4 could be identified at a high rate, and the tone with the tone of CP4 and NP3 was slightly low.

音声感情推定部１４の学習には、口調推定部１５の学習と同様に、６人の俳優にテキストを読み上げてもらった。上述したように、音声感情推定部１４において識別する利用者の感情は「喜び」「怒り」「平静」の３種類であり、「喜び」は表１における「ＦＣ１」、「怒り」は表１における「ＦＣ２」、「平静」は表１におけるＡ１〜Ａ４に相当すると考え、口調推定部１５と同じテキストを用いて音声感情推定部１４の学習を行った。 In the learning of the voice emotion estimation unit 14, as with the learning of the tone estimation unit 15, six actors read out the text. As described above, the user's emotions identified by the voice emotion estimation unit 14 are the three types of “joy”, “anger”, and “seduce”, “joy” is “FC1” in Table 1, and “anger” is Table 1. “FC2” and “Silence” in FIG. 1 are considered to correspond to A1 to A4 in Table 1, and the speech emotion estimation unit 14 is trained using the same text as the tone estimation unit 15.

学習後の音声感情推定部１４について、評価用音声により感情を識別する能力を実験した。「平静」を識別する評価用音声には、「喜び」に対応するテキストを口調なしで読み上げた音声と、「怒り」に対応するテキストを口調なしで読み上げた音声とを用いた。評価したテキストの数は、「喜び」と「怒り」については６人の俳優それぞれで５文ずつとし、合計３０文を用い、また、「平静」については６人の俳優それぞれで１０文ずつとし、合計６０文を用いた。音声感情推定部１４による識別結果を表８に示す。いずれの場合も高い確度で識別することができ、平均では８５．６％の識別性能が得られた。 The voice emotion estimation unit 14 after learning was experimented on the ability to identify emotions using evaluation voices. For the evaluation speech for identifying “peace”, the speech that read the text corresponding to “joy” without tone and the speech that read the text corresponding to “anger” without tone were used. The number of texts evaluated was 5 sentences for each of the 6 actors for “joy” and “anger”, a total of 30 sentences, and 10 sentences for each of the 6 actors for “seduce”. A total of 60 sentences were used. Table 8 shows the identification results obtained by the voice emotion estimation unit 14. In any case, it was possible to discriminate with high accuracy, and an average discriminating performance of 85.6% was obtained.

（実施形態２）
本実施形態は、ユーザ認識部５０により利用者を特定できることを利用して利用者ごとに自我状態の履歴を記録し、利用者の自我状態の履歴を用いて利用者の性格を推定することにより利用者の性格に合わせた応答を可能とするものである。 (Embodiment 2)
This embodiment records the ego state history for each user using the fact that the user recognition unit 50 can identify the user, and estimates the user's personality using the user's ego state history. It is possible to respond to the personality of the user.

本実施形態では、図４に示すように、自我状態推定部２０が推定した利用者の自我状態をユーザ認識部５０において特定した利用者に対応付けて蓄積して記憶する自我状態履歴記憶部６１と、自我状態履歴記憶部６１に記憶した利用者の自我状態の出現頻度の分布パターンから利用者の性格を推定する自我状態特徴抽出部６２と、自我状態特徴抽出部６２により推定した性格を利用者に対応付けて記憶する自我状態特徴記憶部６３とを付加している。自我状態履歴記憶部６１では利用者の自我状態を時系列で履歴として記憶するとともに各自我状態の出現頻度を記憶している。 In the present embodiment, as shown in FIG. 4, an ego state history storage unit 61 that stores and stores the user's ego state estimated by the ego state estimation unit 20 in association with the user specified by the user recognition unit 50. The ego state feature extracting unit 62 for estimating the user's personality from the distribution pattern of the appearance frequency of the user's ego state stored in the ego state history storage unit 61, and the personality estimated by the ego state feature extracting unit 62 are used. An ego state feature storage unit 63 that stores the information in association with the person is added. The ego state history storage unit 61 stores the user's ego state as a history in chronological order and stores the appearance frequency of each ego state.

自我状態特徴抽出部６２では、自我状態の出現頻度の分布パターンと性格とを対応付けて登録してある自我状態特徴データベース６４とを照合し、一致度の高い性格を利用者の性格として推定する。すなわち、自我状態特徴データベース６４には、交流分析における５種類の自我状態の出現頻度の分布パターン（出現頻度を正規化した分布パターン）が性格に対応付けて登録してあり、自我状態履歴記憶部６１に記憶した利用者ごとの自我状態の出現頻度の分布パターンが、自我状態特徴データベース６４に登録されている分布パターンと照合される。この照合はパターンマッチングであって類似度の高いものが選択され、選択された性格が利用者の性格と推定される。 The ego state feature extraction unit 62 collates the ego state appearance frequency distribution pattern and the registered ego state feature database 64 in association with each other, and estimates the personality having a high degree of matching as the personality of the user. . That is, in the ego state feature database 64, distribution patterns of appearance frequencies of five kinds of ego states in the AC analysis (distribution patterns obtained by normalizing the appearance frequency) are registered in association with the personality, and the ego state history storage unit The distribution pattern of the appearance frequency of the ego state for each user stored in 61 is collated with the distribution pattern registered in the ego state feature database 64. This matching is a pattern matching and a high similarity is selected, and the selected personality is estimated as the personality of the user.

自我状態特徴データベース６４では、たとえば、自我状態としてＮＰ，ＦＣが低く、ＣＰ，ＡＣが高い場合には、「自分を表現することができにくく、鬱になりやすい性格」「不登校タイプの性格」「責任感、現実検討能力、協調性は十分持っているが、思いやりに欠ける性格」などの性格が対応付けられる。自我状態特徴抽出部６２での性格の推定結果は利用者と対応付けて自我状態特徴記憶部６３に記憶される。 In the ego state characteristic database 64, for example, when the NP and FC are low and the CP and AC are high as the ego state, “a personality that makes it difficult to express yourself and is prone to depression” “a personality that does not attend school” Characters such as “a sense of responsibility, reality-examination ability, and cooperativity, but lacking compassion” are associated. The personality estimation result in the ego state feature extraction unit 62 is stored in the ego state feature storage unit 63 in association with the user.

ところで、自我状態特徴記憶部６３に利用者の性格が記憶されているときには、当該利用者との対話を行う際に、対話制御部３０では自我状態特徴記憶部６３から利用者の性格を取得する。シナリオデータベース３３には利用者の識別情報に対応付けて対応ルールが登録されているから、対話制御部３０においてシナリオデータベース３３から利用者の性格に応じた対応ルールを選択することができ、結果的に利用者の性格に応じたスムーズな対話が可能になる。 By the way, when the personality of the user is stored in the ego state feature storage unit 63, the dialogue control unit 30 acquires the personality of the user from the ego state feature storage unit 63 when performing a dialogue with the user. . Since the correspondence rule is registered in the scenario database 33 in association with the identification information of the user, the interaction control unit 30 can select the correspondence rule corresponding to the character of the user from the scenario database 33, and as a result In addition, a smooth conversation according to the personality of the user becomes possible.

また、本実施形態の対話エージェントを看護ロボットなどに用いる場合に、利用者の性格を推定することによって、対話エージェントをセラピストのように機能させることが可能になる。たとえば、利用者の性格が悲観的あるいは自虐的な傾向であるときに、利用者の心理状態を向上させるような応答を行うことが可能になる。いま、自我状態の出現頻度のうちＮＰ，ＦＣが低く、ＣＰ，ＡＣが高い場合には上述したように、鬱傾向があることが知られている。そこで、一般の利用者では対話エージェントと利用者との間で以下に〔１〕で示す対話を行うとすれば、自我状態の出現頻度のうちＮＰ，ＦＣが低く、ＣＰ，ＡＣが高い利用者では以下に〔２〕で示すように肯定的な表現を用いて利用者を励まし、利用者の心理状態を向上させる対話が可能になる。
〔１〕
対話エージェント：お薬の時間ですよ。
利用者：もうわかったよ。
対話エージェント：では、よろしくお願いしますね。
〔２〕
対話エージェント：お薬の時間ですよ。
利用者：もうわかったよ。
対話エージェント：これで良くなりますから、元気を出してください。もう少しですから。 Further, when the dialog agent of the present embodiment is used for a nursing robot or the like, it is possible to make the dialog agent function like a therapist by estimating the personality of the user. For example, when the user's personality tends to be pessimistic or masochistic, it becomes possible to make a response that improves the user's psychological state. Now, it is known that when the NP and FC are low and the CP and AC are high in the appearance frequency of the ego state, as described above, there is a depression tendency. Therefore, if a general user performs the dialogue shown in [1] below between the dialogue agent and the user, the NP and FC are low in the frequency of appearance of the ego state, and the CP and AC are high. Then, as shown in [2] below, a positive expression is used to encourage the user and a dialogue that improves the user's psychological state becomes possible.
[1]
Dialogue agent: It's time for medicine.
User: I already know.
Dialogue Agent: Well thank you.
[2]
Dialogue agent: It's time for medicine.
User: I already know.
Dialogue Agent: This will help you, so please do well. Because it is a little more.

なお、上述のように、自我状態履歴記憶部６１には利用者ごとの自我状態の履歴（自我状態の出現頻度を含む）が記憶されており、また自我状態特徴記憶部６３には利用者ごとの性格の推定結果が記憶されているから、利用者がカウンセリングを受ける際にはカウンセラにこれらのデータを提供することによって、カウンセリングの参考に用いることが可能になる。他の構成および機能は実施形態１と同様である。 As described above, the ego state history storage unit 61 stores the ego state history (including the appearance frequency of the ego state) for each user, and the ego state feature storage unit 63 stores each user. Since the personality estimation result is stored, when the user receives counseling, the counselor can provide these data for counseling. Other configurations and functions are the same as those of the first embodiment.

（実施形態３）
上述した各実施形態では利用者が１人である場合を例示したが、本実施形態は２人の利用者が存在し対話エージェントを含めて３者での対話を行うことを可能とする構成について説明する。本実施形態では、図５に示すように、カメラ４１により撮像された画像に含まれる人物の視線の向きを監視し、視線の向きによって対話相手を認識する対話相手認識部６５と、利用者、対話の時間、対話相手、テキスト、自我状態ベクトルからなる対話データを蓄積する対話記録部６６とを付加している。利用者および発話時はユーザ認識部５０により取得でき、対話相手は対話相手認識部６５から取得でき、発話のテキストおよび利用者と対話相手との自我状態は対話制御部３０から取得することができる。なお、本実施形態では、カメラ４１として利用者の目の位置および瞳の位置を監視できる程度の解像度のものを用いることが必要である。 (Embodiment 3)
In each of the above-described embodiments, the case where there is one user is exemplified. However, in the present embodiment, there are two users and a configuration that enables a three-party dialogue including a dialogue agent is possible. explain. In the present embodiment, as shown in FIG. 5, a conversation partner recognition unit 65 that monitors the direction of the line of sight of a person included in the image captured by the camera 41 and recognizes the conversation partner based on the direction of the line of sight, a user, A dialogue recording unit 66 for accumulating dialogue data including dialogue time, dialogue partner, text, and ego state vector is added. The user and the utterance can be acquired by the user recognition unit 50, the conversation partner can be acquired from the conversation partner recognition unit 65, and the utterance text and the ego state between the user and the conversation partner can be acquired from the dialog control unit 30. . In the present embodiment, it is necessary to use a camera 41 having a resolution that can monitor the position of the user's eyes and the position of the pupil.

また、ユーザ認識部５０では、カメラ４１で撮像された画像から顔画像認識処理部５４が２人の利用者を認識したときに、対話制御部３０、感情認識部１３、自我状態推定部２０、対話相手認識部６５に通知することによって、２人の利用者が存在する対話を行う動作に切り換える。その後、ユーザ認識部５０では、マイクロホン４２から入力される利用者の音声を用いて利用者認識処理部５２において発話した利用者を特定し、対話制御部３０、感情認識部１３、自我状態推定部２０、対話相手認識部６５に対して認識した利用者を通知する。要するに、本実施形態ではカメラ４１で撮像された画像を、利用者の人数と利用者の対話相手との特定に用い、マイクロホン４２から入力される音声により発話した利用者の感情を推定し、発話した利用者と対話相手との自我状態ベクトルを推定する。自我状態推定部２０で推定された自我状態ベクトルと、テキスト抽出部１６で得られたテキストデータｄ４とは、対話制御部３０に与えられ、対話制御部３０では上述した対話データを対話記録部６６に記録する。 Moreover, in the user recognition part 50, when the face image recognition process part 54 recognizes two users from the image imaged with the camera 41, the dialogue control part 30, the emotion recognition part 13, the ego state estimation part 20, By notifying the dialogue partner recognizing unit 65, the operation is switched to an operation in which two users are present. Thereafter, the user recognition unit 50 identifies the user who spoke in the user recognition processing unit 52 using the user's voice input from the microphone 42, and the dialogue control unit 30, the emotion recognition unit 13, the ego state estimation unit. 20. Notify the recognized user to the dialogue partner recognition unit 65. In short, in the present embodiment, the image captured by the camera 41 is used to identify the number of users and the conversation partner of the user, the emotion of the user uttered by the voice input from the microphone 42 is estimated, and the utterance Estimate the ego state vector between the selected user and the conversation partner. The ego state vector estimated by the ego state estimation unit 20 and the text data d4 obtained by the text extraction unit 16 are given to the dialogue control unit 30, and the dialogue control unit 30 uses the dialogue data described above as the dialogue recording unit 66. To record.

本実施形態におけるシナリオデータベース３３には３者間の対話シナリオが上述した各実施形態のような２者間での対話シナリオとは別に格納されており、対話相手認識部６５から２人の利用者が存在することが対話制御部３０に通知されると、対話制御部３０ではシナリオデータベース３３から３者間の対話シナリオを選択する。３者間の対話シナリオでは、対話相手が対話エージェントを含むときにのみ対話エージェントが発話するように設定され、利用者同士の対話では対話エージェントから発話しないように設定される。つまり、対話制御部３０は、テキスト抽出部１６が抽出した発話のテキストと、対話相手認識部６５が認識した対話相手と、自我状態推定部２０が推定した自我状態とを用い（対話記録部６６に対話データとして記録されている）、対話エージェントが発話すべきか否かを判断する。対話エージェントが発話しないときにはカメラ４１で撮像した画像から利用者の人数を確認する処理に戻り、対話エージェントが発話するときには上述した各実施形態と同様に音声合成処理部１７および身体表現制御部１９を通して身体表現を伴う発話を行う。 In the scenario database 33 in this embodiment, a dialogue scenario between three parties is stored separately from the dialogue scenario between two parties as in the above-described embodiments, and two users are received from the dialogue partner recognition unit 65. Is notified to the dialogue control unit 30, the dialogue control unit 30 selects a dialogue scenario between the three parties from the scenario database 33. In the three-party dialogue scenario, the dialogue agent is set to speak only when the dialogue partner includes the dialogue agent, and the dialogue agent is set not to speak from the dialogue agent. That is, the dialogue control unit 30 uses the utterance text extracted by the text extraction unit 16, the dialogue partner recognized by the dialogue partner recognition unit 65, and the ego state estimated by the ego state estimation unit 20 (dialog recording unit 66). The dialogue agent determines whether or not to speak. When the conversation agent does not speak, the process returns to the process of confirming the number of users from the image captured by the camera 41, and when the conversation agent speaks, through the speech synthesis processing unit 17 and the body expression control unit 19 as in the above-described embodiments. Speak with physical expression.

本実施形態を用いた対話シナリオの一例を以下に示す。以下に示す例は、対話エージェントが看護ロボットに設けられており、被看護人と来訪者との２人の利用者が存在する場合を例示している。また、〔１〕は対話エージェントが発話しない場合であり、〔２〕は対話エージェントが発話する場合である。
〔１〕
被看護人→来訪者（ＦＣａ→ＦＣａ）：学校では変わったことはない？
来訪者→被看護人（ＦＣａ→ＦＣａ）：いつも通りだよ。
〔２〕
被看護人→来訪者（ＦＣａ→ＮＰ）：しんどいよ。
来訪者→被看護人（ＮＰ→ＦＣａ）：もうすぐ良くなるよ。
看護ロボット→被看護人（ＮＰ→ＦＣａ）：そうですよ、がんばってください。 An example of a dialogue scenario using this embodiment is shown below. In the following example, a dialogue agent is provided in a nursing robot, and there are two users, a nursing person and a visitor. [1] is a case where the dialog agent does not speak, and [2] is a case where the dialog agent speaks.
[1]
Nurse → Visitor (FCa → FCa): Have you ever changed in school?
Visitors → Nurses (FCa → FCa): As usual.
[2]
Nurse → visitor (FCa → NP): It ’s hard.
Visitors → Nurses (NP → FCa): It will soon improve.
Nursing robot-> nurse (NP-> FCa): Yes, please do your best.

上述したように、本実施形態の構成によれば、利用者が複数人存在する場合でも対話が可能になり、たとえば看護ロボットに対話エージェントを設けている場合に、被看護人と来訪者や見舞客と看護ロボットとの３者での対話が可能になる。ここに、対話制御部３０において対話記録部６６の対話データを参照しているのは、対話エージェントの応答が要求されているか否かを判断するためであって、対話相手が対話エージェントを含む場合にのみ対話エージェントによる発話を行うことになる。 As described above, according to the configuration of the present embodiment, even when there are a plurality of users, a conversation is possible. For example, when a dialogue agent is provided in a nursing robot, A dialogue between the customer and the nursing robot is possible. Here, the dialogue control unit 30 refers to the dialogue data in the dialogue recording unit 66 in order to determine whether or not a response from the dialogue agent is requested, and the dialogue partner includes the dialogue agent. The utterance by the dialogue agent will be performed only during the period.

なお、本実施形態では２人の利用者と対話エージェントとの３者による対話を例示したが、本実施形態の技術を３人以上の利用者に拡張することによって、さらに多人数での対話も可能である。他の構成および動作は実施形態１と同様である。 In this embodiment, a three-way dialogue between two users and a dialogue agent is illustrated. However, by extending the technology of this embodiment to three or more users, a dialogue with a larger number of people can be performed. Is possible. Other configurations and operations are the same as those of the first embodiment.

（実施形態４）
本実施形態は利用者の覚醒度を判断することによって、対話エージェントが応答する際の合成音声の速度を調節するものである。つまり、利用者の覚醒度が低いときには、速いテンポで応答すると応答内容を利用者が聞き取れない場合があり、逆に利用者の覚醒度が高いときには、遅いテンポで応答すると利用者がいらだつ場合があるから、利用者の覚醒度を判断するとともに、利用者の覚醒度に応じてスピーカ４３から出力する合成音声の速度を調節している。 (Embodiment 4)
In this embodiment, the speed of the synthesized speech when the dialogue agent responds is adjusted by judging the user's arousal level. In other words, when the user's arousal level is low, if the user responds at a fast tempo, the response may not be heard by the user. Therefore, the user's arousal level is determined, and the speed of the synthesized voice output from the speaker 43 is adjusted according to the user's arousal level.

本実施形態では、利用者の覚醒度の判断に皮膚電位水準（ＳＰＬ）を用いる。皮膚電位水準を求めるために、図６に示すように、利用者の手のひらや足の裏のように精神状態による発汗量の変化が生じやすい部位に接触可能な電極７１（電極７１は一対ある）を設けてあり、電極７１は電位計測部７２に接続され電位計測部７２では一対の電極７１の電位差を計測する。電極７１が利用者に装着されていると利用者は煩わしく感じる可能性があるから、たとえば対話エージェントを看護ロボットに組み込むような場合には、利用者との握手などの行為によって利用者が電極７１に触れるようにするのが望ましい。電位計測部７２により計測された電位差は覚醒度判断部７０に入力され、覚醒度判断部７０では電位差を覚醒度に変換する。一般に、電位差（ＳＰＬ）が大きいほど覚醒度が高いから、適宜に設定した閾値に対して電位差が大きければ覚醒度判断部７０において利用者の覚醒度が高いと判断する。なお、本実施形態では利用者の覚醒度を高低２段階で判断しているが、覚醒度を多段階で判断してもよい。ところで、電位計測部７２で計測される電位差は利用者によって個人差があるから、ユーザ認識部５０により認識した利用者に対応付けて電位計測部７２での検出結果を記憶して蓄積しておき、利用者別の蓄積結果の平均値を上述の閾値に用いるようにすればよい。 In the present embodiment, the skin potential level (SPL) is used to determine the user's arousal level. In order to obtain the skin potential level, as shown in FIG. 6, an electrode 71 (a pair of electrodes 71) can be contacted with a site where the amount of sweating is likely to change due to mental state, such as the palm of a user or the sole of a foot. The electrode 71 is connected to the potential measuring unit 72, and the potential measuring unit 72 measures the potential difference between the pair of electrodes 71. When the electrode 71 is attached to the user, the user may feel annoyed. For example, when the dialogue agent is incorporated into the nursing robot, the user can touch the electrode 71 by an action such as shaking hands with the user. It is desirable to touch. The potential difference measured by the potential measuring unit 72 is input to the arousal level determination unit 70, which converts the potential difference into the arousal level. In general, the greater the potential difference (SPL) is, the higher the arousal level is. Therefore, if the potential difference is large with respect to an appropriately set threshold, the arousal level determination unit 70 determines that the user's arousal level is high. In the present embodiment, the user's arousal level is determined in two levels, but the arousal level may be determined in multiple levels. By the way, since the potential difference measured by the potential measuring unit 72 has individual differences depending on the user, the detection result in the potential measuring unit 72 is stored and stored in association with the user recognized by the user recognition unit 50. The average value of the accumulation results for each user may be used as the above threshold value.

上述のようにして覚醒度判断部７０において求めた利用者の覚醒度は対話制御部３０に与えられ、対話制御部３０では利用者の覚醒度が低い（ＳＰＬが閾値より低い）ときには、合成音声による応答の速度が比較的遅いテンポになるように、音声合成処理部１７で生成する合成音声の出力速度を調節する。このように覚醒度の低い利用者に対して遅いテンポで応答することにより、テキストの内容を聞き取りやすくすることができる。一方、利用者の覚醒度が高い（ＳＰＬが閾値より高い）ときには、やや速いテンポで応答することで利用者の覚醒度を保ち、利用者が応答の遅さにいらだつことがないようにする。すなわち、利用者の覚醒度に応じて合成音声の速度を調節し、利用者の覚醒度が低ければ遅いテンポで応答することにより応答内容を利用者に聞き取りやすくし、逆に利用者の覚醒度が高ければ速いテンポで応答することにより利用者を待たせることなく応答することが可能になる。他の構成および動作は実施形態１と同様である。 The user's arousal level obtained by the arousal level determination unit 70 as described above is given to the dialogue control unit 30. When the user's arousal level is low (SPL is lower than the threshold value), the synthesized voice The output speed of the synthesized speech generated by the speech synthesis processing unit 17 is adjusted so that the response speed according to is a relatively slow tempo. Thus, by responding to a user with a low arousal level at a slow tempo, the contents of the text can be easily heard. On the other hand, when the user's arousal level is high (SPL is higher than the threshold), the user's arousal level is maintained by responding at a slightly faster tempo so that the user is not frustrated by the slow response. In other words, by adjusting the speed of the synthesized speech according to the user's arousal level, if the user's arousal level is low, the response is made easier to hear the user by responding at a slow tempo, and conversely the user's arousal level If it is high, it becomes possible to respond without making the user wait by responding at a fast tempo. Other configurations and operations are the same as those of the first embodiment.

（実施形態５）
本実施形態は、利用者の生体情報を検出することによって対話エージェントが推定した自我状態が対話に適切であったか否かを判断するものである。すなわち、利用者と対話エージェントとが対話する際には自我状態推定部２０において利用者の発話に基づいて利用者の自我状態と対話エージェントにおいて刺激される自我状態とを推定している。一般的には、推定した自我状態を用いて相補的交流を行えばスムーズな対話が可能になる。ここでスムーズな対話とは、利用者が感情を害することなく対話することを意味する。一方、相補的交流ではなく自我状態の交差が生じるような交流では、利用者は感情を害して対話がスムーズに進行しなくなる可能性がある。 (Embodiment 5)
In the present embodiment, it is determined whether or not the ego state estimated by the dialogue agent is appropriate for the dialogue by detecting the biological information of the user. That is, when the user and the dialogue agent interact, the ego state estimation unit 20 estimates the user's ego state and the ego state stimulated by the dialogue agent based on the user's utterance. In general, smooth interaction is possible by performing complementary exchange using the estimated ego state. Here, the smooth dialogue means that the user talks without harming emotions. On the other hand, in an exchange where an intersection of ego states occurs instead of a complementary exchange, the user may harm emotions and the dialogue may not proceed smoothly.

そこで、本実施形態では、利用者の感情を判断するために、利用者の瞬時心拍率（１分間の心拍数）を用いている。一般に瞬時心拍率は、怒り、ストレス、恐怖などの防衛的あるいは攻撃的な感情が生じると上昇し、落ち着いているときには低下することが知られている。本実施形態では、このような生理現象を利用して対話中の利用者の瞬時心拍率を求め、瞬時心拍率の変動を監視することによって、利用者に防衛的ないし攻撃的な感情が生じていないか否かを判断している。瞬時心拍率の検出にはＥＣＧを用いており、実施形態４と同様に電極７３（電極７３は一対ある）を設けている。電極７３は利用者の胸部あるいは四肢に装着される。 Therefore, in the present embodiment, the user's instantaneous heart rate (1 minute heart rate) is used to determine the user's emotions. In general, it is known that the instantaneous heart rate increases when a defensive or aggressive emotion such as anger, stress, and fear occurs, and decreases when it is calm. In this embodiment, by using such a physiological phenomenon, the instantaneous heart rate of the user during the conversation is obtained, and the fluctuation of the instantaneous heart rate is monitored, so that the user has a defensive or aggressive feeling. Judging whether or not there is. ECG is used to detect the instantaneous heart rate, and electrodes 73 (a pair of electrodes 73 are provided) as in the fourth embodiment. The electrode 73 is attached to the chest or limbs of the user.

図７に示すように、電極７３はＥＣＧ計測部７４に接続され、一対の電極７３により検出される電位差がＥＣＧ計測部７４において検出される。ＥＣＧ計測部７４では、たとえば図８に示すような電位変化が検出される。この電位変化は、心臓における心房の興奮を示すＰ波と、心室の脱分極を示すＱＲＳと、心室の細分極を示すＴ波とを含む。ここで、Ｒ−Ｒの間隔は心臓交換神経と心臓副交感神経との拮抗支配を受けており、Ｒ−Ｒ間隔を１分当たりの心拍数に換算することで瞬時心拍率を求めることができる。そこで、図８に示すようなＥＣＧ計測部７４の出力を瞬時心拍率検出部７５に入力し、Ｒ−Ｒ間隔を求めることによって瞬時心拍率を求める。すなわち、電極７３とＥＣＧ計測部７４と瞬時心拍率検出部７５とにより生体情報計測手段が構成される。利用者が防衛的ないし攻撃的な感情を持つと、瞬時心拍率が常時よりも上昇するから、適否判断部７６では瞬時心拍率検出部７５で求めた瞬時心拍率を適宜の閾値と比較することにより、利用者の感情が防衛的ないし攻撃的であるか否かを判断する。つまり、適否判断部７６は自我状態推定部２０が推定した自我状態が適切であったか否かを判断することになる。適否判断部７６では、瞬時心拍率が閾値よりも低いときには自我状態推定部２０で推定した利用者の自我状態および対話エージェントの刺激された自我状態が適切であったと判断し、瞬時心拍率が閾値以上であるときには自我状態推定部２０の推定結果が適切でなかったと判断する。 As shown in FIG. 7, the electrode 73 is connected to the ECG measurement unit 74, and a potential difference detected by the pair of electrodes 73 is detected by the ECG measurement unit 74. In the ECG measurement unit 74, for example, a potential change as shown in FIG. 8 is detected. This potential change includes a P wave indicating atrial excitation in the heart, a QRS indicating ventricular depolarization, and a T wave indicating ventricular fine polarization. Here, the R-R interval is subject to antagonistic control between the heart exchange nerve and the cardiac parasympathetic nerve, and the instantaneous heart rate can be obtained by converting the R-R interval to the heart rate per minute. Therefore, the output of the ECG measurement unit 74 as shown in FIG. 8 is input to the instantaneous heart rate detection unit 75, and the instantaneous heart rate is obtained by obtaining the RR interval. That is, the electrode 73, the ECG measurement unit 74, and the instantaneous heart rate detection unit 75 constitute a biological information measurement unit. If the user has a defensive or aggressive emotion, the instantaneous heart rate rises more than usual. Therefore, the suitability determination unit 76 compares the instantaneous heart rate obtained by the instantaneous heart rate detection unit 75 with an appropriate threshold value. Thus, it is determined whether the user's emotion is defensive or aggressive. That is, the suitability determination unit 76 determines whether or not the ego state estimated by the ego state estimation unit 20 is appropriate. When the instantaneous heart rate is lower than the threshold value, the suitability determination unit 76 determines that the user's ego state estimated by the ego state estimation unit 20 and the stimulated ego state of the dialogue agent are appropriate, and the instantaneous heart rate is the threshold value. When it is above, it is determined that the estimation result of the ego state estimation unit 20 is not appropriate.

なお、図７に破線で示すように、適否判断部７６による判断結果を自我状態推定部２０にフィードバックして自我状態推定部２０の学習に用いれば、推定した自我状態の尤度を高めることができ、結果的に対話エージェントの応答によって利用者の感情を害する可能性が低減されスムーズな対話が可能になる。すなわち、自我状態推定部２０で推定した自我状態ベクトルが利用者の感情を害していないか否かを適否判断部７６において判断することができるから、適否判断部７６の判断結果を自我状態ベクトルの推定結果にフィードバックすることにより、利用者の感情を害することのない適正な自我状態ベクトルの推定が可能になる。瞬時心拍率の検出には電極７３に変えて血流を監視する光学式のセンサを用いてもよい（この種のセンサは種々運動機械における心拍計に用いられている）。他の構成および動作は実施形態１と同様である。 Note that, as indicated by a broken line in FIG. 7, if the determination result of the suitability determination unit 76 is fed back to the ego state estimation unit 20 and used for learning of the ego state estimation unit 20, the likelihood of the estimated ego state can be increased. As a result, the dialog agent's response reduces the possibility of harming the user's emotions and enables a smooth conversation. That is, since the suitability determination unit 76 can determine whether or not the ego state vector estimated by the ego state estimation unit 20 does not harm the user's emotions, the determination result of the suitability determination unit 76 is used as the determination result of the ego state vector. By feeding back to the estimation result, an appropriate ego state vector can be estimated without harming the user's emotion. An optical sensor that monitors the blood flow instead of the electrode 73 may be used to detect the instantaneous heart rate (this type of sensor is used in heart rate monitors in various exercise machines). Other configurations and operations are the same as those of the first embodiment.

実施形態１を示すブロック図である。1 is a block diagram illustrating a first embodiment. 同上の要部のブロック図である。It is a block diagram of the principal part same as the above. 基本構成の動作説明図であるIt is operation | movement explanatory drawing of a basic composition. 実施形態２を示すブロック図である。FIG. 6 is a block diagram illustrating a second embodiment. 実施形態３を示すブロック図である。FIG. 6 is a block diagram illustrating a third embodiment. 実施形態４を示すブロック図である。FIG. 10 is a block diagram illustrating a fourth embodiment. 実施形態５を示すブロック図である。FIG. 10 is a block diagram illustrating a fifth embodiment. 同上に用いるＥＣＧの一例を示す図である。It is a figure which shows an example of ECG used for the same as the above.

Explanation of symbols

１０対話処理手段
１１表情推定部
１３顔感情推定部
１４音声感情推定部
１５口調推定部
１６テキスト抽出部
１７音声合成処理部
１８身体モデル表現部
１９身体表現制御部
２０自我状態推定部
２１感情スコア割当部
２２口調スコア割当部
２３テキストスコア割当部
２４スコア統合演算部
３０対話制御部
４１カメラ（画像入力手段）
４２マイクロホン（音声入力手段）
４３スピーカ（音声出力手段）
４４ディスプレイ（画像出力手段）
５０ユーザ認識部
５１利用者データベース
５２利用者認識処理部
５３顔画像データベース
５４顔画像認識処理部
５５利用者判断部
６１自我状態履歴記憶部
６２自我状態特徴抽出部
６３自我状態特徴記憶部
６５対話相手認識部
６６対話記録部
７０覚醒度判断部
７３電極
７４ＥＣＧ計測部
７５瞬時心拍率検出部
７６適否判断部 DESCRIPTION OF SYMBOLS 10 Dialogue processing means 11 Facial expression estimation part 13 Face emotion estimation part 14 Speech emotion estimation part 15 Tone estimation part 16 Text extraction part 17 Speech synthesis process part 18 Body model expression part 19 Body expression control part 20 Ego state estimation part 21 Emotion score allocation Unit 22 Tonal Score Allocation Unit 23 Text Score Allocation Unit 24 Score Integration Calculation Unit 30 Dialog Control Unit 41 Camera (Image Input Means)
42 Microphone (voice input means)
43 Speaker (Audio output means)
44 Display (image output means)
DESCRIPTION OF SYMBOLS 50 User recognition part 51 User database 52 User recognition process part 53 Face image database 54 Face image recognition process part 55 User judgment part 61 Ego state history memory | storage part 62 Ego state feature extraction part 63 Ego state feature memory part 65 Dialog partner Recognition unit 66 Dialog recording unit 70 Arousal level determination unit 73 Electrode 74 ECG measurement unit 75 Instantaneous heart rate detection unit 76 Suitability determination unit

Claims

Voice input means for inputting the user's voice, dialog processing means for generating text responding to the contents of the voice input from the voice input means, and outputting the text generated by the dialog processing means to the user A speech output estimating unit for classifying a user's emotions into a plurality of types using the prosodic features of the speech input from the speech input unit and outputting as speech emotion data; A tone estimation unit that classifies the user's tone into a plurality of types using the prosodic features of the voice input from the voice input means and outputs as tone data, and extracts a sound string from the voice input from the voice input means A text extraction unit that outputs text data, an image input unit that captures the face of the user, and a time point of feature points set for each part of the user's face captured by the image input unit A facial expression estimator that classifies facial expressions based on changes in facial position, and facial expressions extracted by the facial expression estimator are used to classify the user's emotions into multiple types using the facial expression change patterns that accompany temporal changes. The combination of the ego state, which is the model of the emotion of both the face and emotion to be output, and the ego state vector including the direction from the speaker to the listener, is used as the emotion summary data, voice emotion data, tone data, and text data. The ego state estimation unit that estimates the ego state vector based on the user's utterance from the pair, and the ego state vector when responding to the user according to the preset correspondence rule from the ego state vector estimated by the ego state estimation unit And a dialog control unit that automatically determines text to be answered to the user from the contents of the text data. Dialogue agent.

The ego state estimation unit includes an emotion score assigning unit that obtains an emotion ego state score indicating likelihood for each candidate of ego state vector estimated from a combination of the emotion summary data and the voice emotion data, and the voice emotion data For each ego state vector estimated from the content of the text data, and a tone score assigning unit for obtaining a tone ego state score indicating the likelihood for each candidate ego state vector estimated from a combination of the tone data and the tone data Dialogue of ego states included in candidates of ego state vectors obtained by a text score assigning unit for obtaining a text ego state score indicating likelihood, an emotion score assigning unit, a tone score assigning unit, and a text score assigning unit Classified for each person, and for each person's ego state, the candidate's emotional ego state score, tone ego state score, and text ego The weighted sum obtained by multiplying the state score by each weighting factor is obtained as the integrated score, which is the likelihood evaluation value, and the likelihood of the combined score for each ego state of each person included in the candidate for the ego state vector The interactive agent according to claim 1, further comprising: a score integration calculation unit that estimates an ego state having the maximum degree as an ego state of each person in an ego state vector based on a user's utterance.

The emotion ego state score is given to the candidate of ego state vector when the emotions indicated by the emotion summary data and the voice emotion data match, and the candidate of the ego state vector obtained when the emotion does not match The tone ego state score is given to the candidate of ego state vector when there is no contradiction between the voice emotion data and the tone data, and there is a contradiction The text ego state score is a numerical value that is allocated to the ego state vector candidates when the text data includes a specific incidental phrase. A text ego letter assigned to a candidate for an ego state vector corresponding to one supplementary phrase, in which numerical values are assigned in descending order of likelihood. Dialogue agent of claim 2, wherein the sum of the scores is characterized by comprising a scale.

The weighting factor for obtaining the integrated score is a sum of a weighting factor for the emotional ego state score and a weighting factor for the tone ego state score is greater than the weighting factor for the text ego state score, and the weighting factor for the text ego state score is the user 4. The dialogue agent according to claim 2, wherein a weighting factor for the stimulated ego state is larger than a weighting factor for the ego state.

The dialog control unit has a function of generating a prosodic parameter of speech from the determined ego state vector for response and the determined response text, and the text output means is for the response determined by the dialog control unit. 5. The speech synthesis processing unit for generating a synthesized speech in which prosodic parameters are applied to text, and speech output means for outputting the synthesized speech generated by the speech synthesis processing unit. The dialogue agent according to any one of the above items.

2. The facial expression estimation unit classifies seven types of facial expressions of “no expression”, “surprise”, “fear”, “disgust”, “anger”, “happiness”, and “sadness”. The dialogue agent according to claim 5.

It has a user database in which user voice features and user face features are registered in association with users, and is picked up by the user voice features input from the voice input means and the image input means. A user recognizing unit that identifies a user by comparing the features of the user's face image with a user database, and the dialog control unit is pre-registered with user attributes specified by the user recognizing unit. In addition to the ego state vector estimated by the ego state estimation unit and the text data when determining the ego state vector and text when responding to the user, the user attribute is also used. The dialogue agent according to any one of claims 1 to 6.

The ego state history storage unit that stores and stores the ego state vector estimated by the ego state estimation unit in association with the user specified by the user recognition unit, and the user's ego state stored in the ego state history storage unit An ego state feature extraction unit that estimates the user's personality from the distribution pattern of the appearance frequency of the vector and an ego state feature storage unit that stores the personality estimated by the ego state feature extraction unit in association with the user are added. 8. The dialogue agent according to claim 7, wherein the dialogue control unit determines an ego state vector and text when responding to the user by using the personality of the user stored in the ego state feature storage unit.

In the correspondence rule of the dialog control unit, the ego state stimulated by the user's utterance is set as the ego state at the time of response, and the user's ego state stimulated at the time of response is set as the ego state at the time of the user's previous utterance. The dialogue agent according to any one of claims 1 to 8, characterized in that:

A body model expression unit that performs expression accompanied by a body motion and a body expression control unit that converts the ego state vector and text determined by the dialog control unit into the body motion of the body model expression unit are added. The interactive agent according to any one of claims 1 to 9.

The ego state is classified into five types of “critical parents”, “protective parents”, “adults”, “free children”, and “adapting children”, which are models of mind based on the exchange analysis. The dialogue agent according to any one of claims 1 to 10, wherein the dialogue agent is characterized.