JP2018132624A

JP2018132624A - Voice interaction apparatus

Info

Publication number: JP2018132624A
Application number: JP2017025582A
Authority: JP
Inventors: 佐和樋口; Sawa Higuchi; 生聖渡部; Seisho Watabe
Original assignee: Toyota Motor Corp
Current assignee: Toyota Motor Corp
Priority date: 2017-02-15
Filing date: 2017-02-15
Publication date: 2018-08-23
Anticipated expiration: 2037-02-15
Also published as: JP6766675B2

Abstract

PROBLEM TO BE SOLVED: To estimate emotion while suppressing the influence due to a difference depending on interaction scenes.SOLUTION: A voice interaction apparatus 1 which recognizes voice of a user on the other side of interaction and outputs voice to the user includes: a feature quantity acquisition unit 101 for acquiring feature quantity to be used for estimating emotion from the user; an interaction scene determination unit 104 which determines whether the user or the device is a listener and whether the user or the device is a speaker, on the basis of interaction history of the user and the device; an emotion threshold setting unit 106 which sets a threshold for estimating emotion, in accordance with a result determined by the interaction scene determination unit 104; and an emotion estimation unit 105 which calculates an index value of emotion of the user, on the basis of the feature quantity, and estimates emotion of the user, in accordance with a result of comparing the calculated index value with the threshold set by the emotion threshold setting unit 106.SELECTED DRAWING: Figure 2

Description

本発明は音声対話装置に関し、特に、感情の推定を行う音声対話装置に関する。 The present invention relates to a voice interaction apparatus, and more particularly to a voice interaction apparatus that estimates emotions.

ユーザと会話を行う音声対話装置が知られている。このような技術に関し、例えば、特許文献１では、ユーザの感情に合わせた応答を行う対話処理装置について開示している。この対話処理装置では、ユーザの発話内容に基づいて、ユーザの感情が、ポジティブ、ネガティブ、ニュートラルのいずれであるかを一定の判定基準に従って推定し、推定結果に応じた応答を行う。 2. Description of the Related Art A voice interactive device that has a conversation with a user is known. With regard to such a technique, for example, Patent Document 1 discloses a dialog processing apparatus that performs a response in accordance with the user's emotion. In this interactive processing device, based on the content of the user's utterance, it is estimated according to a certain criterion whether the user's emotion is positive, negative, or neutral, and a response is made according to the estimation result.

このように、言語情報、音韻情報、又は画像情報などといった特徴量に基づいて、ユーザの感情を指標化して、ユーザによらず一定の判断基準で感情を推定する技術がある。 As described above, there is a technique for indexing a user's emotion based on a feature amount such as language information, phonological information, or image information, and estimating the emotion based on a certain criterion regardless of the user.

特開２００６−１７８０６３号公報JP 2006-178063 A

しかしながら、対話シーンによって、感情推定に使用される特徴量へのユーザの感情の反映度合いが異なる場合がある。例えば、ユーザが主に話し手として対話している場合、ユーザが主に聞き手として対話している場合に比べ、特徴量への感情の反映度合いは大きい。特許文献１に記載された技術の場合、ユーザの感情が特徴量に反映されにくい対話シーンの場合には、ユーザに対しては、推定結果としてニュートラルとなることが多くなり、ユーザの感情が特徴量に反映されやすい対話シーンの場合には、推定結果としてポジティブ又はネガティブが多くなる。このように、対話シーンによって、ばらつきが多くなり適切な感情推定を行うことができない。 However, the degree of reflection of the user's emotion in the feature quantity used for emotion estimation may differ depending on the conversation scene. For example, when the user is mainly interacting as a speaker, the degree of reflection of emotion on the feature amount is greater than when the user is mainly interacting as a listener. In the case of the technique described in Patent Document 1, in the case of a conversation scene in which the user's emotion is difficult to be reflected in the feature amount, the estimation result is often neutral for the user, and the user's emotion is characteristic. In the case of an interactive scene that is easily reflected in the amount, positive or negative increases as an estimation result. In this way, there are many variations depending on the conversation scene, and appropriate emotion estimation cannot be performed.

本発明は、上記した事情を背景としてなされたものであり、対話シーンの違いによる影響を抑制した感情推定を行うことができる音声対話装置を提供することを目的とする。 The present invention has been made against the background of the above-described circumstances, and an object thereof is to provide a voice interaction apparatus capable of performing emotion estimation while suppressing the influence of a difference in a conversation scene.

上記目的を達成するための本発明の一態様は、対話相手であるユーザの発話音声を認識し、前記ユーザに対し音声を出力する音声対話装置であって、感情推定に用いる特徴量を前記ユーザから取得する特徴量取得部と、前記ユーザと自装置の対話履歴に基づいて、前記ユーザ及び自装置のいずれが聞き手側として対話し、いずれが話し手側として対話しているかを判定する対話シーン判定部と、感情を推定するための閾値を、前記対話シーン判定部による判定結果に応じて設定する感情閾値設定部と、前記特徴量に基づいて前記ユーザの感情の指標値を算出し、算出した前記指標値と前記感情閾値設定部により設定された閾値との比較結果に応じて、前記ユーザの感情を推定する感情推定部とを有する音声対話装置である。
この音声対話装置によれば、ユーザが、聞き手側として対話しているのか話し手側として対話しているかが判定され、その判定結果に応じて感情推定のための閾値が設定される。このため、この音声対話装置によれば、対話シーンの違いによる影響を抑制した感情推定を行うことができる。 One aspect of the present invention for achieving the above object is a speech dialogue apparatus that recognizes a speech voice of a user who is a conversation partner and outputs the voice to the user, wherein the feature amount used for emotion estimation is the user. Based on the feature amount acquisition unit acquired from the user and the conversation history of the user and the own apparatus, the conversation scene determination is performed to determine which of the user and the own apparatus is interacting as the listener side and which is interacting as the speaker side. And an emotion threshold setting unit for setting a threshold for estimating emotion according to a determination result by the dialog scene determination unit, and calculating an index value of the user's emotion based on the feature amount It is a voice interactive apparatus having an emotion estimation unit that estimates the user's emotion according to a comparison result between the index value and the threshold set by the emotion threshold setting unit.
According to this voice interactive apparatus, it is determined whether the user is interacting as a listener or a speaker, and a threshold for emotion estimation is set according to the determination result. For this reason, according to this speech dialogue apparatus, it is possible to perform emotion estimation while suppressing the influence due to the difference in dialogue scene.

本発明によれば、対話シーンの違いによる影響を抑制した感情推定を行うことができる音声対話装置を提供することができる。 ADVANTAGE OF THE INVENTION According to this invention, the audio | voice dialog apparatus which can perform the emotion estimation which suppressed the influence by the difference in a dialog scene can be provided.

実施の形態１にかかる音声対話装置のハードウェア構成を示す図である。FIG. 2 is a diagram illustrating a hardware configuration of the voice interaction apparatus according to the first embodiment. 実施の形態１にかかる音声対話装置の制御装置の構成を示すブロック図である。1 is a block diagram showing a configuration of a control device of a voice interaction device according to a first exemplary embodiment; 実施の形態１にかかる音声対話装置の動作の一例を示すフローチャートである。3 is a flowchart illustrating an example of an operation of the voice interaction apparatus according to the first exemplary embodiment.

以下、図面を参照して本発明の実施の形態について説明する。なお、各図面において、同一の要素には同一の符号が付されており、必要に応じて重複説明は省略されている。 Embodiments of the present invention will be described below with reference to the drawings. Note that, in each drawing, the same element is denoted by the same reference numeral, and redundant description is omitted as necessary.

図１は、実施の形態１にかかる音声対話装置１のハードウェア構成を示す図である。音声対話装置１は、ユーザと音声を用いて対話を行う。具体的には、音声対話装置１は、ユーザからの発話に応じて、ユーザに対して音声を出力することで、ユーザと対話を行う。つまり、音声対話装置１は、対話相手であるユーザの発話音声を認識し、このユーザに対し音声を出力する。音声対話装置１は、例えば、生活支援ロボット及び小型ロボット等のロボット、クラウドシステム及びスマートフォン等に搭載可能である。 FIG. 1 is a diagram illustrating a hardware configuration of the voice interactive apparatus 1 according to the first embodiment. The voice interactive apparatus 1 performs a conversation with a user using voice. Specifically, the voice interaction apparatus 1 interacts with the user by outputting a voice to the user in response to an utterance from the user. That is, the voice dialogue apparatus 1 recognizes the voice of the user who is the dialogue partner and outputs the voice to the user. The voice interactive apparatus 1 can be mounted on, for example, a robot such as a life support robot and a small robot, a cloud system, a smartphone, and the like.

音声対話装置１は、周辺の音声を収集するマイク２と、音声を発するスピーカ３と、制御装置１０とを有する。制御装置１０は、例えばコンピュータとしての機能を有する。制御装置１０は、マイク２及びスピーカ３と、有線又は無線で接続されている。なお、音声対話装置１が、周辺の画像を取得するカメラ（図示せず）をさらに備え、制御装置１０が、さらに、このカメラと有線又は無線で接続されていてもよい。 The voice interaction device 1 includes a microphone 2 that collects surrounding sounds, a speaker 3 that emits sound, and a control device 10. The control device 10 has a function as a computer, for example. The control device 10 is connected to the microphone 2 and the speaker 3 by wire or wirelessly. Note that the voice interactive apparatus 1 may further include a camera (not shown) that acquires surrounding images, and the control apparatus 10 may be further connected to the camera by wire or wirelessly.

制御装置１０は、主要なハードウェア構成として、ＣＰＵ（Central Processing Unit）１２と、ＲＯＭ（Read Only Memory）１４と、ＲＡＭ（Random Access Memory）１６とを有する。ＣＰＵ１２は、制御処理及び演算処理等を行う演算装置としての機能を有する。ＲＯＭ１４は、ＣＰＵ１２によって実行される制御プログラム及び演算プログラム等を記憶するための機能を有する。ＲＡＭ１６は、処理データ等を一時的に記憶するための機能を有する。 The control device 10 includes a CPU (Central Processing Unit) 12, a ROM (Read Only Memory) 14, and a RAM (Random Access Memory) 16 as main hardware configurations. The CPU 12 has a function as an arithmetic device that performs control processing, arithmetic processing, and the like. The ROM 14 has a function for storing a control program executed by the CPU 12, an arithmetic program, and the like. The RAM 16 has a function for temporarily storing processing data and the like.

制御装置１０は、マイク２によって集音されたユーザの発話を解析して、そのユーザの発話に応じて、ユーザに対する応答を生成する。そして、制御装置１０は、スピーカ３を介して、生成された応答に対応する音声（応答音声）を出力する。 The control device 10 analyzes the user's utterance collected by the microphone 2 and generates a response to the user according to the user's utterance. And the control apparatus 10 outputs the audio | voice (response audio | voice) corresponding to the produced | generated response via the speaker 3. FIG.

図２は、実施の形態１にかかる音声対話装置１の制御装置１０の構成を示すブロック図である。制御装置１０は、特徴量取得部１０１と、発話行為タイプ判別部１０２と、対話履歴記憶部１０３と、対話シーン判定部１０４と、感情推定部１０５と、感情閾値設定部１０６と、応答生成部１０７と、音声合成部１０８とを有する。なお、図２に示す特徴量取得部１０１、発話行為タイプ判別部１０２、対話シーン判定部１０４、感情推定部１０５、感情閾値設定部１０６、応答生成部１０７、及び音声合成部１０８は、例えば、ＣＰＵ１２がＲＯＭ１４に記憶されたプログラムを実行することによって実現可能である。また、必要なプログラムを任意の不揮発性記録媒体に記録しておき、必要に応じてインストールするようにしてもよい。また、対話履歴記憶部１０３は、例えば、ＲＯＭ１４等の記憶装置により実現される。なお、各構成要素は、上記のようにソフトウェアによって実現されることに限定されず、何らかの回路素子等のハードウェアによって実現されてもよい。 FIG. 2 is a block diagram of the configuration of the control device 10 of the voice interactive device 1 according to the first embodiment. The control device 10 includes a feature amount acquisition unit 101, a speech act type determination unit 102, a dialogue history storage unit 103, a dialogue scene determination unit 104, an emotion estimation unit 105, an emotion threshold setting unit 106, and a response generation unit. 107 and a speech synthesizer 108. Note that the feature quantity acquisition unit 101, the speech act type determination unit 102, the conversation scene determination unit 104, the emotion estimation unit 105, the emotion threshold setting unit 106, the response generation unit 107, and the speech synthesis unit 108 illustrated in FIG. This can be realized by the CPU 12 executing a program stored in the ROM 14. In addition, a necessary program may be recorded in an arbitrary nonvolatile recording medium and installed as necessary. Further, the dialogue history storage unit 103 is realized by a storage device such as the ROM 14, for example. Each component is not limited to being realized by software as described above, and may be realized by hardware such as some circuit element.

特徴量取得部１０１は、感情推定に用いる特徴量を、対話相手であるユーザから取得する。本実施の形態では、特徴量取得部１０１は、マイク２により取得されたユーザの音声データに対し、音声認識処理を行うことで、特徴量としてテキストデータを生成する。したがって、特徴量取得部１０１は、音声認識部と称されてもよい。なお、本実施の形態では、特徴量取得部１０１により取得された特徴量、すなわちテキストデータは、感情推定のみならず、応答の生成にも用いられる。また、特徴量取得部１０１は、音声認識処理結果に基づいて、ユーザの発話時間を特定し、特定した発話時間を後述する対話履歴記憶部１０３に格納する。 The feature amount acquisition unit 101 acquires a feature amount used for emotion estimation from a user who is a conversation partner. In the present embodiment, the feature amount acquisition unit 101 generates text data as a feature amount by performing speech recognition processing on the user's speech data acquired by the microphone 2. Therefore, the feature quantity acquisition unit 101 may be referred to as a voice recognition unit. In the present embodiment, the feature quantity acquired by the feature quantity acquisition unit 101, that is, text data, is used not only for emotion estimation but also for generating a response. Further, the feature amount acquisition unit 101 identifies the user's utterance time based on the voice recognition processing result, and stores the identified utterance time in the dialogue history storage unit 103 described later.

発話行為タイプ判別部１０２は、ユーザの発話行為タイプを判別する。発話行為タイプ判別部１０２は、例えば、５種類の発話行為タイプ（質問、応答、挨拶、情報提供、あいづち）を判別する。ここで、発話行為タイプ「質問」は、対話相手に質問する発話である。また、発話行為タイプ「応答」は、対話相手からの質問に応答する発話である。また、発話行為タイプ「挨拶」は、対話相手に対して挨拶する発話である。また、発話行為タイプ「情報提供」は、自発的な話題の提供を行う発話であり、事象等についての説明、自身に関する情報の提供などがこれに該当する。また、発話行為タイプ「あいづち」は、対話相手の発話に対するあいづちを示す発話である。なお、発話行為タイプ判別部１０２による具体的な判別方法としては公知の種々の方法が適用可能である。例えば、発話行為タイプ判別部１０２は機械学習を用いた判別を行ってもよいが、判別方法はこれに限られない。また、発話行為タイプ判別部１０２は、判別した発話行為タイプを対話履歴記憶部１０３に格納する。 The speech act type discriminating unit 102 discriminates the user's speech act type. The speech act type discriminating unit 102 discriminates, for example, five types of speech act types (question, response, greeting, information provision, and nickname). Here, the speech act type “question” is an utterance for asking a question to the conversation partner. Also, the speech act type “response” is a speech that responds to a question from the conversation partner. Further, the speech act type “greeting” is an utterance that greets the conversation partner. Also, the utterance action type “information provision” is an utterance that provides a spontaneous topic, such as explanation of an event or the like, provision of information about itself, and the like. Further, the speech act type “Aichichi” is an utterance that indicates the identity of the conversation partner. Note that various known methods can be applied as a specific determination method by the speech act type determination unit 102. For example, the speech act type discrimination unit 102 may perform discrimination using machine learning, but the discrimination method is not limited to this. Further, the speech act type determination unit 102 stores the determined speech act type in the dialogue history storage unit 103.

対話履歴記憶部１０３は、対話履歴を記憶する。本実施の形態では、対話履歴記憶部１０３は、直近のＮターン（ただし、Ｎは、所定の正整数）の対話におけるユーザと音声対話装置１の対話履歴を記憶する。すなわち、直近のＮターンにおける、ユーザの発話行為タイプ及び発話時間並び音声対話装置１の発話行為タイプ及び発話時間を記憶する。 The dialogue history storage unit 103 stores a dialogue history. In the present embodiment, the dialogue history storage unit 103 stores the dialogue history between the user and the voice dialogue apparatus 1 in the last N turns (where N is a predetermined positive integer). That is, the user's utterance action type and utterance time, and the utterance action type and utterance time of the voice interaction apparatus 1 in the last N turns are stored.

対話シーン判定部１０４は、対話履歴記憶部１０３に記憶されたユーザと音声対話装置１の対話履歴に基づいて、現在、ユーザ及び音声対話装置１のいずれが聞き手側として対話し、いずれが話し手側として対話しているかを判定する。すなわち、対話シーン判定部１０４は、対話履歴記憶部１０３に記憶された発話時間又は発話行為タイプを用いて、現在の対話シーンを判定する。 Based on the conversation history between the user and the voice conversation apparatus 1 stored in the conversation history storage section 103, the conversation scene determination unit 104 currently has a conversation with either the user or the voice conversation apparatus 1 as the listener side, and which is the speaker side. Determine if you are interacting with. That is, the dialogue scene determination unit 104 determines the current dialogue scene using the utterance time or the utterance action type stored in the dialogue history storage unit 103.

例えば、対話シーン判定部１０４は、対話履歴記憶部１０３に記憶された直近のＮターンのユーザと音声対話装置１の発話時間から、直近のＮターンの両者の発話割合を算出し、その算出結果に基づいて、次のように対話シーンを判定する。ここで、対話シーン判定部１０４は、音声対話装置１の発話割合を０〜１００％の値で算出するものとする。すなわち、ユーザの発話割合は、１００％−（音声対話装置１の発話割合）で表される。 For example, the conversation scene determination unit 104 calculates the utterance ratio of both the latest N turns from the utterance time of the latest N turns user and the voice interaction device 1 stored in the conversation history storage unit 103, and the calculation result Based on the above, the conversation scene is determined as follows. Here, it is assumed that the conversation scene determination unit 104 calculates the utterance ratio of the voice interaction apparatus 1 with a value of 0 to 100%. That is, the utterance ratio of the user is expressed as 100% − (utterance ratio of the voice interactive apparatus 1).

（１）音声対話装置１の発話割合が閾値Ｔ１（例えば６０％）以上１００％以下の場合、すなわちユーザの発話割合が０％以上閾値Ｔ２（例えば４０％）以下の場合：
対話シーン判定部１０４は、対話シーンが、音声対話装置１による情報提供シーンであると判定する。すなわち、対話シーン判定部１０４は、現在の対話シーンが、音声対話装置１が話し手側として対話し、ユーザが聞き手側として対話しているシーンであると判定する。
（２）音声対話装置１の発話割合が閾値Ｔ３（例えば４０％）以上閾値Ｔ１（例えば６０％）未満の場合、すなわちユーザの発話割合が閾値Ｔ２（例えば４０％）より高く、閾値Ｔ４（例えば６０％）以下の場合：
対話シーン判定部１０４は、対話シーンが、音声対話装置１とユーザによる雑談シーンであると判定する。すなわち、対話シーン判定部１０４は、現在の対話シーンが、主にいずれが話し手側として対話し、主にいずれが聞き手側として対話しているかを特定できないシーンであると判定する。
（３）音声対話装置１の発話割合が０％以上閾値Ｔ３（例えば４０％）未満の場合、すなわちユーザの発話割合が閾値Ｔ４（例えば６０％）より高く、１００％以下の場合：
対話シーン判定部１０４は、対話シーンが、音声対話装置１による傾聴シーンであると判定する。すなわち、対話シーン判定部１０４は、現在の対話シーンが、音声対話装置１が聞き手側として対話し、ユーザが話し手側として対話しているシーンであると判定する。 (1) When the speech rate of the voice interaction apparatus 1 is not less than the threshold value T1 (for example, 60%) and not more than 100%, that is, the user's rate of speech is not less than 0% and not more than the threshold value T2 (for example, 40%):
The conversation scene determination unit 104 determines that the conversation scene is an information provision scene by the voice interaction device 1. That is, the conversation scene determination unit 104 determines that the current conversation scene is a scene in which the voice conversation apparatus 1 is interacting as the speaker side and the user is interacting as the listener side.
(2) When the utterance ratio of the voice interactive apparatus 1 is equal to or higher than the threshold T3 (for example, 40%) and lower than the threshold T1 (for example, 60%), that is, the user's utterance ratio is higher than the threshold T2 (for example, 40%). 60%) if:
The conversation scene determination unit 104 determines that the conversation scene is a chat scene between the voice interaction device 1 and the user. That is, the conversation scene determination unit 104 determines that the current conversation scene is a scene in which it is not possible to specify which is mainly talking as the speaker side and which is mainly talking as the listener side.
(3) When the utterance ratio of the voice interaction apparatus 1 is 0% or more and less than the threshold T3 (for example, 40%), that is, when the user's utterance ratio is higher than the threshold T4 (for example, 60%) and 100% or less:
The conversation scene determination unit 104 determines that the conversation scene is a listening scene by the voice interaction device 1. That is, the conversation scene determination unit 104 determines that the current conversation scene is a scene in which the voice conversation apparatus 1 has a conversation as a listener and a user is a conversation as a speaker.

すなわち、このような判別が行われる場合、対話シーン判定部１０４は、所定回数の会話中における音声対話装置１による発話割合がユーザの発話割合よりも所定値以上大きい場合、音声対話装置１が話し手側として対話し、ユーザが聞き手側として対話しているシーンであると判定する。また、対話シーン判定部１０４は、所定回数の会話中における音声対話装置１による発話割合がユーザの発話割合よりも所定値以上小さい場合、音声対話装置１が聞き手側として対話し、ユーザが話し手側として対話しているシーンであると判定する。 In other words, when such a determination is made, the conversation scene determination unit 104 determines that the speech conversation apparatus 1 is the speaker when the speech ratio by the speech conversation apparatus 1 during a predetermined number of conversations is greater than the user's speech ratio by a predetermined value or more. It is determined that the scene is a scene in which the user talks as the listener and the user talks as the listener. In addition, when the utterance ratio by the voice interaction apparatus 1 during a predetermined number of conversations is smaller than the user's utterance ratio by a predetermined value or more, the conversation scene determination unit 104 interacts as the listener side, and the user performs the conversation side. It is determined that the scene is a dialogue.

なお、対話シーン判定部１０４は、対話履歴に基づいて対話シーンを判別すればよく、上記判別方法は一例である。したがって、例えば、対話シーン判定部１０４は、直近のＮターンの両者の発話行為タイプに基づいて、対話シーンを判定してもよい。具体的には、例えば、対話シーン判定部１０４は、直近のＮターンの対話におけるユーザの発話行為タイプにおいて、「情報提供」の割合が閾値以上である場合、音声対話装置１による傾聴シーンであると判定し、「あいづち」の割合が閾値以上である場合、音声対話装置１による情報提供シーンであると判定してもよい。なお、対話シーン判定部１０４は、対話履歴記憶部１０３に記憶された発話時間及び発話行為タイプの両方を用いて、対話シーンの判別を行ってもよい。例えば、対話シーン判定部１０４は、所定の発話行為タイプについての、音声対話装置１とユーザの発話割合に基づいて、対話シーンの判別を行ってもよい。 The conversation scene determination unit 104 may determine a conversation scene based on the conversation history, and the determination method is an example. Therefore, for example, the conversation scene determination unit 104 may determine a conversation scene based on the utterance action types of both of the latest N turns. Specifically, for example, the dialogue scene determination unit 104 is a listening scene by the voice dialogue device 1 when the ratio of “information provision” is equal to or greater than a threshold in the user's utterance action type in the last N-turn dialogue. If the ratio of “Aichichi” is equal to or greater than the threshold value, it may be determined that the scene is an information provision scene by the voice interactive device 1. The dialog scene determination unit 104 may determine the dialog scene using both the utterance time and the utterance action type stored in the dialog history storage unit 103. For example, the conversation scene determination unit 104 may determine a conversation scene based on the utterance ratio between the voice interaction device 1 and the user for a predetermined utterance action type.

感情推定部１０５は、特徴量取得部１０１が取得した特徴量に基づいて、ユーザの感情の指標値を算出する。具体的には、感情推定部１０５は、特徴量取得部１０１により生成されたテキストデータを解析し、予め定められた算出規則に従ってユーザの感情を示す指標値を算出する。また、感情推定部１０５は、算出した指標値と閾値との比較結果に応じて、ユーザの感情を推定する。ここで、閾値は、後述する感情閾値設定部１０６により設定される値であり、対話シーンに応じた値である。 The emotion estimation unit 105 calculates an index value of the user's emotion based on the feature amount acquired by the feature amount acquisition unit 101. Specifically, the emotion estimation unit 105 analyzes the text data generated by the feature amount acquisition unit 101, and calculates an index value indicating the user's emotion according to a predetermined calculation rule. Moreover, the emotion estimation part 105 estimates a user's emotion according to the comparison result of the calculated index value and a threshold value. Here, the threshold value is a value set by an emotion threshold setting unit 106 described later, and is a value corresponding to the conversation scene.

なお、感情推定部１０５は、感情の指標値の算出及び指標値と閾値との比較に基づいて感情の推定を行なえばよく、そのような感情推定の方法として公知の任意の手法が適用可能である。例えば、感情推定方法のひとつとして、「Ｗｅｂから獲得した感情生起要因コーパスに基づく感情推定」（徳久良子ほか，言語処理学会第１４回年次大会論文集，２００８年３月）に記載された技術が用いられてもよい。 The emotion estimation unit 105 may perform emotion estimation based on the calculation of the emotion index value and the comparison between the index value and the threshold value, and any known method can be applied as a method for such emotion estimation. is there. For example, as one of the emotion estimation methods, the technique described in “Emotion estimation based on the emotion-causing factor corpus acquired from the Web” (Ryoko Tokuhisa et al., Proc. Of the 14th Annual Conference of the Language Processing Society, March 2008) May be used.

本実施の形態では、感情推定部１０５は、ユーザの発話内容を示すテキストデータの解析結果から、指標値として、−１．０〜＋１．０の範囲内の数値を算出する。ここで、解析結果がネガティブな感情を示す場合、指標値はマイナスの値となり、解析結果がポジティブな感情を示す場合、指標値はプラスの値となる。 In the present embodiment, emotion estimation unit 105 calculates a numerical value within a range of −1.0 to +1.0 as an index value from an analysis result of text data indicating the user's utterance content. Here, when the analysis result indicates a negative emotion, the index value is a negative value, and when the analysis result indicates a positive emotion, the index value is a positive value.

本実施の形態では、感情推定部１０５は、算出した指標値と閾値とを用いて、ポジティブ、ネガティブ、ニュートラルのいずれかの感情を決定する。なお、ニュートラルとは、ポジティブでもネガティブでもない感情である。例えば、ポジティブな感情と推定するための閾値を＋０．５とし、ネガティブな感情と推定するための閾値を−０．５とする。この場合、感情推定部１０５は、特徴量取得部１０１が取得した特徴量に基づいて算出した指標値が、−０．５以下である場合、ユーザの感情がネガティブであると決定する。また、感情推定部１０５は、特徴量取得部１０１が取得した特徴量に基づいて算出した指標値が、＋０．５以上である場合、ユーザの感情がポジティブであると決定する。そして、感情推定部１０５は、特徴量取得部１０１が取得した特徴量に基づいて算出した指標値が、−０．５より大きく＋０．５未満である場合、ユーザの感情がニュートラルであると決定する。 In the present embodiment, emotion estimation section 105 determines a positive, negative, or neutral emotion using the calculated index value and threshold value. Neutral is an emotion that is neither positive nor negative. For example, the threshold for estimating a positive emotion is +0.5, and the threshold for estimating a negative emotion is -0.5. In this case, the emotion estimation unit 105 determines that the user's emotion is negative when the index value calculated based on the feature amount acquired by the feature amount acquisition unit 101 is −0.5 or less. The emotion estimation unit 105 determines that the user's emotion is positive when the index value calculated based on the feature amount acquired by the feature amount acquisition unit 101 is +0.5 or more. Then, the emotion estimation unit 105 determines that the user's emotion is neutral when the index value calculated based on the feature amount acquired by the feature amount acquisition unit 101 is greater than −0.5 and less than +0.5. To do.

感情閾値設定部１０６は、感情を推定するための閾値を、対話シーン判定部１０４による判定結果に応じて設定する。例えば、対話シーン判定部１０４により、対話シーンが、音声対話装置１による情報提供シーンであると判定された場合、すなわち、現在の対話シーンが、音声対話装置１が話し手側として対話し、ユーザが聞き手側として対話しているシーンであると判定された場合、ユーザの感情が特徴量に反映されにくいと考えられる。このため、この場合、感情閾値設定部１０６は、ポジティブ又はネガティブの感情が推定されやすくなるよう、予め定められた基本閾値よりもゆるい閾値を設定する。 The emotion threshold setting unit 106 sets a threshold for estimating the emotion according to the determination result by the conversation scene determination unit 104. For example, when the dialog scene determination unit 104 determines that the dialog scene is an information providing scene by the voice dialog device 1, that is, the current dialog scene is dialoged by the voice dialog device 1 as the speaker side, and the user When it is determined that the scene is a dialogue with the listener, it is considered that the emotion of the user is not easily reflected in the feature amount. For this reason, in this case, the emotion threshold setting unit 106 sets a threshold that is looser than a predetermined basic threshold so that a positive or negative emotion is easily estimated.

これに対し、対話シーン判定部１０４により、対話シーンが、音声対話装置１による傾聴シーンであると判定された場合、すなわち、現在の対話シーンが、音声対話装置１が聞き手側として対話し、ユーザが話し手側として対話しているシーンであると判定された場合、ユーザの感情が特徴量に反映されやすいと考えられる。このため、この場合、感情閾値設定部１０６は、ポジティブ又はネガティブの感情が推定されにくくなるよう、基本閾値よりもきつい閾値を設定する。 On the other hand, when the dialog scene determination unit 104 determines that the dialog scene is a listening scene by the voice dialog device 1, that is, the current dialog scene is dialoged as the listener side by the voice dialog device 1, and the user If it is determined that the scene is a conversation with the speaker, the user's emotion is likely to be reflected in the feature amount. For this reason, in this case, the emotion threshold setting unit 106 sets a threshold that is harder than the basic threshold so that a positive or negative emotion is less likely to be estimated.

したがって、具体的には、対話シーン判定部１０４により、対話シーンが、音声対話装置１による情報提供シーンであると判定された場合、例えば、感情閾値設定部１０６は、ポジティブな感情と推定するための閾値を、その基本閾値である＋０．５より０．２だけ下げ、＋０．３とし、ネガティブな感情と推定するための閾値を、その基本閾値である−０．５より０．２だけ上げ、−０．３とする。また、対話シーン判定部１０４により、対話シーンが、音声対話装置１による傾聴シーンであると判定された場合、例えば、感情閾値設定部１０６は、ポジティブな感情と推定するための閾値を、その基本閾値である＋０．５より０．２だけ上げ、＋０．７とし、ネガティブな感情と推定するための閾値を、その基本閾値である−０．５より０．２だけ下げ、−０．７とする。また、対話シーン判定部１０４により、対話シーンが、音声対話装置１とユーザによる雑談シーンであると判定された場合、例えば、感情閾値設定部１０６は、ポジティブな感情と推定するための閾値を、その基本閾値である＋０．５とし、ネガティブな感情と推定するための閾値を、その基本閾値である−０．５とする。 Therefore, specifically, when the conversation scene determination unit 104 determines that the conversation scene is an information providing scene by the voice interaction apparatus 1, for example, the emotion threshold setting unit 106 estimates a positive emotion. Is lowered by 0.2 from its basic threshold of +0.5 to +0.3, and the threshold for estimating negative emotion is raised by 0.2 from its basic threshold of -0.5. , −0.3. In addition, when the conversation scene determination unit 104 determines that the conversation scene is a listening scene by the voice interaction device 1, for example, the emotion threshold setting unit 106 sets a threshold for estimating a positive emotion as the basic threshold. Raise the threshold by +0.2 from +0.5 to +0.7, and lower the threshold for estimating negative emotion by 0.2 from the basic threshold of −0.5 to −0.7. To do. Further, when the conversation scene determination unit 104 determines that the conversation scene is a chat scene between the voice interaction apparatus 1 and the user, for example, the emotion threshold setting unit 106 sets a threshold for estimating a positive emotion as The basic threshold value is set to +0.5, and the threshold value for estimating a negative emotion is set to the basic threshold value −0.5.

応答生成部１０７は、対話相手であるユーザの発話に対する応答を生成する。応答は、典型的にはテキストデータである。本実施の形態では、応答生成部１０７は、感情推定部１０５により推定された感情に応じて適切な応答を生成する。応答生成部１０７は、例えば、感情の種類に対応付けられた応答文を含む応答文テーブルを参照し、応答文テーブルから適切な応答文を選択することにより、応答文の生成を行ってもよい。なお、応答生成部１０７は、生成した応答についての発話行為タイプ及び発話時間を対話履歴記憶部１０３に記憶する。応答生成部１０７が生成した応答についての発話行為タイプ及び発話時間は、予め、応答文テーブルの応答文と対応付けられてＲＯＭ１４等の記憶装置に記憶されていてもよいし、発話行為タイプ判別部１０２等による解析により得られてもよい。 The response generation unit 107 generates a response to the utterance of the user who is the conversation partner. The response is typically text data. In the present embodiment, the response generation unit 107 generates an appropriate response according to the emotion estimated by the emotion estimation unit 105. For example, the response generation unit 107 may generate a response sentence by selecting an appropriate response sentence from the response sentence table with reference to a response sentence table including a response sentence associated with the type of emotion. . Note that the response generation unit 107 stores the utterance action type and the utterance time for the generated response in the dialogue history storage unit 103. The utterance action type and the utterance time for the response generated by the response generation unit 107 may be stored in advance in a storage device such as the ROM 14 in association with the response sentence in the response sentence table, or the utterance action type determination unit. You may obtain by analysis by 102 grade | etc.,.

音声合成部１０８は、応答生成部１０７が生成した応答を音声データに変換する。すなわち、音声合成部１０８は、応答生成部１０７が生成した応答文のテキストデータを音声データに変換する。テキストデータからの音声データの生成は、公知の種々の音声合成技術等により実現可能である。その後、典型的にはＤ／Ａ変換装置（図示せず）が音声データをアナログ音声信号に変換し、スピーカ３がアナログ音声信号を音声として出力する。 The voice synthesis unit 108 converts the response generated by the response generation unit 107 into voice data. That is, the speech synthesizer 108 converts the text data of the response sentence generated by the response generator 107 into speech data. Generation of voice data from text data can be realized by various known voice synthesis techniques. Thereafter, typically, a D / A converter (not shown) converts the audio data into an analog audio signal, and the speaker 3 outputs the analog audio signal as audio.

次に、音声対話装置１の動作について説明する。図３は、音声対話装置１の動作の一例を示すフローチャートである。以下、図３に沿って、音声対話装置１の動作例を説明する。 Next, the operation of the voice interaction apparatus 1 will be described. FIG. 3 is a flowchart showing an example of the operation of the voice interaction apparatus 1. Hereinafter, an operation example of the voice interactive apparatus 1 will be described with reference to FIG.

ステップ１００（Ｓ１００）において、特徴量取得部１０１が、マイク２により取得されたユーザの音声データに対し、音声認識処理を行い、テキストデータを生成する。例えば、ステップ１００では、ユーザの発話から「今日は晴れているよ」というテキストデータが生成される。また、特徴量取得部１０１は、ユーザの発話時間を特定し、特定した発話時間を対話履歴記憶部１０３に格納する。 In step 100 (S100), the feature quantity acquisition unit 101 performs voice recognition processing on the user voice data acquired by the microphone 2 to generate text data. For example, in step 100, text data “It is sunny today” is generated from the user's utterance. Also, the feature amount acquisition unit 101 specifies the user's utterance time, and stores the specified utterance time in the dialogue history storage unit 103.

次に、ステップ１０１（Ｓ１０１）において、発話行為タイプ判別部１０２が、ステップ１００で取得されたユーザの音声についての発話行為タイプを判別する。例えば、発話行為タイプ判別部１０２は、「今日は晴れているよ」という発話の発話行為タイプが、「情報提供」であると判別し、判別した発話行為タイプを対話履歴記憶部１０３に格納する。 Next, in step 101 (S101), the speech act type discriminating unit 102 discriminates the speech act type for the user's voice acquired in step 100. For example, the utterance action type determination unit 102 determines that the utterance action type of the utterance “It is sunny today” is “information provision”, and stores the determined utterance action type in the conversation history storage unit 103. .

次に、ステップ１０２（Ｓ１０２）において、対話シーン判定部１０４が、直近のＮターンの対話履歴に基づいて、現在の対話シーンを判定する。例えば、対話シーン判定部１０４は、現在の対話シーンが、音声対話装置１による傾聴シーンであると判定する。 Next, in step 102 (S102), the dialog scene determination unit 104 determines the current dialog scene based on the dialog history of the latest N turns. For example, the conversation scene determination unit 104 determines that the current conversation scene is a listening scene by the voice interaction apparatus 1.

次に、ステップ１０３（Ｓ１０３）において、感情閾値設定部１０６は、ステップ１０２で判定された対話シーンに応じた閾値を設定する。例えば、対話シーンが傾聴シーンである場合、感情閾値設定部１０６は、ポジティブな感情と推定するための閾値を、その基本閾値よりも０．２だけ大きい＋０．７とし、ネガティブな感情と推定するための閾値を、その基本閾値よりも０．２だけ小さい−０．７とする。 Next, in step 103 (S103), the emotion threshold setting unit 106 sets a threshold corresponding to the conversation scene determined in step 102. For example, when the conversation scene is a listening scene, the emotion threshold setting unit 106 sets a threshold for estimating a positive emotion as +0.7, which is 0.2 larger than the basic threshold, and estimates a negative emotion. The threshold for this is −0.7, which is smaller by 0.2 than the basic threshold.

次に、ステップ１０４（Ｓ１０４）において、感情推定部１０５は、ステップ１００で得られたテキストデータに基づいてユーザの感情の指標値を算出し、算出した指標値とステップ１０３で設定された閾値との比較結果に応じて、ユーザの感情を推定する。例えば、ユーザの感情の指標値が＋０．６である場合、指標値は、基本閾値である＋０．５よりも大きいもののステップ１０３で修正された閾値である＋０．７未満であるため、ユーザの感情は、ポジティブとは判定されず、ニュートラルと判定される。 Next, in step 104 (S104), the emotion estimation unit 105 calculates an index value of the user's emotion based on the text data obtained in step 100, and the calculated index value and the threshold value set in step 103 are calculated. The user's emotion is estimated according to the comparison result. For example, when the index value of the user's emotion is +0.6, the index value is larger than the basic threshold value +0.5, but is less than +0.7 which is the threshold value corrected in Step 103. The emotion is not determined as positive but is determined as neutral.

次に、ステップ１０５（Ｓ１０５）において、応答生成部１０７は、ステップ１０４において推定された感情に応じた応答を生成する。例えば、ステップ１０４で推定された感情がニュートラルである場合、ステップ１０５では、応答「そうなんだ」が生成される。 Next, in step 105 (S105), the response generation unit 107 generates a response according to the emotion estimated in step 104. For example, if the emotion estimated in step 104 is neutral, a response “Yes” is generated in step 105.

次に、ステップ１０６（Ｓ１０６）において、音声合成部１０８が、ステップ１０５で生成された応答のテキストデータを音声データに変換する。これにより、スピーカ３からステップ１０５で生成された応答の音声が出力される。以降、処理は、ステップ１００に戻り、対話が繰り返されることとなる。 Next, in step 106 (S106), the speech synthesizer 108 converts the response text data generated in step 105 into speech data. Thereby, the response sound generated in Step 105 is output from the speaker 3. Thereafter, the process returns to step 100 and the dialogue is repeated.

以上、実施の形態にかかる音声対話装置１について説明した。音声対話装置１は、上述の通り、対話シーンを判定し、その判定結果に従って、閾値を動的に変更する。したがって、音声対話装置１によれば、対話シーンの違いによる影響を抑制した感情推定を行うことができる。つまり、例えば、ユーザの感情が特徴量に反映されにくいシーンである、ユーザが主に聞き手となるシーンにおいても、適切に感情を推定することができる。このように、感情を的確に捉えた上で、応答を生成することができるため、音声対話装置１は、より円滑にユーザとコミュニケーションをとることができる。 Heretofore, the voice interactive apparatus 1 according to the embodiment has been described. As described above, the voice interaction apparatus 1 determines a conversation scene, and dynamically changes the threshold according to the determination result. Therefore, according to the voice interaction device 1, it is possible to perform emotion estimation while suppressing the influence due to the difference in the conversation scene. That is, for example, even in a scene where the user's emotion is difficult to be reflected in the feature amount and the user is mainly a listener, the emotion can be appropriately estimated. As described above, since the response can be generated after accurately capturing the emotion, the voice interaction apparatus 1 can communicate with the user more smoothly.

なお、本発明は上記実施の形態に限られたものではなく、趣旨を逸脱しない範囲で適宜変更することが可能である。例えば、上記の実施の形態では、特徴量として、ユーザの発話内容が用いられたが、これに限らず、ユーザの表情、韻律などの他の特徴量に基づいて、指標値の算出及び感情の推定が行われてもよい。すなわち、他の一例として、特徴量取得部１０１は、図示しないカメラが取得した画像に対し画像処理を行って、特徴量としてのユーザの表情を取得してもよい。また、指標値及び閾値に関する各値は、一例であり、上記の値に限られない。 Note that the present invention is not limited to the above-described embodiment, and can be changed as appropriate without departing from the spirit of the present invention. For example, in the above embodiment, the content of the user's utterance is used as the feature amount. However, the present invention is not limited to this, and based on other feature amounts such as the user's facial expression and prosody, the calculation of the index value and the emotional amount are performed. An estimation may be made. That is, as another example, the feature amount acquisition unit 101 may perform image processing on an image acquired by a camera (not shown) to acquire a user's facial expression as a feature amount. Moreover, each value regarding an index value and a threshold value is an example, and is not limited to the above values.

１音声対話装置
１０１特徴量取得部
１０２発話行為タイプ判別部
１０３対話履歴記憶部
１０４対話シーン判定部
１０５感情推定部
１０６感情閾値設定部
１０７応答生成部
１０８音声合成部 DESCRIPTION OF SYMBOLS 1 Voice conversation apparatus 101 Feature-value acquisition part 102 Speech action type discrimination | determination part 103 Dialog history memory | storage part 104 Dialogue scene determination part 105 Emotion estimation part 106 Emotion threshold value setting part 107 Response generation part 108 Speech synthesizer

Claims

A speech dialogue apparatus for recognizing a speech voice of a user who is a conversation partner and outputting a voice to the user,
A feature amount acquisition unit for acquiring a feature amount used for emotion estimation from the user;
Based on the interaction history of the user and the own device, an interaction scene determination unit that determines which of the user and the own device interacts as a listener side and which is interacting as a speaker side;
An emotion threshold setting unit for setting a threshold for estimating emotion according to a determination result by the dialog scene determination unit;
Emotion estimation that calculates an index value of the user's emotion based on the feature amount, and estimates the user's emotion according to a comparison result between the calculated index value and a threshold set by the emotion threshold setting unit And a voice interaction device.