JP2024112283A

JP2024112283A - Dialogue system, dialogue control method, and program

Info

Publication number: JP2024112283A
Application number: JP2023221852A
Authority: JP
Inventors: 将樹能勢; Masaki Nose; 悠斗後藤; Yuto Goto; 千尋麻田; Chihiro Asada
Original assignee: Ricoh Co Ltd
Current assignee: Ricoh Co Ltd
Priority date: 2023-02-07
Filing date: 2023-12-27
Publication date: 2024-08-20

Abstract

PROBLEM TO BE SOLVED: To provide a dialogue system for conducting a dialogue with a user using a dialogue agent, configured to generate a response of the dialogue agent on the basis of verbal information of the user and non-verbal information of the user.

SOLUTION: A dialogue system for conducting a dialogue with a user using a dialogue agent includes: a first acquisition unit which acquires verbal information of the user from the dialogue; a second acquisition unit which acquires non-verbal information of the user from the dialogue; a generation unit which generates a response containing a verbal response and a non-verbal response of the dialogue agent, on the basis of the verbal information of the user and the non-verbal information of the user; and a control unit which controls the dialogue agent on the basis of the response generated by the generation unit.

SELECTED DRAWING: Figure 7

Description

本発明は、対話システム、対話制御方法、及びプログラムに関する。 The present invention relates to a dialogue system, a dialogue control method, and a program.

ユーザからのメッセージに対して、対話エージェントが自動で応答を行う対話システムがある。また、ユーザとの対話を学習し、対話エージェントの格好、又は性格等の属性を変更するエージェントシステムが知られている（例えば、特許文献１参照）。 There are dialogue systems in which a dialogue agent automatically responds to messages from a user. There are also known agent systems that learn from dialogues with users and change attributes of the dialogue agent, such as its appearance or personality (see, for example, Patent Document 1).

従来の技術では、対話エージェントによる対話は、ユーザの言語情報とユーザの非言語情報とに基づいて、対話エージェントの応答内容を生成することができないという問題がある。 Conventional technology has the problem that dialogue agents cannot generate responses based on both the user's linguistic and non-linguistic information.

本発明の一実施形態は、上記の問題点に鑑みてなされたものであって、対話エージェントを用いてユーザと対話を行う対話システムにおいて、ユーザの言語情報とユーザの非言語情報とに基づいて、対話エージェントの応答内容を生成することができるようにする。 One embodiment of the present invention has been made in consideration of the above problems, and in a dialogue system that uses a dialogue agent to engage in dialogue with a user, it is possible to generate the dialogue agent's response content based on the user's linguistic information and non-verbal information.

上記の課題を解決するため、一実施形態に係る対話システムは、対話エージェントを用いてユーザと対話を行う対話システムであって、前記対話から前記ユーザの言語情報を取得する第１の取得部と、前記対話から前記ユーザの非言語情報を取得する第２の取得部と、前記ユーザの言語情報と前記ユーザの非言語情報とに基づいて、前記対話エージェントの言語応答と非言語応答とを含む応答内容を生成する生成部と、前記生成部で生成した応答内容に基づいて前記対話エージェントを制御する制御部と、を備える。 In order to solve the above problems, a dialogue system according to one embodiment is a dialogue system that uses a dialogue agent to engage in a dialogue with a user, and includes a first acquisition unit that acquires linguistic information of the user from the dialogue, a second acquisition unit that acquires non-verbal information of the user from the dialogue, a generation unit that generates response content including a verbal response and a non-verbal response of the dialogue agent based on the linguistic information and non-verbal information of the user, and a control unit that controls the dialogue agent based on the response content generated by the generation unit.

本発明の一実施形態によれば、対話エージェントを用いてユーザと対話を行う対話システムにおいて、ユーザの言語情報とユーザの非言語情報とに基づいて、対話エージェントの応答内容を生成することができるようになる。 According to one embodiment of the present invention, in a dialogue system that uses a dialogue agent to dialogue with a user, it becomes possible to generate the response content of the dialogue agent based on the user's linguistic information and the user's non-verbal information.

一実施形態に係る対話システムのシステム構成の例を示す図である。FIG. 1 is a diagram illustrating an example of a system configuration of a dialogue system according to an embodiment. 一実施形態に係る対話エージェントの一例を示す図である。FIG. 2 illustrates an example of a dialogue agent according to an embodiment. 一実施形態に係る対話エージェントの別の一例を示す図である。FIG. 2 illustrates another example of a dialogue agent according to an embodiment. 一実施形態に係る対話処理の概要について説明するための図である。FIG. 1 is a diagram for explaining an overview of dialogue processing according to an embodiment. 一実施形態に係るコンピュータのハードウェア構成の例を示す図である。FIG. 2 is a diagram illustrating an example of a hardware configuration of a computer according to an embodiment. 一実施形態に係る端末装置のハードウェア構成の例を示す図である。FIG. 2 is a diagram illustrating an example of a hardware configuration of a terminal device according to an embodiment. 一実施形態に係る対話システムの機能構成の例を示す図である。FIG. 2 is a diagram illustrating an example of a functional configuration of a dialogue system according to an embodiment. 一実施形態に係る対話処理の概要を示すフローチャートである。1 is a flowchart illustrating an overview of an interaction process according to an embodiment. 第１の実施形態に係る生成部の機能構成の例を示す図である。FIG. 2 is a diagram illustrating an example of a functional configuration of a generating unit according to the first embodiment. 第１の実施形態に係る対話処理の例を示すフローチャート（１）である。1 is a flowchart (1) illustrating an example of an interaction process according to the first embodiment. 第１の実施形態に係る対話処理の例を示すフローチャート（２）である。11 is a flowchart (2) illustrating an example of the dialogue process according to the first embodiment. 第１の実施形態に係る非言語情報の利用例について説明するための図である。1 is a diagram for explaining an example of use of non-language information according to the first embodiment; 第２の実施形態に係る対話シナリオの遷移の一例を示す図（１）である。FIG. 11 is a diagram showing an example of a transition of a dialogue scenario according to the second embodiment; 第２の実施形態に係る対話シナリオの遷移の一例を示す図（２）である。FIG. 13 is a diagram showing an example of a transition of a dialogue scenario according to the second embodiment; 第３の実施形態に係る対話画面の一例を示す図である。FIG. 13 is a diagram showing an example of a dialogue screen according to the third embodiment. 第３の実施形態に係る対話システムの機能構成の例を示す図である。FIG. 13 is a diagram illustrating an example of a functional configuration of a dialogue system according to a third embodiment. 第３の実施形態に係る対話処理の例を示すフローチャートである。13 is a flowchart illustrating an example of an interaction process according to the third embodiment. 第４の実施形態に係る対話システムの機能構成の例を示す図である。FIG. 13 is a diagram illustrating an example of a functional configuration of a dialogue system according to a fourth embodiment. 第４の実施形態に係る対話ログの例を示す図である。FIG. 13 is a diagram illustrating an example of a dialogue log according to the fourth embodiment. 第５の実施形態に係る対話システムの機能構成の例を示す図である。FIG. 13 is a diagram illustrating an example of a functional configuration of a dialogue system according to a fifth embodiment. 第５の実施形態に係るキャッチコピーの提示処理の例を示すフローチャートである。13 is a flowchart illustrating an example of a catchphrase presentation process according to the fifth embodiment. 第６の実施形態に係る対話システムの機能構成の例を示す図である。FIG. 23 is a diagram illustrating an example of a functional configuration of a dialogue system according to a sixth embodiment. 第６の実施形態に係る入出力情報の例を示す図である。FIG. 23 is a diagram illustrating an example of input/output information according to the sixth embodiment. 第６の実施形態に係る対話処理の例を示すフローチャートである。23 is a flowchart illustrating an example of an interaction process according to the sixth embodiment. 一実施形態に係る利用シーン１のシステム構成の例を示す図である。FIG. 1 is a diagram illustrating an example of a system configuration of a usage scene 1 according to an embodiment. 一実施形態に係る利用シーン２の対話開始処理の例を示すフローチャートである。11 is a flowchart illustrating an example of a dialogue start process in a usage scene 2 according to an embodiment. 一実施形態に係る利用シーン２のシステム構成の例を示す図である。FIG. 1 is a diagram illustrating an example of a system configuration of a usage scene 2 according to an embodiment. 一実施形態に係る利用シーン２の対話開始処理の例を示すフローチャートである。11 is a flowchart illustrating an example of a dialogue start process in a usage scene 2 according to an embodiment. 一実施形態に係る利用シーン３のシステム構成の例を示す図である。FIG. 11 is a diagram illustrating an example of a system configuration of a usage scene 3 according to an embodiment. 一実施形態に係る利用シーン２の対話開始処理の例を示すフローチャートである。11 is a flowchart illustrating an example of a dialogue start process in a usage scene 2 according to an embodiment.

以下、本発明の実施形態について、図面を参照しながら詳細に説明する。
＜システム構成＞
図１は、一実施形態に係る対話システムのシステム構成の例を示す図である。図１の例では、対話システム１は、例えば、インターネット、及びＬＡＮ（Local Area Network）等の通信ネットワークＮに接続されたサーバ装置１００と、端末装置１０とを含む。 Hereinafter, embodiments of the present invention will be described in detail with reference to the drawings.
<System Configuration>
Fig. 1 is a diagram showing an example of a system configuration of a dialogue system according to an embodiment. In the example of Fig. 1, the dialogue system 1 includes a server device 100 and a terminal device 10 connected to a communication network N, such as the Internet and a LAN (Local Area Network).

サーバ装置１００は、例えば、コンピュータの構成を備えた情報処理装置、又は複数のコンピュータによって構成されるシステムである。サーバ装置１００は、サーバ装置１００が備えるコンピュータが所定のプログラムを実行することにより、端末装置１０を利用するユーザ１１からのメッセージに対して、対話エージェントが自動で応答を行う対話サービスを提供する。 The server device 100 is, for example, an information processing device having a computer configuration, or a system consisting of multiple computers. The server device 100 provides a dialogue service in which a dialogue agent automatically responds to messages from a user 11 using a terminal device 10 by causing the computer included in the server device 100 to execute a specific program.

端末装置１０は、例えば、ＰＣ（Personal Computer）、タブレット端末、又はスマートフォン等のユーザ１１が利用する情報端末である。端末装置１０は、通信ネットワークＮを介して、サーバ装置１００と通信可能である。ユーザ１１は、端末装置１０を用いて、サーバ装置１００が提供する対話サービスを利用することができる。 The terminal device 10 is an information terminal used by a user 11, such as a PC (Personal Computer), a tablet terminal, or a smartphone. The terminal device 10 is capable of communicating with the server device 100 via a communication network N. The user 11 can use the interactive service provided by the server device 100 by using the terminal device 10.

好ましくは、対話システム１は、ユーザからのメッセージに対して、対話エージェントが自動で応答を行う対話により、例えば、商談、又は介護等の所定のタスクの遂行を支援する。 Preferably, the dialogue system 1 assists in the performance of a specified task, such as business negotiations or caregiving, through a dialogue in which the dialogue agent automatically responds to messages from the user.

なお、図１に示した対話システム１のシステム構成は一例である。また、端末装置１０は、汎用の情報端末に限られず、例えば、専用の端末装置、又は各種の電子機器等であってもよい。また、対話システム１は、例えば、コンピュータの構成を有する１台の情報処理装置によって実現されるものであってもよい。ここでは、対話システム１は、図１に示すようなシステム構成を有しているものとして、以下の説明を行う。 The system configuration of the dialogue system 1 shown in FIG. 1 is an example. The terminal device 10 is not limited to a general-purpose information terminal, and may be, for example, a dedicated terminal device or various electronic devices. The dialogue system 1 may be realized, for example, by a single information processing device having a computer configuration. Here, the following description will be given assuming that the dialogue system 1 has the system configuration shown in FIG. 1.

（対話エージェントのイメージ）
対話エージェントは、ユーザ、又は顧客等からの問いかけに対して登録された情報や知識を含むナレッジ、又はＡＩ（Artificial Intelligence）等を用いて自動応答するシステムである。 (Image of a conversational agent)
A dialogue agent is a system that automatically responds to questions from a user or a customer using knowledge including registered information and knowledge, or AI (Artificial Intelligence), etc.

対話エージェントの利用ケースとして、例えば、Ｗｅｂ会議、Ｗｅｂサイト、スマートフォンアプリ、又はメタバース空間での無人ＡＩアバター等として利用されてもよい。 Examples of use cases for the dialogue agent include web conferences, web sites, smartphone apps, or unmanned AI avatars in the metaverse.

図２は、一実施形態に係る対話エージェントのイメージの一例を示している。この図は、サーバ装置１００が端末装置１０に表示させる商談用の対話画面２００の一例を示している。図２の例では、対話画面２００には、３Ｄ（three-dimensional）モデリングにより生成されたバーチャルヒューマン２０１が表示されている。なお、バーチャルヒューマン２０１は、対話エージェントの一例である。サーバ装置１００は、例えば、この対話画面２００において、ユーザ１１と対話を行いながら、商談を進めるように、バーチャルヒューマン２０１を制御する。 Figure 2 shows an example of an image of a dialogue agent according to an embodiment. This figure shows an example of a dialogue screen 200 for business negotiations that the server device 100 causes the terminal device 10 to display. In the example of Figure 2, a virtual human 201 generated by 3D (three-dimensional) modeling is displayed on the dialogue screen 200. Note that the virtual human 201 is an example of a dialogue agent. For example, the server device 100 controls the virtual human 201 to advance business negotiations while having a dialogue with the user 11 on this dialogue screen 200.

好適な一例として、商談用の対話画面２００には、大型のディスプレイ２０２が表示されている。サーバ装置１００は、このディスプレイ２０２に、例えば、ユーザ提案する商材を表示するとともに、バーチャルヒューマン２０１に商材を説明させるように制御することもできる。 As a suitable example, the dialogue screen 200 for business negotiations displays a large display 202. The server device 100 can display, for example, a product proposed by the user on this display 202, and can also control the virtual human 201 to explain the product.

図３は、一実施形態に係る対話エージェントのイメージの別の一例を示している。この図は、サーバ装置１００が端末装置１０に表示させる介護用途の対話画面３００の一例を示している。図３の例では、対話画面３００には、図２と同様に、３Ｄモデリングにより生成された別のバーチャルヒューマン３０１が表示されている。なお、バーチャルヒューマン３０１は、対話エージェントの別の一例である。サーバ装置１００は、この対話画面３００において、例えば、一人暮らしの高齢者等を対象に、認知症を予防するためのコミュニケーションを行うように、バーチャルヒューマン３０１を制御する。 Figure 3 shows another example of an image of a dialogue agent according to an embodiment. This figure shows an example of a dialogue screen 300 for nursing care purposes that the server device 100 causes the terminal device 10 to display. In the example of Figure 3, the dialogue screen 300 displays another virtual human 301 generated by 3D modeling, similar to Figure 2. Note that the virtual human 301 is another example of a dialogue agent. The server device 100 controls the virtual human 301 in this dialogue screen 300 so that it communicates with, for example, elderly people living alone, in order to prevent dementia.

好適な一例として、ユーザ１１とバーチャルヒューマン３０１との対話は、図３に示すように、音声に加えて（又は代えて）、文字列による対話３０２も可能である。 As a preferred example, the interaction between the user 11 and the virtual human 301 can be through text-based interaction 302 in addition to (or instead of) voice, as shown in FIG. 3.

このように、対話システム１は、対話シナリオを変更することにより、例えば、商談、介護、授業、又はカウンセリング等の様々な用途に合わせて、対話内容を変更することができる。 In this way, the dialogue system 1 can change the dialogue scenario to change the dialogue content to suit various purposes, such as business negotiations, nursing care, classes, or counseling.

（対話処理の概要）
図４は、一実施形態に係る対話処理の概要について説明するための図である。この図は、横軸を時間とし、ユーザ１１と対話エージェントとの対話における、ユーザ１１の言語情報、及び非言語情報と、対話エージェントの言語応答、及び非言語応答と関係の一例を示している。 (Overview of interactive processing)
4 is a diagram for explaining an overview of a dialogue process according to an embodiment. This diagram, with the horizontal axis representing time, shows an example of the relationship between the linguistic information and non-linguistic information of the user 11 and the linguistic responses and non-linguistic responses of the dialogue agent in a dialogue between the user 11 and the dialogue agent.

図４において、ユーザ１１が開始操作を行うと、時間ｔ１において、サーバ装置１００は、対話エージェントに、言語応答として、挨拶、又はアイスブレイク等の発話４０１を行わせるとともに、非言語応答として、お辞儀、又は笑顔等のアイスブレイク４０２を実行させる。 In FIG. 4, when the user 11 performs a start operation, at time t1, the server device 100 causes the dialogue agent to make an utterance 401, such as a greeting or ice-breaker, as a verbal response, and also to perform an ice-breaker 402, such as a bow or a smile, as a non-verbal response.

これに応じて、時間ｔ２において、ユーザが発話を行うと、サーバ装置１００は、ユーザ１１の言語情報と、ユーザ１１の非言語情報とを取得する。このとき、サーバ装置１００は、対話エージェントに、例えば、頷き４０３等の非言語応答を行わせてもよい。 In response to this, at time t2, when the user speaks, the server device 100 acquires the linguistic information and non-verbal information of the user 11. At this time, the server device 100 may cause the dialogue agent to make a non-verbal response, such as a nod 403.

ここで、ユーザ１１の言語情報には、例えば、音声認識技術によってテキスト化した、ユーザ１１の発話４１１の内容を示す情報が含まれる。また、ユーザ１１の非言語情報には、例えば、画像認識技術等によって取得したユーザ１１の表情、視線、姿勢、又は感情等の言語情報以外の情報が含まれる。また、ユーザ１１の非言語情報には、例えば、ユーザ１１の映像に含まれる音声から取得した、声のトーン、話す速さ、声の高さ、声の強さ、咳払い、ため息、笑い、又は沈黙等の言語以外の音声情報（パラ言語）が含まれていてもよい。このように、画像や音声等の非言語情報をマルチモーダルに活用する。 Here, the linguistic information of user 11 includes, for example, information indicating the content of user 11's utterance 411, which has been converted into text using voice recognition technology. Furthermore, the non-linguistic information of user 11 includes, for example, non-linguistic information such as the facial expression, gaze, posture, or emotions of user 11, obtained using image recognition technology or the like. Furthermore, the non-linguistic information of user 11 may include, for example, non-linguistic audio information (paralanguage) such as tone of voice, speaking speed, voice pitch, voice intensity, throat clearing, sighs, laughter, or silence, obtained from the audio contained in the video of user 11. In this way, non-linguistic information such as images and audio is utilized multimodally.

言語情報とは、言葉を介して発話の内容が伝達される情報である。例えば、単語、文法、文の構造、文脈などのような、明確に定義された言語のルールと辞書に基づく意味の伝達がされる情報である。例えば、言語情報には、音声認識技術によってテキスト化した、ユーザ１１の発話４１１の内容を示す情報が含まれる。 Linguistic information is information that conveys the content of an utterance through words. For example, it is information that conveys meaning based on clearly defined language rules and dictionaries, such as words, grammar, sentence structure, and context. For example, linguistic information includes information indicating the content of user 11's utterance 411 that has been converted into text using voice recognition technology.

また、非言語情報とは、言葉以外の手段を通じて伝達される情報である。例えば、画像認識技術等によって取得したユーザ１１の表情、視線、姿勢、又は感情等の言語情報以外の情報が含まれる。また、ユーザ１１の非言語情報には、例えば、ユーザ１１の映像に含まれる音声から取得した、声のトーン、話す速さ、声の高さ、声の強さ、咳払い、ため息、笑い、又は沈黙等の言語以外の音声情報（パラ言語）等が含まれる。このように、本実施形態では、画像や音声等の非言語情報をマルチモーダルに活用する。 Non-verbal information is information that is transmitted through means other than words. For example, it includes non-verbal information such as the facial expression, gaze, posture, or emotions of user 11 obtained by image recognition technology or the like. Non-verbal information of user 11 also includes non-verbal audio information (paralanguage) such as tone of voice, speaking speed, voice pitch, voice intensity, throat clearing, sighs, laughter, or silence obtained from the audio contained in the video of user 11. In this way, in this embodiment, non-verbal information such as images and audio is utilized multimodally.

また、サーバ装置１００は、ユーザ１１の言語情報をベースに、ユーザ１１の非言語情報を加味して、ユーザ１１の発話の意図を解釈する。これにより、サーバ装置１００は、言語情報のみで意図を解釈するより、意図解釈の精度を向上させることができる。 In addition, the server device 100 interprets the intention of the user 11's speech based on the user's linguistic information and taking into account the user's non-linguistic information. This allows the server device 100 to improve the accuracy of intent interpretation compared to interpreting intent based on linguistic information alone.

さらにサーバ装置１００は、ユーザ１１の発話の意図に対応する、対話エージェントの応答内容を生成する。この応答内容には、対話エージェントが発話する発話内容を表す言語応答と、例えば、対話エージェントの表情、又はジェスチャー等を表す非言語応答とが含まれる。好ましくは、サーバ装置１００は、取得したユーザ１１の非言語情報に応じて、対話エージェントの非言語応答を変える。 Furthermore, the server device 100 generates a response content of the dialogue agent corresponding to the intention of the user 11's speech. This response content includes a verbal response that indicates the speech content uttered by the dialogue agent, and a non-verbal response that indicates, for example, the facial expression or gestures of the dialogue agent. Preferably, the server device 100 changes the non-verbal response of the dialogue agent according to the acquired non-verbal information of the user 11.

時間ｔ３になると、サーバ装置１００は、生成された応答内容に従って、対話エージェントを制御する。例えば、サーバ装置１００は、生成された言語応答を音声合成処理で音声化して、対話エージェントに発話４０４させる。好ましくは、サーバ装置１００は、対話エージェントの発話４０４に合わせて、対話エージェントの口を動かす（リップシンク）。さらに、サーバ装置１００は、生成された非言語応答に従って、対話エージェントに、例えば、表情、又はジェスチャー等の非言語応答を実行させる。 At time t3, the server device 100 controls the dialogue agent according to the generated response. For example, the server device 100 converts the generated verbal response into voice using a voice synthesis process, and causes the dialogue agent to speak 404. Preferably, the server device 100 moves the mouth of the dialogue agent (lip sync) in time with the dialogue agent's utterance 404. Furthermore, the server device 100 causes the dialogue agent to execute a non-verbal response, such as a facial expression or a gesture, according to the generated non-verbal response.

このように、本実施形態に係る対話システム１は、ユーザ１１の非言語情報に応じて、対話エージェント（バーチャルヒューマン２０１、３０１）の応答内容（言語応答、及び非言語応答）を変える。従って、本実施形態によれば、対話エージェントを用いてユーザと対話を行う対話システム１において、ユーザ１１に対してより適切なリアクションを行えるようになる。
＜ハードウェア構成＞
（コンピュータのハードウェア構成）
サーバ装置１００は、例えば、図５に示すようなコンピュータ５００のハードウェア構成を有している。或いは、サーバ装置１００は、複数のコンピュータ５００によって構成される。また、端末装置１０は、例えば、図５に示すようなコンピュータ５００のハードウェア構成を有していてもよい。 In this way, the dialogue system 1 according to this embodiment changes the response content (verbal response and non-verbal response) of the dialogue agent (virtual human 201, 301) according to non-verbal information of the user 11. Therefore, according to this embodiment, the dialogue system 1 that uses the dialogue agent to dialogue with the user can react more appropriately to the user 11.
<Hardware Configuration>
(Computer hardware configuration)
The server device 100 has, for example, a hardware configuration of a computer 500 as shown in Fig. 5. Alternatively, the server device 100 is composed of a plurality of computers 500. Furthermore, the terminal device 10 may have, for example, a hardware configuration of a computer 500 as shown in Fig. 5.

図５は、一実施形態に係るコンピュータのハードウェア構成の例を示す図である。コンピュータ５００は、例えば、図５に示されるように、ＣＰＵ（Central Processing Unit）５０１、ＲＯＭ（Read Only Memory）５０２、ＲＡＭ（Random Access Memory）５０３、ＨＤ（Hard Disk）５０４、ＨＤＤ（Hard Disk Drive）コントローラ５０５、ディスプレイ５０６、外部機器接続Ｉ／Ｆ（Interface）５０７、ネットワークＩ／Ｆ５０８、キーボード５０９、ポインティングデバイス５１０、ＤＶＤ－ＲＷ（Digital Versatile Disk Rewritable)ドライブ５１２、メディアＩ／Ｆ５１４、及びバスライン５１５等を備えている。 Figure 5 is a diagram showing an example of the hardware configuration of a computer according to an embodiment. For example, as shown in Figure 5, the computer 500 includes a CPU (Central Processing Unit) 501, a ROM (Read Only Memory) 502, a RAM (Random Access Memory) 503, a HD (Hard Disk) 504, a HDD (Hard Disk Drive) controller 505, a display 506, an external device connection I/F (Interface) 507, a network I/F 508, a keyboard 509, a pointing device 510, a DVD-RW (Digital Versatile Disk Rewritable) drive 512, a media I/F 514, and a bus line 515.

また、コンピュータ５００が端末装置１０である場合、コンピュータ５００は、マイク５２１、スピーカ５２２，音入出力Ｉ／Ｆ５２３、ＣＭＯＳ(Complementary Metal Oxide Semiconductor)センサ５２４、及び撮像素子Ｉ／Ｆ５２５等を、さらに備える。 When the computer 500 is a terminal device 10, the computer 500 further includes a microphone 521, a speaker 522, an audio input/output I/F 523, a CMOS (Complementary Metal Oxide Semiconductor) sensor 524, and an image sensor I/F 525.

これらのうち、ＣＰＵ５０１は、コンピュータ５００全体の動作を制御する。ＲＯＭ５０２は、例えば、ＩＰＬ（Initial Program Loader）等のコンピュータ５００の起動に用いられるプログラムを記憶する。ＲＡＭ５０３は、例えば、ＣＰＵ５０１のワークエリア等として使用される。ＨＤ５０４は、例えば、ＯＳ（Operating System）、アプリケーション、デバイスドライバ等のプログラムや、各種データを記憶する。ＨＤＤコントローラ５０５は、例えば、ＣＰＵ５０１の制御に従ってＨＤ５０４に対する各種データの読み出し又は書き込みを制御する。なお、ＨＤ５０４、及びＨＤＤコントローラ５０５は、ストレージデバイスの一例である。 Of these, the CPU 501 controls the overall operation of the computer 500. The ROM 502 stores programs used to start up the computer 500, such as an IPL (Initial Program Loader). The RAM 503 is used, for example, as a work area for the CPU 501. The HD 504 stores programs such as an OS (Operating System), applications, and device drivers, as well as various data. The HDD controller 505 controls the reading and writing of various data from the HD 504, for example, under the control of the CPU 501. The HD 504 and the HDD controller 505 are examples of storage devices.

ディスプレイ５０６は、例えば、カーソル、メニュー、ウィンドウ、文字、又は画像などの各種情報を表示する。なお、ディスプレイ５０６は、コンピュータ５００の外部に設けられていてもよい。外部機器接続Ｉ／Ｆ５０７は、コンピュータ５００に、様々な外部装置を接続するためのインタフェースである。ネットワークＩ／Ｆ５０８は、コンピュータ５００を通信ネットワーク２に接続して、他の装置と通信するためのインタフェースである。 The display 506 displays various information such as a cursor, a menu, a window, characters, or an image. The display 506 may be provided outside the computer 500. The external device connection I/F 507 is an interface for connecting various external devices to the computer 500. The network I/F 508 is an interface for connecting the computer 500 to the communication network 2 and communicating with other devices.

キーボード５０９は、文字、数値、各種指示などの入力のための複数のキーを備えた入力手段の一種である。ポインティングデバイス５１０は、各種指示の選択や実行、処理対象の選択、カーソルの移動などを行なう入力手段の一種である。なお、キーボード５０９、及びポインティングデバイス５１０は、コンピュータ５００の外部に設けられていてもよい。 The keyboard 509 is a type of input means having multiple keys for inputting characters, numbers, various instructions, etc. The pointing device 510 is a type of input means for selecting and executing various instructions, selecting a processing target, moving a cursor, etc. Note that the keyboard 509 and the pointing device 510 may be provided outside the computer 500.

ＤＶＤ－ＲＷドライブ５１２は、着脱可能な記録媒体の一例としてのＤＶＤ－ＲＷ５１１に対する各種データの読み出し又は書き込みを制御する。なお、ＤＶＤ－ＲＷ５１１は、ＤＶＤ－ＲＷに限らず、着脱可能な他の記録媒体であってもよい。メディアＩ／Ｆ５１４は、フラッシュメモリ等のメディア５１３に対するデータの読み出し又は書き込み（記憶）を制御する。バスライン５１５は、上記の各構成要素を電気的に接続するためのアドレスバス、データバス及び各種の制御信号等を含む。 The DVD-RW drive 512 controls the reading and writing of various data from the DVD-RW 511, which is an example of a removable recording medium. Note that the DVD-RW 511 is not limited to a DVD-RW, and may be another removable recording medium. The media I/F 514 controls the reading and writing (storing) of data from the media 513, such as a flash memory. The bus line 515 includes an address bus, a data bus, and various control signals for electrically connecting the above components.

マイク５２１は、音を電気信号に変える内蔵型の回路である。スピーカ５２２は、電気信号を物理振動に変えて音楽や音声などの音を生み出す内蔵型の回路である。音入出力Ｉ／Ｆ５２３は、ＣＰＵ５０１の制御に従ってマイク５２１及びスピーカ５２２との間で音信号の入出力を処理する回路である。 The microphone 521 is a built-in circuit that converts sound into an electrical signal. The speaker 522 is a built-in circuit that converts the electrical signal into physical vibrations to produce sound such as music or voice. The sound input/output I/F 523 is a circuit that processes the input and output of sound signals between the microphone 521 and the speaker 522 under the control of the CPU 501.

ＣＭＯＳセンサ５２４は、ＣＰＵ５０１の制御に従って被写体（例えば自画像）を撮像して画像データを得る内蔵型の撮像手段の一種である。なお、コンピュータ５００は、ＣＭＯＳセンサ５２４に代えて、ＣＣＤ(Charge Coupled Device)センサ等の撮像手段を有していてもよい。撮像素子Ｉ／Ｆ５２５は、ＣＭＯＳセンサ５２４の駆動を制御する回路である。 The CMOS sensor 524 is a type of built-in imaging means that captures an image of a subject (e.g., a self-portrait) under the control of the CPU 501 to obtain image data. Note that the computer 500 may have an imaging means such as a CCD (Charge Coupled Device) sensor instead of the CMOS sensor 524. The imaging element I/F 525 is a circuit that controls the driving of the CMOS sensor 524.

（端末装置のハードウェア構成の一例）
図６は、一実施形態に係る端末装置のハードウェア構成の一例を示す図である。ここでは、端末装置１０が、スマートフォン、又はタブレット端末等の情報端末である場合における端末装置１０のハードウェア構成の例について説明する。 (Example of Hardware Configuration of Terminal Device)
6 is a diagram illustrating an example of a hardware configuration of a terminal device according to an embodiment. Here, an example of the hardware configuration of the terminal device 10 will be described in the case where the terminal device 10 is an information terminal such as a smartphone or a tablet terminal.

図６の例では、端末装置１０は、ＣＰＵ６０１、ＲＯＭ６０２、ＲＡＭ６０３、ストレージデバイス６０４、ＣＭＯＳセンサ６０５、撮像素子Ｉ／Ｆ６０６、加速度・方位センサ６０７、メディアＩ／Ｆ６０９、ＧＰＳ（Global Positioning System）受信部６１０を備えている。 In the example of FIG. 6, the terminal device 10 includes a CPU 601, a ROM 602, a RAM 603, a storage device 604, a CMOS sensor 605, an image sensor I/F 606, an acceleration/orientation sensor 607, a media I/F 609, and a GPS (Global Positioning System) receiver 610.

これらのうち、ＣＰＵ６０１は、所定のプログラムを実行することにより端末装置１０全体の動作を制御する。ＲＯＭ６０２は、例えば、ＩＰＬ等のＣＰＵ６０１の起動に用いられるプログラムを記憶する。ＲＡＭ６０３は、ＣＰＵ６０１のワークエリアとして使用される。ストレージデバイス６０４は、ＯＳ、アプリ等のプログラム、及び各種のデータ等を記憶する大容量の記憶装置であり、例えば、ＳＳＤ（Solid State Drive）、又はフラッシュＲＯＭ等によって実現される。 Of these, the CPU 601 controls the operation of the entire terminal device 10 by executing a specific program. The ROM 602 stores a program used to start up the CPU 601, such as an IPL. The RAM 603 is used as a work area for the CPU 601. The storage device 604 is a large-capacity storage device that stores the OS, programs such as apps, and various data, and is realized by, for example, an SSD (Solid State Drive) or a flash ROM.

ＣＭＯＳセンサ６０５は、ＣＰＵ６０１の制御に従って被写体（主に自画像）を撮像して画像データを得る内蔵型の撮像手段の一種である。なお、端末装置１０は、ＣＭＯＳセンサ６０５に代えて、ＣＣＤセンサ等の撮像手段を有していてもよい。撮像素子Ｉ／Ｆ６０６は、ＣＭＯＳセンサ６０５の駆動を制御する回路である。加速度・方位センサ６０７は、地磁気を検知する電子磁気コンパスやジャイロコンパス、加速度センサ等の各種センサである。メディアＩ／Ｆ６０９は、フラッシュメモリ等のメディア（記憶メディア）６０８に対するデータの読み出し又は書き込み（記憶）を制御する。ＧＰＳ受信部６１０は、ＧＰＳ衛星からＧＰＳ信号（測位信号）を受信する。 The CMOS sensor 605 is a type of built-in imaging means that captures an image of a subject (mainly a self-portrait) under the control of the CPU 601 to obtain image data. The terminal device 10 may have an imaging means such as a CCD sensor instead of the CMOS sensor 605. The imaging element I/F 606 is a circuit that controls the driving of the CMOS sensor 605. The acceleration/direction sensor 607 is various sensors such as an electronic magnetic compass or gyrocompass that detects geomagnetism, and an acceleration sensor. The media I/F 609 controls the reading or writing (storage) of data to a media (storage media) 608 such as a flash memory. The GPS receiver 610 receives GPS signals (positioning signals) from GPS satellites.

また、端末装置１０は、遠距離通信回路６１１、遠距離通信回路６１１のアンテナ６１１ａ、ＣＭＯＳセンサ６１２、撮像素子Ｉ／Ｆ６１３、マイク６１４、スピーカ６１５、音入出力Ｉ／Ｆ６１６、ディスプレイ６１７、外部機器接続Ｉ／Ｆ６１８、近距離通信回路６１９、近距離通信回路６１９のアンテナ６１９ａ、及びタッチパネル６２０を備えている。 The terminal device 10 also includes a long-distance communication circuit 611, an antenna 611a of the long-distance communication circuit 611, a CMOS sensor 612, an image sensor I/F 613, a microphone 614, a speaker 615, an audio input/output I/F 616, a display 617, an external device connection I/F 618, a short-distance communication circuit 619, an antenna 619a of the short-distance communication circuit 619, and a touch panel 620.

これらのうち、遠距離通信回路６１１は、例えば、通信ネットワーク２を介して、他の装置と通信する回路である。ＣＭＯＳセンサ６１２は、ＣＰＵ６０１の制御に従って被写体を撮像して画像データを得る内蔵型の撮像手段の一種である。撮像素子Ｉ／Ｆ６１３は、ＣＭＯＳセンサ６１２の駆動を制御する回路である。マイク６１４は、音を電気信号に変える内蔵型の回路である。スピーカ６１５は、電気信号を物理振動に変えて音楽や音声などの音を生み出す内蔵型の回路である。音入出力Ｉ／Ｆ６１６は、ＣＰＵ６０１の制御に従ってマイク６１４及びスピーカ６１５との間で音波信号の入出力を処理する回路である。 Of these, the long-distance communication circuit 611 is a circuit that communicates with other devices, for example, via the communication network 2. The CMOS sensor 612 is a type of built-in imaging means that captures an image of a subject and obtains image data under the control of the CPU 601. The image sensor I/F 613 is a circuit that controls the operation of the CMOS sensor 612. The microphone 614 is a built-in circuit that converts sound into an electrical signal. The speaker 615 is a built-in circuit that converts an electrical signal into physical vibrations to produce sounds such as music and voice. The sound input/output I/F 616 is a circuit that processes the input and output of sound wave signals between the microphone 614 and the speaker 615 under the control of the CPU 601.

ディスプレイ６１７は、被写体の画像や各種アイコン等を表示する液晶や有機ＥＬ(Electro Luminescence)等の表示手段の一種である。外部機器接続Ｉ／Ｆ６１８は、各種の外部機器を接続するためのインタフェースである。近距離通信回路６１９は、近距離無線通信を行う回路を含む。タッチパネル６２０は、利用者がディスプレイ６１７を押下することで、端末装置１０を操作する入力手段の一種である。 The display 617 is a type of display means such as a liquid crystal or organic EL (Electro Luminescence) that displays an image of a subject, various icons, etc. The external device connection I/F 618 is an interface for connecting various external devices. The short-range communication circuit 619 includes a circuit for performing short-range wireless communication. The touch panel 620 is a type of input means that allows the user to operate the terminal device 10 by pressing the display 617.

また、端末装置１０は、バスライン６２１を備えている。バスライン６２１は、図６に示されているＣＰＵ６０１等の各構成要素を電気的に接続するためのアドレスバスやデータバス等を含む。 The terminal device 10 also includes a bus line 621. The bus line 621 includes an address bus and a data bus for electrically connecting the components such as the CPU 601 shown in FIG. 6.

なお、図６に示した端末装置１０のハードウェア構成は一例である。端末装置１０は、コンピュータの構成、通信回路、ディスプレイ、マイク、及びスピーカ等を有していれば、他のハードウェア構成であってもよい。 Note that the hardware configuration of the terminal device 10 shown in FIG. 6 is an example. The terminal device 10 may have other hardware configurations as long as it has a computer configuration, a communication circuit, a display, a microphone, a speaker, etc.

＜機能構成＞
図７は、一実施形態に係る対話システムの機能構成の例を示す図である。 <Functional configuration>
FIG. 7 is a diagram illustrating an example of a functional configuration of a dialogue system according to an embodiment.

（サーバ装置の機能構成）
サーバ装置１００は、サーバ装置１００が備えるコンピュータ５００が、記憶媒体に記憶した所定のプログラムを実行することにより、例えば、図７に示すような機能構成を実現している。図７の例では、サーバ装置１００は、通信部７０１、第１の取得部７０２、第２の取得部７０３、生成部７０４、音声合成部７１１、描画部７１２、及び出力部７１３等を有している。なお、上記の各機能構成のうち、少なくとも一部は、ハードウェアによって実現されるものであってもよい。 (Functional configuration of the server device)
The server device 100 realizes, for example, a functional configuration as shown in Fig. 7 by the computer 500 included in the server device 100 executing a predetermined program stored in a storage medium. In the example of Fig. 7, the server device 100 has a communication unit 701, a first acquisition unit 702, a second acquisition unit 703, a generation unit 704, a voice synthesis unit 711, a drawing unit 712, and an output unit 713. At least a part of the above functional configurations may be realized by hardware.

また、サーバ装置１００は、例えば、ＨＤ５０４、及びＨＤＤコントローラ５０５等のストレージデバイス等により、記憶部７１０を実現している。なお、記憶部７１０は、例えば、サーバ装置１００の外部に設けられたストレージサーバ、又はクラウドサービス等によって実現されるものであってもよい。 The server device 100 also implements a memory unit 710 using storage devices such as the HD 504 and the HDD controller 505. The memory unit 710 may also be implemented using a storage server provided outside the server device 100, a cloud service, or the like.

通信部７０１は、例えば、ネットワークＩ／Ｆ５０８等を用いて、サーバ装置１００を通信ネットワークＮに接続し、端末装置１０等の他の装置と通信する通信処理を実行する。 The communication unit 701, for example, uses the network I/F 508 to connect the server device 100 to the communication network N and executes communication processing to communicate with other devices such as the terminal device 10.

第１の取得部７０２は、端末装置１０を利用するユーザ１１との対話から、ユーザ１１の言語情報を取得する第１の取得処理を実行する。例えば、第１の取得部７０２は、通信部７０１が、端末装置１０から受信したユーザ１１の映像（動画像、及び音声）から、ＶＡＤ（Voice Activity Detection）等の技術により音声区間を検出し、ユーザ１１の発話音声を取得する。また、第１の取得部７０２は、取得したユーザ１１の発話音声に対して、音声認識処理を実行して、ユーザ１１の発話音声をテキスト化する。さらに、第１の取得部７０２は、テキスト化したユーザ１１の発話テキストを、ユーザ１１の言語情報として取得する。 The first acquisition unit 702 executes a first acquisition process to acquire language information of the user 11 from a dialogue with the user 11 who uses the terminal device 10. For example, the first acquisition unit 702 detects speech segments from the video (moving image and audio) of the user 11 received from the terminal device 10 by the communication unit 701 using a technique such as VAD (Voice Activity Detection) to acquire the speech of the user 11. The first acquisition unit 702 also executes a speech recognition process on the acquired speech of the user 11 to convert the speech of the user 11 into text. Furthermore, the first acquisition unit 702 acquires the converted speech text of the user 11 as the language information of the user 11.

第２の取得部７０３は、端末装置１０を利用するユーザ１１との対話から、ユーザ１１の非言語情報を取得する第２の取得処理を実行する。例えば、第２の取得部７０３は、通信部７０１が、端末装置１０から受信したユーザ１１の映像（動画像、及び音声）から、画像処理により、例えば、表情、視線、又は感情等のユーザ１１の非言語情報を取得する。また、第２の取得部７０３は、通信部７０１が、端末装置１０から受信したユーザ１１の映像（動画像、及び音声）から、例えば、声の大小、声の抑揚、又は声の音色等のユーザ１１の非言語情報を取得する。 The second acquisition unit 703 executes a second acquisition process to acquire non-verbal information of the user 11 from a dialogue with the user 11 using the terminal device 10. For example, the second acquisition unit 703 acquires non-verbal information of the user 11, such as facial expressions, gaze, or emotions, by image processing from the video (moving images and audio) of the user 11 received from the terminal device 10 by the communication unit 701. The second acquisition unit 703 also acquires non-verbal information of the user 11, such as the volume of the voice, intonation, or tone of the voice, from the video (moving images and audio) of the user 11 received from the terminal device 10 by the communication unit 701.

生成部７０４は、第１の取得部７０２が取得したユーザ１１の言語情報と、第２の取得部７０３が取得したユーザ１１の非言語情報とに基づいて、対話エージェントの言語応答（対話内容）と非言語応答（対話エージェントの動作、又はパラ言語等）とを含む応答内容を生成する生成処理を実行する。例えば、生成部７０４は、対話制御部７０５、意図解釈部７０６、及び応答生成部７０７を含む。また、図示はしていないが、応答生成部７０７のバックエンドには、実際に行われた対話情報（音声や画像等）が大量に蓄積されており、その対話情報が応答生成部７０７の構築に用いられる。応答生成部７０７が後述の機械学習モデルの場合、それらの対話情報は学習データとして用い、対話生成の精度向上に寄与する。 The generation unit 704 executes a generation process to generate response contents including a verbal response (dialogue contents) of the dialogue agent and a non-verbal response (action of the dialogue agent, or paralanguage, etc.) based on the linguistic information of the user 11 acquired by the first acquisition unit 702 and the non-verbal information of the user 11 acquired by the second acquisition unit 703. For example, the generation unit 704 includes a dialogue control unit 705, an intention interpretation unit 706, and a response generation unit 707. Although not shown, a large amount of actual dialogue information (sound, image, etc.) is accumulated in the backend of the response generation unit 707, and the dialogue information is used to construct the response generation unit 707. When the response generation unit 707 is a machine learning model described later, the dialogue information is used as learning data to contribute to improving the accuracy of dialogue generation.

対話制御部７０５は、ユーザ１１の言語情報と非言語情報とを入力する処理、及び対話エージェントの言語応答と非言語応答を出力する処理等を含む対話制御処理を実行する。 The dialogue control unit 705 executes dialogue control processing, including processing for inputting linguistic and non-verbal information of the user 11 and processing for outputting linguistic and non-verbal responses of the dialogue agent.

意図解釈部７０６は、ユーザ１１の言語情報をベースに、ユーザ１１の非言語情報を加味して、ユーザ１１の発話の意図を解釈する意図解釈処理を実行する。例えば、ユーザ１１が「それは、いいです」と発話した場合、ユーザ１１の言語情報（発話テキスト）だけでは、ユーザ１１が、それが「良い」ことを意図しているのか、それが「不要である」ことを意図しているのか判断することが難しい場合がある。そこで、本実施形態に係る意図解釈部７０６は、ユーザ１１の言語情報（発話テキスト）だけではなく、ユーザ１１の非言語情報を用いて、ユーザ１１の発話の意図を解釈する。これにより、意図解釈部７０６は、意図解釈処理の精度を向上することができる。 The intention interpretation unit 706 executes an intention interpretation process that interprets the intention of the user 11's utterance based on the linguistic information of the user 11 and taking into account the non-linguistic information of the user 11. For example, if the user 11 utters "That's good," it may be difficult to determine from the user 11's linguistic information (speech text) alone whether the user 11 intended it to be "good" or "unnecessary." Therefore, the intention interpretation unit 706 according to this embodiment uses not only the user 11's linguistic information (speech text) but also the user 11's non-linguistic information to interpret the intention of the user 11's utterance. This allows the intention interpretation unit 706 to improve the accuracy of the intention interpretation process.

例えば、意図解釈部７０６は、複数のユーザの言語情報と非言語情報とを入力データとして、ユーザの意図を解釈するように、予め機械学習した機械学習モデルに、ユーザ１１の言語情報と非言語情報とを入力して、ユーザ１１の発話の意図を解釈してもよい。 For example, the intention interpretation unit 706 may input the linguistic information and non-linguistic information of user 11 into a machine learning model that has been trained in advance to interpret the user's intention using the linguistic information and non-linguistic information of multiple users as input data, and interpret the intention of the user's speech.

ここで、機械学習とは、コンピュータに人のような学習能力を獲得させるための技術であり、コンピュータが、データ識別等の判断に必要なアルゴリズムを、事前に取り込まれる学習データから自律的に生成し、新たなデータについてこれを適用して予測を行う技術のことをいう。機械学習のための学習方法は、教師あり学習、教師なし学習、半教師学習、強化学習、深層学習のいずれかの方法でもよく、さらに、これらの学習方法を組み合わせた学習方法でもよく、機械学習のための学習方法は問わない。 Here, machine learning refers to a technology that allows a computer to acquire human-like learning capabilities, in which the computer autonomously generates algorithms necessary for judgments such as data identification from training data that is previously loaded, and applies these to new data to make predictions. The learning method for machine learning may be any of supervised learning, unsupervised learning, semi-supervised learning, reinforcement learning, and deep learning, or may be a combination of these learning methods; any learning method for machine learning is acceptable.

応答生成部７０７は、ユーザ１１の発話の意図に対応する、対話エージェントの応答内容を生成する応答生成処理を実行する。この応答内容には、対話エージェントが発話する発話内容を表す言語応答と、例えば、対話エージェントの表情、又はジェスチャー等を表す非言語応答とが含まれる。好ましくは、サーバ装置１００は、取得したユーザ１１の非言語情報に応じて、対話エージェントの言語応答、及び非言語応答を変える。 The response generation unit 707 executes a response generation process to generate a response content of the dialogue agent corresponding to the speech intention of the user 11. This response content includes a verbal response that represents the speech content uttered by the dialogue agent, and a non-verbal response that represents, for example, the facial expression or gestures of the dialogue agent. Preferably, the server device 100 changes the verbal response and non-verbal response of the dialogue agent according to the acquired non-verbal information of the user 11.

例えば、応答生成部７０７は、ユーザ１１の非言語情報に応じて、対話エージェントのアクションの内容を変更する。また、応答生成部７０７は、ユーザ１１の非言語情報に応じて、対話エージェントのアクションのタイミングを変更する。 For example, the response generation unit 707 changes the content of the action of the dialogue agent in response to the non-verbal information of the user 11. The response generation unit 707 also changes the timing of the action of the dialogue agent in response to the non-verbal information of the user 11.

応答内容の生成には、例えば、ルールベース、又は大規模言語モデルによる自然言語処理を用いることができる。大規模言語モデルとしては、一例として、GPT-3（Generative Pre-trained Transformer 3）と呼ばれる文章生成言語モデルを適用することができる。また、ルールベースの自然言語処理では、ユーザの発話の意図に対して、応答内容を予め記述したルールに基づいて、対話エージェントの応答内容を生成する。 For example, natural language processing based on rules or large-scale language models can be used to generate the response content. As an example of a large-scale language model, a sentence generation language model called GPT-3 (Generative Pre-trained Transformer 3) can be applied. In rule-based natural language processing, the response content of the dialogue agent is generated based on rules that predescribe the response content in response to the user's speech intention.

また、応答内容の進行には、シナリオ型とスロット・フィリング型がある（参考：https://goqsmile.com/function/about/）。 In addition, there are two types of response content progression: scenario type and slot-filling type (reference: https://goqsmile.com/function/about/).

音声合成部７１１は、生成部７０４が生成した言語応答を、音声合成技術により音声化する音声合成処理を実行する。 The speech synthesis unit 711 executes speech synthesis processing to convert the linguistic response generated by the generation unit 704 into speech using speech synthesis technology.

描画部７１２は、生成部７０４が生成した非言語応答に従って、対話エージェントを描画した対話画面を描画する描画処理を実行する。例えば、描画部７１２は、非言語応答に従って、表情、視線、姿勢、又は感情等を、図２に示すようなバーチャルヒューマン（対話エージェント）２０１に反映する。 The drawing unit 712 executes a drawing process to draw a dialogue screen depicting a dialogue agent according to the non-verbal response generated by the generation unit 704. For example, the drawing unit 712 reflects facial expressions, gaze, posture, emotions, etc., in a virtual human (dialogue agent) 201 as shown in FIG. 2 according to the non-verbal response.

好ましくは、描画部７１２は、対話エージェントの発話に合わせて、対話エージェントの口を動かすリップシンクの描画も行う。 Preferably, the drawing unit 712 also draws lip sync to move the dialogue agent's mouth in sync with the dialogue agent's speech.

出力部７１３は、音声合成部７１１が音声化した対話エージェントの音声と、描画部７１２が描画した対話画面とを含む映像を出力する出力処理を実行する。例えば、出力部７１３は、音声合成部７１１が音声化した対話エージェントの音声と、描画部７１２が描画した対話画面とを含む映像を、通信部７０１を介して、端末装置１０に送信する。 The output unit 713 executes an output process to output a video including the voice of the dialogue agent vocalized by the voice synthesis unit 711 and the dialogue screen drawn by the drawing unit 712. For example, the output unit 713 transmits a video including the voice of the dialogue agent vocalized by the voice synthesis unit 711 and the dialogue screen drawn by the drawing unit 712 to the terminal device 10 via the communication unit 701.

なお、音声合成部７１１、描画部７１２、及び出力部７１３は、生成部７０４で生成した応答内容に基づいて対話エージェントを制御する制御部７１４の一例である。 The voice synthesis unit 711, the drawing unit 712, and the output unit 713 are an example of a control unit 714 that controls the dialogue agent based on the response content generated by the generation unit 704.

記憶部７１０は、サーバ装置１００が用いる機械学習モデル、ルール、設定情報、及び対話ログ等の様々な情報、データ、及びプログラム等を記憶する。 The storage unit 710 stores various information, data, programs, etc., such as machine learning models, rules, configuration information, and dialogue logs used by the server device 100.

（端末装置の機能構成）
端末装置１０は、例えば、端末装置１０が備えるウェブブラウザ等を用いて、サーバ装置１００にアクセスして、図２に示すような対話画面２００を表示し、ユーザ１１の映像を送信できるものであれば、任意の機能構成であってよい。 (Functional configuration of terminal device)
The terminal device 10 may have any functional configuration as long as it can access the server device 100 using, for example, a web browser provided in the terminal device 10, display an interactive screen 200 such as that shown in FIG. 2, and transmit video of the user 11.

なお、図７に示した対話システム１のシステム構成は一例である。例えば、対話システム１は、図７に示したサーバ装置１００の機能構成を有する１台の情報処理装置によって構成されるものであってもよい。また、サーバ装置１００の各機能構成のうち、少なくとも一部は、端末装置１０が有していてもよい。例えば、端末装置１０は、第１の取得部７０２、第２の取得部７０３、音声合成部７１１、描画部７１２、及び出力部７１３等を有していてもよい。この場合、端末装置１０は、言語情報、及び非言語情報をサーバ装置１００に送信し、サーバ装置１００から受信した言語応答、及び非言語応答に基づいて、対話画面を表示してもよい。 The system configuration of the dialogue system 1 shown in FIG. 7 is an example. For example, the dialogue system 1 may be configured by a single information processing device having the functional configuration of the server device 100 shown in FIG. 7. At least some of the functional configurations of the server device 100 may be included in the terminal device 10. For example, the terminal device 10 may have a first acquisition unit 702, a second acquisition unit 703, a voice synthesis unit 711, a drawing unit 712, and an output unit 713. In this case, the terminal device 10 may transmit linguistic information and non-linguistic information to the server device 100 and display a dialogue screen based on the linguistic response and non-linguistic response received from the server device 100.

＜処理の流れ＞
図８は、一実施形態に係る対話システムが実行する対話処理の概要を示すフローチャートである。この処理は、例えば、図７に示すような機能構成を有する対話システム１が、繰り返し実行する処理の一例を示している。なお、図８の処理の開始時点において、端末装置１０を利用するユーザ１１と、サーバ装置１００が提供する対話エージェントとの間で対話が既に行われているものとする。 <Processing flow>
Fig. 8 is a flowchart showing an outline of a dialogue process executed by a dialogue system according to an embodiment. This process shows an example of a process repeatedly executed by a dialogue system 1 having a functional configuration as shown in Fig. 7. It is assumed that at the start of the process in Fig. 8, a dialogue is already taking place between a user 11 using a terminal device 10 and a dialogue agent provided by a server device 100.

ステップＳ８０１において、第１の取得部７０２は、ユーザ１１と、対話エージェントとの間の対話から、ユーザの言語情報を取得する。例えば、第１の取得部７０２は、通信部７０１が、端末装置１０から受信したユーザ１１の映像から、ユーザ１１の発話音声を取得する。また、第１の取得部７０２は、取得したユーザ１１の発話音声に対して音声認識処理を実行し、ユーザ１１の発話音声をテキスト化したユーザ１１の発話テキスト（言語情報）を取得する。 In step S801, the first acquisition unit 702 acquires user language information from the dialogue between the user 11 and the dialogue agent. For example, the first acquisition unit 702 acquires the speech of the user 11 from the video of the user 11 received from the terminal device 10 by the communication unit 701. The first acquisition unit 702 also performs a speech recognition process on the acquired speech of the user 11, and acquires the speech text (language information) of the user 11, which is a text version of the speech of the user 11.

ステップＳ８０２において、第２の取得部７０３は、ステップＳ８０１の処理と並行して、ユーザ１１と、対話エージェントとの間の対話から、ユーザ１１の非言語情報を取得する。例えば、第２の取得部７０３は、通信部７０１が、端末装置１０から受信したユーザ１１の映像から、画像処理により、ユーザ１１の表情、視線、又は感情等の非言語情報を取得する。また、第２の取得部７０３は、通信部７０１が、端末装置１０から受信したユーザ１１の映像から、音声処理により、声の大小、声の抑揚、又は声の音色等の非言語情報を取得する。 In step S802, the second acquisition unit 703 acquires non-verbal information of the user 11 from the dialogue between the user 11 and the dialogue agent in parallel with the processing of step S801. For example, the second acquisition unit 703 acquires non-verbal information such as facial expression, gaze, or emotions of the user 11 through image processing from the video of the user 11 received by the communication unit 701 from the terminal device 10. In addition, the second acquisition unit 703 acquires non-verbal information such as the volume, intonation, or tone of the voice through audio processing from the video of the user 11 received by the communication unit 701 from the terminal device 10.

ステップＳ８０３において、生成部７０４は、第１の取得部７０２が取得したユーザ１１の言語情報と、第２の取得部７０３が取得したユーザ１１の非言語情報とに基づいて、ユーザ１１の発話の意図を解釈する。 In step S803, the generation unit 704 interprets the intention of the user 11's speech based on the language information of the user 11 acquired by the first acquisition unit 702 and the non-verbal information of the user 11 acquired by the second acquisition unit 703.

ステップＳ８０４において、生成部７０４は、ユーザ１１の発話の意図に対応する言語応答、及び非言語応答を生成する。 In step S804, the generation unit 704 generates a verbal response and a non-verbal response that correspond to the speech intent of the user 11.

ステップＳ８０５において、音声合成部７１１は、生成部７０４が生成した言語応答に基づいて、対話エージェントの発話音声を合成する。 In step S805, the speech synthesis unit 711 synthesizes the speech of the dialogue agent based on the linguistic response generated by the generation unit 704.

ステップＳ８０６において、描画部７１２は、ステップＳ８０９５の処理と並行して、生成部７０４が生成した非言語応答に基づいて、対話エージェントを描画する。 In step S806, the drawing unit 712 draws a dialogue agent based on the non-verbal response generated by the generation unit 704 in parallel with the processing of step S8095.

ステップＳ８０７において、出力部７１３は、音声合成部７１１が合成した対話エージェントの発話音声と、描画部７１２が描画した対話エージェントを含む対話画面を出力する。例えば、出力部７１３は、通信部７０１を用いて、端末装置１０に対話画面を送信する。 In step S807, the output unit 713 outputs a dialogue screen including the spoken voice of the dialogue agent synthesized by the voice synthesis unit 711 and the dialogue agent drawn by the drawing unit 712. For example, the output unit 713 transmits the dialogue screen to the terminal device 10 using the communication unit 701.

対話システム１は、図８の処理を繰り返し実行することにより、ユーザ１１の非言語情報に基づいて、対話エージェントの発話音声だけではなく、対話エージェントの非言語対応を変更することができる。これにより、本実施形態によれば、対話エージェントを用いてユーザ１１と対話を行う対話システム１において、ユーザ１１に対してより適切なリアクションを行えるようになる。 By repeatedly executing the process of FIG. 8, the dialogue system 1 can change not only the dialogue agent's speech but also the dialogue agent's non-verbal responses based on the non-verbal information of the user 11. As a result, according to this embodiment, the dialogue system 1 that uses the dialogue agent to dialogue with the user 11 can react more appropriately to the user 11.

［第１の実施形態］
本実施形態に係る対話システム１は、対話シナリオを変更することにより、様々な用途に対応することができる。第１の実施形態では、商談用途に対応した対話処理の例について説明する。 [First embodiment]
The dialogue system 1 according to the present embodiment can be adapted to various applications by changing the dialogue scenario. In the first embodiment, an example of dialogue processing adapted to a business negotiation application will be described.

＜機能構成＞
第１の実施形態に係る対話システム１は、例えば、図７に示すような機能構成を有している。また、第１の実施形態に係る生成部７０４は、例えば、図９に示すような機能構成を有している。 <Functional configuration>
The dialogue system 1 according to the first embodiment has, for example, a functional configuration as shown in Fig. 7. Moreover, the generation unit 704 according to the first embodiment has, for example, a functional configuration as shown in Fig. 9.

図９は、第１の実施形態に係る生成部の機能構成の例を示す図である。図９に示すように、生成部７０４の対話制御部７０５は、例えば、入力フィルタ部９０１、対話状態管理部９０２、及び出力フィルタ部９０３を含む。 Figure 9 is a diagram showing an example of the functional configuration of the generation unit according to the first embodiment. As shown in Figure 9, the dialogue control unit 705 of the generation unit 704 includes, for example, an input filter unit 901, a dialogue state management unit 902, and an output filter unit 903.

入力フィルタ部９０１は、例えば、ユーザ１１の言語情報と非言語情報との入力を受け付ける入力Ｉ／Ｆの機能、誤認識対応機能、及び不適切な入力を検知する機能等を有している。なお、誤認識対応機能、及び不適切な入力を検知する機能は、オプションであり必須ではない。 The input filter unit 901 has, for example, an input I/F function that accepts input of linguistic information and non-linguistic information from the user 11, a function to respond to misrecognition, and a function to detect inappropriate input. Note that the function to respond to misrecognition and the function to detect inappropriate input are optional and not essential.

対話状態管理部９０２は、例えば、入力情報を記録する機能、現在の商談段階を記憶する機能、商談段階を制御する機能、及び出力情報を記録する機能等を有している。商談段階とは、商談の進行度を数値で定義した一例である。 The dialogue state management unit 902 has, for example, a function for recording input information, a function for storing the current negotiation stage, a function for controlling the negotiation stage, and a function for recording output information. The negotiation stage is an example of a negotiation progress level defined numerically.

出力フィルタ部９０３は、例えば、対話エージェントの言語対応、及び非言語対応を出力する出力Ｉ／Ｆの機能、及び不適切な出力を検知する機能等を有している。なお、不適切な出力を検知する機能は、オプションであり必須ではない。 The output filter unit 903 has, for example, an output I/F function that outputs the language and non-language responses of the dialogue agent, and a function for detecting inappropriate output. Note that the function for detecting inappropriate output is optional and not essential.

意図解釈部７０６は、対話制御部７０５が受け付けたユーザ１１の言語情報と非言語情報とに基づいて、ユーザ１１の発話の意図を解釈する意図解釈処理を実行する。意図解釈部７０６は、例えば、ユーザ１１の言語情報と文脈から、ユーザ１１の意図を推定することも可能であるが、ユーザ１１の非言語情報を加味した方がより的確にユーザ１１の意図を解釈できる可能性が高くなる。 The intention interpretation unit 706 executes an intention interpretation process to interpret the intention of the user 11's speech based on the linguistic information and non-linguistic information of the user 11 received by the dialogue control unit 705. The intention interpretation unit 706 can estimate the intention of the user 11, for example, from the linguistic information and context of the user 11, but is more likely to be able to interpret the intention of the user 11 more accurately by taking into account the non-linguistic information of the user 11.

例えば、「嘘でしょう？」というユーザ１１の発話は、ネガティブな応答に用いられる場合が多いが、良い意味で期待を上回った場合に、ユーザ１１が喜んで「嘘でしょう？」というときにも用いられる。このような場合、意図解釈部７０６は、ユーザ１１の非言語情報を手がかりにして、ユーザ１１の意図をより的確に解釈することが望ましい。 For example, the utterance "You're kidding, right?" by user 11 is often used as a negative response, but it can also be used when user 11 is happy and says "You're kidding, right?" when expectations are exceeded in a good way. In such a case, it is desirable for the intention interpretation unit 706 to use the non-verbal information of user 11 as a clue to more accurately interpret the intention of user 11.

例えば、ユーザ１１の非言語情報として、ユーザ１１の音声のトーンが高く、ユーザ１１の表情が明るい場合、意図解釈部７０６は、ユーザ１１の「嘘でしょう？」という発話を「ポジティブ（喜んで）」と判断してもよい。この場合、生成部７０４は、例えば、対話エージェントの表情を笑顔とし、現在の対話シナリオを維持してもよい。 For example, if the non-verbal information of the user 11 indicates that the user 11 has a high-pitched voice tone and a bright facial expression, the intention interpretation unit 706 may determine that the utterance of the user 11, "You're kidding, right?", is "positive (happily)." In this case, the generation unit 704 may, for example, set the facial expression of the dialogue agent to a smile and maintain the current dialogue scenario.

一方、ユーザ１１の音声のトーンが低く、ユーザ１１の画像の表情が暗い場合、意図解釈部７０６は、ユーザ１１の「嘘でしょう？」という発話を「ネガティブ」と判断してもよい。この場合、生成部７０４は、例えば、対話エージェントの身振り、手振りを低減し、より詳細な実例を含む対話（商材）シナリオに遷移してもよい（或いは、他の商材のシナリオに遷移してもよい。） On the other hand, if the tone of the user 11's voice is low and the facial expression of the image of the user 11 is gloomy, the intention interpretation unit 706 may determine that the utterance of the user 11, "You're kidding, right?", is "negative." In this case, the generation unit 704 may, for example, reduce the gestures and hand movements of the dialogue agent and transition to a dialogue (product) scenario that includes more detailed examples (or transition to a scenario for another product).

商談というビジネスシーンでは商談相手の喜怒哀楽が表れにくいところ、ネガティブと判断される非言語情報は、商談の進行だけでなく次回の商談にも影響する長期的な心証形成にかかわる重要な情報となるため、慎重な対応が求められる。たとえばトーンの低さの程度や表情の暗さの程度までも考慮し、シナリオの遷移の可否を判断することが望ましい。 In the business world of a business negotiation, it is difficult to reveal the emotions of the other party, so non-verbal information that is judged to be negative is important information that affects not only the progress of the negotiation but also the formation of a long-term impression that may affect the next negotiation, so it requires careful handling. For example, it is desirable to take into account the lowness of the tone and even the gloominess of the facial expression when deciding whether or not to transition to a different scenario.

応答生成部７０７は、複数の商談段階１～７に対応する複数の対話シナリオ９１１～９１７、商材レコメンド部９１８、及び判断部９１９等を含む。なお、商材レコメンド部９１８、及び判断部９１９は、応答生成部７０７の外部に設けられていてもよい。 The response generation unit 707 includes a plurality of dialogue scenarios 911-917 corresponding to a plurality of negotiation stages 1-7, a product recommendation unit 918, and a judgment unit 919. The product recommendation unit 918 and the judgment unit 919 may be provided outside the response generation unit 707.

第１段階に対応する対話シナリオ９１１は、商談を開始するときに用いられる対話シナリオであり、例えば、商談の開始の挨拶、又は顧客データの検索等を行う。第２段階に対応する対話シナリオ９１２は、例えば、名刺交換、又はスモールトーク等の対話を行う。第３段階に対応する対話シナリオ９１３は、例えば、業務内容のヒアリング、又は使用機器のヒアリング等の対話を行う。第４段階に対応する対話シナリオ９１４は、例えば、顕在ニーズの確認、又は潜在ニーズの掘り起こし等の対話を行う。 The dialogue scenario 911 corresponding to the first stage is a dialogue scenario used when starting a business negotiation, and includes, for example, a greeting to start a business negotiation or a search for customer data. The dialogue scenario 912 corresponding to the second stage includes, for example, a dialogue such as exchanging business cards or small talk. The dialogue scenario 913 corresponding to the third stage includes, for example, a dialogue such as inquiring about the business content or the equipment used. The dialogue scenario 914 corresponding to the fourth stage includes, for example, a dialogue such as confirming actual needs or uncovering latent needs.

第５段階に対応する対話シナリオ９１１は、例えば、レコメンドする商材の提示、購買意欲を増進させるキャッチコピーの提示、商談延期の判断、又は商談終了の判断等の対話を行う。第６段階に対応する対話シナリオ９１２は、例えば、納期確認、又は電子契約誘導等の対話を行う。第７段階に対応する対話シナリオ９１７は、例えば、日報作成、又はアンケート生成・送付等のインタラクションを行う。 The dialogue scenario 911 corresponding to the fifth stage performs dialogues such as presenting recommended products, presenting catchy slogans that increase purchasing desire, deciding whether to postpone or end a business negotiation, etc. The dialogue scenario 912 corresponding to the sixth stage performs dialogues such as confirming delivery dates or inducing electronic contracts. The dialogue scenario 917 corresponding to the seventh stage performs interactions such as creating daily reports or generating and sending questionnaires, etc.

対話制御部７０５の対話状態管理部９０２は、現在の商談の状態に応じて、複数の対話シナリオ９１１～９１７から、使用する対話シナリオを選択する。例えば、対話制御部７０５は、第１段階に対応する対話シナリオ９１１から商談を開始し、商談が進行するに伴い、商談段階を上げる。また、対話制御部７０５は、ユーザ１１が商談に否定的な場合、商談段階を下げる。 The dialogue state management unit 902 of the dialogue control unit 705 selects a dialogue scenario to use from a number of dialogue scenarios 911 to 917 depending on the current state of the negotiation. For example, the dialogue control unit 705 starts the negotiation from dialogue scenario 911 corresponding to the first stage, and increases the negotiation stage as the negotiation progresses. In addition, the dialogue control unit 705 decreases the negotiation stage if the user 11 is negative about the negotiation.

これにより、生成部７０４は、予め設定された複数の商談段階に応じて、対話エージェントの応答内容を変更することができる。なお、商談段階は、予め設定された複数の対話段階の一例である。 This allows the generation unit 704 to change the response content of the dialogue agent according to multiple pre-set negotiation stages. Note that the negotiation stage is an example of multiple pre-set dialogue stages.

商材レコメンド部９１８は、例えば、第５段階において、第１段階～第４段階の対話内容に基づいて、ユーザ１１に薦める商材を選択する商材レコメンド処理を実行する。判断部９１９は、例えば、第５段階において、第１段階～第５段階の対話内容に基づいて、商談を延期するか否か、又は商談を終了するか否か等を判断する判断処理を実行する。 The product recommendation unit 918 executes a product recommendation process for selecting a product to recommend to the user 11, for example, in the fifth stage, based on the dialogue content in the first to fourth stages. The judgment unit 919 executes a judgment process for determining, for example, in the fifth stage, based on the dialogue content in the first to fifth stages, whether to postpone the business negotiation, or whether to end the business negotiation.

なお、図９に示した、複数の商談段階１～７の数は一例であり、２つ以上の他の数であってもよい。また、図９に示した、複数の対話シナリオ９１１～９１７の対話内容は一例であり、他の内容であってもよい。 Note that the number of multiple negotiation stages 1 to 7 shown in FIG. 9 is just an example, and may be any other number greater than or equal to two. Also, the dialogue content of the multiple dialogue scenarios 911 to 917 shown in FIG. 9 is just an example, and may be other content.

＜処理の流れ＞
図１０Ａは、第１の実施形態に係る対話処理の例を示すフローチャートである。この処理は、例えば、図７に示すようなサーバ装置１００の機能構成と、図９に示すような生成部７０４の機能構成とを有する対話システム１が実行する対話処理の例を示している。 <Processing flow>
Fig. 10A is a flowchart showing an example of dialogue processing according to the first embodiment. This processing shows an example of dialogue processing executed by a dialogue system 1 having, for example, the functional configuration of a server device 100 as shown in Fig. 7 and the functional configuration of a generation unit 704 as shown in Fig. 9.

ステップＳ１００１において、対話システム１は、第１段階に対応する対話シナリオ９１１で対話を開始するとともに、ユーザ１１に関する顧客データがあるか否かを判断する。顧客データがある場合、対話システム１は、処理をステップＳ１００２に移行させる。一方、顧客データがない場合、対話システム１は、処理をステップＳ１００８に移行させる。 In step S1001, the dialogue system 1 starts a dialogue with the dialogue scenario 911 corresponding to the first stage, and determines whether or not there is customer data related to the user 11. If there is customer data, the dialogue system 1 transitions the process to step S1002. On the other hand, if there is no customer data, the dialogue system 1 transitions the process to step S1008.

ここで、ステップＳ１００２～Ｓ１００５の処理と、ステップＳ１００８～Ｓ１０１１の処理は、同様の商談段階になっているが、利用する対話シナリオが異なる。例えば、ステップＳ１００２～Ｓ１００５の処理では、対話システム１は、顧客データを持っているので、過去の商談データに基づいて、商談を進める対話シナリオを用いることが望ましい。一方、ステップＳ１００８～Ｓ１０１１の処理では、対話システム１は、顧客データを持っていないので、顧客データの作成に必要な情報も含めて、丁寧にヒアリングする対話シナリオを用いることが望ましい。これにより、対話エージェントが、ユーザ１１に、毎回、同じような内容をヒアリングしてしまうことを抑制することができる。 Here, the processing of steps S1002 to S1005 and the processing of steps S1008 to S1011 are at similar stages of negotiation, but the dialogue scenarios used are different. For example, in the processing of steps S1002 to S1005, since the dialogue system 1 has customer data, it is desirable to use a dialogue scenario that advances negotiations based on past negotiation data. On the other hand, in the processing of steps S1008 to S1011, since the dialogue system 1 does not have customer data, it is desirable to use a dialogue scenario that carefully listens to the customer, including information necessary to create customer data. This makes it possible to prevent the dialogue agent from asking the user 11 the same questions every time.

ステップＳ１００２に移行すると、対話システム１は、第２段階に対応する対話シナリオ９１２で対話を行うとともに、名刺交換、又はスモールトークができたか否かを判断する。名刺交換、又はスモールトークができた場合、対話システム１は、処理をステップＳ１００４に移行させる。一方、名刺交換、又はスモールトークができていない場合、対話システム１は、例えば、図１０Ａの処理（商談）を終了する。好ましくは、対話システム１は、第２段階に対応する対話シナリオ９１２で対話を開始してから、所定の時間を経過しても、名刺交換、又はスモールトークができていない場合、商談を終了する。 When the process proceeds to step S1002, the dialogue system 1 performs a dialogue in the dialogue scenario 912 corresponding to the second stage, and determines whether or not business card exchange or small talk has occurred. If business card exchange or small talk has occurred, the dialogue system 1 transitions the process to step S1004. On the other hand, if business card exchange or small talk has not occurred, the dialogue system 1 ends the process (business negotiation) of FIG. 10A, for example. Preferably, the dialogue system 1 ends the business negotiation if business card exchange or small talk has not occurred even after a predetermined time has elapsed since the start of the dialogue in the dialogue scenario 912 corresponding to the second stage.

ステップＳ１００３に移行すると、対話システム１は、第３段階に対応する対話シナリオ９１３で対話を行うとともに、例えば、業務内容、又は使用機器等の状況をヒアリングできたか否かを判断する。状況をヒアリングできた場合、対話システム１は、処理をステップＳ１００４に移行させる。一方、状況をヒアリングできていない場合、対話システム１は、例えば、図１０Ａの処理（商談）を終了する。好ましくは、対話システム１は、第３段階に対応する対話シナリオ９１３で対話を開始してから、所定の時間を経過しても、状況をヒアリングできていない場合、商談を終了する。 When the process proceeds to step S1003, the dialogue system 1 conducts a dialogue using the dialogue scenario 913 corresponding to the third stage, and determines whether or not, for example, the business content or the status of the equipment being used, etc., has been obtained. If the status has been obtained, the dialogue system 1 proceeds to step S1004. On the other hand, if the status has not been obtained, the dialogue system 1 ends the process (business negotiation) of FIG. 10A, for example. Preferably, the dialogue system 1 ends the business negotiation if the status has not been obtained even after a predetermined time has elapsed since the dialogue was started using the dialogue scenario 913 corresponding to the third stage.

ステップＳ１００４に移行すると、対話システム１は、第４段階に対応する対話シナリオ９１４で対話を行うとともに、例えば、潜在ニーズ、又は予測ニーズ等のニーズを聞き取りできたか否かを判断する。ニーズを聞き取りできた場合、対話システム１は、処理をステップＳ１００５に移行させる。一方、ニーズを聞き取りできていない場合、対話システム１は、処理をステップＳ１００３に戻す。 When the process proceeds to step S1004, the dialogue system 1 conducts a dialogue using the dialogue scenario 914 corresponding to the fourth stage, and determines whether or not it has been able to hear needs, such as potential needs or predicted needs. If it has been able to hear needs, the dialogue system 1 transitions the process to step S1005. On the other hand, if it has not been able to hear needs, the dialogue system 1 returns the process to step S1003.

ステップＳ１００５に移行すると、対話システム１は、第５段階に対応する対話シナリオ９１５で対話を行うとともに、商材を提案できたか否かを判断する。商材を提案できた場合、対話システム１は、処理をステップＳ１００６に移行させる。一方、商材を提案できていない場合、対話システム１は、処理をステップＳ１００４又はステップＳ１００５に戻す。 When the process proceeds to step S1005, the dialogue system 1 conducts a dialogue using the dialogue scenario 915 corresponding to the fifth stage, and determines whether or not the product has been proposed. If the product has been proposed, the dialogue system 1 proceeds to step S1006. On the other hand, if the product has not been proposed, the dialogue system 1 returns the process to step S1004 or step S1005.

例えば、対話システム１は、ステップＳ１００３、Ｓ１００４で取得した情報に基づいて、商材レコメンド部９１８を用いて、ユーザ１１に提案する商材を選択する。ただし、取得した情報が不十分であり、商材レコメンド部９１８が、ユーザ１１に提案する商材を選択できない場合、対話システム１は、処理をステップＳ１００４又はステップＳ１００５に戻す。 For example, the dialogue system 1 uses the product recommendation unit 918 to select a product to propose to the user 11 based on the information acquired in steps S1003 and S1004. However, if the acquired information is insufficient and the product recommendation unit 918 cannot select a product to propose to the user 11, the dialogue system 1 returns the process to step S1004 or step S1005.

ステップＳ１００６に移行すると、対話システム１は、第６段階に対応する対話シナリオ９１６で対話を行うとともに、契約を締結できたか否かを判断する。契約を締結できた場合、対話システム１は、処理をステップＳ１００７に移行させる。一方、契約を締結できていない場合、対話システム１は、例えば、処理をステップＳ１００５に戻す。 When the process proceeds to step S1006, the dialogue system 1 conducts a dialogue using the dialogue scenario 916 corresponding to the sixth stage, and determines whether or not a contract has been concluded. If a contract has been concluded, the dialogue system 1 proceeds to step S1007. On the other hand, if a contract has not been concluded, the dialogue system 1 returns the process to step S1005, for example.

ステップＳ１００７に移行すると、対話システム１は、第７段階に対応する対話シナリオ９１７で対話を行うとともに、商談の整理ができたか否かを判断する。商談の整理ができた場合、対話システム１は、図１０Ａの処理を終了する。 When the process proceeds to step S1007, the dialogue system 1 conducts a dialogue using the dialogue scenario 917 corresponding to the seventh stage, and determines whether the business negotiations have been organized. If the business negotiations have been organized, the dialogue system 1 ends the process of FIG. 10A.

一方、ステップＳ１００１からステップＳ１００８に移行すると、対話システム１は、第２段階に対応する対話シナリオ９１２（新規顧客用）で対話を行うとともに、名刺交換、又はスモールトークができたか否かを判断する。名刺交換、又はスモールトークができた場合、対話システム１は、処理をステップＳ１００９に移行させる。一方、名刺交換、又はスモールトークができていない場合、対話システム１は、図１０Ａの処理（商談）を終了する。好ましくは、対話システム１は、第２段階に対応する対話シナリオ９１２（新規顧客用）で対話を開始してから、所定の時間を経過しても、名刺交換、又はスモールトークができていない場合、商談を終了する。 On the other hand, when the process moves from step S1001 to step S1008, the dialogue system 1 performs a dialogue in dialogue scenario 912 (for a new customer) corresponding to the second stage, and determines whether or not business card exchange or small talk has occurred. If business card exchange or small talk has occurred, the dialogue system 1 transitions the process to step S1009. On the other hand, if business card exchange or small talk has not occurred, the dialogue system 1 ends the process (business negotiation) of FIG. 10A. Preferably, the dialogue system 1 ends the business negotiation if business card exchange or small talk has not occurred even after a predetermined time has elapsed since starting the dialogue in dialogue scenario 912 (for a new customer) corresponding to the second stage.

ステップＳ１００９に移行すると、対話システム１は、第３段階に対応する対話シナリオ９１３（新規顧客用）で対話を行うとともに、例えば、業務内容、又は使用機器等の状況をヒアリングできたか否かを判断する。状況をヒアリングできた場合、対話システム１は、処理をステップＳ１０１０に移行させる。一方、状況をヒアリングできていない場合、対話システム１は、図１０Ａの処理（商談）を終了する。好ましくは、対話システム１は、第３段階に対応する対話シナリオ９１３（新規顧客用）で対話を開始してから、所定の時間を経過しても、状況をヒアリングできていない場合、商談を終了する。 When the process proceeds to step S1009, the dialogue system 1 conducts a dialogue using dialogue scenario 913 (for a new customer) corresponding to the third stage, and determines whether or not it has been able to hear about, for example, the business content or the status of the equipment being used. If it has been able to hear about the status, the dialogue system 1 transitions the process to step S1010. On the other hand, if it has not been able to hear about the status, the dialogue system 1 ends the process (business negotiation) in FIG. 10A. Preferably, if the dialogue system 1 has not been able to hear about the status even after a predetermined time has elapsed since starting the dialogue using dialogue scenario 913 (for a new customer) corresponding to the third stage, it ends the business negotiation.

ステップＳ１０１０に移行すると、対話システム１は、第４段階に対応する対話シナリオ９１４（新規顧客用）で対話を行うとともに、例えば、潜在ニーズ、又は予測ニーズ等のニーズを聞き取りできたか否かを判断する。ニーズを聞き取りできた場合、対話システム１は、処理をステップＳ１０１１に移行させる。一方、ニーズを聞き取りできていない場合、対話システム１は、処理をステップＳ１００９に戻す。 When the process proceeds to step S1010, the dialogue system 1 conducts a dialogue using the dialogue scenario 914 (for a new customer) corresponding to the fourth stage, and determines whether or not it has been able to hear needs, such as potential needs or predicted needs. If it has been able to hear needs, the dialogue system 1 transitions the process to step S1011. On the other hand, if it has not been able to hear needs, the dialogue system 1 returns the process to step S1009.

ステップＳ１０１１に移行すると、対話システム１は、第５段階に対応する対話シナリオ９１５（新規顧客用）で対話を行うとともに、商材を提案できたか否かを判断する。商材を提案できた場合、対話システム１は、処理をステップＳ１００６に移行させる。一方、商材を提案できていない場合、対話システム１は、処理をステップＳ１０１０に戻す。 When the process proceeds to step S1011, the dialogue system 1 conducts a dialogue using the dialogue scenario 915 (for a new customer) corresponding to the fifth stage, and determines whether or not the product has been proposed. If the product has been proposed, the dialogue system 1 transitions the process to step S1006. On the other hand, if the product has not been proposed, the dialogue system 1 returns the process to step S1010.

図１０Ａの処理により、対話システム１は、予め設定された複数の対話段階に応じて、対話エージェントの応答内容を変更することができる。 By the process of FIG. 10A, the dialogue system 1 can change the response content of the dialogue agent according to multiple pre-set dialogue stages.

なお、図１０Ａの処理は一例である。例えば、対話システム１は、ステップＳ１００６において、契約締結できていない場合、図１０ＢのステップＳ１０２１、Ｓ１０２２の処理を実行してもよい。 The process in FIG. 10A is an example. For example, if a contract has not been concluded in step S1006, the dialogue system 1 may execute the processes in steps S1021 and S1022 in FIG. 10B.

図１０Ｂは、第１の実施形態に係る対話処理の例を示すフローチャート（２）である。ステップＳ１００６において、契約を締結できていない場合、対話システム１は、処理をステップＳ１０２１に移行させる。 FIG. 10B is a flowchart (2) showing an example of dialogue processing according to the first embodiment. If a contract has not been concluded in step S1006, the dialogue system 1 transitions the processing to step S1021.

ステップＳ１０２１に移行すると、対話システム１は、ユーザ１１の感情分析がポジティブであるか否かを判断する。感情分析がポジティブである場合、対話システム１は、処理をステップＳ１００５に戻す。一方、感情分析がポジティブでない場合（ネガティブである場合）、対話システム１は、処理をステップＳ１０２２に移行させる。 When the process proceeds to step S1021, the dialogue system 1 determines whether the emotion analysis of the user 11 is positive. If the emotion analysis is positive, the dialogue system 1 returns the process to step S1005. On the other hand, if the emotion analysis is not positive (if it is negative), the dialogue system 1 proceeds to step S1022.

ステップＳ１０２２に移行すると、対話システム１は、例えば、終了（又は延期）の挨拶をして、図１０Ｂの処理を終了する。例えば、対話システム１は、対話エージェントに、商談終了の挨拶をさせるとともに、お辞儀をさせてもよい。 When the process proceeds to step S1022, the dialogue system 1, for example, issues a closing (or postponement) greeting and ends the process of FIG. 10B. For example, the dialogue system 1 may cause the dialogue agent to issue a closing greeting and bow.

図１１は、第１の実施形態に係る非言語情報の利用例について説明するための図である。例えば、対話システム１は、ユーザ１１の映像１１００から、ユーザ１１の顔が向いている方向を示す方向ベクトル１１０１を取得し、取得した方向ベクトル１１０１と、ユーザ１１の瞳の位置１１２とに基づいて、ユーザ１１の視線を表す視線情報を取得する。 Figure 11 is a diagram for explaining an example of the use of non-verbal information according to the first embodiment. For example, the dialogue system 1 acquires a direction vector 1101 indicating the direction in which the face of the user 11 is facing from an image 1100 of the user 11, and acquires gaze information indicating the gaze of the user 11 based on the acquired direction vector 1101 and the position 112 of the user's 11 pupils.

例えば、ユーザ１１が、対話エージェントが提示した商材に対して関心を示している場合、ユーザ１１は、対話画面に表示した商材を凝視する傾向にあるため、例えば、視線１１０３ａ、１１０３ｂのように、視線はあまり変動しない（分散が小さい）。一方、ユーザ１１が、対話エージェントが提示した商材に対して関心を示していない場合、注意力が低下するので、例えば、視線１１０３ｃのように、視線が変動する（分散が大きい）。 For example, if user 11 is interested in the product presented by the dialogue agent, user 11 tends to stare at the product displayed on the dialogue screen, and so the gaze does not fluctuate much (small variance), for example, as in gazes 1103a and 1103b. On the other hand, if user 11 is not interested in the product presented by the dialogue agent, their attention decreases, and so the gaze fluctuates (large variance), for example, as in gaze 1103c.

従って、対話システム１は、例えば、ユーザ１１に商材を提示した後に、ユーザ１１の視線を表す視線情報を取得して、視線の分散が小さい場合、ユーザ１１の感情分析がポジティブ（商談を続ける）と判断してもよい。また、対話システム１は、例えば、ユーザ１１に商材を提示した後に、ユーザ１１の視線を表す視線情報を取得して、視線の分散が大きい場合、ユーザ１１の感情分析がネガティブ（商談を終了、又は延期する）と判断してもよい。 Therefore, for example, the dialogue system 1 may obtain gaze information representing the gaze of the user 11 after presenting a product to the user 11, and if the gaze variance is small, determine that the emotion analysis of the user 11 is positive (continue the business negotiation). Also, for example, the dialogue system 1 may obtain gaze information representing the gaze of the user 11 after presenting a product to the user 11, and if the gaze variance is large, determine that the emotion analysis of the user 11 is negative (end or postpone the business negotiation).

なお、この方法は、商談の終了（又は延期）の判断に限られず、例えば、より高い商談段階に移行するか、より低い商談段階に戻るかを判断するために用いてもよい。 Note that this method is not limited to determining whether to end (or postpone) a negotiation, but may also be used, for example, to determine whether to move to a higher negotiation stage or return to a lower negotiation stage.

［第２の実施形態］
第１の実施形態では、介護用途に対応した対話処理の一例について説明する。介護用途では、回想法に対応する対話シナリオを用いることができる。回想法とは、高齢者等が、自分の過去のことを話すことで精神を安定させ、認知機能の改善も期待できる心理療法のことである。 Second Embodiment
In the first embodiment, an example of dialogue processing corresponding to a nursing care application will be described. In the nursing care application, a dialogue scenario corresponding to reminiscence therapy can be used. Reminiscence therapy is a psychotherapy that can stabilize the mind of elderly people and improve their cognitive functions by talking about their past.

回想法で懐かしい思い出を話題にして対話することは、右脳で浮かんだイメージ映像を、左脳が言語化していく作業だと言われている。起承転結の会話は「５Ｗ（When, Where, Who, What, Why）話法」と言い、場面の様子やどんな風だったかを中心にした会話を「１Ｈ（How）話法」という。起承転結をともなうストーリーよりも、その時の様子や場面を対話する方が、楽しさが倍増すると言われている。 It is said that talking about fond memories using reminiscence therapy is a process in which the left brain verbalizes the images that arise in the right brain. A conversation with an introduction, development, twist and conclusion is called the "5W (When, Where, Who, What, Why) method of speaking," while a conversation that focuses on the situation and what it was like is called the "1H (How) method of speaking." It is said that talking about the situation and the scene at the time is twice as enjoyable as a story with an introduction, development, twist and conclusion.

そこで、第２の実施形態では、対話システム１は、回想法の対話シナリオを用いて、対話の進行に伴い、具体的に対話を深掘りさせるために、１Ｈ話法を重ねる対話シナリオを設け、その対話シナリオに基づいて、対話エージェントの応答内容を生成する。 Therefore, in the second embodiment, the dialogue system 1 uses a dialogue scenario based on the reminiscence method to create a dialogue scenario that overlaps 1H discourse in order to specifically deepen the dialogue as the dialogue progresses, and generates the response content of the dialogue agent based on that dialogue scenario.

なお、第２の実施形態に係る対話システム１の機能構成は、図７で説明した対話システム１の機能構成と同様でよい。 The functional configuration of the dialogue system 1 according to the second embodiment may be the same as the functional configuration of the dialogue system 1 described in FIG. 7.

＜処理の流れ＞
図１２、１３は、第２の実施形態に係る対話シナリオの遷移の一例を示す図である。この図は、回想法の対話シナリオの遷移の一例を示している。なお、実際の遷移は、ユーザ１１の発話によって変わるため、この図は、図１２、１３に示すように、ユーザ１１が発話したときの遷移の一例を示している。 <Processing flow>
12 and 13 are diagrams showing an example of a transition of a dialogue scenario according to the second embodiment. These diagrams show an example of a transition of a dialogue scenario for reminiscence therapy. Note that since the actual transition changes depending on the utterance of the user 11, these diagrams show an example of a transition when the user 11 speaks, as shown in FIGS. 12 and 13.

例えば、状態１２０１において、対話エージェントは、「学生時代はなにかスポーツをやっていましたか？」と発話し、状態１２０２において、ユーザ１１は、一例として、「スポーツＡをやっていた」と発話したものとする。 For example, in state 1201, the dialogue agent utters, "Did you play any sports when you were a student?", and in state 1202, the user 11 utters, as an example, "I played sport A."

この場合、対話システム１は、第１段階として、対話エージェントに、スポーツＡの全般の知識を振り返る発話をさせる。例えば、状態１２０３において、対話エージェントは、「ポジションはどこでしたか？」と発話する。また、状態１２０４において、ユーザ１１は、一例として、「ポジションＢだった」と発話したものとする。 In this case, in the first step, the dialogue system 1 has the dialogue agent make an utterance that reflects on general knowledge of sport A. For example, in state 1203, the dialogue agent utters, "What position were you in?". Also, in state 1204, the user 11, as an example, utters, "I was in position B."

この場合、対話システム１は、第２段階として、対話エージェントに、スポーツＡの話題を深掘りする発話をさせる。例えば、対話エージェントは、状態１２０５、１２０９、１２１３、１２１５から、ランダムに１つの状態を選択し、選択した状態に遷移させる。 In this case, in the second stage, the dialogue system 1 causes the dialogue agent to make an utterance that delves deeper into the topic of sport A. For example, the dialogue agent randomly selects one state from states 1205, 1209, 1213, and 1215, and transitions to the selected state.

一例として、状態１２０５に遷移すると、対話エージェントは、「試合にでたことはありますか？」と発話する。また、状態１２０６において、ユーザ１１は、一例として、「何度もでていた」と発話したものとする。 As an example, when transitioning to state 1205, the dialogue agent utters, "Have you ever played in a match?". Also, in state 1206, the user 11 utters, as an example, "I've played many times."

ここで、対話システム１は、第３段階として、対話エージェントに、状態１２０５の話題をさらに深掘りする発話をさせる。例えば、状態１２０７において、対話エージェントは、「なにか賞をとりましたか」と発話する。また、状態１２０８において、ユーザ１１は、一例として、「県大会に出場した」と発話したものとする。ここで、対話システム１は、一例として、状態１２１７に状態を遷移させる。 Here, as a third stage, the dialogue system 1 causes the dialogue agent to make an utterance that delves deeper into the topic of state 1205. For example, in state 1207, the dialogue agent utters, "Did you win any awards?". Also, in state 1208, as an example, the user 11 utters, "I participated in the prefectural tournament." Here, as an example, the dialogue system 1 transitions the state to state 1217.

別の一例として、状態１２０４から状態１２０９に遷移すると、対話エージェントは、「どのくらいの頻度でスポーツＡをやっていましたか？」と発話する。また、状態１２１０において、ユーザ１１は、一例として、「週に３回以上やっていた」と発話したものとする。 As another example, when transitioning from state 1204 to state 1209, the dialogue agent utters, "How often did you play sport A?". In addition, in state 1210, the user 11 utters, as an example, "I played it more than three times a week."

ここで、対話システム１は、第３段階として、対話エージェントに、状態１２０９の話題をさらに深掘りする発話をさせる。例えば、状態１２１１において、対話エージェントは、「スポーツＡのどこが好きでしたか？」と発話する。また、状態１２１２において、ユーザ１１は、一例として、「チームでプレイできるところ」と発話したものとする。ここで、対話システム１は、一例として、状態１２１７に状態を遷移させる。 Here, as a third stage, the dialogue system 1 causes the dialogue agent to make an utterance that delves deeper into the topic of state 1209. For example, in state 1211, the dialogue agent utters, "What did you like about sport A?". Also, in state 1212, as an example, the user 11 utters, "The fact that you can play in a team." Here, as an example, the dialogue system 1 transitions the state to state 1217.

別の一例として、状態１２０４から状態１２１３に遷移すると、対話エージェントは、「スポーツＡをすきでしたか？」と発話する。また、状態１２１４において、ユーザ１１は、一例として、「はい」と発話したものとする。ここで、対話システム１は、一例として、状態１２１１に状態を遷移させる。 As another example, when transitioning from state 1204 to state 1213, the dialogue agent utters "Did you like sport A?". Also, in state 1214, as an example, the user 11 utters "Yes." Here, as an example, the dialogue system 1 transitions the state to state 1211.

別の一例として、状態１２０４から状態１２１５に遷移すると、対話エージェントは、「スポーツＡを観戦することはありますか？」と発話する。また、状態１２１６において、ユーザ１１は、一例として、「あります」と発話したものとする。ここで、対話システム１は、一例として、状態１２１７に状態を遷移させる。このように、対話システム１は、第３段階の深掘りを省略してもよい。 As another example, when transitioning from state 1204 to state 1215, the dialogue agent utters, "Do you ever watch sport A?". Also, in state 1216, as an example, the user 11 utters, "Yes." Here, as an example, the dialogue system 1 transitions the state to state 1217. In this way, the dialogue system 1 may omit the third stage of digging deeper.

状態１２１７に遷移すると、対話エージェントは、「教えてくれてありがとうございます。スポーツＡを楽しめているのですね。素晴らしいです。」と発話し、状態１２１８において、ユーザ１１は、一例として、「はい」と発話したものとする。 When the state transitions to 1217, the dialogue agent utters, "Thank you for letting me know. I hear you're enjoying sport A. That's great.", and in state 1218, the user 11 utters, for example, "Yes."

ここで、対話システム１は、例えば、対話を終了してもよいし、図１３の状態１３０１に、さらに状態を遷移させてもよい。 At this point, the dialogue system 1 may, for example, end the dialogue, or may further transition the state to state 1301 in FIG. 13.

状態１３０１に遷移すると、対話エージェントは、例えば、「好きなチームはありましたか？」と発話する。また、状態１３０２において、ユーザ１１は、一例として、「チームＣが好きだった」と発話したものとする。 When the state transitions to 1301, the dialogue agent utters, for example, "Did you have a favorite team?". Also, in state 1302, it is assumed that the user 11 utters, for example, "I liked team C."

この場合、対話システム１は、第４段階として、対話エージェントに、スポーツＡで好きなチーム（又は選出）について深掘りする発話をさせる。例えば、状態１３０３において、対話エージェントは、「チームＣのどんなところが好きでしたか？」と発話する。また、状態１３０４において、ユーザ１１は、一例として、「強いところ」と発話したものとする。この場合、対話システム１は、対話エージェントに、終了の挨拶をさせる。例えば、状態１３０５において、対話エージェントは、「そうなんですね。教えてくれてありがとうございます。お時間を頂きありがとうございました。対話を終了します。」等と発話して、対話を終了する。 In this case, in the fourth stage, the dialogue system 1 causes the dialogue agent to make an utterance that delves deeper into the favorite team (or selection) in sport A. For example, in state 1303, the dialogue agent utters, "What did you like about team C?". Also, in state 1304, it is assumed that the user 11 utters, as an example, "their strengths." In this case, the dialogue system 1 causes the dialogue agent to make a closing greeting. For example, in state 1305, the dialogue agent ends the dialogue by uttering, "I see. Thank you for letting me know. Thank you for your time. I will end the dialogue." etc.

図１２、１３の遷移により、対話システム１は、回想法の対話シナリオを用いて、対話の進行に伴い、具体的に対話を深掘りさせるために、１Ｈ話法を重ねて、対話エージェントに対話させることができる。 By using the transitions in Figures 12 and 13, the dialogue system 1 can have the dialogue agent converse by layering 1H speech to specifically deepen the dialogue as the dialogue progresses, using a dialogue scenario based on the reminiscence method.

［第３の実施形態］
例えば、図３に示すような対話画面３００において、音声による対話と、バーチャルヒューマン３０１の所作だけではなく、補助的な視覚情報を追加することにより、商談においても、介護においても、対話の深掘りが容易になる。 [Third embodiment]
For example, in an interactive screen 300 as shown in FIG. 3, by adding auxiliary visual information in addition to the voice dialogue and the behavior of the virtual human 301, it becomes easier to deepen the dialogue in business negotiations or in nursing care.

図１４は、第３の実施形態に係る対話画面の一例を示す図である。図１４の例では、対話画面１４００には、バーチャルヒューマン（対話エージェント）１４０１、及び文字列による対話３０２に加えて、対話内容に基づいて生成した画像である挿絵１４０３が表示されている。この挿絵１４０３により、ユーザ１１は、対話内容であるクロスカントリースキーのイメージを、容易に思い浮かべることができる。なお、挿絵１４０３には、例えば、効果音、又は対話内容とは別の音声等の音情報が含まれていてもよい。 Figure 14 is a diagram showing an example of a dialogue screen according to the third embodiment. In the example of Figure 14, in addition to a virtual human (dialogue agent) 1401 and a dialogue 302 using character strings, an illustration 1403, which is an image generated based on the dialogue content, is displayed on the dialogue screen 1400. This illustration 1403 allows the user 11 to easily visualize the image of cross-country skiing, which is the dialogue content. Note that the illustration 1403 may include sound information such as sound effects or voices separate from the dialogue content.

＜機能構成＞
図１５は、第３の実施形態に係る対話システムの機能構成の例を示す図である。図１５に示すように、第３の実施形態に係るサーバ装置１００は、図７で説明したサーバ装置１００の機能構成に加えて、画像生成部１５０１を有している。 <Functional configuration>
Fig. 15 is a diagram showing an example of the functional configuration of the dialogue system according to the third embodiment. As shown in Fig. 15, the server device 100 according to the third embodiment has an image generation unit 1501 in addition to the functional configuration of the server device 100 described in Fig. 7.

画像生成部１５０１は、例えば、生成部７０４に含まれ、ユーザ１１との対話内容に基づいて生成した画像である挿絵１４０３生成する画像生成処理を実行する。例えば、画像生成部１５０１は、テキスト情報から画像を生成する学習済の機械学習モデル（例えば、ＤＡＬＬ・Ｅ、ＤＡＬＬ・Ｅ２、又はStable Diffusion等）を利用して、挿絵１４０３を生成することができる。また、画像生成部１５０１は、ユーザ１１の言語情報と非言語情報とのうち、少なくとも１つに基づいて、対話内容に関する画像である挿絵１４０３を生成してもよい。 The image generating unit 1501 is, for example, included in the generating unit 704, and executes an image generation process to generate an illustration 1403, which is an image generated based on the content of the dialogue with the user 11. For example, the image generating unit 1501 can generate the illustration 1403 using a trained machine learning model (e.g., DALL.E, DALL.E2, or Stable Diffusion, etc.) that generates images from text information. The image generating unit 1501 may also generate the illustration 1403, which is an image related to the content of the dialogue, based on at least one of the linguistic information and non-linguistic information of the user 11.

例えば、画像生成部１５０１は、ユーザ１１が発話した「クロスカントリースキー」という言語情報と、ユーザ１１の音声の「トーンが高い」という非言語情報とから、ユーザ１１の感情分析を「ポジティブ」と判断したときに、挿絵１４０３を生成してもよい。これにより、対話システム１は、ユーザ１１の回想をより誘発し、効果的な対話を行うことができる。 For example, the image generating unit 1501 may generate the illustration 1403 when it determines that the emotion analysis of the user 11 is "positive" based on the linguistic information of "cross-country skiing" spoken by the user 11 and the non-linguistic information of the user 11's voice being "high-pitched." This allows the dialogue system 1 to more effectively induce recollections from the user 11 and engage in dialogue.

なお、画像生成部１５０１以外の各機能構成は、図７で説明した一実施形態に係る対話システム１の機能構成と同様でよい。 Note that the functional configurations other than the image generation unit 1501 may be similar to the functional configurations of the dialogue system 1 according to one embodiment described in FIG. 7.

＜処理の流れ＞
図１６は、第３の実施形態に係る対話処理の例を示すフローチャートである。この処理は、例えば、図１５に示した機能構成を有する対話システム１が実行する対話処理の一例を示している。 <Processing flow>
16 is a flowchart showing an example of dialogue processing according to the third embodiment. This processing shows an example of dialogue processing executed by the dialogue system 1 having the functional configuration shown in FIG.

ステップＳ１６０１において、第１の取得部７０２は、ユーザ１１の発話音声を取得する。また、ステップＳ１６０２において、第１の取得部７０２は、取得したユーザ１１の発話音声に対して、音声認識処理を実行する。これにより、第１の取得部７０２は、ユーザ１１の発話音声をテキスト化した、ユーザ１１の言語情報を出力する。なお、ステップＳ１６０１、Ｓ１６０２の処理は、例えば、図８のステップＳ８０１の処理を利用してもよい。 In step S1601, the first acquisition unit 702 acquires the speech of the user 11. In addition, in step S1602, the first acquisition unit 702 executes a speech recognition process on the acquired speech of the user 11. As a result, the first acquisition unit 702 outputs language information of the user 11, which is the speech of the user 11 converted into text. Note that the processes of steps S1601 and S1602 may utilize, for example, the process of step S801 in FIG. 8.

ステップＳ１６０３において、画像生成部１５０１は、ユーザ１１の発話音声から、要約、又はキーワード等を抽出する。また、ステップＳ１６０４において、画像生成部１５０１は、抽出した要約、又はキーワード等に基づいて、例えば、図１４で説明した挿絵１４０３等の画像を生成する。 In step S1603, the image generating unit 1501 extracts a summary, keywords, etc. from the speech of the user 11. In addition, in step S1604, the image generating unit 1501 generates an image such as the illustration 1403 described in FIG. 14 based on the extracted summary or keywords, etc.

ステップＳ１６０４において、生成部７０４は、例えば、対話エージェントに発話させる音声を生成する。なお、この処理は、例えば、図８のステップＳ８０３、Ｓ８０４の処理を利用してもよい。また、生成部７０４は、画像生成部１５０１が、図１４に示すようなクロスカントリースキーの挿絵１４０３を生成した場合、対話エージェントにクロスカントリースキーに関する発話をさせる音声を生成してもよい。 In step S1604, the generation unit 704 generates, for example, a voice to be made to speak by the dialogue agent. Note that this process may utilize, for example, the processes of steps S803 and S804 in FIG. 8. Furthermore, when the image generation unit 1501 generates an illustration 1403 of cross-country skiing as shown in FIG. 14, the generation unit 704 may generate a voice to make the dialogue agent speak about cross-country skiing.

ステップＳ１６０６において、生成部７０４は、画像生成部１５０１が生成した画像と、生成部７０４が生成した音声を、対話画面１４００に出力する。このとき、対話システム１は、バーチャルヒューマン１４０１に、表示した挿絵１４０３をアシストする動作（例えば、指で指し示す等）をさせてもよい。 In step S1606, the generation unit 704 outputs the image generated by the image generation unit 1501 and the sound generated by the generation unit 704 to the interactive screen 1400. At this time, the interactive system 1 may cause the virtual human 1401 to perform an action to assist the displayed illustration 1403 (e.g., pointing with a finger, etc.).

図１６の処理により、対話システム１は、例えば、図４に示すように、対話画面１４００に、対話内容に関する画像である挿絵１４０３を表示することができる。 By the processing of FIG. 16, the dialogue system 1 can display, for example, an illustration 1403, which is an image related to the dialogue content, on the dialogue screen 1400, as shown in FIG. 4.

［第４の実施形態］
図１７は、第４の実施形態に係る対話システムの機能構成の例を示す図である。図１７に示すように、第４の実施形態に係るサーバ装置１００は、図７で説明したサーバ装置１００の機能構成に加えて、要約部１７０１を有している。 [Fourth embodiment]
Fig. 17 is a diagram showing an example of the functional configuration of the dialogue system according to the fourth embodiment. As shown in Fig. 17, the server device 100 according to the fourth embodiment has a summarizing unit 1701 in addition to the functional configuration of the server device 100 described in Fig. 7.

要約部１７０１は、例えば、生成部７０４に含まれ、対話制御部７０５が記憶部７１０に記憶した対話ログを要約して、例えば、報告書等を作成する要約処理を実行する。 The summarization unit 1701 is, for example, included in the generation unit 704, and performs a summarization process to summarize the dialogue log stored in the memory unit 710 by the dialogue control unit 705, for example, to create a report, etc.

対話システム１の対話制御部７０５は、ユーザ１１と、対話エージェントとの対話が行われると、例えば、図１８に示すような対話ログ１８００を作成し、記憶部７１０等に記憶する。 When a dialogue takes place between the user 11 and the dialogue agent, the dialogue control unit 705 of the dialogue system 1 creates, for example, a dialogue log 1800 as shown in FIG. 18 and stores it in the memory unit 710, etc.

図１８の例では、対話ログ１８００は、項目として、「タイムスタンプ」、「話者」、「発話テキスト」、及び「ファイル名」等の情報を含む。「タイムスタンプ」は、ユーザ１１、又は対話エージェントによる発が行われた日時を示す情報である。「話者」は、「発話テキスト」の発話を、ユーザが行ったか、対話エージェントが行ったかを示す情報である。「発話テキスト」は、ユーザ１１、又は対話エージェントの発話をテキスト化した情報である。「ファイル名」は、ユーザ１１の発話音声のファイル名を示す情報である。 In the example of FIG. 18, the dialogue log 1800 includes information such as "timestamp," "speaker," "utterance text," and "file name" as items. "Timestamp" is information indicating the date and time when an utterance was made by the user 11 or the dialogue agent. "Speaker" is information indicating whether the utterance of "utterance text" was made by a user or a dialogue agent. "Utterance text" is information that converts the utterance of the user 11 or the dialogue agent into text. "File name" is information indicating the file name of the utterance of the user 11.

図１８に示すように、対話ログ１８００は、ユーザ１１と対話エージェントとの間の対話を漏れなく記録したものなので、例えば、報告書として提出する場合には、これを要約することが望ましい。 As shown in FIG. 18, the dialogue log 1800 is a complete record of the dialogue between the user 11 and the dialogue agent, so it is desirable to summarize it, for example, when submitting it as a report.

要約部１７０１は、例えば、大規模言語モデルを応用して、対話ログ１８００を要約してもよいし、文章の要約ＡＩ（Artificial Intelligence）として公開されているクラウドサービスを利用して、対話ログ１８００を要約してもよい。 The summarization unit 1701 may, for example, summarize the dialogue log 1800 by applying a large-scale language model, or may summarize the dialogue log 1800 by using a cloud service that is publicly available as a text summarization AI (Artificial Intelligence).

要約する場合に重要な情報としては、例えば、日時と場所、ユーザ情報（属性、及び新規顧客か既存顧客か等）のよう５Ｗ１Ｈ情報と、ユーザが抱える課題又はニーズと、提案した商材の情報と、アクションアイテム又は次の予定等の情報がある。要約部１７０１は、対話ログ１８００を要約して、これらの情報を含む報告書、又は対話の議事録等を作成する。 Information that is important when summarizing includes, for example, 5W1H information such as date, time, location, user information (attributes, and whether the customer is new or existing, etc.), the issues or needs the user has, information on the proposed products, and action items or next plans. The summarizing unit 1701 summarizes the dialogue log 1800 and creates a report including this information, or minutes of the dialogue, etc.

また、要約部１７０１は、ユーザ１１が発話した「はい」等の言語情報と、ユーザ１１の音声の「トーンが高い」、及びユーザ１１の「表情が明るい」等の非言語情報とに基づいて、ユーザ１１が、発話エージェントが提示した商材に興味があると判断してもよい。この場合、要約部１７０１は、要約文を作成するときに、当該商材に関する記述が漏れないように文章を作成することが望ましい。 The summarizing unit 1701 may also determine that the user 11 is interested in the product presented by the speech agent based on linguistic information such as "yes" spoken by the user 11 and non-linguistic information such as the "high tone" of the user 11's voice and the user 11's "cheerful expression." In this case, when creating the summary, the summarizing unit 1701 desirably creates the sentence so as not to omit any description of the product.

［第５の実施形態］
図１９は、第５の実施形態に係る対話システムの機能構成の例を示す図である。図１９に示すように、第５の実施形態に係るサーバ装置１００は、図７で説明したサーバ装置１００の機能構成に加えて、キャッチコピー生成部１９０１を有している。 [Fifth embodiment]
Fig. 19 is a diagram showing an example of the functional configuration of the dialogue system according to the fifth embodiment. As shown in Fig. 19, the server device 100 according to the fifth embodiment has a catchphrase generation unit 1901 in addition to the functional configuration of the server device 100 described in Fig. 7.

キャッチコピー生成部１９０１は、例えば、図９に示した第５段階に対応する対話シナリオ９１５において、商材レコメンドとともに、ユーザ１１に提示するキャッチコピーを生成するキャッチコピー生成処理を実行する。キャッチコピーとは、人の注意をひく広告文、又は宣伝文等であり、ここでは、ユーザ１１に提案する商材を、ユーザ１１にアピールするための文字列である。 The catch phrase generation unit 1901 executes a catch phrase generation process for generating a catch phrase to be presented to the user 11 together with a product recommendation in, for example, a dialogue scenario 915 corresponding to the fifth stage shown in FIG. 9. A catch phrase is an advertising copy or promotional copy that attracts people's attention, and in this case, is a character string for appealing to the user 11 about the product proposed to the user 11.

（キャッチコピーの例１）
一例として、対話エージェントがユーザに提案する商材の概要が次のような内容のニーズ分析サービスであるものとする。 (Example of catchphrase 1)
As an example, assume that the outline of a product proposed to the user by the dialogue agent is a needs analysis service with the following contents.

「小売り・卸、食品飲料、製造、情報通信、サービス、医薬品・化粧品、観光などサポートセンター・コールセンター窓口の返答品質・時間短縮を支援。また、顧客から寄せられる膨大な問い合わせ等をコンテキスト化分析し、販売促進施策の立案、新商品・サービス開発へのヒントを手助けします。」
しかし、このままでは、ユーザ１１に商材の特徴が伝わりにくい。そこで、キャッチコピー生成部１９０１は、例えば、次のようなキャッチコピーを生成してもよい。
１）お客様対応から施策立案までサポート！
お客様のことを徹底分析するＡＩ
或いは、キャッチコピー生成部１９０１は、例えば、次のようなキャッチコピーを生成してもよい。
２）蓄積した顧客の声をＡＩが学習し分析！
タイムリーに最適な解決へ導く "We support the quality and time of responses at support and call centers in industries such as retail and wholesale, food and beverages, manufacturing, information and communications, services, pharmaceuticals and cosmetics, and tourism. We also provide contextual analysis of the vast number of inquiries received from customers, helping to plan sales promotion measures and provide hints for developing new products and services."
However, in this state, it is difficult to convey the characteristics of the product to the user 11. Therefore, the catch phrase generating unit 1901 may generate, for example, the following catch phrase.
1) Support from customer service to policy planning!
AI that thoroughly analyzes customers
Alternatively, the catch phrase generating unit 1901 may generate a catch phrase such as the following, for example.
2) AI learns and analyzes accumulated customer feedback!
Providing the best possible solution in a timely manner

（キャッチコピーの例２）
別の一例として、対話エージェントがユーザに提案する商材の概要が次のような内容の営業支援サービスであるものとする。 (Example of catchphrase 2)
As another example, assume that the outline of a product proposed to the user by the dialogue agent is a sales support service with the following contents.

「顧客とのやりとりの履歴や営業ノウハウの蓄積は、個人に依存してしまい、チーム内に共有されないまま。引き継ぎ時には、ちらばった顧客データの探索に時間がかかるなど、非効率でした。属人的になりがちな営業現場の情報共有で、手間のかかる検索作業を軽減します。たとえば、ベテランの作成した類似案件の提案書等参考情報が共有できれば、スキルでばらつく資料作成といった課題をクリアにし、商談を成功させるドキュメント開発に貢献します。」
しかし、このままでは、ユーザ１１に商材の特徴が伝わりにくい。そこで、キャッチコピー生成部１９０１は、例えば、次のようなキャッチコピーを生成してもよい。
３）お客様の関心事を即効インストール！
商談成功をサポートするＡＩ
或いは、キャッチコピー生成部１９０１は、例えば、次のようなキャッチコピーを生成してもよい。
４）属人的な営業スタイルをＡＩが学習！
お客様の関心事に応じた提案書をＡＩがレコメンド
このようなキャッチコピーは、例えば、大規模言語モデルを用いることで、効率よく生成することができる。また、キャッチコピー生成部１９０１は、外部のクラウドサービス等が提供する、また、キャッチコピー生成サービス等を利用して、キャッチコピーを生成してもよい。 "The accumulation of customer interaction history and sales know-how is dependent on individuals and is not shared within the team. When taking over, searching for scattered customer data takes time, which is inefficient. By sharing information in the sales field, which tends to be dependent on individuals, time-consuming search work can be reduced. For example, if reference information such as proposal documents for similar cases created by veterans can be shared, issues such as inconsistent document creation due to differences in skill can be resolved, and this contributes to the development of documents that lead to successful sales negotiations."
However, in this state, it is difficult to convey the characteristics of the product to the user 11. Therefore, the catch phrase generating unit 1901 may generate, for example, the following catch phrase.
3) Instant installation of what interests your customers!
AI to support successful business negotiations
Alternatively, the catch phrase generating unit 1901 may generate a catch phrase such as the following, for example.
4) AI learns personal sales styles!
AI recommends proposals based on customer interests. Such catchy slogans can be generated efficiently by using, for example, a large-scale language model. The catchy slogan generating unit 1901 may generate a catchy slogan by using a catchy slogan generating service provided by an external cloud service or the like.

＜処理の流れ＞
図２０は、第５の実施形態に係る情報提供処理の例を示すフローチャートである。この処理は、例えば、図９に示すような、第５段階に対応する対話シナリオ９１５において、ユーザ１１に提案する商材に対応するキャッチコピーを生成する処理の一例を示している。 <Processing flow>
Fig. 20 is a flowchart showing an example of an information provision process according to the fifth embodiment. This process shows an example of a process for generating a catchy slogan corresponding to a product to be proposed to a user 11 in a dialogue scenario 915 corresponding to the fifth stage as shown in Fig. 9.

ステップＳ２００１において、図９の商材レコメンド部９１８は、例えば、図１０ＡのステップＳ１００３～Ｓ１００４で行われた対話内容に基づいて、ユーザ１１に提案する商材を決定する。 In step S2001, the product recommendation unit 918 in FIG. 9 determines the product to be proposed to the user 11, for example, based on the content of the dialogue carried out in steps S1003 to S1004 in FIG. 10A.

ステップＳ２００２において、図１９のキャッチコピー生成部１９０１は、決定した商材に関する商材データを、記憶部７１０等から取得する。 In step S2002, the catchphrase generation unit 1901 in FIG. 19 acquires product data related to the determined product from the memory unit 710, etc.

ステップＳ２００３において、キャッチコピー生成部１９０１は、取得した商材データを用いて、商材レコメンド部９１８が決定した商材のキャッチコピーを生成する。一例として、キャッチコピー生成部１９０１は、外部のクラウドサービス等が提供する、キャッチコピー生成サービスを利用して、キャッチコピーを生成してもよい。別の一例として、キャッチコピー生成部１９０１は、大規模言語モデルを用いて、キャッチコピーを生成してもよい。 In step S2003, the catch phrase generation unit 1901 uses the acquired product data to generate a catch phrase for the product determined by the product recommendation unit 918. As one example, the catch phrase generation unit 1901 may generate a catch phrase using a catch phrase generation service provided by an external cloud service or the like. As another example, the catch phrase generation unit 1901 may generate a catch phrase using a large-scale language model.

ステップＳ２００４において、対話システム１は、ユーザ１１に提案する商材と、当該商材のキャッチコピーを、ユーザ１１に提示する。例えば、対話システム１は、図２に示すような対話画面２００に表示されているディスプレイ２０２に、提案する商材の情報と、商材のキャッチフレーズを表示させる。 In step S2004, the dialogue system 1 presents the product proposed to the user 11 and a catchphrase for the product to the user 11. For example, the dialogue system 1 displays information about the proposed product and a catchphrase for the product on the display 202 displayed on the dialogue screen 200 as shown in FIG. 2.

なお、図２０に示す処理は一例である。例えば、ユーザ１１に提案する商材は、複数の商材を組み合わせたパッケージ商材であってもよい。この場合、キャッチコピー生成部１９０１は、ステップＳ２００２において、複数の商材の商材データを取得し、ステップＳ２００３において、複数の商材の商材データを用いて、キャッチフレーズを生成する。 The process shown in FIG. 20 is an example. For example, the product proposed to the user 11 may be a package product that combines multiple products. In this case, the catchphrase generation unit 1901 acquires product data of the multiple products in step S2002, and generates a catchphrase using the product data of the multiple products in step S2003.

第５の実施形態により、対話システム１は、商材の価値を分かりやすく端的に、ユーザ１１に伝えることができる。 The fifth embodiment enables the dialogue system 1 to communicate the value of a product to the user 11 in an easy-to-understand and concise manner.

［第６の実施形態］
図２１は、第６の実施形態に係る対話システムの機能構成の例を示す図である。図２１に示すように、第６の実施形態に係るサーバ装置１００は、図７で説明したサーバ装置１００の機能構成に加えて、記憶部７１０等に過去履歴ＤＢ（Database）２１０１、及び非言語情報の入出力情報（以下、単に入出力情報と呼ぶ）２１０２等を有している（記憶している）。 Sixth embodiment
Fig. 21 is a diagram showing an example of the functional configuration of a dialogue system according to the sixth embodiment. As shown in Fig. 21, the server device 100 according to the sixth embodiment has (stores) a history DB (Database) 2101, input/output information of non-verbal information (hereinafter simply referred to as input/output information) 2102, etc. in the storage unit 710, in addition to the functional configuration of the server device 100 described in Fig. 7.

過去履歴ＤＢ２１０１は、例えば、ユーザ１１の過去の対話ログ、非言語情報、及び体調等の情報を記憶したデータベースである。 The past history DB2101 is a database that stores information such as the user 11's past dialogue logs, non-verbal information, and physical condition.

入出力情報２１０２には、例えば、図２２に示すように、ユーザ１１の画像、及び音声から取得した（入力された）非言語情報が、ポジティブであるか、ネガティブであるかを判断するための情報が含まれる。また、入出力情報２１０２には、例えば、図２２に示すように、対話エージェントの画像、及び音声が表す非言語情報が、ポジティブであるか、ネガティブであるかを示す情報が含まれる。 The input/output information 2102 includes, for example, information for determining whether the non-verbal information acquired (input) from the image and voice of the user 11 is positive or negative, as shown in FIG. 22. The input/output information 2102 also includes, for example, information indicating whether the non-verbal information expressed by the image and voice of the dialogue agent is positive or negative, as shown in FIG. 22.

これにより、意図解釈部７０６は、入出力情報２１０２を用いて、ユーザ１１の画像、及び音声に含まれる非言語情報が、ポジティブであるか、ネガティブであるかを容易に判断することができる。また、応答生成部７０７は、入出力情報２１０２を用いて、対話エージェントのポジティブな非言語情報、又はネガティブな非言語情報の例を取得することができる。 This allows the intention interpretation unit 706 to easily determine whether the non-verbal information contained in the image and voice of the user 11 is positive or negative, using the input/output information 2102. Furthermore, the response generation unit 707 can use the input/output information 2102 to obtain examples of positive non-verbal information or negative non-verbal information of the dialogue agent.

また、第６の実施形態に係る第２の取得部７０３は、端末装置１０を利用するユーザ１１との対話から、ユーザ１１の非言語情報を取得する際に、非言語情報（感情系）と、非言語情報（個性系）とを取得する。ここで、非言語情報（感情系）は、例えば、ユーザ１１の感情、態度、言葉（強さ、早さ、又は抑揚等）、生理的特徴、又は身体動作（視線、表情等）等、そのときによって変化する非言語情報を含む。例えば、意図解釈部７０６は、第２の取得部７０３が取得した非言語情報（感情系）に基づいて、ユーザ１１がポジティブであるか、ネガティブであるかを判断することができる。 Furthermore, the second acquisition unit 703 according to the sixth embodiment acquires non-verbal information (emotion system) and non-verbal information (personality system) when acquiring non-verbal information of the user 11 from a dialogue with the user 11 who uses the terminal device 10. Here, the non-verbal information (emotion system) includes non-verbal information that changes depending on the time, such as the user 11's emotions, attitude, words (strength, speed, intonation, etc.), physiological characteristics, or physical actions (gaze, facial expression, etc.). For example, the intention interpretation unit 706 can determine whether the user 11 is positive or negative based on the non-verbal information (emotion system) acquired by the second acquisition unit 703.

一方、非言語情報（個性系）は、例えば、ユーザ１１の性別、年齢、身体的な特徴、又は身なり等、そのときによって変化しない、又は変化が少ない非言語情報（属性情報）を含む。例えば、応答生成部７０７は、第２の取得部７０３が取得した非言語情報（個性系）に基づいて、ユーザの属性（例えば、性別、年齢、又は体躯等）に応じた言語応答、又は非言語応答を生成することができる。なお、非言語情報（個性系）は、ユーザ１１の属性を示す非言語情報の一例である。 On the other hand, the non-verbal information (personality system) includes non-verbal information (attribute information) that does not change or changes little from time to time, such as the gender, age, physical characteristics, or appearance of the user 11. For example, the response generation unit 707 can generate a verbal response or a non-verbal response according to the user's attributes (for example, gender, age, or physique) based on the non-verbal information (personality system) acquired by the second acquisition unit 703. Note that the non-verbal information (personality system) is an example of non-verbal information indicating the attributes of the user 11.

なお、第６の実施形態に係る対話システム１の他の機能構成は、図７で説明した対話システム１の機能構成と同様でよい。 Note that other functional configurations of the dialogue system 1 according to the sixth embodiment may be similar to the functional configuration of the dialogue system 1 described in FIG. 7.

＜処理の流れ＞
図２３は、第６の実施形態に係る対話処理の例を示すフローチャートである。この処理は、ユーザ１１と対話エージェントとの対話を開始した後に、図２１に示すような対話システム１が実行する処理の例を示している。なお、ここでは、図８で説明した一実施形態に係る対話処理の概要と同様の処理内容に対する詳細な説明は省略する。 <Processing flow>
Fig. 23 is a flowchart showing an example of a dialogue process according to the sixth embodiment. This process shows an example of a process executed by the dialogue system 1 shown in Fig. 21 after starting a dialogue between the user 11 and the dialogue agent. Note that a detailed description of the process contents similar to the outline of the dialogue process according to the embodiment explained in Fig. 8 will be omitted here.

ステップＳ２３０１において、第１の取得部７０２は、ユーザ１１と、対話エージェントとの間の対話から、ユーザの言語情報を取得する。 In step S2301, the first acquisition unit 702 acquires user language information from the dialogue between the user 11 and the dialogue agent.

ステップＳ２３０２、Ｓ２３０３において、第２の取得部７０３は、ステップＳ２３０１の処理と並行して、ユーザ１１と、対話エージェントとの間の対話から、ユーザ１１の非言語情報（感情系）と、非言語情報（個性系）とを取得する。 In steps S2302 and S2303, the second acquisition unit 703 acquires non-verbal information (emotions) and non-verbal information (personality) of the user 11 from the dialogue between the user 11 and the dialogue agent in parallel with the processing of step S2301.

ステップＳ２３０４において、生成部７０４は、第１の取得部７０２が取得した言語情報と、第２の取得部７０３が取得した非言語情報（感情系）とに基づいて、ユーザ１１の発話の意図を解釈する。 In step S2304, the generation unit 704 interprets the intention of the user 11's speech based on the linguistic information acquired by the first acquisition unit 702 and the non-linguistic information (emotion-related) acquired by the second acquisition unit 703.

ステップＳ２３０５において、生成部７０４は、第２の取得部７０３が取得した非言語情報（個性系）、又は過去履歴ＤＢ２１０１を参照して、ユーザ１１の発話の意図に対応する言語応答（対話文）を生成する。例えば、生成部７０４は、過去履歴ＤＢ２１０１のユーザ１１との過去の対話履歴等から、ユーザ１１の性別、趣味、又は体型等を判断し、ユーザ１１の性別、趣味、又は体型等に応じて異なる言語応答（対話文）を生成する。 In step S2305, the generation unit 704 generates a linguistic response (dialogue) corresponding to the intention of the user 11's speech by referring to the non-verbal information (personality-related) acquired by the second acquisition unit 703 or the past history DB 2101. For example, the generation unit 704 determines the gender, hobbies, body type, etc. of the user 11 from the past dialogue history with the user 11 in the past history DB 2101, and generates a different linguistic response (dialogue) according to the gender, hobbies, body type, etc. of the user 11.

なお、ユーザ１１の過去履歴がない場合、生成部７０４は、例えば、ユーザ１１の画像から、顔領域を検出し、年齢性別推定ＡＩ（Artificial Intelligence）等を用いて、ユーザ１１の性別、又は年齢等を推定してもよい。また、生成部７０４は、ユーザ１１の画像から、体型推定ＡＩ等を用いて、ユーザ１１の体型を推定してもよい。さらに、生成部７０４は、ユーザ１１の言語情報から、ユーザ１１の趣味等を判断してもよい。なお、生成部は、推定したユーザ１１の性別、年齢、又は体型等を、過去履歴ＤＢ２１０１に記憶しておく。 If there is no past history of the user 11, the generation unit 704 may, for example, detect a face area from an image of the user 11 and estimate the gender, age, etc. of the user 11 using age/gender estimation AI (Artificial Intelligence) or the like. The generation unit 704 may also estimate the body type of the user 11 from an image of the user 11 using body type estimation AI or the like. Furthermore, the generation unit 704 may determine the hobbies, etc. of the user 11 from the language information of the user 11. The generation unit stores the estimated gender, age, body type, etc. of the user 11 in the past history DB 2101.

具体的な一例として、商談中に、生成部７０４が、ユーザ１１の言語情報と非言語情報から、ユーザ１１が、年齢が４０代の女性で、趣味がコスメティック（以下、コスメと呼ぶ）であると判断したものとする。この場合、生成部７０４は、４０代向けのコスメ商材の紹介、又は提案をする価値ありと判断し、例えば、具体的な商材を紹介する言語応答を生成してもよい。 As a specific example, assume that during a business negotiation, the generation unit 704 determines from the linguistic and non-linguistic information of the user 11 that the user 11 is a woman in her 40s whose hobby is cosmetics (hereinafter referred to as cosmetics). In this case, the generation unit 704 may determine that it is worth introducing or proposing cosmetic products for people in their 40s, and may generate a linguistic response introducing specific products, for example.

別の一例として、生成部７０４は、商談中に、ユーザ１１の画像からユーザ１１の体型を推定し、ユーザ１１の過去の体型の履歴とを比較して、ユーザ１１の体型の推移、又は過去の体型との比較等を行ってもよい。これにより、生成部７０４は、例えば、最近になって太ったユーザ１１に対して、低糖質の食材、又は体重管理アプリケーション等の商材を初回する言語応答を生成してもよい。 As another example, the generation unit 704 may estimate the body type of the user 11 from an image of the user 11 during a business negotiation, and compare it with the history of the user 11's past body types to track the progress of the user 11's body type or compare it with past body types. As a result, the generation unit 704 may generate a verbal response for the user 11 who has recently gained weight, for example, to recommend a product such as low-carbohydrate food ingredients or a weight management application.

別の一例として、生成部７０４は、商談中に、ユーザ１１の画像から、ユーザ１１の服装おしゃれ度を推定し、ユーザ１１の過去の服装おしゃれ度を比較してもよい。これにより、生成部７０４は、服飾関係の商材を優先的に紹介する価値があると判断したユーザ１１に対して、具体的な商材を紹介する言語応答を生成してもよい。 As another example, the generation unit 704 may estimate the fashion sense of the user 11's clothing from an image of the user 11 during a business negotiation, and compare this with the user 11's past fashion sense of clothing. As a result, the generation unit 704 may generate a verbal response introducing specific products to the user 11 who is determined to be worthy of preferentially introducing clothing-related products.

別の一例として、生成部７０４は、商談中に、ユーザ１１の画像からユーザ１１の体型を推定し、過去履歴の病歴情報等と合わせて、ユーザ１１の体調を確認する必要があるかを判断してもよい。これにより、生成部７０４は、体調を確認する必要があると判断したユーザ１１に対して、現状の体調を確認する言語応答を生成してもよい。 As another example, the generation unit 704 may estimate the body type of the user 11 from an image of the user 11 during a business negotiation, and determine whether it is necessary to check the physical condition of the user 11 based on medical history information from past history, etc. As a result, the generation unit 704 may generate a verbal response to check the current physical condition of the user 11 who is determined to need to have their physical condition checked.

ステップＳ２３０６において、生成部７０４は、生成した言語応答と、さらに、ユーザ１１の非言語情報とに基づいて、対話エージェントのパラ言語（例えば、声のトーン、話す速さ、声の高さ、声の強さ、咳払い、ため息、笑い、又は沈黙等）を決定する。例えば、生成部７０４は、図２２に示すような入出力情報２１０２を参照して、ユーザ１１の感情分析がポジティブであると判断した場合、入出力情報２１０２から、対話エージェントのポジティブな非言語情報（パラ言語）を取得してもよい。同様に、生成部７０４は、図２２に示すような入出力情報２１０２を参照して、ユーザ１１の感情分析がネガティブであると判断した場合、入出力情報２１０２から、対話エージェントのネガティブな非言語情報（パラ言語）を取得してもよい。 In step S2306, the generation unit 704 determines the paralanguage of the dialogue agent (e.g., tone of voice, speaking rate, voice pitch, voice intensity, throat clearing, sigh, laughter, silence, etc.) based on the generated linguistic response and further on the non-linguistic information of the user 11. For example, when the generation unit 704 refers to the input/output information 2102 as shown in FIG. 22 and determines that the emotion analysis of the user 11 is positive, it may acquire the positive non-linguistic information (paralanguage) of the dialogue agent from the input/output information 2102. Similarly, when the generation unit 704 refers to the input/output information 2102 as shown in FIG. 22 and determines that the emotion analysis of the user 11 is negative, it may acquire the negative non-linguistic information (paralanguage) of the dialogue agent from the input/output information 2102.

なお、図２２に示した入出力情報２１０２は一例である。入出力情報２１０２には、様々な、ユーザ１１のポジティブな非言語情報、及びネガティブな非言語情報と、対話エージェントのポジティブな非言語情報、及びネガティブな非言語情報とを、予め登録しておく。 Note that the input/output information 2102 shown in FIG. 22 is an example. In the input/output information 2102, various positive non-verbal information and negative non-verbal information of the user 11, and positive non-verbal information and negative non-verbal information of the dialogue agent are registered in advance.

ステップＳ２３０７において、制御部７１４は、生成部７０４が生成した言語応答と、生成部７０４が決定したパラ言語とに基づいて、対話エージェントの応答音声を合成する。 In step S2307, the control unit 714 synthesizes a response voice of the dialogue agent based on the linguistic response generated by the generation unit 704 and the paralanguage determined by the generation unit 704.

また、サーバ装置１００は、ステップＳ２３０６、Ｓ２３０７の処理と並行して、ステップＳ２３０８、Ｓ２３０９の処理を実行する。 In addition, the server device 100 executes the processes of steps S2308 and S2309 in parallel with the processes of steps S2306 and S2307.

ステップＳ２３０８において、生成部７０４は、ユーザ１１の非言語情報に基づいて、対話エージェントの表情、視線、又は所作等を決定する。例えば、生成部７０４は、図２２に示すような入出力情報２１０２を参照して、ユーザ１１の感情分析がポジティブであると判断した場合、入出力情報２１０２から、対話エージェントのポジティブな非言語情報（表情、視線、又は所作等）を取得する。同様に、生成部７０４は、図２２に示すような入出力情報２１０２を参照して、ユーザ１１の感情分析がネガティブであると判断した場合、入出力情報２１０２から、対話エージェントのネガティブな非言語情報（表情、視線、又は所作等）を取得する。 In step S2308, the generation unit 704 determines the facial expression, gaze, or behavior of the dialogue agent based on the non-verbal information of the user 11. For example, when the generation unit 704 refers to the input/output information 2102 as shown in FIG. 22 and determines that the emotion analysis of the user 11 is positive, the generation unit 704 acquires the positive non-verbal information (facial expression, gaze, or behavior, etc.) of the dialogue agent from the input/output information 2102. Similarly, when the generation unit 704 refers to the input/output information 2102 as shown in FIG. 22 and determines that the emotion analysis of the user 11 is negative, the generation unit 704 acquires the negative non-verbal information (facial expression, gaze, or behavior, etc.) of the dialogue agent from the input/output information 2102.

ステップＳ２３０９において、生成部７０４は、決定した対話エージェントの表情、視線、又は所作等に基づいて、対話エージェントの動作（モーション）を決定する。 In step S2309, the generation unit 704 determines the motion of the dialogue agent based on the facial expression, gaze, or gesture of the determined dialogue agent.

具体的な一例として、生成部７０４は、商談中に、ユーザ１１の感情分析がポジティブであると判断した場合、例えば、対話エージェントを笑顔とし、手振りを大きくしてもよい。また、生成部７０４は、ユーザ１１の感情分析がネガティブであると判断した場合、例えば、対話エージェントを寂しい顔とし、頷き、お辞儀等をさせてもよい。 As a specific example, if the generation unit 704 determines that the emotion analysis of the user 11 is positive during a business negotiation, the generation unit 704 may, for example, make the dialogue agent smile and make larger hand gestures. Also, if the generation unit 704 determines that the emotion analysis of the user 11 is negative, the generation unit 704 may, for example, make the dialogue agent have a sad face and nod, bow, etc.

ポジティブ・ネガティブの判断に加え非言語情報（個性系）に基づいて対話エージェントの動作（モーション）をさせてもよい。たとえば、ポジティブの場合で、過去履歴ＤＢに記録されたユーザの手振りや腕組みの形、会話のペースやリズム、などユーザの非言語情報（個性系）に合わせた（類似した）動作を対話エージェントに実行させる。 In addition to positive/negative judgments, the behavior (motion) of the dialogue agent may be based on non-verbal information (personality system). For example, in the case of a positive judgment, the dialogue agent may be made to perform a motion that matches (is similar to) the user's non-verbal information (personality system), such as hand gestures or crossed arms recorded in the past history DB, or the pace and rhythm of the conversation.

ステップＳ２３１０において、制御部７１４は、生成部７０４が決定した対話エージェントの動作に基づいて、対話エージェントを描画し、描画した対話エージェント、及び合成した応答音声を含む対話画面を出力する。例えば、出力部７１３は、通信部７０１を用いて、端末装置１０に対話画面を送信する。 In step S2310, the control unit 714 draws a dialogue agent based on the operation of the dialogue agent determined by the generation unit 704, and outputs a dialogue screen including the drawn dialogue agent and the synthesized response voice. For example, the output unit 713 transmits the dialogue screen to the terminal device 10 using the communication unit 701.

対話システム１は、例えば、図８の処理を繰り返し実行することにより、ユーザ１１の非言語情報（個性系）、又は過去履歴ＤＢ２１０１等に基づいて、ユーザ１１に対してより適切なリアクションを行えるようになる。 For example, by repeatedly executing the process of FIG. 8, the dialogue system 1 can react more appropriately to the user 11 based on the non-verbal information (personality-related) of the user 11 or the past history DB 2101, etc.

＜利用シーンの例＞
続いて、本実施形態に係る対話システム１の利用シーンの例について説明する。 <Examples of usage scenarios>
Next, an example of a usage scene of the dialogue system 1 according to this embodiment will be described.

（利用シーンＡ）
図２４は、一実施形態に係る利用シーンＡのシステム構成の例を示す図である。利用シーンＡは、図１の端末装置１０がデジタルサイネージのサイネージ端末２４００である場合の例を示している。図２４の例では、サイネージ端末２４００は、カメラ、及びマイク等の入力デバイス２４０１と、コンピュータのハードウェア構成を備えている。 (Usage scene A)
Fig. 24 is a diagram showing an example of a system configuration of a usage scene A according to an embodiment. Usage scene A shows an example in which the terminal device 10 in Fig. 1 is a digital signage signage terminal 2400. In the example of Fig. 24, the signage terminal 2400 includes an input device 2401 such as a camera and a microphone, and a hardware configuration of a computer.

図２５は、一実施形態に係る利用シーンＡの対話開始処理の例を示すフローチャートである。 Figure 25 is a flowchart showing an example of a dialogue start process for usage scene A according to one embodiment.

ステップＳ２５０１において、対話システム１は、サイネージ端末２４００が備える入力デバイス２４０１で撮影した画像からユーザ１１の顔を検知する。具体的な一例として、対話システム１は、入力デバイス２４０１で撮影した画像から人物の顔画像を抽出し、抽出した顔画像に対して顔認証を行う。また、対話システム１は、抽出した顔画像が顔認証ＯＫとなった場合、ユーザ１１の顔を検知したと判断する。 In step S2501, the dialogue system 1 detects the face of the user 11 from an image captured by the input device 2401 provided in the signage terminal 2400. As a specific example, the dialogue system 1 extracts a person's facial image from the image captured by the input device 2401, and performs facial recognition on the extracted facial image. Furthermore, if the extracted facial image passes facial recognition, the dialogue system 1 determines that it has detected the face of the user 11.

ステップＳ２５０２において、対話システム１は、顔検知が所定の時間継続したかを判断する。例えば、対話システム１は、ユーザ１１顔を検知した状態が、所定の時間（例えば、５秒間）継続したか否かを判断する。顔検知が所定の時間継続した場合、対話システム１は、処理をステップＳ２５０３に移行させる。一方、顔検知が所定の時間継続しなかった場合、対話システム１は、処理をステップＳ２５０１に戻す。なお、ステップＳ２５０１、Ｓ２５０２の処理は、サイネージ端末２４００が行ってもよいし、サーバ装置１００が行ってもよい。 In step S2502, the dialogue system 1 determines whether face detection has continued for a predetermined time. For example, the dialogue system 1 determines whether the state in which the face of the user 11 has been detected has continued for a predetermined time (e.g., 5 seconds). If face detection has continued for the predetermined time, the dialogue system 1 transitions the process to step S2503. On the other hand, if face detection has not continued for the predetermined time, the dialogue system 1 returns the process to step S2501. Note that the processes of steps S2501 and S2502 may be performed by the signage terminal 2400 or by the server device 100.

ステップＳ２５０３に移行すると、サーバ装置１００は、ユーザ１１の過去の履歴があるかを判断する。例えば、サーバ装置１００は、過去履歴ＤＢ２１０１等を参照して、ユーザ１１の過去の対話ログがある場合、ユーザ１１の過去の履歴があると判断する。過去の履歴がある場合、サーバ装置１００は、処理をステップＳ２５０４に移行させる。一方、過去の履歴がない場合、サーバ装置１００は、処理をステップＳ２５０５に移行させる。 When the process proceeds to step S2503, the server device 100 determines whether there is past history for the user 11. For example, the server device 100 refers to the past history DB 2101, etc., and if there is a past dialogue log for the user 11, it determines that there is past history for the user 11. If there is past history, the server device 100 proceeds to step S2504. On the other hand, if there is no past history, the server device 100 proceeds to step S2505.

ステップＳ２５０４に移行すると、サーバ装置１００は、ユーザ１１の過去の履歴（過去の対話ログ等）から、対話処理に用いるシナリオを決定する。これにより、対話システム１は、同じユーザ１１に、何度も同じ質問、又は発話を繰り返し行ってしまうことを抑制することができる。 When the process proceeds to step S2504, the server device 100 determines a scenario to be used in the dialogue process from the past history of the user 11 (past dialogue logs, etc.). This allows the dialogue system 1 to prevent the same question or utterance from being repeatedly made to the same user 11.

ステップＳ２５０５に移行すると、サーバ装置１００は、対話処理に用いるシナリオとして、定型のシナリオ（例えば、新規顧客用のシナリオ等）を選択する。 When the process proceeds to step S2505, the server device 100 selects a standard scenario (e.g., a scenario for a new customer) as the scenario to be used for the dialogue processing.

ステップＳ２５０６に移行すると、対話システム１は、サイネージ端末２５００との間で、例えば、図１～２３で説明した対話処理を実行する。図２５の処理により、対話システム１は、サイネージ端末２５００を利用して、ユーザ１１に対話サービスを提供することができる。また、対話システム１は、ユーザ１１の過去の対話履歴等に基づいて、ユーザ１１に提供する対話内容を変更することができる。なお、ステップＳ２７０３～Ｓ２７０５の処理はオプションであり、必須ではない。例えば、対話システム１は、ステップＳ２５０６の対話処理の中で、対話に用いるシナリオを決定してもよい。 When the process proceeds to step S2506, the dialogue system 1 executes, for example, the dialogue processing described in Figs. 1 to 23 between the dialogue system 1 and the signage terminal 2500. Through the processing in Fig. 25, the dialogue system 1 can provide a dialogue service to the user 11 using the signage terminal 2500. Furthermore, the dialogue system 1 can change the dialogue content provided to the user 11 based on the user 11's past dialogue history, etc. Note that the processing in steps S2703 to S2705 is optional and not required. For example, the dialogue system 1 may determine a scenario to be used for the dialogue during the dialogue processing in step S2506.

（利用シーンＢ）
図２６は、一実施形態に係る利用シーンＢのシステム構成の例を示す図である。利用シーンＢは、図１の端末装置１０がメタバース用のディスプレイ端末２６００である場合の例を示している。ディスプレイ端末２６００は、例えば、ヘッドマウントディスプレイ、又は空間再現ディスプレイのメタバース用のディスプレイと、コンピュータの構成とを備えている。また、対話システム１は、仮想空間上の対話エージェントを用いて、ユーザ１１に対話サービスを提供する。 (Usage scene B)
Fig. 26 is a diagram showing an example of a system configuration of a usage scene B according to an embodiment. Usage scene B shows an example in which the terminal device 10 in Fig. 1 is a display terminal 2600 for the metaverse. The display terminal 2600 includes, for example, a head-mounted display or a spatial reproduction display for the metaverse, and a computer configuration. In addition, the dialogue system 1 provides a dialogue service to the user 11 by using a dialogue agent in the virtual space.

図２７は、一実施形態に係る利用シーンＢの対話開始処理の例を示すフローチャートである。 Figure 27 is a flowchart showing an example of a dialogue start process for usage scene B according to one embodiment.

ステップＳ２７０１において、対話システム１は、仮想空間上で、ユーザ１１のアバターの接近を検知する。例えば、対話システム１は、ユーザ１１のログイン情報、仮想空間上のユーザ１１のアバターの座標と対話エージェントの座標から、ユーザ１１のアバターが所定の距離（例えば、１ｍ等）以内に接近したか否かを検知する。 In step S2701, the dialogue system 1 detects the approach of the avatar of the user 11 in the virtual space. For example, the dialogue system 1 detects whether the avatar of the user 11 has approached within a predetermined distance (e.g., 1 m) from the login information of the user 11, the coordinates of the avatar of the user 11 in the virtual space, and the coordinates of the dialogue agent.

ステップＳ２７０２において、対話システム１は、ユーザ１１のアバターが所定の距離（例えば、１ｍ等）以内に接近した状態が、所定の時間（例えば、５秒等）継続したか否かを判断する。ユーザ１１のアバターの接近が所定の時間継続した場合、対話システム１は、処理をステップＳ２７０３に移行させる。一方、ユーザ１１のアバターの接近が所定の時間継続しなかった場合、対話システム１は、処理をステップＳ２７０１に戻す。 In step S2702, the dialogue system 1 determines whether the state in which the avatar of the user 11 approaches within a predetermined distance (e.g., 1 m) continues for a predetermined time (e.g., 5 seconds). If the approach of the avatar of the user 11 continues for the predetermined time, the dialogue system 1 transitions the process to step S2703. On the other hand, if the approach of the avatar of the user 11 does not continue for the predetermined time, the dialogue system 1 returns the process to step S2701.

ステップＳ２７０３に移行すると、サーバ装置１００は、ユーザ１１の過去の履歴があるかを判断する。例えば、サーバ装置１００は、過去履歴ＤＢ２１０１等を参照して、ユーザ１１の過去の対話ログがある場合、ユーザ１１の過去の履歴があると判断する。過去の履歴がある場合、サーバ装置１００は、処理をステップＳ２７０４に移行させる。一方、過去の履歴がない場合、サーバ装置１００は、処理をステップＳ２７０５に移行させる。 When the process proceeds to step S2703, the server device 100 determines whether there is past history for the user 11. For example, the server device 100 refers to the past history DB 2101, etc., and if there is a past dialogue log for the user 11, it determines that there is past history for the user 11. If there is past history, the server device 100 proceeds to step S2704. On the other hand, if there is no past history, the server device 100 proceeds to step S2705.

ステップＳ２７０４に移行すると、サーバ装置１００は、ユーザ１１の過去の履歴（過去の対話ログ等）から、対話処理に用いるシナリオを決定する。一方、ステップＳ２７０５に移行すると、サーバ装置１００は、対話処理に用いるシナリオとして、定型のシナリオ（例えば、新規ユーザのシナリオ等）を選択する。 When the process proceeds to step S2704, the server device 100 determines a scenario to be used in the dialogue process from the past history of the user 11 (past dialogue logs, etc.). On the other hand, when the process proceeds to step S2705, the server device 100 selects a standard scenario (e.g., a scenario for a new user) as the scenario to be used in the dialogue process.

ステップＳ２７０６に移行すると、対話システム１は、仮想空間上で、例えば、図１～２３で説明した対話処理を実行する。図２７の処理により、対話システム１は、メタバース用のディスプレイ端末２６００を利用して、仮想空間上でユーザ１１に対話サービスを提供することができる。 When the process proceeds to step S2706, the dialogue system 1 executes, for example, the dialogue processing described in Figs. 1 to 23 in the virtual space. Through the processing in Fig. 27, the dialogue system 1 can provide a dialogue service to the user 11 in the virtual space using the metaverse display terminal 2600.

（利用シーンＣ）
図２８は、一実施形態に係る利用シーンＣのシステム構成の例を示す図である。利用シーンＣは、ユーザ１１が、端末装置１０を用いて、サーバ装置１００が提供する対話エージェントとウェブ会議を行う場合の例を示している。なお、ユーザ１１は、システム外の会議サーバ２８１０等が提供するウェブ会議に参加するものであってもよいし、サーバ装置１００が、ウェブ会議を提供するものであってもよい。 (Usage scene C)
28 is a diagram showing an example of a system configuration of a usage scene C according to an embodiment. Usage scene C shows an example in which a user 11 uses a terminal device 10 to hold a web conference with a dialogue agent provided by a server device 100. Note that the user 11 may participate in a web conference provided by a conference server 2810 or the like outside the system, or the server device 100 may provide the web conference.

図２９は、一実施形態に係る利用シーンＣの対話開始処理の例を示すフローチャートである。 Figure 29 is a flowchart showing an example of a dialogue start process for usage scene C according to one embodiment.

ステップＳ２９０１において、ユーザ１１が、端末装置１０を用いて、対話システム１が提供する対話エージェントと同じウェブ会議に参加するものとする。例えば、ユーザ１１は、端末装置１０を用いて、対話エージェントとウェブ会議に参加するためのリンクにアクセスすることにより、当該ウェブ会議に参加する。 In step S2901, it is assumed that the user 11 uses the terminal device 10 to participate in the same web conference as the dialogue agent provided by the dialogue system 1. For example, the user 11 uses the terminal device 10 to access a link for participating in the web conference with the dialogue agent, thereby participating in the web conference.

ステップＳ２９０２において、対話システム１は、ウェブ会議において、ユーザ１１による対話開始操作を受け付けたか否かを判断する。ユーザ１１による対話開始操作を受け付けた場合、対話システム１は、処理をステップＳ２９０３に移行させる。一方、ユーザ１１による対話開始操作を受け付けていない場合、対話システム１は、例えば、ステップＳ２９０２の処理を繰り返し実行する。 In step S2902, the dialogue system 1 determines whether or not a dialogue start operation by the user 11 has been accepted during the web conference. If a dialogue start operation by the user 11 has been accepted, the dialogue system 1 transitions the process to step S2903. On the other hand, if a dialogue start operation by the user 11 has not been accepted, the dialogue system 1, for example, repeatedly executes the process of step S2902.

ステップＳ２９０３に移行すると、サーバ装置１００は、ユーザ１１の過去の履歴があるかを判断する。例えば、サーバ装置１００は、過去履歴ＤＢ２１０１等を参照して、ユーザ１１の過去の対話ログがある場合、ユーザ１１の過去の履歴があると判断する。過去の履歴がある場合、サーバ装置１００は、処理をステップＳ２９０４に移行させる。一方、過去の履歴がない場合、サーバ装置１００は、処理をステップＳ２９０５に移行させる。 When the process proceeds to step S2903, the server device 100 determines whether there is past history for the user 11. For example, the server device 100 refers to the past history DB 2101, etc., and if there is a past dialogue log for the user 11, it determines that there is past history for the user 11. If there is past history, the server device 100 proceeds to step S2904. On the other hand, if there is no past history, the server device 100 proceeds to step S2905.

ステップＳ２９０４に移行すると、サーバ装置１００は、ユーザ１１の過去の履歴（過去の対話ログ等）から、対話処理に用いるシナリオを決定する。一方、ステップＳ２９０５に移行すると、サーバ装置１００は、対話処理に用いるシナリオとして、定型のシナリオ（例えば、新規ユーザのシナリオ等）を選択する。 When the process proceeds to step S2904, the server device 100 determines a scenario to be used in the dialogue process from the past history of the user 11 (past dialogue logs, etc.). On the other hand, when the process proceeds to step S2905, the server device 100 selects a standard scenario (e.g., a scenario for a new user) as the scenario to be used in the dialogue process.

ステップＳ２９０６に移行すると、対話システム１は、ウェブ会議上で、例えば、図１～２３で説明した対話処理を実行する。図２９の処理により、対話システム１は、ウェブ会議を利用して、ユーザ１１に対話サービスを提供することができる。 When the process proceeds to step S2906, the dialogue system 1 executes, for example, the dialogue processing described in FIGS. 1 to 23 on the web conference. Through the processing of FIG. 29, the dialogue system 1 can provide a dialogue service to the user 11 using the web conference.

以上、本発明の各実施形態によれば、対話エージェントを用いてユーザ１１と対話を行う対話システム１において、ユーザ１１に対してより適切なリアクションを行えるようになる。 As described above, according to each embodiment of the present invention, a dialogue system 1 that uses a dialogue agent to engage in dialogue with a user 11 can provide a more appropriate reaction to the user 11.

＜補足＞
上記で説明した各実施形態の各機能は、一又は複数の処理回路によって実現することが可能である。ここで、本明細書における「処理回路」とは、電子回路により実装されるプロセッサのようにソフトウェアによって各機能を実行するようプログラミングされたプロセッサや、上記で説明した各機能を実行するよう設計されたＡＳＩＣ（Application Specific Integrated Circuit）、ＤＳＰ（digital signal processor）、ＦＰＧＡ（field programmable gate array）や従来の回路モジュール等のデバイスを含むものとする。 <Additional Information>
Each function of each embodiment described above can be realized by one or more processing circuits. Here, the term "processing circuit" in this specification includes a processor programmed to execute each function by software, such as a processor implemented by an electronic circuit, and devices such as an ASIC (Application Specific Integrated Circuit), a DSP (digital signal processor), an FPGA (field programmable gate array), and a conventional circuit module designed to execute each function described above.

また、実施例に記載された装置群は、本明細書に開示された実施形態を実施するための複数のコンピューティング環境のうちの１つを示すものに過ぎない。ある実施形態では、サーバ装置１００は、サーバクラスタといった複数のコンピューティングデバイスを含む。複数のコンピューティングデバイスは、ネットワークや共有メモリなどを含む任意のタイプの通信リンクを介して互いに通信するように構成されており、本明細書に開示された処理を実施する。 Furthermore, the devices described in the examples are merely illustrative of one of multiple computing environments for implementing the embodiments disclosed herein. In one embodiment, the server apparatus 100 includes multiple computing devices, such as a server cluster. The multiple computing devices are configured to communicate with each other via any type of communication link, including a network, shared memory, etc., and perform the processes disclosed herein.

また、サーバ装置１００の各機能構成は、１つのサーバ装置にまとめられていても良いし、複数の装置に分けられていても良い。さらに、サーバ装置１００の各機能構成のうち、少なくとの一部は、端末装置１０が有していてもよい。 Furthermore, each functional configuration of the server device 100 may be integrated into one server device, or may be divided into multiple devices. Furthermore, at least some of the functional configurations of the server device 100 may be possessed by the terminal device 10.

＜付記＞
本明細書には、下記の各項の対話システム、対話制御方法、及びプログラムが開示されている。
（第１項）
対話エージェントを用いてユーザと対話を行う対話システムであって、
前記対話から前記ユーザの言語情報を取得する第１の取得部と、
前記対話から前記ユーザの非言語情報を取得する第２の取得部と、
前記ユーザの言語情報と前記ユーザの非言語情報とに基づいて、前記対話エージェントの言語応答と非言語応答とを含む応答内容を生成する生成部と、
前記生成部で生成した応答内容に基づいて前記対話エージェントを制御する制御部と、
を備える、対話システム。
（第２項）
前記対話エージェントの応答内容は、前記対話エージェントの非言語応答を含み、
前記生成部は、前記ユーザの非言語情報に応じて、前記対話エージェントの非言語応答を変更する、
第１項に記載の対話システム。
（第３項）
前記生成部は、前記ユーザの非言語情報に応じて、前記対話エージェントのアクションの内容を変更する、第２項に記載の対話システム。
（第４項）
前記生成部は、前記ユーザの非言語情報に応じて、前記対話エージェントのアクションのタイミングを変更する、第２項又は第３項に記載の対話システム。
（第５項）
前記ユーザの非言語情報は、前記ユーザの画像から取得した表情、視線、姿勢、又は感情の情報を含む、第１項～第４項のいずれかに記載の対話システム。
（第６項）
前記ユーザの非言語情報は、前記ユーザの音声から取得した声の大小、声の抑揚、又は声の音色の情報を含む、第１項～第５項のいずれかに記載の対話システム。
（第７項）
前記生成部は、前記対話のシナリオに応じて、前記対話エージェントの応答内容を変更する、第１項～第６項のいずれかに記載の対話システム。
（第８項）
前記生成部は、予め設定された複数の対話段階に応じて、前記対話エージェントの応答内容を変更する、第１項～第７項のいずれかに記載の対話システム。
（第９項）
前記生成部は、前記ユーザの視線情報に基づいて前記対話段階を変更する、第８項に記載の対話システム。
（第１０項）
前記ユーザの言語情報に基づいて、対話内容に関する画像を生成する画像生成部を有し、
前記対話エージェントと、前記画像とを用いて、前記ユーザと対話を行う、
第１項～第９項のいずれかに記載の対話システム。
（第１１項）
前記対話の対話ログに基づいて、前記対話を要約する要約部を有する、第１項～第１０項のいずれかに記載の対話システム。
（第１２項）
前記対話は、前記ユーザとの商談であり、
前記商談の対話内容に基づいて、前記ユーザに提案する商材を提案する、
第１項～第１１項のいずれかに記載の対話システム。
（第１３項）
前記商談の対話内容に基づいて、前記商材のキャッチコピーを提示する。
第１２項に記載の対話システム。
（第１４項）
前記対話の過去の履歴を記憶したデータベースを有し、
前記生成部は、前記対話の過去の履歴に基づいて、前記対話のシナリオを変更する、第１項～１３項のいずれかに記載の対話システム。
（第１５項）
前記対話の過去の履歴を記憶したデータベースを有し、
前記生成部は、前記対話の過去の履歴を参照して、前記対話エージェントの言語応答を生成する、
第１項～１４項のいずれかに記載の対話システム。
（第１６項）
前記第２の取得部は、前記対話から前記ユーザの属性を示す非言語情報を取得し、
前記生成部は、前記ユーザの属性に応じた前記言語応答、又は前記非言語応答を生成する、第１項～第１５項のいずれかに記載の対話システム。
（第１７項）
対話エージェントを用いてユーザと対話を行う対話システムにおいて、
コンピュータが、
前記対話から前記ユーザの言語情報を取得する処理と、
前記対話から前記ユーザの非言語情報を取得する処理と、
前記ユーザの言語情報と前記ユーザの非言語情報とに基づいて、前記対話エージェントの言語応答と非言語応答とを含む応答内容を生成する生成処理と、
前記生成処理で生成した応答内容に基づいて前記対話エージェントを制御する処理と、
を実行する、対話制御方法。
（第１８項）
対話エージェントを用いてユーザと対話を行う対話システムにおいて、
コンピュータに、
前記対話から前記ユーザの言語情報を取得する処理と、
前記対話から前記ユーザの非言語情報を取得する処理と、
前記ユーザの言語情報と前記ユーザの非言語情報とに基づいて、前記対話エージェントの言語応答と非言語応答とを含む応答内容を生成する生成処理と、
前記生成処理で生成した応答内容に基づいて前記対話エージェントを制御する処理と、
を実行させる、プログラム。 <Additional Notes>
This specification discloses the following dialogue systems, dialogue control methods, and programs.
(Section 1)
A dialogue system that uses a dialogue agent to dialogue with a user,
a first acquisition unit that acquires language information of the user from the dialogue;
a second acquisition unit that acquires non-verbal information of the user from the dialogue;
a generation unit that generates a response content including a verbal response and a non-verbal response of the dialogue agent based on the verbal information and the non-verbal information of the user;
a control unit that controls the dialogue agent based on the response content generated by the generation unit;
A dialogue system comprising:
(Section 2)
the response content of the dialogue agent includes a non-verbal response of the dialogue agent;
the generation unit changes the non-verbal response of the dialogue agent in response to non-verbal information of the user.
2. A dialogue system as defined in claim 1.
(Section 3)
3. The dialogue system according to claim 2, wherein the generation unit changes the content of the action of the dialogue agent in response to non-verbal information of the user.
(Section 4)
4. The dialogue system according to claim 2, wherein the generation unit changes a timing of an action of the dialogue agent in response to non-verbal information of the user.
(Section 5)
5. The dialogue system according to any one of claims 1 to 4, wherein the non-verbal information of the user includes facial expression, gaze, posture, or emotional information obtained from an image of the user.
(Section 6)
6. The dialogue system according to any one of claims 1 to 5, wherein the non-verbal information of the user includes information on the volume, intonation, or timbre of the voice obtained from the user's voice.
(Section 7)
7. The dialogue system according to any one of claims 1 to 6, wherein the generation unit changes the response content of the dialogue agent in accordance with a scenario of the dialogue.
(Section 8)
8. The dialogue system according to any one of claims 1 to 7, wherein the generation unit changes the response content of the dialogue agent in accordance with a plurality of dialogue stages set in advance.
(Section 9)
The dialogue system according to claim 8, wherein the generation unit changes the dialogue stage based on gaze information of the user.
(Article 10)
an image generating unit that generates an image related to the dialogue content based on the language information of the user;
Engaging in a dialogue with the user using the dialogue agent and the image.
10. The dialogue system according to any one of claims 1 to 9.
(Article 11)
11. The dialogue system according to any one of claims 1 to 10, further comprising a summarizing unit that summarizes the dialogue based on a dialogue log of the dialogue.
(Article 12)
The interaction is a business negotiation with the user,
Proposing a product to be proposed to the user based on the content of the dialogue of the business negotiation;
12. The dialogue system according to any one of claims 1 to 11.
(Article 13)
A catchphrase for the merchandise is presented based on the contents of the dialogue during the business negotiation.
Dialogue system according to clause 12.
(Section 14)
A database storing a past history of the dialogue is provided,
14. The dialogue system according to any one of claims 1 to 13, wherein the generation unit changes a scenario of the dialogue based on a past history of the dialogue.
(Article 15)
A database storing a past history of the dialogue is provided,
the generation unit generates a linguistic response of the dialogue agent by referring to a past history of the dialogue;
15. The dialogue system according to any one of claims 1 to 14.
(Article 16)
The second acquisition unit acquires non-verbal information indicating an attribute of the user from the dialogue;
16. The dialogue system according to any one of claims 1 to 15, wherein the generation unit generates the linguistic response or the non-linguistic response according to an attribute of the user.
(Section 17)
In a dialogue system that uses a dialogue agent to have a dialogue with a user,
The computer
acquiring language information of the user from the dialogue;
A process of acquiring non-verbal information of the user from the dialogue;
a generation process for generating a response content including a verbal response and a non-verbal response of the dialogue agent based on the verbal information and the non-verbal information of the user;
A process of controlling the dialogue agent based on the response content generated in the generation process;
The method for controlling dialogue is as follows.
(Article 18)
In a dialogue system that uses a dialogue agent to have a dialogue with a user,
On the computer,
acquiring language information of the user from the dialogue;
A process of acquiring non-verbal information of the user from the dialogue;
a generation process for generating a response content including a verbal response and a non-verbal response of the dialogue agent based on the verbal information and the non-verbal information of the user;
A process of controlling the dialogue agent based on the response content generated in the generation process;
A program to execute.

以上、本発明の実施形態について説明したが、本発明はかかる特定の実施形態に限定されるものではなく、特許請求の範囲に記載された本発明の要旨の範囲内において、種々の変形、及び応用が可能である。 Although the embodiment of the present invention has been described above, the present invention is not limited to such a specific embodiment, and various modifications and applications are possible within the scope of the gist of the present invention described in the claims.

１対話システム
１０端末装置
１００サーバ装置
２００、３００対話画面
２０１、３０１、１４０１バーチャルヒューマン（対話エージェント）
５００コンピュータ
７０２第１の取得部
７０３第２の取得部
７０４生成部
７１４制御部
１５０１画像生成部
１７０１要約部
１９０１キャッチコピー生成部 Reference Signs List 1 Dialogue system 10 Terminal device 100 Server device 200, 300 Dialogue screen 201, 301, 1401 Virtual human (dialogue agent)
500 Computer 702 First acquisition unit 703 Second acquisition unit 704 Generation unit 714 Control unit 1501 Image generation unit 1701 Summary unit 1901 Catchphrase generation unit

特開２０２２－０９３４７９号公報JP 2022-093479 A

Claims

A dialogue system that uses a dialogue agent to dialogue with a user,
a first acquisition unit that acquires language information of the user from the dialogue;
a second acquisition unit that acquires non-verbal information of the user from the dialogue;
a generation unit that generates a response content including a verbal response and a non-verbal response of the dialogue agent based on the verbal information and the non-verbal information of the user;
a control unit that controls the dialogue agent based on the response content generated by the generation unit;
A dialogue system comprising:

the response content of the dialogue agent includes a non-verbal response of the dialogue agent;
the generation unit generates a non-verbal response of the dialogue agent in response to non-verbal information of the user.
The dialogue system according to claim 1 .

The dialogue system according to claim 2, wherein the generation unit changes the content of the action of the dialogue agent according to non-verbal information of the user.

The dialogue system according to claim 2, wherein the generation unit changes the timing of the action of the dialogue agent according to non-verbal information of the user.

The dialogue system according to any one of claims 1 to 4, wherein the non-verbal information of the user includes facial expression, gaze, posture, or emotional information obtained from an image of the user.

The dialogue system according to claim 5, wherein the non-verbal information of the user includes information on the volume, intonation, or timbre of the voice obtained from the user's voice.

The dialogue system according to claim 1, wherein the generation unit changes the response content of the dialogue agent according to the dialogue scenario.

The dialogue system according to claim 1, wherein the generation unit changes the response content of the dialogue agent according to a plurality of preset dialogue stages.

The dialogue system according to claim 8, wherein the generation unit changes the dialogue stage based on the user's gaze information.

an image generating unit that generates an image related to the content of the dialogue based on at least one of the linguistic information and the non-linguistic information of the user;
interacting with the user using the dialogue agent and the image;
The dialogue system according to claim 1 .

The dialogue system according to claim 1, further comprising a summarizing unit that summarizes the dialogue based on a dialogue log of the dialogue.

The interaction is a business negotiation with the user,
Proposing a product to be proposed to the user based on the content of the dialogue of the business negotiation;
The dialogue system according to claim 1 .

A catchphrase for the merchandise is presented based on the contents of the dialogue during the business negotiation.
Dialogue system according to claim 12.

A database storing a past history of the dialogue is provided,
The dialogue system according to claim 7 , wherein the generation unit changes a scenario of the dialogue based on a past history of the dialogue.

A database storing a past history of the dialogue is provided,
the generation unit generates a linguistic response of the dialogue agent by referring to a past history of the dialogue;
The dialogue system according to claim 1 .

The second acquisition unit acquires non-verbal information indicating an attribute of the user from the dialogue;
The dialogue system according to claim 1 , wherein the generation unit generates the linguistic response or the non-linguistic response according to an attribute of the user.

In a dialogue system that uses a dialogue agent to have a dialogue with a user,
The computer
acquiring language information of the user from the dialogue;
A process of acquiring non-verbal information of the user from the dialogue;
a generation process for generating a response content including a verbal response and a non-verbal response of the dialogue agent based on the verbal information and the non-verbal information of the user;
A process of controlling the dialogue agent based on the response content generated in the generation process;
The method for controlling dialogue is as follows.

In a dialogue system that uses a dialogue agent to have a dialogue with a user,
On the computer,
acquiring language information of the user from the dialogue;
A process of acquiring non-verbal information of the user from the dialogue;
a generation process for generating a response content including a verbal response and a non-verbal response of the dialogue agent based on the verbal information and the non-verbal information of the user;
A process of controlling the dialogue agent based on the response content generated in the generation process;
A program to execute.