JP7026066B2

JP7026066B2 - Voice guidance system and voice guidance method

Info

Publication number: JP7026066B2
Application number: JP2019045443A
Authority: JP
Inventors: 晋資大竹; 辰徳大原
Original assignee: Hitachi Ltd; Hitachi Building Systems Co Ltd
Current assignee: Hitachi Ltd; Hitachi Building Systems Co Ltd
Priority date: 2019-03-13
Filing date: 2019-03-13
Publication date: 2022-02-25
Anticipated expiration: 2039-03-13
Also published as: CN111687831A; CN111687831B; JP2020149264A

Description

本発明は、音声案内システム及び音声案内方法に関する。 The present invention relates to a voice guidance system and a voice guidance method.

近年、音声案内システムとして、利用者の音声による入力に対して、音声認識技術や会話技術を適用して、音声で適切な回答を返すものが知られている。このような音声案内システムの一例として、会話ロボットがある。例えば特許文献１には、会話ロボットにおいて、複数の話者との会話を自然に行う技術が開示されている。 In recent years, as a voice guidance system, there is known a system that applies voice recognition technology or conversation technology to a user's voice input and returns an appropriate answer by voice. An example of such a voice guidance system is a conversation robot. For example, Patent Document 1 discloses a technique for naturally performing a conversation with a plurality of speakers in a conversation robot.

特開２０１２－７６１６２号公報Japanese Unexamined Patent Publication No. 2012-76162

従来開発されている音声案内システムでは、同時に複数の案内対象者と会話することができないという問題があった。
例えば、一人の案内対象者Ａと会話中に、別の案内対象者Ｂに質問されると、会話ロボットは、案内対象者Ａとの会話状況を保持したまま案内対象者Ｂへ回答してしまうため、適切な回答をすることがでない。 The voice guidance system that has been developed in the past has a problem that it is not possible to talk with a plurality of guide targets at the same time.
For example, if a question is asked by another guide target person B during a conversation with one guide target person A, the conversation robot answers to the guide target person B while maintaining the conversation status with the guide target person A. Therefore, it is not possible to give an appropriate answer.

また、案内対象者と非案内対象者（案内実行者）の会話の補助を行うような利用シーンにおいては、会話ロボットが案内対象者の発話（質問）に対して応答して良いが、案内実行者の発話（回答）に対して応答してしまうと、会話が成立しなくなってしまう。
さらに、非案内対象者が回答できない場合に、代わりに会話ロボットが回答してくれると便利だが、現状では、会話ロボット（音声案内システム）が回答すべきか否かを判断することは困難である。
なお、案内対象者と非案内対象者（案内実行者）の会話の補助を会話ロボットが行う状況としては、例えば案内対象者と非案内対象者との会話を、会話ロボットが翻訳するような場合があるが、このような翻訳時には、回答すべきか否かより複雑な判断が必要である。 Further, in a usage scene where the conversation between the guide target person and the non-guidance target person (guidance executor) is assisted, the conversation robot may respond to the utterance (question) of the guide target person, but the guidance execution If you respond to a person's utterance (answer), the conversation will not be established.
Furthermore, it is convenient for the conversation robot to answer instead when the non-guidance target person cannot answer, but at present, it is difficult to determine whether or not the conversation robot (voice guidance system) should answer.
As a situation in which the conversation robot assists the conversation between the guidance target person and the non-guidance target person (guidance executor), for example, the conversation robot translates the conversation between the guidance target person and the non-guidance target person. However, at the time of such translation, it is necessary to make a more complicated judgment as to whether or not to answer.

本発明は、複数の話者が存在する場合の応答を適切に行うことができる音声案内システム及び音声案内方法を提供することを目的とする。 It is an object of the present invention to provide a voice guidance system and a voice guidance method capable of appropriately performing a response when a plurality of speakers are present.

上記課題を解決するために、例えば特許請求の範囲に記載の構成を採用する。
本願は、上記課題を解決する手段を複数含んでいるが、その一例を挙げるならば、カメラと、マイクと、スピーカとを備えて、マイクに入力した音声に基づいた案内用の音声をスピーカが出力する音声案内システムである。
そして、マイクに入力した音声による質問を受け付ける質問受付部と、質問受付部が受け付けた質問に対応するユーザの音声の特徴を計算し、計算した音声の特徴に基づいてユーザを認識する音声認識部と、質問受付部が音声を検知した際に、カメラで撮影したユーザの画像の特徴を計算し、計算した画像又は画像の特徴に基づいてユーザを認識する画像認識部と、音声認識部が計算したユーザの音声の特徴と画像認識部が計算したユーザの画像の特徴とを用いて案内を行うユーザを選択し、選択したユーザに対する案内用の音声をスピーカから出力させる案内部と、出力音声を所定の言語に翻訳した音声とする翻訳部とを備え、質問受付部でのマイクに入力した音声の受け付け状況に基づいて、案内部は、案内用の音声を出力させる代わりに、質問受付部が受け付けた音声を翻訳部が所定の言語に翻訳した音声をスピーカから出力させる。
In order to solve the above problems, for example, the configuration described in the claims is adopted.
The present application includes a plurality of means for solving the above problems. For example, a camera, a microphone, and a speaker are provided, and the speaker provides guidance voice based on the voice input to the microphone. It is a voice guidance system that outputs.
Then, a question reception unit that accepts a question by voice input to the microphone and a voice recognition unit that calculates the characteristics of the user's voice corresponding to the question received by the question reception unit and recognizes the user based on the calculated voice characteristics. When the question reception unit detects the voice, the image recognition unit that calculates the characteristics of the user's image taken by the camera and recognizes the user based on the calculated image or the characteristics of the image, and the voice recognition unit calculates . A guidance unit that selects a user to provide guidance using the characteristics of the user's voice and the characteristics of the user's image calculated by the image recognition unit, and outputs the guidance voice to the selected user from the speaker , and the output voice. It is equipped with a translation unit that uses voice translated into a predetermined language, and based on the reception status of the voice input to the microphone in the question reception unit, the guidance unit uses the question reception unit instead of outputting the guidance voice. The voice translated from the received voice into a predetermined language is output from the speaker.

本発明によれば、複数のユーザが近くにいるような状況であっても、各ユーザを個別に認識することができ、それぞれのユーザへ適切な回答ができるようになる。また、複数のユーザの間での会話をサポートするような案内も可能になる。
上記した以外の課題、構成及び効果は、以下の実施形態の説明により明らかにされる。 According to the present invention, even in a situation where a plurality of users are nearby, each user can be individually recognized, and an appropriate answer can be given to each user. It also enables guidance to support conversations between multiple users.
Issues, configurations and effects other than those described above will be clarified by the following description of the embodiments.

本発明の第１の実施の形態例における音声案内システム全体の構成例を示すブロック図である。It is a block diagram which shows the configuration example of the whole voice guidance system in 1st Embodiment of this invention. 本発明の第１の実施の形態例におけるロボットの構成例を示すブロック図である。It is a block diagram which shows the structural example of the robot in 1st Embodiment of this invention. 本発明の第１の実施の形態例におけるロボット制御装置の構成例を示すブロック図である。It is a block diagram which shows the structural example of the robot control apparatus in 1st Embodiment of this invention. 本発明の第１の実施の形態例におけるロボット管理サーバの構成例を示すブロック図である。It is a block diagram which shows the configuration example of the robot management server in the 1st Embodiment example of this invention. 本発明の第１の実施の形態例におけるロボットの外観例を示す図である。It is a figure which shows the appearance example of the robot in 1st Embodiment of this invention. 本発明の第１の実施の形態例における利用例を示す図である。It is a figure which shows the use example in 1st Embodiment of this invention. 本発明の第１の実施の形態例における使用言語の選択画面を示す図である。It is a figure which shows the selection screen of the language used in 1st Embodiment of this invention. 本発明の第１の実施の形態例におけるサービス全体の流れを示すフローチャートである。It is a flowchart which shows the flow of the whole service in 1st Embodiment of this invention. 本発明の第１の実施の形態例における翻訳サービスの処理例を示すフローチャートである。It is a flowchart which shows the processing example of the translation service in 1st Embodiment of this invention. 本発明の第１の実施の形態例における会話破綻に基づくロボット回答処理のシーケンス図（前半）である。It is a sequence diagram (first half) of the robot answer processing based on the conversation failure in the 1st Embodiment example of this invention. 本発明の第１の実施の形態例における会話破綻に基づくロボット回答処理のシーケンス図（後半：図１０の続き）である。It is a sequence diagram of the robot response process based on the conversation failure in the first embodiment of the present invention (second half: continuation of FIG. 10). 本発明の第１の実施の形態例における回答タイマ満了に基づくロボット回答時のシーケンス図である。It is a sequence diagram at the time of robot response based on the expiration of the response timer in the first embodiment of the present invention. 本発明の第２の実施の形態例におけるサービス全体の流れを示すフローチャートである。It is a flowchart which shows the flow of the whole service in 2nd Embodiment of this invention. 本発明の第２の実施の形態例における会話補助サービスの処理例を示すフローチャートである。It is a flowchart which shows the processing example of the conversation assisting service in 2nd Embodiment of this invention. 本発明の第３の実施の形態例における翻訳サービスの処理例を示すフローチャートである。It is a flowchart which shows the processing example of the translation service in the 3rd Embodiment example of this invention. 本発明の第３の実施の形態例における会話破綻時のロボット回答時のシーケンス図（前半）である。It is a sequence diagram (first half) at the time of a robot response at the time of a conversation failure in the third embodiment of the present invention. 本発明の第３の実施の形態例における会話破綻時のロボット回答時のシーケンス図（後半：図１６の続き）である。It is a sequence diagram (second half: continuation of FIG. 16) at the time of a robot response at the time of conversation failure in the 3rd Embodiment example of this invention.

＜１．第１の実施の形態例＞
以下、本発明の第１の実施の形態例について、図１～図１２を参照して説明する。
本発明の第１の実施の形態例の音声案内システムは、案内対象者(施設利用顧客)と案内実行者(施設スタッフ)との間の会話を翻訳するものである。ここで、第１の実施の形態例の音声案内システムは、案内対象者の質問に対して案内実行者が回答できない場合に、代理で回答を行うことができる。 <1. Example of First Embodiment>
Hereinafter, an example of the first embodiment of the present invention will be described with reference to FIGS. 1 to 12.
The voice guidance system of the first embodiment of the present invention translates a conversation between a guidance target person (facility user) and a guidance performer (facility staff). Here, the voice guidance system of the first embodiment can answer the question of the guidance target person on behalf of the guidance executor when the guidance executor cannot answer the question.

［システム構成］
図１は、本発明の第１の実施の形態例の音声案内システムの全体構成を示す。
音声案内システム１は、ロボット１００と、ロボット制御装置２００と、ロボット管理サーバ３００とで構成される。ロボット１００とロボット制御装置２００とは、ロボット１００を運用するサイト２に設置される。サイト２は、ショッピングセンタなどの施設である。
ロボット１００は、会話による案内を実行する。
ロボット制御装置２００は、ロボット１００を制御する。
ロボット管理サーバ３００は、ロボット１００の運用状況を監視するものであり、例えばシステムを提供する事業者が運用する。ロボット管理サーバ３００は、サイト２に設置されたロボット制御装置２００とネットワークを介して接続される。 [System configuration]
FIG. 1 shows the overall configuration of a voice guidance system according to an example of the first embodiment of the present invention.
The voice guidance system 1 includes a robot 100, a robot control device 200, and a robot management server 300. The robot 100 and the robot control device 200 are installed at the site 2 where the robot 100 is operated. Site 2 is a facility such as a shopping center.
The robot 100 executes guidance by conversation.
The robot control device 200 controls the robot 100.
The robot management server 300 monitors the operation status of the robot 100, and is operated by, for example, a business operator that provides the system. The robot management server 300 is connected to the robot control device 200 installed at the site 2 via a network.

音声案内システム１のロボット１００は、商業施設などのサイト２において、案内実行者と、案内実行者と別の言語を話す案内対象者との会話を、互いに翻訳することで案内の補助を行う。また、案内実行者が回答できない場合に、案内実行者に代わり、ロボット１００が案内対象者へ回答を行う。 The robot 100 of the voice guidance system 1 assists the guidance by translating the conversation between the guidance executor and the guidance target person who speaks a different language from the guidance executor at the site 2 such as a commercial facility. If the guidance executor cannot answer, the robot 100 responds to the guidance target person on behalf of the guidance executor.

図２はロボット１００の構成例を示す。
ロボット１００は、音声による案内の処理動作を制御するＣＰＵ（Central Processing Unit：中央処理ユニット）１１０と、各ソフトウェアやデータを格納する記憶装置１２０と、入出力装置１３０と、外部機器と通信を行う通信インターフェース１４０とで構成される。 FIG. 2 shows a configuration example of the robot 100.
The robot 100 communicates with a CPU (Central Processing Unit) 110 that controls the processing operation of guidance by voice, a storage device 120 that stores each software and data, an input / output device 130, and an external device. It is composed of a communication interface 140.

記憶装置１２０は、入出力部１２１と、シナリオ実行部１２２と、画面処理部１２３とで構成される。
入出力部１２１は、ロボット制御装置２００から受けたデータや指示を保持し、各処理部へ送る。
シナリオ実行部１２２は、ロボット制御装置２００から受けたシナリオ指示に従って、音声の出力や画面の表示指示を行う。また、シナリオ実行部１２２は、入出力装置１３０からのデータ取得などを行う。
画面処理部１２３は、シナリオ実行部１２２から受けた画面の表示指示に従って、ロボット制御装置２００にアクセスして画面表示を行う。また、画面処理部１２３は、入出力装置１３０からのタッチ入力の受付を行う。 The storage device 120 includes an input / output unit 121, a scenario execution unit 122, and a screen processing unit 123.
The input / output unit 121 holds the data and instructions received from the robot control device 200 and sends them to each processing unit.
The scenario execution unit 122 outputs voice and gives an instruction to display a screen according to a scenario instruction received from the robot control device 200. Further, the scenario execution unit 122 acquires data from the input / output device 130 and the like.
The screen processing unit 123 accesses the robot control device 200 and displays the screen according to the screen display instruction received from the scenario execution unit 122. Further, the screen processing unit 123 accepts touch input from the input / output device 130.

入出力装置１３０は、複数のマイクによって構成されるマイクアレイ１３１と、カメラ１３２と、スピーカ１３３と、タッチパネル１３４とで構成される。
マイクアレイ１３１は、複数のマイクで取得した音声データを多チャンネルデータとして取得する。マイクアレイ１３１が備える複数のマイクは、それぞれ異なる到来方向の音声を取得する。カメラ１３２は、映像や画像などの視覚データを取得する。スピーカ１３３は、音声を出力する。タッチパネル１３４は、画面の出力及びタッチ入力のデータを取得する。 The input / output device 130 includes a microphone array 131 composed of a plurality of microphones, a camera 132, a speaker 133, and a touch panel 134.
The microphone array 131 acquires audio data acquired by a plurality of microphones as multi-channel data. The plurality of microphones included in the microphone array 131 acquire voices in different directions of arrival. The camera 132 acquires visual data such as a video or an image. The speaker 133 outputs sound. The touch panel 134 acquires screen output and touch input data.

ロボット１００は、マイクアレイ１３１及びカメラ１３２で取得したデータを、通信インターフェース１４０を通して、常に一定間隔でロボット制御装置２００へ送信し、通信インターフェース１４０を通して、ロボット制御装置２００からのシナリオ実行指示を受ける。 The robot 100 always transmits the data acquired by the microphone array 131 and the camera 132 to the robot control device 200 at regular intervals through the communication interface 140, and receives a scenario execution instruction from the robot control device 200 through the communication interface 140.

図３は、ロボット制御装置２００の構成例を示した図であり、ロボット制御装置２００は各部の処理を行うＣＰＵ２１０と、各ソフトウェアやデータが格納される記憶装置２２０と、外部機器との通信を行う通信インターフェース２３０とで構成される。
主記憶装置２２０は、入出力部２２１と、質問受付部２２２と、音声認識部２２３と、画像認識部２２４と、案内部２２５と、翻訳部２２６とで構成される。
入出力部２２１は、ロボット１００や、ロボット制御サーバ３００からの入力データ及び出力データを処理し、データを保持し、ロボット制御装置２００内の各部へ送る。 FIG. 3 is a diagram showing a configuration example of the robot control device 200, in which the robot control device 200 communicates with a CPU 210 that processes each part, a storage device 220 that stores software and data, and an external device. It is composed of a communication interface 230 to perform.
The main storage device 220 includes an input / output unit 221, a question reception unit 222, a voice recognition unit 223, an image recognition unit 224, a guidance unit 225, and a translation unit 226.
The input / output unit 221 processes input data and output data from the robot 100 and the robot control server 300, holds the data, and sends the data to each unit in the robot control device 200.

質問受付部２２２は、ロボット１００から受けた多チャンネルの音声データから、記憶しているユーザの音声が最も大きいマイク（チャンネル）の音声入力を受け付ける質問受付処理を行う。このとき、質問受付部２２２は、そのマイクの設置方向に基づいて音声データの到来方向を推定する。 The question reception unit 222 performs a question reception process for receiving the voice input of the microphone (channel) having the loudest voice of the user stored from the multi-channel voice data received from the robot 100. At this time, the question receiving unit 222 estimates the arrival direction of the voice data based on the installation direction of the microphone.

音声認識部２２３は、音声認識処理を行う。音声認識部２２３での音声認識処理としては、質問受付部２２２が受けた音声の特徴量を計算し、音声に基づいてユーザを識別する。ここでの音声の特徴量の計算は、ロボット１００の周囲にいる複数人の音声を区別するために行う。
また、音声認識部２２３は、質問受付部２２２が受けた音声のテキスト化を行う。
画像認識部２２４は、ロボット１００から受けた画像データを画像処理し、ユーザの画像の特徴量を計算し、画像に基づいてユーザを識別する。画像に複数の人物が写っている場合、それぞれのユーザを個別に認識する。 The voice recognition unit 223 performs voice recognition processing. As the voice recognition process in the voice recognition unit 223, the feature amount of the voice received by the question reception unit 222 is calculated, and the user is identified based on the voice. The calculation of the feature amount of the voice here is performed in order to distinguish the voices of a plurality of people around the robot 100.
Further, the voice recognition unit 223 converts the voice received by the question reception unit 222 into text.
The image recognition unit 224 performs image processing on the image data received from the robot 100, calculates the feature amount of the user's image, and identifies the user based on the image. When multiple people appear in the image, each user is recognized individually.

なお、音声認識部２２３や画像認識部２２４がユーザを認識する場合、事前に登録されたユーザを、特定のユーザとして認識することができる。すなわち、事前に案内実行者４の音声の特徴と画像の特徴を、音声認識部２２３及び画像認識部２２４に登録しておく。このように登録することで、事前に登録されている音声の特徴及び／又は画像の特徴を検出したユーザを非案内対象者（案内実行者４）とし、事前に登録されていない音声の特徴又は画像の特徴を検出したユーザを案内対象者３とすることができる。 When the voice recognition unit 223 or the image recognition unit 224 recognizes the user, the user registered in advance can be recognized as a specific user. That is, the voice feature and the image feature of the guidance executor 4 are registered in the voice recognition unit 223 and the image recognition unit 224 in advance. By registering in this way, the user who has detected the pre-registered voice features and / or image features is designated as the non-guidance target person (guidance executor 4), and the pre-registered voice features or features. The user who has detected the feature of the image can be the guide target person 3.

案内部２２５は、会話継続判定部２２５ａと、回答可否判定部２２５ｂと、会話破綻判定部２２５ｃと、回答出力部２２５ｄとで構成され、音声処理部２２３と画像認識部２２４でのユーザ識別結果に基づき、ユーザの選択を行い、選択したユーザとの会話を実行する。
会話継続判定部２２５ａは、案内対象者と案内実行者との会話が継続しているかを判定する。
回答可否判定部２２５ｂは、案内対象者の質問に対して、回答可能か否かを判定する。
会話破綻判定部２２５ｃは、案内対象者と案内実行者との会話が破綻しているか否かを判定する。
回答出力部２２５ｄは、案内対象者の質問に対する回答を検索し、出力する。
翻訳部２２６は、音声認識部２２３でテキスト化された発話に対して翻訳を行う。翻訳元言語と翻訳先言語はユーザの立ち位置やタッチパネル１３４の入力に基づき決定される。 The guidance unit 225 is composed of a conversation continuation determination unit 225a, an answer availability determination unit 225b, a conversation failure determination unit 225c, and an answer output unit 225d, and is used as a user identification result in the voice processing unit 223 and the image recognition unit 224. Based on this, a user is selected and a conversation with the selected user is executed.
The conversation continuation determination unit 225a determines whether or not the conversation between the guidance target person and the guidance executor is continuing.
The answer availability determination unit 225b determines whether or not it is possible to answer the question of the guide target person.
The conversation failure determination unit 225c determines whether or not the conversation between the guidance target person and the guidance executor is broken.
The answer output unit 225d searches for and outputs an answer to the question of the guide target person.
The translation unit 226 translates the utterances converted into text by the voice recognition unit 223. The translation source language and translation destination language are determined based on the user's standing position and the input of the touch panel 134.

図４は、ロボット管理サーバ３００の構成例を示す。
ロボット管理サーバ３００は、各部の処理を制御するＣＰＵ３１０と、各ソフトウェアやデータが記憶される記憶装置３２０と、外部機器との通信を行う通信インターフェース３３０とで構成される。
記憶装置３２０は、ロボット管理部３２１を持ち、ロボット管理部３２１は各ロボットの状態を管理して、保全の計画や修理の通知などを行う。 FIG. 4 shows a configuration example of the robot management server 300.
The robot management server 300 includes a CPU 310 that controls the processing of each unit, a storage device 320 that stores software and data, and a communication interface 330 that communicates with an external device.
The storage device 320 has a robot management unit 321, and the robot management unit 321 manages the state of each robot and notifies maintenance plans and repairs.

［ロボット形状と利用形態の例］
図５は、ロボット１００の外観の一例を示す。本実施の形態例のロボット１００はサイネージ型ロボットで、カメラ１３２とスピーカ１３３とタッチパネル１３４とを備えるタブレット１０１と、マイクアレイ１３１を備えるクレドール１０２とで構成される。クレームドール１０２は、タブレット１０１を保持し、保持したタブレット１０１と接続される。 [Examples of robot shapes and usage patterns]
FIG. 5 shows an example of the appearance of the robot 100. The robot 100 of the present embodiment is a signage type robot, and is composed of a tablet 101 including a camera 132, a speaker 133, and a touch panel 134, and a credor 102 including a microphone array 131. The claim doll 102 holds the tablet 101 and is connected to the held tablet 101.

アレイマイク１３１は水平方向に複数のマイクを配置してあり、それぞれのマイクが、音声を集音する方向を異なる方向（水平角度）に設定してある。なお、図５に示すロボット１００の外観は一例であり、その他の外観形状としてもよい。例えばカメラ１３２とスピーカ１３３は、クレドール１０２が備えてもよい。
カメラ１３２は、タブレット１０１の前面に配置され、タブレット１０１の前方にいるユーザを撮影する。スピーカ１３３についても、タブレット１０１の前面に配置される。 A plurality of microphones are arranged in the horizontal direction of the array microphone 131, and each microphone sets the direction of collecting sound to a different direction (horizontal angle). The appearance of the robot 100 shown in FIG. 5 is an example, and other appearance shapes may be used. For example, the camera 132 and the speaker 133 may be provided by the credor 102.
The camera 132 is arranged in front of the tablet 101 and photographs a user in front of the tablet 101. The speaker 133 is also arranged in front of the tablet 101.

タッチパネル１３４は、テキスト表示部１３４ａとアバタ表示部１３４ｂとを備える。テキスト表示部１３４ａには、翻訳結果のテキストもしくは音声案内システムによる回答のテキストが表示される。アバタ表示部１３４ｂには、テキスト表示部１３４ａに表示したテキストに応じて動作するアバタのアニメーションや案内に用いる画像を表示する。図５では、アバタとしてロボットを表示した例を示し、テキスト表示部１３４ａに会話文（ここでは「こんにちは」）が表示され、ロボット（アバタ）が会話したように表示される。 The touch panel 134 includes a text display unit 134a and an avatar display unit 134b. The text of the translation result or the text of the answer by the voice guidance system is displayed on the text display unit 134a. The avatar display unit 134b displays an image used for animation and guidance of the avatar that operates according to the text displayed on the text display unit 134a. FIG. 5 shows an example in which a robot is displayed as an avatar, and a conversation sentence (here, “hello”) is displayed on the text display unit 134a, and the robot (avatar) is displayed as if it had a conversation.

図６は、音声案内システムによる翻訳の利用例を示す。案内対象者３と案内実行者４はロボット１００と向き合った状態で、ロボット１００の前に立ち、音声案内システム１が翻訳した音声やテキストを基に、ロボット１００が案内の補助や翻訳を行いながら、案内対象者３と案内実行者４との間で互いに会話を行う。ここでの案内対象者３は、サイト２の利用者であり、案内実行者４は、サイト２で案内を行うために待機している者（サイト２の従業員など）である。 FIG. 6 shows an example of using translation by a voice guidance system. The guidance target person 3 and the guidance executor 4 stand in front of the robot 100 while facing the robot 100, and the robot 100 assists and translates the guidance based on the voice and the text translated by the voice guidance system 1. , The guidance target person 3 and the guidance executor 4 have a conversation with each other. The guidance target person 3 here is a user of the site 2, and the guidance executor 4 is a person (such as an employee of the site 2) who is waiting for guidance at the site 2.

図７は、案内対象者３と案内実行者４の使用言語を選択する画面を示す。
ボタン１３４ｃは、案内対象者３の言語選択ボタンを示す。ボタン１３４ｄは案内実行者４の言語選択ボタンを示す。ボタン１３４ｅは、案内対象者３と案内実行者４の立ち位置の入れ替えボタンを示す。 FIG. 7 shows a screen for selecting the language used by the guidance target person 3 and the guidance executor 4.
The button 134c indicates a language selection button of the guide target person 3. Button 134d indicates a language selection button of the guidance executor 4. The button 134e indicates a button for switching the standing positions of the guidance target person 3 and the guidance executor 4.

立ち位置入れ替えボタン１３４ｅにより、案内対象者３と案内実行者４の入れ替えを行うことで、タッチパネル１３４上の案内対象者言語選択ボタン１３４ｃと案内実行者言語選択ボタン１３４ｄの位置が入れ替わる。
案内対象者３と案内実行者４は、それぞれの言語選択バタン１３４ｃと１３４ｄを選択することによって、各自が音声認識や翻訳に用いる言語を決定する。 By exchanging the guidance target person 3 and the guidance executor 4 with the standing position exchange button 134e, the positions of the guidance target person language selection button 134c and the guidance executor language selection button 134d on the touch panel 134 are exchanged.
The guidance target person 3 and the guidance performer 4 each determine the language to be used for speech recognition and translation by selecting the language selection buttons 134c and 134d, respectively.

[案内サービスの実行例]
図８は、音声案内システム１による案内サービスの実行手順の例を示すフローチャートである。ここでは、図６に示すように、ロボット１００の前に近接して、案内対象者３と案内実行者４とがいる状況である。 [Execution example of guidance service]
FIG. 8 is a flowchart showing an example of an execution procedure of the guidance service by the voice guidance system 1. Here, as shown in FIG. 6, there is a guidance target person 3 and a guidance executor 4 in close proximity to the robot 100.

まず、ロボット制御装置２００では、ユーザ判定（ステップＳ１００）が行われる。このユーザ判定では、画像と音声とタッチ入力に基づき、ロボット１００の前に立つ複数のユーザがそれぞれ案内対象者３か案内実行者４かの判定が行われる。
この案内対象者３か案内実行者４かの判定は、例えば以下の処理で行われる。
最初に、画像認識部２２４での画像認識処理によって、既登録の案内実行者の顔画像から、案内実行者が右に立っているか左に立っているかを識別する。
次に、音声認識部２２３での音声到来方向検知によって、発話者が左にいるか右にいるかを識別し、発話者が案内実行者か否(案内対象者)かを推定する。 First, in the robot control device 200, a user determination (step S100) is performed. In this user determination, it is determined whether the plurality of users standing in front of the robot 100 are the guidance target person 3 or the guidance executor 4, respectively, based on the image, the voice, and the touch input.
The determination of whether the guidance target person 3 or the guidance executor 4 is performed by, for example, the following processing.
First, the image recognition process in the image recognition unit 224 identifies whether the guidance executor is standing on the right or the left from the face image of the registered guidance executor.
Next, the voice recognition unit 223 detects whether the speaker is on the left or right by detecting the direction of arrival of the voice, and estimates whether the speaker is the guide performer or not (guidance target person).

図８のフローチャートの説明に戻ると、ロボット制御装置２００では、言語判定（ステップＳ２００）が行われる。ここでは、案内対象者３の発話言語（第１の言語）並びに案内実行者４の発話言語（第２の言語）が判定される。
この言語判定は、例えば音声認識部２２３が取得した音声に基づく言語識別によって、実行される。あるいは、図７に示すタッチ入力での設定で行われる。 Returning to the description of the flowchart of FIG. 8, the robot control device 200 performs the language determination (step S200). Here, the utterance language (first language) of the guidance target person 3 and the utterance language (second language) of the guidance executor 4 are determined.
This language determination is executed, for example, by language identification based on the voice acquired by the voice recognition unit 223. Alternatively, the setting is performed by the touch input shown in FIG. 7.

続いて、ロボット制御装置２００では、翻訳サービス（ステップＳ３００）が行われる。ここでは、第１の言語から第２の言語への翻訳、及び第２の言語から第１の言語への翻訳が行われる。ここでは、第１の言語を英語、第２の言語を日本語とした例を説明する。
このように言語を設定することで、案内実行者４が適切な回答を行えない場合に、ロボット１００が第１の言語（英語）を用いて、案内実行者４の代理で回答を行う。 Subsequently, the robot control device 200 provides a translation service (step S300). Here, the translation from the first language to the second language and the translation from the second language to the first language are performed. Here, an example in which the first language is English and the second language is Japanese will be described.
By setting the language in this way, when the guidance executor 4 cannot give an appropriate answer, the robot 100 uses the first language (English) to give an answer on behalf of the guidance executor 4.

図９は、ステップＳ３００の翻訳サービスの流れを示すフローチャートである。
また、図１０及び図１１（両図は連続した図）は、図９のフローチャートを実行して、会話破綻に基づきロボットが回答する場合のシーケンス図である。図１２は、図９のフローチャートを実行して、回答タイマ満了に基づきロボットが回答する場合のシーケンス図である。これらのシーケンス図では、図９のフローチャートの各ステップに対応する箇所に、同じステップ番号を付与する。 FIG. 9 is a flowchart showing the flow of the translation service in step S300.
Further, FIGS. 10 and 11 (both figures are continuous views) are sequence diagrams in which the robot responds based on the conversation failure by executing the flowchart of FIG. FIG. 12 is a sequence diagram when the flowchart of FIG. 9 is executed and the robot responds based on the expiration of the response timer. In these sequence diagrams, the same step numbers are assigned to the parts corresponding to each step in the flowchart of FIG.

以下に、図９に示す翻訳サービスＳ３００の詳細について説明する。
まず、ロボット制御装置２００の質問受付部２２２は、割り込み処理のチェックを行い（ステップＳ１）、割り込み処理の有無を判断する（ステップＳ２）。ここでの質問受付部２２２が判断する割り込み処理としては、音声の入力による割り込みと、回答タイマ満了による割り込みとがある。
ステップＳ２で割り込み処理がないとき（ステップＳ２の「割込無」）、質問受付部２２２はステップＳ１のチェックに戻る。 The details of the translation service S300 shown in FIG. 9 will be described below.
First, the question receiving unit 222 of the robot control device 200 checks the interrupt processing (step S1) and determines the presence / absence of the interrupt processing (step S2). The interrupt processing determined by the question receiving unit 222 here includes an interrupt due to voice input and an interrupt due to the expiration of the answer timer.
When there is no interrupt processing in step S2 (“no interrupt” in step S2), the question receiving unit 222 returns to the check in step S1.

また、ステップＳ２で割り込み処理があると判断したとき（ステップＳ２の「割込有」）、質問受付部２２２は、割り込み内容が音声の入力による割り込みと、回答タイマ満了による割り込みの何れかを判断する（ステップＳ３）。
ステップＳ３で、音声の入力による割り込みと判断したとき（ステップＳ３の「音声」）、質問受付部２２２は、複数チャネルの音声データから音声の到来方向を検知し、その発話者が案内対象者３であるか、案内実行者４であるかを判定する（ステップＳ４）。 Further, when it is determined in step S2 that there is an interrupt process (“interrupted” in step S2), the question receiving unit 222 determines whether the interrupt content is an interrupt due to voice input or an interrupt due to the expiration of the answer timer. (Step S3).
When it is determined in step S3 that the interrupt is due to voice input (“voice” in step S3), the question receiving unit 222 detects the direction of arrival of voice from the voice data of a plurality of channels, and the speaker is the guide target person 3. It is determined whether the user is the guide executor 4 (step S4).

そして、ステップＳ４で音声方向より判定した発話者が案内対象者３であった場合（ステップＳ４の「顧客」）、音声認識部２２３において英語で音声認識した結果を、翻訳部２２６で日本語に翻訳し、ロボット１００のスピーカ１３３から音声で出力する（ステップＳ５）。このとき、音声の出力と同時、またはその代わりに、タッチパネル１３４での翻訳文を出力してもよい。 Then, when the speaker determined from the voice direction in step S4 is the guidance target person 3 (“customer” in step S4), the result of voice recognition in English by the voice recognition unit 223 is translated into Japanese by the translation unit 226. It is translated and output by voice from the speaker 133 of the robot 100 (step S5). At this time, the translated text on the touch panel 134 may be output at the same time as or instead of the audio output.

案内対象者３の発話の翻訳が終わると、顧客会話破綻検知部２２５ｃは質問回数をインクリメントし（ステップＳ６）、会話継続判定部２２５ａは回答タイマをスタートさせる（ステップＳ７）。 When the translation of the utterance of the guidance target person 3 is completed, the customer conversation failure detection unit 225c increments the number of questions (step S6), and the conversation continuation determination unit 225a starts the answer timer (step S7).

また、ステップＳ４で音声方向より判定した発話者が案内実行者４であった場合（ステップＳ４の「サイト管理者」）、音声認識部２２３において日本語で音声認識した結果を翻訳部２２６で英語に翻訳する。そして、ロボット１００のスピーカ１３３やタッチパネル１３４が、翻訳した音声又は文章を出力する（ステップＳ９）。 Further, when the speaker determined from the voice direction in step S4 is the guidance executor 4 (“site administrator” in step S4), the result of voice recognition in Japanese by the voice recognition unit 223 is translated into English by the translation unit 226. Translate to. Then, the speaker 133 and the touch panel 134 of the robot 100 output the translated voice or sentence (step S9).

ステップＳ９での案内実行者４の発話の翻訳処理が終わると、会話破綻検知部２２５ｃは、ロボット１００から画像を取得し、画像認識部２２４で顔認識して感情の判定を行い、ユーザの反応がポジティブであるかネガティブであるかを判定する（ステップＳ１０）。
ここで、ユーザの反応がポジティブである場合（ステップＳ１０の「ポジティブ」）、会話破綻検知部２２５ｃは顧客質問回数をクリアし（ステップＳ１１）、会話継続判定部２２５ａは回答タイマをクリアする（ステップＳ１２）。 When the translation process of the utterance of the guidance executor 4 in step S9 is completed, the conversation failure detection unit 225c acquires an image from the robot 100, and the image recognition unit 224 recognizes the face to determine the emotion, and the user's reaction. Is positive or negative (step S10).
Here, when the user's reaction is positive (“positive” in step S10), the conversation failure detection unit 225c clears the number of customer questions (step S11), and the conversation continuation determination unit 225a clears the answer timer (step). S12).

また、ステップＳ１０の判定で、ユーザの反応がネガティブである場合（ステップＳ１０の「ネガティブ」）、会話破綻検知部２２５ｃは、顧客質問回数が閾値以上か否かを判断する（ステップＳ１３）。ここで、閾値以下であれば（ステップＳ１３の「閾値以下」）、ステップＳ１２に移行して、会話判定部２２５ａは回答タイマをクリアする。 Further, when the user's reaction is negative in the determination in step S10 (“negative” in step S10), the conversation failure detection unit 225c determines whether or not the number of customer questions is equal to or greater than the threshold value (step S13). Here, if it is equal to or less than the threshold value (“below the threshold value” in step S13), the process proceeds to step S12, and the conversation determination unit 225a clears the answer timer.

また、ステップＳ１３で顧客質問回数が閾値以上であれば（ステップＳ１３の「閾値以上」）、会話破綻検知部２２５ｃは会話が破綻したとみなし、回答可否判定部２２５ｂが回答可能か否かを判定する（ステップＳ１４）。 Further, if the number of customer questions in step S13 is equal to or greater than the threshold value (“greater than or equal to the threshold value” in step S13), the conversation failure detection unit 225c considers that the conversation has failed, and the answer possibility determination unit 225b determines whether or not the answer is possible. (Step S14).

ステップＳ１４の判定で回答不可の場合（ステップＳ１４の「否」）、会話破綻検知部２２５ｃは顧客質問回数をクリアし（ステップＳ１１）、会話継続判定部２２５ａは回答タイマをクリアする（ステップＳ１２）。
また、ステップＳ１４の判定で回答可能の場合（ステップＳ１４の「可」）、回答出力部２２５ｄは回答の検索を行い（ステップＳ１５）、検索結果をロボット１００のスピーカ１３３及び／又はタッチパネル１３４で出力する（ステップＳ１６）。 If no answer is possible in the determination of step S14 (“No” in step S14), the conversation failure detection unit 225c clears the number of customer questions (step S11), and the conversation continuation determination unit 225a clears the answer timer (step S12). ..
If the answer can be answered by the determination in step S14 (“OK” in step S14), the answer output unit 225d searches for the answer (step S15), and the search result is output by the speaker 133 and / or the touch panel 134 of the robot 100. (Step S16).

ロボット回答後、会話破綻検知部２２５ｃは顧客質問回数をクリアし（ステップＳ１１）、会話継続判定部２２５ａは回答タイマをクリアする（ステップＳ１２）。 After the robot answers, the conversation failure detection unit 225c clears the number of customer questions (step S11), and the conversation continuation determination unit 225a clears the answer timer (step S12).

また、割り込みチェック（ステップＳ１、Ｓ２）中に、回答タイマが満了した場合、会話継続判定部２２５ａはタイマ満了の割り込みを行い、質問受付部２２２は、タイマ満了割り込みを検知する（ステップＳ３の「回答タイマ満了」）。この場合、回答可否判定部２２５ｂは、直前の案内対象者の質問に対して回答可能か否かを判定する（ステップＳ１４）。
ここで、回答可能の場合（ステップＳ１４の「可」）、回答出力部２２５ｄは回答の検索を行い（ステップＳ１５）、検索結果をロボット１００のスピーカ１３３やタッチパネル１３４で出力する（ステップＳ１６）。 If the answer timer expires during the interrupt check (steps S1 and S2), the conversation continuation determination unit 225a interrupts the timer expiration, and the question receiving unit 222 detects the timer expiration interrupt (“step S3”. Answer timer has expired "). In this case, the answerability determination unit 225b determines whether or not the question of the immediately preceding guide target person can be answered (step S14).
Here, when the answer is possible (“OK” in step S14), the answer output unit 225d searches for the answer (step S15), and outputs the search result to the speaker 133 or the touch panel 134 of the robot 100 (step S16).

そして、回答タイマスタート（ステップＳ７）の後、並びに回答タイマクリア（ステップＳ１２）後、会話継続判定部２２５ａは、画像認識部２２４の顔認識に基づいて、ユーザがロボット１００の前にいるか否かを判定する使用状態確認を行う（ステップＳ８）。
ここで、会話継続判定部２２５ａは、ユーザがいる場合と判定したとき（ステップＳ８の「使用中」）、再度ステップＳ１の割り込みチェックに戻る。また、ユーザがいない場合と判定したとき（ステップＳ８の「使用終了」）、翻訳サービスを終了する。 Then, after the answer timer start (step S7) and after the answer timer clear (step S12), the conversation continuation determination unit 225a determines whether or not the user is in front of the robot 100 based on the face recognition of the image recognition unit 224. Check the usage status to determine (step S8).
Here, when the conversation continuation determination unit 225a determines that there is a user (“in use” in step S8), the conversation continuation determination unit 225a returns to the interrupt check in step S1 again. Further, when it is determined that there is no user (“end of use” in step S8), the translation service is terminated.

図１０と図１１（両図は連続したシーケンス図）は、会話破綻に基づきロボットが回答する場合のシーケンス図である。
この例では、最初に案内対象者３が英語で発話し、ロボット１００はマイクアレイ１３１に入力した音声を、ロボット制御装置２００の質問受付部２２２に送信する。この音声の割り込みが質問受付部２２２で検知され、音声認識部２２３で音声方向が検知され、テキスト化が行われ、翻訳部２２６で英語から日本語に翻訳される。このとき、会話破綻検知部２２５ｃで質問回数がインクリメントされる。
そして、翻訳部２２６での翻訳結果がロボット１００で出力されると共に、会話継続判定部２２５ａで回答タイマがスタートする。ここまでが、図１０の上半分に示す、案内対象者３の発話の翻訳処理である。 10 and 11 (both figures are continuous sequence diagrams) are sequence diagrams when the robot responds based on the conversation failure.
In this example, the guidance target person 3 first speaks in English, and the robot 100 transmits the voice input to the microphone array 131 to the question reception unit 222 of the robot control device 200. This voice interruption is detected by the question receiving unit 222, the voice direction is detected by the voice recognition unit 223, text conversion is performed, and the translation unit 226 translates from English to Japanese. At this time, the number of questions is incremented by the conversation failure detection unit 225c.
Then, the translation result in the translation unit 226 is output by the robot 100, and the answer timer is started in the conversation continuation determination unit 225a. The above is the translation process of the utterance of the guidance target person 3 shown in the upper half of FIG.

続いて、案内実行者４による回答の発話があると、ロボット１００は、マイクアレイ１３１に入力した音声を、ロボット制御装置２００の質問受付部２２２に送信する。この音声の割り込みが質問受付部２２２で検知され、音声認識部２２３で音声方向が検知され、テキスト化が行われ、翻訳部２２６で日本語から英語に翻訳される。
そして、翻訳部２２６での翻訳結果がロボット１００で出力される。ここまでが、図１０の下半分に示す、案内実行者４の発話の翻訳処理である。 Subsequently, when the guidance executor 4 utters an answer, the robot 100 transmits the voice input to the microphone array 131 to the question reception unit 222 of the robot control device 200. This voice interruption is detected by the question receiving unit 222, the voice direction is detected by the voice recognition unit 223, text conversion is performed, and the translation unit 226 translates from Japanese to English.
Then, the translation result in the translation unit 226 is output by the robot 100. The above is the translation process of the utterance of the guidance executor 4 shown in the lower half of FIG.

本実施の形態例の場合、さらにロボット１００は、案内対象者３の顔の表情に基づいて回答を行う。
すなわち、図１０に示す回答の音声出力後に、図１１に示すように、ロボット１００のカメラ１３２が撮影した画像から、顔認識と、その認識した顔の感情判定を行い、会話破綻検知部２２５ｃが、会話が破綻したか否かを判断する。この判断で、会話が破綻したことを検知したとき、回答可否判定部２２５ｂは、案内対象者３の質問に回答できるか判断し、可能な場合に回答出力部２２５ｄが回答を検索し、検索した回答をロボット１００から出力させる。 In the case of the present embodiment, the robot 100 further gives an answer based on the facial expression of the guide target person 3.
That is, after the voice output of the answer shown in FIG. 10, as shown in FIG. 11, face recognition and emotion determination of the recognized face are performed from the image taken by the camera 132 of the robot 100, and the conversation failure detection unit 225c performs the face recognition. , Determine if the conversation is broken. When it is detected that the conversation is broken by this judgment, the answer possibility determination unit 225b determines whether the question of the guidance target person 3 can be answered, and if possible, the answer output unit 225d searches for the answer and searches. The answer is output from the robot 100.

また、会話破綻検知部２２５ｃで質問回数がクリアされると共に、会話継続判定部２２５ａで回答タイマがクリアされ、使用状態の確認後に終了する。あるいは、会話が継続しているときには、図１０の最初に戻る。 Further, the conversation failure detection unit 225c clears the number of questions, and the conversation continuation determination unit 225a clears the answer timer, and the process ends after the usage status is confirmed. Alternatively, when the conversation continues, return to the beginning of FIG.

図１２は、回答タイマ満了に基づくロボット回答時のシーケンス図である。
この例では、図１０の場合と同様に、最初に案内対象者３が英語で発話し、ロボット１００はマイクアレイ１３１に入力した音声が、ロボット制御装置２００の質問受付部２２２に送信される。この音声の割り込みが質問受付部２２２で検知され、音声認識部２２３で音声方向が検知され、テキスト化が行われ、翻訳部２２６で英語から日本語に翻訳される。このとき、会話破綻検知部２２５ｃで質問回数がインクリメントされる。
そして、翻訳部２２６での翻訳結果がロボット１００で出力されると共に、会話継続判定部２２５ａで回答タイマがスタートする。ここまでは、図１０の上半分に示す、案内対象者３の発話の翻訳処理と同じである。 FIG. 12 is a sequence diagram at the time of robot response based on the expiration of the response timer.
In this example, as in the case of FIG. 10, the guidance target person 3 first speaks in English, and the robot 100 transmits the voice input to the microphone array 131 to the question reception unit 222 of the robot control device 200. This voice interruption is detected by the question receiving unit 222, the voice direction is detected by the voice recognition unit 223, text conversion is performed, and the translation unit 226 translates from English to Japanese. At this time, the number of questions is incremented by the conversation failure detection unit 225c.
Then, the translation result in the translation unit 226 is output by the robot 100, and the answer timer is started in the conversation continuation determination unit 225a. Up to this point, the process is the same as the translation process of the utterance of the guidance target person 3 shown in the upper half of FIG.

その後、会話継続判定部２２５ａでは、回答タイマ満了となり、質問受付部２２２でタイマ満了の割り込みが検知される。このとき、回答可否判定部２２５ｂで回答可否が判断され、回答可能である場合に、回答出力部２２５ｄで回答が検索され、ロボット１００で回答が音声及び／又は画像で出力される。
また、回答出力部２２５ｄでの回答検索後に、会話破綻検知部２２５ｃで質問回数がクリアされると共に、会話継続判定部２２５ａで回答回数もクリアされ、顔認証結果に基づいた使用状態の確認後に、処理を終了する。 After that, the conversation continuation determination unit 225a expires the answer timer, and the question reception unit 222 detects an interrupt of the timer expiration. At this time, the answer possibility determination unit 225b determines whether or not the answer is possible, and when the answer is possible, the answer output unit 225d searches for the answer, and the robot 100 outputs the answer by voice and / or image.
Further, after the answer search by the answer output unit 225d, the number of questions is cleared by the conversation failure detection unit 225c, and the number of answers is also cleared by the conversation continuation determination unit 225a. End the process.

ここで、本実施の形態例での、案内対象者３と案内実行者４とロボット１００による会話及び回答の具体的な例を以下に示す。
案内対象者３の発話：「Hello」
ロボット１００の翻訳：「こんにちは」
案内実行者４の発話：「なにかお困りごとですか？」
ロボット１００の翻訳：「Can you help you with something?」
案内対象者３の発話：「I’m looking for coinlocker」
ロボット１００の翻訳：「コインロッカーを探しています」
案内実行者４の発話：「うーん・・・」
ロボット１００の翻訳：「Umm・・・」
案内対象者３の発話：「Don’t you know?」
案内実行者４の発話：「えーっと・・・」
ロボット１００の翻訳：「Umm・・・」（会話破綻検知）
ロボット１００の回答：「I’m answer behalf of him.
There are coinlockers at ～～～」
案内対象者３の発話：「Oh! Thank you!!」
ロボット１００の翻訳：「おお！ありがとう！」
案内対象者３の立ち去り
ロボット１００の会話終了判断 Here, a specific example of a conversation and an answer by the guidance target person 3, the guidance executor 4, and the robot 100 in the embodiment of the present embodiment is shown below.
Utterance of Guidance Target 3: "Hello"
Translation of Robot 100: "Hello"
Guidance executor 4 utterance: "Are you in trouble?"
Robot 100 translation: "Can you help you with something?"
Utterance of Guidance Target 3: "I'm looking for coin locker"
Robot 100 Translation: "Looking for a coin locker"
Utterance of Guidance Executer 4: "Hmm ..."
Translation of Robot 100: "Umm ..."
Utterance of Guidance Target 3: "Don't you know?"
Utterance of Guidance Executer 4: "Um ..."
Translation of Robot 100: "Umm ..." (conversation failure detection)
Robot 100's answer: "I'm answer behalf of him.
There are coinlockers at ~~~ "
Utterance of Guidance Target 3: "Oh! Thank you !!"
Robot 100 translation: "Oh! Thank you!"
Judgment of the end of conversation of the robot 100 leaving the guidance target person 3

以上説明したように、本実施の形態例の音声案内システム１によると、案内対象者３の質問とその案内実行者４の回答を、翻訳しながら適切に行うことができる。
特に、案内対象者３と案内実行者４とを、音声認識と画像認識で区別することで、質問の受け付けと、その質問に対する回答の出力とを、適切に実行できるようになる。
また、案内対象者３と案内実行者４とによる会話が破綻した場合、つまり回答が適切でない場合、音声案内システム１が代理で回答することができ、会話による案内を成立させることができる。 As described above, according to the voice guidance system 1 of the embodiment of the present embodiment, the question of the guidance target person 3 and the answer of the guidance executor 4 can be appropriately performed while translating.
In particular, by distinguishing the guidance target person 3 and the guidance executor 4 by voice recognition and image recognition, it becomes possible to appropriately execute the acceptance of the question and the output of the answer to the question.
Further, when the conversation between the guidance target person 3 and the guidance executor 4 is broken, that is, when the answer is not appropriate, the voice guidance system 1 can answer on behalf of the person, and the guidance by the conversation can be established.

＜２．第２の実施の形態例＞
次に、本発明の第２の実施の形態例について、図１３～図１４を参照して説明する。
本発明の第２の実施の形態例の音声案内システム１の構成は、第１の実施の形態例で図１～図７で説明した構成と同じであり、重複説明を省略する。
第２の実施の形態例では、音声案内システム１は、案内対象者３と案内実行者４との会話を補助する会話補助サービスを行うものである。 <2. Example of the second embodiment>
Next, an example of the second embodiment of the present invention will be described with reference to FIGS. 13 to 14.
The configuration of the voice guidance system 1 of the second embodiment of the present invention is the same as the configuration described with reference to FIGS. 1 to 7 in the first embodiment, and duplicate description will be omitted.
In the second embodiment, the voice guidance system 1 provides a conversation assisting service that assists the conversation between the guidance target person 3 and the guidance executor 4.

[会話補助サービスの実行例]
図１３は、音声案内システム１による会話補助サービスの流れを示すフローチャートである。
まず、ロボット制御装置２００では、発話言語の判定（ステップＳ４００）が行われる。
続いて、ロボット制御装置２００では、ステップＳ４００で判定した発話言語に基づいて、会話補助サービス（ステップＳ５００）が行われる。 [Execution example of conversation assistance service]
FIG. 13 is a flowchart showing the flow of the conversation assisting service by the voice guidance system 1.
First, the robot control device 200 determines the spoken language (step S400).
Subsequently, in the robot control device 200, the conversation assist service (step S500) is performed based on the utterance language determined in step S400.

図１４は、ステップＳ５００の会話補助サービスの詳細を示すフローチャートである。
まず、ロボット制御装置２００の質問受付部２２２は、割り込みの有無のチェックを実行し（ステップＳ２１）、割り込みの有無を判断する（ステップＳ２２）。ここで、割り込み無しの場合（ステップＳ２２の「割込無」）には、質問受付部２２２は、ステップＳ２１の割り込みの有無のチェックに戻る。 FIG. 14 is a flowchart showing the details of the conversation assisting service in step S500.
First, the question receiving unit 222 of the robot control device 200 executes a check for the presence / absence of an interrupt (step S21), and determines the presence / absence of an interrupt (step S22). Here, when there is no interrupt (“no interruption” in step S22), the question receiving unit 222 returns to the check for the presence or absence of an interrupt in step S21.

ステップＳ２２の判断で、音声の割り込みがあった場合（ステップＳ２２の「割込有」）、質問受付部２２２は、割り込み内容が音声の入力による割り込みと、回答タイマ満了による割り込みの何れかを判断する（ステップＳ２３）。
ステップＳ２３で、音声の入力による割り込みと判断したとき（ステップＳ２３の「音声」）、回答可否判定部２２５ｂは、認識した音声の回答可否の判定を行う（ステップＳ２４）。 If there is a voice interrupt in the judgment of step S22 (“interruption available” in step S22), the question receiving unit 222 determines whether the interrupt content is an interrupt due to voice input or an interrupt due to the expiration of the answer timer. (Step S23).
When it is determined in step S23 that the interrupt is due to the input of voice (“voice” in step S23), the response availability determination unit 225b determines whether or not the recognized voice can be answered (step S24).

ここで回答可能な場合（ステップＳ２４の「可」）、回答可否判定部２２５ｂは、回答を検索し（ステップＳ２５）、回答の画像を出力する（ステップＳ２６）。ここでは回答を画像でのみ出力し、音声は出力しない。 If the answer can be answered here (“OK” in step S24), the answer availability determination unit 225b searches for the answer (step S25) and outputs an image of the answer (step S26). Here, the answer is output only as an image, and no sound is output.

ステップＳ２６での回答の画像を出力後、会話破綻判定部２２５ｃは、案内対象者３の反応を判定する（ステップＳ２７）。ここで、反応がポジティブなら（ステップＳ２７の「ポジティブ」）、会話破綻判定部２２５ｃは、顧客質問回数をクリアし（ステップＳ２８）、会話タイマをクリアする（ステップＳ２９）。 After outputting the image of the answer in step S26, the conversation failure determination unit 225c determines the reaction of the guidance target person 3 (step S27). Here, if the reaction is positive (“positive” in step S27), the conversation failure determination unit 225c clears the number of customer questions (step S28) and clears the conversation timer (step S29).

そして、ステップＳ２７で判定した反応がネガティブなら（ステップＳ２７の「ネガティブ」）、会話破綻判定部２２５ｃは、顧客質問回数が閾値以上か否かを判断する（ステップＳ３１）。ここで、閾値以下であれば（ステップＳ３１の「閾値以下」）、ステップＳ２９に移行して、会話判定部２２５ａは回答タイマをクリアする。 If the reaction determined in step S27 is negative (“negative” in step S27), the conversation failure determination unit 225c determines whether or not the number of customer questions is equal to or greater than the threshold value (step S31). Here, if it is equal to or less than the threshold value (“below the threshold value” in step S31), the process proceeds to step S29, and the conversation determination unit 225a clears the answer timer.

また、ステップＳ３１で顧客質問回数が閾値以上であれば（ステップＳ３１の「閾値以上」）、会話破綻検知部２２５ｃは会話が破綻したとみなし、回答出力部２２５ｄは、直前の回答（画像による回答）を音声で出力する（ステップＳ３２）。その後、ステップＳ２８に移行して、会話破綻判定部２２５ｃは、顧客質問回数をクリアする。 Further, if the number of customer questions in step S31 is equal to or greater than the threshold value (“greater than or equal to the threshold value” in step S31), the conversation failure detection unit 225c considers that the conversation has failed, and the answer output unit 225d responds immediately before (answer by image). ) Is output by voice (step S32). After that, the process proceeds to step S28, and the conversation failure determination unit 225c clears the number of customer questions.

回答タイマクリア（ステップＳ２９）を行った後、会話継続判定部２２５ａは、画像認識部２２４の顔認識に基づいて、ユーザがロボット１００の前にいるか否かを判定する使用状態確認を行う（ステップＳ３０）。
ここで、会話継続判定部２２５ａは、ユーザがいる場合と判定したとき（ステップＳ３０の「使用中」）、再度ステップＳ２１の割り込みチェックに戻る。また、ユーザがいない場合と判定したとき（ステップＳ３０の「使用終了」）、翻訳サービスを終了する。 After clearing the answer timer (step S29), the conversation continuation determination unit 225a confirms the usage state for determining whether or not the user is in front of the robot 100 based on the face recognition of the image recognition unit 224 (step). S30).
Here, when the conversation continuation determination unit 225a determines that there is a user (“in use” in step S30), the conversation continuation determination unit 225a returns to the interrupt check in step S21 again. Further, when it is determined that there is no user (“end of use” in step S30), the translation service is terminated.

以上説明したように、本実施の形態例では、複数人の会話に関連する画像表示によりロボット１００が会話を補助し、その会話が破綻した際には音声で回答することで、会話を継続させることができる。 As described above, in the embodiment of the present embodiment, the robot 100 assists the conversation by displaying images related to the conversation of a plurality of people, and when the conversation breaks down, the conversation is continued by answering by voice. be able to.

＜３．第３の実施の形態例＞
次に、本発明の第３の実施の形態例について、図１５～図１７を参照して説明する。
本発明の第３の実施の形態例の音声案内システム１の構成は、第１の実施の形態例で図１～図７で説明した構成と同じであり、重複説明を省略する。
第３の実施の形態例では、音声案内システム１は、案内対象者３からの発話（質問）に対して、ロボット１００が回答し、その案内対象者３とロボット１００との会話が破綻したときに、案内実行者４の案内を翻訳して案内対象者３に伝えるようにしたものである。
サービスの全体の流れは、図８に示すように、ユーザ判定（ステップＳ１００）、言語判定（ステップＳ２００）、翻訳サービス（ステップＳ３００）の順で行われ、翻訳サービス時に、図１５のフローチャートで説明する手順で実行されるものである。 <3. Example of Third Embodiment>
Next, an example of the third embodiment of the present invention will be described with reference to FIGS. 15 to 17.
The configuration of the voice guidance system 1 of the third embodiment of the present invention is the same as the configuration described with reference to FIGS. 1 to 7 in the first embodiment, and duplicate description will be omitted.
In the third embodiment, in the voice guidance system 1, when the robot 100 answers an utterance (question) from the guidance target person 3 and the conversation between the guidance target person 3 and the robot 100 breaks down. In addition, the guidance of the guidance executor 4 is translated and transmitted to the guidance target person 3.
As shown in FIG. 8, the overall flow of the service is performed in the order of user determination (step S100), language determination (step S200), and translation service (step S300), and is described by the flowchart of FIG. 15 at the time of translation service. It is executed by the procedure to be performed.

[案内サービスの実行例]
図１５は、本実施の形態例での翻訳サービスの流れを示すフローチャートである。
また、図１６及び図１７（両図は連続した図）は、図１５のフローチャートを実行して、会話破綻に基づき案内実行者４が回答する場合のシーケンス図である。これらのシーケンス図では、図１５のフローチャートの各ステップに対応する箇所に、同じステップ番号を付与する。この図１５の例の場合も、案内対象者３は英語で会話し、案内実行者４は日本語で会話する。 [Execution example of guidance service]
FIG. 15 is a flowchart showing the flow of the translation service in the example of the present embodiment.
Further, FIGS. 16 and 17 (both figures are continuous views) are sequence diagrams in which the guidance executor 4 answers based on the conversation failure by executing the flowchart of FIG. In these sequence diagrams, the same step numbers are assigned to the parts corresponding to each step in the flowchart of FIG. Also in the case of the example of FIG. 15, the guidance target person 3 talks in English, and the guidance executor 4 talks in Japanese.

以下に、図１５に示す翻訳サービスの詳細について説明する。
まず、ロボット制御装置２００の質問受付部２２２は、割り込み処理のチェックを行い（ステップＳ４１）、割り込み処理の有無を判断する（ステップＳ４２）。ここでの質問受付部２２２が判断する割り込み処理としては、音声の入力による割り込みと、回答タイマ満了による割り込みとがある。
ステップＳ４２で割り込み処理がないとき（ステップＳ４２の「割込無」）、質問受付部２２２はステップＳ４１のチェックに戻る。 The details of the translation service shown in FIG. 15 will be described below.
First, the question receiving unit 222 of the robot control device 200 checks the interrupt processing (step S41) and determines the presence / absence of the interrupt processing (step S42). The interrupt processing determined by the question receiving unit 222 here includes an interrupt due to voice input and an interrupt due to the expiration of the answer timer.
When there is no interrupt processing in step S42 (“no interrupt” in step S42), the question receiving unit 222 returns to the check in step S41.

また、ステップＳ４２で割り込み処理があると判断したとき（ステップＳ４２の「割込有」）、質問受付部２２２は、複数チャネルの音声データから音声の到来方向を検知する（ステップＳ４３）。そして、音声認識部２２３が入力した音声を認識し（ステップＳ４４）、画像認識部２２４が入力した画像を認識し（ステップＳ４５）、顔認識処理及び識別処理を行う（ステップＳ４６）。 Further, when it is determined in step S42 that there is interrupt processing (“interruption available” in step S42), the question receiving unit 222 detects the arrival direction of the voice from the voice data of the plurality of channels (step S43). Then, the voice recognition unit 223 recognizes the input voice (step S44), the image recognition unit 224 recognizes the input image (step S45), and performs face recognition processing and identification processing (step S46).

ここで、質問受付部２２２は、発話者が案内対象者３であるか、案内実行者４であるかを判定する（ステップＳ４７）。ステップＳ４７において、判定した発話者が案内対象者３であった場合（ステップＳ４７の「顧客（英語話者）」）、回答可否判定部２２５ｂが回答可能か否かを判定する（ステップＳ４８）。 Here, the question reception unit 222 determines whether the speaker is the guidance target person 3 or the guidance executor 4 (step S47). In step S47, when the determined speaker is the guidance target person 3 (“customer (English speaker)” in step S47), the answerability determination unit 225b determines whether or not the answer is possible (step S48).

ステップＳ４８の判定で回答可能の場合（ステップＳ４８の「可」）、回答出力部２２５ｄは、質問に対する回答を検索し（ステップＳ４９）、その検索した回答を音声及び／又は画像でロボット１００から出力させる（ステップＳ５０）。 When the answer can be answered by the determination in step S48 (“OK” in step S48), the answer output unit 225d searches for the answer to the question (step S49), and outputs the searched answer by voice and / or image from the robot 100. (Step S50).

そして、会話破綻検知部２２５ｃは、画像認識部２２４で顔認識して感情の判定を行い、ユーザの反応がポジティブであるかネガティブであるかを判定する（ステップＳ５１）。
ここで、ユーザの反応がポジティブである場合（ステップＳ５１の「ポジティブ」）、会話継続判定部２２５ａは、画像認識部２２４の顔認識に基づいて、ユーザがロボット１００の前にいるか否かを判定する使用状態確認を行う（ステップＳ５２）。
ここで、会話継続判定部２２５ａは、ユーザがいる場合と判定したとき（ステップＳ５２の「使用中」）、再度ステップＳ４１の割り込みチェックに戻る。また、ユーザがいない場合と判定したとき（ステップＳ４１の「使用終了」）、翻訳サービスを終了する。 Then, the conversation failure detection unit 225c recognizes the face by the image recognition unit 224 and determines the emotion, and determines whether the user's reaction is positive or negative (step S51).
Here, when the user's reaction is positive (“positive” in step S51), the conversation continuation determination unit 225a determines whether or not the user is in front of the robot 100 based on the face recognition of the image recognition unit 224. Check the usage status (step S52).
Here, when the conversation continuation determination unit 225a determines that there is a user (“in use” in step S52), the conversation continuation determination unit 225a returns to the interrupt check in step S41 again. Further, when it is determined that there is no user (“end of use” in step S41), the translation service is terminated.

ここまでの流れは、案内対象者３からの質問に、ロボット１００が回答して、案内対象者３が、その会話にポジティブな反応をした場合であり、このようなポジティブな反応が続く限り、案内対象者３とロボット１００との会話が継続する。
ところが、ステップＳ５１で判別したユーザの反応がネガティブである場合、ロボット１００による会話が破綻することになる。 The flow up to this point is the case where the robot 100 answers the question from the guidance target person 3 and the guidance target person 3 gives a positive reaction to the conversation, and as long as such a positive reaction continues, as long as such a positive reaction continues. The conversation between the guide target person 3 and the robot 100 continues.
However, if the reaction of the user determined in step S51 is negative, the conversation by the robot 100 will break down.

すなわち、ステップＳ５１で判別したユーザの反応がネガティブである場合（ステップＳ５１の「ネガティブ」）、画像認識部２２４は、案内実行者４の有無を判断する（ステップＳ５３）。なお、ステップＳ４８で回答できないと判断した場合にも、このステップＳ５３に移行して、案内実行者４の有無を判断する。 That is, when the reaction of the user determined in step S51 is negative (“negative” in step S51), the image recognition unit 224 determines the presence / absence of the guidance executor 4 (step S53). Even if it is determined in step S48 that the answer cannot be answered, the process proceeds to step S53 to determine the presence or absence of the guidance executor 4.

そして、案内実行者４がいると判断したとき（ステップＳ５３の「有」）、翻訳部２２６は、案内対象者３からの質問（英語）の日本語への翻訳が行われ（ステップＳ５５）、翻訳結果がロボット１００から音声及び／又は画像で出力される（ステップＳ５６）。また、ステップＳ５３の判断で、案内実行者４がいないと判断したとき（ステップＳ５３の「無」）、案内実行者４を呼び出す処理を行った後（ステップＳ５４）、ステップＳ５５に移行する。翻訳結果を出力した後、ステップＳ５２の使用状態の判断に移る。 Then, when it is determined that the guidance executor 4 is present (“Yes” in step S53), the translation unit 226 translates the question (English) from the guidance target person 3 into Japanese (step S55). The translation result is output from the robot 100 as audio and / or an image (step S56). Further, when it is determined in step S53 that the guidance executor 4 does not exist (“none” in step S53), the process of calling the guidance executor 4 is performed (step S54), and then the process proceeds to step S55. After outputting the translation result, the process proceeds to the determination of the usage state in step S52.

また、ステップＳ４７において、判定した発話者が案内実行者４であった場合（ステップＳ４７の「サイト管理者（日本語話者）」）、翻訳部２２６は、案内実行者４からの回答（日本語）の英語への翻訳が行われ（ステップＳ５７）、翻訳結果がロボット１００から音声及び／又は画像で出力される（ステップＳ５８）。翻訳結果を出力した後、ステップＳ５２の使用状態の判断に移る。 If the determined speaker in step S47 is the guidance executor 4 (“site administrator (Japanese speaker)” in step S47), the translation unit 226 responds from the guidance executor 4 (Japan). The word) is translated into English (step S57), and the translation result is output from the robot 100 as voice and / or image (step S58). After outputting the translation result, the process proceeds to the determination of the usage state in step S52.

図１６と図１７（両図は連続したシーケンス図）は、ロボットでの会話破綻に基づき案内実行者４が回答する場合のシーケンス図である。
この例では、最初に案内対象者３が英語で発話し、ロボット１００はマイクアレイ１３１に入力した音声を、ロボット制御装置２００の質問受付部２２２に送信する。この音声の割り込みが質問受付部２２２で検知される。このとき、音声認識部２２３で音声方向が検知され、さらにロボット１００のカメラ１３２で取得した画像に基づいて、画像認識部２２４で顔認識が行われ、質問受付部２２２で発話者が案内対象者３であると識別される。 16 and 17 (both figures are continuous sequence diagrams) are sequence diagrams when the guidance executor 4 answers based on the conversation failure in the robot.
In this example, the guidance target person 3 first speaks in English, and the robot 100 transmits the voice input to the microphone array 131 to the question reception unit 222 of the robot control device 200. This voice interrupt is detected by the question reception unit 222. At this time, the voice direction is detected by the voice recognition unit 223, face recognition is performed by the image recognition unit 224 based on the image acquired by the camera 132 of the robot 100, and the speaker is guided by the question reception unit 222. It is identified as 3.

案内対象者３であると識別したとき、回答可否判定部２２５ｂは、回答可否が判定され、回答可であるとき、回答出力部２２５ｄで回答の会話が検索され、検索結果としての回答が、ロボット１００から英語の音声及び／又は英語文の画像で出力される。
ここまでが、図１６の上半分に示すロボット１００による回答を行う処理である。 When it is identified as the guidance target person 3, the answer possibility determination unit 225b determines whether the answer is possible, and when the answer is possible, the answer output unit 225d searches for the answer conversation, and the answer as the search result is the robot. It is output from 100 as an English voice and / or an image of an English sentence.
Up to this point, the process of answering by the robot 100 shown in the upper half of FIG. 16 is performed.

そして、この回答の出力時には、ロボット１００のカメラ１３２で撮影した画像から、画像認識部２２４が案内対象者３の顔認識を行うと共に、案内対象者３の感情判定から、会話破綻検知部２２５ｃが会話の破綻の検知を行う。
ここで、会話破綻検知部２２５ｃが会話の破綻を検知し、案内実行者４がいることを確認したとき、案内対象者３が英語で発話した質問文を日本語に翻訳する処理が行われ、その翻訳結果が出力される。
ここまでが、図１６の下半分に示すロボット１００による会話破綻検出時の処理である。 Then, at the time of outputting this answer, the image recognition unit 224 recognizes the face of the guidance target person 3 from the image taken by the camera 132 of the robot 100, and the conversation failure detection unit 225c is based on the emotion determination of the guidance target person 3. Detects conversation breakdown.
Here, when the conversation failure detection unit 225c detects the conversation failure and confirms that the guidance executor 4 is present, a process of translating the question sentence spoken in English by the guidance target person 3 into Japanese is performed. The translation result is output.
Up to this point, the processing at the time of conversation failure detection by the robot 100 shown in the lower half of FIG. 16 is performed.

その後、本実施の形態例の場合、案内実行者４の回答を翻訳する処理が行われる。
すなわち、図１７に示すように、案内実行者４の回答（日本語発話）が、ロボット１００からロボット制御装置２００の質問受付部２２２に送信され、音声の割り込みが質問受付部２２２で検知される。このとき、音声認識部２２３で音声方向が検知されると共に、音声認識が行われ、さらに画像認識部２２４での画像認識が行われ、発話者（案内実行者４）が識別される。 After that, in the case of the present embodiment, the process of translating the answer of the guidance executor 4 is performed.
That is, as shown in FIG. 17, the answer (Japanese utterance) of the guidance executor 4 is transmitted from the robot 100 to the question receiving unit 222 of the robot control device 200, and the voice interruption is detected by the question receiving unit 222. .. At this time, the voice recognition unit 223 detects the voice direction, the voice recognition is performed, and the image recognition unit 224 performs image recognition to identify the speaker (guidance executor 4).

案内実行者４が識別されると、案内実行者４の回答が英語に翻訳され、翻訳結果としての回答が、ロボット１００から英語の音声及び／又は英語文の画像で出力される。 When the guidance executor 4 is identified, the answer of the guidance executor 4 is translated into English, and the answer as the translation result is output from the robot 100 as an English voice and / or an image of an English sentence.

以上説明したように、本実施の形態例では、案内対象者３の質問にロボット１００が回答し、その案内対象者３とロボット１００との会話が破綻したとき、案内実行者４の回答を翻訳する処理が行われる。したがって、案内実行者４により会話を補助しながら、案内対象者３とロボット１００での会話を適切に継続させることができるようになる。 As described above, in the present embodiment, when the robot 100 answers the question of the guidance target person 3 and the conversation between the guidance target person 3 and the robot 100 breaks down, the answer of the guidance executor 4 is translated. Processing is performed. Therefore, it becomes possible to appropriately continue the conversation between the guidance target person 3 and the robot 100 while assisting the conversation by the guidance executor 4.

＜４．変形例＞
本発明は、上述した各実施の形態例に限定されるものではなく、様々な変形例が含まれる。
例えば、上述した実施の形態例では、ロボット１００としてアバタを表示するタブレット端末を適用したが、その他の形状のロボットとしてもよい。また、上述した実施の形態例では、ロボット１００はマイクやカメラでの入力処理とスピーカでの出力処理を行い、ユーザの識別、会話処理、並びに翻訳処理などのデータ処理はロボット制御装置２００が行うようにした。これに対して、ロボット１００内で一部のデータ処理又は全てのデータ処理を行うようにしてもよい。 <4. Modification example>
The present invention is not limited to the above-described embodiments, but includes various modifications.
For example, in the above-described embodiment, the tablet terminal that displays the avatar is applied as the robot 100, but a robot having another shape may be used. Further, in the above-described embodiment, the robot 100 performs input processing by a microphone or a camera and output processing by a speaker, and data processing such as user identification, conversation processing, and translation processing is performed by the robot control device 200. I did it. On the other hand, some data processing or all data processing may be performed in the robot 100.

また、ここまで各実施の形態例では、案内対象者３と案内実行者４は、それぞれ１人ずつの例を示したが、案内対象者３や案内実行者４は複数人であってもよい。例えば、案内対象者３が複数人であるとき、それぞれの案内対象者３を音声と画像で識別して、それぞれの案内対象者３の質問に回答できるようになる。 Further, in each of the embodiments up to this point, the guidance target person 3 and the guidance executor 4 have been shown as one person each, but the guidance target person 3 and the guidance executor 4 may be a plurality of persons. .. For example, when there are a plurality of guidance target persons 3, each guidance target person 3 can be identified by voice and an image, and the question of each guidance target person 3 can be answered.

また、上述した実施の形態例は本発明を分かりやすく説明するために詳細に説明したものであり、必ずしも説明した全ての構成を備えるものに限定されない。また、図１～図４などの構成図では、制御線や情報線は説明上必要と考えられるものだけを示しており、製品上必ずしも全ての制御線や情報線を示しているとは限らない。実際には殆ど全ての構成が相互に接続されていると考えてもよい。また、図８、図９、図１３、図１４、図１５に示すフローチャートや図１０、図１１、図１２、図１６、図１７のシーケンス図において、実施の形態例の処理結果に影響がない範囲で、一部の処理ステップの実行順序を入れ替えたり、一部の処理ステップを同時に実行したりするようにしてもよい。 Further, the above-described embodiment is described in detail in order to explain the present invention in an easy-to-understand manner, and is not necessarily limited to the one including all the described configurations. Further, in the configuration diagrams such as FIGS. 1 to 4, only the control lines and information lines considered to be necessary for explanation are shown, and not all the control lines and information lines are shown in the product. .. In practice, it can be considered that almost all configurations are interconnected. Further, in the flowcharts shown in FIGS. 8, 9, 13, 14, and 15, and the sequence diagrams of FIGS. 10, 11, 12, 16, and 17, the processing results of the embodiment are not affected. Within the range, the execution order of some processing steps may be changed, or some processing steps may be executed at the same time.

また、上述した実施の形態例で説明した構成は、プロセッサがそれぞれの機能を実現するプログラムを解釈し、実行することによりソフトウェアで実現してもよい。各機能を実現するプログラムなどの情報は、メモリや、ハードディスク、ＳＳＤ（Solid State Drive）等の記録装置、または、ＩＣカード、ＳＤカード、光ディスク等の記録媒体に置くことができる。 Further, the configuration described in the above-described embodiment may be realized by software by the processor interpreting and executing a program that realizes each function. Information such as a program that realizes each function can be placed in a memory, a hard disk, a recording device such as an SSD (Solid State Drive), or a recording medium such as an IC card, an SD card, or an optical disk.

１…音声案内システム、２…サイト、３…案内対象者、４…案内実行者、１００…ロボット、１１０…ＣＰＵ、１２０…記憶装置、１２１…入出力部、１２２…シナリオ実行部、１２３…画面処理部、１３０…入出力装置、１３１…マイクアレイ、１３２…カメラ、１３３…スピーカ、１３４…タッチパネル、１３４ａ…テキスト表示部、１３４ｂ…アバタ表示部、１３４ｃ…案内対象者言語選択ボタン、１３４ｄ…案内実行者言語選択ボタン、１３４ｅ…立ち位置入れ替えボタン、１４０…通信インターフェース、２００…ロボット制御装置、２１０…ＣＰＵ、２２０…記憶装置、２２１…入出力装置、２２２…質問受付部、２２３…音声認識部、２２４…画像認識部、２２５…案内部、２２５ａ…会話継続判定部、２２５ｂ…回答可否判定部、２２５ｃ…会話破綻判定部、２２５ｄ…回答出力部、２２６…翻訳部、２３０…ロボット制御装置２００の通信インターフェース、３００…ロボット管理サーバ、３１０…ＣＰＵ、３２０…記憶装置、３２１…ロボット管理部、３３０…通信インターフェース
1 ... Voice guidance system, 2 ... Site, 3 ... Guidance target person, 4 ... Guidance executor, 100 ... Robot, 110 ... CPU, 120 ... Storage device, 121 ... Input / output unit, 122 ... Scenario execution unit, 123 ... Screen Processing unit, 130 ... input / output device, 131 ... microphone array, 132 ... camera, 133 ... speaker, 134 ... touch panel, 134a ... text display unit, 134b ... avatar display unit, 134c ... guidance target language selection button, 134d ... guidance Executer language selection button, 134e ... Standing position switching button, 140 ... Communication interface, 200 ... Robot control device, 210 ... CPU, 220 ... Storage device, 221 ... Input / output device, 222 ... Question reception unit, 223 ... Voice recognition unit , 224 ... Image recognition unit, 225 ... Guidance unit, 225a ... Conversation continuation determination unit, 225b ... Answer availability determination unit, 225c ... Conversation failure determination unit, 225d ... Answer output unit, 226 ... Translation unit, 230 ... Robot control device 200 Communication interface, 300 ... Robot management server, 310 ... CPU, 320 ... Storage device, 321 ... Robot management unit, 330 ... Communication interface

Claims

A voice guidance system including a camera, a microphone, and a speaker, which outputs a guidance voice based on the voice input to the microphone from the speaker.
A question reception unit that accepts questions by voice input to the microphone,
A voice recognition unit that calculates the characteristics of the user's voice corresponding to the question received by the question reception unit and recognizes the user based on the calculated voice characteristics.
An image recognition unit that calculates the characteristics of the user's image taken by the camera when the question reception unit detects voice, and recognizes the user based on the calculated characteristics of the image .
A user to be guided is selected using the characteristics of the user's voice calculated by the voice recognition unit and the characteristics of the user's image calculated by the image recognition unit, and the guidance voice to the selected user is transmitted from the speaker. Equipped with a guide to output
The guide unit includes a translation unit that translates the output voice into a predetermined language.
Based on the reception status of the voice input to the microphone in the question reception unit, the guidance unit outputs the voice received by the question reception unit in a predetermined language instead of outputting the guidance voice. The voice translated into the above is output from the speaker.
Voice guidance system.

The guide unit selects a user who matches a user recognized by the voice recognition unit and a user recognized by the image recognition unit, and executes a conversation with the selected user using the microphone and the speaker. The voice guidance system according to 1.

The question reception unit targets users who have detected pre-registered voice features or image features as non-guidance targets, and guides users who have detected pre-registered voice features or image features. And
The voice guidance system according to claim 1, wherein the guidance unit answers only the questions of the guidance target person.

The image recognition unit performs a process of detecting the reaction of the guide target person photographed by the camera, and performs a process.
Further, a conversation failure determination unit that detects that the utterance of the non-guidance target person is not the utterance expected by the guidance target person based on the number of utterances of the non-guidance target person and the reaction of the guidance target person.
It is provided with an answerability determination unit that determines whether or not the question can be answered by voice received by the question reception unit.
The voice guidance system according to claim 3 , wherein when the conversation failure determination unit determines the conversation failure and the answer availability determination unit determines that the answer is possible, the guidance unit outputs an answer to the question of the guidance target person. ..

It is a voice guidance method that provides guidance by voice by outputting a guidance voice based on the voice input to the microphone by the speaker.
The question reception process, in which the question reception department accepts questions by voice input to the microphone,
The voice recognition unit calculates the characteristics of the user's voice corresponding to the question received by the question reception process in the question reception unit, and recognizes the user based on the calculated voice characteristics.
When the voice is detected by the question reception process in the question reception unit, the image recognition unit calculates the characteristics of the user's image taken by the camera and recognizes the user based on the calculated characteristics of the image. Recognition processing and
The guidance unit selects a user who provides guidance using the characteristics of the user's voice calculated by the voice recognition processing in the image recognition unit and the characteristics of the user's image calculated by the image recognition processing, and for the selected user. Guidance processing that outputs guidance voice from the speaker,
Includes a translation process in which the output voice from the speaker is translated into a predetermined language.
Based on the reception status of the voice input to the microphone in the question reception process, in the guidance process, instead of outputting the guidance voice, the voice received in the question reception process is translated into a predetermined language. The voice translated into the above is output from the speaker.
Voice guidance method.