JP2009223170A

JP2009223170A - Speech recognition system

Info

Publication number: JP2009223170A
Application number: JP2008069605A
Authority: JP
Inventors: Kazuhiko Shinosawa; 一彦篠沢; Keiko Miyashita; 敬宏宮下; Noriaki Mitsunaga; 法明光永; Masahiro Shiomi; 昌裕塩見; Takaaki Akimoto; 高明秋本; Norihiro Hagita; 紀博萩田
Original assignee: ATR Advanced Telecommunications Research Institute International
Current assignee: ATR Advanced Telecommunications Research Institute International
Priority date: 2008-03-18
Filing date: 2008-03-18
Publication date: 2009-10-01

Abstract

PROBLEM TO BE SOLVED: To improve accuracy of speech recognition, while reducing time required for speech recognition processing. SOLUTION: A speech recognition system 10 includes a server 20, and the server 20 specifies an article 24 which a person 16 indicates by performing speech recognition processing on speech uttered by the person 16. The server 20 detects a position of the person 16, extracts information required for performing speech recognition on a word about the article 24 located around the person 16, from a speech recognition dictionary 126, and creates a speech recognition local dictionary 205B. Then, the speech uttered by the person 16 is speech-recognized by using the speech recognition local dictionary 205B. As a result, time required for speech recognition processing is reduced, and accuracy of speech recognition is improved. COPYRIGHT: (C)2010,JPO&INPIT

Description

この発明は音声認識システムに関し、特にたとえば、人間の発する音声を音声認識して人間が指示する物品を特定する音声認識システムに関する。 The present invention relates to a speech recognition system, and more particularly to a speech recognition system that recognizes speech produced by a person and identifies an article designated by the person.

この種の音声認識システムの例が開示された特許文献は見当たらないが、音声認識は人間と装置をつなぐインターフェースとしてさまざまな装置に採用されており、音声認識を採用した装置には、たとえば、特許文献１に開示された仕分け設備がある。 There is no patent document that discloses an example of this type of speech recognition system, but speech recognition is used in various devices as an interface for connecting a person and a device. There is a sorting facility disclosed in Document 1.

特許文献１によると、この仕分け設備は、商品のピッキング、ピッキングされた商品の検品、および商品がピッキングされたコンテナの仕分けを行うものであり、ピッキング作業者や仕分け作業者が行う音声による各種の指示は、音声認識されて仕分け設備に与えられる。
特開２００６−３６４１３号公報［Ｂ６５Ｇ１／１３７］ According to Patent Document 1, this sorting facility is for picking products, inspecting picked products, and sorting containers in which products are picked. Various sorts of voices performed by picking workers and sorting workers The instructions are recognized by voice and given to the sorting equipment.
JP 2006-36413 A [B65G 1/137]

しかし、音声認識を実現するためには、一般的に、音素ベースの音声認識辞書が採用されるところ、この仕分け装置のような背景技術では、人間の発するさまざまな発話内容に対応するべく音声認識辞書の語彙が大きくなる（特許文献１の段落番号００１９の第５行目参照。）。そして、音声認識辞書の語彙が大きくなるということは、音声認識によって生成された音素記号列とマッチング処理をする音声認識辞書に登録された音素記号列が多くなるということであるので、音声認識の処理にかかる時間が長くなるという問題が起こる。また、音声認識辞書に登録された音素記号列が多くなると、音声認識によって生成された音素記号列とマッチングの対象となる候補音素記号列が多くなるので、正しく音声認識できる割合が低下するという問題も起こる。 However, in order to realize speech recognition, phoneme-based speech recognition dictionaries are generally used. However, in background technology such as this sorter, speech recognition is performed to deal with various utterances produced by humans. The dictionary vocabulary increases (see the fifth line of paragraph number 0019 of Patent Document 1). An increase in the vocabulary of the speech recognition dictionary means that the number of phoneme symbol strings registered in the speech recognition dictionary to be matched with the phoneme symbol string generated by the speech recognition increases. There arises a problem that the processing time becomes long. In addition, when the number of phoneme symbol strings registered in the speech recognition dictionary increases, the number of phoneme symbol strings generated by speech recognition and the candidate phoneme symbol strings to be matched increases, so the rate of correct speech recognition decreases. Also happens.

それゆえに、この発明の主たる目的は、新規な音声認識システムを提供することである。 Therefore, the main object of the present invention is to provide a novel speech recognition system.

また、この発明の他の目的は、音声認識の処理にかかる時間を短縮し、正しく音声認識できる割合を高めることができる、音声認識システムを提供することである。 Another object of the present invention is to provide a speech recognition system capable of reducing the time required for speech recognition processing and increasing the rate of correct speech recognition.

本発明は、上記の課題を解決するために、以下の構成を採用した。なお、括弧内の参照符号および補足説明等は、本発明の理解を助けるために後述する実施の形態との対応関係を示したものであって、本発明を何ら限定するものではない。 The present invention employs the following configuration in order to solve the above problems. The reference numerals in parentheses, supplementary explanations, and the like indicate correspondence relationships with embodiments described later to help understanding of the present invention, and do not limit the present invention in any way.

第１の発明は、人間の発する音声を音声認識して人間が指示する物品を特定する音声認識システムであって、人間の位置を検出する位置検出手段、位置検出手段が検出した人間の位置に基づいて、人間の近傍にある物品を特定する近傍物品特定手段、物品特定手段が特定した物品に関する音声認識辞書を構築する辞書構築手段、および辞書構築手段が構築した音声認識辞書を用いて音声認識を行って人間が指示する物品を特定する指示物品特定手段を備える、音声認識システムである。 A first aspect of the present invention is a voice recognition system for recognizing a voice generated by a human to identify an article designated by the human, a position detecting unit for detecting a human position, and a human position detected by the position detecting unit. Based on, a nearby article specifying unit that specifies an article in the vicinity of a person, a dictionary building unit that builds a speech recognition dictionary related to the article specified by the article specifying unit, and a voice recognition using a voice recognition dictionary built by the dictionary building unit It is a voice recognition system provided with the instruction | indication article | item identification means which identifies the article | item which a person instruct | indicates by performing.

第１の発明では、音声認識システム（１０）は、人間の発する音声を音声認識して人間が指示する物品を特定する。位置検出手段（１２４、２００、２０８）は音声を発して物品を指示する人間の位置を検出する。また、近傍物品特定手段（１２２、２００）は位置検出手段が検出した人間の位置に基づいて、人間の近傍にある物品を特定する。そして、辞書構築手段（２００、１２２、２０５Ａ）は物品特定手段が特定した物品に関する音声認識辞書（２０５Ｂ）を構築する。さらに、指示物品特定手段（２００）は、辞書構築手段が構築した音声認識辞書を用いて人間が発した音声の音声認識を行って当該人間が指示する物品を特定する。 In the first invention, the speech recognition system (10) identifies an article designated by a person by recognizing the speech produced by the person. The position detection means (124, 200, 208) detects the position of the person who indicates the article by producing a sound. The nearby article specifying means (122, 200) specifies an article in the vicinity of the person based on the position of the person detected by the position detecting means. Then, the dictionary construction means (200, 122, 205A) constructs a speech recognition dictionary (205B) related to the article specified by the article specification means. Furthermore, the designated article specifying means (200) performs voice recognition of a voice uttered by a person using the voice recognition dictionary constructed by the dictionary construction means, and identifies an article designated by the person.

第１の発明によれば、人間の近傍にある物品に関する情報のみを記憶した音声認識辞書に基づいて音声認識を行うので、音声認識にかかる時間を短縮するとともに正しく音声認識できる割合を高めることができる。 According to the first invention, since voice recognition is performed based on a voice recognition dictionary that stores only information related to articles in the vicinity of humans, it is possible to reduce the time required for voice recognition and increase the rate of correct voice recognition. it can.

第２の発明は、第１の発明に従属する発明であって、辞書構築手段は、当該音声認識システムが対象とするすべての物品に関する第２音声認識辞書に基づいて第１音声認識辞書を構築する。 The second invention is an invention subordinate to the first invention, wherein the dictionary construction means constructs the first speech recognition dictionary based on the second speech recognition dictionary for all articles targeted by the speech recognition system. To do.

第２の発明では、辞書構築手段（２００、１２２、２０５Ａ）は当該音声認識システムが対象とするすべての物品に関する第２音声認識辞書（１２６）に基づいて第１音声認識辞書を構築する。 In the second invention, the dictionary construction means (200, 122, 205A) constructs the first speech recognition dictionary based on the second speech recognition dictionary (126) for all articles targeted by the speech recognition system.

第２の発明によれば、第２音声認識辞書に基づいて第１音声認識辞書を構築するので、第１音声認識辞書を迅速かつ用意に構築することができる。 According to the second aspect, since the first speech recognition dictionary is constructed based on the second speech recognition dictionary, the first speech recognition dictionary can be constructed quickly and easily.

第３の発明は、人間の発する音声を音声認識して人間が指示する物品を特定する音声認識システムであって、当該音声認識システムが対象とする各物品について物品の存在する位置を示す位置情報を記憶する位置情報記憶手段、当該音声認識システムが対象とする各物品について物品に関連する単語と当該単語の音素記号列を含む音声認識情報を記憶する音声認識情報記憶手段、音声認識処理を行って得られる音素記号列に基づいて音声認識情報記憶手段に記憶されている音声認識情報を参照して物品に関連する単語を特定する単語特定手段、単語特定手段が特定した単語に基づいて人間が指示した物品を特定する物品特定手段、人間の位置を検出する位置検出手段、および位置検出手段が検出した人間の位置と位置情報記憶手段に記憶されている位置情報とに基づいて当該人間の近傍に存在する物品を特定する近傍物品特定手段を備え、単語特定手段は、近傍物品特定手段が特定した物品に関連する単語を含む音声認識情報のみを参照して単語を特定する、音声認識システムである。 A third invention is a voice recognition system for recognizing voices uttered by humans to identify articles designated by humans, and position information indicating positions where articles exist for each article targeted by the voice recognition system Position information storage means for storing speech recognition information storage means for storing speech recognition information including a word related to the article and a phoneme symbol string of the word for each article targeted by the speech recognition system, and performing speech recognition processing A word specifying means for specifying a word related to the article with reference to the speech recognition information stored in the speech recognition information storage means based on the phoneme symbol string obtained in this manner, and a human being based on the word specified by the word specifying means Stored in the article specifying means for specifying the instructed article, the position detecting means for detecting the position of the person, and the human position detected by the position detecting means and the position information storage means. Based on the position information, and a nearby article specifying means for specifying an article that exists in the vicinity of the person, and the word specifying means refers only to the speech recognition information including the word related to the article specified by the neighboring article specifying means. This is a speech recognition system that identifies words.

第３の発明では、音声認識システム（１０）人間の発する音声を音声認識して人間が指示する物品を特定する。位置情報記憶手段（１２２、２０５Ａ）は当該音声認識システムが対象とする各物品について物品の存在する位置を示す位置情報を記憶する。また、音声認識情報記憶手段（１２６）は当該音声認識システムが対象とする各物品について物品に関連する単語と当該単語の音素記号列を含む音声認識情報を記憶する。そして、単語特定手段（２００）は音声認識処理を行って得られる音素記号列に基づいて音声認識情報記憶手段に記憶されている音声認識情報を参照して物品に関連する単語を特定する。さらに、物品特定手段（２００、２０５Ａ）は単語特定手段が特定した単語に基づいて人間が指示した物品を特定する。また、位置検出手段（１２４、２００、２０８）は、人間の位置を検出する。さらに、近傍物品特定手段（２００）は位置検出手段が検出した人間の位置と位置情報記憶手段に記憶されている位置情報とに基づいて当該人間の近傍に存在する物品を特定する。そして、単語特定手段は、近傍物品特定手段が特定した物品に関連する単語を含む音声認識情報のみを参照して単語を特定する。 In the third invention, a voice recognition system (10) recognizes a voice uttered by a person and specifies an article designated by the person. The position information storage means (122, 205A) stores position information indicating the position where the article exists for each article targeted by the voice recognition system. The voice recognition information storage means (126) stores voice recognition information including a word related to the article and a phoneme symbol string of the word for each article targeted by the voice recognition system. The word specifying means (200) specifies words related to the article with reference to the speech recognition information stored in the speech recognition information storage means based on the phoneme symbol string obtained by performing the speech recognition process. Further, the article specifying means (200, 205A) specifies an article designated by a person based on the word specified by the word specifying means. The position detecting means (124, 200, 208) detects the position of a person. Further, the nearby article specifying means (200) specifies an article existing in the vicinity of the person based on the position of the person detected by the position detecting means and the position information stored in the position information storage means. Then, the word specifying means specifies the word by referring only to the speech recognition information including the word related to the article specified by the nearby article specifying means.

第３の発明によれば、人間の近傍に存在する物品に関連する単語を含む音声認識情報のみを参照して音声認識を行って物品に関連する単語を特定するので、音声認識の処理にかかる時間を短縮するとともに正しく音声認識できる割合を高めることができる。 According to the third invention, since the speech recognition is performed by referring to only the speech recognition information including the word related to the article existing in the vicinity of the human to identify the word related to the article, the voice recognition process is performed. It is possible to shorten the time and increase the rate of correct voice recognition.

第４の発明は、第３の発明に従属する発明であって、単語特定手段は、近傍物品特定手段が特定した物品に関連する単語を含む音声認識情報のみを音声認識情報記憶手段から抽出して記憶した副音声認識情報記憶手段を参照することによって、音声認識情報記憶手段に記憶された音声認識情報を参照する。 The fourth invention is an invention subordinate to the third invention, wherein the word specifying means extracts from the voice recognition information storage means only the speech recognition information including the word related to the article specified by the nearby article specifying means. The voice recognition information stored in the voice recognition information storage means is referred to by referring to the sub voice recognition information storage means stored in the above.

第４の発明では、単語特定手段（２００）は、近傍物品特定手段が特定した物品に関連する単語を含む音声認識情報のみを音声認識情報記憶手段から抽出して記憶した副音声認識情報記憶手段を参照することによって、音声認識情報記憶手段に記憶された音声認識情報を参照する。 In the fourth invention, the word specifying means (200) is a sub-voice recognition information storage means for extracting and storing only the voice recognition information including the word related to the article specified by the nearby article specifying means from the voice recognition information storage means. Is referred to, the voice recognition information stored in the voice recognition information storage means is referred to.

第４の発明によれば、人間の近傍に存在する物品に関連する単語を含む音声認識情報のみを記憶した副音声認識情報記憶手段を用いて音声認識を行うので、音声認識の処理にかかる時間を短縮するとともに正しく音声認識できる割合を高めることができる。 According to the fourth aspect of the invention, since the voice recognition is performed using the sub voice recognition information storage unit that stores only the voice recognition information including the word related to the article existing in the vicinity of the person, the time required for the voice recognition process The rate at which correct speech recognition can be achieved can be increased.

この発明によれば、音声認識の処理にかかる時間を短縮することができる。 According to the present invention, the time required for the speech recognition process can be shortened.

この発明の上述の目的，その他の目的，特徴および利点は、図面を参照して行う以下の実施例の詳細な説明から一層明らかとなろう。 The above object, other objects, features and advantages of the present invention will become more apparent from the following detailed description of embodiments with reference to the drawings.

図１を参照して、この実施例のコミュニケーションロボットシステム（以下、単に「システム」ということがある。）１０は、コミュニケーションロボット(以下、単に「ロボット」ということがある。)１２を含む。このロボット１２は、たとえば無線ＬＡＮなどのネットワーク１４にアクセスすることができる。ロボット１２はサーバ２０と協働して人間１６が音声、視線、および指差しで指示する物品を特定し、たとえばその物品を人間１６に持っていくなどの動作を実行する。 Referring to FIG. 1, a communication robot system (hereinafter simply referred to as “system”) 10 of this embodiment includes a communication robot (hereinafter also simply referred to as “robot”) 12. The robot 12 can access a network 14 such as a wireless LAN. The robot 12 cooperates with the server 20 to identify an article that the human 16 indicates with voice, line of sight, and pointing, and performs an operation such as bringing the article to the human 16.

人間１６は、その人物が誰であるかを示す無線タグ１８を装着しているとともに、図示しないが、モーションキャプチャのためのマーカが付着されている。マーカは、典型的には、人間の頭頂、両肩、両肘、両手の人差し指の先端などに設定されていて、それらのマーカが、人間１６の全体とともに、サーバ２０に制御されるカメラ１２０によって撮影される。カメラ１２０は、実施例では、３つ設けられ、人間１６を３方向から撮影し、そのカメラ映像をサーバ２０に供給する。 The human 16 is wearing a wireless tag 18 indicating who the person is, and a marker for motion capture is attached, although not shown. The markers are typically set on the top of the human head, both shoulders, both elbows, the tip of the index finger of both hands, etc., and these markers together with the entire human 16 are controlled by the camera 120 controlled by the server 20. Taken. In the embodiment, three cameras 120 are provided, take a picture of the person 16 from three directions, and supply the camera image to the server 20.

サーバ２０は無線ＬＡＮのようなネットワーク１４に結合され、上述のようにして入力されるカメラ映像データに基づいて、マーカの動きを検出するモーションキャプチャ処理を実行するとともに、たとえば肌色領域を検出することによって、人間１６の顔の位置を特定することができる。 The server 20 is coupled to a network 14 such as a wireless LAN, and executes a motion capture process that detects the movement of the marker based on the camera video data input as described above, and also detects a skin color area, for example. Thus, the position of the face of the human 16 can be specified.

このシステム１０では、上述のように、ロボット１２が人間１６の指示する物品を対象物として特定するものである。対象物となり得る物品の例として、この実施例では、本（書籍）２４を用いる。本２４には、その本がどのような本であるかを示す無線タグ１８が付着されている。本２４は本棚２６に収納される。 In the system 10, as described above, the robot 12 specifies an article designated by the human 16 as an object. In this embodiment, a book (book) 24 is used as an example of an article that can be an object. A radio tag 18 indicating what kind of book the book is attached to the book 24. The book 24 is stored in a bookshelf 26.

ただし、対象物となり得る物品は実施例の書籍だけでなく、もし家庭用のシステムであれば、家庭内のあらゆる物品が考えられる。また、当然、家庭用としてだけではなく、人間と一緒に働く任意の場所（会社、事務所、工場など）での利用が考えられる。 However, the articles that can be the object are not only the books of the embodiment, but any household article can be considered if it is a home system. Naturally, it can be used not only for home use but also in any place (company, office, factory, etc.) that works with people.

そして、このシステム１０が対象とするすべての物品（本２４）は、サーバ２０に付設された物品辞書１２２に登録される。物品辞書１２２については後述する。 All articles (books 24) targeted by this system 10 are registered in the article dictionary 122 attached to the server 20. The article dictionary 122 will be described later.

また、システム１０が対象とする人間１６は、人間１６に付着されている無線タグ１８が複数存在するアンテナ１２４のいずれかを介して無線タグ読取装置２０８（図４参照）で読み取られることによって、その位置がサーバ２０で把握される。つまり、人間１６の位置は、人間１６に付着されている無線タグ１８からの電波をどのアンテナ１２４で受信したかによって若干大まかな位置が把握される。 Further, the human 16 targeted by the system 10 is read by the wireless tag reader 208 (see FIG. 4) via any one of the antennas 124 in which a plurality of wireless tags 18 attached to the human 16 are present. The server 20 grasps the position. That is, the position of the person 16 is grasped slightly roughly depending on which antenna 124 has received the radio wave from the wireless tag 18 attached to the person 16.

なお、図１では、簡単のため、１台のロボット１２を示してあるが、２台以上であってよい。また、人間１６は１人に限定される必要はなく、無線タグ１８で識別できるので、複数であってよい。 In FIG. 1, one robot 12 is shown for simplicity, but two or more robots may be used. Further, the human 16 is not necessarily limited to one person, and can be identified by the wireless tag 18 and may be plural.

また、図１に示す実施例では、このシステム１０を設置している空間のワールド座標を用いてロボット１２、人間１６、物品２４などの位置が表現されていて、他方、ロボット１２の制御はロボット座標で行なわれるので、詳細は説明しないが、ロボット１２は、後述の処理における必要に応じて、ロボット座標とワールド座標との間の座標変換処理を実行するものである。 In the embodiment shown in FIG. 1, the positions of the robot 12, the human 16, the article 24, and the like are expressed using the world coordinates of the space where the system 10 is installed. Since it is performed in coordinates, details will not be described, but the robot 12 performs a coordinate conversion process between the robot coordinates and the world coordinates as necessary in the process described later.

図２を参照して、ロボット１２のハードウェアの構成について説明する。また、図２はこの実施例のロボット１２の外観を示す正面図である。ロボット１２は台車３０を含み、台車３０の下面にはロボット１２を自律移動させる２つの車輪３２および１つの従輪３４が設けられる。２つの車輪３２は車輪モータ３６（図３参照）によってそれぞれ独立に駆動され、台車３０すなわちロボット１２を前後左右の任意方向に動かすことができる。また、従輪３４は車輪３２を補助する補助輪である。したがって、ロボット１２は、配置された空間内を自律制御によって移動可能である。 The hardware configuration of the robot 12 will be described with reference to FIG. FIG. 2 is a front view showing the appearance of the robot 12 of this embodiment. The robot 12 includes a carriage 30, and two wheels 32 and one slave wheel 34 for autonomously moving the robot 12 are provided on the lower surface of the carriage 30. The two wheels 32 are independently driven by a wheel motor 36 (see FIG. 3), and the carriage 30, that is, the robot 12 can be moved in any direction, front, back, left, and right. The slave wheel 34 is an auxiliary wheel that assists the wheel 32. Therefore, the robot 12 can move in the arranged space by autonomous control.

台車３０の上には、円柱形のセンサ取り付けパネル３８が設けられ、このセンサ取り付けパネル３８には、多数の赤外線距離センサ４０が取り付けられる。これらの赤外線距離センサ４０は、センサ取り付けパネル３８すなわちロボット１２の周囲の物体（人間や障害物など）との距離を測定するものである。 A cylindrical sensor attachment panel 38 is provided on the carriage 30, and a large number of infrared distance sensors 40 are attached to the sensor attachment panel 38. These infrared distance sensors 40 measure the distance to the sensor mounting panel 38, that is, the object (human being, obstacle, etc.) around the robot 12.

なお、この実施例では、距離センサとして、赤外線距離センサを用いるようにしてあるが、赤外線距離センサに代えて、超音波距離センサやミリ波レーダなどを用いることもできる。 In this embodiment, an infrared distance sensor is used as the distance sensor, but an ultrasonic distance sensor, a millimeter wave radar, or the like can be used instead of the infrared distance sensor.

センサ取り付けパネル３８の上には、胴体４２が直立するように設けられる。また、胴体４２の前方中央上部（人の胸に相当する位置）には、上述した赤外線距離センサ４０がさらに設けられ、ロボット１２の前方の主として人間との距離を計測する。また、胴体４２には、その側面側上端部のほぼ中央から伸びる支柱４４が設けられ、支柱４４の上には、全方位カメラ４６が設けられる。全方位カメラ４６は、ロボット１２の周囲を撮影するものであり、後述する眼カメラ７０とは区別される。この全方位カメラ４６としては、たとえばＣＣＤやＣＭＯＳのような固体撮像素子を用いるカメラを採用することができる。なお、これら赤外線距離センサ４０および全方位カメラ４６の設置位置は、当該部位に限定されず適宜変更され得る。 A body 42 is provided on the sensor mounting panel 38 so as to stand upright. Further, the above-described infrared distance sensor 40 is further provided in the upper front upper portion of the body 42 (a position corresponding to a human chest), and measures the distance mainly to a human in front of the robot 12. Further, the body 42 is provided with a support column 44 extending from substantially the center of the upper end of the side surface, and an omnidirectional camera 46 is provided on the support column 44. The omnidirectional camera 46 photographs the surroundings of the robot 12 and is distinguished from an eye camera 70 described later. As this omnidirectional camera 46, for example, a camera using a solid-state imaging device such as a CCD or a CMOS can be adopted. In addition, the installation positions of the infrared distance sensor 40 and the omnidirectional camera 46 are not limited to the portions, and can be changed as appropriate.

胴体４２の両側面上端部（人の肩に相当する位置）には、それぞれ、肩関節４８Ｒおよび肩関節４８Ｌによって、上腕５０Ｒおよび上腕５０Ｌが設けられる。図示は省略するが、肩関節４８Ｒおよび肩関節４８Ｌは、それぞれ、直交する３軸の自由度を有する。すなわち、肩関節４８Ｒは、直交する３軸のそれぞれの軸廻りにおいて上腕５０Ｒの角度を制御できる。肩関節４８Ｒの或る軸（ヨー軸）は、上腕５０Ｒの長手方向（または軸）に平行な軸であり、他の２軸（ピッチ軸およびロール軸）は、その軸にそれぞれ異なる方向から直交する軸である。同様にして、肩関節４８Ｌは、直交する３軸のそれぞれの軸廻りにおいて上腕５０Ｌの角度を制御できる。肩関節４８Ｌの或る軸（ヨー軸）は、上腕５０Ｌの長手方向（または軸）に平行な軸であり、他の２軸（ピッチ軸およびロール軸）は、その軸にそれぞれ異なる方向から直交する軸である。 An upper arm 50R and an upper arm 50L are provided at upper end portions on both sides of the torso 42 (position corresponding to a human shoulder) by a shoulder joint 48R and a shoulder joint 48L, respectively. Although illustration is omitted, each of the shoulder joint 48R and the shoulder joint 48L has three orthogonal degrees of freedom. That is, the shoulder joint 48R can control the angle of the upper arm 50R around each of three orthogonal axes. A certain axis (yaw axis) of the shoulder joint 48R is an axis parallel to the longitudinal direction (or axis) of the upper arm 50R, and the other two axes (pitch axis and roll axis) are orthogonal to the axes from different directions. It is an axis to do. Similarly, the shoulder joint 48L can control the angle of the upper arm 50L around each of three orthogonal axes. A certain axis (yaw axis) of the shoulder joint 48L is an axis parallel to the longitudinal direction (or axis) of the upper arm 50L, and the other two axes (pitch axis and roll axis) are orthogonal to the axes from different directions. It is an axis to do.

また、上腕５０Ｒおよび上腕５０Ｌのそれぞれの先端には、肘関節５２Ｒおよび肘関節５２Ｌが設けられる。図示は省略するが、肘関節５２Ｒおよび肘関節５２Ｌは、それぞれ１軸の自由度を有し、この軸（ピッチ軸）の軸回りにおいて前腕５４Ｒおよび前腕５４Ｌの角度を制御できる。 In addition, an elbow joint 52R and an elbow joint 52L are provided at the respective distal ends of the upper arm 50R and the upper arm 50L. Although illustration is omitted, each of the elbow joint 52R and the elbow joint 52L has one degree of freedom, and the angle of the forearm 54R and the forearm 54L can be controlled around the axis (pitch axis).

前腕５４Ｒおよび前腕５４Ｌのそれぞれの先端には、人の手に相当するハンド５６Ｒおよびハンド５６Ｌがそれぞれ設けられる。これらのハンド５６Ｒおよび５６Ｌは、詳細な図示は省略するが、開閉可能に構成され、それによってロボット１２は、ハンド５６Ｒおよび５６Ｌを用いて物体を把持または挟持することができる。ただし、ハンド５６Ｒ，５６Ｌの形状は実施例の形状に限らず、人間の手に酷似した形状や機能を持たせるようにしてもよい。 At the tip of each of the forearm 54R and the forearm 54L, a hand 56R and a hand 56L corresponding to a human hand are provided. Although the detailed illustration is omitted, these hands 56R and 56L are configured to be openable and closable so that the robot 12 can grip or hold an object using the hands 56R and 56L. However, the shape of the hands 56R and 56L is not limited to the shape of the embodiment, and may have a shape and a function very similar to a human hand.

また、図示は省略するが、台車３０の前面，肩関節４８Ｒと肩関節４８Ｌとを含む肩に相当する部位，上腕５０Ｒ，上腕５０Ｌ，前腕５４Ｒ，前腕５４Ｌ，球体５６Ｒおよび球体５６Ｌには、それぞれ、接触センサ５８（図３で包括的に示す）が設けられる。台車３０の前面の接触センサ５８は、台車３０への人間や他の障害物の接触を検知する。したがって、ロボット１２は、その自身の移動中に障害物との接触が有ると、それを検知し、直ちに車輪３２の駆動を停止してロボット１２の移動を急停止させることができる。また、その他の接触センサ５８は、当該各部位に触れたかどうかを検知する。なお、接触センサ５８の設置位置は、当該部位に限定されず、適宜な位置（人の胸，腹，脇，背中および腰に相当する位置）に設けられてもよい。 Although not shown, the front surface of the carriage 30, the portion corresponding to the shoulder including the shoulder joint 48R and the shoulder joint 48L, the upper arm 50R, the upper arm 50L, the forearm 54R, the forearm 54L, the sphere 56R, and the sphere 56L, A contact sensor 58 (shown generically in FIG. 3) is provided. A contact sensor 58 on the front surface of the carriage 30 detects contact of a person or another obstacle with the carriage 30. Therefore, the robot 12 can detect the contact with the obstacle during its movement and immediately stop the driving of the wheel 32 to suddenly stop the movement of the robot 12. Further, the other contact sensors 58 detect whether or not the respective parts are touched. In addition, the installation position of the contact sensor 58 is not limited to the said site | part, and may be provided in an appropriate position (position corresponding to a person's chest, abdomen, side, back, and waist).

胴体４２の中央上部（人の首に相当する位置）には首関節６０が設けられ、さらにその上には頭部６２が設けられる。図示は省略するが、首関節６０は、３軸の自由度を有し、３軸の各軸廻りに角度制御可能である。或る軸（ヨー軸）はロボット１２の真上（鉛直上向き）に向かう軸であり、他の２軸（ピッチ軸、ロール軸）は、それぞれ、それと異なる方向で直交する軸である。 A neck joint 60 is provided at the upper center of the body 42 (a position corresponding to a person's neck), and a head 62 is further provided thereon. Although illustration is omitted, the neck joint 60 has a degree of freedom of three axes, and the angle can be controlled around each of the three axes. A certain axis (yaw axis) is an axis directed directly above (vertically upward) of the robot 12, and the other two axes (pitch axis and roll axis) are axes orthogonal to each other in different directions.

頭部６２には、人の口に相当する位置に、スピーカ６４が設けられる。スピーカ６４は、ロボット１２が、それの周辺の人間に対して音声ないし音によってコミュニケーションを取るために用いられる。また、人の耳に相当する位置には、マイク６６Ｒおよびマイク６６Ｌが設けられる。以下、右のマイク６６Ｒと左のマイク６６Ｌとをまとめてマイク６６ということがある。マイク６６は、周囲の音、とりわけコミュニケーションを実行する対象である人間の音声を取り込む。さらに、人の目に相当する位置には、眼球部６８Ｒおよび眼球部６８Ｌが設けられる。眼球部６８Ｒおよび眼球部６８Ｌは、それぞれ眼カメラ７０Ｒおよび眼カメラ７０Ｌを含む。以下、右の眼球部６８Ｒと左の眼球部６８Ｌとをまとめて眼球部６８ということがある。また、右の眼カメラ７０Ｒと左の眼カメラ７０Ｌとをまとめて眼カメラ７０ということがある。 The head 62 is provided with a speaker 64 at a position corresponding to a human mouth. The speaker 64 is used for the robot 12 to communicate with humans around it by voice or sound. A microphone 66R and a microphone 66L are provided at a position corresponding to a human ear. Hereinafter, the right microphone 66R and the left microphone 66L may be collectively referred to as a microphone 66. The microphone 66 captures ambient sounds, in particular, the voices of humans who are subjects of communication. Furthermore, an eyeball part 68R and an eyeball part 68L are provided at positions corresponding to human eyes. The eyeball portion 68R and the eyeball portion 68L include an eye camera 70R and an eye camera 70L, respectively. Hereinafter, the right eyeball part 68R and the left eyeball part 68L may be collectively referred to as the eyeball part 68. The right eye camera 70R and the left eye camera 70L may be collectively referred to as an eye camera 70.

眼カメラ７０は、ロボット１２に接近した人間の顔や他の部分ないし物体などを撮影して、それに対応する映像信号を取り込む。この実施例では、ロボット１２は、この眼カメラ７０からの映像信号によって、人間１６の左右両目のそれぞれの視線方向（ベクトル）を検出する。その視線検出方法は具体的には、２つのカメラを用いるものとして特開２００４‐２５５０７４号公報に、１つのカメラを用いるものとして特開２００６‐１７２２０９号公報や特開２００６‐２８５５３１号公報開示されるが、ここではその詳細は重要ではないので、これらの公開公報を引用するにとどめる。 The eye camera 70 captures a human face approaching the robot 12, other parts or objects, and captures a corresponding video signal. In this embodiment, the robot 12 detects the line-of-sight directions (vectors) of the left and right eyes of the human 16 from the video signal from the eye camera 70. Specifically, the line-of-sight detection method is disclosed in Japanese Patent Application Laid-Open No. 2004-255074 as using two cameras, and Japanese Patent Application Laid-Open No. 2006-172209 and Japanese Patent Application Laid-Open No. 2006-285531 as using one camera. However, the details are not important here, so only those publications are cited.

ただし、人間１６の視線ベクトルの検出のためには、よく知られているアイマークレコーダなどが利用されてもよい。 However, a well-known eye mark recorder or the like may be used for detecting the line-of-sight vector of the human 16.

また、眼カメラ７０は、上述した全方位カメラ４６と同様のカメラを用いることができる。たとえば、眼カメラ７０は、眼球部６８内に固定され、眼球部６８は、眼球支持部（図示せず）を介して頭部６２内の所定位置に取り付けられる。図示は省略するが、眼球支持部は、２軸の自由度を有し、それらの各軸廻りに角度制御可能である。たとえば、この２軸の一方は、頭部６２の上に向かう方向の軸（ヨー軸）であり、他方は、一方の軸に直交しかつ頭部６２の正面側（顔）が向く方向に直行する方向の軸（ピッチ軸）である。眼球支持部がこの２軸の各軸廻りに回転されることによって、眼球部６８ないし眼カメラ７０の先端（正面）側が変位され、カメラ軸すなわち視線方向が移動される。なお、上述のスピーカ６４，マイク６６および眼カメラ７０の設置位置は、当該部位に限定されず、適宜な位置に設けられてよい。 The eye camera 70 can be the same camera as the omnidirectional camera 46 described above. For example, the eye camera 70 is fixed in the eyeball unit 68, and the eyeball unit 68 is attached to a predetermined position in the head 62 via an eyeball support unit (not shown). Although illustration is omitted, the eyeball support portion has two degrees of freedom, and the angle can be controlled around each of these axes. For example, one of the two axes is an axis (yaw axis) in a direction toward the top of the head 62, and the other is orthogonal to the one axis and goes straight in a direction in which the front side (face) of the head 62 faces. It is an axis (pitch axis) in the direction to be performed. By rotating the eyeball support portion around each of these two axes, the tip (front) side of the eyeball portion 68 or the eye camera 70 is displaced, and the camera axis, that is, the line-of-sight direction is moved. Note that the installation positions of the speaker 64, the microphone 66, and the eye camera 70 described above are not limited to those portions, and may be provided at appropriate positions.

このように、この実施例のロボット１２は、車輪３２の独立２軸駆動，肩関節４８の３自由度（左右で６自由度），肘関節５２の１自由度（左右で２自由度），首関節６０の３自由度および眼球支持部の２自由度（左右で４自由度）の合計１７自由度を有する。 As described above, the robot 12 of this embodiment includes independent two-axis driving of the wheels 32, three degrees of freedom of the shoulder joint 48 (6 degrees of freedom on the left and right), one degree of freedom of the elbow joint 52 (2 degrees of freedom on the left and right), It has a total of 17 degrees of freedom, 3 degrees of freedom for the neck joint 60 and 2 degrees of freedom for the eyeball support (4 degrees of freedom on the left and right).

図３はロボット１２の電気的な構成を示すブロック図である。この図３を参照して、ロボット１２は、ＣＰＵ８０を含む。ＣＰＵ８０は、マイクロコンピュータ或いはプロセッサとも呼ばれ、バス８２を介して、メモリ８４，モータ制御ボード８６，センサ入力／出力ボード８８および音声入力／出力ボード９０に接続される。 FIG. 3 is a block diagram showing the electrical configuration of the robot 12. With reference to FIG. 3, the robot 12 includes a CPU 80. The CPU 80 is also called a microcomputer or a processor, and is connected to the memory 84, the motor control board 86, the sensor input / output board 88 and the audio input / output board 90 via the bus 82.

メモリ８４は、図示は省略をするが、ＲＯＭ，ＨＤＤおよびＲＡＭを含む。ＲＯＭおよびＨＤＤには、ロボット１２の動作を制御するための制御プログラムが予め記憶される。たとえば、各センサの出力（センサ情報）を検知するための検知プログラムや、外部コンピュータとの間で必要なデータやコマンドを送受信するための通信プログラムなどが記録される。また、ＲＡＭは、ワークメモリやバッファメモリとして用いられる。 The memory 84 includes a ROM, an HDD, and a RAM (not shown). In the ROM and the HDD, a control program for controlling the operation of the robot 12 is stored in advance. For example, a detection program for detecting the output (sensor information) of each sensor, a communication program for transmitting / receiving necessary data and commands to / from an external computer, and the like are recorded. The RAM is used as a work memory or a buffer memory.

さらに、この実施例では、ロボット１２は、人間１６とのコミュニケーションをとるために発話したり、ジェスチャしたりできるように構成されているが、メモリ８４に、このような発話やジェスチャのための発話／ジェスチャ辞書８５Ａが設定されている。 Furthermore, in this embodiment, the robot 12 is configured to be able to speak and make a gesture to communicate with the human 16, but the memory 84 has a speech for such a speech and gesture. / Gesture dictionary 85A is set.

モータ制御ボード８６は、たとえばＤＳＰで構成され、各腕や首関節および眼球部などの各軸モータの駆動を制御する。すなわち、モータ制御ボード８６は、ＣＰＵ８０からの制御データを受け、右眼球部６８Ｒの２軸のそれぞれの角度を制御する２つのモータ（図３では、まとめて「右眼球モータ９２」と示す）の回転角度を制御する。同様にして、モータ制御ボード８６は、ＣＰＵ８０からの制御データを受け、左眼球部６８Ｌの２軸のそれぞれの角度を制御する２つのモータ（図３では、まとめて「左眼球モータ９４」と示す）の回転角度を制御する。 The motor control board 86 is constituted by, for example, a DSP, and controls driving of each axis motor such as each arm, neck joint, and eyeball unit. That is, the motor control board 86 receives control data from the CPU 80, and controls two motors (collectively indicated as “right eyeball motor 92” in FIG. 3) that control the angles of the two axes of the right eyeball portion 68R. Control the rotation angle. Similarly, the motor control board 86 receives control data from the CPU 80, and controls two angles of the two axes of the left eyeball portion 68L (in FIG. 3, collectively referred to as “left eyeball motor 94”). ) To control the rotation angle.

また、モータ制御ボード８６は、ＣＰＵ８０からの制御データを受け、肩関節４８Ｒの直交する３軸のそれぞれの角度を制御する３つのモータと肘関節５２Ｒの角度を制御する１つのモータとの計４つのモータ（図３では、まとめて「右腕モータ９６」と示す）の回転角度を制御する。同様にして、モータ制御ボード８６は、ＣＰＵ８０からの制御データを受け、肩関節４８Ｌの直交する３軸のそれぞれの角度を制御する３つのモータと肘関節５２Ｌの角度を制御する１つのモータとの計４つのモータ（図３では、まとめて「左腕モータ９８」と示す）の回転角度を制御する。 The motor control board 86 receives control data from the CPU 80, and includes a total of four motors including three motors for controlling the angles of the three orthogonal axes of the shoulder joint 48R and one motor for controlling the angle of the elbow joint 52R. The rotation angle of two motors (collectively indicated as “right arm motor 96” in FIG. 3) is controlled. Similarly, the motor control board 86 receives control data from the CPU 80, and includes three motors for controlling the angles of the three orthogonal axes of the shoulder joint 48L and one motor for controlling the angle of the elbow joint 52L. The rotation angles of a total of four motors (collectively indicated as “left arm motor 98” in FIG. 3) are controlled.

さらに、モータ制御ボード８６は、ＣＰＵ８０からの制御データを受け、首関節６０の直交する３軸のそれぞれの角度を制御する３つのモータ（図３では、まとめて「頭部モータ１００」と示す）の回転角度を制御する。そして、モータ制御ボード８６は、ＣＰＵ８０からの制御データを受け、車輪３２を駆動する２つのモータ（図３では、まとめて「車輪モータ３６」と示す）の回転角度を制御する。 Further, the motor control board 86 receives control data from the CPU 80, and controls three motors that control the angles of the three orthogonal axes of the neck joint 60 (in FIG. 3, collectively indicated as “head motor 100”). Control the rotation angle. The motor control board 86 receives control data from the CPU 80 and controls the rotation angles of the two motors (collectively indicated as “wheel motor 36” in FIG. 3) that drive the wheels 32.

モータ制御ボード８６にはさらにハンドアクチュエータ１０８が結合され、モータ制御ボード８６は、ＣＰＵ８０からの制御データを受け、ハンド５６Ｒ，５６Ｌの開閉を制御する。 A hand actuator 108 is further coupled to the motor control board 86, and the motor control board 86 receives control data from the CPU 80 and controls the opening and closing of the hands 56R and 56L.

なお、この実施例では、車輪モータ３６を除くモータは、制御を簡素化するためにステッピングモータ（すなわち、パルスモータ）を用いる。ただし、車輪モータ３６と同様に直流モータを用いるようにしてもよい。また、ロボット１２の身体部位を駆動するアクチュエータは、電流を動力源とするモータに限らず適宜変更された、たとえば、他の実施例では、エアアクチュエータが適用されてもよい。 In this embodiment, a motor other than the wheel motor 36 uses a stepping motor (that is, a pulse motor) in order to simplify the control. However, a DC motor may be used similarly to the wheel motor 36. The actuator that drives the body part of the robot 12 is not limited to a motor that uses a current as a power source, and may be changed as appropriate. For example, in another embodiment, an air actuator may be applied.

センサ入力／出力ボード８８は、モータ制御ボード８６と同様に、ＤＳＰで構成され、各センサからの信号を取り込んでＣＰＵ８０に与える。すなわち、赤外線距離センサ４０のそれぞれからの反射時間に関するデータがこのセンサ入力／出力ボード８８を通じてＣＰＵ８０に入力される。また、全方位カメラ４６からの映像信号が、必要に応じてセンサ入力／出力ボード８８で所定の処理を施してからＣＰＵ８０に入力される。眼カメラ７０からの映像信号も、同様にして、ＣＰＵ８０に入力される。また、上述した複数の接触センサ５８（図３では、まとめて「接触センサ５８」と示す）からの信号がセンサ入力／出力ボード８８を介してＣＰＵ８０に与えられる。音声入力／出力ボード９０もまた、同様に、ＤＳＰで構成され、ＣＰＵ８０から与えられる音声合成データに従った音声または声がスピーカ６４から出力される。また、マイク６６からの音声入力が、音声入力／出力ボード９０を介してＣＰＵ８０に与えられる。 Similar to the motor control board 86, the sensor input / output board 88 is configured by a DSP and takes in signals from each sensor and gives them to the CPU 80. That is, data relating to the reflection time from each of the infrared distance sensors 40 is input to the CPU 80 through the sensor input / output board 88. The video signal from the omnidirectional camera 46 is input to the CPU 80 after being subjected to predetermined processing by the sensor input / output board 88 as necessary. Similarly, the video signal from the eye camera 70 is also input to the CPU 80. Further, signals from the plurality of contact sensors 58 described above (collectively indicated as “contact sensors 58” in FIG. 3) are provided to the CPU 80 via the sensor input / output board 88. Similarly, the voice input / output board 90 is also configured by a DSP, and voice or voice in accordance with voice synthesis data provided from the CPU 80 is output from the speaker 64. In addition, voice input from the microphone 66 is given to the CPU 80 via the voice input / output board 90.

また、ＣＰＵ８０は、バス８２を介して通信ＬＡＮボード１０２に接続される。通信ＬＡＮボード１０２は、たとえばＤＳＰで構成され、ＣＰＵ８０から与えられた送信データを無線通信装置１０４に与え、無線通信装置１０４は送信データを、ネットワーク１４を介してサーバ２０に送信する。また、通信ＬＡＮボード１０２は、無線通信装置１０４を介してデータを受信し、受信したデータをＣＰＵ８０に与える。たとえば、送信データとしては、ロボット１２からサーバ２０への信号（コマンド）であったり、ロボット１２が行ったコミュニケーションについての動作履歴情報（履歴データ）などであったりする。このように、コマンドのみならず履歴データを送信するのは、メモリ８４の容量を少なくするためと、消費電力を抑えるためである。この実施例では、履歴データはコミュニケーションが実行される度に、サーバ２０に送信されたが、一定時間または一定量の単位でサーバ２０に送信されるようにしてもよい。 The CPU 80 is connected to the communication LAN board 102 via the bus 82. The communication LAN board 102 is configured by a DSP, for example, and provides transmission data provided from the CPU 80 to the wireless communication device 104, and the wireless communication device 104 transmits the transmission data to the server 20 via the network 14. In addition, the communication LAN board 102 receives data via the wireless communication device 104 and provides the received data to the CPU 80. For example, the transmission data may be a signal (command) from the robot 12 to the server 20, or operation history information (history data) regarding communication performed by the robot 12. The reason why the history data is transmitted as well as the command is to reduce the capacity of the memory 84 and to reduce power consumption. In this embodiment, the history data is transmitted to the server 20 every time communication is performed. However, the history data may be transmitted to the server 20 in units of a fixed time or a fixed amount.

さらに、ＣＰＵ８０は、バス８２を介して無線タグ読取装置１０６が接続される。無線タグ読取装置１０６は、アンテナ（図示せず）を介して、無線タグ１８（ＲＦＩＤタグ）から送信される識別情報の重畳された電波を受信する。そして、無線タグ読取装置１０６は、受信した電波信号を増幅し、当該電波信号から識別信号を分離し、当該識別情報を復調（デコード）してＣＰＵ８０に与える。図１によれば無線タグ１８は、ロボット１２が配置された会社の受付や一般家庭の居間などに居る人間１６に装着され、無線タグ読取装置１０６は、通信可能範囲内の無線タグ１８を検出する。なお、無線タグ１８は、アクティブ型であってもよいし、無線タグ読取装置１０６から送信される電波に応じて駆動されるパッシブ型であってもよい。 Further, the wireless tag reader 106 is connected to the CPU 80 via the bus 82. The wireless tag reader 106 receives a radio wave superimposed with identification information transmitted from the wireless tag 18 (RFID tag) via an antenna (not shown). Then, the RFID tag reader 106 amplifies the received radio wave signal, separates the identification signal from the radio wave signal, demodulates (decodes) the identification information, and supplies the identification information to the CPU 80. According to FIG. 1, the wireless tag 18 is attached to the person 16 in the reception of the company where the robot 12 is disposed or in the living room of a general household, and the wireless tag reader 106 detects the wireless tag 18 within the communicable range. To do. Note that the wireless tag 18 may be an active type or a passive type that is driven according to a radio wave transmitted from the wireless tag reader 106.

図４を参照して、サーバ２０のハードウェアの構成について説明する。図４に示すように、サーバ２０は、ＣＰＵ２００を含む。ＣＰＵ２００は、プロセッサとも呼ばれ、バス２０２を介して、メモリ２０４、カメラ制御ボード２０６、無線タグ読取装置２０８、ＬＡＮ制御ボード２１０、入力装置制御ボード２１２、およびモニタ制御ボード２１４に接続される。 The hardware configuration of the server 20 will be described with reference to FIG. As shown in FIG. 4, the server 20 includes a CPU 200. The CPU 200, also called a processor, is connected to the memory 204, the camera control board 206, the wireless tag reader 208, the LAN control board 210, the input device control board 212, and the monitor control board 214 via the bus 202.

ＣＰＵ２００は、サーバ２０の全体の制御を司る。メモリ２０４は、ＲＯＭ、ＲＡＭ、およびＨＤＤなどを包括的に示したものであり、サーバ２０の動作のためのプログラムを記録したり、ＣＰＵ２００が動作する際のワークエリアとして機能したりする。カメラ制御ボード２０６は、当該制御ボード２０６に接続されるカメラ１２０を制御するためのものである。 The CPU 200 governs overall control of the server 20. The memory 204 comprehensively shows ROM, RAM, HDD, and the like, and records a program for operating the server 20 and functions as a work area when the CPU 200 operates. The camera control board 206 is for controlling the camera 120 connected to the control board 206.

無線タグ読取装置２０８は、当該制御ボード２０８に接続されるアンテナ１２４を介して人間１６や物品（本）２４に装着された無線タグ１８から送信される識別情報の重畳された電波を受信する。そして、無線タグ読取装置２０８は、受信した電波信号を増幅し、当該電波信号から識別信号を分離し、当該識別情報を復調（デコード）してＣＰＵ２００に与える。アンテナ１２４は、ロボット１２が配置された会社の受付や一般家庭の各部屋などにくまなく配置され、システム１０が対象とするすべての物品（本）２４および人間１６の無線タグ１８から電波を受信できるようになっている。したがって、アンテナ１２４は複数存在するが、図１および図４では包括的に示している。 The wireless tag reader 208 receives a radio wave superimposed with identification information transmitted from the wireless tag 18 attached to the person 16 or the article (book) 24 via the antenna 124 connected to the control board 208. Then, the RFID tag reading device 208 amplifies the received radio wave signal, separates the identification signal from the radio wave signal, demodulates (decodes) the identification information, and gives it to the CPU 200. The antenna 124 is placed throughout the reception of the company where the robot 12 is placed, and in each room of a general household, and receives radio waves from all articles (books) 24 targeted by the system 10 and the wireless tag 18 of the human 16. It can be done. Therefore, although there are a plurality of antennas 124, they are comprehensively shown in FIGS.

また、ＬＡＮ制御ボード２１０は、当該制御ボード２１０に接続される無線通信装置２１６を制御し、サーバ２０が外部のネットワーク１４に無線によってアクセスできるようにするものである。さらに、入力装置制御ボード２１２は、当該制御ボード２１２に接続される入力装置としてのたとえば、キーボードやマウスなどによる入力を制御するものである。そして、モニタ制御ボード２１４は、当該制御ボード２１４に接続されるモニタへの出力を制御するものである。 The LAN control board 210 controls the wireless communication device 216 connected to the control board 210 so that the server 20 can access the external network 14 wirelessly. Further, the input device control board 212 controls input by, for example, a keyboard or a mouse as an input device connected to the control board 212. The monitor control board 214 controls output to a monitor connected to the control board 214.

また、サーバ２０は、図示しないインターフェースによって、サーバ２０に付設された物品辞書１２２および音声認識辞書１２６（図１参照）に接続されている。 The server 20 is connected to an article dictionary 122 and a voice recognition dictionary 126 (see FIG. 1) attached to the server 20 by an interface (not shown).

メモリ２０４には、物品ローカル辞書２０５Ａ、音声認識ローカル辞書２０５Ｂ、発話パターン辞書２０５Ｃ、発話辞書２０５Ｄ、個人正誤ＤＢ２０５Ｅ、および音声認識率ＤＢ２０５Ｆが設定されている。 In the memory 204, an article local dictionary 205A, a speech recognition local dictionary 205B, an utterance pattern dictionary 205C, an utterance dictionary 205D, an individual correct / incorrect DB 205E, and a speech recognition rate DB 205F are set.

物品ローカル辞書２０５Ａは、後述する物品辞書１２２から抽出された内容が登録される辞書である。サーバ２０は、ロボット１２が人間１６を認識した際に、当該人間１６の近傍に存在する物品（本）２４の情報だけを物品辞書１２２から抽出して物品ローカル辞書２０５Ａに登録する。音声認識ローカル辞書２０５Ｂは、後述する音声認識辞書１２６から抽出された内容が登録される辞書である。サーバ２０は、ロボット１２が人間１６を認識して物品ローカル辞書２０５Ａを作成すると、当該物品ローカル辞書２０５Ａに登録されている単語を音声認識するために必要な情報を音声認識辞書１２６から抽出して音声認識ローカル辞書２０５Ｂに登録する。したがって、物品ローカル辞書２０５Ａおよび音声認識ローカル辞書２０５Ｂは、人間１６の位置の変化に応じて動的に書き換えられる。このように音声認識辞書１２６から音声認識ローカル辞書２０５Ｂを作成し、音声認識に使用する辞書を小さくすることによって音声認識の対象となる単語（音素記号列）の数を少なくし、音声認識の処理にかかる時間を短くするとともに正しく音声認識できる割合を高めることができる。 The article local dictionary 205A is a dictionary in which contents extracted from an article dictionary 122 described later are registered. When the robot 12 recognizes the person 16, the server 20 extracts only the information of the article (book) 24 existing in the vicinity of the person 16 from the article dictionary 122 and registers it in the article local dictionary 205 A. The voice recognition local dictionary 205B is a dictionary in which contents extracted from a voice recognition dictionary 126 described later are registered. When the robot 12 recognizes the person 16 and creates the article local dictionary 205A, the server 20 extracts information necessary for voice recognition of words registered in the article local dictionary 205A from the voice recognition dictionary 126. Register in the speech recognition local dictionary 205B. Therefore, the article local dictionary 205A and the speech recognition local dictionary 205B are dynamically rewritten in accordance with the change in the position of the person 16. In this way, the speech recognition local dictionary 205B is created from the speech recognition dictionary 126, and the number of words (phoneme symbol strings) to be speech-recognized is reduced by reducing the dictionary used for speech recognition. It is possible to shorten the time required for the voice recognition and to increase the rate of correct voice recognition.

発話パターン辞書２０５Ｃは、人間１６の行った発話の内容が特定のパターンであるか否かを判断するための辞書である。発話辞書２０５Ｄは、サーバ２０がロボット１２に、人間１６に対して発話させる音声の内容を決定するために必要な情報を記憶している。また、個人正誤ＤＢ２０５Ｅは、システム１０が、人間１６が指示した物品（本）２４を特定することに最終的に成功したか否かを人間１６のＩＤ別に記憶している。そして、音声認識率ＤＢ２０５Ｆは、音声認識辞書１２６に登録されている単語のそれぞれについての音声認識における認識率、つまり、実際の音声認識により当該単語を正しく認識できた割合を記憶している。 The utterance pattern dictionary 205C is a dictionary for determining whether or not the content of the utterance performed by the human 16 is a specific pattern. The utterance dictionary 205D stores information necessary for the server 20 to determine the content of the voice that the robot 12 causes the human 16 to utter. Further, the personal correctness DB 205E stores, for each ID of the person 16, whether or not the system 10 has finally succeeded in specifying the article (book) 24 designated by the person 16. The speech recognition rate DB 205F stores a recognition rate in speech recognition for each word registered in the speech recognition dictionary 126, that is, a rate at which the word can be correctly recognized by actual speech recognition.

次に、図５を参照して物品辞書１２２を説明する。この図５に示す物品辞書１２２は、たとえばユーコード（Ucode）のようなＩＤをそれぞれの物品の１つに割り当て、物品毎にその名称、属性、位置(座標)などの必要な情報を文字列として登録している。なお、ユーコードは、具体的には、１２８ビットの数字からなり、３４０兆の１兆倍のさらに１兆倍の数の物品を個別に識別できるものである。ただし、この物品辞書１２２に使うＩＤは必ずしもこのようなユーコードである必要はなく、適宜の数字や記号の組み合わせからなるものであってよい。 Next, the article dictionary 122 will be described with reference to FIG. The article dictionary 122 shown in FIG. 5 assigns an ID such as a Ucode to one of the articles, and stores necessary information such as the name, attribute, and position (coordinates) for each article as a character string. Registered as. The U-code is specifically a 128-bit number, and can individually identify articles that are 1 trillion times more than 1 trillion times 340 trillion. However, the ID used for the article dictionary 122 does not necessarily need to be such a U-code, and may consist of a combination of appropriate numbers and symbols.

このような物品辞書１２２は、システム１０（ロボット１２およびサーバ２０）が識別すべき対象物となるすべての、たとえば家庭内の物品をＩＤと文字列とで登録するものであり、いわばグローバル辞書に相当する。 Such an article dictionary 122 registers all articles, for example, household articles, which are objects to be identified by the system 10 (the robot 12 and the server 20) with IDs and character strings. Equivalent to.

物品辞書１２２には、1つの物品（本）２４についての情報が１つのレコードとして登録されている。そして、たとえば、１つのレコードには、本のＩＤ以外に「名称」、「属性」、「著者」、「出版社」などの項目が記憶されている。「属性」の項目には、本のカバーの色や体裁など当該本を補足的に説明する単語が記憶されている。各項目の情報はテキスト形式の単語で記憶されている。しかし、「属性」の項目においてダブルコーテーション（“）で囲まれている情報は、音声認識における音素記号形式（音素記号列）の単語で記憶されている。 In the article dictionary 122, information about one article (book) 24 is registered as one record. For example, in addition to the book ID, items such as “name”, “attribute”, “author”, and “publisher” are stored in one record. In the “attribute” item, words that supplementally describe the book such as the color and appearance of the book cover are stored. Information on each item is stored in words in text format. However, information enclosed in double quotation marks (“) in the“ attribute ”item is stored as a phonemic symbol format (phoneme symbol string) word in speech recognition.

この音素記号列の情報は、ロボット１２がマイク６６で拾った人間１６が発する音声をサーバ２０が音声認識する処理の過程で生成される情報であり、音声認識ローカル辞書２０５Ｂ（音声認識辞書１２６）に登録されておらず、音声認識をすることができなかったが、人間１６の視線や指差しにより、人間１６がどの本を指示しているか判断できた場合に、その認識できなかった音声に相当する音素記号列をその本のレコードの「属性」として記憶したものである。 The information of the phoneme symbol string is information generated in the process of the server 20 recognizing the voice uttered by the human 16 picked up by the robot 12 with the microphone 66, and the voice recognition local dictionary 205B (voice recognition dictionary 126). Is not registered, and voice recognition could not be performed. However, when it is possible to determine which book the human 16 is pointing to by the line of sight or pointing of the human 16, the voice that could not be recognized The corresponding phoneme symbol string is stored as the “attribute” of the record of the book.

また、音素記号列の後に括弧書きで付記されているのは、当該音素記号列に対応つけて記憶された人間１６のＩＤの情報である。これは、当該音声記号列の示す音声が、当該ＩＤが示す人間１６によって発せられたことを示している。 Further, what is appended in parentheses after the phoneme symbol string is information on the ID of the person 16 stored in association with the phoneme symbol string. This indicates that the voice indicated by the phonetic symbol string is emitted by the person 16 indicated by the ID.

なお、前述したように、物品辞書１２２には各物品（本）２４が存在する場所の情報も記憶されており、物品辞書１２２の１つのレコードには、図示しないが当該物品（本）２４が存在する場所を記憶する項目も含まれている。 As described above, the article dictionary 122 also stores information on the location where each article (book) 24 exists, and one record of the article dictionary 122 contains the article (book) 24 (not shown). It also includes an item that remembers where it exists.

次に、図１０を参照して音声認識辞書１２６を説明する。一般的に音声認識辞書には単語辞書と文法辞書とが存在するが音声認識辞書１２６は単語辞書である。文法辞書についての説明は省略する。音声認識辞書１２６は、図１０に示すように、テキスト形式の単語を記憶する項目、テキスト形式の単語に対応する音素記号形式（音素記号列）の単語を記憶する項目、および当該テキスト形式の単語が物品の名称、たとえば、本のタイトル（名称）であるか否かを示す情報を記憶する項目を含むレコードからなっている。テキスト形式の単語を記憶する項目に記憶されているのが物品の名称であるか否かを示す情報を記憶する項目には、テキスト形式の単語を記憶する項目に物品の名称が記憶されている場合には「１」が記憶され、そうでない場合は「０」が記憶されている。 Next, the speech recognition dictionary 126 will be described with reference to FIG. In general, a speech recognition dictionary includes a word dictionary and a grammar dictionary, but the speech recognition dictionary 126 is a word dictionary. A description of the grammar dictionary is omitted. As shown in FIG. 10, the speech recognition dictionary 126 stores an item for storing a word in a text format, an item for storing a word in a phoneme symbol format (phoneme symbol string) corresponding to the text format word, and a word in the text format. Is a record including an item for storing information indicating whether or not it is the name of an article, for example, the title (name) of a book. In the item storing information indicating whether or not the name of the article is stored in the item storing the text format word, the name of the article is stored in the item storing the text format word. In this case, “1” is stored. Otherwise, “0” is stored.

音声認識の処理では、入力された音声を音素に分解し、分解した各音素について当該音素を表す記号を生成する。そして、入力された音声の単語に相当するこの記号の列が音声認識辞書１２６に記憶されている音素記号列である。音声認識の処理では、音声から生成した音素記号列ともっとも近い音素記号列を音声認識辞書１２６（音声認識ローカル辞書２０５Ｂ）内で特定し、この特定した音素記号列に対応して記憶されている単語を音声認識結果として出力する。 In the speech recognition process, the input speech is decomposed into phonemes, and a symbol representing the phoneme is generated for each decomposed phoneme. The symbol string corresponding to the input speech word is a phoneme symbol string stored in the speech recognition dictionary 126. In the speech recognition process, the phoneme symbol string closest to the phoneme symbol string generated from the speech is identified in the speech recognition dictionary 126 (speech recognition local dictionary 205B) and stored in correspondence with the identified phoneme symbol string. The word is output as a speech recognition result.

前述したように、このシステム１０では、人間１６が音声と視線および指差しによって物品（本）２４を指示すると、ロボット１２とサーバ２０とが協働して、人間１６が指示した物品（本）２４を特定し、その特定した物品（本）２４をロボット１２が人間１６のところに運搬などする。以下において、この人間１６とシステム１０とのやり取りをコミュニケーションと呼ぶことがある。 As described above, in the system 10, when the person 16 indicates the article (book) 24 by voice, line of sight, and pointing, the robot 12 and the server 20 cooperate to provide the article (book) indicated by the person 16. 24 is specified, and the robot 12 transports the specified article (book) 24 to the person 16. Hereinafter, the exchange between the human 16 and the system 10 may be referred to as communication.

より詳細に述べると、このシステム１０では、人間１６がロボット１２に近づくと、ロボット１２が人間１６を無線タグ１８によって認識する。サーバ２０には、システム１０が対象とする物品（本）２４のすべてが登録された物品辞書１２２、および音声認識によって物品（本）２４を特定するための単語が登録された音声認識辞書１２６が付設されている。ロボット１２が人間１６を認識すると、ロボット１２はサーバ２０に対して物品辞書１２２および音声認識辞書１２６のローカル辞書（物品ローカル辞書２０５Ａ、音声認識ローカル辞書２０５Ｂ）の作成を指示する。 More specifically, in the system 10, when the human 16 approaches the robot 12, the robot 12 recognizes the human 16 by the wireless tag 18. The server 20 includes an article dictionary 122 in which all articles (books) 24 targeted by the system 10 are registered, and a voice recognition dictionary 126 in which words for identifying the articles (books) 24 by voice recognition are registered. It is attached. When the robot 12 recognizes the human 16, the robot 12 instructs the server 20 to create local dictionaries (article local dictionary 205 A, voice recognition local dictionary 205 B) of the article dictionary 122 and the voice recognition dictionary 126.

ローカル辞書の作成の指示を受けると、サーバ２０では、ロボット１２が認識した人間１６の位置を特定し、特定した当該人間１６から所定の範囲内、たとえば、半径５ｍ以内にある物品（本）２４のレコードのみを物品辞書１２２から抽出して物品ローカル辞書２０５Ａを作成する。そして次に、音声認識辞書１２６から、物品ローカル辞書２０５Ａに登録されている物品（本）２４を音声認識するため必要な情報のみを抽出して音声認識ローカル辞書２０５Ｂを作成する。 Upon receiving an instruction to create a local dictionary, the server 20 specifies the position of the person 16 recognized by the robot 12, and the article (book) 24 within a predetermined range from the specified person 16, for example, within a radius of 5 m. Are extracted from the article dictionary 122 to create the article local dictionary 205A. Then, only the information necessary for voice recognition of the article (book) 24 registered in the article local dictionary 205A is extracted from the voice recognition dictionary 126 to create the voice recognition local dictionary 205B.

その後、ロボット１２は、認識した人間１６に対して、たとえば、「何か本を持ってきましょうか？」という発話を行う。この発話に対し、人間１６は、持ってきてほしい物品（本）２４に視線を向けつつ当該物品（本）２４を指差しながら、「化学の謎持ってきて」などと答える。 Thereafter, the robot 12 utters, for example, “Let's bring some books?” To the recognized human 16. In response to this utterance, the human 16 replies, for example, “Bring a chemical mystery” while pointing at the article (book) 24 while looking at the article (book) 24 that he / she wants to bring.

するとロボット１２は、「化学の謎持ってきて」という人間１６の声を音声認識し、人間１６の視線を推定し、指差した指が向かっている方向を推定することによって、人間１６が指示している物品（本）２４を特定する。 Then, the robot 12 recognizes the voice of the human 16 saying “Please bring a mystery of chemistry”, estimates the gaze of the human 16, and estimates the direction in which the pointing finger is pointing. The article (book) 24 being identified is specified.

人間１６が指示している物品（本）２４を特定すると、サーバ２０は、人間１６に特定した物品（本）２４を確認するためにロボット１２が発話する音声の内容、たとえば、「白い本ですか？」を決定し、ロボット１２が当該物品（本）２４（化学の謎の本２４）を指し示しながらこれを発話する。 When the article (book) 24 instructed by the person 16 is identified, the server 20 confirms the article (book) 24 identified to the person 16 and the content of the voice uttered by the robot 12, for example, “white book. The robot 12 speaks while pointing to the article (book) 24 (the mysterious book of chemistry 24).

このとき、サーバ２０は、発話の内容を、次回に人間１６が当該物品（本）２４（化学の謎の本２４）をロボット１２に指示する際に、人間１６に当該物品（本）２４（化学の謎の本２４）を特定するための単語として使用してほしい単語を含んだものとする。 At this time, the server 20 indicates the content of the utterance to the human 16 when the human 16 next instructs the robot 12 of the article (book) 24 (chemical mysterious book 24). It is assumed that a word to be used as a word for specifying the mysterious book of chemistry 24) is included.

たとえば、ロボット１２が特定した物品（本）２４（化学の謎の本２４）を確認するための発話内容が「白い本ですか？」であった場合、システム１０は、次回に人間１６が「化学の謎」の本を指示する際には、指差しとともに、「化学の謎持ってきて」と発話する代わりに「あの白い本持ってきて」と発話することを期待している。このことは、人間はロボットの発話内容を真似する傾向があるという知見に基づいている。この場合、人間１６に物品（本）２４を特定するために次回に使用することを期待する単語は「白い」である。この「白い」という単語は、人間１６の近傍に「化学の謎」の本２４の他に「白い」物品（本）２４が存在せず、また、「白い」という単語の音声認識における認識率が高い場合にサーバ２０によって選択される。 For example, when the utterance content for confirming the article (book) 24 (the mysterious book of chemistry 24) identified by the robot 12 is “Is it a white book?”, The system 10 When pointing to a book of “Chemical Mystery”, I hope that with the pointing, I will say “Take that white book” instead of saying “Take a chemical mystery”. This is based on the knowledge that humans tend to imitate the utterance content of robots. In this case, the word that the human 16 expects to use next time to identify the article (book) 24 is “white”. The word “white” has no “white” article (book) 24 in addition to the book “chemical mystery” 24 in the vicinity of the human 16, and the recognition rate in the speech recognition of the word “white”. Is selected by the server 20 when it is high.

ところで、人間１６が指差しとともに、「化学の謎を持ってきて」などと発話する場合、「化学の謎」という単語ではなく、その人間１６が独自に使用する、たとえば、「化学の謎」の省略形である「バケナゾ」という単語を用い「バケナゾ持ってきて」などと発話することが考えられる。この場合、「バケナゾ」という単語が物品辞書１２２および音声認識辞書１２６に登録されていなければ、システム１０は「バケナゾ」という音声を認識し、「バケナゾ」が「化学の謎」の本を指示していることを特定することができない。 By the way, when the human 16 speaks with a pointing hand, “Take a chemical mystery” or the like, the word “chemical mystery” is used instead of the word “chemical mystery”. It is possible to use the word “Bakenazo”, which is an abbreviation for, and say “Bring a Bakenazo”. In this case, if the word “bakenazo” is not registered in the article dictionary 122 and the speech recognition dictionary 126, the system 10 recognizes the voice “bakenazo”, and “bakenazo” points to the book of “chemistry mystery”. I can't be certain.

このような場合においても、システム１０は、人間１６の視線や指差しの方向に基づいて人間１６が指示している物品（本）２４を特定することが可能である。したがって、「バケナゾ」などのように音声認識できなかった単語がある場合には、視線や指差しの方向に基づいて特定した結果を利用して、音声認識できなかった単語を物品辞書１２２および音声認識辞書１２６に登録する。 Even in such a case, the system 10 can identify the article (book) 24 indicated by the person 16 based on the line of sight of the person 16 or the direction of pointing. Therefore, when there is a word that could not be recognized by speech such as “bakenazo”, the word that could not be recognized by using the result identified based on the direction of the line of sight or the pointing is used. Register in the recognition dictionary 126.

システム１０が特定した物品（本）２４を確認するために、ロボット１２がたとえば「白い本ですか？」と発話すると、人間１６は、「そうです」あるいは「ちがいます」などと発話し、ロボット１２に返答する。サーバ２０は、この人間１６の返答における音声を音声認識し、システム１０が特定した物品（本）２４が、人間１６が指示したものであるか否かを判断する。システム１０が特定した物品（本）２４が、人間１６が指示したものでなかった場合には、次の候補である物品（本）２４が、人間１６が指示したものであるか否かを確認する。一方、システム１０が特定した物品（本）２４が、人間１６が指示したものであった場合には、ロボット１２が当該物品（本）２４を人間１６のところにまで運搬する。 When the robot 12 speaks, for example, “Is it a white book?” To confirm the article (book) 24 identified by the system 10, the human 16 speaks “Yes” or “No” and the robot Reply to 12. The server 20 recognizes the voice in the response of the human 16 and determines whether or not the article (book) 24 specified by the system 10 is the one designated by the human 16. If the article (book) 24 identified by the system 10 is not designated by the person 16, it is confirmed whether or not the next candidate article (book) 24 is designated by the person 16. To do. On the other hand, when the article (book) 24 specified by the system 10 is the one designated by the person 16, the robot 12 carries the article (book) 24 to the person 16.

次に、図１に示す実施例におけるロボット１２およびサーバ２０の動作について、図１５から図２２に示すフロー図を参照して説明する。 Next, operations of the robot 12 and the server 20 in the embodiment shown in FIG. 1 will be described with reference to flowcharts shown in FIGS.

図１５の最初のステップＳ１において、ロボット１２のＣＰＵ８０（図３）は、同じく図３に例示するセンサ入力／出力ボード８８からのセンサ入力にしたがって、人間１６（図１）を認識したかどうか判断する。具体的には、たとえば赤外線センサ４０で人体を検知し、そのとき無線タグ読取装置１０６で人間１６が装着している無線タグ１８を認識したとき、人間１６を認識したと判断する。このとき、ロボット１２は、無線タグ１８より人間１６のユーザＩＤ（たとえば、（456…0000004））を読み取る。 In the first step S1 of FIG. 15, the CPU 80 (FIG. 3) of the robot 12 determines whether or not the person 16 (FIG. 1) has been recognized in accordance with the sensor input from the sensor input / output board 88 also illustrated in FIG. To do. Specifically, for example, when the human body is detected by the infrared sensor 40 and the wireless tag reader 106 recognizes the wireless tag 18 worn by the human 16 at that time, it is determined that the human 16 has been recognized. At this time, the robot 12 reads the user ID (for example, (456... 0000004)) of the human 16 from the wireless tag 18.

ステップＳ１でＹＥＳと判断すると、次のステップＳ３で、ロボット１２のＣＰＵ８０は、ローカル辞書、つまり、物品ローカル辞書２０５Ａおよび音声認識ローカル辞書２０５Ｂを作成する指示を、ネットワーク１４を介してサーバ２０に送信する。このとき、ローカル辞書の作成指示と同時に、ステップＳ１で認識したと判断した人間１６のユーザＩＤ（456…0000004）をもサーバ２０に送信する。 If YES is determined in step S1, the CPU 80 of the robot 12 transmits an instruction to create a local dictionary, that is, the article local dictionary 205A and the voice recognition local dictionary 205B to the server 20 via the network 14 in the next step S3. To do. At this time, simultaneously with the local dictionary creation instruction, the user ID (456... 0000004) of the person 16 determined to have been recognized in step S1 is also transmitted to the server 20.

ロボット１２からローカル辞書の作成指示とユーザＩＤとが送信されると、サーバ２０では、図１６のステップＳ３１で、ＣＰＵ２００は、ローカル辞書の作成指示を受信したと判断し、ステップＳ３３において、ロボット１２が認識した人間１６とのコミュニケーションに使用する当該人間１６に専用の物品ローカル辞書２０５Ａおよび音声認識ローカル辞書２０５Ｂの作成を行う。 When the local dictionary creation instruction and the user ID are transmitted from the robot 12, the server 20 determines in step S31 in FIG. 16 that the CPU 200 has received the local dictionary creation instruction. In step S33, the robot 12 The article local dictionary 205A and the voice recognition local dictionary 205B dedicated to the person 16 used for communication with the person 16 recognized by the person 16 are created.

ステップＳ３３の「ローカル辞書の作成」の処理は、図１８のフロー図に示す手順で実行される。まず、図１８のステップＳ８１で、ロボット１２から送信されたユーザＩＤで特定される人間１６の近傍にある物品（本）２４のレコードを物品辞書１２２から抽出して物品ローカル辞書２０５Ａを作成する。 The process of “create local dictionary” in step S33 is executed according to the procedure shown in the flowchart of FIG. First, in step S81 in FIG. 18, a record of an article (book) 24 near the person 16 specified by the user ID transmitted from the robot 12 is extracted from the article dictionary 122 to create an article local dictionary 205A.

このとき、ＣＰＵ２００は、無線タグ読取装置２０８を駆動してアンテナ１２４を介して、人間１６に付着された無線タグ１８から電波を受信して人間１６ごとにユーザＩＤと人間の位置を特定する。そして、こうして特定した複数の人間１６のユーザＩＤと人間の位置の情報と、ロボット１２より送信されたユーザＩＤとに基づいてロボット１２が認識した人間１６の位置を特定する。 At this time, the CPU 200 drives the wireless tag reading device 208 and receives radio waves from the wireless tag 18 attached to the person 16 via the antenna 124 to specify the user ID and the position of the person for each person 16. Then, the positions of the humans 16 recognized by the robot 12 are specified based on the user IDs and information on the positions of the humans 16 specified as described above and the user IDs transmitted from the robot 12.

ロボット１２が認識した人間１６の位置を特定すると、次に、物品辞書１２２に記憶されている各レコードの物品（本）２４の位置が記憶されている項目を参照し、図５に示すような物品辞書１２２から、ロボット１２が認識した人間１６の近傍に位置する物品（本）２４のレコードをすべて抽出して、図６に示すような物品ローカル辞書２０５Ａを作成する。 When the position of the person 16 recognized by the robot 12 is specified, next, an item in which the position of the article (book) 24 of each record stored in the article dictionary 122 is stored is referred to as shown in FIG. All records of articles (books) 24 located near the person 16 recognized by the robot 12 are extracted from the article dictionary 122 to create an article local dictionary 205A as shown in FIG.

こうして物品ローカル辞書２０５Ａを作成すると、次に、ＣＰＵ２００はステップＳ８３で、「物品ローカル辞書２０５Ａに登録されている単語」の一覧３００を作成する。図７は、この単語の一覧３００を示した図解図である。この単語の一覧３００は、音声認識辞書１２６から音声認識ローカル辞書２０５Ｂを作成する際に使用される。図６および図７からわかるように、単語の一覧３００に登録される「物品ローカル辞書２０５Ａに登録されている単語」とは、物品ローカル辞書２０５Ａの各レコードの「名称」、「属性」、「著者」、「出版社」のそれぞれに記憶されているテキスト形式の単語である。ただし、図７に示されるように、この単語一覧３００には、テキスト形式の文字列からなる単語のみではなく、先に説明した音素記号列（図７の例では、“bakenazo”、“saiai”）もユーザＩＤ（図７の例では、（456…0000004）、（456…0000003））とともに登録される。これらの音素記号列（およびユーザＩＤ）は、先述したように、物品辞書１２２に単語が登録されていなかったために、登録されていない単語に対応する音素記号列が後から登録されたものである。 After the article local dictionary 205A is created in this way, the CPU 200 creates a list 300 of “words registered in the article local dictionary 205A” in step S83. FIG. 7 is an illustrative view showing a list 300 of the words. This word list 300 is used when the speech recognition local dictionary 205B is created from the speech recognition dictionary 126. As can be seen from FIGS. 6 and 7, “words registered in the article local dictionary 205A” registered in the word list 300 are “name”, “attribute”, “ It is a word in text format stored in each of “author” and “publisher”. However, as shown in FIG. 7, this word list 300 includes not only words consisting of text strings but also phoneme symbol strings described above (in the example of FIG. 7, “bakenazo”, “saiai” ) Is also registered together with the user ID (in the example of FIG. 7, (456... 0000004), (456... 0000003)). As described above, these phoneme symbol strings (and user IDs) are those in which a phoneme symbol string corresponding to an unregistered word is registered later because no word is registered in the article dictionary 122. .

ステップＳ８５では、単語の一覧３００に登録されている音素記号列の中から、ロボット１２が認識した人間１６のユーザＩＤ以外のユーザＩＤが対応つけられている音素記号列を削除する。ロボット１２が認識した人間１６のユーザＩＤが（456…0000004）である場合には、図８に示したように、他の人間１６のユーザＩＤ（456…0000003）と対応つけられている“saiai”が削除される。ユーザＩＤが（456…0000003）である人間１６に特有の“saiai”という本２４（化学の目）の名称の呼び方は、ロボット１２が認識したユーザＩＤが（456…0000004）である人間１６とのコミュニケーションにおける音声認識には必要がないと考えられるからである。 In step S85, a phoneme symbol string associated with a user ID other than the user ID of the human 16 recognized by the robot 12 is deleted from the phoneme symbol strings registered in the word list 300. If the user ID of the human 16 recognized by the robot 12 is (456... 0000004), as shown in FIG. 8, the “saiai” associated with the user ID (456. "Is deleted. The name of the book 24 (Chemical Eye) called “saiai” unique to the human 16 whose user ID is (456... 0000003) is the human 16 whose user ID recognized by the robot 12 is (456... 0000004). This is because it is considered unnecessary for the voice recognition in communication with the.

次に、ＣＰＵ２００は、ステップＳ８７で、単語の一覧３００における冗長性を排除する。つまり、図７の例において、単語の一覧３００に登録されている「白色」や「ハードカバー」などの重複部分を図８に示すように削除する。 Next, in step S87, the CPU 200 eliminates redundancy in the word list 300. That is, in the example of FIG. 7, overlapping portions such as “white” and “hard cover” registered in the word list 300 are deleted as shown in FIG.

こうして単語の一覧３００が完成すると、次に、この単語の一覧３００を利用して音声認識辞書１２６から音声認識ローカル辞書２０５Ｂを作成する。 When the word list 300 is completed in this manner, the speech recognition local dictionary 205B is created from the speech recognition dictionary 126 using the word list 300.

まず、ステップＳ８９において、ＣＰＵ２００は、メモリ２０４内に設定したカウンタＣｔ１をインクリメントする。初期状態では、カウンタＣｔ１は「０」である。このカウンタＣｔ１は、単語の一覧３００に登録されている単語の数をカウントするものであり、単語の一覧３００のポインタとして機能する。したがって、カウンタＣｔ１のカウント値によって、単語の一覧３００内において異なる単語を指定する。カウンタＣｔ１のカウント値が、単語の一覧３００に登録されている単語の数「Ｌ」に等しくなるまで、以下の動作が各単語について実行されると理解されたい。 First, in step S89, the CPU 200 increments the counter Ct1 set in the memory 204. In the initial state, the counter Ct1 is “0”. The counter Ct 1 counts the number of words registered in the word list 300 and functions as a pointer to the word list 300. Therefore, different words are designated in the word list 300 according to the count value of the counter Ct1. It should be understood that the following operations are performed for each word until the count value of the counter Ct1 is equal to the number of words “L” registered in the word list 300.

ステップＳ９１では、単語の一覧３００に登録されている単語が音素記号列であるか否かを判断する。たとえば、単語の一覧３００に登録されている単語が図８に示すように、テキスト形式で表現された「地球温暖化」という単語であれば、ステップＳ９１でＮＯと判断される。すると、ステップＳ９３で、テキスト形式である「地球温暖化」という単語をキーとして、音声認識辞書１２６（図１０参照）を検索し、「地球温暖化」という単語を記憶した項目を含むレコードを抽出し、図１１に示すように音声認識ローカル辞書２０５Ｂに登録する。 In step S91, it is determined whether or not the word registered in the word list 300 is a phoneme symbol string. For example, if the word registered in the word list 300 is the word “global warming” expressed in a text format as shown in FIG. 8, NO is determined in step S91. Then, in step S93, the speech recognition dictionary 126 (see FIG. 10) is searched using the word “global warming” in text format as a key, and a record including an item storing the word “global warming” is extracted. Then, it is registered in the voice recognition local dictionary 205B as shown in FIG.

一方、単語の一覧３００に登録されている単語が図８に示すように、音素記号形式で表現された“bakenazo”という単語であれば、ステップＳ９１でＹＥＳと判断される。すると、ステップＳ９５で、音素記号列である“bakenazo”という単語をキーとして音声認識辞書１２６を検索し、“bakenazo”という単語を記憶した項目を含むレコードを抽出し、図１１に示すように音声認識ローカル辞書２０５Ｂに登録する。 On the other hand, if the word registered in the word list 300 is the word “bakenazo” expressed in the phoneme symbol form as shown in FIG. 8, “YES” is determined in the step S91. Then, in step S95, the speech recognition dictionary 126 is searched using the word “bakenazo” that is a phoneme symbol string as a key, and a record including an item in which the word “bakenazo” is stored is extracted. As shown in FIG. Register in the recognition local dictionary 205B.

このようにして、物品辞書１２２から物品ローカル辞書２０５Ａが作成され、音声認識辞書１２６から音声認識ローカル辞書２０５Ｂが作成される。 In this way, the article local dictionary 205A is created from the article dictionary 122, and the voice recognition local dictionary 205B is created from the voice recognition dictionary 126.

一方、図１５に戻って、ロボット１２において、ステップＳ３でローカル辞書の作成指示をサーバ２０に送信した後は、ステップＳ５で、ＣＰＵ８０は、メモリ８４内に設定している発話／ジェスチャ辞書８５Ａを用いて、スピーカ６４から、たとえば、「何か本を持って来ましょうか？」のような発話を行なわせる。その後、人間１６がたとえば「花火の匠持ってきて」のような発話をしたとすると、ロボット１２が人間１６の指示を認識し、ステップＳ７で、ＣＰＵ８０が、“ＹＥＳ”と判断する。このとき、人間１６は、「花火の匠持ってきて」という発話とともに、該当本に視線を向けつつ当該本を指差すことによって、どの本持ってきてほしいかを指示する。 On the other hand, returning to FIG. 15, after the robot 12 has transmitted a local dictionary creation instruction to the server 20 in step S 3, the CPU 80 stores the utterance / gesture dictionary 85 A set in the memory 84 in step S 5. The speaker 64 is used to make a speech such as “Would you like to bring some books?” Thereafter, if the human 16 has made an utterance such as “bring fireworks master”, the robot 12 recognizes the instruction of the human 16, and the CPU 80 determines “YES” in step S7. At this time, the human 16 instructs which book he / she wants to bring by pointing to the book while looking at the book with an utterance “bring fireworks master”.

ＣＰＵ８０は、ステップＳ７でＹＥＳと判断すると、次にステップＳ９において、人間１６が指示した「花火の匠持ってきて」という音声の情報とともに、音声認識ならびに視線と指差しの推定を行わせる指示をサーバ２０に送信する。 If the CPU 80 determines YES in step S7, then in step S9, along with the voice information that the human 16 has instructed, “Give fireworks master”, an instruction to perform voice recognition and estimation of the line of sight and pointing is performed. Send to server 20.

サーバ２０では、ＣＰＵ２００が、図１６のステップＳ３５において、ロボット１２から音声認識ならびに視線と指差しの推定を行わせる指示を音声の情報とともに受信したと判断すると、ステップＳ３７における「音声認識による物品の推定」の処理とステップＳ３９における「視線と指差しによる物品の推定」の処理とを並列的に実行する。 In the server 20, when the CPU 200 determines in step S 35 of FIG. 16 that the instruction for voice recognition and the estimation of the line of sight and pointing is received from the robot 12 together with the voice information, the “200 The process of “estimation” and the process of “estimation of article by line of sight and pointing” in step S39 are executed in parallel.

「音声認識による物品の推定」の処理は、図１９のフロー図に示す手順にしたがって実行される。まず、ＣＰＵ２００は、ステップＳ１１１で、ロボット１２から送信されてきた「花火の匠持ってきて」という音声の情報を音声認識する。具体的には、「花火の匠持ってきて」という音声を音素に分割し、各音素に対応する音素記号を生成する。そして、図示しない文法辞書を参照して、生成した音素記号列“hanabinotakumi mottekite”のうち“hanabinotakumi”という音素記号列が目的語であると特定する。次に、この“hanabinotakumi”という音素記号列をキーとして音声認識ローカル辞書２０５Ｂ（図１１参照）を検索し、もっとも近い音素記号列の単語を記憶した項目を有するレコードを特定する。そして、特定したレコードに記憶されているテキスト形式の単語である「花火の匠」を得る。また、このとき、特定したレコードの「物品名称」の項目に記憶されている情報に基づいて、音声認識されたのが物品の名称であるか、つまり、本２４のタイトル（名称）であるか否かを判断することができる。この判断に基づいて後ほどフラグｔｗの設定を行うが、後ほど設定を行うのはフロー図の表現上の都合であり、ステップＳ１１１においてフラグｔｗの設定を行えばよい。 The process of “estimating an article by voice recognition” is executed according to the procedure shown in the flowchart of FIG. First, in step S 111, the CPU 200 recognizes voice information transmitted from the robot 12 as “Take fireworks master”. Specifically, the speech “Take fireworks master” is divided into phonemes, and phoneme symbols corresponding to each phoneme are generated. Then, a phoneme symbol string “hanabinotakumi” in the generated phoneme symbol string “hanabinotakumi mottekite” is identified as an object with reference to a grammar dictionary (not shown). Next, the phonetic symbol string “hanabinotakumi” is used as a key to search the speech recognition local dictionary 205B (see FIG. 11), and a record having an item storing the word of the nearest phoneme symbol string is specified. Then, “fireworks master”, which is a text-format word stored in the specified record, is obtained. At this time, based on the information stored in the “article name” item of the specified record, is the name of the article that has been voice-recognized, that is, the title (name) of the book 24? It can be determined whether or not. Based on this determination, the flag tw is set later. However, setting the flag tw later is convenient for the representation of the flowchart, and the flag tw may be set in step S111.

たとえば、人間１６が「花火の匠持ってきて」という代わりに「黒いの持ってきて」と指示した場合には、同様にしてステップＳ１１１における音声認識の結果として、「黒い」が得られる。なお、音声認識の方法としては特に限定されるものではなく既存の方法を採用することができる。 For example, when the human 16 instructs “bring a fireworks master” instead of “bring a black firework”, “black” is similarly obtained as a result of the speech recognition in step S111. The voice recognition method is not particularly limited, and an existing method can be adopted.

また、人間１６が「花火の匠持ってきて」という代わりに、「花火の匠」という単語の人間１６独特の略語である「ハナタク」という単語を用いて、「ハナタク持ってきて」と発話した場合には、音声認識ローカル辞書２０５Ｂには「ハナタク」という単語に相当する“hanataku”という音素記号列は登録されていないので、音声認識をすることができない。 Also, instead of saying "Take Fireworks Takumi", human 16 uttered "Take Hanatak" using the word "Hanataku" which is a unique abbreviation for the word "Fireworks Takumi". In this case, since the phonetic symbol string “hanataku” corresponding to the word “hanataku” is not registered in the speech recognition local dictionary 205B, speech recognition cannot be performed.

このように、音声認識ができない単語があった場合には、ＣＰＵ２００は、ステップＳ１１３において、ＹＥＳと判断し、ステップＳ１１５において、音声認識した人間１６の発話内容は特定のパターンであったか否かを判断する。ここで発話内容の特定のパターンとは、たとえば、「○○持ってきて」や「○○お願い」などであり、人間１６が指示する物品を特定する単語（○○）が含まれるであろう発話パターンである。人間１６が発した発話の内容がこれらの特定のパターンに該当するか否かは、音声認識の処理における図示しない文法辞書を用いた発話内容の文法解析とともに、メモリ２０４に設定されている発話パターン辞書２０５Ｃを参照することによって判定することができる。 As described above, when there is a word that cannot be recognized by the voice, the CPU 200 determines YES in step S113, and in step S115, determines whether or not the utterance content of the person 16 who has recognized the voice is a specific pattern. To do. Here, the specific pattern of the utterance content is, for example, “Bring XX” or “Request XX”, and will include a word (XX) that specifies an article designated by the human 16. It is an utterance pattern. Whether or not the content of the speech uttered by the human 16 corresponds to these specific patterns is determined by the utterance pattern set in the memory 204 together with the grammatical analysis of the utterance content using a grammar dictionary (not shown) in the speech recognition processing. This can be determined by referring to the dictionary 205C.

ステップＳ１１５でＹＥＳと判断した場合は、音声認識できなかった単語であって、人間１６が指示した物品を特定するであろう単語があることを示すために、ステップＳ１１７で、メモリ２０４上に設定されているフラグｎｗに数値「１」を記憶させてフラグｎｗをオン状態にする。なお、フラグｎｗの初期状態はオフ状態（数値「０」）である。そして、ステップＳ１１９では、ステップＳ１１１における音声認識の処理で生成された、音声認識できなかった単語の音素記号列をメモリ２０４に設定されたワークエリアｗａに格納する。なお、ステップＳ１１５でＮＯと判断すると、ステップＳ１１７およびステップＳ１１９をスキップする。 If YES is determined in step S115, it is set on the memory 204 in step S117 to indicate that there is a word that could not be recognized by voice and that would specify an article designated by the person 16. The numerical value “1” is stored in the flag nw, and the flag nw is turned on. The initial state of the flag nw is an off state (numerical value “0”). In step S119, the phoneme symbol string of the word that cannot be recognized by the speech recognition process in step S111 is stored in the work area wa set in the memory 204. If NO is determined in step S115, steps S117 and S119 are skipped.

一方、ステップＳ１１３においてＮＯと判断すると、つまり、音声認識できない単語がなかったと判断すると、ステップＳ１２１において、ＣＰＵ２００は、音声認識した単語は物品の名称、つまり、本２４のタイトル（名称）であったかどうかを判断する。この音声認識した単語が本２４のタイトル（名称）であったかどうかは、音声認識ローカル辞書２０５Ｂの各レコードに含まれる「物品名称」の項目に記録されている情報に基づいて判断される。 On the other hand, if NO is determined in step S113, that is, if it is determined that there is no word that cannot be voice-recognized, in step S121, the CPU 200 determines whether the voice-recognized word is the name of the article, that is, the title (name) of the book 24. Judging. Whether or not the speech-recognized word is the title (name) of the book 24 is determined based on information recorded in the “article name” item included in each record of the speech recognition local dictionary 205B.

ステップＳ１２１でＹＥＳと判断すると、ステップＳ１２３で、本２４のタイトル（名称）が音声認識されたことを示すために、メモリ２０４に設定されたフラグｔｗに数値「１」を記憶させてフラグｔｗをオン状態にする。なお、フラグｔｗの初期状態はオフ状態（数値「０」）である。ステップＳ１２１でＮＯと判断するとステップＳ１２３をスキップする。 If YES is determined in the step S121, a numerical value “1” is stored in the flag tw set in the memory 204 in order to indicate that the title (name) of the book 24 is recognized in step S123, and the flag tw is set. Turn on. The initial state of the flag tw is an off state (numerical value “0”). If NO is determined in step S121, step S123 is skipped.

ステップＳ１２５では、ＣＰＵ２００は、ステップＳ１１１における音声認識の結果として得られたテキスト形式の単語、たとえば、「花火の匠」に基づいて物品ローカル辞書２０５Ａを参照し、人間１６が指示した物品（本）２４の候補としての物品（本）２４を選出する。具体的には、ステップＳ１１１で得られたテキスト形式の単語、たとえば、「花火の匠」をキーとして、物品ローカル辞書２０５Ａを検索し、「花火の匠」というテキスト形式の単語を項目に記憶しているレコードを特定する。図６の例では、物品ＩＤが「123…0000046」であるレコードが「名称」の項目に「花火の匠」という単語を記憶しているので、物品ＩＤが「123…0000046」である物品（本）２４が特定される。こうして特定された物品ＩＤの物品（本）２４が、人間１６が指示した物品の候補として選出され、当該物品（本）２４のＩＤが物品リストＡ（不図示）に登録される。 In step S125, the CPU 200 refers to the article local dictionary 205A based on the text-formatted word obtained as a result of the speech recognition in step S111, for example, “Fireworks Master”, and the article (book) indicated by the human 16 Articles (books) 24 as 24 candidates are selected. Specifically, the text-format word obtained in step S111, for example, “article of fireworks” is used as a key to search the article local dictionary 205A, and the text-format word of “artificial fireworks” is stored in the item. Identify the records that are In the example of FIG. 6, since the record with the article ID “123... 0000046” stores the word “fireworks master” in the item “name”, the article with the article ID “123. Book) 24 is identified. The article (book) 24 having the article ID specified in this way is selected as a candidate for the article designated by the human 16, and the ID of the article (book) 24 is registered in the article list A (not shown).

ここで、ステップＳ１１１における音声認識の結果として得られた単語が、たとえば、「白い」であった場合は、図６の物品ローカル辞書２０５Ａの例の場合では、「属性」の項目に「白い」という単語を記憶しているレコードが複数存在するので、これらのレコードに含まれる物品ＩＤ、つまり、物品ＩＤが「123…0000035」、「123…0000091」、および「123…0000102」の物品（本）２４が、人間１６が指示した物品の候補として選出される。この場合、こうして選出された複数の物品のＩＤが、物品リストＡ（不図示）に登録される。 Here, if the word obtained as a result of the speech recognition in step S111 is “white”, for example, in the case of the article local dictionary 205A in FIG. Since there are a plurality of records storing the word “N”, the article IDs included in these records, that is, articles with the article IDs “123… 0000035”, “123… 0000091”, and “123… 0000102” (books) ) 24 is selected as a candidate for the article designated by the human 16. In this case, the IDs of the plurality of articles selected in this way are registered in the article list A (not shown).

ステップＳ１２７では、こうして選出された物品（本）２４の候補が物品リストＡに存在するか否かを判断する。ステップＳ１２７でＹＥＳと判断すると、ステップＳ１２９で、ＣＰＵ２００は、音声認識の結果による物品（本）２４の候補が存在することを示すために、メモリ２０４に設定されているフラグｃａに数値「１」を記憶させてフラグｃａをオン状態にする。なお、フラグｃａの初期状態はオフ状態（数値「０」）である。ステップＳ１２７でＮＯと判断すると、ステップＳ１２９をスキップする。このようにして、音声認識によって、人間１６が指示した物品（本）２４の推定が行われる。 In step S127, it is determined whether or not the candidate for the article (book) 24 thus selected exists in the article list A. If YES is determined in step S127, in step S129, the CPU 200 indicates a numerical value “1” in the flag ca set in the memory 204 to indicate that there is a candidate for the article (book) 24 as a result of the speech recognition. Is stored and the flag ca is turned on. Note that the initial state of the flag ca is an off state (numerical value “0”). If NO is determined in step S127, step S129 is skipped. In this manner, the article (book) 24 instructed by the person 16 is estimated by voice recognition.

図１６に戻って、ステップＳ３９の「視線と指差しによる物品の推定」の処理は、図２０のフロー図に示す手順にしたがって実行される。ＣＰＵ２００は、まず、ステップＳ１４１で、メモリ２０４内に設定したカウンタＣｔ２をインクリメントする。初期状態ではカウンタＣｔ２は「０」が設定されている。このカウンタＣｔ２は、人間１６の近傍に存在する物品の数をカウントするもので、物品ローカル辞書２０５Ａ（図６参照）のポインタとして機能する。したがって、カウンタＣｔ２のカウント値によって、物品ローカル辞書２０５Ａ内において、異なる物品を指定する。カウンタＣｔ２のカウント値が、物品ローカル辞書２０５Ａ内にリストアップしている物品の数「ｍ」に等しくなるまで、以下の動作が各物品について、実行されるものと理解されたい。 Returning to FIG. 16, the processing of “estimation of article by line of sight and pointing” in step S39 is executed according to the procedure shown in the flowchart of FIG. First, the CPU 200 increments the counter Ct2 set in the memory 204 in step S141. In the initial state, the counter Ct2 is set to “0”. The counter Ct2 counts the number of articles existing in the vicinity of the person 16, and functions as a pointer for the articles local dictionary 205A (see FIG. 6). Therefore, different articles are designated in the article local dictionary 205A by the count value of the counter Ct2. It should be understood that the following operations are performed for each article until the count value of counter Ct2 is equal to the number of articles “m” listed in article local dictionary 205A.

ステップＳ１４１に続いて、ＣＰＵ２００は、人間１６の視線を推定してそれの確信度を求める動作と、指差し方向を推定してそれの確信度を求める動作とを並行して実行するが、ここでは便宜上、まず視線を推定し次いで指差し方向を推定する順序で説明する。 Subsequent to step S141, the CPU 200 executes in parallel the operation of estimating the line of sight of the human 16 to obtain the certainty thereof and the operation of estimating the pointing direction and obtaining the certainty thereof. For the sake of convenience, description will be made in the order of estimating the line of sight first and then estimating the pointing direction.

図２０のステップＳ１４１〜ステップＳ１５３の処理は、図１６のステップＳ３５でロボット１２から音声認識ならびに視線と指差しの推定を行わせる指示を受信した後の一定の繰り返しの時間（ｔ１，ｔ２，ｔ３，…，ｔｍ）毎に実行されるが、実施例では、５０Ｈｚ（１秒間に５０回）で実行されるものとする。なお、ステップＳ１４１〜ステップＳ１５３の処理は、たとえば、図１５のステップＳ５でロボット１２が発話を行った際に、その旨を知らせる通知をロボット１２からサーバ２０に送信し、当該通知を受信したサーバ２０が、当該通知を受信した後の一定の繰り返しの時間（ｔ１，ｔ２，ｔ３，…，ｔｍ）毎に実行するようにしてもよい。 The processing from step S141 to step S153 in FIG. 20 is performed at a predetermined repetition time (t1, t2, t3) after receiving an instruction for performing speech recognition and gaze and pointing estimation from the robot 12 in step S35 in FIG. ,..., Tm), but in the embodiment, it is executed at 50 Hz (50 times per second). Note that the processing from step S141 to step S153 is performed, for example, when the robot 12 speaks in step S5 in FIG. 15, a notification informing the fact is transmitted from the robot 12 to the server 20, and the server that has received the notification. 20 may be executed at regular repetition times (t1, t2, t3,..., Tm) after receiving the notification.

ステップＳ１４３では、ＣＰＵ２００は、ロボット１２から、人間１６の視線の方向を示す視線ベクトル情報を受信する。フロー図には明記しないが、ロボット１２では、ＣＰＵ８０は、たとえば眼カメラ７０からのカメラ映像を処理することによって、先に挙げた公開公報に記載したいずれかの方法に従って、人間１６の左右のそれぞれの眼の視線ベクトルを推定する。この左右それぞれの眼の視線方向は図１３において直線Ｌ１およびＬ２で示される。このようにして、ロボット１２は、所定の時間間隔で繰り返し各視線Ｌ１およびＬ２を推定し、この視線の方向を示す視線ベクトル情報をサーバ２０に送信する。 In step S 143, the CPU 200 receives line-of-sight vector information indicating the direction of the line of sight of the human 16 from the robot 12. Although not explicitly shown in the flow diagram, in the robot 12, the CPU 80 processes the camera video from the eye camera 70, for example, according to any of the methods described in the above-mentioned publications, respectively, Estimate the eye gaze vector. The line-of-sight directions of the left and right eyes are indicated by straight lines L1 and L2 in FIG. In this way, the robot 12 repeatedly estimates each line of sight L1 and L2 at predetermined time intervals, and transmits line-of-sight vector information indicating the direction of the line of sight to the server 20.

サーバ２０では、ステップＳ１４５において、ＣＰＵ２００が、カウンタＣｔ２がそのとき物品ローカル辞書２０５Ａ内でポイントしている物品と、各視線Ｌ１およびＬ２との距離を計算する。この計算の際には、ＣＰＵ２００は、ロボット１２から受信した視線ベクトル情報をロボット座標系からワールド座標系に座標変換して使用する。また、この距離の計算の際には、物品ローカル辞書２０５Ａの当該物品（本）２４のレコードに記録されている当該物品（本）２４の位置の情報が利用される。 In the server 20, in step S145, the CPU 200 calculates the distance between the line of sight L1 and L2 and the article that the counter Ct2 is currently pointing in the article local dictionary 205A. In this calculation, the CPU 200 uses the line-of-sight vector information received from the robot 12 by converting the coordinate from the robot coordinate system to the world coordinate system. In calculating the distance, information on the position of the article (book) 24 recorded in the record of the article (book) 24 of the article local dictionary 205A is used.

一方、指差し方向を推定するためには、ステップＳ１４７において、まず、ＣＰＵ２００は、人間１６が指差し動作をした腕を特定する。具体的には、モーションキャプチャのデータを参照して、たとえば、人間１６の指先と肩の高さとの差が小さい側の腕を指差し腕として推定する。なぜなら、指差し動作をする場合には、腕を持ち上げる動作をまずすると考えられるからである。このようにして、ステップＳ１４７でどちらの腕を用いて指差し動作をするかを推定した後、ＣＰＵ２００は、次のステップＳ１４９において、指差し方向を推定する。 On the other hand, in order to estimate the pointing direction, in step S147, first, the CPU 200 identifies the arm on which the human 16 has performed the pointing operation. Specifically, referring to the motion capture data, for example, the arm on the side where the difference between the fingertip and shoulder height of the human 16 is small is estimated as the pointing arm. This is because, when performing a pointing operation, it is considered that the operation of lifting the arm first. In this way, after estimating which arm is used to perform the pointing operation in step S147, the CPU 200 estimates the pointing direction in the next step S149.

この実施例では、図１３に示すように、指差し腕の指先と顔の中心（重心）とを通る直線Ｌ３、および指差し腕の指先とその腕の肘とを通る直線Ｌ４を想定する。そして、モーションキャプチャのデータを参照して、その直線Ｌ３およびＬ４を推定する。次のステップＳ１５１において、各直線Ｌ３およびＬ４と各物品との間の距離を計算する。 In this embodiment, as shown in FIG. 13, a straight line L3 passing through the fingertip of the pointing arm and the center (center of gravity) of the face and a straight line L4 passing through the fingertip of the pointing arm and the elbow of the arm are assumed. Then, the straight lines L3 and L4 are estimated with reference to the motion capture data. In the next step S151, the distance between each straight line L3 and L4 and each article is calculated.

上述のステップＳ１４３〜Ｓ１４５およびステップＳ１４７〜Ｓ１５１は、繰返し時間（ｔ１，ｔ２，ｔ３，．．．，ｔｍ）毎に行われる。そして、各繰返しの時間（ｔ１，ｔ２，ｔ３，．．．，ｔｍ）毎に、線Ｌ１，Ｌ２，Ｌ３，およびＬ４との距離が最小になる物品を求める。各線において、最小になった物品に対して高い確信度（図１４でいえば「○」印）を付与する。このようにして、たとえば図１４に示すような確信度表を作成する。 The above-described steps S143 to S145 and steps S147 to S151 are performed every repetition time (t1, t2, t3,..., Tm). Then, for each repetition time (t1, t2, t3,..., Tm), an article having a minimum distance from the lines L1, L2, L3, and L4 is obtained. In each line, a high certainty factor (“◯” in FIG. 14) is assigned to the minimized article. In this way, for example, a certainty factor table as shown in FIG. 14 is created.

このように直線毎に最短距離を持つ物品を算出することによって確信度表を作成するようにすれば、１つの物品について２以上の直線について確信度（○）が付与されることがある。このことによって、後にステップＳ１５５で説明するような物品リストＢ（不図示）を作成することができるのである。 If the certainty factor table is created by calculating the article having the shortest distance for each straight line as described above, the certainty factor (○) may be given to two or more straight lines for one article. As a result, an article list B (not shown) as will be described later in step S155 can be created.

この図１４の確信度表において、視線Ｌ１およびＬ２のそれぞれについて評価される確信度は「視線確信度」ということができ、指差し方向線Ｌ３およびＬ４のそれぞれについて評価される確信度が「指差し方向確信度」であるということができる。 In the certainty factor table of FIG. 14, the certainty factor evaluated for each of the lines of sight L1 and L2 can be referred to as “gaze certainty factor”, and the certainty factor evaluated for each of the pointing direction lines L3 and L4 is It can be said that it is the “direction direction certainty”.

図１４に示す例で説明すると、「123…0000001」のＩＤを持つ物品、実施例でいえば図１に示す「地球温暖化」という名称の本についていえば、時間ｔ１に一方の視線Ｌ２とこの本との間の距離が最小になったものの、その他の時間区間ではどの線も当該本に最接近することはなかったと判断できる。次の、「123…0000046」のＩＤを持つ物品、実施例でいえば図１に示す「花火の匠」という名称の本についていえば、時間ｔ１を除いて、各時間にどれかの線がこの物品に再接近したことがわかる。このようにして、図１４に示す確信度表がステップＳ１４５およびステップＳ１５１で作成される。 Referring to the example shown in FIG. 14, an article having an ID of “123... 0000001”, in the example, a book named “global warming” shown in FIG. Although the distance to the book is minimized, it can be determined that no line is closest to the book in other time intervals. For the next article having an ID of “123 ... 0000046”, in the example, the book named “Fireworks Takumi” shown in FIG. It can be seen that this article has been approached again. In this way, the certainty factor table shown in FIG. 14 is created in steps S145 and S151.

ステップＳ１５５においてＣＰＵ２００は、図１４に示す確信度表を参照して、そのとき人間１６が指示したと考える物品（本）２４を特定する。具体的には、確信度評価（図１４で言えば丸印）が単に多い順や、繰返し時間で視線（Ｌ１またはＬ２）と指差し(Ｌ３またはＬ４)の両方に○が入っている回数が多い順などに従って、物品のＩＤのリストである物品リストＢ（不図示）を作成する。 In step S 155, the CPU 200 refers to the certainty factor table shown in FIG. 14 and specifies the article (book) 24 that the person 16 thinks instructed at that time. Specifically, the number of times that ○ is included in both the line of sight (L1 or L2) and the pointing (L3 or L4) in the order in which the confidence evaluations (circles in FIG. 14) are simply large or the repetition time. An article list B (not shown), which is a list of article IDs, is created in the descending order.

この確信度評価について、たとえば、図１４に示す例で説明すると、「123…0000001」のＩＤを持つ「地球温暖化」という名称の本についていえば、確信度評価は「１」（１つの○印が付与された。）であり、「123…0000046」のＩＤを持つ「花火の匠」という本の確信度は「３」ということになる。したがって、この場合には、物品リストＢには、ＩＤ「123…0000046」、ＩＤ「123…0000001」の順で登録される。 This confidence evaluation will be described with reference to the example shown in FIG. 14. For example, if the book named “global warming” having the ID “123... The degree of certainty of the book “Fireworks Takumi” with ID “123... 0000046” is “3”. Accordingly, in this case, ID “123... 0000046” and ID “123... 0000001” are registered in the article list B in this order.

ただし、確信度（○印）の数が同じ場合であるとか、確信度（○印）の数が所定の閾値より小さい場合など、判断に迷う場合には、たとえば、図１４に示す各繰り返しの時間の全区間の半分以上で確信度が付与されているような物品を対象物として特定すればよい。 However, when the number of certainty factors (circles) is the same, or when the number of certainty factors (circles) is smaller than a predetermined threshold, for example, each repetition shown in FIG. What is necessary is just to specify the articles | goods to which confidence is provided in the half or more of all the sections of time as a target object.

このようにして、ステップＳ３７の「音声認識による物品の推定」の処理で物品リストＡが作成され、ステップＳ３９の「視線と指差しによる物品の推定」の処理で物品リストＢが作成されると、次に、ＣＰＵ２００は、図１６のステップＳ４１で、人間１６が指示した物品の候補の一覧である候補物品一覧Ｃ（不図示）を作成する。この候補物品一覧Ｃを作成する処理は、図２１のフロー図に示す手順で実行される。 In this manner, the article list A is created by the process of “estimating articles by voice recognition” in step S37, and the article list B is created by the process of “estimating articles by line of sight and pointing” in step S39. Next, in step S41 of FIG. 16, the CPU 200 creates a candidate article list C (not shown), which is a list of article candidates designated by the human 16. The process of creating the candidate article list C is executed according to the procedure shown in the flowchart of FIG.

まず、ＣＰＵ２００は、ステップＳ１６１で、メモリ２０４に設定されているフラグｃａがオン状態であるか、つまり、音声認識の結果により推定された物品が存在するか否か、言い換えれば、物品リストＡに物品のＩＤが登録されているか否かを判断する。 First, in step S161, the CPU 200 determines whether or not the flag ca set in the memory 204 is on, that is, whether or not there is an article estimated from the result of speech recognition, in other words, in the article list A. It is determined whether or not the article ID is registered.

ステップＳ１６１でＮＯと判断すると、音声認識の結果によって推定された物品は存在しないので、ステップＳ１６３で、視線確信度および指差し方向確信度に基づいて候補物品一覧Ｃを作成する。つまり、物品リストＢの内容をそのままに候補物品一覧Ｃを作成する。 If NO is determined in step S161, there is no article estimated based on the result of the voice recognition. Therefore, in step S163, a candidate article list C is created based on the line-of-sight certainty and the pointing direction certainty. That is, the candidate article list C is created with the contents of the article list B as they are.

一方、ステップＳ１６１でＹＥＳと判断すると、次に、ＣＰＵ２００は、ステップＳ１６５で、フラグｔｗがオン状態であるか、つまり、音声認識により本２４のタイトル（名称）（物品の名称）が認識されたか否か、言い換えれば、音声認識の結果として得られた物品リストＡに登録されている物品のＩＤは、人間１６の発話内容に含まれる本２４のタイトル（名称）に基づいて決定されたか否かを判断する。 On the other hand, if “YES” is determined in the step S161, the CPU 200 next determines whether the flag tw is in an on state in step S165, that is, whether the title (name) (name of the article) of the book 24 is recognized by voice recognition. No, in other words, whether the ID of the article registered in the article list A obtained as a result of the speech recognition is determined based on the title (name) of the book 24 included in the utterance content of the person 16 Judging.

ステップＳ１６５でＹＥＳと判断すると、ステップＳ１６７で、ＣＰＵ２００は、音声認識の結果を視線確信度および指差し方向確信度よりも優先させて候補物品一覧Ｃを作成する。つまり、物品リストＡに登録されている物品のＩＤが上位となるように先に候補物品一覧Ｃに登録し、その後に、物品リストＢに登録されている物品のＩＤが下位となるように登録する。これは、人間１６が名指しした本２４のタイトル（名称）を音声認識して物品（本）２４を推定したほうが、視線や指差しに基づいて物品（本）２４を推定するよりも確実であると考えられるからである。なお、物品リストＡと物品リストＢとに重複する物品のＩＤが存在する場合には、物品リストＢに登録されている物品のＩＤを候補物品一覧Ｃに登録しない。 If YES is determined in step S165, in step S167, the CPU 200 prioritizes the voice recognition result over the line-of-sight certainty and the pointing direction certainty, and creates the candidate article list C. That is, the candidate ID is registered first in the candidate article list C so that the ID of the article registered in the article list A is higher, and then the ID of the article registered in the article list B is lower. To do. It is more reliable to estimate the article (book) 24 by voice recognition of the title (name) of the book 24 named by the human 16 than to estimate the article (book) 24 based on the line of sight or pointing. Because it is considered. Note that if there are duplicate product IDs in the product list A and the product list B, the product IDs registered in the product list B are not registered in the candidate product list C.

一方、ステップＳ１６５でＮＯと判断すると、ＣＰＵ２００は、ステップＳ１６９で、視線確信度および指差し方向確信度を音声認識の結果よりも優先させて候補物品一覧Ｃを作成する。つまり、物品リストＢに登録されている物品のＩＤが上位となるように先に候補物品一覧Ｃに登録し、その後に、物品リストＡに登録されている物品のＩＤが下位となるように登録する。なお、物品リストＡと物品リストＢとに重複する物品のＩＤが存在する場合には、物品リストＡに登録されている物品のＩＤを候補物品一覧Ｃに登録しない。 On the other hand, if NO is determined in step S165, the CPU 200 creates the candidate article list C in step S169 by prioritizing the line-of-sight certainty and the pointing direction certainty over the result of voice recognition. In other words, the product ID registered in the product list B is first registered in the candidate product list C so that the ID of the product registered in the product list B is higher, and then the product ID registered in the product list A is registered lower. To do. Note that if there are duplicate product IDs in the product list A and the product list B, the product IDs registered in the product list A are not registered in the candidate product list C.

このようにして候補物品一覧Ｃが作成されると、次に、ＣＰＵ２００は、図１６のステップＳ４３で、候補物品一覧Ｃから第１候補である物品（本）２４を選出する。はじめのステップＳ４３では、候補物品一覧Ｃの先頭に登録されている物品（本）２４のＩＤを選出する。２回目以降のステップＳ４３では、候補物品一覧Ｃの２番目以降に登録されている物品（本）２４のＩＤを選出する。 When the candidate article list C is created in this way, the CPU 200 next selects the article (book) 24 that is the first candidate from the candidate article list C in step S43 of FIG. In the first step S43, the ID of the article (book) 24 registered at the top of the candidate article list C is selected. In step S43 after the second time, the ID of the article (book) 24 registered after the second candidate article list C is selected.

そして、ステップＳ４５では、ステップＳ４３で物品（のＩＤ）が選出されたか否か、つまり、候補となる物品（本）２４が存在したか否かを判断する。ステップＳ４５で、ＹＥＳと判断すると、次に、ＣＰＵ２００は、ステップＳ４７において、ステップＳ４３で選出された物品の位置情報を取得する。つまり、ＣＰＵ２００は、ステップＳ４３で選出された物品（本）２４（以下、「選出物品（本）２４」と呼ぶ。）のＩＤをキーとして物品ローカル辞書２０５Ａを検索することによって、当該物品ＩＤが示す物品（本）２４の位置の情報を取得する。 In step S45, it is determined whether or not an article (its ID) is selected in step S43, that is, whether or not a candidate article (book) 24 exists. If YES is determined in the step S45, the CPU 200 next acquires the position information of the article selected in the step S43 in a step S47. That is, the CPU 200 searches the article local dictionary 205A using the ID of the article (book) 24 selected in step S43 (hereinafter referred to as “selected article (book) 24”) as a key, whereby the article ID is obtained. Information on the position of the article (book) 24 to be shown is acquired.

次に、ＣＰＵ２００は、ステップＳ４９で、「選出物品（本）２４」が、人間１６が指示した物品（本）２４であるか否かを人間１６に確認するためにロボット１２に発話させる際の発話内容を決定する。ロボット１２が行うこの発話の内容には、サーバ２０が、人間１６が指示したと推定する物品（本）２４を特定する単語（以下、「特定単語」と呼ぶ。）が含まれる。図６を参照して、サーバ２０が、人間１６が指示したと推定した物品（本）２４が、物品ＩＤが「123…0000046」である「花火の匠」とう本であったとする。この場合、ステップＳ４９では、たとえば、「花火の匠の本ですね」、「黒色の本ですね」、「ハードカバーの本ですね」などという発話内容を決定する。この例の場合、それぞれ、「花火の匠」、「黒色」、「ハードカバー」が先述の「特定単語」である。この「発話内容の決定」の処理は、図２２のフロー図に示す手順で実行される。 Next, in step S49, the CPU 200 causes the robot 12 to speak to confirm to the human 16 whether or not the “selected article (book) 24” is the article (book) 24 instructed by the human 16. Determine the utterance content. The content of this utterance performed by the robot 12 includes a word (hereinafter referred to as a “specific word”) that identifies the article (book) 24 that the server 20 presumes that the human 16 has instructed. Referring to FIG. 6, it is assumed that the article (book) 24 estimated by the server 20 to be instructed by the human 16 is a book “Fireworks Master” whose article ID is “123... 0000046”. In this case, in step S49, for example, utterance contents such as “It is a book of fireworks master”, “It is a black book”, “It is a book of hard cover”, etc. are determined. In this example, “artificial fireworks”, “black”, and “hard cover” are the above-mentioned “specific words”, respectively. The process of “determination of utterance content” is executed according to the procedure shown in the flowchart of FIG.

図２２を参照して、まず、ＣＰＵ２００は、ステップＳ１８１で、メモリ２０４に設定されている個人正誤ＤＢ２０５Ｅを参照して、ロボット１２が認識した人間１６とコミュニケーションをとるのがはじめてであるか否か、つまり、当該人間１６の指示する物品（本）２４を特定するのがはじめてであるか否かを判断する。個人正誤ＤＢ２０５Ｅは、ロボット１２と人間１６とのコミュニケーション結果の記録である。つまり、この個人正誤ＤＢ２０５Ｅには、ロボット１２とコミュニケーションを行った人間１６のＩＤと当該人間１６が指示する物品（本）２４をシステム１０が正しく推定することができたか否かを示す成功、不成功の別がコミュニケーションのたび毎に記録されている。個人正誤ＤＢ２０５Ｅに当該人間１６のＩＤが記録されていなければ、当該人間１６は初対面であると判断する。 Referring to FIG. 22, first, in step S181, CPU 200 refers to personal correctness DB 205E set in memory 204 to determine whether or not communication with human 16 recognized by robot 12 is the first time. That is, it is determined whether or not it is the first time that the article (book) 24 indicated by the person 16 is specified. The personal correctness DB 205E is a record of communication results between the robot 12 and the human 16. That is, the personal correctness DB 205E includes success or failure indicating whether the system 10 can correctly estimate the ID of the person 16 who communicated with the robot 12 and the article (book) 24 indicated by the person 16. The success is recorded for each communication. If the ID of the person 16 is not recorded in the personal correctness DB 205E, it is determined that the person 16 is the first meeting.

ステップＳ１８１でＹＥＳと判断すると、つまり、人間１６は初対面であると判断すると、ステップＳ１８３で、ＣＰＵ２００は、物品（本）２４の「特定単語」を本２４のタイトル（名称）に決定する。この本２４のタイトル（名称）の内容は、図１６のステップＳ４３で選出した物品（本）２４（「選出物品（本）２４」）のＩＤを物品ローカル辞書２０５Ａで検索することによって得ることができる。 If YES is determined in step S181, that is, if it is determined that the human 16 is the first meeting, the CPU 200 determines the “specific word” of the article (book) 24 as the title (name) of the book 24 in step S183. The contents of the title (name) of the book 24 can be obtained by searching the article local dictionary 205A for the ID of the article (book) 24 selected in step S43 in FIG. 16 ("selected article (book) 24"). it can.

そして、ステップＳ２０３では、メモリ２０４に設定されている発話辞書２０５Ｄを利用して、ステップＳ１８３で決定した「特定単語」である本２４のタイトル（名称）、たとえば、「花火の匠」に基づいて「花火の匠の本ですね」などという発話内容を決定する。なお、ステップＳ２０３では、決定した発話内容をロボット１２がこれにしたがって発話するための発話内容情報の生成も行う。 In step S203, using the utterance dictionary 205D set in the memory 204, based on the title (name) of the book 24, which is the "specific word" determined in step S183, for example, "fireworks master". Decide the utterance content such as "It is a fireworks master's book". In step S203, utterance content information is also generated for the robot 12 to utter the determined utterance content accordingly.

一方、ステップＳ１８１でＮＯと判断された場合、つまり、当該人間１６のＩＤが個人正誤ＤＢ２０５Ｅに記録されている場合は、ステップＳ１８５で、ＣＰＵ２００は、同じく個人正誤ＤＢ２０５Ｅを参照して当該人間１６とのコミュニケーションにおける成功率を計算し、成功率が、たとえば、７０％以上であるか否かを判断する。ステップＳ１８５でＮＯと判断すると、先に説明したように、ステップＳ１８３で、物品（本）２４の「特定単語」を本２４のタイトル（名称）と決定し、ステップＳ２０３で、発話内容を決定する。 On the other hand, if NO is determined in step S181, that is, if the ID of the person 16 is recorded in the personal correctness DB 205E, in step S185, the CPU 200 refers to the personal correctness DB 205E in the same way. The success rate in communication is calculated, and it is determined whether or not the success rate is 70% or more, for example. If NO is determined in step S185, as described above, the "specific word" of the article (book) 24 is determined as the title (name) of the book 24 in step S183, and the utterance content is determined in step S203. .

一方、ステップＳ１８５で、ＹＥＳと判断すると、ステップＳ１８７で、「選出物品（本）２４」に関する単語の一覧Ｄ（不図示）を物品ローカル辞書２０５Ａから作成する。たとえば、「選出物品（本）２４」が、図６を参照して物品ＩＤが「123…0000046」である「花火の匠」という本２４であったとすると、単語の一覧Ｄには、物品ローカル辞書２０５Ａの物品ＩＤが「123…0000046」であるレコードに登録されている単語である「花火の匠」、「黒色」、「ハードカバー」、「近藤四郎」、「ＡＴＲ出版株式会社」が登録される。 On the other hand, if YES is determined in the step S185, a word list D (not shown) relating to the “selected article (book) 24” is created from the article local dictionary 205A in a step S187. For example, if the “selected article (book) 24” is a book 24 of “artist of fireworks” whose article ID is “123... 0000046” with reference to FIG. The words “Fireworks Takumi”, “Black”, “Hardcover”, “Shiro Kondo”, and “ATR Publishing Co., Ltd.” are registered in the record with the article ID “123 ... 0000046” in the dictionary 205A. Is done.

単語の一覧Ｄが作成されると、次に、ＣＰＵ２００は、ステップＳ１８９で、単語の一覧Ｄに登録されている単語のうち、人間１６の近傍に存在する物品（本）２４に関連する単語と同一の単語を削除する。人間１６の近傍には、物品ローカル辞書２０５Ａに登録されている物品（本）２４が存在することになる。「選出物品（本）２４」が、図６を参照して物品ＩＤが「123…0000046」である「花火の匠」という本２４であったとすると、単語の一覧Ｄには、先述したように、「ハードカバー」という単語や「ＡＴＲ出版株式会社」という単語が登録されている。 After the word list D is created, the CPU 200 next selects words related to the article (book) 24 existing in the vicinity of the person 16 among the words registered in the word list D in step S189. Delete the same word. An article (book) 24 registered in the article local dictionary 205A exists in the vicinity of the person 16. Assuming that the “selected article (book) 24” is a book 24 called “artisan of fireworks” whose article ID is “123... 0000046” with reference to FIG. The word “hard cover” and the word “ATR Publishing Co., Ltd.” are registered.

ここで、物品（本）２４の「特定単語」を「ハードカバー」という単語や「ＡＴＲ出版株式会社」という単語に決定し、ロボット１２が「ハードカバーの本ですね」や「ＡＴＲ出版株式会社の本ですね」と発話したとする。すると、図６の物品ローカル辞書２０５Ａからわかるように、人間１６の近傍には「ハードカバーの本」や「ＡＴＲ出版株式会社の本」は複数存在するので、人間１６はロボット１２の発話に基づいてロボット１２がどの物品（本）２４を示しているのかがわからない。そこで、単語の一覧Ｄから、物品ローカル辞書２０５Ａに記録されている「選出物品（本）２４」以外のレコードに記録されている単語と同一の単語を削除する。 Here, the “specific word” of the article (book) 24 is determined to be the word “hardcover” or the word “ATR Publishing Co., Ltd.”, and the robot 12 “is a hardcover book” or “ATR Publishing Co., Ltd.” It ’s a book. ” Then, as can be seen from the article local dictionary 205 A in FIG. 6, there are a plurality of “hardcover books” and “books of ATR Publishing Co.” in the vicinity of the person 16, so the person 16 is based on the utterance of the robot 12. Thus, it is not known which article (book) 24 the robot 12 indicates. Therefore, the same word as the word recorded in the record other than the “selected article (book) 24” recorded in the article local dictionary 205A is deleted from the word list D.

次に、ステップＳ１９１で、ＣＰＵ２００は、単語の一覧Ｄから６個未満の音素からなる単語、つまり、音声認識の処理において６個未満の音素に分解される単語を削除する。これは、音素数の少ない単語は正しく音声認識されにくいためである。なお、分解される音素の数は、当該単語を音声認識ローカル辞書２０５Ｂに参照することによって判明する当該単語に対応する音素記号列を構成する音素記号の数によって判断することができる。ただし、図１１の音声認識ローカル辞書２０５Ｂに示した音素記号列は、説明の表現の都合上、アルファベットを用いてローマ字であらわしたものであり、図１１に示すアルファベットが音素記号をあらわしたものではない。そして、ステップＳ１９３では、さらに、単語の一覧Ｄから平仮名の「う」で始まる単語を削除する。これは、「う」から始まる単語は正しく音声認識されにくいためである。 Next, in step S191, the CPU 200 deletes words composed of less than six phonemes from the word list D, that is, words that are decomposed into less than six phonemes in the speech recognition process. This is because words with a small number of phonemes are difficult to recognize correctly. Note that the number of phonemes to be decomposed can be determined based on the number of phoneme symbols constituting the phoneme symbol string corresponding to the word that is found by referring to the word in the speech recognition local dictionary 205B. However, the phoneme symbol string shown in the speech recognition local dictionary 205B in FIG. 11 is expressed in Roman letters using alphabets for convenience of explanation, and the alphabet shown in FIG. 11 does not represent phoneme symbols. Absent. In step S193, words starting with “u” in Hiragana are further deleted from the word list D. This is because words beginning with “u” are difficult to recognize correctly.

次に、ＣＰＵ２００は、ステップＳ１９５で、単語の一覧Ｄに単語が存在するか否かを判断する。ステップＳ１９５でＮＯと判断すると、「特定単語」とするべき単語が単語の一覧Ｄに存在しないので、ステップＳ１９７で指示形容詞である「あの」という単語を「特定単語」に決定する。そして、ステップＳ２０３では、発話辞書２０５Ｄを利用して、ステップＳ１９７で決定した「特定単語」である「あの」に基づいて「あの本ですね」などという発話内容を決定する。あるいは、指示形容詞とステップＳ１８７の段階で作成した単語の一覧Ｄに登録されているいずれかの単語を用いて、「あの白い本ですね」という発話内容にしてもよい。ただし、この例では、「白い本」という言葉によっては物品（本）２４を一意に特定することはできない。なお、ステップＳ２０３では、指示代名詞を用いて「白いあれ（これ）ですね」という発話内容にしてもよいし、単に「あれ（これ）ですね」という発話内容にしてもよい。 Next, in step S195, the CPU 200 determines whether or not a word exists in the word list D. If NO is determined in step S195, a word to be “specific word” does not exist in the word list D, and therefore the word “that” as an instructional adjective is determined as “specific word” in step S197. In step S203, the utterance dictionary 205D is used to determine the utterance content such as “That book” based on “that” which is the “specific word” determined in step S197. Alternatively, the content of the utterance may be “that white book” using the instruction adjective and any word registered in the word list D created in step S187. However, in this example, the article (book) 24 cannot be uniquely specified by the word “white book”. In step S203, an utterance content such as “It's white (this)” may be used by using a pronoun, or an utterance content may be simply “that (this)”.

一方、ステップＳ１９５でＹＥＳと判断すると、ステップＳ１９９で、メモリ２０４に設定されている音声認識率ＤＢ２０５Ｆを参照し、単語の一覧Ｄに登録されている単語のそれぞれの音声認識率を取得して単語の一覧Ｄに登録されている単語を音声認識率の高い順にソートする。そして、ステップＳ２０１では、単語の一覧Ｄの最上位に登録されているもっとも音声認識率の高い単語を「特定単語」に決定する。ステップＳ２０３では、発話辞書２０５Ｄを利用して、ステップＳ２０１で決定した「特定単語」に基づいて発話内容を決定する。このようにして、システム１０が推定した人間１６が指示した物品（本）２４を人間１６に確認するためにロボット１２が発話する内容が決定される。 On the other hand, if “YES” is determined in the step S195, the speech recognition rate DB 205F set in the memory 204 is referred to in the step S199, the respective speech recognition rates of the words registered in the word list D are acquired, and the word The words registered in the list D are sorted in descending order of speech recognition rate. In step S201, the word having the highest speech recognition rate registered at the top of the word list D is determined as the “specific word”. In step S203, the utterance content is determined based on the “specific word” determined in step S201 using the utterance dictionary 205D. In this way, the content that the robot 12 speaks to confirm the article (book) 24 instructed by the person 16 estimated by the system 10 with the person 16 is determined.

このように、人間１６が指示したとシステム１０が推定した物品（本）２４を人間１６に確認する際にロボット１２が発話する内容に、音声認識率の高い「特定単語」を利用すると、人間１６がこれを真似し、次回にその物品（本）２４を指示する際にこの「特定単語」を利用すれば、システム１０は音声認識による物品（本）２４の特定を容易に行うことができるようになる。 As described above, when a “specific word” having a high voice recognition rate is used as the content spoken by the robot 12 when the human body 16 confirms the article (book) 24 that the system 10 has estimated that the human 16 has instructed, If this "specific word" is used when 16 imitates this and indicates the article (book) 24 next time, the system 10 can easily identify the article (book) 24 by voice recognition. It becomes like this.

図１６に戻って、ステップＳ４９において発話内容が決定されると、次に、ＣＰＵ２００は、ステップＳ５１で、ステップＳ４７で取得した「選出物品（本）２４」の位置情報とステップＳ２０３（図２２）で生成した発話内容情報とをロボット１２に送信する。 Returning to FIG. 16, when the utterance content is determined in step S49, the CPU 200 next determines the position information of “selected article (book) 24” acquired in step S47 and step S203 (FIG. 22) in step S51. And the utterance content information generated in step (1) are transmitted to the robot 12.

ロボット１２では、「選出物品（本）２４」の位置情報と発話内容情報とを受信すると、ＣＰＵ８０が、図１５のステップＳ１１でＹＥＳと判断する。するとＣＰＵ８０は、次のステップＳ１３において、メモリ８４に設定されている発話／ジェスチャ辞書８５Ａを参照して、「選出物品（本）２４」の位置情報に基づいて「選出物品（本）２４」を指差すとともに、発話内容情報に基づいて発話を行う。なお、このとき、ＣＰＵ８０は、受信した「選出物品（本）２４」の位置情報をワールド座標系からロボット座標系に座標変換して利用する。 When the robot 12 receives the position information of the “selected article (book) 24” and the utterance content information, the CPU 80 determines YES in step S11 of FIG. Then, in the next step S13, the CPU 80 refers to the utterance / gesture dictionary 85A set in the memory 84, and selects the “selected article (book) 24” based on the position information of the “selected article (book) 24”. Pointing and speaking based on the utterance content information. At this time, the CPU 80 uses the received position information of the “selected article (book) 24” by converting the coordinates from the world coordinate system to the robot coordinate system.

「選出物品（本）２４」が「花火の匠」というタイトル（名称）の本２４である場合は、このステップＳ１３では、図１において右から２番名に置かれている本を指差しジェスチャで示すとともに、たとえば、「花火の匠の本ですね」のような発話を行う。 When the “selected article (book) 24” is the book 24 with the title (name) “Fireworks Master”, in this step S13, the book placed at the second name from the right in FIG. And utterances such as “It ’s a fireworks master ’s book”.

このようなロボット１２の指差しと発話に対し、人間１６は、「そうです」とか「ちがいます」などといった応答を行う。するとロボット１２では、ＣＰＵ８０が、マイク６６を通して入力された人間１６の声を取り込み、ステップＳ１５で人間１６の応答があったと判断する。そして、ステップＳ１７では、取り込んだ人間１６の音声の情報とともに、当該音声を音声認識して肯定する内容であるか否定する内容であるかを判断させる指示をサーバ２０に送信する。 In response to such pointing and utterance of the robot 12, the human 16 makes a response such as “Yes” or “No”. Then, in the robot 12, the CPU 80 takes in the voice of the human 16 input through the microphone 66, and determines that the human 16 has responded in step S15. In step S17, an instruction for determining whether the content is to be affirmed by recognizing the speech and the content to be affirmed or not is transmitted to the server 20 together with the captured speech information of the human 16.

サーバ２０では、ＣＰＵ２００は、音声認識により肯定であるか否定であるかを判断させる指示と、音声認識すべき音声の情報を受信すると、図１６のステップＳ５３でＹＥＳと判断する。そして、次に、ＣＰＵ２００は、ステップＳ５５で、受信した音声の情報に音声認識処理を施して、音声の内容が肯定を示すものであるか否定を示すものであるかを判断し、ステップＳ５７でその判断結果をロボット１２に送信する。 In server 20, CPU 200 determines YES in step S53 of FIG. 16 when it receives an instruction to determine whether it is affirmative or negative by voice recognition and information of voice to be voice-recognized. Next, in step S55, the CPU 200 performs voice recognition processing on the received voice information to determine whether the voice content indicates affirmation or denial, and in step S57. The determination result is transmitted to the robot 12.

ロボット１２では、音声認識による肯定であるか否定であるかの判断結果を受信すると、ＣＰＵ８０は、ステップＳ１９でＹＥＳと判断する。そして、さらに、ステップＳ２１ではサーバ２０における音声認識の結果が肯定を示すものであったか否かを判断する。ステップＳ２１でＮＯと判断すると、ＣＰＵ８０は、ステップＳ１１で、サーバ２０から「選出物品（本）２４」の位置情報と発話内容情報とをさらに受信したか否かを判断する。 When the robot 12 receives the determination result whether the voice recognition is affirmative or negative, the CPU 80 determines YES in step S19. Further, in step S21, it is determined whether or not the result of the speech recognition in the server 20 is affirmative. If NO is determined in step S21, the CPU 80 determines whether or not the position information of the “selected article (book) 24” and the utterance content information are further received from the server 20 in step S11.

一方、ステップＳ２１でＹＥＳと判断すると、ロボット１２は、該当する物品（本）２４の方向に移動し、該当する物品(本)２４を把持して人間１６の位置に運ぶ。つまり、物品（本）２４が存在する位置の座標が既にわかっているので、ロボット１２のＣＰＵ８０は、車輪モータ３６を制御して、ロボット１２をその物品（本）２４の位置に移動させ、次いでアクチュエータ１０８（図３）を制御することによってハンド５６Ｒ（または５６Ｌ）開閉して物品（本）２４をハンド５６Ｒ（または５６Ｌ：図２）で把持させ、その状態で再び車輪モータ３６を制御してロボット１２を人間１６の位置にまで移動させる。このようにして、ステップＳ２３で、サーバ２０が図１６のステップＳ４３で選出した「選出物品（本）２４」を人間１６に運ぶことができる。 On the other hand, if “YES” is determined in the step S 21, the robot 12 moves in the direction of the corresponding article (book) 24, holds the corresponding article (book) 24, and carries it to the position of the person 16. That is, since the coordinates of the position where the article (book) 24 exists are already known, the CPU 80 of the robot 12 controls the wheel motor 36 to move the robot 12 to the position of the article (book) 24, and then By controlling the actuator 108 (FIG. 3), the hand 56R (or 56L) is opened and closed, the article (book) 24 is gripped by the hand 56R (or 56L: FIG. 2), and the wheel motor 36 is controlled again in this state. The robot 12 is moved to the position of the human 16. In this manner, in step S23, the “selected article (book) 24” selected by the server 20 in step S43 of FIG.

一方、サーバ２０では、図１６のステップＳ５７で、音声認識による人間１６の音声の内容が肯定であるか否定であるかの判断結果をロボット１２に送信した後、図１７のステップＳ６１で、当該判断の結果が肯定であったか否かを判断する。ステップＳ６１でＮＯと判断すると、ＣＰＵ２００は、図１６のステップＳ４３で、音声認識の結果ならびに視線確信度と指差し方向確信度から作成した前述の候補物品一覧Ｃから、次の候補である物品（本）２４の選出を行う。 On the other hand, in step S57 in FIG. 16, the server 20 transmits a determination result of whether the content of the voice of the human 16 by voice recognition is affirmative or negative to the robot 12, and then in step S61 in FIG. It is determined whether or not the result of the determination is affirmative. If NO is determined in step S61, the CPU 200 determines in step S43 in FIG. 16 the next candidate article (C) from the candidate article list C created from the voice recognition result, the line-of-sight reliability and the pointing direction reliability. Book) Select 24.

このとき、候補物品一覧Ｃに他の物品（本）２４のＩＤが登録されていない場合には、ステップＳ４５で物品が選出されなかったと判断される（ステップＳ４５でＮＯ）。すると、ＣＰＵ２００は、ステップＳ５９で候補物品がない旨の通知をロボット１２に送信する。 At this time, if the ID of the other article (book) 24 is not registered in the candidate article list C, it is determined that no article has been selected in step S45 (NO in step S45). Then, the CPU 200 transmits a notification that there is no candidate article to the robot 12 in step S59.

ロボット１２では、候補物品がない旨の通知を受信すると、ＣＰＵ２００が、図１５のステップＳ２５でＹＥＳと判断し、人間１６が指示する物品（本）２４を特定できなかったものとして処理を終了する。 When the robot 12 receives a notification that there is no candidate article, the CPU 200 determines YES in step S25 of FIG. 15 and ends the process assuming that the article (book) 24 designated by the human 16 cannot be specified. .

一方、図１７に戻って、ステップＳ６１でＹＥＳと判断すると、次に、ＣＰＵ２００は、ステップＳ６３で、人間１６の指示する物品（本）２４を特定することができたことを示すために、メモリ２０４に設定されているフラグｓｃに数値「１」を記憶させてフラグｓｃをオン状態にする。なお、フラグｓｃの初期状態はオフ状態（数値「０」）である。 On the other hand, returning to FIG. 17, if YES is determined in the step S61, the CPU 200 next stores a memory (memory) in order to indicate that the article (book) 24 instructed by the human 16 can be specified in the step S63. The numerical value “1” is stored in the flag sc set in 204 to turn on the flag sc. The initial state of the flag sc is an off state (numerical value “0”).

次に、ＣＰＵ２００は、ステップＳ６５で、メモリ２０４に設定されているフラグｎｗがオン状態であるか否かを判断する。先に説明したように、このフラグｎｗは、人間１６が発話した音声を音声認識した結果、音声認識できない単語が存在し、その単語が、人間１６が指示した物品を特定するであろう単語である場合にオン状態に設定される。 Next, in step S65, the CPU 200 determines whether or not the flag nw set in the memory 204 is on. As described above, this flag nw is a word that cannot be recognized as a result of speech recognition of the speech uttered by the human 16, and the word is a word that will specify the article designated by the human 16. In some cases it is set to the on state.

ステップＳ６５でＹＥＳと判断すると、ＣＰＵ２００は、ステップＳ６７で、音声認識できなかった単語を物品辞書１２２に登録して辞書を更新する。具体的には、図１９のステップＳ１１９で、メモリ２０４のワークエリアｗａに格納された音声認識できなかった単語の音素記号列を、物品辞書１２２に登録する。たとえば、人間１６が「ハナタク持ってきて」と発話し、「ハナタク」が音声認識できなかったが、視線や指差しに基づく推定により、人間１６は「花火の匠」の本を指示していたと判明したとする。この場合、ワークエリアｗａには単語「ハナタク」の音素記号列“hanataku”が格納されているので、この音素記号列を、図９に示すように、物品辞書１２２の物品のＩＤが「123…0000046」である「花火の匠」という本２４のレコードの「属性」の項目に、この「ハナタク」という単語を発話した人間１６のＩＤとともに記憶する。 If YES is determined in the step S65, the CPU 200 registers a word that cannot be recognized in the speech in the article dictionary 122 and updates the dictionary in a step S67. Specifically, in step S119 in FIG. 19, the phoneme symbol string of the word that could not be recognized by speech stored in the work area wa of the memory 204 is registered in the article dictionary 122. For example, human 16 speaks “bring Hanatak” and “Hanataku” could not be recognized by voice, but human 16 pointed to the book “Fireworks Takumi” based on gaze and pointing. Suppose that it turns out. In this case, since the phoneme symbol string “hanataku” of the word “hanataku” is stored in the work area wa, the ID of the article in the article dictionary 122 is “123... “0000046” is stored in the “attribute” item of the book 24 record “Fireworks Master” together with the ID of the person 16 who spoke the word “Hanataku”.

次に、ＣＰＵ２００は、ステップＳ６９で、音声認識できなかった単語を音声認識辞書１２６に新たなレコードとして登録して辞書を更新する。具体的には、ステップＳ６７での例をそのまま用いると、図１２に示すように、音声認識辞書１２６に、図１６のステップＳ４３で選出された物品（本）２４（「選出物品（本）２４」）の名称の単語である「花火の匠」を記憶した項目と、メモリ２０４のワークエリアｗａに格納されている音素記号列である“hanataku”を記憶した項目とを含むレコードを追加する。このとき、当該レコードの「物品名称」の項目には、物品の名称（本２４のタイトル（名称））であることを示す値「１」が格納される。 Next, in step S69, the CPU 200 registers the word that could not be recognized as a new record in the speech recognition dictionary 126 and updates the dictionary. Specifically, if the example in step S67 is used as it is, the article (book) 24 ("selected article (book) 24" selected in step S43 in FIG. ”) Is added to the record that includes an item that stores the word“ Fireworks Master ”and an item that stores the phoneme symbol string“ hanataku ”stored in the work area wa of the memory 204. At this time, a value “1” indicating the name of the article (title (name) of the book 24) is stored in the “article name” item of the record.

このように音声認識辞書１２２に「花火の匠」という単語と“hanataku”という音素記号列を登録しておくと、次回以降において、人間１６が、たとえば、「ハナタク持ってきて」と発話すると、「ハナタク」を音声認識することができ、その音声認識の結果として物品（本）２４の名称である「花火の匠」を得ることができる。 If the word “fireworks master” and the phoneme symbol string “hanataku” are registered in the speech recognition dictionary 122 in this way, the next time, for example, when the human 16 speaks, “Bring Hanataku”, “Hanataku” can be recognized by speech, and “artificial fireworks”, the name of the article (book) 24, can be obtained as a result of the speech recognition.

ステップＳ７１では、メモリ２０４に設定されている個人正誤ＤＢ２０５Ｅを更新する。つまり、メモリ２０４に設定されているフラグｓｃがオン状態であるか否かを判断し、オン状態である場合には、人間１６が指示する物品（本）２４を特定することができたということを示しているので、個人正誤ＤＢ２０５Ｅに当該人間１６とのコミュニケーションが成功したことを示す情報を記憶する。一方、フラグｓｃがオフ状態である場合には、当該人間１６とのコミュニケーションが不成功であったことを示す情報を記憶する。 In step S71, the personal correctness / error DB 205E set in the memory 204 is updated. That is, it is determined whether or not the flag sc set in the memory 204 is in the on state. If the flag sc is in the on state, the article (book) 24 instructed by the person 16 has been identified. Therefore, information indicating that the communication with the person 16 has been successful is stored in the personal correctness DB 205E. On the other hand, when the flag sc is in the off state, information indicating that the communication with the person 16 is unsuccessful is stored.

なお、ステップＳ６５でＮＯと判断した場合は、ステップＳ６７およびステップＳ６９をスキップする。 If NO is determined in step S65, steps S67 and S69 are skipped.

このように、このシステム１０では、物品（本）２４を指示する人間１６の位置を検出し、当該人間１６の近傍に存在する物品（本）２４を特定する。また、音声認識辞書１２６に含まれるレコードから人間１６の近傍に存在する物品（本）２４に関連する単語についてのレコードのみを抽出して音声認識ローカル辞書２０５Ｂを作成する。そして、この音声認識ローカル辞書２０５Ｂを用いて、人間１６が発する音声を認識して、当該人間１６が指示する物品（本）２４を特定する。つまり、本件では、人間１６の発話中に含まれると予想される単語を、人間１６の近傍の空間に存在する物品（本）２４に限定することにより、音声認識に利用する音声認識辞書（音声認識ローカル辞書２０５Ｂ）に含まれる単語を最適なものとしている。したがって、システム１０が対象とするすべての物品（本）２４に関連する単語についてのレコードが登録された音声認識辞書１２６を用いて音声認識の処理を実行する場合に比べ、音声認識の処理にかかる時間を短縮するとともに、音声認識の精度を向上することができる。 Thus, in this system 10, the position of the person 16 pointing to the article (book) 24 is detected, and the article (book) 24 existing in the vicinity of the person 16 is specified. Further, only a record for a word related to the article (book) 24 existing in the vicinity of the person 16 is extracted from the records included in the speech recognition dictionary 126 to create the speech recognition local dictionary 205B. Then, using this voice recognition local dictionary 205B, the voice uttered by the person 16 is recognized, and the article (book) 24 indicated by the person 16 is specified. That is, in this case, the words that are expected to be included in the speech of the human 16 are limited to the articles (books) 24 that exist in the space near the human 16, so that the speech recognition dictionary (speech used for speech recognition) is used. The words included in the recognition local dictionary 205B) are optimized. Therefore, the speech recognition processing is performed as compared with the case where the speech recognition processing is executed using the speech recognition dictionary 126 in which records for words related to all articles (books) 24 targeted by the system 10 are registered. The time can be shortened and the accuracy of voice recognition can be improved.

なお、上述の実施例では、音声認識辞書１２６からこれとは別個の音声認識ローカル辞書２０５Ｂを作成することにより音声認識の処理にかかる時間を短縮したが、これに代えて、音声認識ローカル辞書２０５Ｂを別途に構築するのではなく、たとえば、音声認識辞書１２６に登録されているレコードのうち、人間１６の近傍に存在する物品（本）２４に関連する単語についてのレコードのみにフラグを立て、フラグが立っているレコードのみを参照して音声認識の処理を行うようにしても同様に音声認識の処理にかかる時間を短縮し、音声認識の精度を向上することができる。 In the above-described embodiment, the time required for the speech recognition process is shortened by creating a separate speech recognition local dictionary 205B from the speech recognition dictionary 126. Instead, the speech recognition local dictionary 205B is used instead. For example, among the records registered in the speech recognition dictionary 126, only the record for the word related to the article (book) 24 existing in the vicinity of the person 16 is flagged, Even if the voice recognition process is performed with reference to only the records with “”, the time required for the voice recognition process can be shortened and the accuracy of the voice recognition can be improved.

図１はこの発明の一実施例を示すコミュニケーションロボットシステムの概要を示す図解図である。FIG. 1 is an illustrative view showing an outline of a communication robot system showing an embodiment of the present invention. 図２は図１に示すロボットの外観を正面から見た図解図である。FIG. 2 is an illustrative view showing the appearance of the robot shown in FIG. 1 from the front. 図３は図１に示すロボットの電気的な構成を示す図解図である。FIG. 3 is an illustrative view showing an electrical configuration of the robot shown in FIG. 図４は図１に示すサーバの電気的な構成を示す図解図である。FIG. 4 is an illustrative view showing an electrical configuration of the server shown in FIG. 図５は図１の実施例で用いられる物品辞書の一例を示す図解図である。FIG. 5 is an illustrative view showing one example of an article dictionary used in the embodiment of FIG. 図６は図１の実施例で用いられる物品ローカル辞書の一例を示す図解図である。FIG. 6 is an illustrative view showing one example of an article local dictionary used in the embodiment of FIG. 図７は図１の実施例で用いられる単語の一覧の一例を示す図解図である。FIG. 7 is an illustrative view showing one example of a list of words used in the embodiment of FIG. 図８は図１の実施例で用いられる単語の一覧の一例を示す図解図である。FIG. 8 is an illustrative view showing one example of a list of words used in the embodiment of FIG. 図９は図１の実施例で用いられる物品辞書の一例を示す図解図である。FIG. 9 is an illustrative view showing one example of an article dictionary used in the embodiment of FIG. 図１０は図１の実施例で用いられる音声認識辞書の一例を示す図解図である。FIG. 10 is an illustrative view showing one example of a speech recognition dictionary used in the embodiment of FIG. 図１１は図１の実施例で用いられる音声認識ローカル辞書の一例を示す図解図である。FIG. 11 is an illustrative view showing one example of a speech recognition local dictionary used in the embodiment of FIG. 図１２は図１の実施例で用いられる音声認識辞書の一例を示す図解図である。FIG. 12 is an illustrative view showing one example of a speech recognition dictionary used in the embodiment of FIG. 図１３は人間の視線および指差し方向を示す図解図である。FIG. 13 is an illustrative view showing a human gaze and pointing direction. 図１４は図１の実施例で用いられる確信度表の一例を示す図解図である。FIG. 14 is an illustrative view showing one example of a certainty factor table used in the embodiment of FIG. 図１５は図１の実施例におけるロボットの動作を示すフロー図である。FIG. 15 is a flowchart showing the operation of the robot in the embodiment of FIG. 図１６は図１の実施例におけるサーバの動作を示すフロー図である。FIG. 16 is a flowchart showing the operation of the server in the embodiment of FIG. 図１７は図１の実施例におけるサーバの動作を示すフロー図である。FIG. 17 is a flowchart showing the operation of the server in the embodiment of FIG. 図１８は図１の実施例におけるサーバの動作を示すフロー図である。FIG. 18 is a flowchart showing the operation of the server in the embodiment of FIG. 図１９は図１の実施例におけるサーバの動作を示すフロー図である。FIG. 19 is a flowchart showing the operation of the server in the embodiment of FIG. 図２０は図１の実施例におけるサーバの動作を示すフロー図である。FIG. 20 is a flowchart showing the operation of the server in the embodiment of FIG. 図２１は図１の実施例におけるサーバの動作を示すフロー図である。FIG. 21 is a flowchart showing the operation of the server in the embodiment of FIG. 図２２は図１の実施例におけるサーバの動作を示すフロー図である。FIG. 22 is a flowchart showing the operation of the server in the embodiment of FIG.

Explanation of symbols

１０ …コミュニケーションロボットシステム
１２ …コミュニケーションロボット
１４ …ネットワーク
１８ …無線タグ
２０ …サーバ
２４ …物品（本）
８０ …ＣＰＵ
１２０ …カメラ
１２４ …アンテナ
２００ …ＣＰＵ
２０８ …無線タグ読取装置 DESCRIPTION OF SYMBOLS 10 ... Communication robot system 12 ... Communication robot 14 ... Network 18 ... Wireless tag 20 ... Server 24 ... Goods (book)
80 ... CPU
120 ... Camera 124 ... Antenna 200 ... CPU
208... RFID tag reader

Claims

A voice recognition system for recognizing human-generated voice and identifying an article designated by the human,
Position detecting means for detecting a human position,
Based on the position of the person detected by the position detection means, a nearby article specifying means for specifying an article in the vicinity of the person,
Dictionary construction means for constructing a first speech recognition dictionary related to the article identified by the neighboring article identification means, and articles designated by a human by performing speech recognition using the first speech recognition dictionary constructed by the dictionary construction means A voice recognition system comprising a specified article specifying means for specifying.

The speech recognition system according to claim 1, wherein the dictionary construction unit constructs the first speech recognition dictionary based on a second speech recognition dictionary for all articles targeted by the speech recognition system.

A voice recognition system for recognizing human-generated voice and identifying an article designated by the human,
Position information storage means for storing position information indicating the position of the article for each article targeted by the voice recognition system;
Voice recognition information storage means for storing voice recognition information including a word related to the article and a phoneme symbol string of the word for each article targeted by the voice recognition system;
Word specifying means for specifying a word related to an article with reference to the speech recognition information stored in the speech recognition information storage means based on a phoneme symbol string obtained by performing speech recognition processing;
Article specifying means for specifying an article designated by the person based on the word specified by the word specifying means;
Based on position detection means for detecting the position of a person, and the position of the person detected by the position detection means and position information stored in the position information storage means, an article existing in the vicinity of the person is specified. Providing a nearby article specifying means,
The said word specific | specification means is a speech recognition system which specifies a word only with reference to the said speech recognition information containing the word relevant to the articles | goods which the said adjacent goods specific | specification means specified.

The word specifying means refers to the sub-voice recognition information storage means that extracts and stores only the voice recognition information including the word related to the article specified by the neighboring article specification means from the voice recognition information storage means. The speech recognition system according to claim 3, wherein the speech recognition information stored in the speech recognition information storage means is referred to.