JP2008126329A

JP2008126329A - Speech recognition robot and method for controlling speech recognition robot

Info

Publication number: JP2008126329A
Application number: JP2006311482A
Authority: JP
Inventors: Ryo Murakami; 涼村上
Original assignee: Toyota Motor Corp
Current assignee: Toyota Motor Corp
Priority date: 2006-11-17
Filing date: 2006-11-17
Publication date: 2008-06-05

Abstract

【課題】発話者から発生する音声を認識する音声認識性能の向上する音声認識ロボット、および音声認識ロボットの制御方法を提供すること。
【解決手段】音声認識ロボットにおいて、音声の発生した方向を特定する音源特定部と、音声取得部と、音声の内容を認識する音声認識部と、取得した音声の音圧を測定する測定部と、取得した音声の発生した方向について撮像する撮像部と、撮像した画像内に存在する人物の顔を検出する顔検出部と、検出した顔の中から、口唇を抽出する抽出部と、抽出した口唇の中心点を特定する特定部と、音声取得部を保持し、伸縮動作または関節角度の変化の少なくともいずれかの駆動動作により姿勢変更可能な保持部と、を備えさせることで、発話者の口唇の中心点と、前記音声取得部との相対的な位置を、測定した音声の音圧と基づいて修正するように前記保持部の姿勢を変更するようにした。
【選択図】図７A speech recognition robot for improving speech recognition performance for recognizing speech generated from a speaker and a method for controlling the speech recognition robot.
In a voice recognition robot, a sound source specifying unit that specifies a direction in which a voice is generated, a voice acquisition unit, a voice recognition unit that recognizes the content of voice, and a measurement unit that measures the sound pressure of the acquired voice. An image capturing unit that captures the direction in which the acquired sound is generated, a face detection unit that detects a human face existing in the captured image, an extraction unit that extracts lips from the detected face, and an extraction By providing a specific unit that identifies the center point of the lips and a holding unit that holds the voice acquisition unit and can change the posture by at least one of the expansion and contraction operations and the change of the joint angle, The posture of the holding unit is changed so that the relative position between the center point of the lips and the voice acquisition unit is corrected based on the measured sound pressure of the voice.
[Selection] Figure 7

Description

本発明は、入力された音声情報を認識し、その内容に対して応答する音声認識ロボットおよびこのような音声認識ロボットの制御方法に関する。 The present invention relates to a voice recognition robot that recognizes input voice information and responds to the contents thereof, and a control method for such a voice recognition robot.

人間が発声した音声の内容を認識し、その内容に対して適切な回答を音声出力することで応答する音声認識ロボットが一般的に知られている。このような音声認識ロボットは、人間から発声された音声を音声データとして取得し、その音声データを解析することで音声認識処理を行っているために、発生された音声を正確に取得する必要がある。 A speech recognition robot that recognizes the content of speech uttered by a human and responds by outputting an appropriate answer to the content is generally known. Such a speech recognition robot acquires speech uttered by a person as speech data, and performs speech recognition processing by analyzing the speech data. Therefore, it is necessary to accurately acquire the generated speech. is there.

そのために、音声を取得するマイクロフォン等の受音する受音部に雑音を除去するフィルタを設けたり、受音部に指向性を持たせて特定方向に位置する音源からの音に限定して音声を取得するといった技術が用いることが行われている（例えば特許文献１および２）。
特開２００６−１７１７１９号公報特開２００６−２５１２６６号公報 For this purpose, a filter for removing noise is provided in a sound receiving unit such as a microphone for acquiring sound, or sound is limited to sound from a sound source located in a specific direction with directivity in the sound receiving unit. It is performed that the technique of acquiring is used (for example, patent documents 1 and 2).
JP 2006-171719 A JP 2006-251266 A

しかしながら、音声認識ロボットの位置と、音声を発生する人間との相対位置が確定している場合であれば、前述のようなフィルタを用いた雑音除去処理によって音声をある程度正確に取得することが可能となる場合があるが、これらの相対位置が定まらない場合は、音声認識ロボットが取得する音声の大きさ（音圧）が安定しない。そして、十分な音圧の得られない音声に対して雑音除去処理を行うと、音声認識性能が低くなる場合がある。 However, if the position of the voice recognition robot and the relative position of the person generating the voice are fixed, the voice can be acquired with a certain degree of accuracy by the noise removal processing using the filter as described above. However, if these relative positions are not determined, the volume (sound pressure) of the voice acquired by the voice recognition robot is not stable. If noise removal processing is performed on speech for which sufficient sound pressure cannot be obtained, speech recognition performance may be lowered.

また、音声認識ロボットの全体または一部が移動可能に構成されている場合、このロボットと発話者である人間との距離を安全上一定距離以上に離間する必要があるため、特に音声認識性能を十分得ることができないという問題がある。 In addition, when all or part of the voice recognition robot is configured to be movable, it is necessary to keep the distance between the robot and the person who is the speaker more than a certain distance for safety. There is a problem that you can not get enough.

本発明は、このような問題を解決するためになされたものであり、発話者から発生する音声を認識する音声認識性能の向上する音声認識ロボット、および音声認識ロボットの制御方法を提供することを目的とするものである。 The present invention has been made to solve such problems, and provides a speech recognition robot that improves speech recognition performance for recognizing speech generated from a speaker and a method for controlling the speech recognition robot. It is the purpose.

本発明にかかる音声認識ロボットは、音声の発生した方向を特定する音源特定部と、発生した音声を取得する音声取得部と、前記音声取得部により取得した音声の内容を認識する音声認識部と、取得した音声の音圧を測定する測定部と、取得した音声の発生した方向について撮像し、その撮像した画像の画像データを作成する撮像部と、作成した画像データ内に存在する人物の顔を検出する顔検出部と、検出した顔の中から、口唇を抽出する抽出部と、抽出した口唇の中心点を特定する特定部と、前記音声取得部を保持し、伸縮動作または関節角度の変化の少なくともいずれかの駆動動作により姿勢変更可能な保持部と、を備えており、前記音源特定部で特定した音声の発生した方向を撮像部で撮像し、前記撮像部で撮像した画像内に存在する人物の顔を、前記顔検出部で検出し、前記検出した人物の顔の中から抽出した口唇の中心点を特定し、特定した口唇の中心点と、前記音声取得部の位置との距離を算出し、前記算出した距離と、測定部により測定した音声の音圧とに基づいて、前記口唇の中心点に対する音声取得部の相対的な位置を定め、定められた位置に音声取得部を移動するように前記保持部の姿勢を変更することを特徴とする。 A voice recognition robot according to the present invention includes a sound source specifying unit that specifies a direction in which a voice is generated, a voice acquisition unit that acquires the generated voice, a voice recognition unit that recognizes the content of the voice acquired by the voice acquisition unit, A measuring unit that measures the sound pressure of the acquired sound, an image capturing unit that captures an image of the direction in which the acquired sound is generated, and creates image data of the captured image, and a human face that exists in the created image data A face detection unit that detects a lip, an extraction unit that extracts a lip from the detected face, a specifying unit that specifies a center point of the extracted lip, and the voice acquisition unit, and holds an expansion operation or a joint angle A holding unit that can change its posture by at least one of the driving operations of the change, the direction in which the sound specified by the sound source specifying unit is generated is imaged by the imaging unit, and the image captured by the imaging unit is included in the image Exists The human face is detected by the face detection unit, the lip center point extracted from the detected human face is specified, and the distance between the specified lip center point and the position of the voice acquisition unit is determined. Based on the calculated distance and the sound pressure of the sound measured by the measurement unit, the relative position of the sound acquisition unit with respect to the center point of the lip is determined, and the sound acquisition unit is moved to a predetermined position. Thus, the posture of the holding part is changed.

このような音声認識ロボットによると、音声を発生した発話者の口唇と、音声を取得する音声取得部との距離を、音声取得部で取得する音声の音圧が所定の適切な値となるように、音声取得部の位置を定めることができる。これによって、音声認識部で認識する音声が一定の音圧に定められるため、音声認識部において、発話者から発生する音声を認識する音声認識性能が向上することができる。 According to such a voice recognition robot, the sound pressure of the voice acquired by the voice acquisition unit becomes a predetermined appropriate value for the distance between the lip of the speaker who generated the voice and the voice acquisition unit for acquiring the voice. In addition, the position of the voice acquisition unit can be determined. Thereby, since the voice recognized by the voice recognition unit is set to a constant sound pressure, the voice recognition performance for recognizing the voice generated from the speaker can be improved in the voice recognition unit.

また、このような音声認識ロボットにおいては、目標とする目標音圧を記憶し、測定した音声の音圧と、前記目標音圧との差をパラメータとして、前記口唇の中心点と音声取得部との最適な距離を求め、前記音声取得部の位置を定めるようにするものであってもよい。このような音声認識ロボットによると、発話者から発生される音声の強さに応じて音声取得部の位置を決定し、目標音圧で音声を取得することできるため、音声認識性能をより高めることが可能となる。 Further, in such a speech recognition robot, the target sound pressure as a target is stored, and the center point of the lip and the sound acquisition unit are stored with the difference between the measured sound pressure and the target sound pressure as a parameter. The optimum distance may be obtained and the position of the voice acquisition unit may be determined. According to such a voice recognition robot, the position of the voice acquisition unit can be determined according to the strength of the voice generated from the speaker, and the voice can be acquired with the target sound pressure, so that the voice recognition performance is further improved. Is possible.

また、前記音声取得部は、指向性を有するマイクロフォンから構成され、音声取得部の先端から当該指向性と同方向を向いた位置の延長線上に、前記口唇の中心点が位置するように保持部の姿勢を変更するものであってもよい。このようにすると、音声取得部の指向性を考慮した音声取得部の位置決めができるため、発話者から発せられる音がある程度小さい音圧であっても、音声取得部により音声を取得するための感度が高くなり、十分な音圧の音声を取得しやすくすることが可能となる。 In addition, the voice acquisition unit is composed of a microphone having directivity, and a holding unit so that the center point of the lip is located on an extension line of a position facing the same direction as the directivity from the tip of the voice acquisition unit The posture may be changed. In this way, since the voice acquisition unit can be positioned in consideration of the directivity of the voice acquisition unit, the sensitivity for acquiring the voice by the voice acquisition unit even when the sound emitted from the speaker has a somewhat low sound pressure. It becomes possible to make it easy to acquire sound with sufficient sound pressure.

また、前記音源特定部は、指向性を有する１または複数のマイクロフォンから構成されることが好ましい。このような音源特定部は、発生した音声の、ロボットからみた相対的な方向をより正確に特定することができる。 Moreover, it is preferable that the said sound source specific | specification part is comprised from one or several microphones which have directivity. Such a sound source identifying unit can more accurately identify the relative direction of the generated voice as viewed from the robot.

また、前記抽出部は、口唇の輪郭を求め、その輪郭によって特定される重心位置を中心点とするものであってもよい。すなわち、口唇の輪郭を平面的に求め、その平面上において描写された輪郭に囲まれた領域の重心位置を特定し、その重心位置を口唇の中心点とする。このようにすると、口唇の中心点を簡単に求めることができる。 Further, the extraction unit may obtain a contour of the lips and use a center of gravity position specified by the contour as a center point. That is, the outline of the lips is obtained in a plane, the position of the center of gravity of the area surrounded by the outline drawn on the plane is specified, and the position of the center of gravity is set as the center point of the lips. In this way, the center point of the lips can be easily obtained.

また、このような音声認識ロボットにおいては、検出した顔が、撮像した画像内の略中央に位置し続けるように撮像する方向を変更するように設けられていることが好ましい。このようにすると、検出した発話者を撮像エリア内におさめた状態を維持できるため、発話者を見失う可能性が低くなり、再度新たに発話者を検出する必要性が低くなるという効果がある。 Further, in such a voice recognition robot, it is preferable that the detected face is provided so as to change the imaging direction so that the detected face continues to be positioned substantially in the center of the captured image. In this way, since the detected speaker can be kept in the imaging area, the possibility of losing sight of the speaker is reduced, and the need to detect a speaker again is reduced.

このような音声認識ロボットが、さらに移動手段を備えるものであり、所定の領域内を移動可能に構成されているものであれば、より好ましい。このような音声認識ロボットは、発話者に対して近接離間可能に移動できるため、音声をより適切な音圧で取得しやすくなる。 It is more preferable that such a voice recognition robot further includes moving means and is configured to be movable within a predetermined area. Such a speech recognition robot can move so as to be close to and away from the speaker, so that it is easy to acquire speech with a more appropriate sound pressure.

さらに、前記保持部が、駆動制御される関節を含む腕部であり、前記音声取得部が当該腕部の先端に保持されるように構成することが好ましい。このようにすることで、保持部としての腕部を駆動して発話者と音声取得部との距離を適切な位置に保つことができる。 Furthermore, it is preferable that the holding unit is an arm unit including a joint to be driven and controlled so that the sound acquisition unit is held at a tip of the arm unit. By doing in this way, the arm part as a holding | maintenance part can be driven and the distance of a speaker and an audio | voice acquisition part can be kept at an appropriate position.

なお、本発明は、音声認識ロボットの制御方法をも提供するものであり、このような音声認識ロボットの制御方法においては、音声の発生した方向を特定するステップと、発生した音声を音声取得部を介して取得するステップと、取得した音声の内容を認識するステップと、取得した音声の音圧を測定するステップと、取得した音声の発生した方向について撮像し、その撮像した画像の画像データを作成するステップと、作成した画像データ内に存在する人物の顔を検出するステップと、検出した顔の中から、口唇を抽出するステップと、抽出した口唇の中心点を特定するステップと、特定した口唇の中心点と、前記音声取得部との距離を算出するステップと、前記算出した距離と、測定部により測定した音声の音圧とに基づいて、前記口唇の中心点に対する音声取得部の相対的な位置を定めるステップと、定められた位置に前記音声取得部を移動するステップと、を備えることを特徴としている。 The present invention also provides a method for controlling a voice recognition robot. In such a method for controlling a voice recognition robot, the step of specifying the direction in which the voice is generated, and the voice acquisition unit The step of recognizing the acquired sound, the step of recognizing the content of the acquired sound, the step of measuring the sound pressure of the acquired sound, and imaging the direction in which the acquired sound is generated, and obtaining image data of the captured image A step of generating, a step of detecting a human face existing in the generated image data, a step of extracting a lip from the detected face, a step of specifying a center point of the extracted lip, Based on the step of calculating the distance between the center point of the lips and the sound acquisition unit, the calculated distance, and the sound pressure of the sound measured by the measurement unit, the lips It is characterized in that it comprises a step of determining the relative position of the sound acquisition unit, and moving the voice acquisition unit in a defined position, the relative center point.

このような音声認識ロボットの制御方法によると、音声を発生した発話者の口唇と、音声を取得する音声取得部との距離を、音声取得部で取得する音声の音圧が所定の適切な値となるように、音声取得部の位置を定めることができる。これによって、音声認識部で認識する音声が一定の音圧に定められるため、音声認識部において、発話者から発生する音声を認識する音声認識性能が向上することができる。 According to such a control method of the voice recognition robot, the distance between the lip of the speaker who has generated the voice and the voice acquisition unit that acquires the voice, and the sound pressure of the voice acquired by the voice acquisition unit is a predetermined appropriate value. Thus, the position of the voice acquisition unit can be determined. Thereby, since the voice recognized by the voice recognition unit is set to a constant sound pressure, the voice recognition performance for recognizing the voice generated from the speaker can be improved in the voice recognition unit.

また、このような音声認識ロボットの制御方法において、前記音声取得部が指向性を有しており、前記音声取得部の先端から当該指向性と同方向を向いた位置の延長線上に、前記口唇の中心点が位置するように音源取得部の位置を定めるステップをさらに備えるようにしてもよい。音声取得部の指向性を考慮した音声取得部の位置決めができるため、発話者から発せられる音がある程度小さい音圧であっても、音声取得部の感度が高い状態で音声を取得することができる。 Further, in such a method for controlling a voice recognition robot, the voice acquisition unit has directivity, and the lip is on an extension line of a position facing the same direction as the directivity from the tip of the voice acquisition unit. You may make it further provide the step which determines the position of a sound source acquisition part so that the center point of may be located. Since the voice acquisition unit can be positioned in consideration of the directivity of the voice acquisition unit, it is possible to acquire voice with high sensitivity of the voice acquisition unit even if the sound emitted from the speaker has a somewhat low sound pressure. .

以上、説明したように、本発明によると、発話者から発生する音声を認識する音声認識性能の向上する音声認識ロボット、および音声認識ロボットを制御する制御方法を提供することができる。 As described above, according to the present invention, it is possible to provide a speech recognition robot that improves speech recognition performance for recognizing speech generated from a speaker and a control method for controlling the speech recognition robot.

発明の実施の形態１．
以下に、図１から図７を参照しつつ、本発明の第１の実施形態にかかる音声認識ロボットについて説明する。 Embodiment 1 of the Invention
The speech recognition robot according to the first embodiment of the present invention will be described below with reference to FIGS.

図１は、室内Ｒの中に発話者である人が存在しており、その室内Ｒ内に音声認識ロボット１が載置されている様子を示している。図１に示される音声認識ロボット１は、図２に示すように、地面に固定された胴体１０と、この胴体１０に接続された頭部１１、右腕１２、左腕１３を備える、人間の上半身と同様に構成されたヒューマノイド型のロボットである。以下、各構成要素について詳細に説明する。 FIG. 1 shows that a person who is a speaker exists in the room R, and the voice recognition robot 1 is placed in the room R. As shown in FIG. 2, the speech recognition robot 1 shown in FIG. 1 includes a torso 10 fixed to the ground, a head 11, a right arm 12, and a left arm 13 connected to the torso 10. It is a humanoid robot constructed in the same way. Hereinafter, each component will be described in detail.

胴体１０は、室内Ｒの床面に設置され、その内部に音声認識ロボット１の動作およびその他の機能を制御する制御部１００を備えている。制御部１００は、後述する音声取得部としてのマイクロフォンから入力された入力信号の内容を認識し、適切な応答データを選択した後に当該応答データを音声で出力する、演算処理部やメモリ等を備えた制御コンピュータである。この制御部１００の詳細な構成については後述する。 The body 10 is installed on the floor surface of the room R, and includes a control unit 100 that controls the operation of the voice recognition robot 1 and other functions. The control unit 100 includes an arithmetic processing unit, a memory, and the like that recognizes the content of an input signal input from a microphone serving as an audio acquisition unit, which will be described later, and selects the appropriate response data and outputs the response data in audio. Control computer. The detailed configuration of the control unit 100 will be described later.

頭部１１は、音声認識ロボット１の前方の所定範囲を撮像するための撮像部１１１、１１２と、周囲で生じた音声を聞き取るための音源特定部１１３、１１４と、外部に対して言葉を発声するためのスピーカ１１５と、を備えている。撮像部１１１、１１２は、各々所定範囲の光学的な情報を取得して撮像データを作成し、この撮像データを取り込んで制御部１００へ出力する光学カメラであり、頭部１１の前面左右に設けられている。 The head 11 utters words to the outside, imaging units 111 and 112 for imaging a predetermined range in front of the voice recognition robot 1, sound source identification units 113 and 114 for listening to surrounding sounds, and And a speaker 115 for The imaging units 111 and 112 are optical cameras that acquire optical information within a predetermined range, create imaging data, capture the imaging data, and output the imaging data to the control unit 100. It has been.

音源特定部１１３、１１４は、各々一定の方向からの音声を取得可能な、いわゆる指向性を有するマイクロフォンを水平方向に複数配置したものであり、周囲で発声した音声が、音声認識ロボット１からみて相対的にどの方向から伝達されたものかを大まかに特定することができる。なお、音源特定部１１３、１１４は、頭部１１の左右の側面に設けられており、音声認識ロボット１の周囲で発せられた音声が、音声認識ロボット１に対して相対的にどの方向から発生されたかを特定し、後述する制御部１００へ出力する。 The sound source specifying units 113 and 114 each have a plurality of so-called directional microphones that can acquire sound from a certain direction in the horizontal direction, and the sound uttered in the surroundings is viewed from the voice recognition robot 1. It is possible to roughly specify from which direction the signal is transmitted. Note that the sound source identification units 113 and 114 are provided on the left and right side surfaces of the head 11, and from which direction the sound emitted around the voice recognition robot 1 is generated relative to the voice recognition robot 1. It is specified whether it has been done, and is output to the control unit 100 described later.

そして、スピーカ１１５は、制御部１００で作成された応答データを外部へ所定の方向および大きさで出力するものであり、頭部１１の前面下方に設けられている。 The speaker 115 outputs response data created by the control unit 100 to the outside in a predetermined direction and size, and is provided below the front surface of the head 11.

また、頭部１１は、胴体１０に対して床面に対して水平な面内で左右方向に回動可能に接続されており、頭部１１を回動することで撮像する範囲を状況に応じて変更し、周囲の環境を撮像することができる。 The head 11 is connected to the body 10 so as to be rotatable in the left-right direction in a plane horizontal to the floor surface. To change the image of the surrounding environment.

右腕１２および左腕１３は、制御部１００に含まれる演算処理部（図示せず）によって、駆動所定の制御プログラムに従って各腕部に含まれる関節部が駆動する量が制御され、各関節の関節駆動角度が決定されることで、所望の位置および姿勢をとるものである。そして、右腕１２および左腕１３の先端には物体を把持可能なハンド部１２ａ、１３ａが備えられており、さらに、右腕１２のハンド部１２ａには指向性を有するマイクロフォン１２ｂが、その先端を胴体１０に対して遠ざける向きにやや傾斜させて把持されている。マイクロフォン１２ｂは、音声認識ロボット１の周囲で発声した音声を取得する音声取得部として作用するものであり、本実施形態の場合、マイクロフォン１２ｂを把持する右腕１２が保持部に該当する。なお、右腕１２および左腕１３は、上腕部材および下腕部材、下腕部材およびハンド部をそれぞれ制御部１００からの指令により駆動可能な関節部材により接続され、関節部材の関節駆動領域の許す限り、その姿勢を自由に取ることができる。 The right arm 12 and the left arm 13 are controlled by an arithmetic processing unit (not shown) included in the control unit 100 to control the amount of driving of the joints included in each arm according to a predetermined drive control program. By determining the angle, a desired position and posture are taken. Hands 12a and 13a capable of gripping an object are provided at the distal ends of the right arm 12 and the left arm 13, and a directional microphone 12b is disposed on the hand portion 12a of the right arm 12, with the distal ends thereof being connected to the body 10. It is gripped with a slight inclination in a direction away from it. The microphone 12b acts as a voice acquisition unit that acquires the voice uttered around the voice recognition robot 1, and in the present embodiment, the right arm 12 that holds the microphone 12b corresponds to the holding unit. The right arm 12 and the left arm 13 are connected by joint members capable of driving the upper arm member, the lower arm member, the lower arm member, and the hand unit according to commands from the control unit 100, respectively, as long as the joint drive region of the joint member allows. The posture can be taken freely.

なお、制御部１００は、図３に示すように、マイクロフォン１２ｂによって取得された音声が入力から音声データを作成してその内容を認識する音声認識部１０１、作成された音声データから、その音声の音圧（音の強度）を測定する測定部１０２、頭部１０に備えられた前記撮像部１１１、１１２により撮像した画像内に存在する人物の顔を検出する顔検出部１０３により検出した顔の位置および向きを判別する判別部１０４、検出した顔の中から口唇を部分的に抽出する抽出部１０５、抽出した口唇の中心点を特定する特定部１０６、および音声出力する応答データを作成するための音声合成部１０７と、所定のプログラムやデータを記憶する記憶領域１０８を備えている。 As shown in FIG. 3, the control unit 100 creates a voice data from the voice acquired by the microphone 12b and recognizes the content thereof. The voice recognition unit 101 recognizes the contents of the voice data. The measurement unit 102 that measures sound pressure (sound intensity), and the face detected by the face detection unit 103 that detects the face of a person present in the image captured by the imaging units 111 and 112 provided in the head 10. A discrimination unit 104 that discriminates the position and orientation, an extraction unit 105 that partially extracts the lips from the detected face, a specification unit 106 that specifies the center point of the extracted lips, and response data for outputting sound And a storage area 108 for storing predetermined programs and data.

音声認識部１０１は、マイクロフォン１２ｂを介して入力された音声を、図示しないＡ／Ｄ変換部を介してＷＡＶＥファイルなどの音声データに変換するとともに、その音声データを音節毎に分割し、各音節を記憶領域１０８に記憶された単語データベースを用いて単語に置き換える。そして、音声データに含まれる単語およびその語順を解析して、記憶領域１０８に記憶された多数の文章のうち、この解析した音声データに最も近い文章を選び出す。選び出した文章と、音声データとの近似度合いが所定の値以上の場合は、解析した音声データが、選び出した文章と同一の内容として認識し、取得した音声を、選び出した文章と等しい旨を示す信号を出力する。また、最も近い文章が、所定の近似度合いに満たない場合は、該当する文章が記憶領域に記憶されていないとし、取得した音声の内容を認識できなかったことを表す信号を出力する。 The voice recognition unit 101 converts voice input through the microphone 12b into voice data such as a WAVE file through an A / D conversion unit (not shown), and divides the voice data into syllables. Is replaced with a word using the word database stored in the storage area 108. Then, the words contained in the voice data and their word order are analyzed, and the sentence closest to the analyzed voice data is selected from the many sentences stored in the storage area 108. If the degree of approximation between the selected text and the voice data is greater than or equal to a predetermined value, the analyzed voice data is recognized as the same content as the selected text, and indicates that the acquired voice is equal to the selected text Output a signal. If the closest sentence is less than the predetermined degree of approximation, it is determined that the corresponding sentence is not stored in the storage area, and a signal indicating that the content of the acquired voice has not been recognized is output.

測定部１０２は、音声認識部１０１により作成された音声データから、その音声の音圧（強さ）をデータの振幅により求めるものであり、入力される音声の音圧を時系列的に求めるものである。求められた音圧データは、記憶領域１０８に記憶される。 The measurement unit 102 obtains the sound pressure (strength) of the sound from the sound data created by the sound recognition unit 101 based on the amplitude of the data, and obtains the sound pressure of the input sound in time series. It is. The obtained sound pressure data is stored in the storage area 108.

顔検出部１０３は、撮像部１１１、１１２で撮像することにより得られた画像データから顔の輪郭に相当する縁部を推定し、推定された顔の輪郭によって囲まれる領域を人物の顔として検出する。 The face detection unit 103 estimates an edge corresponding to the contour of the face from image data obtained by imaging with the imaging units 111 and 112, and detects a region surrounded by the estimated face contour as a human face. To do.

なお、顔検出部１０３によって、検出された各人物の顔に含まれる目の位置、すなわちロボットからの相対距離および相対的方向に基づいて、検出した顔が、ロボットから見てどの方向を向いているかどうかを推定することができる。詳細には、図４に示すように、人物の顔に含まれる目（右目Ｅ１および左目Ｅ２）の各中心位置Ｅ１０、Ｅ１１を特定するとともに、各中心位置を結ぶ線分上の中点Ｍを特定する。そして、この各中心位置を結ぶ線分を含み、床面に平行な平面内で、中点Ｍから各中心位置を結ぶ線分に垂直な方向Dを求め、この方向Dを目（右目Ｅ１、左目Ｅ２）の視線方向、すなわちこれらの目を含む顔が向いている方向とする。そして、検出した各顔の向いている方向を各々求め、それらの方向についての信号を、各顔の存在するロボットからの相対位置と併せて出力する。 It should be noted that, based on the position of the eyes included in each human face detected by the face detection unit 103, that is, the relative distance and relative direction from the robot, the detected face faces in the direction viewed from the robot. It can be estimated whether or not. Specifically, as shown in FIG. 4, the center positions E10 and E11 of the eyes (right eye E1 and left eye E2) included in the face of the person are specified, and the midpoint M on the line segment connecting the center positions is determined. Identify. Then, a direction D that includes the line segment connecting the center positions and is perpendicular to the line segment connecting the center positions from the midpoint M in a plane parallel to the floor surface is obtained. The line-of-sight direction of the left eye E2), that is, the direction in which the face including these eyes is facing. Then, the detected direction of each face is obtained, and a signal for each direction is output together with the relative position from the robot where each face exists.

判別部１０４は、顔検出部１０３によって検出された、撮像された画像内に含まれる各顔の、ロボットからの相対位置および向いている方向から、どの顔がロボット自身に向けられているかを判別する。具体的には、図５に示すように、自身の位置P（例えば頭部１１の中心点）を基準として、目（右目および左目）の位置に基づいて判断した各顔の中心位置と、各顔の向いている方向とを組み合わせ、自身の位置を含むか否かを判断する。ここで、各顔の向いている方向には、所定の幅をもたせることとし、詳細には各方向を中心として床面に水平な方向に左右微小角度（例えば５度）ずつ幅を持たせるものとしている。このようにして、各顔がロボット自身の向きを向いているか否かを判断する。 The discriminating unit 104 discriminates which face is directed to the robot itself from the relative position and the direction of each face included in the captured image detected by the face detecting unit 103. To do. Specifically, as shown in FIG. 5, the center position of each face determined based on the position of the eyes (right eye and left eye) with reference to its own position P (for example, the center point of the head 11), It is determined whether or not it includes its own position by combining with the direction the face is facing. Here, the direction in which each face is facing has a predetermined width, and in detail, the width is given by a minute angle (for example, 5 degrees) in the direction horizontal to the floor with each direction as the center. It is said. In this way, it is determined whether or not each face is facing the robot itself.

抽出部１０５は、顔検出部１０３によって検出された画像データ中の顔のうち、判別部１０４においてロボットの方を向いていると判断された顔の中から、口唇を部分的に抽出する。また、ロボットの方を向いていると判断された顔が複数ある場合は、ロボットからの距離が最も近いものを選択し、選択した顔から口唇を部分的に抽出する。 The extraction unit 105 partially extracts the lips from the faces in the image data detected by the face detection unit 103, determined from the faces determined by the determination unit 104 to face the robot. If there are a plurality of faces determined to be facing the robot, the face closest to the robot is selected, and the lips are partially extracted from the selected face.

抽出部１０５によって口唇を抽出する具体的な手法としては、顔の輪郭内部に含まれる領域の中で、予め記憶された複数の口唇データと略一致する部分を口唇として認識し、その認識した部分を顔中の口唇として抽出する。 As a specific method of extracting the lips by the extracting unit 105, a portion that substantially matches a plurality of lip data stored in advance is recognized as a lip in the region included in the face outline, and the recognized portion Are extracted as lips in the face.

特定部１０６は、抽出部１０５で抽出した口唇の輪郭から、その輪郭によって特定される重心位置をこの口唇の中心点として特定する。詳細には、抽出した口唇の輪郭を平面的に求め、その平面上において描写された輪郭に囲まれた領域の重心位置を特定し、その重心位置を口唇の中心点とする。このように、特定部１０６で特定された口唇の中心点は、音声認識ロボット１の位置を基準点とした相対的な位置を表す座標として、記憶領域１０８に記憶される。 The identifying unit 106 identifies, from the lip contour extracted by the extracting unit 105, the position of the center of gravity identified by the contour as the center point of the lip. Specifically, the contour of the extracted lip is obtained in a plane, the position of the center of gravity of the region surrounded by the contour drawn on the plane is specified, and the position of the center of gravity is set as the center point of the lip. Thus, the lip center point specified by the specifying unit 106 is stored in the storage area 108 as coordinates representing a relative position with the position of the voice recognition robot 1 as a reference point.

音声合成部１０７は、音声認識部１０１によって認識された、取得した音声の内容に対応する、予め記憶領域内に多数記憶された応答文データ群の中から、最も適切な応答文データを読み出し、音声ファイルに変換してスピーカ１１５を介して外部に出力する。このように構成された音声認識ロボット１は、その前面近傍に位置する人物を撮像し、その撮像した画面内に含まれる複数の人物に対して、自分が応答すべき人物を特定して、その人物から発声された音声の内容を認識し、その内容に応じた内容の音声出力を行う。 The speech synthesizer 107 reads out the most appropriate response sentence data from among a large number of response sentence data groups stored in advance in the storage area corresponding to the acquired speech content recognized by the speech recognition unit 101, It converts into an audio file and outputs it outside through the speaker 115. The speech recognition robot 1 configured as described above captures a person located near the front surface, identifies a person to whom he / she should respond to a plurality of persons included in the captured screen, and It recognizes the contents of the voice uttered by a person and outputs the voice according to the contents.

記憶領域１０８は、音声認識ロボット１の腕部や首部分を含む関節部材を動作させる単一または複数種類のプログラムや音声出力するための多数種類の応答文等の他、右腕１２に把持されたマイクロフォン１２ｂの長さや指向性などを記憶するとともに、前述した口唇の中心点等の追加情報を新たに記憶する。さらに、記憶領域１０８には、マイクロフォン１２ｂにより発話者より取得する音声の強さ（音圧）の目標値としての目標音圧と、口唇とマイクロフォンまでの基本距離とが記憶されており、所定の数式に基づいて、測定した音圧を目標音圧とするための、口唇とマイクロフォンとの実際の距離を算出する。詳細には、以下の数式１によって関連づけられた測定された音圧と口唇−マイクロフォン間の距離との関係に基づいて、取得する音声の強さを目標音圧とするための口唇―マイクロフォン間の距離を計算する。
The storage area 108 is held by the right arm 12 in addition to a single or multiple types of programs for operating joint members including the arm and neck of the speech recognition robot 1 and multiple types of response sentences for outputting voices. The length and directivity of the microphone 12b are stored, and additional information such as the lip center point described above is newly stored. Further, the storage area 108 stores a target sound pressure as a target value of the sound intensity (sound pressure) acquired from the speaker by the microphone 12b, and a basic distance between the lip and the microphone. Based on the mathematical formula, an actual distance between the lips and the microphone for calculating the measured sound pressure as the target sound pressure is calculated. In detail, based on the relationship between the measured sound pressure and the distance between the lips and the microphone, which are related by the following Equation 1, the lip and the microphone are set to obtain the target sound pressure as the sound intensity to be acquired. Calculate the distance.

このように構成された音声認識ロボット１は、測定した音圧と、目標音圧との差をパラメータとして、音声を発生する発話者２００の口唇２０１の中心点２０１ａとマイクロフォン１２ｂとの距離が所定の適切な値とするマイクロフォン１２ｂの位置を特定する。さらに、口唇２０１の中心点２０１ａとマイクロフォン１２ｂと距離を定める条件とともに、口唇２０１に対するマイクロフォン１２ｂの指向性の向きを考慮して、マイクロフォン１２ｂを保持する右腕１２を含む自己の姿勢を変更する。姿勢の変更は、主としてマイクロフォン１２ｂを把持した右腕１２の関節を、制御部１００の図示しない演算処理部により駆動することで行われる。その手順について図６に示す概略図および図７に示すフローチャートを用いつつ説明する。なお、図６は音声認識ロボット１の周囲に位置する発話者２００が音声を発した様子を示す概略図であり、図６に示すように、音声認識ロボット１は、マイクロフォン１２ｂを把持した右腕１２の上腕および下腕のなす角度θ１と、下腕とマイクロフォン１２ｂを把持するハンド部１２ａのなす角度θ２を変化させることで、マイクロフォン１２ｂの先端の位置およびマイクロフォン１２ｂの傾斜度合いを調整する。また、図７は音声認識ロボット１が発話者の顔を検出した後にその姿勢を変更するまでの手順を示すフローチャートである。以下、詳細に説明する。 The voice recognition robot 1 configured as described above uses a difference between the measured sound pressure and the target sound pressure as a parameter, and the distance between the center point 201a of the lip 201 of the speaker 200 that generates the sound and the microphone 12b is predetermined. The position of the microphone 12b to be an appropriate value of is specified. Further, in consideration of the condition for determining the distance between the center point 201a of the lip 201 and the microphone 12b and the direction of directivity of the microphone 12b with respect to the lip 201, the posture of the self including the right arm 12 holding the microphone 12b is changed. The posture is changed mainly by driving the joint of the right arm 12 holding the microphone 12b by an arithmetic processing unit (not shown) of the control unit 100. The procedure will be described with reference to the schematic diagram shown in FIG. 6 and the flowchart shown in FIG. FIG. 6 is a schematic view showing a state in which the speaker 200 positioned around the voice recognition robot 1 utters a voice. As shown in FIG. 6, the voice recognition robot 1 has the right arm 12 holding the microphone 12b. The position of the tip of the microphone 12b and the inclination of the microphone 12b are adjusted by changing the angle θ1 formed by the upper arm and the lower arm and the angle θ2 formed by the hand portion 12a that holds the lower arm and the microphone 12b. FIG. 7 is a flowchart showing a procedure until the voice recognition robot 1 detects the speaker's face and changes its posture. Details will be described below.

まず、音声認識ロボット１に電力が供給され、周囲からの音声を取得するための準備をした状態で、音声認識ロボット１の周囲に存在する発話者２００が音声認識ロボット１に話し掛けると、音声認識ロボット１は、この人物の発声した音声を音声取得部としてのマイクロフォン１２ｂで取得するとともに、音源特定部１１３、１１４によって音声の発声した方向（ロボットからみた相対的な方向）を特定する（ステップ１０１）。そして、マイクロフォン１２ｂで取得された音声は、音声認識部１０１において音声データに変換され、測定部１０２は、音声認識部１０１で作成された音声データから、取得した音声の音圧の測定を開始する（ステップ１０２）。このとき、音源特定部１１３、１１４で特定した方向に頭部１１の前面が位置するように頭部１１を回動し、撮像を開始する。 First, when power is supplied to the voice recognition robot 1 and a speaker 200 existing around the voice recognition robot 1 speaks to the voice recognition robot 1 with preparations for acquiring voices from the surroundings, voice recognition is performed. The robot 1 acquires the voice uttered by the person with the microphone 12b as a voice acquisition unit, and specifies the direction (relative direction seen from the robot) where the voice is uttered by the sound source identification units 113 and 114 (step 101). ). The voice acquired by the microphone 12b is converted into voice data by the voice recognition unit 101, and the measurement unit 102 starts measuring the sound pressure of the acquired voice from the voice data created by the voice recognition unit 101. (Step 102). At this time, the head 11 is rotated so that the front surface of the head 11 is positioned in the direction specified by the sound source specifying units 113 and 114, and imaging is started.

撮像部１１１、１１２で撮像したことで得られた画像データは、制御部１００に入力され、この画像データの中において顔検出部１０３が人物の顔を検出できるか否かを判断する（ステップ１０３）。ここで、画像データ中に人物の顔が１つでも検出できれば、検出した顔の向きを判別部１０４によって判別し、音声認識ロボット１の方を向いている顔が存在するか否かを判断する（ステップ１０４）。音声認識ロボット１の方を向いている顔が存在すれば、それらの顔の、音声認識ロボット１からの各距離を求めて、最も近い位置に存在する顔を選択する（ステップ１０５）。逆に、音声認識ロボット１の方を向いている顔が存在しなければ、音声認識ロボット１に対して話し掛けられた状態ではないと判断し、再度音声を取得する準備する状態に戻る。 Image data obtained by imaging by the imaging units 111 and 112 is input to the control unit 100, and it is determined whether or not the face detection unit 103 can detect a human face in the image data (step 103). ). Here, if even one person's face can be detected in the image data, the orientation of the detected face is discriminated by the discriminating unit 104 to determine whether there is a face facing the voice recognition robot 1. (Step 104). If there are faces facing the voice recognition robot 1, the distances of those faces from the voice recognition robot 1 are obtained, and the face existing at the closest position is selected (step 105). On the other hand, if there is no face facing the voice recognition robot 1, it is determined that the voice recognition robot 1 is not in a state of talking to the voice recognition robot 1, and the process returns to the state of preparing to acquire the voice again.

次に、最も近い位置に存在する、音声認識ロボット１の方を向いている顔を判別すると、画像データ内における、判別した顔が占める領域から、口唇に相当する部分を抽出する（ステップ１０６）。そして、抽出した口唇の中心点２０１ａを特定し、その中心点２０１ａの位置座標（ｘ_ｔ，ｙ_ｔ，ｚ_ｔ）を算出した後、音声認識ロボット１から、中心点の位置座標に相当する口唇内の点までの距離を、撮像部１１１、１１２で撮像する三角測量の原理で求める（ステップ１０７）。これによって、音声認識ロボット１の位置に対する、発話者の口唇中心点２０１ａの相対的な位置が特定される。 Next, when the face that faces the voice recognition robot 1 present at the closest position is determined, a portion corresponding to the lips is extracted from the area occupied by the determined face in the image data (step 106). . Then, after specifying the center point 201a of the extracted lip and calculating the position coordinates (x _t , y _t , z _t ) of the center point 201a, the lip corresponding to the position coordinates of the center point is obtained from the speech recognition robot 1. The distance to the inner point is obtained by the principle of triangulation that is imaged by the imaging units 111 and 112 (step 107). Thus, the relative position of the speaker's lip center point 201a with respect to the position of the speech recognition robot 1 is specified.

また、このように発話者の顔を撮像し続け、発話者の口唇中心点の位置を特定する動作と平行して、音源特定部１１３、１１４により特定された方向、すなわち発話者の存在する位置からの音声を常に取得し続け、音声認識部１０１により取得した音声から音声データを作成する（ステップ２０１）。そして、作成した音声データから取得した音声の音圧を測定部１０２で測定し、時系列的に測定した音圧を記憶する（ステップ２０２）。 Further, the direction specified by the sound source specifying units 113 and 114, that is, the position where the speaker is present, in parallel with the operation of continuously capturing the face of the speaker and specifying the position of the lip center point of the speaker. Is continuously acquired, and voice data is created from the voice acquired by the voice recognition unit 101 (step 201). Then, the sound pressure of the sound acquired from the created sound data is measured by the measuring unit 102, and the sound pressure measured in time series is stored (step 202).

このようにして、時系列的に求められる発話者からの音声の音圧と、音声認識ロボット１から発話者の口唇中心点までの相対的な距離とから、前述の数式１に基づいて、目標音圧を得るためのマイクロフォン先端−口唇中心点間の目標距離を算出する（ステップ１０８）。そして、マイクロフォン先端−口唇中心点間を前記目標距離とし、マイクロフォン１２ｂにより取得される音声の音圧が目標音圧となるマイクロフォン１２ｂの先端部分の目標位置（Ｘ_ｔ，Ｙ_ｔ，Ｚ_ｔ）を定め、定められた先端位置にマイクロフォン１２ｂの先端が到達するための、右腕１２の関節角θ１，θ２の関係を算出する（ステップ１０９）。 In this way, based on the above Equation 1, the target sound pressure from the speaker obtained in time series and the relative distance from the speech recognition robot 1 to the lip center point of the speaker are obtained. A target distance between the microphone tip and the lip center point for obtaining the sound pressure is calculated (step 108). Then, the target position (X _t , Y _t , Z _t ) of the tip of the microphone 12b where the sound pressure of the sound acquired by the microphone 12b becomes the target sound pressure with the distance between the tip of the microphone and the lip center point as the target distance. The relationship between the joint angles θ1 and θ2 of the right arm 12 for the tip of the microphone 12b to reach the determined tip position is calculated (step 109).

さらに、このような関係を満たすθ１およびθ２のうち、マイクロフォン１２ｂの先端から、マイクロフォン１２ｂの有する指向性を示す方向に発話者の口唇の中心点が位置するように、θ１およびθ２の値を特定する（ステップ１１０）。なお、マイクロフォン１２ｂの先端を口唇の中心点に向ける手法としては、どのような手法が用いられてもよいが、例えばマイクロフォン１２ｂの先端部分の座標と、後端部分の座標とを求め、これらの座標を結ぶベクトルが、マイクロフォン１２ｂの先端部分を始点として口唇の中心点を示す座標に向かうようにマイクロフォン１２ｂの先端部分の位置を定めるようにしてもよい。 Further, among θ1 and θ2 satisfying such a relationship, the values of θ1 and θ2 are specified so that the center point of the lip of the speaker is positioned in the direction indicating the directivity of the microphone 12b from the tip of the microphone 12b. (Step 110). Note that any method may be used as a method of directing the tip of the microphone 12b toward the center point of the lips. For example, the coordinates of the tip portion and the rear end portion of the microphone 12b are obtained, and these You may make it determine the position of the front-end | tip part of the microphone 12b so that the vector which connects a coordinate may go to the coordinate which shows the center point of a lip from the front-end | tip part of the microphone 12b.

そして、このように定められた右腕１２の関節角θ１およびθ２を実現するために、右腕の関節部材を駆動し、マイクロフォン１２ｂの先端部分を発話者の口唇付近にマイクロフォン１２ｂの先端部分を目標位置（Ｘ_ｔ，Ｙ_ｔ，Ｚ_ｔ）に到達させる（ステップ１１１）。このマイクロフォン１２ｂの先端部分の目標位置は、発話者からの音声の強弱の変化や発話者の位置などに応じて変化するため、マイクロフォン１２ｂによって取得される音声が一定時間途切れるまで（ステップ１１２）、ステップ１０１からステップ１１１までのフローを繰り返す。すなわち、選択した顔として抽出した領域を、画像データの略中心に位置するように頭部１１を回動させ、常に音声認識ロボット１に話し掛ける人物の顔の方向に、音声認識ロボット１の頭部１１の前面（音声認識ロボット１の顔前面）が向いているように動作する。これによって、前述のような状況の変化に対応して発話者の口唇とマイクロフォン先端部分との相対的な位置関係を一定に保たれ、発話者から発せられる音声の音圧が一定となるため、音声認識部１０１における音声認識の精度が向上する。 In order to realize the joint angles θ1 and θ2 of the right arm 12 determined in this way, the joint member of the right arm is driven so that the tip portion of the microphone 12b is near the lip of the speaker and the tip portion of the microphone 12b is the target position. ( _Xt , _Yt , _Zt ) is reached (step 111). Since the target position of the tip portion of the microphone 12b changes according to the change in the strength of the voice from the speaker or the position of the speaker, the voice acquired by the microphone 12b is interrupted for a certain time (step 112). The flow from step 101 to step 111 is repeated. That is, the head 11 of the voice recognition robot 1 is rotated in the direction of the face of the person who always speaks to the voice recognition robot 1 by rotating the head 11 so that the region extracted as the selected face is positioned at the approximate center of the image data. 11 (front face of the speech recognition robot 1) is facing. As a result, the relative positional relationship between the lip of the speaker and the tip of the microphone is kept constant corresponding to the change in the situation as described above, and the sound pressure of the sound emitted from the speaker is constant. The accuracy of speech recognition in the speech recognition unit 101 is improved.

そして、音声認識部１０１によって取得される音声が一定時間途切れると、さらに音声認識を継続するか否かを判断し（ステップ１１３）、継続する場合は再度ステップ１０１からの手順を繰り返し、音声認識を終了する場合は所定の手順で終了処理を行う。 When the voice acquired by the voice recognition unit 101 is interrupted for a certain period of time, it is further determined whether or not voice recognition is to be continued (step 113). If so, the procedure from step 101 is repeated again to perform voice recognition. When the process is to be terminated, the termination process is performed according to a predetermined procedure.

以上、説明したような音声認識ロボットおよび音声認識ロボットの制御方法においては、音声を発生した発話者の口唇と、音声を取得するマイクロフォンとの距離を、当該マイクロフォンで取得する音声の音圧が所定の適切な値となるように、マイクロフォンと口唇との距離が定められる。さらに、マイクロフォンの先端から、マイクロフォンの指向性と同方向を向いた位置の延長線上に、前記口唇の中心点が位置する関係を満たすように、マイクロフォンを保持する保持部としての右腕の関節を駆動することで、マイクロフォンの指向性を考慮したマイクロフォンの位置決めを行うことができる。 As described above, in the voice recognition robot and the method for controlling the voice recognition robot as described above, the distance between the lip of the speaker who generated the voice and the microphone from which the voice is acquired is determined by the sound pressure of the voice acquired by the microphone. The distance between the microphone and the lips is determined so as to be an appropriate value. In addition, the right arm joint as a holding part that holds the microphone is driven so that the center point of the lip is located on the extension line extending from the tip of the microphone in the same direction as the directivity of the microphone. Thus, the microphone can be positioned in consideration of the directivity of the microphone.

なお、本実施形態においては、右腕に保持されたマイクロフォンの指向性を考慮してマイクロフォンの位置が定められているが、本発明はこれに限られるものではない。例えば、指向性を考慮せずに、マイクロフォンと口唇との距離だけに基づいて、右腕の関節部材の駆動量（関節角）を定めてもよい。その場合、右腕の関節部材の関節許容範囲や、発話者と音声認識ロボットとの距離等に応じて、マイクロフォンの位置を特定するとよい。 In the present embodiment, the position of the microphone is determined in consideration of the directivity of the microphone held by the right arm, but the present invention is not limited to this. For example, the driving amount (joint angle) of the joint member of the right arm may be determined based only on the distance between the microphone and the lip without considering directivity. In that case, the position of the microphone may be specified according to the joint allowable range of the joint member of the right arm, the distance between the speaker and the voice recognition robot, and the like.

また、マイクロフォンの指向性に基づいて右腕の駆動量（関節角）を定める場合において、マイクロフォン１２ｂの先端から、マイクロフォンの指向性の方向について正確に口唇の中心点が位置するような角度θ１およびθ２が特定できない場合がある。その場合は、マイクロフォン１２ｂの指向性を示す方向から所定の角度の範囲内（例えば指向性を示す軸線方向を中心として傾斜する所定角度の範囲内）に口唇の中心点が位置するようにしてもよい。このようにすると、角度θ１およびθ２が厳密には特定できない場合であっても、マイクロフォン１２ｂの指向性をある程度考慮した位置に、マイクロフォン１２ｂを移動させることができる。これによって、常に音声認識ロボット１に話し掛ける人物の顔の方向に、音声認識ロボット１の頭部１１の前面（音声認識ロボット１の顔前面）が向いているように動作させる。 Further, when the right arm drive amount (joint angle) is determined based on the directivity of the microphone, the angles θ1 and θ2 at which the center point of the lip is accurately positioned in the direction of the directivity of the microphone from the tip of the microphone 12b. May not be identified. In that case, the center point of the lip may be positioned within a predetermined angle range from the direction indicating the directivity of the microphone 12b (for example, within a predetermined angle range inclined about the axial direction indicating the directivity). Good. In this way, even when the angles θ1 and θ2 cannot be strictly specified, the microphone 12b can be moved to a position where the directivity of the microphone 12b is considered to some extent. Thus, the voice recognition robot 1 is operated so that the front face of the head 11 of the voice recognition robot 1 (the face front face of the voice recognition robot 1) faces the face of the person who always talks to the voice recognition robot 1.

また、音声認識ロボット１に話し掛ける人物の顔の方向に、音声認識ロボット１の頭部１１の前面（音声認識ロボット１の顔前面）を向かせる動作は、発話者が音声認識ロボット１から所定距離以上離れた位置に移動するまで続けられる。すなわち、選択した発話者の口唇の中心位置と、音声認識ロボット１との相対距離が所定距離以上であるか否かを判断し、所定距離以内に存在する限りは音声を取得しつづけ、音声認識を継続するようにするものであってもよい。このようにすると、発話者が音声認識ロボットから所定距離以上離れたと判断した場合は、音声取得および音声認識動作を終了し、次の音声取得に備えた準備状態となるようにしてもよい。 In addition, an operation in which the front face of the head 11 of the voice recognition robot 1 (the front face of the voice recognition robot 1) is directed toward the face of the person talking to the voice recognition robot 1 is performed by the speaker from the voice recognition robot 1 at a predetermined distance. It continues until it moves to the position far away. That is, it is determined whether or not the relative distance between the center position of the selected speaker's lips and the voice recognition robot 1 is equal to or greater than a predetermined distance. May be made to continue. In this way, when it is determined that the speaker is away from the voice recognition robot by a predetermined distance or more, the voice acquisition and voice recognition operations may be terminated and a preparation state for the next voice acquisition may be set.

発明の実施の形態２．
次に、図８及び図９を参照しつつ、本発明の第２の実施形態にかかる音声認識ロボットについて説明する。なお、本実施の形態においては、前述した実施の形態１において説明した構成と同一の構成は同一の符号を付して、その説明を省略するものとする。 Embodiment 2 of the Invention
Next, a voice recognition robot according to a second embodiment of the present invention will be described with reference to FIGS. 8 and 9. In the present embodiment, the same components as those described in the first embodiment are denoted by the same reference numerals, and the description thereof is omitted.

図８に示す音声認識ロボット１'は、前述の実施形態において説明した音声認識ロボットと同様に、胴体１０と、この胴体１０に接続された頭部１１、右腕１２、左腕１３を備える他、胴体１０に対して接続された腰部２０と、腰部２０に対して接続された右脚２１および左脚２２を備えるヒューマノイド型のロボットである。このように構成された音声認識ロボット１'は右脚２１および左脚２２とを交互に移動させることで２足歩行動作を行うものであり、右脚２１および左脚２２とは、本発明でいう移動手段に相当する。 Similar to the speech recognition robot described in the above embodiment, the voice recognition robot 1 ′ shown in FIG. 8 includes the body 10, the head 11, the right arm 12, and the left arm 13 connected to the body 10. 10 is a humanoid robot including a lumbar part 20 connected to 10 and a right leg 21 and a left leg 22 connected to the lumbar part 20. The voice recognition robot 1 ′ configured as described above performs a bipedal walking action by alternately moving the right leg 21 and the left leg 22, and the right leg 21 and the left leg 22 are defined in the present invention. It corresponds to the moving means.

２足歩行を行うための右脚２１および左脚２２は、各々股関節、上腿、膝関節、下腿、足首関節、足先といった各部材を備えている。これらの各部材は、図示しない関節部を介して接続されており、同じく図示しない複数のモータによって駆動自在に構成されている。そして、関節部を駆動するためのモータは制御部１００に含まれる演算処理部（図示せず）によって、所定の制御プログラムに従って駆動され、各関節の関節駆動角度が決定されることで、所望の姿勢をとり、また、２足歩行同さにより所定の位置に移動可能となる。 The right leg 21 and the left leg 22 for performing bipedal walking are provided with respective members such as a hip joint, an upper thigh, a knee joint, a lower leg, an ankle joint, and a toe. Each of these members is connected via a joint portion (not shown), and is configured to be freely driven by a plurality of motors (not shown). A motor for driving the joints is driven according to a predetermined control program by an arithmetic processing unit (not shown) included in the control unit 100, and the joint drive angles of the respective joints are determined. The posture can be taken, and it is possible to move to a predetermined position by walking with two legs.

このように構成された音声認識ロボット１'は、音声を発生する発話者２００の口唇２０１に対して、マイクロフォン１２ｂの指向性を向けるとともに、口唇２０１とマイクロフォン１２ｂとの距離が所定の適切な値となるように自らの姿勢を変更するが、その姿勢変更だけで口唇２０１とマイクロフォン１２ｂとの距離が適切な値とならない場合には、自身の位置を移動することで、口唇２０１とマイクロフォン１２ｂとの相対的な位置関係を満たす。姿勢の変更は、主としてマイクロフォン１２ｂを把持した右腕１２の関節を、制御部１００の図示しない演算処理部により駆動することで行われ、自身の移動は右脚２１と左脚２２とを駆動することで行われる。その手順について図８に示すフローチャートを用いつつ説明する。なお、本実施形態における音声認識ロボット１'は、前述の実施形態と同様に、マイクロフォン１２ｂを把持した右腕１２の上腕および下腕のなす角度θ１と、下腕とマイクロフォン１２ｂを把持するハンド部１２ａのなす角度θ２を変化させることで、マイクロフォン１２ｂの先端の位置およびマイクロフォン１２ｂの傾斜度合いを調整する。以下、図９に示すフローチャートを用いて詳細に説明する。 The voice recognition robot 1 ′ configured in this manner directs the directivity of the microphone 12 b toward the lip 201 of the speaker 200 that generates voice, and the distance between the lip 201 and the microphone 12 b is a predetermined appropriate value. If the distance between the lip 201 and the microphone 12b does not become an appropriate value just by changing the posture, the lip 201 and the microphone 12b are moved by moving their positions. Satisfy the relative positional relationship. The posture is changed mainly by driving the joint of the right arm 12 holding the microphone 12b by an arithmetic processing unit (not shown) of the control unit 100, and moving itself drives the right leg 21 and the left leg 22. Done in The procedure will be described with reference to the flowchart shown in FIG. Note that the speech recognition robot 1 ′ in this embodiment is similar to the above-described embodiment in that the angle θ1 formed by the upper arm and the lower arm of the right arm 12 that holds the microphone 12b, and the hand portion 12a that holds the lower arm and the microphone 12b. The position of the tip of the microphone 12b and the inclination degree of the microphone 12b are adjusted by changing the angle θ2 formed by. This will be described in detail below with reference to the flowchart shown in FIG.

まず、音声認識ロボット１'に電力が供給されると、周囲からの音声を取得する準備をした状態で停止（倒立）し、この状態で音声認識ロボット１'の周囲に存在する発話者２００が音声認識ロボット１'に話し掛けると、音声認識ロボット１'は、この人物の発声した音声を音声取得部としてのマイクロフォン１２ｂで取得するとともに、音源特定部１１３、１１４によって音声の発声した方向（ロボットからみた相対的な方向）を特定する（ステップ３０１）。そして、マイクロフォン１２ｂで取得された音声は、音声認識部１０１において音声データに変換され、測定部１０２は、音声認識部１０１で作成された音声データから、取得した音声の音圧の測定を開始する（ステップ３０２）。このとき、音源特定部１１３、１１４で特定した方向に頭部１１の前面が位置するように頭部１１を回動し、撮像を開始するとともに、頭部１１の正面が音声認識ロボット１'の身体正面となるように、脚部を駆動して移動し、自己の位置を修正する（ステップ３０３）。 First, when power is supplied to the voice recognition robot 1 ′, the voice recognition robot 1 ′ stops (inverts) in a state where it is prepared to acquire voice from the surroundings. In this state, the speaker 200 existing around the voice recognition robot 1 ′ When talking to the voice recognition robot 1 ′, the voice recognition robot 1 ′ acquires the voice uttered by the person with the microphone 12b as a voice acquisition unit and the direction in which the voice is uttered by the sound source identification units 113 and 114 (from the robot). Relative direction) is identified (step 301). The voice acquired by the microphone 12b is converted into voice data by the voice recognition unit 101, and the measurement unit 102 starts measuring the sound pressure of the acquired voice from the voice data created by the voice recognition unit 101. (Step 302). At this time, the head 11 is rotated so that the front surface of the head 11 is positioned in the direction specified by the sound source specifying units 113 and 114, and imaging is started, and the front of the head 11 is the front of the voice recognition robot 1 ′. The leg is driven and moved so as to be in front of the body, and the position of itself is corrected (step 303).

撮像部１１１、１１２で撮像したことで得られた画像データは、制御部１００に入力され、この画像データの中において顔検出部１０３が人物の顔を検出できるか否かを判断する（ステップ３０４）。ここで、画像データ中に人物の顔が１つでも検出できれば、検出した顔の向きを判別部１０４によって判別し、音声認識ロボット１'の方を向いている顔が存在するか否かを判断する（ステップ３０５）。音声認識ロボット１'の方を向いている顔が存在すれば、それらの顔の、音声認識ロボット１'からの各距離を求めて、最も近い位置に存在する顔を選択する（ステップ３０６）。逆に、音声認識ロボット１'の方を向いている顔が存在しなければ、音声認識ロボット１'に対して話し掛けられた状態ではないと判断し、脚部による移動を停止し、再度音声を取得可能な停止状態に戻る。 Image data obtained by imaging by the imaging units 111 and 112 is input to the control unit 100, and it is determined whether or not the face detection unit 103 can detect a human face in the image data (step 304). ). Here, if even one person's face can be detected in the image data, the direction of the detected face is discriminated by the discriminating unit 104, and it is judged whether or not there is a face facing the voice recognition robot 1 ′. (Step 305). If there are faces facing the voice recognition robot 1 ′, the distances of the faces from the voice recognition robot 1 ′ are obtained, and the face existing at the closest position is selected (step 306). Conversely, if there is no face facing the voice recognition robot 1 ′, it is determined that the voice recognition robot 1 ′ is not talking to the voice recognition robot 1 ′, the movement by the legs is stopped, and the voice is again heard. Return to the obtainable stop state.

次に、最も近い位置に存在する、音声認識ロボット１'の方を向いている顔を判別すると、画像データ内における、判別した顔が占める領域から、口唇に相当する部分を抽出する（ステップ３０７）。そして、抽出した口唇の中心点２０１ａを特定し、その中心点２０１ａの位置座標（ｘ_ｔ，ｙ_ｔ，ｚ_ｔ）を算出した後、音声認識ロボット１'から、中心点の位置座標に相当する口唇内の点までの距離を、撮像部１１１、１１２で撮像する三角測量の原理で求める（ステップ３０８）。これによって、音声認識ロボット１'の位置に対する、発話者の口唇中心点の相対的な位置が特定される。 Next, when the face that faces the voice recognition robot 1 ′ that is present at the closest position is determined, a portion corresponding to the lips is extracted from the area occupied by the determined face in the image data (step 307). ). Then, after specifying the extracted lip center point 201a and calculating the position coordinates (x _t , y _t , z _t ) of the center point 201a, it corresponds to the position coordinates of the center point from the speech recognition robot 1 ′. The distance to the point in the lip is obtained by the principle of triangulation that is imaged by the imaging units 111 and 112 (step 308). As a result, the relative position of the speaker's lip center point with respect to the position of the speech recognition robot 1 ′ is specified.

そして、音声認識ロボット１'の位置と、発話者の口唇中心点との相対的な位置から、音声認識ロボット１'の右腕に保持されたマイク１２の先端が、発話者の口唇中心点に対して十分近づける位置に到達可能か否かを判断する（ステップ３０９）。このような判断を行うための手順の一例としては、音声認識ロボットの特定位置（例えば立脚位置）から右腕１２の関節駆動による動きを考慮したマイク１２ｂの到達する領域を算出し、その領域の中で最大限発話者の口唇中心点に近づく点を求め、その点と口唇中心点との距離が所定の距離以内に収まらない場合に、音声認識ロボット１'が発話者に対して十分近づいていないと判断するようにする。 Then, from the relative position between the position of the speech recognition robot 1 ′ and the lip center point of the speaker, the tip of the microphone 12 held on the right arm of the speech recognition robot 1 ′ is relative to the lip center point of the speaker. It is then determined whether or not it is possible to reach a position that is sufficiently close (step 309). As an example of a procedure for making such a determination, an area reached by the microphone 12b in consideration of the movement due to joint driving of the right arm 12 from a specific position (for example, a standing position) of the voice recognition robot is calculated, If the point that is as close as possible to the lip center point of the speaker is found and the distance between the point and the lip center point does not fall within a predetermined distance, the speech recognition robot 1 'is not sufficiently close to the speaker. Judge that.

このような手順により、音声認識ロボット１'が発話者に対して十分近づいていないと判断すると、音声認識ロボット１'は脚部を駆動して移動することで立脚位置を変更する（ステップ４０１）。この際、音声認識ロボット１'の移動する位置としては、音声認識ロボット１'の特定部位（例えば立脚位置）を発話者の口唇中心点に向かう方向に所定距離だけ近づくような位置を選択する。そして、再度ステップ３０８に戻って音声認識ロボット１'の位置に対する、発話者の口唇中心点の相対的な位置を特定する。このようにし、音声認識ロボット１'の位置が発話者の口唇中心点に対して十分近づける位置に到達可能となると判断されるまで、これらのステップを繰り返す。 If it is determined that the voice recognition robot 1 ′ is not sufficiently close to the speaker by such a procedure, the voice recognition robot 1 ′ changes the stance position by driving the leg portion and moving (step 401). . At this time, a position where the voice recognition robot 1 ′ moves is selected such that a specific part (for example, a standing position) of the voice recognition robot 1 ′ approaches a predetermined distance in the direction toward the lip center point of the speaker. Then, returning to step 308 again, the relative position of the lip center point of the speaker with respect to the position of the speech recognition robot 1 ′ is specified. In this way, these steps are repeated until it is determined that the position of the speech recognition robot 1 ′ can reach a position sufficiently close to the lip center point of the speaker.

一方、発話者の顔を撮像し続け、発話者の口唇中心点の位置を特定する動作と平行して、音源特定部１１３、１１４により特定された方向、すなわち発話者の存在する位置からの音声を常に取得し続け、音声認識部１０１により取得した音声から音声データを作成する（ステップ４０２）。そして、作成した音声データから取得した音声の音圧を測定部１０２で測定し、時系列的に測定した音圧を記憶する（ステップ４０３）。 On the other hand, the voice from the direction specified by the sound source specifying units 113 and 114, that is, the position where the speaker is present, in parallel with the operation of continuously capturing the face of the speaker and specifying the position of the lip center point of the speaker. Is continuously acquired, and voice data is created from the voice acquired by the voice recognition unit 101 (step 402). Then, the sound pressure of the sound acquired from the created sound data is measured by the measuring unit 102, and the sound pressure measured in time series is stored (step 403).

このように、音声認識ロボット１'の右腕に保持されたマイク１２の先端が、発話者の口唇中心点に対して十分近づける位置に到達可能と判断されると、時系列的に求められる発話者からの音声の音圧と、音声認識ロボット１から発話者の口唇中心点までの相対的な距離とから、前述の数式１に基づいて、目標音圧を得るためのマイクロフォン先端−口唇中心点間の目標距離を算出する（ステップ３１０）。そして、マイクロフォン先端−口唇中心点間を前記目標距離とし、マイクロフォン１２ｂにより取得される音声の音圧が目標音圧となるマイクロフォン１２ｂの先端部分の目標位置（Ｘ_ｔ，Ｙ_ｔ，Ｚ_ｔ）を定め、定められた先端位置にマイクロフォン１２ｂの先端が到達するための、右腕１２の関節角θ１，θ２の関係を算出する（ステップ３１１）。 As described above, when it is determined that the tip of the microphone 12 held by the right arm of the speech recognition robot 1 ′ can reach a position sufficiently close to the lip center point of the speaker, the speaker who is obtained in time series is determined. Between the microphone tip and the lip center point for obtaining the target sound pressure from the sound pressure of the voice and the relative distance from the speech recognition robot 1 to the lip center point of the speaker based on the above-described equation 1 The target distance is calculated (step 310). Then, the target position (X _t , Y _t , Z _t ) of the tip of the microphone 12b where the sound pressure of the sound acquired by the microphone 12b becomes the target sound pressure with the distance between the tip of the microphone and the lip center point as the target distance. The relationship between the joint angles θ1, θ2 of the right arm 12 for the tip of the microphone 12b to reach the determined tip position is calculated (step 311).

さらに、このような関係を満たすθ１およびθ２のうち、マイクロフォン１２ｂの先端から、マイクロフォン１２ｂの有する指向性を示す方向に発話者の口唇の中心点が位置するように、θ１およびθ２の値を特定可能か否かを判定する（ステップ３１２）。 Further, among θ1 and θ2 satisfying such a relationship, the values of θ1 and θ2 are specified so that the center point of the lip of the speaker is positioned in the direction indicating the directivity of the microphone 12b from the tip of the microphone 12b. It is determined whether or not it is possible (step 312).

θ１およびθ２の値を特定可能である場合は、定められた右腕１２の関節角θ１およびθ２を実現するために、右腕の関節部材を駆動し、マイクロフォン１２ｂの先端部分を発話者の口唇付近にマイクロフォン１２ｂの先端部分を目標位置（Ｘ_ｔ，Ｙ_ｔ，Ｚ_ｔ）に到達させる（ステップ３１３）。 When the values of θ1 and θ2 can be specified, in order to realize the determined joint angles θ1 and θ2 of the right arm 12, the right arm joint member is driven, and the tip portion of the microphone 12b is placed near the lip of the speaker. target position the tip portion of the microphone _{_{12b (X t, Y t,}} Z t) to reach the (step 313).

一方、θ１およびθ２の値を特定できない場合は、音声認識ロボット１'の位置が発話者に対して遠すぎると判断し、音声認識ロボット１'の立脚位置を修正し（ステップ４０４）、再度θ１およびθ２の値を特定可能か否かを判定するステップ（ステップ３１２）に戻る。 On the other hand, if the values of θ1 and θ2 cannot be specified, it is determined that the position of the speech recognition robot 1 ′ is too far from the speaker, the stance position of the speech recognition robot 1 ′ is corrected (step 404), and θ1 is again detected. And the process returns to the step of determining whether or not the values of θ2 can be specified (step 312).

なお、このマイクロフォン１２ｂの先端部分の目標位置は、発話者からの音声の強弱の変化や発話者の位置などに応じて変化するため、マイクロフォン１２ｂによって取得される音声が一定時間途切れるまで（ステップ３１４）、ステップ１０１からステップ１１１までのフローを繰り返す。すなわち、選択した顔として抽出した領域を、画像データの略中心に位置するように頭部１１を回動させた後、頭部１１の前面が身体の正面となるように脚部を駆動して自己位置を修正する。これによって、前述のような状況の変化に対応して発話者の口唇とマイクロフォン先端部分との相対的な位置関係を一定に保たれ、発話者から発せられる音声の音圧が一定となるため、音声認識部１０１における音声認識の精度が向上する。 Note that the target position of the tip portion of the microphone 12b changes according to the change in the strength of the voice from the speaker, the position of the speaker, and the like, so that the voice acquired by the microphone 12b is interrupted for a certain time (step 314). ), The flow from step 101 to step 111 is repeated. That is, after rotating the head 11 so that the region extracted as the selected face is positioned at the approximate center of the image data, the legs are driven so that the front of the head 11 is the front of the body. Correct self-position. As a result, the relative positional relationship between the lip of the speaker and the tip of the microphone is kept constant corresponding to the change in the situation as described above, and the sound pressure of the sound emitted from the speaker is constant. The accuracy of speech recognition in the speech recognition unit 101 is improved.

そして、音声認識部１０１によって取得される音声が一定時間途切れると、さらに音声認識を継続するか否かを判断し（ステップ３１５）、継続する場合は再度ステップ１０１からの手順を繰り返し、音声認識を終了する場合は所定の手順で終了処理を行う。 When the voice acquired by the voice recognition unit 101 is interrupted for a certain period of time, it is further determined whether or not to continue voice recognition (step 315). If so, the procedure from step 101 is repeated again, and voice recognition is performed. When the process is to be terminated, the termination process is performed according to a predetermined procedure.

以上、本実施形態における音声認識ロボットおよび音声認識ロボットの制御方法では、音声を発生した発話者の口唇と、音声を取得するマイクロフォンとの距離を、当該マイクロフォンで取得する音声の音圧が所定の適切な値となるように、マイクロフォンと口唇との距離が定められる。さらに、マイクロフォンの先端から、マイクロフォンの指向性と同方向を向いた位置の延長線上に、前記口唇の中心点が位置する関係を満たすように、マイクロフォンを保持する保持部としての右腕の関節を駆動するとともに、ロボットの立脚位置を自律的に修正する。これによって、発話者に対して音声認識を行う上で適切な距離となるようにマイクロフォンの指向性を考慮したマイクロフォンの位置決めを行うことが可能となる。 As described above, in the voice recognition robot and the voice recognition robot control method according to the present embodiment, the distance between the lip of the speaker who has generated the voice and the microphone from which the voice is acquired is determined based on the sound pressure of the voice acquired by the microphone. The distance between the microphone and the lips is determined so as to be an appropriate value. In addition, the right arm joint as a holding part that holds the microphone is driven so that the center point of the lip is located on the extension line extending from the tip of the microphone in the same direction as the directivity of the microphone. In addition, the robot's stance position is corrected autonomously. This makes it possible to position the microphone in consideration of the directivity of the microphone so that the distance is appropriate for performing speech recognition on the speaker.

なお、前述の実施形態においては、音声認識ロボットに備えられる移動手段として、歩行動作を行うための脚部（右脚および左脚）を備える例が記載されているが、本発明はこれに限られるものではない。すなわち、回転駆動される車輪等のような、一般的な移動手段を音声認識ロボットに組み込むものであってもよい。また、移動手段によりロボットの位置を変更する場合に、ロボット本体により自己位置を認識する機能を備えているものに限らず、ロボット外部に設けられた位置認識ステーションによりロボットの位置を認識し、ロボットの位置制御を行うようにしてもよい。 In the above-described embodiment, an example in which legs (right leg and left leg) for performing a walking motion are provided as moving means provided in the speech recognition robot is described. However, the present invention is not limited to this. It is not something that can be done. That is, a general moving means such as a rotationally driven wheel may be incorporated in the voice recognition robot. In addition, when the position of the robot is changed by the moving means, the robot body is not limited to the one having a function of recognizing its own position by the robot body, and the position of the robot is recognized by a position recognition station provided outside the robot. Position control may be performed.

また、本発明に係る音声認識ロボットとしては、前述のようなヒューマノイドロボットに限られるものではなく、また、関節駆動による移動や姿勢変更を行うものに限られるものでもない。例えば、音声取得部としてのマイクロフォンを保持する保持部としては、前述のような関節駆動する腕部に限られるものではなく、単なる回動部材や伸縮部材に置き換えることも可能である。 Further, the voice recognition robot according to the present invention is not limited to the humanoid robot as described above, and is not limited to one that performs movement or posture change by joint driving. For example, the holding unit that holds the microphone as the sound acquisition unit is not limited to the arm unit that is joint-driven as described above, and can be replaced with a simple rotation member or a telescopic member.

以上、説明したように、本発明に係る音声認識ロボットおよび音声認識ロボットの制御方法によると、音声取得部としてのマイクロフォンを、発話者の口唇に対して目標音圧となり得るに適切な相対距離に位置させることができるため、より正確な音声認識を行うことができる。 As described above, according to the voice recognition robot and the voice recognition robot control method according to the present invention, the microphone as the voice acquisition unit is set to a relative distance suitable for the target sound pressure with respect to the lip of the speaker. Since it can be positioned, more accurate speech recognition can be performed.

本発明に係る第１の実施の形態である音声認識ロボットが室内に設けられている様子を示す全体外略図である。BRIEF DESCRIPTION OF THE DRAWINGS It is the whole outline figure which shows a mode that the speech recognition robot which is 1st Embodiment based on this invention is provided indoors. 図１、２に示す音声認識ロボットを概略的に示す概略図である。It is the schematic which shows roughly the speech recognition robot shown in FIG. 図１、２に示す音声認識ロボットに備えられた制御部の内部を概念的に表したブロック図である。FIG. 3 is a block diagram conceptually showing the inside of a control unit provided in the voice recognition robot shown in FIGS. 図１、２に示す音声認識ロボットに備えられた顔検出部によって、各人物の顔の向きを求める様子を示す図であるIt is a figure which shows a mode that the direction of each person's face is calculated | required by the face detection part with which the speech recognition robot shown to FIG. 図１、２に示す音声認識ロボットに備えられた判別部が、ロボットの方向を向いている顔を判別する様子を示す図である。It is a figure which shows a mode that the discrimination | determination part with which the voice recognition robot shown to FIG. 1, 2 was equipped discriminate | determines the face which has faced the direction of the robot. 図１、２に示す音声認識ロボットが、発話者に対してマイクロフォンを向けた様子を概略的に示す概略図である。It is the schematic which shows a mode that the voice recognition robot shown in FIG. 1, 2 pointed the microphone at the speaker. 図１に示す音声認識ロボットが、取得した音声に基づいて、自分が応答すべき発話者を特定し、音声取得部を発話者に対して近づける手順を示すフローチャートである。It is a flowchart which shows the procedure in which the speech recognition robot shown in FIG. 1 specifies the speaker to whom he / she should respond based on the acquired speech and brings the speech acquisition unit closer to the speaker. 本発明に係る第２の実施の形態である音声認識ロボットを概略的に示す概略図である。It is the schematic which shows roughly the speech recognition robot which is 2nd Embodiment which concerns on this invention. 図８に示す音声認識ロボットが、取得した音声に基づいて、自分が応答すべき発話者を特定し、音声取得部を発話者に対して近づける手順を示すフローチャートである。It is a flowchart which shows the procedure in which the speech recognition robot shown in FIG. 8 specifies the speaker who should respond based on the acquired speech and brings the speech acquisition unit closer to the speaker.

Explanation of symbols

１、１'・・・音声認識ロボット
１０・・・胴体
１１・・・頭部
１２・・・右腕（保持部）
１２ｂ・・・マイクロフォン（音声取得部）
１３・・・左腕
２１・・・右脚（移動手段）
２２・・・左脚（移動手段）
１００・・・制御部
１０１・・・音声認識部
１０２・・・測定部
１０３・・・顔検出部
１０４・・・判別部
１０５・・・抽出部
１０６・・・特定部
１０７・・・音声合成部
１０８・・・記憶領域
１１１、１１２・・・撮像部
１１３、１１４・・・音源特定部
１１５・・・スピーカ
２００・・・発話者
２０１・・・発話者の唇
２０１ａ・・唇の中心点 1, 1 '... voice recognition robot 10 ... body 11 ... head 12 ... right arm (holding part)
12b ... Microphone (voice acquisition unit)
13 ... Left arm 21 ... Right leg (moving means)
22 ... Left leg (moving means)
DESCRIPTION OF SYMBOLS 100 ... Control part 101 ... Speech recognition part 102 ... Measurement part 103 ... Face detection part 104 ... Discrimination part 105 ... Extraction part 106 ... Identification part 107 ... Speech synthesis Unit 108 ... storage area 111, 112 ... imaging unit 113, 114 ... sound source identification unit 115 ... speaker 200 ... speaker 201 ... speaker's lips 201a ... center point of lips

Claims

A sound source identifying unit for identifying the direction in which the sound is generated;
An audio acquisition unit for acquiring the generated audio;
A voice recognition unit for recognizing the content of the voice acquired by the voice acquisition unit;
A measurement unit for measuring the sound pressure of the acquired voice;
An imaging unit that captures an image of the direction in which the acquired sound is generated and creates image data of the captured image;
A face detection unit for detecting a human face existing in the created image data;
An extraction unit for extracting lips from the detected face;
A specific part for identifying the center point of the extracted lips;
A voice recognition robot that holds the voice acquisition unit and includes a holding unit that can change posture by a driving operation of at least one of an expansion operation and a joint angle change,
The direction in which the sound identified by the sound source identification unit is generated is imaged by the imaging unit,
A face of a person present in the image captured by the imaging unit is detected by the face detection unit;
Identify the center point of the lips extracted from the detected human face,
Calculate the distance between the identified center point of the lips and the position of the voice acquisition unit,
Based on the calculated distance and the sound pressure of the sound measured by the measurement unit, the relative position of the sound acquisition unit with respect to the center point of the lips is determined,
A voice recognition robot, wherein the posture of the holding unit is changed so as to move the voice acquisition unit to a predetermined position.

A target target sound pressure is stored, and an optimum distance between the center point of the lips and the sound acquisition unit is obtained using a difference between the measured sound pressure of the sound and the target sound pressure as a parameter, and the sound acquisition unit The voice recognition robot according to claim 1, wherein the position is determined.

The voice acquisition unit is composed of a microphone having directivity, and the posture of the holding unit is such that the center point of the lip is positioned on an extension line of a position facing the same direction as the directivity from the tip of the voice acquisition unit The voice recognition robot according to claim 1, wherein the voice recognition robot is changed.

The voice recognition robot according to any one of claims 1 to 3, wherein the sound source specifying unit includes one or more microphones having directivity.

5. The voice recognition robot according to claim 1, wherein the extraction unit obtains a contour of a lip and uses a center of gravity specified by the contour as a center point. 6.

The voice recognition robot according to claim 1, wherein the direction of imaging is changed so that the detected face continues to be located at the center of the captured image.

The voice recognition robot according to any one of claims 1 to 6, wherein the voice recognition robot further includes a moving unit, and is configured to be movable within a predetermined area.

The voice recognition according to claim 1, wherein the holding unit is an arm unit including a joint to be driven and controlled, and the voice acquisition unit is held at a tip of the arm unit. robot.

Identifying the direction in which the sound occurred;
Acquiring the generated voice via the voice acquisition unit;
Recognizing the content of the acquired audio;
Measuring the sound pressure of the acquired voice;
Capturing in the direction in which the acquired sound is generated and creating image data of the captured image;
Detecting a human face present in the created image data;
Extracting a lip from the detected face;
Identifying the center point of the extracted lips;
Calculating a distance between the identified center point of the lips and the voice acquisition unit;
Based on the calculated distance and the sound pressure of the sound measured by the measurement unit, determining the relative position of the sound acquisition unit with respect to the center point of the lips;
And a step of moving the voice acquisition unit to a predetermined position.

The position of the sound source acquisition unit so that the voice acquisition unit has directivity and the center point of the lip is positioned on an extension line of a position facing the same direction as the directivity from the tip of the voice acquisition unit The method for controlling a speech recognition robot according to claim 9, further comprising a step of determining