JP2008299135A

JP2008299135A - Speech synthesis device, speech synthesis method and program for speech synthesis

Info

Publication number: JP2008299135A
Application number: JP2007145930A
Authority: JP
Inventors: Reishi Kondou; 玲史近藤; Satoshi Nakazawa; 聡中澤
Original assignee: NEC Corp
Current assignee: NEC Corp
Priority date: 2007-05-31
Filing date: 2007-05-31
Publication date: 2008-12-11

Abstract

<P>PROBLEM TO BE SOLVED: To provide a speech synthesis device by which a user, having listened synthesized speech, easily understands a positional relation with an object associated with the synthesized speech, a state of the object and an intention of utterance. <P>SOLUTION: A synthesis parameter determination section 21 determines a speech synthesis parameter according to position relation indicated by at least any of a direction C, distance r or a relative angle, or their combination, or environment around the object 01. A speech synthesis section 22 synthesizes speech which is related to the object 01, according to a synthesized speech parameter which is determined by the synthesis parameter determination section 21. The synthesis parameter determination section 21 determines the speech synthesis parameter by, for example, preparing a parameter determination table in which a value of the synthesized speech parameter corresponding to a value of each input parameter is registered, based on a predetermined control plan, and by referring to the parameter determination table according to the value of each input parameter. <P>COPYRIGHT: (C)2009,JPO&INPIT

Description

本発明は、音声合成技術に関し、特に、ロボットやゲームキャラクタ等の物理的な音声発声器官を持たないオブジェクトに対応づけた音声を合成する音声合成技術に関する。 The present invention relates to a speech synthesis technology, and more particularly to a speech synthesis technology for synthesizing speech associated with an object that does not have a physical speech uttering organ such as a robot or a game character.

例えば、特許文献１には、発話オブジェクトと視聴者とが共に同じ方向を向いている場合に、音声合成の出力をＬＰＦに通して高域を抑制することで、互いに向き合っている場合の発声音よりも不明瞭にする技術が記載されている。 For example, in Patent Document 1, when the utterance object and the viewer are both facing the same direction, the voice output when the speech synthesis faces each other is suppressed by passing the output of speech synthesis through the LPF and suppressing the high range. More obscuring techniques are described.

また、特許文献２には、ウェブページの内容を発話する際に、現在の読み上げ位置が全体のどれくらいの割合かによって、音声の高低や、長さ、強さ等の音の属性を変化させる技術が記載されている。 Patent Document 2 discloses a technique for changing sound attributes such as pitch, length, and strength depending on how much the current reading position is when speaking the contents of a web page. Is described.

また、特許文献３には、案内対象物までの距離を的確に把握させるために、距離と方位とを仰角と水平角とに変換して、立体音響を生成する技術が記載されている。 Patent Document 3 describes a technique for generating a three-dimensional sound by converting a distance and an azimuth into an elevation angle and a horizontal angle in order to accurately grasp the distance to the guidance object.

また、一般に、楽器などのオブジェクトの位置を立体音響の位置情報に対応させるといった立体音響技術が知られている。 In general, a stereophonic technique is known in which the position of an object such as a musical instrument is made to correspond to the position information of stereophonic sound.

特開２００６−１０９２９５号公報JP 2006-109295 A 特開２００６−１７１５４４号公報JP 2006-171544 A 特開２００５−３３３６２１号公報JP 2005-333621 A

しかしながら、合成音声の多くは、口の動きなどの物理現象と直接対応しないことが多いため、ユーザが合成音声とその合成音声に対応づけられているオブジェクトとをうまく関連付けられない場合がある。 However, since many of the synthesized voices often do not directly correspond to physical phenomena such as mouth movements, the user may not be able to associate the synthesized voice with the object associated with the synthesized voice.

例えば、音声合成機能を有するロボットがある平面上を動いている時、ロボットが合成音声を発声し、同じ平面上に居る人間であるところのユーザはその声を聞くことができる。ところが、ロボットは口を持っていなかったり、人間の口とは音響放射特性の異なるスピーカから合成音声を出力したりするので、ユーザはロボットの声から位置の情報を把握しづらいという問題がある。 For example, when a robot having a speech synthesis function is moving on a plane, the robot utters synthesized speech, and a user who is a human being on the same plane can hear the voice. However, since the robot does not have a mouth or outputs a synthesized voice from a speaker having a sound radiation characteristic different from that of a human mouth, there is a problem that it is difficult for the user to grasp position information from the voice of the robot.

また、ロボットとユーザとの位置関係が異なっても合成音声の聞こえ方が同じであると、情報伝達の意図が不明確になってしまうという問題がある。例えば、ロボットとユーザとが向き合っているからうれしさを表現したいときと、ロボットとユーザが遠く離れているから寂しさを表現したいときとで同じ調子で発声したのでは、ロボットが何をつたえたいのかが不明瞭になってしまう。 Further, there is a problem that the intention of information transmission becomes unclear if the synthesized speech is heard in the same way even if the positional relationship between the robot and the user is different. For example, if you want to express joy because the robot and the user are facing each other, and say you want to express loneliness because the robot and the user are far away, what do you want the robot to say? It becomes ambiguous.

こうした問題は、ロボットに限らず、おもちゃ、携帯電話機、パソコンなど、存在位置が常に固定でない物体（オブジェクト）に対応づけた合成音声についても、同様に生じ得る問題である。なお、ビデオゲームのキャラクタなどの仮想空間上のオブジェクトに対応づけた合成音声については、そのオブジェクトが物理的な実体を持たない為に、上述したような物理的な実体を持つオブジェクトに対応づけた合成音声以上に不明確となる。 Such a problem is not limited to a robot, but may also occur in a synthesized voice associated with an object (object) whose location is not always fixed, such as a toy, a mobile phone, and a personal computer. Note that the synthesized speech associated with a virtual space object such as a video game character is associated with an object having a physical entity as described above because the object does not have a physical entity. It is more unclear than synthesized speech.

そこで、本発明は、合成音声を聞いたユーザが、その合成音声に対応づけられたオブジェクトとの位置関係やそのオブジェクトの状況や発話の意図を想起しやすい合成音声を生成することができる音声合成装置、音声合成方法、および音声合成用プログラムを提供することを目的とする。 Therefore, the present invention provides a speech synthesis that enables a user who has heard a synthesized speech to generate a synthesized speech that can easily recall the positional relationship with the object associated with the synthesized speech, the status of the object, and the intention of the utterance. An object is to provide a device, a speech synthesis method, and a speech synthesis program.

本発明による音声合成装置は、オブジェクトに対応づけた音声を合成する音声合成装置であって、オブジェクトを観察する主体であるユーザと前記オブジェクトの位置関係に応じて、前記オブジェクトに対応づけた音声として、どのような合成音声を生成するかを示す音声合成パラメータを変化させる合成パラメータ決定部と、前記合成パラメータ決定部が変化させた音声合成パラメータに従って、前記オブジェクトに対応づけた音声としての合成音声を生成する音声合成部とを備えたことを特徴とする。 A speech synthesizer according to the present invention is a speech synthesizer that synthesizes speech associated with an object, and as speech associated with the object in accordance with the positional relationship between the user who is the object observing object and the object. A synthesis parameter determining unit that changes a speech synthesis parameter indicating what kind of synthesized speech is generated, and a synthesized speech as a speech associated with the object according to the speech synthesis parameter changed by the synthesis parameter determining unit And a speech synthesizer for generation.

また、合成パラメータ決定部は、ユーザとオブジェクトを結ぶ線と予め定められた設定方法に従って定められる所定の方向基準とのなす角で示されるユーザとオブジェクトの位置関係に応じて、音声合成パラメータを変化させてもよい。 Further, the synthesis parameter determination unit changes the speech synthesis parameter according to the positional relationship between the user and the object indicated by an angle formed by a line connecting the user and the object and a predetermined direction reference determined according to a predetermined setting method. You may let them.

また、合成パラメータ決定部は、ユーザとオブジェクトを結ぶ線と予め定められた設定方法に従って定められる所定の方向基準とのなす角で示されるユーザとオブジェクトとの位置方向、ユーザとオブジェクトとの間の距離、ユーザの正面方向とオブジェクトの正面方向とによる相対角度のいずれかまたはその組み合わせによって示されるユーザとオブジェクトとの位置関係に応じて、音声合成パラメータを変化させてもよい。 Further, the composite parameter determination unit is configured to determine a position direction between the user and the object indicated by an angle between a line connecting the user and the object and a predetermined direction reference determined according to a predetermined setting method, and between the user and the object. The speech synthesis parameter may be changed according to the positional relationship between the user and the object indicated by any one of the distance, the relative angle between the front direction of the user and the front direction of the object, or a combination thereof.

また、音声合成装置は、オブジェクトの周囲の環境に応じて、前記オブジェクトに対応づけられる音声として、どのような合成音声を生成するかを示す合成パラメータを変化させる合成パラメータ決定部と、前記合成パラメータ決定部が変化させた音声合成パラメータに従って、前記オブジェクトに対応づけた音声としての合成音声を生成する音声合成部とを備えていてもよい。 In addition, the speech synthesizer includes a synthesis parameter determination unit that changes a synthesis parameter indicating what kind of synthesized speech is generated as speech associated with the object according to an environment around the object, and the synthesis parameter A speech synthesis unit that generates synthesized speech as speech associated with the object according to the speech synthesis parameter changed by the determination unit may be provided.

また、音声合成装置は、オブジェクトを観察する主体であるユーザと前記オブジェクトの位置関係と、前記オブジェクトの周囲の環境とに応じて、前記オブジェクトに対応づけられる音声として、どのような合成音声を生成するかを示す合成パラメータを変化させる合成パラメータ決定部と、前記合成パラメータ決定部が変化させた音声合成パラメータに従って、前記オブジェクトに対応づけた音声としての合成音声を生成する音声合成部とを備えていてもよい。 Also, the speech synthesizer generates any synthesized speech as speech associated with the object according to the positional relationship between the user who is the object observing object and the object and the environment surrounding the object. A synthesis parameter determination unit that changes a synthesis parameter that indicates whether to perform, and a voice synthesis unit that generates a synthesized speech as a voice associated with the object according to the voice synthesis parameter changed by the synthesis parameter determination unit May be.

また、合成パラメータ決定部は、少なくとも周囲の明るさ、周囲の音の大きさ、近傍に他のオブジェクトが存在するか否か、当該オブジェクトに接しているまたは当該オブジェクトが保有している他のオブジェクトの種類のいずれかによって示されるオブジェクトの周囲の環境に応じて、音声合成パラメータを変化させてもよい。 In addition, the synthesis parameter determination unit determines at least the brightness of the surroundings, the volume of the surrounding sound, whether there is another object in the vicinity, other objects that are in contact with or possessed by the object The speech synthesis parameter may be changed according to the environment around the object indicated by any of the above types.

また、合成パラメータ決定部は、音声合成パラメータとして、声質、発話速度、声の大きさのいずれかを変化させてもよい。 In addition, the synthesis parameter determination unit may change any of voice quality, speech speed, and voice volume as a voice synthesis parameter.

また、合成パラメータ決定部は、ユーザとオブジェクトの位置関係に応じて、前記オブジェクトが発話する内容の詳細度を決定し、音声合成部は、前記合成パラメータ決定部によって決定される詳細度に従って、前記オブジェクトの発話内容が示されたテキストを要約する発話テキスト編集部と、前記発話テキスト編集部によって要約されたテキストを発話内容とする合成音声を生成するテキスト音声合成部とを含んでいてもよい。 Further, the synthesis parameter determination unit determines the detail level of the content spoken by the object according to the positional relationship between the user and the object, and the speech synthesis unit determines the detail level determined by the synthesis parameter determination unit. An utterance text editing unit that summarizes text in which the utterance content of the object is shown, and a text speech synthesis unit that generates synthesized speech using the text summarized by the utterance text editing unit as utterance content may be included.

また、合成パラメータ決定部は、ユーザとオブジェクトの位置関係に応じて、前記オブジェクトが発話する内容の詳細度を決定し、音声合成部は、前記合成パラメータ決定部によって決定される詳細度に従って、与えられた事項が発話内容として表現されたテキストを生成する発話テキスト生成部と、前記発話テキスト生成部によって生成されたテキストを発話内容とする合成音声を生成するテキスト音声合成部とを含んでいてもよい。 Further, the synthesis parameter determination unit determines the level of detail of the content spoken by the object according to the positional relationship between the user and the object, and the speech synthesis unit provides the level of detail determined by the synthesis parameter determination unit. An utterance text generation unit that generates a text in which a given item is expressed as utterance content, and a text-to-speech synthesis unit that generates a synthesized speech with the text generated by the utterance text generation unit as utterance content Good.

また、合成パラメータ決定部は、オブジェクトとは独立した位置から合成音声が出力されることを前提にして、合成音声パラメータを変化させてもよい。 Further, the synthesized parameter determination unit may change the synthesized speech parameter on the assumption that the synthesized speech is output from a position independent of the object.

また、合成パラメータ決定部は、ユーザの近傍から合成音声が出力されることを前提にして、合成音声パラメータを変化させてもよい。 Further, the synthesized parameter determination unit may change the synthesized speech parameter on the assumption that synthesized speech is output from the vicinity of the user.

また、方向基準として、ユーザが向いている方向を用いてもよい。 Moreover, you may use the direction which the user is facing as a direction reference | standard.

また、方向基準として、ユーザの視線方向を用いてもよい。 Moreover, you may use a user's gaze direction as a direction reference | standard.

また、合成パラメータ決定部は、オブジェクトがユーザの正面方向に近いか否かに応じて、声質を変化させてもよい。 Further, the synthesis parameter determination unit may change the voice quality depending on whether or not the object is close to the front direction of the user.

また、合成パラメータ決定部は、オブジェクトがユーザの正面方向に近いか否かに応じて、音の大きさを変化させてもよい。 Further, the synthesis parameter determination unit may change the volume of the sound depending on whether or not the object is close to the front direction of the user.

また、合成パラメータ決定部は、方向基準として絶対方向を用いることによって極座標系で表現される位置関係によって特定されるオブジェクトの周囲の環境に応じて、合成音声パラメータを変化させてもよい。 Further, the synthesized parameter determination unit may change the synthesized voice parameter according to the environment surrounding the object specified by the positional relationship expressed in the polar coordinate system by using the absolute direction as the direction reference.

また、合成パラメータ決定部は、位置関係が所定の条件を満たしている場合にのみ、周囲の環境に応じて、合成音声パラメータを変化させてもよい。 Further, the synthesized parameter determination unit may change the synthesized speech parameter according to the surrounding environment only when the positional relationship satisfies a predetermined condition.

また、合成パラメータ決定部は、合成音声を対応づけるオブジェクトが仮想空間上にのみ存在する仮想オブジェクトである場合に、合成音声パラメータをより極端に変化させてもよい。 Further, the synthesized parameter determination unit may change the synthesized speech parameter more drastically when the object associated with the synthesized speech is a virtual object that exists only in the virtual space.

また、合成パラメータ決定部は、オブジェクトがユーザの正面方向に近いか否かに応じて、詳細度を変化させてもよい。 Further, the synthesis parameter determination unit may change the level of detail depending on whether or not the object is close to the front direction of the user.

また、合成パラメータ決定部は、オブジェクトがユーザの正面方向に近いか否かに応じて、詳細度および発話速度を変化させてもよい。 Further, the synthesis parameter determination unit may change the level of detail and the speech rate according to whether or not the object is close to the front direction of the user.

また、本発明による音声合成方法は、オブジェクトに対応づけた音声を合成するための音声合成方法であって、オブジェクトを観察する主体であるユーザと前記オブジェクトの位置関係に応じて、前記オブジェクトに対応づけた音声として、どのような合成音声を生成するかを示す音声合成パラメータを変化させ、変化させた前記音声合成パラメータに従って、前記オブジェクトに対応づけた音声としての合成音声を生成することを特徴とする。 The speech synthesis method according to the present invention is a speech synthesis method for synthesizing speech associated with an object, and corresponds to the object according to a positional relationship between a user who observes the object and the object. A synthesized speech as speech associated with the object is generated according to the changed speech synthesis parameter according to the changed speech synthesis parameter. To do.

また、音声合成方法は、ユーザとオブジェクトを結ぶ線と予め定められた設定方法に従って定められる所定の方向基準とのなす角で示されるユーザとオブジェクトとの位置方向、ユーザとオブジェクトとの間の距離、ユーザの正面方向とオブジェクトの正面方向とによる相対角度のいずれかまたはその組み合わせによって示されるユーザとオブジェクトとの位置関係に応じて、音声合成パラメータを変化させてもよい。 In addition, the speech synthesis method includes a position direction of the user and the object indicated by an angle between a line connecting the user and the object and a predetermined direction reference determined according to a predetermined setting method, and a distance between the user and the object. The voice synthesis parameter may be changed according to the positional relationship between the user and the object indicated by any one or a combination of relative angles between the front direction of the user and the front direction of the object.

また、音声合成方法は、オブジェクトの周囲の環境に応じて、音声合成パラメータを変化させ、変化させた前記音声合成パラメータに従って、前記オブジェクトに対応づけた音声としての合成音声を生成してもよい。 In the speech synthesis method, a speech synthesis parameter may be changed according to an environment around the object, and a synthesized speech as speech associated with the object may be generated according to the changed speech synthesis parameter.

また、音声合成方法は、ユーザとオブジェクトの位置関係に応じて、前記オブジェクトが発話する内容の詳細度を決定し、前記決定された詳細度に従って、前記オブジェクトの発話内容が示されたテキストを要約し、前記要約されたテキストを発話内容とする合成音声を生成してもよい。 Further, the speech synthesis method determines the level of detail of the content spoken by the object according to the positional relationship between the user and the object, and summarizes the text indicating the content of speech of the object according to the determined level of detail. Then, a synthesized speech having the summarized text as utterance content may be generated.

また、音声合成方法は、ユーザとオブジェクトの位置関係に応じて、前記オブジェクトが発話する内容の詳細度を決定し、前記決定された詳細度に従って、与えられた事項が発話内容として表現されたテキストを生成し、前記生成されたテキストを発話内容とする合成音声を生成してもよい。 Further, the speech synthesis method determines the level of detail of the content spoken by the object according to the positional relationship between the user and the object, and the text in which the given item is expressed as the speech content according to the determined level of detail. And a synthesized speech having the generated text as the utterance content may be generated.

また、本発明による音声合成用プログラムは、オブジェクトに対応づけた音声を合成するための音声合成用プログラムであって、コンピュータに、オブジェクトを観察する主体であるユーザと前記オブジェクトの位置関係に応じて、前記オブジェクトに対応づけた音声として、どのような合成音声を生成するかを示す音声合成パラメータを変化させるパラメータ決定処理、および変化させた前記音声合成パラメータに従って、前記オブジェクトに対応づけた音声としての合成音声を生成する合成処理を実行させることを特徴とする。 A speech synthesis program according to the present invention is a speech synthesis program for synthesizing speech associated with an object. The computer synthesizes a speech according to the positional relationship between the user who is the subject who observes the object and the object. As a speech associated with the object, a parameter determination process for changing a speech synthesis parameter indicating what kind of synthesized speech is generated as a speech associated with the object, and a speech associated with the object according to the changed speech synthesis parameter A synthesis process for generating synthesized speech is executed.

また、音声合成用プログラムは、コンピュータに、パラメータ決定処理で、ユーザとオブジェクトを結ぶ線と予め定められた設定方法に従って定められる所定の方向基準とのなす角で示されるユーザとオブジェクトとの位置方向、ユーザとオブジェクトとの間の距離、ユーザの正面方向とオブジェクトの正面方向とによる相対角度のいずれかまたはその組み合わせによって示されるユーザとオブジェクトとの位置関係に応じて、音声合成パラメータを変化させてもよい。 In addition, the speech synthesis program causes the computer to determine the position direction of the user and the object indicated by an angle between a line connecting the user and the object and a predetermined direction reference determined according to a predetermined setting method in the parameter determination process. The speech synthesis parameters are changed according to the positional relationship between the user and the object indicated by the distance between the user and the object, the relative angle between the front direction of the user and the front direction of the object, or a combination thereof. Also good.

また、音声合成用プログラムは、コンピュータに、オブジェクトの周囲の環境に応じて、前記オブジェクトに対応づけた音声として、どのような合成音声を生成するかを示す音声合成パラメータを変化させるパラメータ決定処理、および変化させた前記音声合成パラメータに従って、前記オブジェクトに対応づけた音声としての合成音声を生成する音声合成処理を実行させてもよい。 Further, the speech synthesis program causes the computer to change a speech synthesis parameter indicating what kind of synthesized speech is generated as the speech associated with the object according to the environment surrounding the object, In accordance with the changed speech synthesis parameter, speech synthesis processing for generating synthesized speech as speech associated with the object may be executed.

また、音声合成用プログラムは、コンピュータに、パラメータ決定処理で、ユーザとオブジェクトの位置関係に応じて、前記オブジェクトが発話する内容の詳細度を決定させ、音声合成処理で、前記決定された詳細度に従って、前記オブジェクトの発話内容が示されたテキストを要約させて、前記要約されたテキストを発話内容とする合成音声を生成させてもよい。 Further, the speech synthesis program causes the computer to determine the level of detail of the content spoken by the object in accordance with the positional relationship between the user and the object in the parameter determination process, and to determine the level of detail determined in the speech synthesis process. In accordance with the above, the text in which the utterance content of the object is shown may be summarized to generate a synthesized speech using the summarized text as the utterance content.

また、音声合成用プログラムは、コンピュータに、パラメータ決定処理で、ユーザとオブジェクトの位置関係に応じて、前記オブジェクトが発話する内容の詳細度を決定させ、音声合成処理で、前記決定された詳細度に従って、与えられた事項が発話内容として表現されたテキストを生成させ、前記生成されたテキストを発話内容とする合成音声を生成させてもよい。 Further, the speech synthesis program causes the computer to determine the level of detail of the content spoken by the object in accordance with the positional relationship between the user and the object in the parameter determination process, and to determine the level of detail determined in the speech synthesis process. Accordingly, a text in which a given item is expressed as utterance content may be generated, and a synthesized speech having the generated text as the utterance content may be generated.

第１の効果は、合成パラメータ決定部がユーザとオブジェクトの位置関係または周囲の環境に応じて合成音声のパラメータを変化させるので、生成される合成音声に対応づけられたオブジェクトを的確に表現することができ、ユーザが、合成音声とオブジェクトとの結び付きを理解しやすくする。 The first effect is that the synthesized parameter determination unit changes the parameters of the synthesized speech in accordance with the positional relationship between the user and the object or the surrounding environment, so that the object associated with the generated synthesized speech is accurately expressed. This makes it easier for the user to understand the connection between the synthesized speech and the object.

第２の効果は、ユーザに対し、生成される合成音声に対応づけられたオブジェクトとの位置関係や状況や、合成音声による情報伝達の意図をわかりやすく提示することができる。 The second effect is that the user can be presented with an easy-to-understand understanding of the positional relationship and situation with the object associated with the generated synthesized speech and the intention of information transmission by the synthesized speech.

第３の効果は、たとえ合成音声に対応づけられたオブジェクトがユーザの視野に直接入っていない場合でも、視野に入っているのと同様に、ユーザにそのオブジェクトとの位置関係や状況、合成音声による情報伝達の意図を想起させることができる。 The third effect is that, even if the object associated with the synthesized speech is not directly in the user's field of view, the positional relationship with the object, the situation, It is possible to recall the intention of information transmission.

以下、本発明の実施の形態を図面を参照して説明する。 Hereinafter, embodiments of the present invention will be described with reference to the drawings.

実施の形態１．
図１は、本発明の第１の実施の形態による音声合成装置の構成例を示すブロック図である。図１に示すように、本実施の形態による音声合成装置は、方向入力手段１１と、距離入力手段１２と、合成パラメータ決定部２１と、音声合成部２２とを備える。 Embodiment 1 FIG.
FIG. 1 is a block diagram showing a configuration example of a speech synthesizer according to the first embodiment of the present invention. As shown in FIG. 1, the speech synthesizer according to the present embodiment includes a direction input unit 11, a distance input unit 12, a synthesis parameter determination unit 21, and a speech synthesis unit 22.

ここで、本施の形態における位置関係を以下のように定める。図２は、本実施の形態におけるオブジェクト０１とユーザ０２との位置関係を定義した説明図である。ここで、ユーザ０２は、オブジェクト０１を観察する主体である。また、オブジェクト０１は、本音声合成装置が合成する音声を発生元として対応づけられているオブジェクトであって、ここでは実体をもったロボットを想定している。こうした実体を持つオブジェクトには、ロボットだけでなく、おもちゃや携帯電話機なども該当し、これらを実体オブジェクトと呼ぶこととする。なお、図２に示す例では、オブジェクト０１に対応づけられた合成音声は、オブジェクト０１が備えるスピーカ装置などの合成音声出力部０３によって、オブジェクト０１と同じ位置から出力される。音声合成装置は、例えば、実体オブジェクトの音声出力用装置として、オブジェクト０１に組み込まれた形態で実現されたり、実体オブジェクトを展示している部屋のスピーカ制御装置に組み込まれた形態で実現されたりする。 Here, the positional relationship in the present embodiment is determined as follows. FIG. 2 is an explanatory diagram defining the positional relationship between the object 01 and the user 02 in the present embodiment. Here, the user 02 is a subject who observes the object 01. The object 01 is an object associated with the voice synthesized by the voice synthesizer as a generation source, and here, a robot having an entity is assumed. The objects having such entities include not only robots but also toys and mobile phones, and these are called entity objects. In the example illustrated in FIG. 2, the synthesized voice associated with the object 01 is output from the same position as the object 01 by the synthesized voice output unit 03 such as a speaker device included in the object 01. The voice synthesizer is realized, for example, as a voice output device for a real object, in a form incorporated in the object 01, or in a form incorporated in a speaker control device in a room displaying the real object. .

図２に示すように、本実施の形態では、オブジェクト０１とユーザ０２との位置関係を、方向Ｃと距離ｒとによって定義する。方向Ｃは、ユーザ０２に対して、予め定義しておいた空間上における任意の角度である方向基準Ｃ０と、ユーザ０２からオブジェクト０１が見える方向とのなす角である。距離ｒは、ユーザ０２とオブジェクト０１との間の距離で示される情報である。図２では、ユーザ０２の正面となる方向を方向基準０５とした例を示している。 As shown in FIG. 2, in this embodiment, the positional relationship between the object 01 and the user 02 is defined by a direction C and a distance r. The direction C is an angle formed by a direction reference C0 that is an arbitrary angle in a space defined in advance with respect to the user 02 and a direction in which the object 01 can be seen from the user 02. The distance r is information indicated by the distance between the user 02 and the object 01. FIG. 2 shows an example in which the direction that is the front of the user 02 is the direction reference 05.

方向入力手段１１は、方向Ｃを入力する。距離入力手段１２は、距離ｒを入力する。処理部１３は、オブジェクト０１が合成音声を発する際に考慮される条件とする周囲の環境を示す周囲条件を入力する。 The direction input means 11 inputs the direction C. The distance input unit 12 inputs the distance r. The processing unit 13 inputs an ambient condition indicating an ambient environment as a condition to be considered when the object 01 emits synthesized speech.

合成パラメータ決定部２１は、方向入力手段１１によって入力される方向Ｃと、距離入力手段１２によって入力される距離ｒと、処理部１３によって入力される周囲条件とに基づいて、オブジェクト０１に対応づけた合成音声を生成するための音声合成パラメータを決定する。ここで、音声合成パラメータとは、どのような合成音声を生成するかを示すパラメータである。例えば、生成される合成音声に含まれる属性の値を直接示す情報であってもよいし、合成音声に含まれる属性に対し類型化された特徴を示す情報であってもよい。 The composite parameter determination unit 21 associates the object 01 with the object 01 based on the direction C input by the direction input unit 11, the distance r input by the distance input unit 12, and the ambient condition input by the processing unit 13. A speech synthesis parameter for generating the synthesized speech is determined. Here, the speech synthesis parameter is a parameter indicating what kind of synthesized speech is generated. For example, it may be information directly indicating the value of an attribute included in the generated synthesized speech, or may be information indicating a characteristic categorized with respect to the attribute included in the synthesized speech.

音声合成部２２は、合成パラメータ決定部２１によって決定される音声合成パラメータに従って、音声を合成する。 The speech synthesizer 22 synthesizes speech according to the speech synthesis parameter determined by the synthesis parameter determination unit 21.

次に、図３を参照して本実施の形態の動作を説明する。図３は、本実施の形態による音声合成装置の動作例を示すフローチャートである。 Next, the operation of the present embodiment will be described with reference to FIG. FIG. 3 is a flowchart showing an operation example of the speech synthesizer according to the present embodiment.

まず、方向入力手段１１、距離入力手段１２、周囲条件入力手段１３は、それぞれ方向Ｃ、距離ｒ、周辺条件を入力する（ステップＳ１０１）。 First, the direction input unit 11, the distance input unit 12, and the ambient condition input unit 13 input the direction C, the distance r, and the ambient condition, respectively (step S101).

方向入力手段１１と距離入力手段１２とは、例えば、次のように動作する。オブジェクト０１およびユーザ０２をカメラで撮影した画像データからそれぞれの位置を認識し、方眼紙上に配置することによって、オブジェクト０１の平面座標（ｘ１，ｙ１）と、ユーザ０２の平面座標（ｘ２，ｙ２）を得る。なお、カメラで撮影する代わりに、ＧＰＳやジャイロスコープ（ジャイロセンサ）を用いて座標を特定してもよい。 The direction input unit 11 and the distance input unit 12 operate as follows, for example. By recognizing the positions of the object 01 and the user 02 from the image data captured by the camera and arranging them on the graph paper, the plane coordinates (x1, y1) of the object 01 and the plane coordinates (x2, y2) of the user 02 Get. Note that the coordinates may be specified using GPS or a gyroscope (gyro sensor) instead of shooting with the camera.

この時、方向入力手段１１は、方向Ｃを、方向基準０５が方眼紙上のｘ軸となす角Ｃ０を用いて、以下の式（１）によって求めればよい。 At this time, the direction input means 11 may obtain the direction C by the following formula (1) using the angle C0 that the direction reference 05 makes with the x axis on the graph paper.

Ｃ＝ａｒｃｔａｎ（（ｙ１−ｙ２）／（ｘ１−ｘ２））−Ｃ０・・・式（１） C = arctan ((y1-y2) / (x1-x2))-C0 Formula (1)

また、距離入力手段１２は、距離ｒを、以下の式（２）によって求めればよい。 Moreover, the distance input means 12 should just obtain | require the distance r by the following formula | equation (2).

ここでは、オブジェクト０１とユーザ０２との位置関係を示す情報として、直交座標系における両者の位置座標に基づく方向Ｃと距離ｒを用いているが、他の座標系や位置関係を示す尺度を用いてもよい。また、平面上に限らず、立体角など高次においても同様に用いることができる。例えば、後述するように、ユーザ０２の正面方向とオブジェクト０１の正面方向とによる相対角度Ｒを用いてもよい。なお、方向Ｃは、方向基準Ｃ０とのなす角によってオブジェクト０１とユーザ０２との位置方向を示すのに対し、相対角度Ｒは、オブジェクト０１とユーザ０２との向き合わせの角度を示すものである。 Here, as the information indicating the positional relationship between the object 01 and the user 02, the direction C and the distance r based on the positional coordinates of both in the orthogonal coordinate system are used, but other coordinate systems and scales indicating the positional relationship are used. May be. Moreover, it can be similarly used not only on a plane but also at a higher order such as a solid angle. For example, as described later, a relative angle R between the front direction of the user 02 and the front direction of the object 01 may be used. The direction C indicates the position direction of the object 01 and the user 02 by the angle formed with the direction reference C0, while the relative angle R indicates the angle of orientation between the object 01 and the user 02. .

また、周囲条件入力手段１３は、周囲条件として、周囲の明るさ、周囲の音パワー（音の大きさ）、近傍に他のオブジェクトが存在するか否か、オブジェクト０１に接したり保有している他のオブジェクトの種類等を示す情報を入力する。周囲条件の入力は、例えば、オブジェクト０１がフォトダイオードなどの光センサ、マイクロフォンなどの音センサ、マイクロスイッチや焦電素子などの接触センサや非接触センサ、他オブジェクトの保有状況を示す情報を記憶した保有状況テーブルなどを備え、それらセンサからの入力を検出したり、保有状況テーブルを参照することで実現される。 In addition, the ambient condition input means 13 is in contact with or holds the object 01 as ambient conditions, ambient brightness, ambient sound power (sound volume), whether or not another object exists in the vicinity. Enter information indicating the type of other objects. As for the input of the ambient conditions, for example, the object 01 stores information indicating the possession status of an optical sensor such as a photodiode, a sound sensor such as a microphone, a contact sensor such as a microswitch or a pyroelectric element, a non-contact sensor, and other objects. It is realized by providing a holding status table and detecting inputs from these sensors or referring to the holding status table.

次に、合成パラメータ決定部２１は、方向入力手段１１、距離入力手段１２、周囲条件入力手段１３によって所定の入力パラメータ（ここでは、方向Ｃと距離ｒと周囲条件）が入力されると、それら入力パラメータによって示されるオブジェクト０１とユーザ０２との位置関係、およびオブジェクト０１の周囲の環境に応じて、音声合成パラメータを決定する（ステップＳ１０２）。なお、合成音声パラメータの制御方針によっては、方向Ｃだけや、距離ｒだけ、周囲条件だけが入力される場合もある。または、オブジェクト０１の正面向きとユーザ０２の正面向きによる相対角度Ｒが入力される場合がある。なお、合成パラメータ決定部２１は、少なくとも方向Ｃ、距離ｒ、相対角度Ｒのいずれかまたはその組み合わせによって示されるオブジェクト０１とユーザ０２の位置関係、またはオブジェクト０１の周囲の環境に応じて、音声合成パラメータを決定すればよい。 Next, when the predetermined input parameters (here, the direction C, the distance r, and the ambient condition) are input by the direction input unit 11, the distance input unit 12, and the ambient condition input unit 13, the composite parameter determination unit 21 A speech synthesis parameter is determined according to the positional relationship between the object 01 and the user 02 indicated by the input parameter and the environment around the object 01 (step S102). Depending on the control policy of the synthesized speech parameter, only the direction C, the distance r, and only the ambient conditions may be input. Or the relative angle R by the front direction of the object 01 and the front direction of the user 02 may be input. Note that the synthesis parameter determination unit 21 performs speech synthesis according to the positional relationship between the object 01 and the user 02 indicated by at least one of the direction C, the distance r, the relative angle R, or a combination thereof, or the environment around the object 01. What is necessary is just to determine a parameter.

合成パラメータ決定部２１は、例えば、各入力パラメータの値に対応する合成音声パラメータの値を登録したパラメータ決定テーブルを備え、各入力パラメータの値に応じてパラメータ決定テーブルを参照することによって、合成音声パラメータを決定してもよい。 The synthesis parameter determination unit 21 includes, for example, a parameter determination table in which values of synthesized speech parameters corresponding to the values of each input parameter are registered, and by referring to the parameter determination table according to the values of each input parameter, the synthesized speech is determined. The parameter may be determined.

そして、音声合成部２２は、合成パラメータ決定部２１が決定した合成音声パラメータに従って、オブジェクト０１に対応づける音声を合成し、出力する（ステップＳ１０３）。 Then, the speech synthesizer 22 synthesizes and outputs the speech associated with the object 01 according to the synthesized speech parameter determined by the synthesis parameter determiner 21 (step S103).

次に、具体的な例を示しながら、合成音声パラメータの決定方法について説明する。図４〜図６は、パラメータ決定テーブルの例を示す説明図である。これらのパラメータ決定テーブルでは、オブジェクト０１がユーザ０２の正面に近ければ近いほど、オブジェクト０１に対するユーザ０２の興味が強いと想定して、ユーザ０２にオブジェクト０１の発話内容がはっきり伝わるように制御して（変化させて）いる。一方、オブジェクト０１がユーザ０２の正面から遠ざかるほど興味が少ないと想定して、オブジェクト０１の発声によってユーザ０２の意識をあまり邪魔しないように制御している。また、オブジェクトの周囲条件の一つである明るさに対応して、オブジェクトの存在する位置の雰囲気が伝わるように制御している。なお、ユーザ０２の正面を基準に方向Ｃを求める方法については後述する。 Next, a method for determining a synthesized speech parameter will be described with a specific example. 4 to 6 are explanatory diagrams showing examples of the parameter determination table. In these parameter determination tables, it is assumed that the closer the object 01 is to the front of the user 02, the more the user 02 is interested in the object 01, and control is performed so that the utterance content of the object 01 is clearly transmitted to the user 02. (Changed). On the other hand, assuming that the object 01 is less interested as it moves away from the front of the user 02, control is performed such that the consciousness of the user 02 is not disturbed by the utterance of the object 01. In addition, control is performed so that the atmosphere at the position where the object exists is transmitted in accordance with brightness, which is one of the ambient conditions of the object. A method of obtaining the direction C with reference to the front of the user 02 will be described later.

図４は、方向Ｃに基づいて声質および発話速度を示す音声合成パラメータを決定するためのパラメータ決定テーブルの例である。図４では、例えば、ユーザ０２の正面を方向基準Ｃ０とした際の方向Ｃ（すなわち、ユーザ０２から見たオブジェクト０１が位置する方向）が−４５〜４５度である場合に、声質を「明瞭」、かつ発話速度を「通常」にすることが示されている。また、例えば、４５〜９０度または−９０度〜−４５度である場合には、声質を「やや曖昧」、かつ発話速度を「やや速い」にすることが示されている。また、例えば、９０〜１８０度または−１８０〜−９０度である場合には、声質を「モゴモゴ（不明瞭）」、かつ発話速度を「速い」にすることが示されている。 FIG. 4 is an example of a parameter determination table for determining speech synthesis parameters indicating voice quality and speech rate based on direction C. In FIG. 4, for example, when the direction C (that is, the direction in which the object 01 is viewed from the user 02) when the front of the user 02 is the direction reference C0 is −45 to 45 degrees, the voice quality is “clear” ", And the speech rate is shown to be" normal ". For example, in the case of 45 to 90 degrees or −90 degrees to −45 degrees, it is indicated that the voice quality is “somewhat ambiguous” and the speech rate is “somewhat fast”. For example, in the case of 90 to 180 degrees or −180 to −90 degrees, it is indicated that the voice quality is “mogomogo (unclear)” and the speech rate is “fast”.

また、図５は、距離ｒに基づいてパワーを示す音声合成パラメータを決定するためのパラメータ決定テーブルの例である。図５では、例えば、オブジェクト０１とユーザ０２との距離ｒが、０〜５０ｃｍである場合にパワーを「大」、５０〜１５０ｃｍであればパワーを「中」、１５０ｃｍを超えていればパワーを「小」にすることが示されている。 FIG. 5 is an example of a parameter determination table for determining a speech synthesis parameter indicating power based on the distance r. In FIG. 5, for example, when the distance r between the object 01 and the user 02 is 0 to 50 cm, the power is “large”, when the distance r is 50 to 150 cm, the power is “medium”, and when the distance r exceeds 150 cm, the power is increased. Shown to be “small”.

また、図６は、周囲の明るさに基づいて発話の雰囲気を示す音声合成パラメータを決定するためのパラメータ決定テーブルの例である。図６では、例えば、周囲の明るさが明るい場合には「楽しそうな」雰囲気に、暗い場合には「怖がった」雰囲気にすることが示されている。なお、周囲条件は、例えば、オブジェクト０１がユーザ０２から遠く離れている場合にその様子を伝えるために用いるというように、方向Ｃや距離ｒに応じた限定要素として用いることも可能である。 FIG. 6 is an example of a parameter determination table for determining a speech synthesis parameter indicating an utterance atmosphere based on ambient brightness. FIG. 6 shows, for example, a “joyful” atmosphere when the surrounding brightness is bright, and a “scary” atmosphere when the surrounding brightness is dark. The ambient condition can also be used as a limiting element according to the direction C and the distance r, for example, when the object 01 is far away from the user 02 and used to convey the state.

例えば、ユーザ０２の視線方向とした方向基準Ｃ０を座標軸のｘ軸にとったとして、オブジェクトの位置が（ｘ１，ｙ１）＝（２０ｃｍ，４０ｃｍ）、ユーザ０２の位置が（ｘ２，ｙ２）＝（０ｃｍ，０ｃｍ）、周囲条件が明るいと観測された場合、方向Ｃは６０度、距離ｒ＝４４．７ｃｍと計算される。この時、図４〜図６に示したパラメータ決定テーブルに基づいて、音声合成パラメータは、声質「やや曖昧」、発話速度「やや速い」、パワー「小」、雰囲気「楽しそうな」が、選択される。これらの音声合成パラメータは、音声合成部２２に送られ、該当するパラメータによって合成音声が生成され、出力される。 For example, assuming that the direction reference C0, which is the line of sight of the user 02, is taken on the x axis of the coordinate axis, the position of the object is (x1, y1) = (20 cm, 40 cm), and the position of the user 02 is (x2, y2) = ( 0 cm, 0 cm), when the ambient conditions are observed to be bright, the direction C is calculated to be 60 degrees and the distance r = 44.7 cm. At this time, based on the parameter determination tables shown in FIGS. 4 to 6, the voice synthesis parameters are selected as “slightly ambiguous” voice quality, “slightly fast” speech speed, “small” power, “looks fun”. Is done. These speech synthesis parameters are sent to the speech synthesizer 22, and synthesized speech is generated by the corresponding parameters and output.

これによって、オブジェクト０１がユーザ０２から見て斜め６０度方向、距離４４．７ｃｍの位置に存在するという条件から、ユーザの興味は正面近傍ほど強くないとみなして、「やや曖昧」かつ「やや速い」発声でパワーが「小」の発声を行い、ただしロボットの存在する位置は明るいので「楽しそうな雰囲気」を伝えるように、ユーザとオブジェクトの位置関係に応じた特徴を持つ合成音声が出力される。 As a result, it is assumed that the object 01 is present at a position of 60 ° diagonally and at a distance of 44.7 cm when viewed from the user 02, so that the user's interest is not as strong as near the front, and “somewhat ambiguous” and “somewhat fast” "Speech" with a power of "Small", but since the position of the robot is bright, a synthesized voice with characteristics according to the positional relationship between the user and the object is output so as to convey a "joyful atmosphere" The

なお、本例では、方向Ｃ、距離ｒ、周囲条件から音声合成パラメータを求める手段として、図４〜図６に示すようなパラメータ決定テーブルを用いる例を示したが、このようなパラメータ決定テーブルに限定されるものではない。また、上記例では、各テーブルに記述する入力パラメータは独立しているが、各入力パラメータを組み合わせた値と合成音声パラメータとを対応づけてもよい。例えば、方向Ｃと距離ｒの組合せによってパワーを決定するように設計することも可能である。また、例えば、ユーザ０２との距離が近い場合や、ユーザの視野に入っていない場合にだけ、周囲条件に応じた制御を行うといった方法をとることも可能である。また、テーブル参照によらず、各入力パラメータの値を演算した結果を用いてパラメータを定めてもよい。 In this example, as an example of using the parameter determination table as shown in FIGS. 4 to 6 as means for obtaining the speech synthesis parameter from the direction C, the distance r, and the ambient conditions, the parameter determination table is used. It is not limited. In the above example, the input parameters described in each table are independent. However, a value obtained by combining the input parameters may be associated with a synthesized speech parameter. For example, it is possible to design such that the power is determined by a combination of the direction C and the distance r. Further, for example, it is possible to take a method of performing control according to the ambient conditions only when the distance to the user 02 is short or when the user is not in the field of view. Further, the parameters may be determined using the result of calculating the value of each input parameter regardless of the table reference.

また、周囲条件の例として、明るさに対応させて音声合成パラメータを制御する例に説明したが、例えば、周囲条件として周囲の音パワーを入力するようにし、周囲の音パワーに対応させてその音よりも大きくなるようにベースの音パワーを決定してもよい。また、例えば、周囲条件としてオブジェクト０１の近傍に他のオブジェクトが存在するか否かや、オブジェクト０１に接したり保有している他のオブジェクトが存在するか否か、また存在する場合にそのオブジェクトの情報（例えば、種類や重さやレベル等）を入力するようにし、オブジェクトの情報に対応させて、怖がった声にしたり、重たそうな声にしたり、うれしそうな声にするといった制御を行ってもよい。 In addition, as an example of the ambient condition, the speech synthesis parameter is controlled according to the brightness. However, for example, the ambient sound power is input as the ambient condition, and the ambient sound power is associated with the ambient sound power. The sound power of the bass may be determined so as to be larger than the sound. Also, for example, as an ambient condition, whether there is another object in the vicinity of the object 01, whether there is another object in contact with or possessing the object 01, and if there is, the object's Input information (for example, type, weight, level, etc.), and make it a scary voice, a heavy voice, or a joyful voice according to the object information. Also good.

また、音声合成パラメータとして、声質、発話速度、パワー、雰囲気を用いているが、これらに限るものではなく、話者、ピッチ周波数、アクセント強度などを制御するようにしてもよい。 Further, although voice quality, speech speed, power, and atmosphere are used as speech synthesis parameters, the present invention is not limited to these, and the speaker, pitch frequency, accent strength, and the like may be controlled.

また、本実施の形態において、音声合成部２２は、適用されるシステムに応じて、録音再生、ＣＥＬＰなどのコーデック、テキスト音声合成などの形態を取り得るが、テキスト音声合成のように前記パラメータの全てを変更可能なものと、録音再生と発速変換技術の組み合わせのように高々発話速度程度しか可変できないものが存在する。ここでは、全てのパラメータを取り扱えるものを用いるが、取り扱うことのできないパラメータを無視する構成を採ることも可能である。 In the present embodiment, the speech synthesizer 22 may take forms such as recording / playback, codec such as CELP, text speech synthesis, etc. depending on the system to be applied. There are those that can change everything, and those that can only change the speech rate at most, such as a combination of recording and playback and speech rate conversion technology. Here, what can handle all parameters is used, but it is also possible to adopt a configuration in which parameters that cannot be handled are ignored.

また、合成音声パラメータは、次に示すよう決定方法によって決定してもよい。図７は、本例における合成音声出力部０３の位置の例を示す説明図である。図７に示すように、本例では、オブジェクト０１に対応づけられた合成音声は、オブジェクト０１の位置とは無関係に、別の位置から発せられることとする。ここでは、ユーザ０２の十分近傍から出力されることを想定している。 The synthesized speech parameter may be determined by a determination method as described below. FIG. 7 is an explanatory diagram showing an example of the position of the synthesized speech output unit 03 in this example. As shown in FIG. 7, in this example, the synthesized speech associated with the object 01 is emitted from another position regardless of the position of the object 01. Here, it is assumed that the data is output from sufficiently close to the user 02.

なお、本例では、合成音声がユーザ０２の充分近傍から出力されるので、オブジェクト０１が複数存在する場合には、各オブジェクトの合成音声が出力される位置はそれぞれ十分に近傍または完全に同じになる。 In this example, since the synthesized speech is output from sufficiently close to the user 02, when there are a plurality of objects 01, the position where the synthesized speech of each object is output is sufficiently close or completely the same. Become.

図８は、本例におけるパラメータ決定テーブルの例を示す説明図である。図８は、距離ｒに基づいてパワーを決定するためのパラメータ決定テーブルの例を示している。図８に示すように、ユーザ０２から見てオブジェクト０１が近い場合には合成音声のパワーを大きく、遠い場合には合成音声のパワーを小さくするように制御してもよい。 FIG. 8 is an explanatory diagram showing an example of a parameter determination table in this example. FIG. 8 shows an example of a parameter determination table for determining power based on the distance r. As illustrated in FIG. 8, the power of the synthesized speech may be increased when the object 01 is close as viewed from the user 02, and the synthesized speech power may be decreased when the object 01 is far away.

これにより、ユーザから見てオブジェクトが近い場合には合成音声のパワーを大きく、遠い場合には合成音声のパワーを小さくすることで、より臨場感のある合成音声を得ることができる。また、図９は、オブジェクト０１とユーザ０２との位置関係の例を示す説明図である。図９に示すように、オブジェクト０１がユーザ０２の視野の外に存在するような場合であっても、対応する合成音声の特徴から、大まかな位置を推定することが可能になる。 As a result, the synthesized speech power can be increased when the object is close to the user, and the synthesized speech power can be decreased when the object is far away, thereby obtaining a more realistic synthesized speech. FIG. 9 is an explanatory diagram showing an example of the positional relationship between the object 01 and the user 02. As shown in FIG. 9, even when the object 01 exists outside the field of view of the user 02, it is possible to estimate a rough position from the characteristics of the corresponding synthesized speech.

また、図８に示すような距離ｒに基づくパラメータ決定テーブルを使う代わりに、入力された距離ｒを用いて、次の式（３）によってパワーｐ（ｒ）を求めることも可能である。ここで、Ｋは負の定数とする。 Further, instead of using the parameter determination table based on the distance r as shown in FIG. 8, it is also possible to obtain the power p (r) by the following equation (3) using the input distance r. Here, K is a negative constant.

ｐ（ｒ）＝ｒ^Ｋ・・・式（３） p (r) = r ^K ... Formula (3)

ここで、Ｋ＝−（２／３）とすることで、感覚に即した効果を得ることが可能である。 Here, by setting K = − (2/3), it is possible to obtain an effect in accordance with the sense.

また、距離ｒに基づくパラメータの決定方法として、位置を推定できるようにするのではなく、情報伝達の意図をよりわかりやすくするために、図１０に示すようなパラメータ決定テーブルに従って、音声合成パラメータを決定してもよい。図１０は、距離ｒに基づいてパワーを決定するためのパラメータ決定テーブルの他の例を示す説明図である。図１０に示すように、、遠いところでは情報が伝わるようにパワーを大きくして、近いところではやかましさを感じさせないようにパワーを小さくするといった制御をすることも可能である。 In addition, as a parameter determination method based on the distance r, the speech synthesis parameters are set according to the parameter determination table as shown in FIG. You may decide. FIG. 10 is an explanatory diagram showing another example of a parameter determination table for determining power based on the distance r. As shown in FIG. 10, it is also possible to perform control such that the power is increased so that information is transmitted at a distant place, and the power is reduced so as not to feel an agility at a close place.

なお、ここでは、距離ｒに対応させてパワーを制御する例だけを示したが、上述のように方向Ｃや周囲条件を組み合わせて声質や雰囲気も制御するようにしてもよい。 Here, only the example of controlling the power in correspondence with the distance r is shown, but the voice quality and the atmosphere may be controlled by combining the direction C and the ambient conditions as described above.

次に、方向基準Ｃ０の設定例を示す。まず、方向基準Ｃ０を絶対方向として定義する例を示す。図１１は、本例における方向基準Ｃ０を示す説明図である。図１１に示すように、方向基準Ｃ０を、ユーザ０２とオブジェクト０１が存在する部屋の中に絶対方向として定義してもよい。図１１に示す例では、ユーザ０２の位置から部屋の長辺に並行な方向を方向基準Ｃ０と定義している。 Next, an example of setting the direction reference C0 is shown. First, an example in which the direction reference C0 is defined as an absolute direction is shown. FIG. 11 is an explanatory diagram showing the direction reference C0 in this example. As shown in FIG. 11, the direction reference C0 may be defined as an absolute direction in a room where the user 02 and the object 01 exist. In the example shown in FIG. 11, the direction parallel to the long side of the room from the position of the user 02 is defined as the direction reference C0.

これにより、部屋の長辺と、ユーザ０２とオブジェクト０１を結ぶ線分とのなす角を計ることにより、方向Ｃを簡単に求めることができる。また、方向基準Ｃ０を絶対方向として定義することによって、オブジェクト０１とユーザ０２との位置関係を、ユーザ位置を原点とする極座標系で表現することができる。極座標系で示される位置関係に基づいて、例えば、オブジェクト０１がユーザ０２よりも北側にいる場合には寒そうな表現にしたり、また、例えば、ユーザ０２と比較してオブジェクト０１の方がある場所に対しより近い位置にいる場合に怖がった雰囲気になるようにしたりといった制御を行うことによって、オブジェクトの周囲の状況や情報伝達の意図をよりわかりやすく伝えることができる。なお、本例は、ユーザ０２とオブジェクト０１との位置関係によって特定づけられるオブジェクト０１の周囲の環境に応じて、音声合成パラメータを制御する例でもある。 Accordingly, the direction C can be easily obtained by measuring the angle formed by the long side of the room and the line segment connecting the user 02 and the object 01. Also, by defining the direction reference C0 as an absolute direction, the positional relationship between the object 01 and the user 02 can be expressed in a polar coordinate system with the user position as the origin. Based on the positional relationship shown in the polar coordinate system, for example, when the object 01 is on the north side of the user 02, the expression is cold, or, for example, the place where the object 01 is located compared to the user 02 By performing control such as creating a scared atmosphere when the user is closer to the object, it is possible to convey the situation around the object and the intention of information transmission more clearly. This example is also an example in which the speech synthesis parameters are controlled according to the environment around the object 01 specified by the positional relationship between the user 02 and the object 01.

また、方向基準Ｃ０を次のように定めてもよい。本例は、オブジェクト０１とユーザ０２との位置関係に、ユーザ０２の目線方向を加味する例である。図１２は、本例における音声合成装置の構成例および方向基準Ｃ０を示す説明図である。図１２に示すように、音声合成装置は、ユーザ視野検出手段１６と、ユーザ位置検出手段１５と、オブジェクト位置検出手段１４と通信可能に接続されていてもよい。なお、図１２において音声合成装置は、オブジェクト０１に含まれているものとする。 Further, the direction reference C0 may be determined as follows. In this example, the direction of the eyes of the user 02 is added to the positional relationship between the object 01 and the user 02. FIG. 12 is an explanatory diagram showing a configuration example of the speech synthesizer and the direction reference C0 in this example. As shown in FIG. 12, the speech synthesizer may be communicably connected to the user visual field detection means 16, the user position detection means 15, and the object position detection means 14. In FIG. 12, it is assumed that the speech synthesizer is included in the object 01.

オブジェクト位置検出手段１４は、オブジェクト０１に装着され、オブジェクト０１の位置を検出する。また、ユーザ位置検出手段１５は、ユーザ０２に装着され、ユーザ０２の位置を検出する。オブジェクト位置検出手段１４およびユーザ位置検出手段１５は、例えば、ＧＰＳ受信機によって実現される。 The object position detection unit 14 is attached to the object 01 and detects the position of the object 01. The user position detection unit 15 is attached to the user 02 and detects the position of the user 02. The object position detection unit 14 and the user position detection unit 15 are realized by a GPS receiver, for example.

また、ユーザ視野検出手段１６は、例えば、カメラ装置と画像解析手段とから構成され、ユーザ０１の視線方向を検出する。具体的には、ユーザ０２が装着するメガネに取り付けられたカメラ装置が、ユーザ０２の黒目位置を撮影し、画像解析手段が画像データを解析して、黒目位置の基準値からの偏差を求めることによって、ユーザの目線方向を検出すればよい。なお、ユーザの目線方法が実際のどの方角かを向いているかは、例えば、メガネに装着されたジャイロセンサを用いて測定される顔の正面方向を基準にすればよい。また、例えば、ユーザ０２を少なくとも２方向から撮影できる固定カメラ装置が撮影した画像データを画像解析手段が解析し、それら画像データから顔の正面を示す特徴や黒目位置を検出して、画像内における顔の正面方向および顔の正面方向に対する目線方向を特定し、それら方向と、その画像を撮影したカメラ装置の位置とから実際の目線方向を算出してもよい。 Moreover, the user visual field detection means 16 is comprised from a camera apparatus and an image analysis means, for example, and detects the gaze direction of the user 01. Specifically, the camera device attached to the glasses worn by the user 02 captures the black eye position of the user 02, and the image analysis unit analyzes the image data to obtain a deviation from the reference value of the black eye position. Thus, the user's line-of-sight direction may be detected. Note that the direction in which the user's line of sight is directed may be based on, for example, the front direction of the face measured using a gyro sensor attached to the glasses. In addition, for example, the image analysis unit analyzes image data captured by a fixed camera device capable of capturing the user 02 from at least two directions, and detects the feature indicating the front of the face and the position of the black eye from the image data. The front direction of the face and the gaze direction with respect to the front direction of the face may be specified, and the actual gaze direction may be calculated from these directions and the position of the camera device that captured the image.

方向入力手段１１および距離入力手段１２は、これら検出手段によって検出されるオブジェクト位置を示す情報、ユーザ位置を示す情報、ユーザ視線方向を示す情報を、例えば、入力ポートや通信ネットワークを介してそれぞれ受け取って、方向Ｃや距離ｒを算出すればよい。その際、方向入力手段１１は、ユーザ視野検出手段１６によって検出された視線方向を、方向基準Ｃ０として定義する。このような場合には、方向基準Ｃ０はユーザの向いている方向に依存するものとなる。 The direction input unit 11 and the distance input unit 12 receive information indicating the object position detected by these detection units, information indicating the user position, and information indicating the user's line-of-sight direction, for example, via an input port or a communication network. Thus, the direction C and the distance r may be calculated. At this time, the direction input unit 11 defines the line-of-sight direction detected by the user visual field detection unit 16 as the direction reference C0. In such a case, the direction reference C0 depends on the direction in which the user is facing.

これにより、ユーザ０２が動きまわり、位置や視線方向が刻一刻と変化することにリアルタイムに対応して、合成音声の特徴を変化させることも可能である。ユーザは、必要に応じて自らの向きや位置を変えることで、情報の受け取り方を変えることができる。 Thereby, it is also possible to change the characteristics of the synthesized speech in response to the fact that the user 02 moves around and the position and line-of-sight direction change every moment. The user can change how information is received by changing his / her orientation and position as necessary.

また視線方向に限らず、顔の向いている方向（正面方向）や体の向いている方向（正面方向）などを検出することで、より安定な方向を用いてもよい。なお、顔の正面方向や体の正面方向については、例えば、ユーザの両手や両耳、両目など顔の正面や体の正面がわかるような部位に対応して装着される装着物（例えば、メガネやイヤホンマイク）に取り付けたジャイロセンサを用いて測定したり、２方向からの画像データから顔や体の正面を示す特徴を検出して、特定すればよい。 In addition to the line-of-sight direction, a more stable direction may be used by detecting a direction in which the face is facing (front direction), a direction in which the body is facing (front direction), and the like. As for the front direction of the face and the front direction of the body, for example, a wearing object (for example, glasses) that is worn corresponding to a part where the front of the face or the front of the body can be seen, such as the user's hands, both ears, and both eyes. Measurement using a gyro sensor attached to a microphone or an earphone microphone), or a feature indicating the face or the front of the body may be detected from image data from two directions and specified.

図１３は、本例におけるパラメータ決定テーブルの例を示す説明図である。図１３は、ユーザ０２の正面方向を方向基準Ｃ０とする方向Ｃに基づいて声質を決定するためのパラメータ決定テーブルの例を示している。なお、本例は、距離ｒによって、オブジェクト０１とユーザ０２との距離が十分近傍であることが示されていることを前提としている。図１３に示すように、ユーザ０２から見てオブジェクト０１が正面付近に存在する場合、つまり正面からの角度が０度付近の場合は、声質を「明瞭」とし、逆に、側面方向や背面方向に存在する場合は、声質を「ささやき声」とするように制御してもよい。 FIG. 13 is an explanatory diagram illustrating an example of a parameter determination table in the present example. FIG. 13 shows an example of a parameter determination table for determining voice quality based on the direction C with the front direction of the user 02 as the direction reference C0. This example is based on the assumption that the distance r indicates that the distance between the object 01 and the user 02 is sufficiently close. As shown in FIG. 13, when the object 01 is present near the front as viewed from the user 02, that is, when the angle from the front is near 0 degrees, the voice quality is “clear”, and conversely, the side direction and the rear direction If it exists, the voice quality may be controlled to be a “whispering voice”.

これによって、正面付近以外ではユーザに対しささやくように発話させることで、ユーザの注意を引きつけ、ユーザ０２に正面を向いてもらう効果を期待できる。また、明瞭とささやき声の中間段階を用意してもよい。同様の効果を、図１４のようにパワーを変更して、側面や背後方向では大きなパワーで意識を向けてもらうようにすることによっても実現してもよい。図１４は、ユーザ０２の正面方向を方向基準Ｃ０とする方向Ｃに基づいてパワーを決定するためのパラメータ決定テーブルの例を示している。 As a result, it is possible to expect the effect of attracting the user's attention and having the user 02 face the front by uttering the user whispering in the vicinity of the front. Further, an intermediate stage between clearness and whispering may be prepared. A similar effect may also be realized by changing the power as shown in FIG. 14 so that the consciousness is directed with a large amount of power in the side and rear directions. FIG. 14 shows an example of a parameter determination table for determining power based on the direction C with the front direction of the user 02 as the direction reference C0.

また、ユーザ０２の正面方向だけでなく、オブジェクト０１にも正面方向がある場合には、オブジェクト０１とユーザ０２との位置関係に、お互いの向き（目線、顔、体の正面方向）による相対角度Ｒを加味してもよい。なお、オブジェクト０１の正面方向をオブジェクト０１が管理していない場合には、ユーザ０２の正面方向を検出する方法と同様の方法を用いて、オブジェクト０１の正面方向を検出すればよい。 When the object 01 has a front direction as well as the front direction of the user 02, the relative angle between the object 01 and the user 02 depends on the direction of each other (line of sight, face, front of the body). R may be added. When the object 01 does not manage the front direction of the object 01, the front direction of the object 01 may be detected using a method similar to the method of detecting the front direction of the user 02.

そして、方向入力手段１１が、オブジェクト位置を示す情報と、ユーザ位置を示す情報とともに、ユーザ０２の正面方向を示す情報と、オブジェクト０１の正面方向を示す情報とを入力ポートや通信ネットワークを介して受け取り、方向Ｃと相対角度Ｒとを求めればよい。 Then, the direction input unit 11 sends information indicating the object position, information indicating the user position, information indicating the front direction of the user 02, and information indicating the front direction of the object 01 via an input port or a communication network. The direction C and the relative angle R may be obtained.

このような場合、合成パラメータ決定部２１は、例えば、オブジェクト０１がユーザ０２の背面寄りにいる場合であっても距離が近くほぼ同じ方向を向いていれば連れだって移動していると見なして、楽しげな雰囲気になるように音声合成パラメータを制御してもよい。また、例えば、距離が近くてもお互いにそっぽを向いていれば、無関心な雰囲気になるように制御してもよい。また、オブジェクトがユーザの方を向いている場合に限定して、上述のささやき声の制御を行うといったことも可能である。このように、方向Ｃと距離ｒと相対角度Ｒとで示される位置関係に基づいて合成音声パラメータを制御することによって、より臨場感をもった発話が可能になる。 In such a case, for example, the synthesis parameter determination unit 21 regards the object 01 as moving if it is close to the back of the user 02 and is directed in the same direction if the distance is close. The speech synthesis parameters may be controlled so as to have a pleasant atmosphere. Further, for example, if the distances are close to each other even if the distance is short, the atmosphere may be controlled so as to be indifferent. Further, it is possible to perform the above whisper control only when the object is facing the user. In this way, by controlling the synthesized speech parameters based on the positional relationship indicated by the direction C, the distance r, and the relative angle R, it is possible to utter with more realistic feeling.

図１５は、本例におけるパラメータ決定テーブルの例を示す説明図である。図１５は、方向Ｃと距離ｒと相対角度Ｒとに基づいてピッチ周波数の高低さ、抑揚の強弱、雰囲気（パタン）を決定するためのパラメータ決定テーブルの例を示している。なお、図１５（ａ）は距離ｒが１００ｃｍより小さい場合に参照されるパラメータ決定テーブルの例であり、図１５（ｂ）は距離ｒが１００ｃｍ以上の場合に参照されるパラメータ決定テーブルの例であるが、３つの要素を組み合わせたパラメータの決定は、このようにテーブルを２つに分けて登録することには限定されない。図１５では、例えば、距離ｒが近距離（１００ｃｍより小さい）で、相対方向Ｒが正対方向（０〜＋９０度または０〜−９０度）であって、方向Ｃが正面方向（０〜＋４５度または０〜−４５度）である場合には、ピッチ周波数の高低を高めに、抑揚の強弱は強く、またとても楽しい雰囲気となるように制御する例が示されている。また、例えば、距離ｒが遠距離（１００ｃｍ以上）で、相対方向Ｒが正対方向とは逆方向（＋９０〜＋１８０度または−９０〜−１８０度）であって、方向Ｃが背面方向（＋９０〜＋１８０度または−９０〜−１８０度）である場合には、ピッチ周波数の高低を中くらいに、、抑揚の強弱は弱く、また険しい雰囲気（疎遠な感じ）となるように制御する例が示されている。 FIG. 15 is an explanatory diagram showing an example of a parameter determination table in this example. FIG. 15 shows an example of a parameter determination table for determining the pitch frequency height, inflection strength, and atmosphere (pattern) based on the direction C, the distance r, and the relative angle R. 15A is an example of a parameter determination table that is referred to when the distance r is smaller than 100 cm, and FIG. 15B is an example of a parameter determination table that is referred to when the distance r is 100 cm or more. However, the determination of the parameters combining the three elements is not limited to registering the table divided into two in this way. In FIG. 15, for example, the distance r is a short distance (smaller than 100 cm), the relative direction R is the front direction (0 to +90 degrees or 0 to −90 degrees), and the direction C is the front direction (0 to +45). In the case of (degree or 0 to -45 degrees), an example is shown in which the pitch frequency is increased and decreased, the inflection is strong, and the atmosphere is controlled so as to have a very pleasant atmosphere. Further, for example, the distance r is a long distance (100 cm or more), the relative direction R is the opposite direction (+90 to +180 degrees or −90 to −180 degrees) from the directly facing direction, and the direction C is the back direction (+90). ~ + 180 degrees or -90 to -180 degrees), an example is shown in which the pitch frequency is moderately high, the inflection is weak, and the atmosphere is steep (distant feeling). Has been.

なお、上記例では、オブジェクト０１が実体オブジェクトである場合を例に説明したが、オブジェクト０１は、ビデオゲームキャラクタのように仮想空間上にのみ存在する仮想オブジェクトであっても、本発明を適用することが可能である。ただし、オブジェクト０１には、仮想区間上における位置の概念および位置情報を有するものとする。 In the above example, the case where the object 01 is a real object has been described as an example. However, the present invention is applied even if the object 01 is a virtual object that exists only in a virtual space such as a video game character. It is possible. However, the object 01 has a concept of position on the virtual section and position information.

このような場合には、オブジェクト０１と同じ仮想空間上に、ユーザ０２の仮想的な位置と方向基準Ｃ０とを設定し、仮想空間上におけるオブジェクト０１とユーザ０２との方向Ｃや距離ｒを求めればよい。方向基準Ｃ０は、例えば、ある絶対方向やユーザ０２の顔の向きや進行方向に設定すればよい。また、例えば、オブジェクト０１やユーザ０２の進行方向をそれぞれの正面向きとして、お互いの正面向きによる相対角度Ｒを求めることも可能である。 In such a case, the virtual position of the user 02 and the direction reference C0 are set in the same virtual space as the object 01, and the direction C and the distance r between the object 01 and the user 02 in the virtual space can be obtained. That's fine. The direction reference C0 may be set to, for example, a certain absolute direction, the direction of the face of the user 02, or the traveling direction. Further, for example, it is also possible to obtain the relative angle R according to the front direction of each other, with the traveling direction of the object 01 or the user 02 as the front direction.

オブジェクト０１の位置やユーザ０２の位置は、ビデオディスプレイに投影するなどの方法でユーザに提示してもよいし、提示しないでもよい。また、ユーザ０２の仮想的な位置は、ユーザ０２によるコントローラ上のボタンやマウス操作に応じて移動させてもよい。なお、本例におけるパラメータの決定方法は、既に示した方法と同様に、例えば、オブジェクト０１がユーザ０２の正面に近ければ近いほど、ユーザ０２にオブジェクト０１の発話内容がはっきり伝わるように制御したり、オブジェクトの周囲条件の一つである明るさに対応して、オブジェクトの存在する位置の雰囲気が伝わるように制御したりしてもよい。また、例えば、距離感をつかませるために、ユーザ０２から見てオブジェクト０１が近い場合には音パワーを大きく、遠い場合には音パワーを小さくするように制御してもよい。また、発話内容が十分聞こえるように、ユーザ０２から見てオブジェクト０１が遠い場合には音パワーを大きく、近い場合には音パワーを小さくするように制御してもよい。また、例えば、極座標系で示される位置関係に基づいて、寒そうにしたり、固定的に位置する環境を利用する等の用途で、音パワーを制御したりしてもよい。また、例えば、オブジェクト０１が正面以外の付近に存在する場合に、ささやき声になるように声質を制御したり、注意を引きつけるような音パワーに制御してもよい。また、例えば、さらに相対角度Ｒを加味した位置関係に基づいて、楽しげな雰囲気にしたり、無関心な雰囲気にするように制御したりしてもよい。 The position of the object 01 and the position of the user 02 may or may not be presented to the user by a method such as projecting onto a video display. Further, the virtual position of the user 02 may be moved according to a button or mouse operation on the controller by the user 02. Note that the parameter determination method in this example is controlled so that, for example, the closer the object 01 is to the front of the user 02, the more clearly the utterance content of the object 01 is transmitted to the user 02, as in the method already shown. Alternatively, it may be controlled so that the atmosphere of the position where the object exists is transmitted in accordance with the brightness which is one of the ambient conditions of the object. Further, for example, in order to get a sense of distance, the sound power may be controlled to be large when the object 01 is close when viewed from the user 02 and to be small when the object 01 is far away. Further, the sound power may be controlled so that the sound power is increased when the object 01 is far from the user 02 and the sound power is decreased when the object 01 is close so that the utterance content can be sufficiently heard. Further, for example, based on the positional relationship shown in the polar coordinate system, the sound power may be controlled for purposes such as making it cold or using a fixedly located environment. Further, for example, when the object 01 is present in the vicinity other than the front, the voice quality may be controlled so as to be a whisper, or the sound power may be controlled so as to attract attention. Further, for example, based on the positional relationship in consideration of the relative angle R, it may be controlled to have a pleasant atmosphere or an indifferent atmosphere.

なお、仮想オブジェクトに対応づけた合成音声の合成音声パラメータを制御する場合には、パラメータの変化幅を大きくして、より極端に変化させてもよい。 In addition, when controlling the synthesized speech parameter of the synthesized speech associated with the virtual object, the parameter change range may be increased to change it more drastically.

なお、本例における音声合成装置は、例えば、仮想オブジェクトを動作させるためのプログラムの一機能として音声合成機能が記述されたプログラムを読み込み、そのプログラムに従って動作するＣＰＵ等によって実現される。 Note that the speech synthesizer in this example is realized by, for example, a CPU or the like that reads a program describing a speech synthesis function as one function of a program for operating a virtual object and operates according to the program.

これにより、位置の提示有無に関わらず、合成音声の特徴（声質やパワーや）によって、ユーザが仮想空間上での位置や、状態伝達の意図を容易に推定できるようになる。 This makes it possible for the user to easily estimate the position in the virtual space and the intention of state transmission based on the characteristics (voice quality and power) of the synthesized speech regardless of whether or not the position is presented.

以上のように、本実施の形態によれば、少なくとも方向Ｃと距離ｒと相対角度Ｒのいずれか、またはその組み合わせによって示されるオブジェクト０１とユーザ０２との位置関係や周囲の環境に基づいて、所定の制御方針に基づき予め定めれられている決定方法に従って、合成音声パラメータを決定することによって、制御方針によっては生成される合成音声に対応づけられたオブジェクトをより的確に表現することができ、また、生成される合成音声に対応づけられたオブジェクトとの位置関係や状況や、合成音声による情報伝達の意図をよりわかりやすく提示することができ、また、たとえ合成音声に対応づけられたオブジェクトがユーザの視野に直接入っていない場合でも、視野に入っているのと同様に、ユーザにそのオブジェクトとの位置関係や状況、合成音声による情報伝達の意図をより想起させることができる。 As described above, according to the present embodiment, based on the positional relationship between the object 01 and the user 02 indicated by at least one of the direction C, the distance r, and the relative angle R, or a combination thereof, and the surrounding environment, By determining a synthesized speech parameter according to a predetermined determination method based on a predetermined control policy, an object associated with the synthesized speech generated depending on the control policy can be expressed more accurately. In addition, it is possible to present the positional relationship and situation with the object associated with the generated synthesized speech and the intention of information transmission by the synthesized speech in an easy-to-understand manner, and even if the object associated with the synthesized speech is Even if it is not directly in the user's field of view, the user can see that object as well as in the field of view. Positional relationship and circumstances, it is possible to further recall the intentions of the information transmission by synthesized speech.

実施の形態２．
次に、本発明の第２の実施の形態について説明する。図１６は、第２の実施の形態による音声合成装置の構成例を示すブロック図である。図１６に示すように、本実施の形態による音声合成装置は、音声合成部２２が、発話テキスト編集部２２１と、テキスト音声合成部２２２とを含む。また、本実施の形態における音声合成装置は、合成パラメータ決定部２１が、音声合成パラメータとして、少なくとも発話内容の詳細度を決定する。 Embodiment 2. FIG.
Next, a second embodiment of the present invention will be described. FIG. 16 is a block diagram illustrating a configuration example of a speech synthesizer according to the second embodiment. As shown in FIG. 16, in the speech synthesizer according to the present embodiment, the speech synthesizer 22 includes an utterance text editing unit 221 and a text speech synthesizer 222. In the speech synthesizer in the present embodiment, the synthesis parameter determination unit 21 determines at least the level of detail of the utterance content as the speech synthesis parameter.

発話テキスト編集部２２１は、合成パラメータ決定部２１が決定した詳細度に基づいて、入力されたテキスト（発話内容を示すテキスト）を編集する。具体的には、合成パラメータ決定部２１が決定した詳細度に応じて、発話内容を要約する。 The utterance text editing unit 221 edits the input text (text indicating the utterance content) based on the level of detail determined by the synthesis parameter determination unit 21. Specifically, the utterance contents are summarized according to the level of detail determined by the synthesis parameter determination unit 21.

テキスト音声合成部２２２は、発話テキスト編集部２２１によって編集された内容で、合成音声を生成する。 The text-to-speech synthesis unit 222 generates synthesized speech with the content edited by the utterance text editing unit 221.

次に、図１７を参照して本実施の形態の動作を説明する。図１７は、本実施の形態による音声合成装置の動作例を示すフローチャートである。なお、図１７におけるステップＳ１０１の動作は、図２に示す第１の実施の形態におけるステップＳ１０１と同様のため、説明省略する。 Next, the operation of the present embodiment will be described with reference to FIG. FIG. 17 is a flowchart showing an operation example of the speech synthesizer according to the present embodiment. The operation in step S101 in FIG. 17 is the same as that in step S101 in the first embodiment shown in FIG.

合成パラメータ決定部２１は、方向入力手段１１、距離入力手段１２、周囲条件入力手段１３によって所定の入力パラメータ（例えば、方向Ｃや、距離ｒ、相対角度Ｒ、周囲条件等）が入力されると、それら入力パラメータに基づいて、少なくとも発話内容の詳細度を示す合成音声パラメータを決定する（ステップＳ２０１）。 The composite parameter determination unit 21 receives predetermined input parameters (for example, direction C, distance r, relative angle R, ambient condition, etc.) from the direction input unit 11, the distance input unit 12, and the ambient condition input unit 13. Based on these input parameters, a synthesized speech parameter indicating at least the level of detail of the utterance content is determined (step S201).

合成パラメータ決定部２１は、例えば、図１８に示すようなパラメータ決定テーブルを参照することによって、「入力全て」「少しだけ要約する」「ほとんど要約する」の３値を取る詳細度パラメータを決定してもよい。図１８は、方向Ｃに基づいて詳細度を示す音声合成パラメータを決定するためのパラメータ決定テーブルの例である。図１８に示すパラメータ決定テーブルでは、オブジェクト０１がユーザ０２の正面にある場合にはオブジェクト０１に対するユーザ０２の関心が大きいと見なして詳細な情報を伝達するように制御し、逆に、全く違う方向を向いている場合には関心が少ないと見なして概要だけを伝達するように制御している。 The composite parameter determination unit 21 determines, for example, a detail parameter that takes three values “all input”, “summarize only a little”, and “summarize almost” by referring to a parameter determination table as shown in FIG. May be. FIG. 18 is an example of a parameter determination table for determining a speech synthesis parameter indicating the degree of detail based on the direction C. In the parameter determination table shown in FIG. 18, when the object 01 is in front of the user 02, it is assumed that the user 02 is interested in the object 01 and the detailed information is transmitted. If it is facing, it is considered that there is little interest and it is controlled so that only the outline is transmitted.

次に、発話テキスト編集部２２１は、詳細度パラメータに従って入力されたテキストを編集（要約）する（ステップＳ２０２）。テキストの要約は、例えば、非特許文献「ＩｎｄｅｒｊｅｅｔＭａｎｉ著（奥村学他訳），”自動翻訳”，共立出版，２００３年６月，ｐ．１４−２１」に開示されているように、テキストを解析して内部表現を生成し、内部表現を詳細度パラメータに応じて要約の表現に変形し、要約の表現を自然言語に戻すことによって行ってもよい。 Next, the utterance text editing unit 221 edits (summarizes) the input text according to the detail level parameter (step S202). The summary of the text is, for example, as disclosed in a non-patent document “Inderjeet Mani (translated by Manabu Okumura),“ Automatic Translation ”, Kyoritsu Shuppan, June 2003, p. 14-21”. The analysis may be performed by generating an internal representation, transforming the internal representation into a summary representation according to the detail parameter, and returning the summary representation to a natural language.

そして、テキスト音声合成部２２２は、発話テキスト編集部２２１によって編集された内容で、オブジェクト０１に対応づける音声を合成し、出力する（ステップＳ２０３）。 Then, the text-to-speech synthesizer 222 synthesizes and outputs the speech associated with the object 01 with the contents edited by the utterance text editing unit 221 (step S203).

なお、上記例では、与えられた発話内容を示すテキストを編集する発話テキスト編集部２２１を用いたが、これに限らず、詳細度と発話内容を入力として、これらから発話内容を生成するようにしてもよい。例えば、発話内容として、家からの経路およびかかる時間が「バス／１５分＋電車／１時間＋待ち合わせ／１５分＋電車／３０分＋バス／３０分」である旨の情報、および概要だけを説明する旨の詳細度が与えられた場合には、「家からバスと電車を使って来ました」という発話内容を生成するようにしてもよい。なお、詳細度とともに、「ですます調」「である調」「だ調」といった丁寧度を与えるようにし、丁寧度で示される口調で発話内容を生成するようにしてもよい。 In the above example, the utterance text editing unit 221 for editing the text indicating the given utterance content is used. However, the present invention is not limited to this, and the utterance content is generated from the details and the utterance content as inputs. May be. For example, as the utterance content, only the route from the house and the time it takes are “bus / 15 minutes + train / 1 hour + waiting / 15 minutes + train / 30 minutes + bus / 30 minutes” and only an outline. When the level of detail for explanation is given, an utterance content “I used a bus and a train from home” may be generated. It should be noted that, along with the level of detail, a politeness such as “mass tone”, “a tone that is”, and “a tone” may be given, and the utterance content may be generated in a tone indicated by the politeness.

また、発話全体の概略の時間長を与えることによって、例えば美術館の収蔵品案内を音声合成で行う際に、ユーザが向いている方向にある絵は長い時間を掛けて説明し、向いていない方向の絵は短時間で説明を終えるなどの使い方もできる。 Also, by giving a rough time length for the entire utterance, for example, when performing art collection collection guidance by voice synthesis, the picture in the direction that the user is facing will be explained over a long time, and the direction that is not suitable The picture can be used to finish explanation in a short time.

また、テキスト再生に限らず、録音再生による音声合成の場合、発話対象の一部だけを飛ばし飛ばしに再生することや、一つの発話対象が複数の部分音声から構成されている場合には見出しなどの重要部分音声だけを選択的に再生することで、合成音声による情報伝達の意図をよりわかりやすく提示することができる。 In addition, not only text playback but also speech synthesis by recording and playback, only a part of the utterance target is skipped and played, or a headline etc. when one utterance target consists of multiple partial voices By selectively reproducing only the important partial voices, it is possible to present the intention of information transmission by synthetic voices in an easy-to-understand manner.

なお、方向Ｃに応じて詳細度を制御するだけでなく、例えば、図１９に示すように距離ｒに応じて詳細度を制御しても同様の効果を得ることができる。 In addition to controlling the degree of detail according to the direction C, for example, the same effect can be obtained by controlling the degree of detail according to the distance r as shown in FIG.

また、各テーブルに記述する入力パラメータを独立させずに、各入力パラメータを組み合わせた値と合成音声パラメータ（詳細度）とを対応づけてもよい。例えば、方向Ｃと距離ｒの組合せによって詳細度を決定するように設計することも可能である。また、テーブル参照によらず、各入力パラメータの値を演算した結果を用いてパラメータ（例えば、発話時間）を定めてもよい。 Further, a value obtained by combining the input parameters may be associated with a synthesized speech parameter (detail level) without making the input parameters described in each table independent. For example, it is possible to design so that the degree of detail is determined by a combination of the direction C and the distance r. Further, parameters (for example, utterance time) may be determined using the result of calculating the value of each input parameter without referring to the table.

図２０は、本例におけるパラメータ決定テーブルの例を示す説明図である。図２０は、距離ｒと方向Ｃとに基づいて詳細度を示す音声合成パラメータを決定するためのパラメータ決定テーブルの例である。図２０では、例えば、距離ｒが近距離（０〜５０ｃｍ）であっても、方向Ｃが正面方向（０〜＋４５度または０〜−４５度）である場合には、短く簡潔に話すように制御し、逆に、方向Ｃが背面方向（＋９０〜＋１８０度または−９０〜−１８０度）である場合には、詳細に話すように制御する例が示されている。これは、関心が少ないと見られる人でも近くにいる場合には、その人の興味を引きつけるように少しでも長く耳に入るような制御を行う例である。 FIG. 20 is an explanatory diagram showing an example of a parameter determination table in this example. FIG. 20 is an example of a parameter determination table for determining a speech synthesis parameter indicating the degree of detail based on the distance r and the direction C. In FIG. 20, for example, even if the distance r is a short distance (0 to 50 cm), if the direction C is the front direction (0 to +45 degrees or 0 to −45 degrees), it should be briefly and briefly spoken. An example is shown in which control is performed and, conversely, when the direction C is the back direction (+90 to +180 degrees or −90 to −180 degrees), control is performed so as to speak in detail. This is an example in which, even if a person who seems to be less interested is in the vicinity, control is performed so that the user can get into the ear for a long time to attract the person's interest.

なお、図２１は、本実施の形態による音声合成装置の他の動作例を示すフローチャートである。図２１に示すように、テキスト音声合成部２２２は、発話テキスト編集部２２１によって編集（または生成）された内容で、かつ合成パラメータ決定部２１によって決定された声質やパワー等の音声合成パラメータに従って、オブジェクト０１の音声を合成するようにしてもよい（ステップＳ３０１）。なお、他の動作は、図３および図１７に示す動作と同様である。 FIG. 21 is a flowchart showing another operation example of the speech synthesizer according to this embodiment. As shown in FIG. 21, the text-to-speech synthesizer 222 is the content edited (or generated) by the utterance text editor 221 and according to the speech synthesis parameters such as voice quality and power determined by the synthesis parameter determination unit 21. You may make it synthesize | combine the audio | voice of the object 01 (step S301). Other operations are the same as the operations shown in FIGS. 3 and 17.

以上のように、本実施の形態によれば、オブジェクト０１とユーザ０２との位置関係や周囲条件に基づいて、発話内容の詳細度を決定することによって、ユーザ０２の関心度に応じて詳細な内容を出力（発話や、再生）したり、概要だけを出力させたりすることができ、合成音声による情報伝達の意図をより想起させることができる。 As described above, according to the present embodiment, by determining the level of detail of the utterance content based on the positional relationship between the object 01 and the user 02 and the surrounding conditions, the detailed level according to the level of interest of the user 02 is obtained. The contents can be output (spoken or reproduced), or only the summary can be output, and the intention of information transmission by synthesized speech can be recalled more.

本発明は、例えばロボットやおもちゃなどにおける音声対話に適用可能である。また、位置や状況が時時刻刻と変化する交通システム、オブジェクトが実体を持たないビデオゲームキャラクタや仮想現実感システムなどにも適用可能である。特に、音声合成のパラメータ設定の自由度が一般的に高い、テキスト音声合成を利用したこれらのシステムへの適用に好適である。 The present invention can be applied to a voice conversation in a robot or a toy, for example. Further, the present invention can be applied to a traffic system in which the position and situation change from time to time, a video game character in which an object does not have a substance, a virtual reality system, and the like. In particular, the present invention is suitable for application to these systems using text-to-speech synthesis, which generally has a high degree of freedom in setting parameters for speech synthesis.

第１の実施の形態による音声合成装置の構成例を示すブロック図である。It is a block diagram which shows the structural example of the speech synthesizer by 1st Embodiment. オブジェクト０１とユーザ０２との位置関係を定義した説明図である。It is explanatory drawing which defined the positional relationship of the object 01 and the user 02. FIG. 第１の実施の形態による音声合成装置の動作例を示すフローチャートである。It is a flowchart which shows the operation example of the speech synthesizer by 1st Embodiment. パラメータ決定テーブルの例を示す説明図である。It is explanatory drawing which shows the example of a parameter determination table. パラメータ決定テーブルの例を示す説明図である。It is explanatory drawing which shows the example of a parameter determination table. パラメータ決定テーブルの例を示す説明図である。It is explanatory drawing which shows the example of a parameter determination table. 合成音声出力部０３の位置の例を示す説明図である。It is explanatory drawing which shows the example of the position of the synthetic | combination audio | voice output part 03. FIG. パラメータ決定テーブルの例を示す説明図である。It is explanatory drawing which shows the example of a parameter determination table. オブジェクト０１とユーザ０２との位置関係の例を示す説明図である。It is explanatory drawing which shows the example of the positional relationship of the object 01 and the user 02. FIG. パラメータ決定テーブルの例を示す説明図である。It is explanatory drawing which shows the example of a parameter determination table. 方向基準Ｃ０を示す説明図である。It is explanatory drawing which shows the direction reference | standard C0. 音声合成装置の構成例および方向基準Ｃ０を示す説明図である。It is explanatory drawing which shows the structural example and direction reference | standard C0 of a speech synthesizer. パラメータ決定テーブルの例を示す説明図である。It is explanatory drawing which shows the example of a parameter determination table. パラメータ決定テーブルの例を示す説明図である。It is explanatory drawing which shows the example of a parameter determination table. パラメータ決定テーブルの例を示す説明図である。It is explanatory drawing which shows the example of a parameter determination table. 第２の実施の形態による音声合成装置の構成例を示すブロック図である。It is a block diagram which shows the structural example of the speech synthesizer by 2nd Embodiment. 第２の実施の形態による音声合成装置の動作例を示すフローチャートである。It is a flowchart which shows the operation example of the speech synthesizer by 2nd Embodiment. パラメータ決定テーブルの例を示す説明図である。It is explanatory drawing which shows the example of a parameter determination table. パラメータ決定テーブルの例を示す説明図である。It is explanatory drawing which shows the example of a parameter determination table. パラメータ決定テーブルの例を示す説明図である。It is explanatory drawing which shows the example of a parameter determination table. 第２の実施の形態による音声合成装置の他の動作例を示すフローチャートである。It is a flowchart which shows the other operation example of the speech synthesizer by 2nd Embodiment.

Explanation of symbols

１１方向入力手段
１２距離入力手段
１３周囲条件入力手段
１４オブジェクト位置検出手段
１５ユーザ位置検出手段
１６ユーザ視野検出手段
２１合成パラメータ決定部
２２音声合成部
２２１発話テキスト編集部
２２２テキスト音声合成部 DESCRIPTION OF SYMBOLS 11 Direction input means 12 Distance input means 13 Ambient condition input means 14 Object position detection means 15 User position detection means 16 User visual field detection means 21 Synthesis parameter determination part 22 Speech synthesis part 221 Utterance text edit part 222 Text speech synthesis part

Claims

A speech synthesizer that synthesizes speech associated with an object,
A synthesis parameter determination unit that changes a speech synthesis parameter indicating what kind of synthesized speech is generated as speech associated with the object according to a positional relationship between the user who is an object observing object and the object;
A speech synthesizer, comprising: a speech synthesizer that generates synthesized speech as speech associated with the object according to the speech synthesis parameter changed by the synthesis parameter determination unit.

The synthesis parameter determination unit changes the speech synthesis parameter according to a positional relationship between the user and the object indicated by an angle formed by a line connecting the user and the object and a predetermined direction reference determined according to a predetermined setting method. Item 2. The speech synthesizer according to Item 1.

The composite parameter determination unit is configured to determine a position direction of the user and the object indicated by an angle between a line connecting the user and the object and a predetermined direction reference determined according to a predetermined setting method, a distance between the user and the object, The speech synthesis parameter is changed according to a positional relationship between the user and the object indicated by any one or a combination of relative angles between the front direction of the user and the front direction of the object. Synthesizer.

A speech synthesizer that synthesizes speech associated with an object,
A synthesis parameter determination unit that changes a synthesis parameter indicating what kind of synthesized speech is generated as speech associated with the object according to an environment around the object;
A speech synthesizer, comprising: a speech synthesizer that generates synthesized speech as speech associated with the object according to the speech synthesis parameter changed by the synthesis parameter determination unit.

A speech synthesizer that synthesizes speech associated with an object,
A synthesis parameter indicating what kind of synthesized speech is generated as speech associated with the object according to the positional relationship between the user who is the object observing object and the object and the environment around the object A synthesis parameter determination unit to be changed;
A speech synthesizer, comprising: a speech synthesizer that generates synthesized speech as speech associated with the object according to the speech synthesis parameter changed by the synthesis parameter determination unit.

The synthesis parameter determination unit determines at least the brightness of the surroundings, the volume of the surrounding sound, whether or not other objects exist in the vicinity, and the types of other objects that are in contact with or possessed by the objects. The speech synthesis apparatus according to claim 4, wherein speech synthesis parameters are changed according to an environment around the object indicated by any of the above.

The speech synthesis device according to any one of claims 1 to 5, wherein the synthesis parameter determination unit changes any of voice quality, speech rate, and voice volume as a speech synthesis parameter.

The synthesis parameter determination unit determines the level of detail of the content spoken by the object according to the positional relationship between the user and the object,
The speech synthesizer
An utterance text editing section that summarizes the text indicating the utterance content of the object according to the level of detail determined by the synthesis parameter determination section;
The speech synthesizer according to any one of claims 1 to 3, further comprising: a text speech synthesizer that generates synthesized speech having the text summarized by the utterance text editing unit as utterance content.

The synthesis parameter determination unit determines the level of detail of the content spoken by the object according to the positional relationship between the user and the object,
The speech synthesizer
According to the level of detail determined by the synthesis parameter determination unit, an utterance text generation unit that generates text in which a given item is expressed as utterance content;
The speech synthesizer according to any one of claims 1 to 3, further comprising: a text speech synthesizer that generates a synthesized speech whose utterance content is the text generated by the uttered text generator.

The speech synthesis apparatus according to any one of claims 1 to 4, wherein the synthesis parameter determination unit changes the synthesis speech parameter on the assumption that the synthesized speech is output from a position independent of the object.

The speech synthesis apparatus according to claim 10, wherein the synthesis parameter determination unit changes the synthesis speech parameter on the assumption that synthesized speech is output from the vicinity of the user.

The speech synthesis apparatus according to claim 2, wherein a direction in which the user is facing is used as the direction reference.

The speech synthesizer according to claim 2 or 3, wherein a user's line-of-sight direction is used as the direction reference.

The speech synthesis apparatus according to claim 3, wherein the synthesis parameter determination unit changes the voice quality according to whether or not the object is close to the front direction of the user.

The speech synthesis apparatus according to claim 3, wherein the synthesis parameter determination unit changes the sound volume according to whether or not the object is close to the front direction of the user.

The speech synthesis according to claim 3, wherein the synthesis parameter determination unit changes the synthesized speech parameter according to the environment surrounding the object specified by the positional relationship expressed in the polar coordinate system by using the absolute direction as the direction reference. apparatus.

The speech synthesis apparatus according to claim 5, wherein the synthesis parameter determination unit changes the synthesized speech parameter according to the surrounding environment only when the positional relationship satisfies a predetermined condition.

The speech synthesis apparatus according to claim 1, wherein the synthesis parameter determination unit changes the synthesis speech parameter more drastically when the object associated with the synthesized speech is a virtual object that exists only in the virtual space.

The speech synthesis device according to claim 8 or 9, wherein the synthesis parameter determination unit changes the degree of detail according to whether or not the object is close to the front direction of the user.

The speech synthesis device according to claim 8 or 9, wherein the synthesis parameter determination unit changes the level of detail and the speech rate according to whether or not the object is close to the front direction of the user.

A speech synthesis method for synthesizing speech associated with an object,
According to the positional relationship between the user who observes the object and the object, the speech synthesis parameter indicating what kind of synthesized speech is generated as the speech associated with the object is changed,
A synthesized speech as a speech associated with the object is generated according to the changed speech synthesis parameter.

Position direction of the user and the object indicated by an angle between a line connecting the user and the object and a predetermined direction reference determined in accordance with a predetermined setting method, a distance between the user and the object, a front direction of the user and the object The speech synthesis method according to claim 21, wherein the speech synthesis parameter is changed in accordance with a positional relationship between the user and the object indicated by any one or a combination of relative angles depending on the front direction.

A speech synthesis method for synthesizing speech associated with an object,
Depending on the environment around the object, change the speech synthesis parameters,
A synthesized speech as a speech associated with the object is generated according to the changed speech synthesis parameter.

According to the positional relationship between the user and the object, determine the level of detail of the content spoken by the object,
Summarizing text with the utterance content of the object according to the determined level of detail,
The speech synthesis method according to any one of claims 21 to 23, wherein a synthesized speech having the summarized text as an utterance content is generated.

According to the positional relationship between the user and the object, determine the level of detail of the content spoken by the object,
According to the determined level of detail, generate a text in which a given item is expressed as utterance content,
The speech synthesis method according to any one of claims 21 to 23, wherein a synthesized speech having the generated text as an utterance content is generated.

A speech synthesis program for synthesizing speech associated with an object,
On the computer,
A parameter determination process for changing a speech synthesis parameter indicating what kind of synthesized speech is generated as speech associated with the object according to a positional relationship between the user who is an object observing object and the object, and change A speech synthesis program for executing a synthesis process for generating synthesized speech as speech associated with the object according to the speech synthesis parameters that have been made.

On the computer,
In the parameter determination process, the position and direction of the user and the object indicated by the angle between the line connecting the user and the object and a predetermined direction reference determined according to a predetermined setting method, the distance between the user and the object, the user 27. The speech synthesis program according to claim 26, wherein the speech synthesis parameter is changed according to a positional relationship between the user and the object indicated by any one or a combination of relative angles between the front direction of the object and the front direction of the object.

A speech synthesis program for synthesizing speech associated with an object,
On the computer,
In accordance with the environment surrounding the object, according to a parameter determination process for changing a speech synthesis parameter indicating what kind of synthesized speech is generated as speech associated with the object, and according to the changed speech synthesis parameter, A speech synthesis program for executing speech synthesis processing for generating synthesized speech as speech associated with an object.

On the computer,
In the parameter determination process, according to the positional relationship between the user and the object, the detail level of the content spoken by the object is determined,
29. The speech synthesis process, according to the determined level of detail, summarizes the text indicating the utterance content of the object, and generates synthesized speech using the summarized text as the utterance content. The speech synthesis method according to any one of the above.

On the computer,
In the parameter determination process, according to the positional relationship between the user and the object, the detail level of the content spoken by the object is determined,
27. The speech synthesis process generates a text in which a given item is expressed as utterance content according to the determined level of detail, and generates a synthesized speech using the generated text as the utterance content. 28. The speech synthesis method according to any one of 28.