JP2014130211A

JP2014130211A - Speech output device, speech output method, and program

Info

Publication number: JP2014130211A
Application number: JP2012287362A
Authority: JP
Inventors: Kumi Hatada; 久美幡田
Original assignee: Brother Industries Ltd
Current assignee: Brother Industries Ltd
Priority date: 2012-12-28
Filing date: 2012-12-28
Publication date: 2014-07-10

Abstract

PROBLEM TO BE SOLVED: To impart a suitable expression to a synthesized voice of a conversation sentence when outputting the synthesized voice reading the conversation sentence aloud through voice synthesis.SOLUTION: In document read-aloud processing (S115), specified document data is acquired from an information processing server. In document segmentation processing (S125), the specified document data is analyzed to specify sentences included in a document represented by the specified document data respectively, and in sentence structure determination processing (S130), the respective specified sentences are analyzed to specify whether each sentence is a base sentence, a conversation sentence, or a mixed sentence. In expression specification processing (S135), an object explanation sentence is specified according to a sentence structure of each specified sentence and an expression of the specific conversation sentence is specified based upon a result of the specification. In speech synthesis processing (S145), speech synthesis is carried out so as to impart the specified expression to the specific conversation sentence and a synthesized speech is output.

Description

本発明は、文章データに基づく合成音を出力する音声出力装置、音声出力方法、およびプログラムに関する。 The present invention relates to an audio output device, an audio output method, and a program that output a synthesized sound based on text data.

従来、周知の音声合成技術を用いて、入力された文章データを読み上げる文章読上げ装置が知られている（特許文献１参照）。
この特許文献１に記載された文章読上げ装置（音声出力装置）では、入力された文章データに含まれる“かぎ括弧”に従って文章データに含まれる会話文を特定し、その特定した会話文の文脈を解析した結果に基づいて当該会話文の属性を特定する。そして、特許文献１に記載された文章読上げ装置では、特定した会話文の属性を有した合成音となるように音声合成することがなされている。なお、ここでいう属性とは、会話文を発した登場人物の性別や年齢などである。 2. Description of the Related Art Conventionally, a text reading apparatus that reads out input text data using a known speech synthesis technique is known (see Patent Document 1).
In the text-to-speech reading device (speech output device) described in Patent Document 1, a conversation sentence included in the sentence data is specified according to “quotation marks” included in the input sentence data, and the context of the specified conversation sentence is determined. The attribute of the conversation sentence is specified based on the analyzed result. In the text-to-speech device described in Patent Document 1, speech synthesis is performed so that a synthesized speech having the specified conversational sentence attribute is obtained. Here, the attributes are the gender, age, etc. of the character who issued the conversation.

特開平０５−３１３６８５号公報Japanese Patent Laid-Open No. 05-313685

ところで、文章読上げ装置においては、音声合成によって会話文を読み上げた合成音に対して、当該会話文に適した表情を付与することが求められている。
しかしながら、特許文献１に記載された文章読上げ装置では、会話文の属性は特定しているものの、当該会話文における表情は特定していない。このため、特許文献１に記載された文章読上げ装置では、音声合成によって会話文を読上げた合成音に対して、当該会話文に適した表情を付与することができないという課題がある。 By the way, in the text-to-speech device, it is required to give a facial expression suitable for the conversation sentence to the synthesized sound obtained by reading the conversation sentence by voice synthesis.
However, in the text-to-speech device described in Patent Literature 1, although the attribute of the conversation sentence is specified, the facial expression in the conversation sentence is not specified. For this reason, in the text-to-speech device described in Patent Document 1, there is a problem that it is not possible to give a facial expression suitable for the conversation sentence to the synthesized sound that is read out by the voice synthesis.

つまり、従来の技術では、音声合成によって文章データを読上げた合成音を出力する際に、会話文の合成音に適切な表情を付与することが困難であるという問題がある。
そこで、本発明は、音声合成によって会話文を読上げた合成音を出力する際に、会話文に対する合成音に適切な表情を付与することを目的とする。 That is, the conventional technique has a problem that it is difficult to give an appropriate expression to the synthesized sound of the conversation sentence when outputting the synthesized sound obtained by reading out the text data by speech synthesis.
In view of the above, an object of the present invention is to provide an appropriate facial expression to a synthesized sound for a conversation sentence when outputting a synthesized sound obtained by reading out the conversation sentence by speech synthesis.

上記目的を達成するためになされた第一発明の音声出力装置は、文章取得手段と、文構造特定手段と、表情推定手段と、音声合成手段とを備えることを特徴としている。
文章取得手段は、指定された文章を構成する文字列を表す文章データを取得する。文構造特定手段は、文章取得手段で取得された文章データによって表される文章に含まれる各文の少なくとも一部分が、会話文であるか地の文であるかを特定する。表情推定手段は、文構造特定手段で特定した会話文の一つを特定会話文とし、特定会話文について説明する少なくとも一つの地の文である対象説明文の意味を解析した結果に基づいて、当該特定会話文における表情を推定する。音声合成手段は、表情推定手段にて推定した表情が特定会話文に反映されるように音声合成した合成音を出力する。 The speech output device of the first invention made to achieve the above object is characterized by comprising a sentence acquisition means, a sentence structure specifying means, a facial expression estimation means, and a speech synthesis means.
The sentence acquisition unit acquires sentence data representing a character string constituting the specified sentence. The sentence structure specifying unit specifies whether at least a part of each sentence included in the sentence represented by the sentence data acquired by the sentence acquiring unit is a conversational sentence or a local sentence. The facial expression estimation means uses one of the conversation sentences specified by the sentence structure specifying means as a specific conversation sentence, and based on the result of analyzing the meaning of the target explanation sentence that is at least one local sentence explaining the specific conversation sentence, The facial expression in the specific conversation sentence is estimated. The voice synthesizing unit outputs a synthesized sound synthesized by voice so that the facial expression estimated by the facial expression estimating unit is reflected in the specific conversation sentence.

このような音声出力装置によれば、対象説明文に基づいて特定会話文の表情を推定することができ、音声合成によって特定会話文を読み上げた合成音に対して、当該特定会話文に適した表情を付与することができる。 According to such a voice output device, the facial expression of a specific conversation sentence can be estimated based on the target explanatory sentence, and it is suitable for the specific conversation sentence with respect to the synthesized sound that is read out by the voice synthesis. A facial expression can be added.

つまり、第一発明の音声出力装置によれば、音声合成によって会話文を読上げた合成音を出力する際に、会話文に対する合成音に適切な表情を付与することができる。
なお、第一発明における「表情」とは、少なくとも、会話文における感情や情緒、情景、状況を含む概念である。 That is, according to the voice output device of the first invention, when outputting the synthesized sound obtained by reading out the conversation sentence by voice synthesis, an appropriate expression can be given to the synthesized sound for the conversation sentence.
The “expression” in the first invention is a concept including at least emotions, emotions, scenes, and situations in a conversation sentence.

文の中には、会話文と地の文とが一文の中に含まれる混在文も存在する。この混在文に含まれる地の文は、通常、当該混在文に含まれる会話文を説明している可能性が高い。
このため、第一発明における表情推定手段は、文構造特定手段での特定の結果、文章データ中に混在文が存在する場合、当該混在文に含まれる会話文を特定会話文とし、当該混在文に含まれる地の文を対象説明文としても良い。 Among sentences, there is also a mixed sentence in which a conversation sentence and a local sentence are included in one sentence. The local sentence included in the mixed sentence usually has a high possibility of explaining the conversation sentence included in the mixed sentence.
Therefore, the facial expression estimation means in the first invention, when the mixed sentence exists in the sentence data as a result of the specification by the sentence structure specifying means, the conversation sentence included in the mixed sentence is a specific conversation sentence, the mixed sentence It is good also considering the text of the ground contained in the object explanatory text.

このような音声出力装置によれば、混在文に含まれる地の文の意味に基づいて、混在文に含まれる会話文の表情を推定するため、その推定精度は高いものとなる。
すなわち、第一発明の音声出力装置によれば、混在文に含まれる会話文に対し、適切な表情を付与できる。 According to such an audio output device, since the facial expression of the conversation sentence included in the mixed sentence is estimated based on the meaning of the local sentence included in the mixed sentence, the estimation accuracy is high.
That is, according to the voice output device of the first invention, an appropriate facial expression can be given to the conversational sentence included in the mixed sentence.

また、地の文単体からなる文は、当該地の文の一つ前または一つ後ろの文中に含まれる会話文について説明していることが多い。
このため、第一発明における表情推定手段は、地の文の一つ前の文または一つ後ろの文のうち、少なくともいずれか一方に会話文が存在していれば、当該会話文を特定会話文とし、当該地の文を対象説明文とすれば良い。 In addition, a sentence composed of a single sentence in a local area often explains a conversation sentence included in a sentence immediately before or after the local sentence.
For this reason, the facial expression estimation means in the first invention, if there is a conversation sentence in at least one of the previous sentence or the next sentence of the local sentence, A sentence may be used, and a sentence in the area may be used as a target explanatory sentence.

このような第一発明の音声出力装置によれば、会話文のみからなる一つの文であっても、その会話文について説明している説明対象文を特定することができ、その説明対象文に基づいて当該会話文の表情を推定できる。 According to such a voice output device of the first invention, even if it is a single sentence consisting only of a conversational sentence, it is possible to specify an explanation target sentence that describes the conversational sentence, Based on this, the facial expression of the conversation sentence can be estimated.

さらに、第一発明における表情推定手段は、対象説明文に含まれる単語それぞれが有する意味に対応する表情を集計し、その集計した結果、最多値に対応した表情を、特定会話文の表情としても良い。 Further, the facial expression estimation means in the first invention totals the facial expressions corresponding to the meanings of the words included in the target explanatory sentence, and as a result of the aggregation, the facial expression corresponding to the most frequent value is used as the facial expression of the specific conversation sentence. good.

このような音声出力装置によれば、集計した結果が最多値である表情を、特定会話文における表情としているため、各会話文に対してより適切な表情を推定できる。特に、本発明の音声出力装置によれば、一つの会話文に対して複数の対象説明文が存在する場合であっても、当該会話文に最適な表情を推定できる。 According to such a voice output device, since the facial expression having the highest value as a result of aggregation is used as the facial expression in the specific conversation sentence, a more appropriate facial expression can be estimated for each conversation sentence. In particular, according to the voice output device of the present invention, it is possible to estimate an optimal facial expression for a conversational sentence even when there are a plurality of target explanatory sentences for one conversational sentence.

また、地の文は、同一の段落に含まれている会話文について説明している可能性が高い。
本発明における文構造特定手段は、文章データに含まれる文章を構成単位としての段落である特定段落ごとに分割し、特定段落それぞれに含まれる各文の少なくとも一部が、会話文であるか地の文であるかを特定するようにしても良い。この場合、本発明における表情推定手段は、特定会話文と同一の特定段落に含まれている地の文を対象説明文としても良い。 In addition, there is a high possibility that the local sentence explains the conversation sentence included in the same paragraph.
The sentence structure specifying means in the present invention divides a sentence included in sentence data into specific paragraphs that are paragraphs as constituent units, and determines whether at least a part of each sentence included in each specific paragraph is a conversational sentence. You may make it identify whether it is a sentence. In this case, the facial expression estimation means in the present invention may use a local sentence included in the same specific paragraph as the specific conversation sentence as the target explanatory sentence.

このような音声出力装置によれば、特定段落に含まれている地の文に基づいて会話文における表情を推定するため、当該表情の推定精度をより向上させることができる。
ところで、本願に係る発明は、音声合成によって生成された合成音を出力する音声出力方法としてなされたものでもよい。 According to such an audio output device, since the facial expression in the conversation sentence is estimated based on the local sentence included in the specific paragraph, the estimation accuracy of the facial expression can be further improved.
By the way, the invention concerning this application may be made | formed as an audio | voice output method which outputs the synthetic | combination sound produced | generated by the audio | voice synthesis | combination.

この場合、本発明の音声出力方法は、指定された文章を構成する文字列を表す文章デー
タを取得する文章取得手順と、その取得した文章データによって表される文章に含まれる各文の少なくとも一部分が、会話文であるか地の文であるかを特定する文構造特定手順と、その特定した会話文の一つを特定会話文とし、特定会話文について説明する少なくとも一つの地の文である対象説明文の意味を解析した結果に基づいて、当該特定会話文における表情を推定する表情推定手順と、その推定した表情が、その特定会話文に反映されるように音声合成した合成音を出力する音声合成手順とを備えることを特徴とする。 In this case, the audio output method of the present invention includes a sentence acquisition procedure for acquiring sentence data representing a character string constituting a specified sentence, and at least a part of each sentence included in the sentence represented by the acquired sentence data. Is a sentence structure identifying procedure for identifying whether the sentence is a conversation sentence or a local sentence, and one of the identified conversation sentences is a specific conversation sentence, and is at least one local sentence explaining the specific conversation sentence Based on the result of analyzing the meaning of the target explanatory sentence, the facial expression estimation procedure for estimating the facial expression in the specific conversation sentence, and the synthesized speech synthesized so that the estimated facial expression is reflected in the specific conversation sentence is output. And a speech synthesis procedure.

このような音声出力方法によれば、複数の異なる装置に各手順を実行させることができる。この結果、本発明の音出力方法によれば、複数の装置を備えたシステムを請求項１に係る音声出力装置として動作させることができる。 According to such an audio output method, each procedure can be executed by a plurality of different devices. As a result, according to the sound output method of the present invention, a system including a plurality of devices can be operated as the sound output device according to claim 1.

さらに、本願に係る発明は、コンピュータに実行させるプログラム（第三発明）としてなされたものでも良い。
この場合、本発明のプログラムは、指定された文章を構成する文字列を表す文章データを取得する文章取得手順と、その取得した文章データによって表される文章に含まれる各文の少なくとも一部分が、会話文であるか地の文であるかを特定する文構造特定手順と、その特定した会話文の一つを特定会話文とし、特定会話文について説明する少なくとも一つの地の文である対象説明文の意味を解析した結果に基づいて、当該特定会話文における表情を推定する表情推定手順と、その推定した表情が、その特定会話文に反映されるように音声合成した合成音を出力する音声合成手順とをコンピュータに実行させることを特徴とする。 Furthermore, the invention according to the present application may be made as a program (third invention) to be executed by a computer.
In this case, the program of the present invention includes a sentence acquisition procedure for acquiring sentence data representing a character string constituting a specified sentence, and at least a part of each sentence included in the sentence represented by the acquired sentence data, Sentence structure identification procedure for identifying whether it is a conversational sentence or a local sentence, and one of the identified conversational sentences as a specific conversational sentence, and an object explanation that is at least one local sentence explaining the specific conversational sentence Based on the result of analyzing the meaning of the sentence, the facial expression estimation procedure for estimating the facial expression in the specific conversation sentence, and the voice that outputs the synthesized sound that is synthesized so that the estimated expression is reflected in the specific conversation sentence A synthesis procedure is executed by a computer.

本発明のプログラムが、このようになされていれば、例えば、ＤＶＤ−ＲＯＭ、ＣＤ−ＲＯＭ、ハードディスク等のコンピュータ読み取り可能な記録媒体に記録し、必要に応じてコンピュータにロードさせて起動することや、通信回線を介して必要に応じてコンピュータに取得させて起動することにより用いることができる。そして、コンピュータに各手順を実行させることで、そのコンピュータを、請求項１に記載された音声出力装置として機能させることができる。 If the program of the present invention is made in this way, for example, it can be recorded on a computer-readable recording medium such as a DVD-ROM, CD-ROM, hard disk, etc. It can be used by being acquired and activated by a computer via a communication line as necessary. And by making a computer perform each procedure, the computer can be functioned as an audio | voice output apparatus described in Claim 1.

全体構成を示すブロック図である。It is a block diagram which shows the whole structure. 第一実施形態における文章読上げ処理の処理手順を示すフローチャートである。It is a flowchart which shows the process sequence of the text reading process in 1st embodiment. 段落分割処理の処理手順を示すフローチャートである。It is a flowchart which shows the process sequence of a paragraph division process. 第一実施形態における文章分割処理の処理手順を示すフローチャートである。It is a flowchart which shows the process sequence of the sentence division | segmentation process in 1st embodiment. 文構造判定処理の処理手順を示すフローチャートである。It is a flowchart which shows the process sequence of a sentence structure determination process. 第一実施形態における表情特定処理の処理手順を示すフローチャートである。It is a flowchart which shows the process sequence of the facial expression specific process in 1st embodiment. 文タイプ特定処理の処理手順を示すフローチャートである。It is a flowchart which shows the process sequence of a sentence type specific process. 表情割当処理の処理手順を示すフローチャートである。It is a flowchart which shows the process sequence of a facial expression allocation process. 表情推定処理の処理手順を示すフローチャートである。It is a flowchart which shows the process sequence of a facial expression estimation process. 表情候補導出処理の処理手順を示すフローチャートである。It is a flowchart which shows the process sequence of a facial expression candidate derivation process. （Ａ）段落分割処理の処理結果、（Ｂ）第一実施形態における文構造判定処理の処理結果を示す図である。It is a figure showing the processing result of (A) paragraph division processing, and the processing result of (B) sentence structure judging processing in a first embodiment. （Ａ）文タイプ特定処理の処理結果、（Ｂ），（Ｃ）第一実施形態における表情特定処理の処理結果を示す図である。(A) The result of a sentence type specific process, (B), (C) It is a figure which shows the process result of the facial expression specific process in 1st embodiment. 音声合成処理の処理手順を示すフローチャートである。It is a flowchart which shows the process sequence of a speech synthesis process. 第二実施形態における文章読上げ処理の処理手順を示すフローチャートである。It is a flowchart which shows the process sequence of the text reading-out process in 2nd embodiment. 第二実施形態における文章分割処理の処理手順を示すフローチャートである。It is a flowchart which shows the process sequence of the sentence division | segmentation process in 2nd embodiment. 第二実施形態における表情特定処理の処理手順を示すフローチャートである。It is a flowchart which shows the process sequence of the facial expression specific process in 2nd embodiment. （Ａ）第二実施形態における文構造判定処理の処理結果、（Ｂ），（Ｃ）第二実施形態における表情特定処理の処理結果を示す図である。(A) The process result of the sentence structure determination process in 2nd embodiment, (B), (C) It is a figure which shows the process result of the facial expression specific process in 2nd embodiment.

以下に本発明の実施形態を図面と共に説明する。
［第一実施形態］
〈音声合成システム〉
図１に示す音声合成システム１は、ユーザが指定した文章データＷＴの内容を、ユーザが指定した特徴の合成音にて出力するシステムである。この音声合成システム１では、ユーザが指定した文章データＷＴを解析するとともに、少なくとも、予め登録された複数の音声データＳＤの中から、当該音声合成システム１のユーザが希望する音声データＳＤを抽出して音声合成を実行する。 Embodiments of the present invention will be described below with reference to the drawings.
[First embodiment]
<Speech synthesis system>
A speech synthesis system 1 shown in FIG. 1 is a system that outputs the contents of text data WT designated by a user as synthesized sounds having characteristics designated by the user. The speech synthesis system 1 analyzes the text data WT designated by the user and extracts speech data SD desired by the user of the speech synthesis system 1 from at least a plurality of speech data SD registered in advance. Voice synthesis.

これを実現するために、音声合成システム１は、少なくとも一つの情報処理サーバ１０と、少なくとも一つの音声出力端末６０とを備えている。なお、本実施形態における音声合成システム１は、音声出力端末６０を複数台備えている。
〈情報処理サーバ〉
情報処理サーバ１０は、通信部１２と、制御部２０と、記憶部３０とを備え、少なくとも、文章を構成する文字列を表す文章データＷＴと、予め入力された音声の音声特徴量を少なくとも含む音声データＳＤと、表情が付与された音声の傾向を表す表情データＥＴとが格納されたサーバである。 In order to realize this, the speech synthesis system 1 includes at least one information processing server 10 and at least one speech output terminal 60. Note that the speech synthesis system 1 in this embodiment includes a plurality of speech output terminals 60.
<Information processing server>
The information processing server 10 includes a communication unit 12, a control unit 20, and a storage unit 30, and includes at least sentence data WT representing a character string constituting a sentence and at least a voice feature amount of speech input in advance. The server stores voice data SD and facial expression data ET representing the tendency of voice to which facial expressions are added.

通信部１２は、通信網を介して、情報処理サーバ１０が外部との間で通信を行う。本実施形態における通信網とは、例えば、公衆無線通信網やネットワーク回線である。
制御部２０は、電源が切断されても記憶内容を保持する必要がある処理プログラムやデータを格納するＲＯＭ２２と、処理プログラムやデータを一時的に格納するＲＡＭ２４と、ＲＯＭ２２やＲＡＭ２４に記憶された処理プログラムに従って各種処理を実行するＣＰＵ２６とを少なくとも有した周知のコンピュータを中心に構成されている。この制御部２０は、通信部１２や記憶部３０を制御する。 In the communication unit 12, the information processing server 10 communicates with the outside through a communication network. The communication network in this embodiment is, for example, a public wireless communication network or a network line.
The control unit 20 includes a ROM 22 that stores processing programs and data that need to retain stored contents even when the power is turned off, a RAM 24 that temporarily stores processing programs and data, and processes stored in the ROM 22 and RAM 24. A known computer having at least a CPU 26 that executes various processes according to a program is mainly configured. The control unit 20 controls the communication unit 12 and the storage unit 30.

記憶部３０は、記憶内容を読み書き可能に構成された不揮発性の記憶装置である。この記憶装置とは、例えば、ハードディスク装置やフラッシュメモリなどである。
記憶部３０には、文章データＷＴと、音声データＳＤと、表情データＥＴとが格納されている。 The storage unit 30 is a non-volatile storage device configured to be able to read and write stored contents. The storage device is, for example, a hard disk device or a flash memory.
The storage unit 30 stores text data WT, voice data SD, and facial expression data ET.

文章データＷＴは、例えば、書籍をテキストデータ化したデータであり、書籍ごとに予め用意されている。ここでいう書籍とは、小説などである。
音声データＳＤは、音声パラメータＰＶ_iと、タグデータＴＧ_iとを発声者ごとに対応付けたデータである。 The text data WT is, for example, data obtained by converting a book into text data, and is prepared in advance for each book. Books here are novels and the like.
The voice data SD is data in which voice parameters PV _i and tag data TG _i are associated with each speaker.

音声パラメータＰＶは、人が発した音の波形ごとに用意されるものであり、当該音声波形ｉを表す少なくとも一つの特徴量である。この特徴量は、いわゆるフォルマント合成に用いる音声の特徴量であり、発声者ごと、かつ、音素ごとに用意される。音声パラメータＰＶにおける特徴量として、発声音声における各音素での基本周波数Ｆ０、メル周波数ケプストラム（ＭＦＣＣ）、音素長、パワー、およびそれらの時間差分を少なくとも備えている。 The voice parameter PV is prepared for each waveform of sound produced by a person, and is at least one feature amount representing the voice waveform i. This feature amount is a feature amount of speech used for so-called formant synthesis, and is prepared for each speaker and for each phoneme. As a feature amount in the speech parameter PV, at least a fundamental frequency F0, a mel frequency cepstrum (MFCC), a phoneme length, a power, and a time difference thereof are provided for each phoneme in the uttered speech.

タグデータＴＧは、音声パラメータＰＶによって表される音の性質を表すデータであり、発声者の特徴を表す発声者特徴情報と、当該音声が発声されたときの発声者の表情を表す表情情報とを少なくとも含む。発声者特徴情報には、例えば、発声者の性別、年齢などを含む。また、表情情報は、表情としての感情を表す情報に加えて、発声したときの情景、情緒や、雰囲気などを表す情報や、発声者の表情を推定するために必要な情報を含んでも良い。 The tag data TG is data representing the nature of the sound represented by the speech parameter PV, and includes speaker feature information representing the features of the speaker, and facial expression information representing the expression of the speaker when the speech is spoken. At least. The speaker characteristic information includes, for example, the gender and age of the speaker. The facial expression information may include information representing a scene, emotion, atmosphere, and the like when speaking and information necessary for estimating the facial expression of the speaker in addition to information representing emotion as a facial expression.

表情データＥＴは、表情ごとに表出される音声パラメータＰＶの傾向を表すデータである。本実施形態における表情データＥＴは、音声データＳＤをクラスタリングすることで生成され、代表タグデータＴＤと、代表パラメータＤＰＶと、分散パラメータＣＰＶ＿Ｖとが、それぞれに対応するクラスタごとに対応付けられている。 The expression data ET is data representing a tendency of the voice parameter PV expressed for each expression. The facial expression data ET in the present embodiment is generated by clustering the audio data SD, and the representative tag data TD, the representative parameter DPV, and the distributed parameter CPV_V are associated with each corresponding cluster.

代表タグデータＴＤとは、各クラスタに含まれるタグデータＴＧ群の中で、当該クラスタを代表するタグデータＴＧである。
代表パラメータＤＰＶは、クラスタにおける平均パラメータＣＰＶ＿Ａと基準パラメータＮＰＶとの差分である。ここでいう平均パラメータＣＰＶ＿Ａとは、各クラスタに含まれる音声パラメータＰＶの平均値である。また、ここでいう基準パラメータＮＰＶとは、特定条件を満たす全ての音声パラメータＰＶの平均値である。ただし、ここでいう特定条件とは、タグデータＴＧにおける表情が自然体であることを表していることである。 The representative tag data TD is tag data TG representing the cluster in the tag data TG group included in each cluster.
The representative parameter DPV is a difference between the average parameter CPV_A and the reference parameter NPV in the cluster. The average parameter CPV_A here is an average value of the voice parameters PV included in each cluster. Further, the reference parameter NPV here is an average value of all voice parameters PV that satisfy the specific condition. However, the specific condition here indicates that the expression in the tag data TG represents a natural body.

分散パラメータＣＰＶ＿Ｖは、各クラスタに含まれる音声パラメータＰＶの分散である。
〈音声出力端末〉
この音声出力端末６０は、通信部６１と、情報受付部６２と、表示部６３と、音入力部６４と、音出力部６５と、記憶部６６と、制御部７０とを備えている。本実施形態における音声出力端末６０として、例えば、周知の携帯端末を想定しても良いし、いわゆるパーソナルコンピュータといった周知の情報処理装置を想定しても良い。なお、携帯端末には、周知の電子書籍端末や、携帯電話、タブレット端末などの携帯情報端末を含む。 The distribution parameter CPV_V is a distribution of the voice parameter PV included in each cluster.
<Audio output terminal>
The audio output terminal 60 includes a communication unit 61, an information receiving unit 62, a display unit 63, a sound input unit 64, a sound output unit 65, a storage unit 66, and a control unit 70. As the audio output terminal 60 in the present embodiment, for example, a known portable terminal may be assumed, or a known information processing apparatus such as a so-called personal computer may be assumed. Note that portable terminals include well-known electronic book terminals, and portable information terminals such as mobile phones and tablet terminals.

通信部６１は、通信網を介して音声出力端末６０が外部との間で情報通信を行う。情報受付部６２は、入力装置（図示せず）を介して入力された情報を受け付ける。表示部６３は、制御部７０からの信号に基づいて画像を表示する。 In the communication unit 61, the audio output terminal 60 performs information communication with the outside via a communication network. The information receiving unit 62 receives information input via an input device (not shown). The display unit 63 displays an image based on a signal from the control unit 70.

音入力部６４は、音を電気信号に変換して制御部７０に入力する装置であり、例えば、マイクロホンである。音出力部６５は、音を出力する周知の装置であり、例えば、ＰＣＭ音源と、スピーカとを備えている。記憶部６６は、記憶内容を読み書き可能に構成された不揮発性の記憶装置である。記憶部６６には、各種処理プログラムや各種データが記憶される。 The sound input unit 64 is a device that converts sound into an electric signal and inputs the electric signal to the control unit 70, and is, for example, a microphone. The sound output unit 65 is a known device that outputs sound, and includes, for example, a PCM sound source and a speaker. The storage unit 66 is a non-volatile storage device configured to be able to read and write stored contents. The storage unit 66 stores various processing programs and various data.

また、制御部７０は、ＲＯＭ７２、ＲＡＭ７４、ＣＰＵ７６を少なくとも有した周知のコンピュータを中心に構成されている。
制御部７０のＲＯＭ７２には、ユーザが指定した文章データＷＴを解析すると共に、その解析結果に基づいて、予め登録された複数の音声データＳＤの中から、ユーザが希望する音声データＳＤを抽出して音声合成を実行する文章読上げ処理を、制御部７０が実行するための処理プログラムが格納されている。 The control unit 70 is configured around a known computer having at least a ROM 72, a RAM 74, and a CPU 76.
The ROM 72 of the control unit 70 analyzes the text data WT designated by the user and extracts voice data SD desired by the user from a plurality of pre-registered voice data SD based on the analysis result. A processing program for the control unit 70 to execute a text-to-speech process for executing speech synthesis is stored.

文章データＷＴを解析する場合には、文章データＷＴを記憶部３０から読み出すときにファイルの拡張子などから、文字データＷＴのテキストデータの種類を特定し、テキストデータの種類に応じた文字記号（文字コード）を特定してから、文章を解析する。例えば
、テキストデータに含まれる、ラインフィード（ＬＦ）、キャリッジリターン（ＣＲ）などの文字記号を、改行記号と区別したりする。
〈文章読上げ処理〉
音声出力装置の制御部７０が実行する文章読上げ処理は、音声出力端末６０の情報受付部６２を介して起動指令が入力されると起動される。 When analyzing the text data WT, when the text data WT is read from the storage unit 30, the type of text data of the character data WT is specified from the file extension and the like, and the character symbol ( After identifying the character code), analyze the sentence. For example, character symbols such as line feed (LF) and carriage return (CR) included in text data are distinguished from line feed symbols.
<Text reading process>
The text-to-speech process executed by the control unit 70 of the voice output device is started when a start command is input via the information receiving unit 62 of the voice output terminal 60.

文章読上げ処理は、起動されると、図２に示すように、まず、情報受付部６２を介して入力された情報（以下、文章指定情報と称す）を取得する（Ｓ１１０）。このＳ１１０にて取得する文章指定情報とは、音声合成による読み上げの実行を希望する文章データＷＴを指定する情報である。 When the text-to-speech process is started, as shown in FIG. 2, first, information input through the information receiving unit 62 (hereinafter referred to as text designation information) is acquired (S110). The text designation information acquired in S110 is information for designating text data WT that is desired to be read out by speech synthesis.

続いて、Ｓ１１０にて取得した文章指定情報に対応する文章データＷＴ（以下、「指定文章データ」と称す）を、情報処理サーバ１０の記憶部３０から取得する（Ｓ１１５）。
さらに、指定文章データを解析して、指定文章データによって表される文章の各段落を特定する段落分割処理を実行する（Ｓ１２０）。その段落分割処理（Ｓ１２０）にて特定された段落ごとに、各段落に含まれている文それぞれを特定する文章分割処理を実行する（Ｓ１２５）。そして、文章分割処理（Ｓ１２５）にて特定された各文を解析して、各文が、地の文であるか、会話文であるか、地の文と会話文とが一文中に混在する混在文であるかを表す文構造を特定する文構造判定処理を実行する（Ｓ１３０）。 Subsequently, sentence data WT (hereinafter referred to as “designated sentence data”) corresponding to the sentence designation information obtained in S110 is obtained from the storage unit 30 of the information processing server 10 (S115).
Further, the specified sentence data is analyzed, and a paragraph division process for specifying each paragraph of the sentence represented by the specified sentence data is executed (S120). For each paragraph specified in the paragraph dividing process (S120), a sentence dividing process for specifying each sentence included in each paragraph is executed (S125). Then, each sentence specified in the sentence division process (S125) is analyzed, and whether each sentence is a local sentence or a conversation sentence, or a local sentence and a conversation sentence are mixed in one sentence. A sentence structure determination process for specifying a sentence structure indicating whether it is a mixed sentence is executed (S130).

続いて、文構造判定処理にて特定された各文の文構造に従って、会話文に対する説明文である対象説明文を特定するとともに、その特定した結果に基づいて、当該会話文における表情を特定する表情特定処理を実行する（Ｓ１３５）。 Subsequently, according to the sentence structure of each sentence specified in the sentence structure determination process, the target explanatory sentence that is an explanatory sentence for the conversation sentence is specified, and the facial expression in the conversation sentence is specified based on the specified result. A facial expression specifying process is executed (S135).

さらに、情報受付部６２を介して入力された出力性質情報を取得する（Ｓ１４０）。この出力性質情報とは、発声者特徴情報に対応する情報であり、例えば、合成音として出力する音の性質としての発声者の性別や年齢などである。 Further, the output property information input via the information receiving unit 62 is acquired (S140). This output property information is information corresponding to the speaker characteristic information, and is, for example, the sex and age of the speaker as the property of the sound output as a synthesized sound.

続いて、指定文章データの内容を音声合成によって読み上げた合成音声を生成して出力する音声合成処理を実行する（Ｓ１４５）。
その後、本文章読上げ処理を終了する。
〈段落分割処理〉
文章読上げ処理のＳ１２０にて起動される段落分割処理は、起動されると、図３に示すように、指定文章データの最初の文字記号を取得する（Ｓ２１０）。ここでいう文字記号とは、文章中に用いられる記号であり、具体的には、文字そのものの他に、かぎ括弧や感嘆符などの符号を含むものである。 Subsequently, a speech synthesis process for generating and outputting a synthesized speech obtained by reading out the contents of the designated sentence data by speech synthesis is executed (S145).
Thereafter, the text reading process is terminated.
<Paragraph division processing>
When the paragraph division process activated in S120 of the text-to-speech process is activated, as shown in FIG. 3, the first character symbol of the designated sentence data is acquired (S210). The character symbol here is a symbol used in the text, and specifically includes symbols such as angle brackets and exclamation marks in addition to the characters themselves.

続いて、取得した文字記号が“閉じ括弧”（」）であるか否かを判定する（Ｓ２１５）。一般的な文章では、“」”を会話文の終わりとして用いていることから、このＳ２１５での判定の結果、取得した文字記号が“閉じ括弧”でなければ（Ｓ２１５：ＮＯ）、取得した文字記号が“句点”あるか否かを判定する（Ｓ２２０）。このＳ２２０での判定の結果、取得した文字記号が“句点”でなければ（Ｓ２２０：ＮＯ）、詳しくは後述するＳ２４０へと移行する。 Subsequently, it is determined whether or not the acquired character symbol is “close bracket” (“) (S215). In a general sentence, “” ”is used as the end of the conversation sentence. As a result of the determination in S215, if the acquired character symbol is not“ close bracket ”(S215: NO), the acquired character It is determined whether or not the symbol is “punctuation” (S220) As a result of the determination in S220, if the acquired character symbol is not “punctuation” (S220: NO), the process proceeds to S240 described later in detail. .

一方、Ｓ２２０での判定の結果、取得した文字記号が“句点”ある場合（Ｓ２２０：ＹＥＳ）、または、Ｓ２１５での判定の結果、取得した文字記号が“閉じ括弧”である場合（Ｓ２１５：ＹＥＳ）には、指定文章データにおける文章の流れに沿った次の文字記号を取得する（Ｓ２２５）。 On the other hand, if the acquired character symbol is “punctuation” as a result of the determination in S220 (S220: YES), or if the acquired character symbol is “close bracket” as a result of the determination in S215 (S215: YES). ), The next character symbol along the flow of the sentence in the designated sentence data is acquired (S225).

そして、Ｓ２２５にて取得した文字記号が“改行記号”であるか否かを判定する（Ｓ２
３０）。この判定の結果、文字記号が“改行記号”であれば（Ｓ２３０：ＹＥＳ）、本Ｓ２３５を前回実行した以降に取得した文字列である取得文字列を、一つの段落を構成する文章そのものとし、当該段落を識別する段落識別情報（以下、「段落ＩＤ」と称す）と対応付けて記憶する（Ｓ２３５）。その後、Ｓ２４０へと移行する。 Then, it is determined whether or not the character symbol acquired in S225 is a “line feed symbol” (S2
30). As a result of this determination, if the character symbol is “line feed symbol” (S230: YES), the acquired character string, which is the character string acquired after the previous execution of S235, is taken as the sentence itself constituting one paragraph, It is stored in association with paragraph identification information for identifying the paragraph (hereinafter referred to as “paragraph ID”) (S235). Thereafter, the process proceeds to S240.

なお、このＳ２３５にて記憶される、段落ＩＤと、当該段落ＩＤによって識別される段落を構成する文章そのものとを対応付けた情報を段落情報と称す。
一方、Ｓ２３０での判定の結果、取得した文字記号が“改行記号”でなければ（Ｓ２３０：ＮＯ）、Ｓ２３５を実行することなくＳ２４０へと移行する。 The information stored in S235 that associates the paragraph ID with the sentence itself that forms the paragraph identified by the paragraph ID is referred to as paragraph information.
On the other hand, as a result of the determination in S230, if the acquired character symbol is not the “line feed symbol” (S230: NO), the process proceeds to S240 without executing S235.

そのＳ２４０では、当該指定文章データにおける文章の流れに沿った次の文字記号が、指定文章データ中に含まれているか否かを判定する。そして、Ｓ２４０での判定の結果、次の文字記号が含まれていれば（Ｓ２４０：ＹＥＳ）、その次の文字記号を取得し（Ｓ２４５）、Ｓ２１５へと戻る。 In S240, it is determined whether or not the next character symbol along the flow of the sentence in the designated sentence data is included in the designated sentence data. If the result of determination in S240 is that the next character symbol is included (S240: YES), the next character symbol is acquired (S245), and the process returns to S215.

一方、Ｓ２４０での判定の結果、次の文字記号が含まれていなければ（Ｓ２４０：ＮＯ）、指定文章データにおける文章の最後の文字記号まで、Ｓ２１５からＳ２４５の処理が終了したものとし、本段落分割処理を終了して文章読上げ処理のＳ１２５へと移行する。 On the other hand, if the result of determination in S240 is that the next character symbol is not included (S240: NO), it is assumed that the processing from S215 to S245 has been completed up to the last character symbol of the sentence in the designated sentence data. The division process is terminated, and the process proceeds to S125 of the text reading process.

つまり、本実施形態の段落分割処理では、指定文章データによって表される文章を解析し、当該文章を構成する各段落を特定する。よって、本実施形態の段落分割処理を実行した結果として、図１１（Ａ）に示す、文章データに含まれている文を段落ごとに分類した段落情報が生成される。
〈文章分割処理〉
文章読上げ処理のＳ１２５にて起動される文章分割処理は、図４に示すように、起動されると、段落情報から特定される最初の段落における最初の文字記号を取得する（Ｓ３１０）。 That is, in the paragraph division processing of this embodiment, the sentence represented by the designated sentence data is analyzed, and each paragraph constituting the sentence is specified. Therefore, as a result of executing the paragraph division processing according to the present embodiment, paragraph information that classifies sentences included in the sentence data shown in FIG. 11A for each paragraph is generated.
<Sentence division processing>
As shown in FIG. 4, the text division process activated in S125 of the text-to-speech process acquires the first character symbol in the first paragraph specified from the paragraph information as shown in FIG. 4 (S310).

そして、取得した文字記号が“開き括弧”（「）であるか否かを判定する（Ｓ３１５）。一般的な文章では、“「”を会話文の始まりとして用いていることから、このＳ３１５での判定の結果、取得した文字記号が“開き括弧”でなければ（Ｓ３１５：ＮＯ）、詳しくは後述するＳ３３５へと移行する。 Then, it is determined whether or not the acquired character symbol is “open parenthesis” (“) (S315) Since“ “” is used as the beginning of a conversational sentence in a general sentence, in this S315 As a result of the determination, if the acquired character symbol is not “open bracket” (S315: NO), the process proceeds to S335 described later in detail.

一方、Ｓ３１５での判定の結果、取得した文字記号が“開き括弧”であれば（Ｓ３１５：ＹＥＳ）、指定文章データにおける文章の流れに沿った次の文字記号を取得する（Ｓ３２０）。そして、Ｓ３２０にて取得した文字記号が“閉じ括弧”であるか否かを判定する（Ｓ３２５）。このＳ３２５での判定の結果、文字記号が“閉じ括弧”であれば（Ｓ３２５：ＹＥＳ）、Ｓ３２０へと戻る。 On the other hand, if the acquired character symbol is “open bracket” as a result of the determination in S315 (S315: YES), the next character symbol along the flow of the sentence in the designated sentence data is acquired (S320). Then, it is determined whether or not the character symbol acquired in S320 is a “close bracket” (S325). As a result of the determination in S325, if the character symbol is “close bracket” (S325: YES), the process returns to S320.

一方、文字記号が“閉じ括弧”でなければ（Ｓ３２５：ＮＯ）、Ｓ３２０にて取得した文字記号が改行記号（以下、“改行”と称す)であるか否かを判定する（Ｓ３３０）。このＳ３３０での判定の結果、文字記号が“改行”でなければ（Ｓ３３０：ＮＯ）、Ｓ３３５へと移行する。 On the other hand, if the character symbol is not “close bracket” (S325: NO), it is determined whether or not the character symbol acquired in S320 is a line feed symbol (hereinafter referred to as “line feed”) (S330). As a result of the determination in S330, if the character symbol is not “line feed” (S330: NO), the process proceeds to S335.

そのＳ３３５では、取得した文字記号が“句点”であるか否かを判定する。このＳ３３５での判定の結果、取得した文字記号が“句点”である場合（Ｓ３３５：ＹＥＳ）、または、Ｓ３３０での判定の結果、Ｓ３２０にて取得した文字記号が“改行”である場合（Ｓ３２０：ＹＥＳ）には、Ｓ３４０へと移行する。 In S335, it is determined whether or not the acquired character symbol is a “punctuation point”. As a result of the determination in S335, the acquired character symbol is “punctuation” (S335: YES), or as a result of the determination in S330, the character symbol acquired in S320 is “line feed” (S320). : YES), the process proceeds to S340.

そのＳ３４０では、本Ｓ３４０を前回実行した以降に取得した文字記号の列を、一つの
文そのものとし、当該文を識別する文識別情報（以下、「文ＩＤ」と称す）および段落ＩＤと対応付けて記憶する。その後、Ｓ３４５へと移行する。 In S340, a string of character symbols acquired after the previous execution of S340 is regarded as one sentence, and is associated with sentence identification information (hereinafter referred to as “sentence ID”) for identifying the sentence and a paragraph ID. Remember. Thereafter, the process proceeds to S345.

なお、このＳ３４０にて記憶される、段落ＩＤと、文ＩＤと、段落ＩＤおよび文ＩＤによって識別される文を構成する文字記号の列とを対応付けた情報を文情報と称す。
Ｓ３３５での判定の結果、取得した文字記号が“句点”でない場合（Ｓ３３５：ＮＯ）には、Ｓ３４０を実行することなく、Ｓ３４５へと移行する。 The information stored in S340 in which the paragraph ID, the sentence ID, and the string of character symbols constituting the sentence identified by the paragraph ID and the sentence ID are associated is referred to as sentence information.
As a result of the determination in S335, if the acquired character symbol is not “punctuation” (S335: NO), the process proceeds to S345 without executing S340.

そのＳ３４５では、Ｓ３１５からＳ３４０までのステップを実行した段落に、文章の流れに沿った次の文字記号が含まれているか否かを判定する。そして、Ｓ３４５での判定の結果、次の文字記号が含まれていれば（Ｓ３４５：ＹＥＳ）、その次の文字記号を取得し（Ｓ３６０）、Ｓ３１５へと戻る。 In S345, it is determined whether or not the paragraph in which the steps from S315 to S340 are executed includes the next character / symbol along the flow of the sentence. If the result of determination in S345 is that the next character symbol is included (S345: YES), the next character symbol is acquired (S360), and the process returns to S315.

一方、Ｓ３４５での判定の結果、次の文字記号が含まれていなければ（Ｓ３４５：ＮＯ）、段落情報から特定される次の段落が存在するか否かを判定する（Ｓ３５０）。このＳ３５０での判定の結果、次の段落が存在すれば（Ｓ３５０：ＹＥＳ）、その次の段落を取得する（Ｓ３５５）。さらに、Ｓ３５５にて取得した段落の最初の文字記号を取得して（Ｓ３６０）、Ｓ３１５へと戻る。 On the other hand, if the result of determination in S345 is that the next character symbol is not included (S345: NO), it is determined whether or not the next paragraph specified from the paragraph information exists (S350). As a result of the determination in S350, if the next paragraph exists (S350: YES), the next paragraph is acquired (S355). Furthermore, the first character symbol of the paragraph acquired in S355 is acquired (S360), and the process returns to S315.

なおＳ３５０での判定の結果、次の段落が存在しなければ（Ｓ３５０：ＮＯ）、指定文章データにおける文章の最後の文字記号まで、Ｓ３１５からＳ３４０の処理が終了したものとして、本文章分割処理を終了し、文章読上げ処理のＳ１３０へと移行する。 As a result of the determination in S350, if the next paragraph does not exist (S350: NO), it is assumed that the processes from S315 to S340 have been completed up to the last character symbol of the sentence in the designated sentence data. The process ends, and the process proceeds to S130 of the text reading process.

つまり、本実施形態の文章分割処理では、指定文章データによって表される文章を解析し、各段落に含まれている各文を特定し、段落ＩＤと、文ＩＤと、段落ＩＤおよび文ＩＤによって識別される文を構成する文字記号の列とを対応付けた文情報を生成する。
〈文構造判定処理〉
文章読上げ処理のＳ１３０にて起動される文構造判定処理は、図５に示すように、起動されると、指定文章データにおける文章の流れに沿った最初の文を取得する（Ｓ４１０）。 That is, in the sentence division processing of the present embodiment, the sentence represented by the designated sentence data is analyzed, each sentence included in each paragraph is specified, and the paragraph ID, sentence ID, paragraph ID, and sentence ID are used. Sentence information is generated by associating strings of character symbols constituting the sentence to be identified.
<Sentence structure judgment processing>
As shown in FIG. 5, the sentence structure determination process activated in S130 of the sentence reading process acquires the first sentence along the flow of the sentence in the designated sentence data as shown in FIG. 5 (S410).

続いて、取得した文における最初の文字記号が“開き括弧”、および最後の文字記号が“閉じ括弧”であるか否かを判定する（Ｓ４１５）。この判定の結果、最初および最後の文字記号が“括弧”であれば（Ｓ４１５：ＹＥＳ）、当該取得した文に対応する文情報に、当該取得した文が会話文であることを表す「タイプ１」を文構造タイプの識別情報として付与する（Ｓ４２０）。その後、詳しくは後述するＳ４６０へと移行する。 Subsequently, it is determined whether or not the first character symbol in the acquired sentence is “open bracket” and the last character symbol is “close bracket” (S415). As a result of the determination, if the first and last character symbols are “parentheses” (S415: YES), “type 1” indicating that the acquired sentence is a conversation sentence is added to the sentence information corresponding to the acquired sentence. "Is added as identification information of the sentence structure type (S420). Thereafter, the process proceeds to S460 described later in detail.

一方、Ｓ４１５での判定の結果、取得した文の最初および最後の文字記号が“括弧”でなければ（Ｓ４１５：ＮＯ）、その取得した文中に括弧が存在するか否かを判定する（Ｓ４２５）。このＳ４２５での判定の結果、取得した文中に括弧が存在していなければ（Ｓ４２５：ＮＯ）、当該取得した文に対応する文情報に、当該取得した文が地の文であることを表す「タイプ３」を文構造タイプの識別情報として付与する（Ｓ４３０）。その後、詳しくは後述するＳ４６０へと移行する。 On the other hand, as a result of the determination in S415, if the first and last character symbols of the acquired sentence are not “parentheses” (S415: NO), it is determined whether or not parentheses are present in the acquired sentence (S425). . As a result of the determination in S425, if parentheses are not present in the acquired sentence (S425: NO), the sentence information corresponding to the acquired sentence indicates that the acquired sentence is a local sentence. "Type 3" is given as identification information of the sentence structure type (S430). Thereafter, the process proceeds to S460 described later in detail.

なお、Ｓ４２５での判定の結果、取得した文中に括弧が存在していれば（Ｓ４２５：ＹＥＳ）、当該取得した文の形態素解析を実行する（Ｓ４３５）。この形態素解析は、周知の手法を用いれば良い。 If parentheses are present in the acquired sentence as a result of the determination in S425 (S425: YES), morphological analysis of the acquired sentence is executed (S435). For this morphological analysis, a known method may be used.

続いて、Ｓ４３５での形態素解析の結果、取得した文中の“閉じ括弧”の直後の単語（以下、「判定対象単語」と称す）が付属語であるか否かを判定する（Ｓ４４０）。なお、
ここでいう付属語とは、自立語に対する付属語であり、具体的には、助詞および助動詞の少なくともいずれか一方を含むものである。 Subsequently, as a result of the morphological analysis in S435, it is determined whether or not the word immediately after the “close bracket” in the acquired sentence (hereinafter referred to as “determination target word”) is an attached word (S440). In addition,
The adjunct here is an adjunct to an independent word, and specifically includes at least one of a particle and an auxiliary verb.

そして、Ｓ４４０での判定の結果、判定対象単語が付属語であれば（Ｓ４４０：ＹＥＳ）、当該取得した文に対応する文情報に、当該取得した文が混在文であることを表す「タイプ２」を文構造タイプの識別情報として付与する（Ｓ４４５）。その後、詳しくは後述するＳ４６０へと移行する。 As a result of the determination in S440, if the determination target word is an attached word (S440: YES), “type 2” indicates that the acquired sentence is a mixed sentence in the sentence information corresponding to the acquired sentence. "Is added as sentence structure type identification information (S445). Thereafter, the process proceeds to S460 described later in detail.

一方、Ｓ４４０での判定の結果、判定対象単語が付属語でなければ（Ｓ４４０：ＮＯ）、取得した文における括弧内の文について、文構造タイプの識別情報を「タイプ１」とした新たな文情報を生成する（Ｓ４５０）。さらに、取得した文における括弧外の文について、文構造タイプの識別情報を「タイプ３」とした新たな文情報を生成する（Ｓ４５５）。その後、Ｓ４６０へと移行する。 On the other hand, as a result of the determination in S440, if the determination target word is not an adjunct word (S440: NO), a new sentence with the sentence structure type identification information “type 1” for the sentence in parentheses in the acquired sentence Information is generated (S450). Further, new sentence information is generated with the sentence structure type identification information “type 3” for the sentence outside the parentheses in the acquired sentence (S455). Thereafter, the process proceeds to S460.

つまり、Ｓ４５０，Ｓ４５５では、取得した文を、括弧内の文と括弧外の文とに分割して新たな文情報を生成する。これと共に、Ｓ４５０，Ｓ４５５では、括弧内の文に対応する文情報には、「タイプ１」の文構造タイプの識別情報を付与し、括弧外の文に対応する文情報には、「タイプ３」の文構造タイプの識別情報を付与する。 That is, in S450 and S455, the acquired sentence is divided into a sentence inside parentheses and a sentence outside parentheses, and new sentence information is generated. At the same time, in S450 and S455, the sentence information corresponding to the sentence in parentheses is given the identification information of the sentence structure type “type 1”, and the sentence information corresponding to the sentence outside the parentheses is “type 3”. "Is added to the sentence structure type identification information.

Ｓ４６０では、指定文章データにおける文章の流れに沿った次の文が存在するか否かを判定する（Ｓ４６０）。このＳ４６０での判定の結果、次の文が存在していれば（Ｓ４６０：ＹＥＳ）、その次の文を取得し（Ｓ４６５）、Ｓ４１５へと戻る。 In S460, it is determined whether there is a next sentence along the sentence flow in the designated sentence data (S460). As a result of the determination in S460, if the next sentence exists (S460: YES), the next sentence is acquired (S465), and the process returns to S415.

一方、Ｓ４６０での判定の結果、次の文が存在していなければ（Ｓ４６０：ＮＯ）、本文構造判定処理を終了して、文章読上げ処理のＳ１３５へと移行する。
つまり、文構造判定処理では、指定文章データに含まれている文の構造が、会話文であるか地の文であるか混在文であるかを特定する。そして、図１１（Ｂ）に示すように、各文に対応する文情報に、会話文であるか地の文であるか混在文であるかを表す文タイプ構造を付与する。
〈表情特定処理〉
文章読上げ処理のＳ１３５にて起動される表情特定処理は、図６に示すように、起動されると、先に実行した文構造判定処理の結果に基づいて、混在文における括弧内の文と括弧外の文とに文構造タイプを付与する文タイプ特定処理を実行する（Ｓ５１０）。 On the other hand, as a result of the determination in S460, if the next sentence does not exist (S460: NO), the body structure determination process is terminated, and the process proceeds to S135 of the text reading process.
That is, in the sentence structure determination process, it is specified whether the structure of the sentence included in the designated sentence data is a conversational sentence, a local sentence, or a mixed sentence. Then, as shown in FIG. 11B, a sentence type structure indicating whether it is a conversational sentence, a local sentence or a mixed sentence is given to sentence information corresponding to each sentence.
<Facial expression identification processing>
When the facial expression specifying process activated in S135 of the text-to-speech process is activated as shown in FIG. 6, the sentence in parentheses and parentheses in the mixed sentence are based on the result of the sentence structure determination process executed previously. A sentence type specifying process for assigning a sentence structure type to another sentence is executed (S510).

文タイプ特定処理は、具体的には、図７に示すように、指定文章データにおける文章の流れに沿った最初の文に対応する文情報を取得する（Ｓ６１０）。続いて、Ｓ６１０にて取得した文情報中の文構造タイプを取得し（Ｓ６１５）、その取得した文構造タイプが「タイプ２」であるか否かを判定する（Ｓ６２０）。 Specifically, as shown in FIG. 7, the sentence type specifying process acquires sentence information corresponding to the first sentence along the flow of sentences in the designated sentence data (S610). Subsequently, the sentence structure type in the sentence information acquired in S610 is acquired (S615), and it is determined whether or not the acquired sentence structure type is “type 2” (S620).

このＳ６２０での判定の結果、文構造タイプが「タイプ２」であれば（Ｓ６２０：ＹＥＳ）、当該文情報に対応した文における括弧内の部分を会話文（以下、「括弧内会話文」と称す）として設定する（Ｓ６２５）。さらに、当該文情報に対応した文における括弧外の部分を地の文（以下、「括弧外地の文」と称す）として設定する（Ｓ６３０）。 As a result of the determination in S620, if the sentence structure type is “type 2” (S620: YES), the portion in parentheses in the sentence corresponding to the sentence information is referred to as a conversation sentence (hereinafter referred to as “parenthesized conversation sentence”). (S625). Further, the part outside the parentheses in the sentence corresponding to the sentence information is set as a background sentence (hereinafter referred to as “a sentence outside the parentheses”) (S630).

つまり、Ｓ６２５，Ｓ６３０では、図１２（Ａ）に示すように、混合文に対応する文構造タイプの詳細な情報として、混合文における括弧内の部分に対して、会話文であることを表す情報を付与し、混合文における括弧外の部分に対して、地の文であることを表す情報を付与する。 That is, in S625 and S630, as shown in FIG. 12A, as the detailed information of the sentence structure type corresponding to the mixed sentence, information indicating that it is a conversation sentence with respect to the part in the parentheses in the mixed sentence. Is added to the portion outside the parentheses in the mixed sentence, and information indicating that it is a ground sentence.

その後、Ｓ６３５へと移行する。
一方、Ｓ６２０での判定の結果、文構造タイプが「タイプ２」でなければ（Ｓ６２０：ＮＯ）、Ｓ６２５，Ｓ６３０を実行することなく、Ｓ６３５へと移行する。 Thereafter, the process proceeds to S635.
On the other hand, as a result of the determination in S620, if the sentence structure type is not “type 2” (S620: NO), the process proceeds to S635 without executing S625 and S630.

そのＳ６３５では、指定文章データにおける文章の流れに沿った次の文が存在するか否かを判定する（Ｓ６３５）。このＳ６３５での判定の結果、次の文が存在すれば（Ｓ６３５：ＹＥＳ）、Ｓ６１０へと戻り、その次の文に対応する文情報を取得する（Ｓ６１０）。 In S635, it is determined whether or not there is a next sentence along the sentence flow in the designated sentence data (S635). As a result of the determination in S635, if the next sentence exists (S635: YES), the process returns to S610 to acquire sentence information corresponding to the next sentence (S610).

一方、Ｓ６３５での判定の結果、次の文が存在しなければ（Ｓ６３５：ＮＯ）、本文タイプ特定処理を終了する。
すると、表情特定処理では、会話文の一つである特定会話文にて表出している表情の可能性の高さを表す表情頻度を、その特定会話文を説明する地の文である説明対象文に基づいて導出する表情割当処理を実行する（Ｓ５２０）。 On the other hand, as a result of the determination in S635, if the next sentence does not exist (S635: NO), the body type specifying process is terminated.
Then, in the facial expression identification process, the expression frequency representing the high possibility of the facial expression expressed in the specific conversation sentence that is one of the conversation sentences, the explanation object that is the local sentence explaining the specific conversation sentence A facial expression assignment process derived based on the sentence is executed (S520).

その表情割当処理では、具体的には、図８に示すように、先の文タイプ特定処理にて文構造タイプが「タイプ２」であると特定され、指定文章データにおける文章の流れに沿った最初の文を構成する文字列（以下、「推定対象文」と称す）を取得する（Ｓ７１５）。続いて、Ｓ７１５にて取得した推定対象文において括弧外地の文として設定されている文字列（以下、「推定利用文」と称す）を取得する（Ｓ７２０）。 In the facial expression assignment process, specifically, as shown in FIG. 8, the sentence structure type is identified as “type 2” in the previous sentence type identification process, and the sentence flow in the designated sentence data is followed. A character string constituting the first sentence (hereinafter referred to as “estimation target sentence”) is acquired (S715). Subsequently, a character string (hereinafter referred to as “estimated usage sentence”) set as a sentence outside parentheses in the estimation target sentence acquired in S715 is acquired (S720).

続いて、推定対象文中の括弧内会話文（以下、「表情推定対象文」と称す）を特定会話文とし、その推定利用文を説明対象文として解析した結果に基づいて、特定会話文に対する表情頻度を導出する表情推定処理を実行する（Ｓ７２５）。 Subsequently, a conversation sentence in parentheses (hereinafter referred to as “expression estimation target sentence”) in the estimation target sentence is set as a specific conversation sentence, and an expression for the specific conversation sentence is analyzed based on the result of analyzing the estimated usage sentence as an explanation target sentence. The facial expression estimation process for deriving the frequency is executed (S725).

この表情推定処理では、具体的には、図９に示すように、推定利用文を形態素解析し（Ｓ９１０）、その形態素解析によって特定された各単語について、単語それぞれに対応する単語表情情報を取得する（Ｓ９１５）。ここでいう単語表情情報とは、単語それぞれと、各単語によって表される表情とを予め対応付けた情報であり、単語表情データベース１００に予め格納されている。 Specifically, in this facial expression estimation process, as shown in FIG. 9, the estimated usage sentence is morphologically analyzed (S910), and for each word specified by the morphological analysis, word facial expression information corresponding to each word is acquired. (S915). The word expression information here is information in which each word is associated with the expression represented by each word in advance, and is stored in the word expression database 100 in advance.

そして、Ｓ９１５で取得した単語表情情報に従って、同一内容を表す表情の登場頻度を各表情の内容ごとに集計することで、表情頻度を導出し（Ｓ９２０）、表情割当処理に戻。 Then, according to the word expression information acquired in S915, the appearance frequency of the expression representing the same content is totaled for each expression content to derive the expression frequency (S920), and the process returns to the expression assignment process.

Ｓ９２０から戻った表情割当処理では、指定文章データにおける文章の流れに沿った次の推定対象文が存在するか否かを判定する（Ｓ７３０）。このＳ７３０での判定の結果、次の推定対象文が存在していれば（Ｓ７３０：ＹＥＳ）、Ｓ７１５へと戻り、その次の推定対象文を取得する。 In the facial expression assignment process returned from S920, it is determined whether or not there is a next estimation target sentence along the sentence flow in the designated sentence data (S730). As a result of the determination in S730, if there is a next estimation target sentence (S730: YES), the process returns to S715 to acquire the next estimation target sentence.

一方、Ｓ７３０での判定の結果、次の推定対象文が存在していなければ（Ｓ７３０：ＮＯ）、本表情割当処理を終了して、表情特定処理のＳ５３０へと移行する。
その表情特定処理のＳ５３０では、括弧内会話文以外の会話文を特定会話文として、当該特定会話文にて表出している表情の可能性の高さを表す表情頻度を、説明対象文に基づいて導出する表情割当処理を実行する表情候補導出処理を実行する（Ｓ５３０）。 On the other hand, as a result of the determination in S730, if the next sentence to be estimated does not exist (S730: NO), the facial expression assignment process is terminated, and the process proceeds to S530 of the facial expression specifying process.
In S530 of the facial expression identification process, a conversation sentence other than the parenthesized conversation sentence is used as a specific conversation sentence, and the facial expression frequency representing the high possibility of the facial expression expressed in the specific conversation sentence is based on the explanation target sentence. The facial expression candidate derivation process is executed to execute the facial expression assignment process derived in step S530.

表情候補導出処理では、具体的には、図１０に示すように、文情報に基づいて、同一の段落に含まれる全ての文、即ち、同一の段落ＩＤと対応付けられた全ての文を取得する（Ｓ８１０）。続いて、Ｓ８１０で取得した文の数が複数であるか否かを判定する（Ｓ８１５）。 In the facial expression candidate derivation process, specifically, as shown in FIG. 10, all sentences included in the same paragraph, that is, all sentences associated with the same paragraph ID are acquired based on sentence information. (S810). Subsequently, it is determined whether or not the number of sentences acquired in S810 is plural (S815).

このＳ８１５での判定の結果、同一の段落に含まれている文の数が単数である場合（Ｓ８１５：ＮＯ）、詳しくは後述するＳ８６５へと移行する。一方、Ｓ８１５での判定の結果、同一の段落に含まれている文の数が複数であれば（Ｓ８１５：ＹＥＳ）、その複数の文それぞれに対応する文情報に基づいて、文構造タイプとして「タイプ２」または「タイプ３」が付与された文が、当該複数の文の中に存在するか否かを判定する（Ｓ８２０）。 As a result of the determination in S815, when the number of sentences included in the same paragraph is singular (S815: NO), the process proceeds to S865 described later in detail. On the other hand, as a result of the determination in S815, if there are a plurality of sentences included in the same paragraph (S815: YES), the sentence structure type is set as “sentence structure type” based on sentence information corresponding to each of the plurality of sentences. It is determined whether or not a sentence given “type 2” or “type 3” exists in the plurality of sentences (S820).

このＳ８２０での判定の結果、文構造タイプとして「タイプ２」または「タイプ３」が付与された文が存在していなければ（Ｓ８２０：ＮＯ）、詳しくは後述するＳ８６５へと移行する。一方、Ｓ８２０での判定の結果、文構造タイプとして「タイプ２」または「タイプ３」が付与された文が存在していれば（Ｓ８２０：ＹＥＳ）、その文構造タイプとして「タイプ２」または「タイプ３」が付与された文を、指定文章データにおける文章の流れに沿って一つ取得する（Ｓ８２５）。このＳ８２５にて取得した一つの文を第二推定利用文と称す。 As a result of the determination in S820, if there is no sentence assigned “type 2” or “type 3” as the sentence structure type (S820: NO), the process proceeds to S865 described later in detail. On the other hand, as a result of the determination in S820, if there is a sentence assigned “type 2” or “type 3” as the sentence structure type (S820: YES), “type 2” or “ One sentence to which “type 3” is assigned is acquired along the sentence flow in the designated sentence data (S825). One sentence acquired in S825 is referred to as a second estimated usage sentence.

そして、Ｓ８２５にて取得した第二推定利用文を説明対象文として解析した結果に基づいて、特定会話文に対する表情頻度を導出する表情推定処理を実行する（Ｓ８３０）。なお、Ｓ８３０にて実行する表情推定処理は、Ｓ７２５にて実行される表情推定処理と処理内容が同じであるため、ここでの詳しい説明は省略する。 And the facial expression estimation process which derives the facial expression frequency with respect to a specific conversation sentence based on the result of having analyzed the 2nd presumed utilization sentence acquired in S825 as a description object sentence is performed (S830). Note that the facial expression estimation process executed in S830 has the same processing content as the facial expression estimation process executed in S725, and a detailed description thereof will be omitted here.

続いて、指定文章データにおける文章の流れに沿って、第二推定利用文の一つ前（即ち、直前）に文が存在するか否かを判定する（Ｓ８３５）。このＳ８３５での判定の結果、第二推定利用文の一つ前に文が存在していれば（Ｓ８３５：ＹＥＳ）、第二推定利用文の一つ前（直前）の文に会話文が含まれているか否かを判定する（Ｓ８４０）。 Subsequently, it is determined whether or not a sentence exists immediately before the second estimated usage sentence (that is, immediately before) along the flow of the sentence in the designated sentence data (S835). As a result of the determination in S835, if there is a sentence immediately before the second estimated usage sentence (S835: YES), the conversation sentence is included in the sentence immediately before (just before) the second estimated usage sentence. It is determined whether or not (S840).

そのＳ８４０での判定の結果、第二推定利用文の一つ前の文に会話文が含まれていれば（Ｓ８４０：ＹＥＳ）、その会話文を特定会話文とし、Ｓ８３０での表情推定処理の結果を、当該特定会話文に対する表情頻度に加算する（Ｓ８４５）。その後、Ｓ８５０へと移行する。 As a result of the determination in S840, if a conversation sentence is included in the sentence immediately before the second estimated usage sentence (S840: YES), the conversation sentence is set as a specific conversation sentence, and the facial expression estimation process in S830 is performed. The result is added to the expression frequency for the specific conversation sentence (S845). Thereafter, the process proceeds to S850.

一方、Ｓ８３５での判定の結果、第二推定利用文の一つ前に文が存在しない場合（Ｓ８３５：ＮＯ）や、Ｓ８４０での判定の結果、第二推定利用文の一つ前の文に会話文が含まれていない場合（Ｓ８４０：ＮＯ）には、Ｓ８４５を実行することなく、Ｓ８５０へと移行する。 On the other hand, as a result of the determination in S835, when there is no sentence immediately before the second estimated usage sentence (S835: NO), or as a result of the determination in S840, the sentence immediately before the second estimated usage sentence When the conversation sentence is not included (S840: NO), the process proceeds to S850 without executing S845.

そのＳ８５０では、第二推定利用文の一つ後（即ち、直後）に文が存在するか否かを判定する。このＳ８５０での判定の結果、第二推定利用文の一つ後に文が存在していれば（Ｓ８５０：ＹＥＳ）、その第二推定利用文の一つ後（直後）の文に会話文が含まれているか否かを判定する（Ｓ８５５）。 In S850, it is determined whether there is a sentence immediately after (ie, immediately after) the second estimated usage sentence. As a result of the determination in S850, if there is a sentence immediately after the second estimated usage sentence (S850: YES), the conversation sentence is included in the sentence immediately after the second estimated usage sentence (immediately after). It is determined whether or not (S855).

このＳ８５５での判定の結果、第二推定利用文の一つ後の文に会話文が含まれていれば（Ｓ８５５：ＹＥＳ）、その会話文を特定会話文とし、Ｓ８３０での表情推定処理の結果を、当該特定会話文に対する表情頻度に加算する（Ｓ８６０）。その後、Ｓ８２０へと移行する。 As a result of the determination in S855, if a conversation sentence is included in the sentence immediately after the second estimated usage sentence (S855: YES), the conversation sentence is set as a specific conversation sentence, and the facial expression estimation process in S830 is performed. The result is added to the expression frequency for the specific conversation sentence (S860). Thereafter, the process proceeds to S820.

一方、Ｓ８５０での判定の結果、第二推定利用文の一つ後に文が存在しない場合（Ｓ８５０：ＮＯ）や、Ｓ８５５での判定の結果、第二推定利用文の一つ後の文に会話文が含まれていない場合（Ｓ８５５：ＮＯ）には、Ｓ８６０を実行することなく、Ｓ８２０へと移行する。 On the other hand, as a result of the determination in S850, when there is no sentence immediately after the second estimated usage sentence (S850: NO), or as a result of the determination in S855, the conversation immediately after the second estimated usage sentence When the sentence is not included (S855: NO), the process proceeds to S820 without executing S860.

ところで、一つの段落に含まれている文の数が単数である場合（Ｓ８１５：ＮＯ）や、
文構造タイプとして「タイプ２」または「タイプ３」が付与された文が一つの段落中に含まれていない場合（Ｓ８２０：ＮＯ）には、Ｓ８６５へと進む。そのＳ８６５では、指定文章データにおける文章の流れに沿って次の段落が存在するか否かを判定する。 By the way, when the number of sentences included in one paragraph is singular (S815: NO),
If a sentence assigned “type 2” or “type 3” as the sentence structure type is not included in one paragraph (S820: NO), the process proceeds to S865. In S865, it is determined whether or not the next paragraph exists along the sentence flow in the designated sentence data.

このＳ８６５での判定の結果、次の段落が存在していれば（Ｓ８６５：ＹＥＳ）、Ｓ８１０へと戻り、次の段落が存在していなければ（Ｓ８６５：ＮＯ）、本表情候補導出処理を終了して表情特定処理のＳ５４０へと進む。 As a result of the determination in S865, if the next paragraph exists (S865: YES), the process returns to S810, and if the next paragraph does not exist (S865: NO), the facial expression candidate derivation process ends. Then, the process proceeds to S540 of the facial expression specifying process.

その表情特定処理のＳ５４０では、各特定会話文に対応する表情頻度に基づいて、各特定会話文の表情を特定する。具体的に本実施形態のＳ５４０では、各特定会話文に対応する表情頻度が、図１２（Ｂ）に示すように表情ごとの登場回数を集計した結果となるため、その集計結果（値）が最も多い内容の表情を、その表情頻度と対応付けられた特定会話文の表情として特定する。なお、本実施形態のＳ５４０では、図１２（Ｃ）に示すように、特定された表情の内容が、その対応する特定会話文それぞれに割り当てられる。
〈音声合成処理〉
文章読上げ処理のＳ１４５にて起動される音声合成処理は、図１３に示すように、起動されると、指定文章データにおける文章の流れに沿った最初の文を出力文言として取得する（Ｓ１０１０）。 In S540 of the facial expression specifying process, the facial expression of each specific conversation sentence is specified based on the facial expression frequency corresponding to each specific conversation sentence. Specifically, in S540 of the present embodiment, the facial expression frequency corresponding to each specific conversation sentence is the result of totaling the number of appearances for each facial expression as shown in FIG. The facial expression having the most content is identified as the facial expression of the specific conversation sentence associated with the facial expression frequency. In S540 of the present embodiment, as shown in FIG. 12C, the content of the specified facial expression is assigned to each corresponding specific conversation sentence.
<Speech synthesis processing>
As shown in FIG. 13, the speech synthesis process activated in S145 of the text-to-speech process, when activated, acquires the first sentence along the text flow in the designated text data as an output text (S1010).

続いて、取得した出力文言を合成音として出力するために必要な音素それぞれに対応し、かつ先のＳ１４０にて取得した出力性質情報のうちの発声者特徴情報に最も類似する代表タグデータＴＤと対応付けられた音声パラメータＰＶを、情報処理サーバ１０から取得する（Ｓ１０１５）。 Subsequently, representative tag data TD corresponding to each phoneme necessary for outputting the acquired output wording as a synthesized sound and most similar to the speaker characteristic information in the output property information acquired in the previous S140; The associated voice parameter PV is acquired from the information processing server 10 (S1015).

続いて、取得した出力文言に表情が割り当てられているか否かを判定する（Ｓ１０２０）。この判定の結果、出力文言に表情が割り当てられていなければ（Ｓ１０２０：ＮＯ）、詳しくは後述するＳ１０３５へと移行する。 Subsequently, it is determined whether a facial expression is assigned to the acquired output word (S1020). If no facial expression is assigned to the output wording as a result of this determination (S1020: NO), the process proceeds to S1035 described later in detail.

一方、Ｓ１０２０での判定の結果、取得した出力文言に表情が割り当てられていれば（Ｓ１０２０：ＹＥＳ）、取得した出力文言に割り当てられている表情に最も類似する代表タグデータＴＤを含む表情データＥＴを情報処理サーバ１０から取得する（Ｓ１０２５）。そして、取得した出力文言に即した合成音が出力されるように、Ｓ１０１５にて取得した音声パラメータＰＶを、Ｓ１０２５にて取得した表情データＥＴに基づいて調整する（Ｓ１０３０）。 On the other hand, as a result of determination in S1020, if a facial expression is assigned to the acquired output wording (S1020: YES), facial expression data ET including representative tag data TD most similar to the facial expression assigned to the acquired output wording Is acquired from the information processing server 10 (S1025). Then, the voice parameter PV acquired in S1015 is adjusted based on the facial expression data ET acquired in S1025 so that a synthesized sound corresponding to the acquired output wording is output (S1030).

続いて、Ｓ１０３０にて調整された音声パラメータＰＶに基づいて、音声合成する（Ｓ１０３５）。このＳ１０３５における音声合成は、フォルマント合成による周知の音声合成の手法を用いる。なお、Ｓ１０２０での判定の結果、出力文言に表情が割り当てられていない場合（Ｓ１０２０：ＮＯ）に移行するＳ１０３５では、Ｓ１０２５，Ｓ１０３０を実行することなく、Ｓ１０１５にて取得した音声パラメータＰＶに基づくフォルマント合成を実行する。 Subsequently, speech synthesis is performed based on the speech parameter PV adjusted in S1030 (S1035). The voice synthesis in S1035 uses a well-known voice synthesis technique by formant synthesis. Note that, as a result of the determination in S1020, in S1035 where the expression is not assigned to the output wording (S1020: NO), the formant based on the voice parameter PV acquired in S1015 without executing S1025 and S1030. Perform synthesis.

続いて、指定文章データにおける文章の流れに沿った次の文が存在するか否かを判定し（Ｓ１０４０）、判定の結果、次の文が存在していれば（Ｓ１０４０：ＹＥＳ）、その次の文を出力文言として取得し（Ｓ１０４５）、Ｓ１０１５へと戻る。 Subsequently, it is determined whether or not there is a next sentence along the flow of the sentence in the designated sentence data (S1040). If the result of the determination is that the next sentence exists (S1040: YES), the next Is obtained as an output word (S1045), and the process returns to S1015.

一方、Ｓ１０４０での判定の結果、次の文が存在していなければ（Ｓ１０４０：ＮＯ）、本音声合成処理、ひいては文章読上げ処理を終了する。
［第一実施形態の効果］
以上説明したように、音声合成システム１によれば、対象説明文に基づいて特定会話文
の表情を推定することができ、音声合成によって特定会話文を読み上げた合成音に対して、当該特定会話文に適した表情を付与することができる。 On the other hand, as a result of the determination in S1040, if the next sentence does not exist (S1040: NO), the speech synthesis process and the sentence reading process are ended.
[Effect of the first embodiment]
As described above, according to the speech synthesis system 1, the facial expression of a specific conversation sentence can be estimated based on the target explanatory sentence, and the specific conversation is performed with respect to the synthesized sound read out from the specific conversation sentence by speech synthesis. Facial expressions suitable for sentences can be given.

つまり、音声合成システム１によれば、音声合成によって会話文を読上げた合成音を出力する際に、会話文に対する合成音に適切な表情を付与することができる。
ところで、会話文と地の文とが一文の中に含まれる混在文では、当該混在文に含まれる地の文が、当該混在文に含まれる会話文を説明している可能性が高い。 That is, according to the speech synthesis system 1, when a synthesized sound obtained by reading out a conversation sentence by speech synthesis is output, an appropriate expression can be given to the synthesized sound for the conversation sentence.
By the way, in the mixed sentence in which the conversation sentence and the local sentence are included in one sentence, it is highly likely that the local sentence included in the mixed sentence explains the conversation sentence included in the mixed sentence.

したがって、混在文に含まれる会話文を特定会話文とし、当該混在文に含まれる地の文を対象説明文として、特定会話文における表情を推定する音声合成システム１によれば、混在文に含まれる地の文に基づいて、混在文に含まれる会話文の表情を推定するため、その推定精度は高いものとなる。 Therefore, according to the speech synthesis system 1 that estimates a facial expression in a specific conversation sentence with a conversation sentence included in the mixed sentence as a specific conversation sentence, and a local sentence included in the mixed sentence as a target explanatory sentence, it is included in the mixed sentence. Since the expression of the conversational sentence included in the mixed sentence is estimated based on the local sentence, the estimation accuracy is high.

すなわち、音声合成システム１によれば、混在文に含まれる会話文に対し、適切な表情を付与できる。
また、地の文単体からなる文は、当該地の文の一つ前または一つ後ろの文中に含まれる会話文について説明していることが多い。 That is, according to the speech synthesis system 1, an appropriate facial expression can be given to the conversational sentence included in the mixed sentence.
In addition, a sentence composed of a single sentence in a local area often explains a conversation sentence included in a sentence immediately before or after the local sentence.

したがって、地の文の一つ前の文および一つ後ろの文に含まれる会話文を特定会話文とし、当該地の文を対象説明文とする音声合成システム１によれば、会話文のみからなる一つの文であっても、その会話文について説明している説明対象文を特定することができる。 Therefore, according to the speech synthesis system 1 in which the conversation sentence included in the immediately preceding sentence and the immediately following sentence of the local sentence is a specific conversation sentence and the local sentence is the target explanatory sentence, only the conversation sentence is used. Even if it is one sentence which becomes, the description object sentence which has demonstrated the conversation sentence can be specified.

また、地の文だけからなる文は、同一の段落に含まれている会話文について説明している可能性が高い。
したがって、一つの特定会話文に対して同一段落内の地の文を対象説明文として、当該特定会話文の表情を推定する音声合成システム１によれば、当該表情の推定精度をより向上させることができる。 In addition, a sentence composed only of a local sentence is likely to explain a conversation sentence included in the same paragraph.
Therefore, according to the speech synthesis system 1 that estimates the facial expression of the specific conversation sentence using the local sentence in the same paragraph as the target explanatory sentence for one specific conversation sentence, the estimation accuracy of the facial expression is further improved. Can do.

特に、本実施形態の音声合成システム１では、各特定会話文に対応する表情ごとの登場回数を集計した結果を表情頻度とし、その表情頻度の集計結果（値）が最も多い内容の表情を、その表情頻度と対応付けられた特定会話文の表情として特定している。 In particular, in the speech synthesis system 1 of the present embodiment, the expression frequency is defined as the result of totaling the number of appearances for each facial expression corresponding to each specific conversation sentence, and the facial expression having the highest count result (value) of the expression frequency is expressed as It is specified as an expression of a specific conversation sentence associated with the expression frequency.

このため、音声合成システム１によれば、各特定会話文に対してより適切な表情を推定できる。さらに、音声合成システム１によれば、一つの会話文に対して複数の対象説明文が存在する場合であっても、当該会話文に最適な表情を推定できる。
［第二実施形態］
第二実施形態の音声合成システムは、第一実施形態の音声合成システム１とは、主として、文章読上げ処理を構成する一部の処理の処理内容が異なる。このため、本実施形態においては、第一実施形態と同様の構成及び処理には、同一の符号を付して説明を省略し、第一実施形態とは異なる文章読上げ処理を構成する一部の処理を中心に説明する。
〈文章読上げ処理〉
本実施形態における文章読上げ処理は、音声出力端末６０の情報受付部６２を介して起動指令が入力されることで起動されると、図１４に示すように、まず、文章指定情報を取得する（Ｓ１１０）。 For this reason, according to the speech synthesis system 1, a more appropriate facial expression can be estimated for each specific conversation sentence. Furthermore, according to the speech synthesis system 1, even when there are a plurality of target explanatory sentences for one conversation sentence, it is possible to estimate an optimal facial expression for the conversation sentence.
[Second Embodiment]
The speech synthesis system according to the second embodiment is different from the speech synthesis system 1 according to the first embodiment mainly in the processing contents of some processes constituting the text reading process. For this reason, in the present embodiment, the same components and processes as those in the first embodiment are denoted by the same reference numerals, description thereof is omitted, and a part of the text reading process different from that in the first embodiment is configured. The process will be mainly described.
<Text reading process>
When the text-to-speech process in the present embodiment is activated by an activation command input via the information receiving unit 62 of the voice output terminal 60, first, as shown in FIG. S110).

続いて、Ｓ１１０にて取得した文章指定情報に対応する指定文章データを、情報処理サーバ１０の記憶部３０から取得する（Ｓ１１５）。さらに、指定文章データを解析して、指定文章データによって表される文章を構成する文それぞれを特定する文章分割処理を実行する（Ｓ１２５）。 Subsequently, the designated sentence data corresponding to the sentence designation information obtained in S110 is obtained from the storage unit 30 of the information processing server 10 (S115). Furthermore, the specified sentence data is analyzed, and a sentence dividing process for specifying each sentence constituting the sentence represented by the specified sentence data is executed (S125).

そして、文章分割処理（Ｓ１２５）にて特定された各文を解析して、各文について文構造を特定する文構造判定処理を実行する（Ｓ１３０）。
続いて、文構造判定処理にて特定された各文の文構造に従って、会話文に対する説明文である対象説明文を特定するとともに、その特定した結果に基づいて、当該会話文における表情を特定する表情特定処理を実行する（Ｓ１３５）。 Then, each sentence specified in the sentence division process (S125) is analyzed, and a sentence structure determination process for specifying a sentence structure for each sentence is executed (S130).
Subsequently, according to the sentence structure of each sentence specified in the sentence structure determination process, the target explanatory sentence that is an explanatory sentence for the conversation sentence is specified, and the facial expression in the conversation sentence is specified based on the specified result. A facial expression specifying process is executed (S135).

さらに、情報受付部６２を介して入力された出力性質情報を取得する（Ｓ１４０）。続いて、指定文章データの内容を音声合成によって読み上げた合成音声を生成して出力する音声合成処理を実行する（Ｓ１４５）。 Further, the output property information input via the information receiving unit 62 is acquired (S140). Subsequently, a speech synthesis process for generating and outputting a synthesized speech obtained by reading out the contents of the designated sentence data by speech synthesis is executed (S145).

その後、本文章読上げ処理を終了する。
すなわち、本実施形態の文章読上げ処理は、第一実施形態における文章読上げ処理から段落分割処理（Ｓ１２０）が省略されている。
〈文章分割処理〉
このため、本実施形態の文章分割処理は、図１５に示すように、起動されると、指定文章データにおける文章の流れに沿った最初の文字記号を取得する（Ｓ３１０）。 Thereafter, the text reading process is terminated.
That is, in the text reading process of this embodiment, the paragraph division process (S120) is omitted from the text reading process in the first embodiment.
<Sentence division processing>
For this reason, as shown in FIG. 15, when the sentence division processing of this embodiment is started, the first character symbol along the sentence flow in the designated sentence data is acquired (S310).

そして、取得した文字記号が“開き括弧”でなければ（Ｓ３１５：ＮＯ）、後述するＳ３３５へと移行する。一方、取得した文字記号が“開き括弧”であれば（Ｓ３１５：ＹＥＳ）、指定文章データにおける文章の流れに沿った次の文字記号を取得する（Ｓ３２０）。 If the acquired character symbol is not “open bracket” (S315: NO), the process proceeds to S335 described later. On the other hand, if the acquired character symbol is “open bracket” (S315: YES), the next character symbol along the flow of the sentence in the designated sentence data is acquired (S320).

そして、Ｓ３２０にて取得した文字記号が“閉じ括弧”でなければ（Ｓ３２５：ＮＯ）、Ｓ３２０へと戻る。一方、Ｓ３２５での判定の結果、文字記号が“閉じ括弧”であれば（Ｓ３２５：ＹＥＳ）、Ｓ３２０にて取得した文字記号が“改行”であるか否かを判定する（Ｓ３３０）。 If the character symbol acquired in S320 is not “close bracket” (S325: NO), the process returns to S320. On the other hand, as a result of the determination in S325, if the character symbol is “close bracket” (S325: YES), it is determined whether or not the character symbol acquired in S320 is “line feed” (S330).

このＳ３３０での判定、文字記号が“改行”でなければ（Ｓ３３０：ＮＯ）、Ｓ３３５へと移行する。
Ｓ３３５では、取得した文字記号が“句点”であるか否かを判定する。このＳ３３５での判定の結果、取得した文字記号が“句点”である場合（Ｓ３３５：ＹＥＳ）、または、Ｓ３３０での判定の結果、Ｓ３２０にて取得した文字記号が“改行”である場合（Ｓ３２０：ＹＥＳ）には、Ｓ３４０へと移行する。 If it is determined in S330 that the character symbol is not “line feed” (S330: NO), the process proceeds to S335.
In S335, it is determined whether or not the acquired character symbol is a “punctuation point”. As a result of the determination in S335, the acquired character symbol is “punctuation” (S335: YES), or as a result of the determination in S330, the character symbol acquired in S320 is “line feed” (S320). : YES), the process proceeds to S340.

そのＳ３４０では、本Ｓ３４０を前回実行した以降に取得した文字記号の列を、一つの文そのものとし、文ＩＤと対応付けて記憶する。その後、Ｓ３４５へと移行する。
すなわち、本実施形態における文情報は、文ＩＤと、当該文ＩＤによって識別される文を構成する文字記号の列とを対応付けた情報となる。 In S340, a string of character symbols acquired after the previous execution of S340 is regarded as one sentence itself and stored in association with the sentence ID. Thereafter, the process proceeds to S345.
That is, the sentence information in the present embodiment is information in which a sentence ID is associated with a string of character symbols constituting a sentence identified by the sentence ID.

なお、Ｓ３３５での判定の結果、取得した文字記号が“句点”でない場合（Ｓ３３５：ＮＯ）には、Ｓ３４０を実行することなく、Ｓ３４５へと移行する。
そのＳ３４５では、指定文章データにおける文章の流れに沿った次の文字記号が含まれているか否かを判定する。そして、Ｓ３４５での判定の結果、次の文字記号が含まれていれば（Ｓ３４５：ＹＥＳ）、その次の文字記号を取得し（Ｓ３６０）、Ｓ３１５へと戻る。 If the acquired character / symbol is not “punctuation” as a result of the determination in S335 (S335: NO), the process proceeds to S345 without executing S340.
In S345, it is determined whether or not the next character / symbol along the sentence flow in the designated sentence data is included. If the result of determination in S345 is that the next character symbol is included (S345: YES), the next character symbol is acquired (S360), and the process returns to S315.

一方、Ｓ３４５での判定の結果、次の文字記号が含まれていなければ（Ｓ３４５：ＮＯ）、指定文章データにおける文章の最後の文字記号まで、Ｓ３１５からＳ３４０の処理が終了したものとして、本文章分割処理を終了して、文章読上げ処理のＳ１３０へと移行す
る。 On the other hand, as a result of the determination in S345, if the next character symbol is not included (S345: NO), it is assumed that the processing from S315 to S340 has been completed up to the last character symbol of the sentence in the designated sentence data. The division process ends, and the process proceeds to S130 for the text reading process.

すなわち、本実施形態の文章分割処理は、第一実施形態における文章分割処理からＳ３５０、およびＳ３５５が省略されている。この結果、本実施形態の文章分割処理にて生成される文情報は、文ＩＤと、当該文ＩＤによって識別される文を構成する文字記号の列とを対応付けた情報となる。 That is, in the sentence dividing process of this embodiment, S350 and S355 are omitted from the sentence dividing process in the first embodiment. As a result, the sentence information generated in the sentence division process according to the present embodiment is information in which the sentence ID is associated with the character symbol string constituting the sentence identified by the sentence ID.

そして、本実施形態の文構造判定処理（Ｓ１３０）が実行されると、文情報には、図１７（Ａ）に示すように、その文情報に対応する文の各々に、各文の文構造タイプが付与される。
〈表情特定処理〉
次に、本実施形態の表情特定処理は、図１６に示すように、起動されると、先に実行した文構造判定処理の結果に基づいて、混在文における括弧内の文と括弧外の文とに文構造タイプを付与する文タイプ特定処理を実行する（Ｓ５１０）。 Then, when the sentence structure determination process (S130) of the present embodiment is executed, the sentence information includes the sentence structure of each sentence in the sentence corresponding to the sentence information, as shown in FIG. A type is given.
<Facial expression identification processing>
Next, as shown in FIG. 16, the facial expression specifying process according to the present embodiment, when activated, a sentence in parentheses and a sentence outside parentheses in a mixed sentence based on the result of the sentence structure determination process executed earlier. A sentence type specifying process for assigning a sentence structure type to each is executed (S510).

本実施形態における文タイプ特定処理の処理内容は、第一実施形態における文タイプ特定処理の処理内容と同様である。このため、本実施形態の文タイプ特定処理においても、図１７（Ｂ）に示すように、混合文に対応する文構造タイプの詳細な情報として、混合文における括弧内の部分に対して、会話文であることを表す情報が付与され、混合文における括弧外の部分に対して、地の文であることを表す情報が付与される。 The processing content of the sentence type specifying process in the present embodiment is the same as the processing content of the sentence type specifying process in the first embodiment. For this reason, also in the sentence type specifying process of the present embodiment, as shown in FIG. 17B, as the detailed information of the sentence structure type corresponding to the mixed sentence, the conversation within the parentheses in the mixed sentence is performed. Information indicating that it is a sentence is given, and information showing that it is a local sentence is given to the part outside the parentheses in the mixed sentence.

続いて、特定会話文にて表出している表情の可能性の高さを表す表情頻度を、その特定会話文を説明する地の文である説明対象文に基づいて導出する表情割当処理を実行する（Ｓ５２０）。 Next, facial expression assignment processing is performed to derive the facial expression frequency that represents the high possibility of facial expression expressed in a specific conversation sentence based on the explanation sentence that is the local sentence that explains the specific conversation sentence. (S520).

本実施形態における表情割当処理の処理内容は、第一実施形態における表情割当処理の処理内容と同様であるため、ここでの詳しい説明は省略する。
そして、各特定会話文に対応する表情頻度に基づいて、各特定会話文の表情を特定する（Ｓ５４０）。具体的に本実施形態のＳ５４０では、表情ごとの登場回数を集計した各特定会話文に対応する表情頻度の集計結果（値）が最も多い内容の表情を、その表情頻度と対応付けられた特定会話文の表情として特定する。なお、本実施形態のＳ５４０においても、図１７（Ｃ）に示すように、特定された表情の内容が、その対応する特定会話文それぞれに割り当てられる。
［第二実施形態の効果］
このような第二実施形態の音声合成システムであっても、第一実施形態の音声合成システム１同様、対象説明文に基づいて特定会話文の表情を推定することができ、音声合成によって特定会話文を読み上げた合成音に対して、当該特定会話文に適した表情を付与することができる。 Since the processing content of the facial expression assignment process in the present embodiment is the same as the processing content of the facial expression assignment process in the first embodiment, detailed description thereof is omitted here.
Then, the facial expression of each specific conversation sentence is identified based on the expression frequency corresponding to each specific conversation sentence (S540). Specifically, in S540 of the present embodiment, the facial expression having the highest count result (value) of the facial expression frequency corresponding to each specific conversation sentence in which the number of appearances for each facial expression is totaled is identified in association with the facial expression frequency. It is specified as the expression of the conversation sentence. Note that also in S540 of the present embodiment, as shown in FIG. 17C, the content of the specified facial expression is assigned to each corresponding specific conversation sentence.
[Effects of Second Embodiment]
Even in such a speech synthesis system according to the second embodiment, the facial expression of the specific conversation sentence can be estimated based on the target explanatory sentence, as in the speech synthesis system 1 according to the first embodiment, and the specific conversation can be performed by speech synthesis. A facial expression suitable for the specific conversation sentence can be given to the synthesized sound obtained by reading the sentence.

つまり、本実施形態の音声合成システムにおいても、音声合成によって会話文を読上げた合成音を出力する際に、会話文に対する合成音に適切な表情を付与することができる。［その他の実施形態］
以上、本発明の実施形態について説明したが、本発明は上記実施形態に限定されるものではなく、本発明の要旨を逸脱しない範囲において、様々な態様にて実施することが可能である。 That is, also in the speech synthesis system of the present embodiment, an appropriate facial expression can be given to the synthesized sound for the conversation sentence when outputting the synthesized sound that is read out from the conversation sentence by speech synthesis. [Other Embodiments]
As mentioned above, although embodiment of this invention was described, this invention is not limited to the said embodiment, In the range which does not deviate from the summary of this invention, it is possible to implement in various aspects.

上記第一実施形態の表情候補導出処理では、地の文の一つ前の文および一つ後ろの文の両方に存在する会話文を特定会話文とし、当該地の文を対象説明文としていたが、本発明においては、地の文の一つ前の文のみを特定会話文としても良いし、地の文の一つ前の文および一つ後ろの文のみを特定会話文としても良い。つまり、本発明においては、地の文
の一つ前の文または一つ後ろの文のうち、少なくともいずれか一方に含まれる会話文を特定会話文とすれば良い。 In the facial expression candidate derivation process of the first embodiment, the conversation sentence existing in both the immediately preceding sentence and the immediately following sentence of the ground sentence is the specific conversation sentence, and the local sentence is the target explanatory sentence. However, in the present invention, only the sentence immediately before the local sentence may be the specific conversation sentence, or only the sentence immediately before and the sentence after the local sentence may be the specific conversation sentence. That is, in the present invention, the conversation sentence included in at least one of the sentence immediately before or the sentence immediately after the ground sentence may be set as the specific conversation sentence.

なお、上記実施形態の構成の一部を、課題を解決できる限りにおいて省略した態様も本発明の実施形態である。また、上記実施形態と変形例とを適宜組み合わせて構成される態様も本発明の実施形態である。また、特許請求の範囲に記載した文言によって特定される発明の本質を逸脱しない限度において考え得るあらゆる態様も本発明の実施形態である。 In addition, the aspect which abbreviate | omitted a part of structure of the said embodiment as long as the subject could be solved is also embodiment of this invention. Further, an aspect configured by appropriately combining the above embodiment and the modification is also an embodiment of the present invention. Moreover, all the aspects which can be considered in the limit which does not deviate from the essence of the invention specified by the wording described in the claims are the embodiments of the present invention.

上記実施形態の説明で用いる符号を特許請求の範囲にも適宜使用しているが、各請求項に係る発明の理解を容易にする目的で使用しており、各請求項に係る発明の技術的範囲を限定する意図ではない。
［実施形態と特許請求の範囲との対応関係］
上記実施形態の文章読上げ処理のＳ１１０，Ｓ１１５を実行することで得られる機能が、特許請求の範囲の記載における文章取得手段に相当し、Ｓ１２０からＳ１３０を実行することで得られる機能が、特許請求の範囲の記載における文構造特定手段に相当する。 The reference numerals used in the description of the above embodiments are also used in the claims as appropriate, but they are used for the purpose of facilitating the understanding of the invention according to each claim, and the technical aspects of the invention according to each claim. It is not intended to limit the scope.
[Correspondence between Embodiment and Claims]
The function obtained by executing S110 and S115 of the text-to-speech process of the above embodiment corresponds to the text acquisition means in the description of the claims, and the function obtained by executing S120 to S130 is claimed. This corresponds to the sentence structure specifying means in the description of the range.

さらに、上記実施形態の文章読上げ処理のＳ１３５を実行することで得られる機能が、特許請求の範囲の記載における表情推定手段に相当し、Ｓ１４０、Ｓ１４５を実行することで得られる機能が、特許請求の範囲の記載における音声合成手段に相当する。 Further, the function obtained by executing S135 of the text-to-speech process of the above embodiment corresponds to the facial expression estimating means in the claims, and the function obtained by executing S140 and S145 is claimed. This corresponds to the speech synthesis means in the description of the range.

１…音声合成システム１０…情報処理サーバ１２…通信部２０…制御部２２…ＲＯＭ２４…ＲＡＭ２６…ＣＰＵ３０…記憶部６０…音声出力端末６１…通信部６２…情報受付部６３…表示部６４…音入力部６５…音出力部６６…記憶部
７０…制御部７２…ＲＯＭ７４…ＲＡＭ７６…ＣＰＵ１００…単語表情データベース DESCRIPTION OF SYMBOLS 1 ... Speech synthesis system 10 ... Information processing server 12 ... Communication part 20 ... Control part 22 ... ROM 24 ... RAM 26 ... CPU 30 ... Memory | storage part 60 ... Voice output terminal 61 ... Communication part 62 ... Information reception part 63 ... Display part 64 ... Sound input unit 65 ... Sound output unit 66 ... Storage unit 70 ... Control unit 72 ... ROM 74 ... RAM 76 ... CPU 100 ... Word expression database

Claims

Sentence acquisition means for acquiring sentence data representing a character string constituting the specified sentence;
Sentence structure specifying means for specifying whether at least a part of each sentence included in the sentence represented by the sentence data acquired by the sentence acquisition means is a conversational sentence or a local sentence;
One of the conversation sentences specified by the sentence structure specifying means is a specific conversation sentence, and the specific conversation is based on the result of analyzing the meaning of the target explanation sentence that is at least one local sentence explaining the specific conversation sentence. Facial expression estimation means for estimating facial expressions in sentences;
A voice output device comprising: voice synthesis means for outputting a synthesized voice synthesized by voice so that the facial expression estimated by the facial expression estimation means is reflected in the specific conversation sentence.

The facial expression estimation means includes
As a result of the specification by the sentence structure specifying means, when there is a mixed sentence in which the conversation sentence and the local sentence are included in one sentence in the sentence data, the conversation sentence included in the mixed sentence is The voice output device according to claim 1, wherein a specific conversation sentence is used, and a local sentence included in the mixed sentence is the target explanation sentence.

The facial expression estimation means includes
As a result of the specification by the sentence structure specifying means, if the conversation sentence is present in at least one of the previous sentence or the next sentence of the local sentence, the conversation sentence is The voice output device according to claim 1, wherein a specific conversation sentence is used, and a sentence in the place is the target explanation sentence.

The facial expression estimation means includes
The facial expressions corresponding to the meanings of the words included in the target explanatory sentence are totaled, and as a result of the totaling, the facial expression corresponding to the most frequent value is used as the facial expression of the specific conversation sentence. The audio output device according to claim 3.

The sentence structure specifying means includes:
The sentence included in the sentence data is divided into specific paragraphs that are paragraphs as structural units, and it is specified whether at least a part of each sentence included in each of the specific paragraphs is a conversational sentence or a local sentence And
The facial expression estimation means includes
The speech output apparatus according to any one of claims 1 to 4, wherein a local sentence included in the specific paragraph that is the same as the specific conversation sentence is used as the target explanatory sentence.

A sentence acquisition procedure for acquiring sentence data representing a character string constituting the specified sentence;
A sentence structure identifying procedure for identifying whether at least a part of each sentence included in the sentence represented by the sentence data obtained in the sentence obtaining procedure is a conversational sentence or a local sentence;
One of the conversation sentences specified in the sentence structure specifying procedure is set as a specific conversation sentence, and the specific conversation is based on the result of analyzing the meaning of the target explanation sentence that is at least one local sentence explaining the specific conversation sentence. A facial expression estimation procedure for estimating a facial expression in a sentence;
A voice output method comprising: a voice synthesis procedure for outputting a synthesized voice that is voice-synthesized so that the facial expression estimated in the facial expression estimation procedure is reflected in the specific conversation sentence.

A sentence acquisition procedure for acquiring sentence data representing a character string constituting the specified sentence;
A sentence structure identifying procedure for identifying whether at least a part of each sentence included in the sentence represented by the sentence data obtained in the sentence obtaining procedure is a conversational sentence or a local sentence;
One of the conversation sentences specified in the sentence structure specifying procedure is set as a specific conversation sentence, and the specific conversation is based on the result of analyzing the meaning of the target explanation sentence that is at least one local sentence explaining the specific conversation sentence. A facial expression estimation procedure for estimating a facial expression in a sentence;
A speech synthesis procedure for outputting a synthesized speech synthesized by speech so that the facial expression estimated in the facial expression estimation procedure is reflected in the specific conversation sentence,
A program characterized by being executed by a computer.