JP2019109278A

JP2019109278A - Speech synthesis system, statistic model generation device, speech synthesis device, and speech synthesis method

Info

Publication number: JP2019109278A
Application number: JP2017240349A
Authority: JP
Inventors: 慶華孫; Keika Son; 直之神田; Naoyuki Kanda
Original assignee: Hitachi ULSI Systems Co Ltd
Current assignee: Hitachi Solutions Technology Ltd
Priority date: 2017-12-15
Filing date: 2017-12-15
Publication date: 2019-07-04
Anticipated expiration: 2037-12-15
Also published as: JP6806662B2

Abstract

To synthesize a speech with a reading (pronunciation) that a user specified by an ETE (End-To-End) type speech synthesis system.SOLUTION: A speech synthesis system synthesizes a speech by: generating a statistic model to be used for speech synthesis on the basis of learning data in which an utterance text and speech data are made to correspond to each other; generating, as learning data, data in which an utterance text having some words of the utterance text replaced with a pronunciation symbol array and the speech data are made to correspond to each other; storing a user dictionary as data including information in which words and pronunciation symbol arrays for the words are made to correspond to each other; replacing words, included in the user dictionary among words included in an object text as an object of speech synthesis, with the pronunciation symbol arrays made to correspond to words in the user dictionary; and performing speech synthesis processing, based upon the statistic model, on the object text after the replacement.SELECTED DRAWING: Figure 3

Description

本発明は、音声合成システム、統計モデル生成装置、音声合成装置、音声合成方法に関する。 The present invention relates to a speech synthesis system, a statistical model generation device, a speech synthesis device, and a speech synthesis method.

近年、電話の自動応答、公共交通機関や自治体のアナウンス、スマートフォンやパーソナルコンピュータのアプリケーションによる情報の読み上げ等、様々な分野において音声合成技術が導入されている。また昨今、音声認識、機械翻訳、対話生成等の技術が飛躍的に向上し、音声翻訳やサービスロボット等への実用化が急速に進められている。 In recent years, speech synthesis technology has been introduced in various fields such as automatic answering of telephones, announcements by public transport organizations and local governments, and reading of information by applications of smartphones and personal computers. Also, in recent years, technologies such as speech recognition, machine translation, dialogue generation, etc. have dramatically improved, and their practical application to speech translation, service robots, etc. are rapidly advanced.

Oord, Aaron van den; Dieleman, Sander; Zen, Heiga; Simonyan, Karen; Vinyals, Oriol; Graves, Alex; Kalchbrenner, Nal; Senior, Andrew; Kavukcuoglu, Koray (2016-09-12). "WaveNet: A Generative Model for Raw Audio".Dioreman, Sander; Zen, Heiga; Simonyan, Karen; Vinyals, Oriol; Graves, Alex; Kalchbrenner, Nal; Senior, Andrew; Kavukcuoglu, Koray (2016-09-12). "WaveNet: A Generative Model for Raw Audio ". Yuxuan Wang, RJ Skerry-Ryan, Daisy Stanton, Yonghui Wu, Ron J Weiss, Navdeep Jaitly, Zongheng Yang, Ying Xiao, Zhifeng Chen, Samy Bengio, et al. 2017. Tacotron: A fully end-toend text-to-speech synthesis model.Yucoon Wang, RJ Skerry-Ryan, Daisy Stanton, Yonghui Wu, Ron J Weiss, Navdeep Jaitly, Zongheng Yang, Ying Xiao, Zhifeng Chen, Samy Bengio, et al. synthesis model. "Unit selection with Hierarchical Cascaded Long Short Term Memory Bidirectional Recurrent Neural Nets", Vincent Pollet, Enrico Zovato, Sufian Irhimeh, Pier Batzu, Interspeech 2017."Unit selection with Hierarchical Cascaded Long Short Term Memory Bidirectional Recurrent Neural Nets", Vincent Pollet, Enrico Zovato, Sufian Irhimeh, Pier Batzu, Interspeech 2017.

音声合成の方式としては様々なものが存在するが、中でもテキスト音声合成方式は重要技術の一つとして注目されている。一般的なテキスト音声合成方式は、テキストを発音記号列に変換するフロントエンド（Front-End）処理と、中間言語から音声波形を生成するバックエンド（Back-End）処理とを含む。昨今ではフロントエンド処理及びバックエンド処理の夫々にＤＮＮ (Deep Neural Network）等の統計的手法を適用したものも実用化されている。 There are various methods for speech synthesis, and among them, text-to-speech synthesis is attracting attention as one of the important technologies. A typical text-to-speech synthesis method includes front-end processing for converting text into phonetic symbol strings, and back-end processing for generating speech waveforms from intermediate languages. In recent years, those in which a statistical method such as DNN (Deep Neural Network) is applied to each of front-end processing and back-end processing are also put to practical use.

また最近では、中間言語を経由することなく、入力されたテキストから直接音声を生成する、いわゆるＥＴＥ（End-To-End）型の音声合成方式も登場している。ＥＴＥ型の音声合成方式では、音声コーパスの発話テキストの言語特徴量と音声波形の音響特徴量との関係をＤＮＮ等の統計的手法により事前に統計モデルとして用意しておき、音声合成時は統計モデルに基づき音声合成の対象となるテキストに対応する音響特徴量を有する系列を生成して音声を合成する。 Recently, a so-called ETE (End-To-End) type speech synthesis system has also appeared, which directly generates speech from input text without passing through an intermediate language. In the ETE speech synthesis method, the relationship between the language feature of the speech text of the speech corpus and the sound feature of the speech waveform is prepared in advance as a statistical model by a statistical method such as DNN, and the speech is synthesized during speech synthesis. A voice is synthesized by generating a sequence having an acoustic feature corresponding to a text to be subjected to speech synthesis based on a model.

ところで、音声合成技術の利用現場においては、例えば、人名や固有名詞等についてユーザが指定した読み（発音）で音声合成を行いたいというニーズが少なからず存在し、ＥＴＥ型の音声合成方式においてもユーザ辞書の機能に対するニーズは少なくない。しかしＥＴＥ型の音声合成方式における統計モデルは、その全体（ＤＮＮモデル等）が一つの大きなモデルとして構成されており、膨大なデータを用いたＥＴＥ型の音声合成方式の再学習には大量の計算機リソースが必要となり、ユーザ辞書の内容を学習データに追加したり新語を追加する度に統計モデルを再学習することは必ずしも現実的でない。 By the way, in the field where speech synthesis technology is used, there are not a few needs to perform speech synthesis by reading (pronunciation) specified by the user for personal names, proper nouns, etc., for example. There is a great need for dictionary functionality. However, the statistical model in the ETE speech synthesis system is configured as a whole large model (such as the DNN model), and a large amount of computers are required for relearning the ETE speech synthesis system using a large amount of data. Resources are required, and it is not always practical to re-learn a statistical model each time you add the contents of a user dictionary to training data or add new words.

本発明はこうした背景に鑑みてなされたものであり、ユーザが指定した読み（発音）での音声合成を行うことが可能なＥＴＥ型の音声合成方式による音声合成システム等を提供することを目的とする。 The present invention has been made in view of the above background, and it is an object of the present invention to provide an ETE-type speech synthesis system etc. capable of performing speech synthesis in reading (pronunciation) specified by a user. Do.

上記課題を解決するための本発明のうちの一つは、情報処理装置を用いて構成される音声合成システムであって、発話テキストと音声データとを対応づけた学習データに基づき、音声合成に用いる統計モデルを生成するモデル学習部と、前記発話テキストの全部又は一部の単語を発音記号列で置き換えた発話テキストと音声データとを対応づけたデータを前記学習データとして生成するテキスト置換部と、単語と当該単語についての発音記号列とを対応づけた情報を含むデータであるユーザ辞書を記憶する記憶部と、音声合成の対象となるテキストである対象テキストについて、当該対象テキストに含まれている単語のうち前記ユーザ辞書に含まれている単語を前記ユーザ辞書において当該単語についての前記発音記号列で置き換えるユーザ辞書適用部と、前記置き換え後の前記対象テキストについて前記統計モデルに基づく音声合成処理を行うことにより合成音声を生成する音声合成処理部と、を備える。 One of the present inventions for solving the above problems is a speech synthesis system configured by using an information processing apparatus, which performs speech synthesis based on learning data in which an utterance text and speech data are associated with each other. A model learning unit that generates a statistical model to be used; a text substitution unit that generates, as the learning data, data in which speech data is associated with speech text in which all or part of words of the speech text are replaced with phonetic symbol strings; A storage unit for storing a user dictionary which is data including information in which a word is associated with a phonetic symbol string for the word, and a target text which is a text to be subjected to speech synthesis in the target text; A user who replaces a word included in the user dictionary among the words with the phonetic symbol string for the word in the user dictionary Comprising a write application unit, and a speech synthesis unit for generating synthetic speech by performing speech synthesis processing based on the statistical model for the target text after the replacement.

本発明によれば、ユーザが指定した読み（発音）での音声合成を行うことが可能なＥＴＥ型の音声合成方式による音声合成システム等を提供することができる。 According to the present invention, it is possible to provide an ETE-type speech synthesis system or the like that can perform speech synthesis in reading (pronunciation) specified by a user.

音声合成システムの概略的な構成を示す図である。FIG. 1 is a diagram showing a schematic configuration of a speech synthesis system. 音声合成システムの実現に用いるハードウェアの一例として示す情報処理装置のブロック図である。FIG. 1 is a block diagram of an information processing apparatus shown as an example of hardware used to realize a speech synthesis system. 音声合成システムの構成を説明する図である。It is a figure explaining the composition of a speech synthesis system. テキスト置換部の詳細を説明する図である。It is a figure explaining the detail of a text substitution part. 置換単語抽出部の詳細を説明する図である。It is a figure explaining the detail of a substitution word extraction part. 音声特徴量抽出部の詳細を説明する図である。It is a figure explaining the detail of an audio | voice feature-value extraction part. ユーザ辞書適用部の詳細を説明する図である。It is a figure explaining the detail of a user dictionary application part.

以下、本発明に係る実施形態を図面に参照しつつ説明する。 Hereinafter, embodiments according to the present invention will be described with reference to the drawings.

図１に本実施形態で説明する、ＥＴＥ（End-To-End）型の音声合成方式により音声合成を行うシステム（以下、音声合成システム１と称する。）の概略的な構成を示している。 FIG. 1 shows a schematic configuration of a system (hereinafter referred to as a speech synthesis system 1) for performing speech synthesis by an ETE (End-To-End) type speech synthesis method described in the present embodiment.

音声合成システム１は、音声コーパス５０の発話テキスト５１の言語特徴量と音声データ５２の音響特徴量との関係をＤＮＮ (Deep Neural Network）等の統計的手法（機械学習等）で学習することにより事前に統計モデル６０を生成しておき、音声合成の対象となるテキスト（文章や句）（以下、入力テキスト７００（対象テキスト）と称する。）に対応する音響特徴量を有する系列を統計モデル６０に基づき生成して音声を合成する。 The speech synthesis system 1 learns the relationship between the language feature of the speech text 51 of the speech corpus 50 and the sound feature of the speech data 52 by learning statistical techniques (such as machine learning) such as DNN (Deep Neural Network). A statistical model 60 is generated in advance, and a series having an acoustic feature corresponding to text (text or phrase) (hereinafter referred to as input text 700 (target text)) to be subjected to speech synthesis is a statistical model 60 To generate speech based on.

尚、本実施形態では、上記の統計的手法としてＤＮＮを例として説明するが、統計的手法は必ずしも限定されず、例えば、隠れマルコフモデル（ＨＭＭ：Hidden Markov Model）等の他の統計的手法を用いてもよい。また以下ではテキストとして記述される言語が日本語である場合を例として説明するが、テキストとして記述される言語は他の言語であってもよいし、テキストには複数の言語が混在していてもよい。 In the present embodiment, DNN will be described as an example of the above-described statistical method, but the statistical method is not necessarily limited. For example, other statistical methods such as Hidden Markov Model (HMM) may be used. You may use. In the following, the case where the language described as text is Japanese will be described as an example, but the language described as text may be another language, and a plurality of languages are mixed in the text. It is also good.

同図に示すように、音声合成システム１は、音声コーパス５０、音声コーパス５０に基づき統計モデル６０を生成する統計モデル生成部１００、及び、入力テキスト７００に対応する音響特徴量を有する系列を生成し、生成した系列に基づき音声波形を合成して合成音声８００を生成（出力）する音声合成部２００、を含む。 As shown in the figure, the speech synthesis system 1 generates a speech corpus 50, a statistical model generation unit 100 that generates a statistical model 60 based on the speech corpus 50, and a sequence having acoustic features corresponding to the input text 700. And a speech synthesis unit 200 which synthesizes speech waveforms based on the generated sequence and generates (outputs) a synthesized speech 800.

音声コーパス５０は、発話テキスト５１と、発話テキスト５１と対応づけられた音声データ５２（音声波形データ、符号化された音声データ等）とを含む。音声コーパス５０は、統計モデル生成部１００が統計モデル６０を生成する際の学習データとして用いられる。 The speech corpus 50 includes a speech text 51 and speech data 52 (speech waveform data, coded speech data, etc.) associated with the speech text 51. The speech corpus 50 is used as learning data when the statistical model generation unit 100 generates the statistical model 60.

音声合成部２００は、統計モデル６０を用い、入力テキスト７００について指定された発音や発話スタイル（例えば、会話調である、感情がこもっている等の発声上の特徴や、声の抑揚、大きさ、リズム、速さ、間の長さといった要素によって特徴付けられる発声上の特徴等）等に従った音声を合成する。 The speech synthesis unit 200 uses the statistical model 60 to specify a specified pronunciation or utterance style for the input text 700 (for example, speech characteristics such as speech tone, emotional stagnation, etc., voice intonation, size). The speech is synthesized according to the vocal feature characterized by the elements such as rhythm, speed, and length between).

図２は音声合成システム１の実現に用いるハードウェアの一例として示す情報処理装置１０（コンピュータ、計算機リソース）のブロック図である。同図に示すように、情報処理装置１０は、プロセッサ１１、主記憶装置１２、補助記憶装置１３、入力装置１４、出力装置１５、及び通信装置１６を備える。これらは図示しないバス等の通信手段を介して互いに通信可能に接続されている。 FIG. 2 is a block diagram of the information processing apparatus 10 (computer, computer resource) shown as an example of hardware used to realize the speech synthesis system 1. As shown in the figure, the information processing device 10 includes a processor 11, a main storage device 12, an auxiliary storage device 13, an input device 14, an output device 15, and a communication device 16. These are communicably connected to each other via communication means such as a bus (not shown).

尚、情報処理装置１０は、その全ての構成が必ずしもハードウェアで実現されている必要はなく、例えば、構成の一部又は全部がクラウドシステムのクラウドサーバのような仮想的な資源によって実現されていてもよい。 In the information processing apparatus 10, the entire configuration does not necessarily have to be realized by hardware. For example, part or all of the configuration is realized by a virtual resource such as a cloud server of a cloud system. May be

プロセッサ１１は、例えば、ＣＰＵ（Central Processing Unit）、ＭＰＵ（Micro Processing Unit）、ＤＳＰ（Digital Signal Processor）、ＦＰＧＡ（Field Programmable Gate Array）等を用いて構成される。プロセッサ１１が、主記憶装置１２に格納されているプログラムを読み出して実行することにより、音声合成システム１の様々な機能が実現される。 The processor 11 is configured using, for example, a central processing unit (CPU), a micro processing unit (MPU), a digital signal processor (DSP), a field programmable gate array (FPGA) or the like. Various functions of the speech synthesis system 1 are realized by the processor 11 reading and executing a program stored in the main storage device 12.

主記憶装置１２は、プログラムやデータを記憶する装置であり、例えば、ＲＯＭ（Read Only Memory）、ＲＡＭ（Random Access Memory）、不揮発性半導体メモリ（ＮＶＲＡＭ（Non Volatile RAM））等である。 The main storage device 12 is a device for storing programs and data, and is, for example, a ROM (Read Only Memory), a RAM (Random Access Memory), a nonvolatile semiconductor memory (NVRAM (Non Volatile RAM)), or the like.

補助記憶装置１３は、例えば、ハードディスクドライブ、ＳＳＤ（Solid State Drive）、光学式記憶装置（ＣＤ（Compact Disc）、ＤＶＤ(Digital Versatile Disc)等）、ストレージシステム、ＩＣカード、ＳＤメモリカード、ＦＤ（フレキシブルディスク）等の記録媒体の読取／書込装置、クラウドサーバの記憶領域等である。補助記憶装置１３に格納されているプログラムやデータは主記憶装置１２に随時読み込まれる。音声コーパス５０等の音声合成システム１が管理するデータは、例えば、補助記憶装置１３をデータの格納領域として利用するＤＢＭＳ（DataBase Management System）のデータベースに管理される。 The auxiliary storage device 13 may be, for example, a hard disk drive, an SSD (Solid State Drive), an optical storage device (CD (Compact Disc), DVD (Digital Versatile Disc), etc.), a storage system, an IC card, an SD memory card, an FD (FD) This is a reading / writing device of a recording medium such as a flexible disk), a storage area of a cloud server, and the like. The programs and data stored in the auxiliary storage device 13 are read into the main storage device 12 as needed. Data managed by the speech synthesis system 1 such as the speech corpus 50 is managed, for example, in a database of a DBMS (DataBase Management System) that uses the auxiliary storage device 13 as a data storage area.

入力装置１４は、音声コーパス５０や入力テキスト７００を入力するためのインタフェース（又はユーザインタフェース）であり、例えば、キーボード、マウス、タッチパネル、カードリーダ、マイクロフォン、アンプ等である。尚、情報処理装置１０が、通信装置１６を介して他の装置との間で情報の入力を受け付ける構成としてもよい。 The input device 14 is an interface (or a user interface) for inputting the voice corpus 50 and the input text 700, and is, for example, a keyboard, a mouse, a touch panel, a card reader, a microphone, an amplifier or the like. The information processing apparatus 10 may be configured to receive input of information with another apparatus via the communication apparatus 16.

出力装置１５は、各種の情報を出力するインタフェース（又はユーザインタフェース）であり、合成音声を出力する音声出力装置（スピーカ、アンプ等）を含む。尚、情報処理装置１０は、ユーザに処理経過や処理結果等の各種情報を提供するインタフェース（例えば、音声出力装置（スピーカ等）、画面表示装置（液晶モニタ、ＬＣＤ（Liquid Crystal Display）、グラフィックカード等）、印字装置等）を出力装置１５として更に備えていてもよい。また情報処理装置１０が、通信装置１６を介して他の装置との間で情報の出力を行う構成としてもよい。 The output device 15 is an interface (or a user interface) for outputting various information, and includes an audio output device (speaker, amplifier, etc.) for outputting synthetic speech. Note that the information processing apparatus 10 is an interface (for example, an audio output device (speaker etc.), a screen display device (liquid crystal monitor, LCD (Liquid Crystal Display), graphic card) that provides the user with various information such as process progress and process result And the like), a printing device and the like) may be further provided as the output device 15. Further, the information processing apparatus 10 may output information with another apparatus via the communication apparatus 16.

通信装置１６は、ＬＡＮやインターネット等の通信手段を介した他の装置との間の通信を実現する有線方式又は無線方式の通信インタフェースであり、例えば、ＮＩＣ（Network Interface Card）、各種無線通信モジュール、ＵＳＢ（Universal Serial Interface）モジュール、シリアル通信モジュール、モデム等である。 The communication device 16 is a wired or wireless communication interface for realizing communication with another device via a communication means such as a LAN or the Internet, and, for example, a NIC (Network Interface Card), various wireless communication modules , USB (Universal Serial Interface) module, serial communication module, modem or the like.

尚、統計モデル生成部１００、音声合成部２００、音声コーパス５０、及び統計モデル６０は、これらの全てが共通のハードウェアで実現されていなくてもよく、通信可能に接続された複数のハードウェアに分散して配置されていてもよい。例えば、音声コーパス５０及び統計モデル生成部１００と、統計モデル６０及び音声合成部２００とを、独立した資源により構成してもよい。統計モデル６０及び音声合成部２００は、例えば、カーナビゲーション装置、スマートフォン、携帯電話機、パーソナルコンピュータ等のデバイスに組み込まれる。 The statistical model generation unit 100, the speech synthesis unit 200, the speech corpus 50, and the statistical model 60 do not have to be all realized by common hardware, and a plurality of hardwares connected communicably It may be distributed and arranged. For example, the speech corpus 50 and the statistical model generation unit 100, and the statistical model 60 and the speech synthesis unit 200 may be configured by independent resources. The statistical model 60 and the voice synthesis unit 200 are incorporated into a device such as a car navigation device, a smartphone, a mobile phone, a personal computer, etc., for example.

また統計モデル生成部１００の後述する構成要素は、必ずしも共通のハードウェアで実現されていなくてもよく、通信可能に接続された複数のハードウェアに分散して配置されていてもよい。また音声合成部２００の後述する構成要素は、必ずしも共通のハードウェアで実現されていなくてもよく、通信可能に接続された複数のハードウェアに分散して配置されていてもよい。 The components of the statistical model generation unit 100, which will be described later, may not necessarily be realized by common hardware, and may be distributed and disposed in a plurality of hardware connected communicably. The components of the voice synthesis unit 200 described later do not have to be necessarily realized by common hardware, and may be distributed and arranged in a plurality of hardware connected communicably.

また音声コーパス５０や統計モデル６０をクラウドサーバ等の通信ネットワーク上の資源に配置し、統計モデル生成部１００や音声合成部２００が有線方式又は無線方式の通信ネットワークを通じて音声コーパス５０や統計モデル６０にアクセスするようにしてもよい。 In addition, the voice corpus 50 and the statistical model 60 are arranged in resources on the communication network such as a cloud server, and the statistical model generation unit 100 and the voice synthesis unit 200 make the voice corpus 50 and the statistical model 60 through the wired or wireless communication network. You may make it access.

また統計モデル生成部１００と音声合成部２００とを独立したハードウェアに配置し、統計モデル生成部１００によって生成された統計モデル６０を、物理的な記録媒体（光学式記憶装置（ＣＤ（Compact Disc）、ＤＶＤ(Digital Versatile Disc)等）、ハードディスクドライブ、ＳＳＤ、ＩＣカード、ＳＤメモリカード等）や有線方式又は無線方式の通信ネットワークを介して音声合成部２００に提供するようにしてもよい。 Further, the statistical model generation unit 100 and the speech synthesis unit 200 are arranged in independent hardware, and the statistical model 60 generated by the statistical model generation unit 100 is used as a physical recording medium (optical storage device (CD (Compact Disc (Compact Disc Or the like), a hard disk drive, an SSD, an IC card, an SD memory card, etc.), or may be provided to the voice synthesis unit 200 via a wired or wireless communication network.

図３は、図１に示した音声合成システム１の構成を詳細に説明する図である。同図に示すように、統計モデル生成部１００は、テキスト置換部１１０、及びモデル学習部１２０の各機能を有する。 FIG. 3 is a diagram for explaining the configuration of the speech synthesis system 1 shown in FIG. 1 in detail. As shown in the figure, the statistical model generation unit 100 has the functions of a text substitution unit 110 and a model learning unit 120.

テキスト置換部１１０は、音声コーパス５０の発話テキスト５１の一部又は全部をテキストデータ形式の発音記号列で置き換える。より詳細には、テキスト置換部１１０は、発話テキスト５１の一部又は全部をテキストデータ形式の発音記号列で置き換える。 The text replacement unit 110 replaces a part or all of the utterance text 51 of the speech corpus 50 with a phonetic symbol string in text data format. More specifically, the text replacement unit 110 replaces part or all of the utterance text 51 with a phonetic symbol string in text data format.

図４は図３のテキスト置換部１１０の詳細を説明する図である。同図に示すように、テキスト置換部１１０は、置換単語抽出部１１１、音声特徴量抽出部１１２、発音記号列生成部１１３、及びテキスト置換処理部１１５を含む。 FIG. 4 is a diagram for explaining the details of the text replacement unit 110 of FIG. As shown in the figure, the text replacement unit 110 includes a replacement word extraction unit 111, an audio feature extraction unit 112, a phonetic symbol string generation unit 113, and a text replacement processing unit 115.

置換単語抽出部１１１は、音声コーパス５０の発話テキスト５１のうち、発音記号列への置き換えの対象とする単語（例えば、固有名詞等）を抽出する。尚、以下の説明において、「単語」という場合、１文字、１単語、複数の単語の塊、句、節、文等、後述するユーザ辞書２１２への登録対象となりうる全ての態様を含むものとする。 The replacement word extraction unit 111 extracts, from the utterance text 51 of the speech corpus 50, a word (for example, proper noun or the like) to be a target of replacement with a phonetic symbol string. In the following description, the word "word" includes all aspects that can be registered in the user dictionary 212 described later, such as one character, one word, multiple word clusters, phrases, clauses, and sentences.

ここでこのように発話テキスト５１から置き換えの対象とする単語を抽出するのは、音声コーパス５０の全ての発話テキスト５１の全ての単語を発音記号列に置き換えてしまうと、音声合成部２００による音声合成に際し発音記号列の入っていないテキスト（以下、ノーマルテキストと称する。）に基づく音声合成ができなくなってしまうからである。そこで例えば、ユーザ辞書の機能を実現することが目的である場合、置換単語抽出部１１１は、例えば、ユーザ辞書に登録される可能性のある単語を置き換えの対象となる単語として抽出する。 The reason why the words to be replaced are extracted from the utterance text 51 in this way is that if all the words in all the utterance texts 51 of the speech corpus 50 are replaced with phonetic symbol strings, the speech synthesis unit 200 This is because speech synthesis based on text without phonetic symbol string (hereinafter referred to as normal text) can not be performed during synthesis. Therefore, for example, when the object is to realize the function of the user dictionary, the replacement word extraction unit 111 extracts, for example, a word that may be registered in the user dictionary as a word to be replaced.

図５は、図４に示した置換単語抽出部１１１の詳細を説明する図である。同図に示すように、置換単語抽出部１１１は、形態素解析部１１１１及び単語抽出部１１１２の各機能を有する。 FIG. 5 is a diagram for explaining the details of the replacement word extraction unit 111 shown in FIG. As shown in the figure, the replacement word extraction unit 111 has functions of a morphological analysis unit 1111 and a word extraction unit 1112.

また置換単語抽出部１１１は、単語抽出用データ１０６を記憶する。単語抽出用データ１０６には、単語抽出部１１１２が発音記号列への置き換えの対象となる単語として抽出するか否かの判定の基準となる情報が含まれる。単語抽出用データ１０６の内容は、例えば、ユーザが設定する。尚、置換単語抽出部１１１が、例えば、統計的手法（機械学習等）を用いて自動的に単語を抽出するようにしてもよい。 The replacement word extraction unit 111 also stores word extraction data 106. The word extraction data 106 includes information serving as a reference for determining whether the word extraction unit 1112 extracts a word to be replaced with a phonetic symbol string. The content of the word extraction data 106 is set by the user, for example. The replacement word extraction unit 111 may automatically extract words using, for example, a statistical method (such as machine learning).

形態素解析部１１１１は、形態素解析を行って発話テキスト５１を言語の最小単位（日本語の場合は形態素）に分割する。例えば、発話テキスト５１が「外国人参政権」である場合、形態素解析部１１１１は、これを「外国」、「人」、「参政」、「権」という４つの単語に分割する。 The morphological analysis unit 1111 performs morphological analysis to divide the utterance text 51 into the smallest unit of language (morpheme in the case of Japanese). For example, when the utterance text 51 is "foreign suffrage", the morphological analysis unit 1111 divides this into four words "foreign", "person", "regime", and "right".

単語抽出部１１１２は、形態素解析部２１０１によって分割されることにより得られた単語から、発音記号列への置き換えの対象となる単語を抽出する。ここで上記抽出の方法は必ずしも限定されないが、例えば、ルールベース手法や統計ベース手法を用いて行うことができる。例えば、鉄道放送向けの音声合成システム１を構築する場合、後述するユーザ辞書２１２に駅名が登録されることを想定し、置換単語抽出部１１１は発話テキスト５１から駅名を抽出する。具体的には、例えば、入力テキスト７００が「新宿さんが新宿で電車を降りました。」であり、単語抽出用データ１０６が「表記＝新宿、読み＝しんじゅく、属性＝駅名」である場合、単語抽出部２１０２は、「駅名」として「新宿」が使われている単語のみを抽出する。但しこのように駅名のみを置換するようにした場合には、アクセント型に偏りが生じる可能性がある。そこでユーザ辞書に登録される可能性のある単語を全体的に高品質に読み上げることができるように、例えば、置換単語抽出部１１１が、単語の音韻のバランス（選択した単語の音素つながりが偏っていないこと）や韻律のバランス（アクセント型、文中における位置などが偏っていないこと）を考慮して音声コーパス５０全体の発話テキスト５１から単語を適宜選択する（必要であれば駅名以外の単語も抽出対象とする）ようにしてもよい。 The word extraction unit 1112 extracts, from the words obtained by being divided by the morphological analysis unit 2101, a word to be replaced with the phonetic symbol string. Here, the above extraction method is not necessarily limited, but can be performed using, for example, a rule-based method or a statistic-based method. For example, when constructing the speech synthesis system 1 for railway broadcast, assuming that a station name is registered in a user dictionary 212 described later, the replacement word extraction unit 111 extracts the station name from the utterance text 51. Specifically, for example, the input text 700 is "Shinjuku-san got off the train in Shinjuku.", And the word extraction data 106 is "notation = Shinjuku, Yomi = Shinjuku, attribute = station name". In the case, the word extraction unit 2102 extracts only words in which “Shinjuku” is used as the “station name”. However, when only station names are replaced in this way, bias may occur in the accent type. Therefore, for example, the replacement word extraction unit 111 can balance the phonemes of the words (the phoneme connection of the selected words is biased so that the words that may be registered in the user dictionary can be read out with high quality as a whole. Select words appropriately from the utterance text 51 of the entire speech corpus 50 in consideration of the absence of) and prosody balance (accent type, position in the sentence, etc. not biased) (extracting words other than the station name if necessary) You may make it a target.

図６は、図４に示した音声特徴量抽出部１１２の詳細を説明する図である。音声特徴量抽出部１１２は、音声コーパス５０の音声データ５２から、例えば、発音（音素）、発話スタイル、韻律等を音声特徴量として抽出する。 FIG. 6 is a diagram for explaining the details of the audio feature quantity extraction unit 112 shown in FIG. The speech feature quantity extraction unit 112 extracts, for example, pronunciation (phoneme), speech style, prosody etc. as speech feature quantity from the speech data 52 of the speech corpus 50.

同図に示すように、音声特徴量抽出部１１２は、発音（音素）抽出部１１２１、発話スタイル抽出部１１２２、及び韻律特徴抽出部１１２３を有する。発音（音素）抽出部１１２１は、音声コーパス５０の音声データ５２から発音（音素）情報を抽出する。発話スタイル抽出部１１２２は、音声コーパス５０の音声データ５２から発話スタイル情報を抽出する。韻律特徴抽出部１１２３は、音声コーパス５０の音声データ５２から韻律特徴情報を抽出する。このように発音（音素）のみならず発話スタイルや韻律に関する情報を抽出して発音記号列に組み込むことで、合成音声の発話スタイルや抑揚をチューニングすることが可能になる。音声特徴量抽出部１１２は、更に別の情報を音声特徴量１０４として抽出してもよい。音声特徴量抽出部１１２が行う処理は、例えば、音声認識技術やテキスト解析技術等を用いて情報処理装置により自動的に行うことができる。尚、上記処理の一部又は全部を手動で行ってもよい。 As shown in the figure, the voice feature extraction unit 112 includes a pronunciation (phoneme) extraction unit 1121, a speech style extraction unit 1122, and a prosody feature extraction unit 1123. The pronunciation (phoneme) extraction unit 1121 extracts pronunciation (phoneme) information from the speech data 52 of the speech corpus 50. The speech style extraction unit 1122 extracts speech style information from the speech data 52 of the speech corpus 50. The prosody feature extraction unit 1123 extracts prosody feature information from the voice data 52 of the voice corpus 50. As described above, it is possible to tune the speech style and intonation of synthetic speech by extracting information on not only the pronunciation (phoneme) but also the speech style and prosody into the phonetic symbol string. The voice feature quantity extraction unit 112 may extract further information as the voice feature quantity 104. The processing performed by the voice feature extraction unit 112 can be automatically performed by the information processing apparatus using, for example, a voice recognition technology or a text analysis technology. Note that part or all of the above processing may be performed manually.

尚、発音（音素）情報は、例えば、発話テキスト５１から言語処理技術によって抽出することができるが、例えば「明日」という単語が「あした」又は「あす」のうちのいずれで発音されるのか等、言語処理技術では発音を一意に決定することができないことがあるが、その場合は例えば音声認識技術を用いて音声データ５２に基づき正確な発音を抽出するようにしてもよい。例えば、ある音声データ５２に対応する発話テキスト５１が「明日は晴れです。」である場合、音声特徴量抽出部１１２は音声データ５２について音声認識技術を適用することにより「アスワハレデス」という発音（音素）情報を抽出する。 The pronunciation (phoneme) information can be extracted, for example, from the utterance text 51 by language processing technology. For example, whether the word "Tomorrow" is pronounced by "Ashita" or "Asu", etc. In the case of language processing technology, it may not be possible to uniquely determine the pronunciation, but in that case, the correct pronunciation may be extracted based on the voice data 52 using, for example, a speech recognition technology. For example, when the utterance text 51 corresponding to a certain voice data 52 is “Tomorrow is fine.”, The voice feature extraction unit 112 applies a voice recognition technology to the voice data 52 to produce a pronunciation “Ashwa haredes” (phoneme ) Extract information.

図４に戻り、発音記号列生成部１１３は、音声特徴量抽出部１１２が抽出した音声特徴量１０４に基づき発音記号列１０５を生成（音声特徴量１０４を発音記号列で表現）する。発音記号列１０５の記述方法は必ずしも限定されないが、例えば、日本語の場合、発音記号列生成部１１３は、音声特徴量抽出部１１２が抽出した音声特徴量１０４に基づき、ＪＩＥＴＡ（一般社団法人電子情報技術産業協会）が規定する日本語テキスト音声合成用記号（JEITA IT-4006）を生成する。 Returning to FIG. 4, the phonetic symbol string generation unit 113 generates the phonetic symbol string 105 based on the voice feature amount 104 extracted by the voice feature amount extraction unit 112 (represents the voice feature amount 104 by a phonetic symbol string). Although the method of describing the phonetic symbol string 105 is not necessarily limited, for example, in the case of Japanese, the phonetic symbol string generation unit 113 uses the JIETA (general corporate entity electronic corporation) based on the speech feature quantity 104 extracted by the speech feature quantity extraction unit 112. Generates Japanese text-to-speech synthesis symbols (JEITA IT-4006) defined by the Information Technology Industries Association.

尚、一般的なテキストに用いる文字コードに含まれていないコード（記号コードや特殊コード等）を用いることで、読み（発音）指定の精度を高めることができる。例えば、発音記号列生成部１１３は、音声特徴量抽出部１１２が抽出した発音「アスワハレデス」を、「アス'ワハレ'テ゛ス%.」という一般的なテキストに用いる文字コードに含まれていないコードを含む発音記号列に変換する。尚、発音（音素）情報以外に、更にアクセント情報や韻律境界情報、ポーズ情報等の情報を付加して読み指定の品質を高めるようにしてもよい。 The accuracy of reading (pronunciation) designation can be enhanced by using a code (symbol code, special code, etc.) not included in the character code used for general text. For example, the phonetic symbol string generation unit 113 does not include the character code used in the general text “as 'wa Halle' d%.” As the pronunciation “Aswahhaledes” extracted by the speech feature quantity extraction unit 112 Convert to phonetic symbols containing. In addition to the pronunciation (phoneme) information, information such as accent information, prosody boundary information, and pose information may be further added to enhance the quality of the reading designation.

テキスト置換処理部１１５は、発話テキスト５１について、置換単語抽出部１１１が抽出した単語（同図における、抽出した単語１０３）を、これに対応する発音記号列生成部１１３が生成した発音記号列で置き換えた発話テキスト（以下、発音記号列を含む発話テキスト１０２と称する。）を生成する。 The text substitution processing unit 115 is a phonetic symbol string generated by the phonetic symbol string generation unit 113 corresponding to the word extracted by the replacement word extraction unit 111 (the extracted word 103 in the figure) for the utterance text 51. A replaced utterance text (hereinafter referred to as utterance text 102 including a phonetic symbol string) is generated.

図３に戻り、テキスト置換処理部１１５は、テキスト置換部１１０が生成した、発音記号列を含む発話テキスト１０２を、これに対応する音声データ５２と組み合わせて新たな学習データとして音声コーパス５０に追加する。これによりテキスト置換部１１０が発音記号列を含む発話テキスト１０２を生成する度に音声コーパス５０に学習データが追加されていくことになる。 Returning to FIG. 3, the text replacement processing unit 115 adds the utterance text 102 generated by the text replacement unit 110 including the phonetic symbol string to the speech corpus 50 as new learning data in combination with the corresponding speech data 52. Do. Thus, learning data is added to the speech corpus 50 each time the text replacement unit 110 generates the utterance text 102 including the phonetic symbol string.

具体例を示すと、例えば、発話テキスト５１が「次の停車駅は新宿です。」であり、置換単語抽出部１１１が、この発話テキスト５１から「新宿」という単語（抽出した単語１０３）を抽出した場合、テキスト置換処理部１１５は、「次の停車駅はシンシ゛ュクです。」という、発音記号列を含む発話テキスト１０２を生成する。テキスト置換処理部１１５は、生成した発音記号列を含む発話テキスト１０２を、対応する音声データ５２と組み合わせて音声コーパス５０に追加する。 As a specific example, for example, the utterance text 51 is "the next stop station is Shinjuku", and the replacement word extraction unit 111 extracts the word "Shinjuku" (extracted word 103) from the utterance text 51. In this case, the text substitution processing unit 115 generates the utterance text 102 including the phonetic symbol string that “the next stop station is a syn-ck”. The text substitution processing unit 115 adds the generated utterance text 102 including the phonetic symbol string to the speech corpus 50 in combination with the corresponding speech data 52.

尚、テキスト置換処理部１１５は、テキスト置換部１１０が生成した、発音記号列を含む発話テキスト１０２とこれに対応する音声データ５２との組合せを、上記のように新たに音声コーパス５０に追加してもよいし、音声コーパス５０の、発音記号列に置換する前の発話テキスト５１と音声データ５２の組合せにおける発話テキスト５１を、発音記号列を含む発話テキスト１０２に置換するようにしてもよい（即ち、音声コーパス５０の既存の組合せを更新する）。 The text substitution processing unit 115 newly adds the combination of the utterance text 102 including the phonetic symbol string generated by the text substitution unit 110 and the corresponding speech data 52 to the speech corpus 50 as described above. Alternatively, the utterance text 51 in the combination of the utterance text 51 and the speech data 52 in the speech corpus 50 before the substitution to the pronunciation symbol string may be replaced with the utterance text 102 including the pronunciation symbol string ( That is, the existing combination of speech corpus 50 is updated).

図３に示すモデル学習部１２０は、音声コーパス５０に含まれている、全ての発話テキスト５１と音声データ５２との組合せ（テキスト置換処理部１１５により新たに追加された組合せを含む）を学習データとして学習（機械学習等）を行い、統計モデル６０を生成する。尚、このように学習データには発音記号列を含む発話テキスト１０２が含まれているため、生成された統計モデル６０を用いることで、ノーマルテキストのみならず発音記号列を含むテキストについても音声合成を行うことができる。 The model learning unit 120 shown in FIG. 3 learns combinations of all speech texts 51 and speech data 52 (including combinations newly added by the text substitution processing unit 115) included in the speech corpus 50. As learning (machine learning etc.) to generate a statistical model 60. Since the learning data includes the utterance text 102 including the phonetic symbol string as described above, using the generated statistical model 60, speech synthesis is performed for not only normal text but also text including the phonetic symbol string. It can be performed.

続いて、図３に示した音声合成部２００について説明する。同図に示すように、音声合成部２００は、ユーザ辞書適用部２１０、及び音声合成処理部２２０の各機能を有する。 Subsequently, the speech synthesis unit 200 shown in FIG. 3 will be described. As shown in the figure, the speech synthesis unit 200 has functions of a user dictionary application unit 210 and a speech synthesis processing unit 220.

ユーザ辞書適用部２１０は、入力テキスト７００に基づく音声合成の際、入力テキスト７００中にユーザ辞書に登録されている単語を検出すると、検出した単語を当該単語についてユーザ辞書２１２で指定されている発音記号列に置き換え、発音記号列を含むテキスト２０３を生成する。尚、ユーザ辞書適用部２１０の機能は、単純な文字列置換のアルゴリズムによって実現することもできるが、入力テキスト７００が日本語のように単語の間に明確な分割文字が存在しない言語で記述されている場合、単純な文字列置換のアルゴリズムでは上記の置き換えを正しく行うことができないことがある。例えば、ユーザ辞書２１２に「表記＝人参、読み（発音記号列）＝ニンシ゛ン」というデータが含まれている場合、単純な文字列置換のアルゴリズムでは「外国人参政権」が「外国ニンシ゛ン政権」に置き換えられてしまうとことがある。そこで本実施形態のユーザ辞書適用部２１０は、以下のようにして発音記号列を含むテキスト２０３を生成する。 When the user dictionary application unit 210 detects a word registered in the user dictionary in the input text 700 during speech synthesis based on the input text 700, the user pronounces the detected word for the word specified in the user dictionary 212. The text string is substituted to generate text 203 including a phonetic string. Although the function of the user dictionary application unit 210 can be realized by a simple string substitution algorithm, the input text 700 is described in a language such as Japanese in which there is no clear divided character between words. If so, the simple string substitution algorithm may not be able to do the above substitution correctly. For example, when the user dictionary 212 contains data "notation = ginseng, reading (phonetic symbol string) = ninjang", "foreign suffrage" is replaced with "foreign ninjing administration" in the simple string substitution algorithm. There are times when you are Therefore, the user dictionary application unit 210 of this embodiment generates the text 203 including the phonetic symbol string as follows.

図７は、ユーザ辞書適用部２１０の詳細を説明する図である。同図に示すように、ユーザ辞書適用部２１０は、形態素解析部２１０１、単語抽出部２１０２、及び発音記号列置換部２１０３を有する。 FIG. 7 is a diagram for explaining the details of the user dictionary application unit 210. As shown in FIG. As shown in the figure, the user dictionary application unit 210 includes a morphological analysis unit 2101, a word extraction unit 2102, and a phonetic symbol string replacement unit 2103.

またユーザ辞書適用部２１０は、単語抽出用データ２１１及びユーザ辞書２１２を記憶する。単語抽出用データ２１１には、単語抽出部２１０２が置き換えの対象となる単語として抽出するか否かの判定の基準となる情報が含まれる。ユーザ辞書２１２は、単語（表記）と読み（発音記号列）とを対応づけた情報が含まれる。 The user dictionary application unit 210 also stores word extraction data 211 and a user dictionary 212. The word extraction data 211 includes information serving as a reference for determining whether or not the word extraction unit 2102 extracts a word to be replaced. The user dictionary 212 includes information in which words (indications) and readings (phonetic symbol strings) are associated.

形態素解析部２１０１は、形態素解析を行って入力テキスト７００を言語の最小単位（日本語の場合は形態素）に分割する。例えば、入力テキスト７００として「外国人参政権」が入力された場合、形態素解析部２１０１は、これを「外国」、「人」、「参政」、「権」という４つの単語に分割する。このため、例えば、ユーザ辞書２１２に「表記＝人参、読み（発音記号列）＝ニンシ゛ン」といった内容が登録されていた場合でも誤って置き換えてしまうことはない。 The morphological analysis unit 2101 performs morphological analysis to divide the input text 700 into language minimum units (morphemes in the case of Japanese). For example, when “foreign suffrage” is input as the input text 700, the morphological analysis unit 2101 divides this into four words of “foreign”, “person”, “registrar”, and “right”. Therefore, for example, even when the content such as “notation = ginseng, reading (phonetic symbol string) = ending” is registered in the user dictionary 212, there is no possibility of erroneous replacement.

単語抽出部２１０２は、形態素解析部２１０１によって分割された単語から、発音記号列への置き換え対象とする単語（例えば、固有名詞等）を抽出する。この抽出の方法は必ずしも限定されないが、例えば、ルールベース手法や統計ベース手法を用いて行う。前述した単語抽出部１１１２と同様であるが、例えば、入力テキスト７００が「新宿さんが新宿で電車を降りました。」であり、単語抽出用データ２１１に「表記＝新宿、読み＝しんじゅく、属性＝駅名」が登録されている場合、単語抽出部２１０２は、「駅名」として「新宿」が使われている単語を抽出する。 The word extraction unit 2102 extracts, from the words divided by the morphological analysis unit 2101, words (for example, proper nouns) to be replaced with the phonetic symbol string. Although the method of this extraction is not necessarily limited, it is performed using, for example, a rule-based method or a statistic-based method. The same as the word extraction unit 1112 described above, but, for example, the input text 700 is "Shinjuku san got off the train in Shinjuku." If the attribute “station name” is registered, the word extraction unit 2102 extracts words in which “Shinjuku” is used as the “station name”.

発音記号列置換部２１０３は、単語抽出部２１０２が抽出した単語のうち、ユーザ辞書２１２に登録されている単語を当該単語についてユーザ辞書２１２に登録されている読み（発音記号列）で置き換える。上記の例において、例えば、ユーザ辞書２１２に「表記＝新宿、読み（発音記号列）＝シンシ゛ュク」が登録されていた場合、発音記号列置換部２１０３は、「新宿さんが新宿で電車を降りました。」という入力テキスト７００を「新宿さんがシンシ゛ュクで電車を降りました。」というテキストに置換する。 The phonetic symbol string replacement unit 2103 replaces the word registered in the user dictionary 212 among the words extracted by the word extraction unit 2102 with the reading (phonetic symbol string) registered in the user dictionary 212 for the word. In the above example, for example, when “indication = Shinjuku, Yomi (phonetic symbol string) = Shinjuku” is registered in the user dictionary 212, the phonetic symbol string replacement unit 2103 is “Shinjuku-san got off the train in Shinjuku. The input text 700 is replaced with the text "Shinjuku-san got off the train at Shinjuku."

尚、形態素解析部２１０１及び単語抽出部２１０２としては、前述した統計モデル生成部１００の置換単語抽出部１１１における、形態素解析部１１１１及び単語抽出部１１１２と同じもの（アルゴリズムが共通するもの）を用いることが好ましい。また発音記号列置換部２１０３としては、前述したテキスト置換部１１０のテキスト置換処理部１１５と同じもの（アルゴリズムが共通するもの）を用いることが好ましい。このように統計モデル生成部１００と音声合成部２００とで共通のアルゴリズムを用いることで合成音声の品質を高めることができる。 The morphological analysis unit 2101 and the word extraction unit 2102 use the same one (the algorithm is common) with the morphological analysis unit 1111 and the word extraction unit 1112 in the replacement word extraction unit 111 of the statistical model generation unit 100 described above. Is preferred. Further, as the phonetic symbol string replacement unit 2103, it is preferable to use the same one (a common algorithm) as the text replacement processing unit 115 of the text replacement unit 110 described above. As such, by using a common algorithm in the statistical model generation unit 100 and the speech synthesis unit 200, it is possible to improve the quality of synthesized speech.

図３に戻り、音声合成処理部２２０は、統計モデル６０を用い、入力テキスト７００から合成音声８００を生成する。音声合成処理部２２０は、例えば、非特許文献１のように直接音声波形を生成する方法、非特許文献２のようにフレームごとに音声パラメータを生成した後に音声を生成する手法、非特許文献３のようなＤＮＮで選択した音声素片をつなぎ合わせることで音声合成する手法等により、合成音声８００を生成する。 Returning to FIG. 3, the speech synthesis processing unit 220 generates synthetic speech 800 from the input text 700 using the statistical model 60. For example, as in Non-Patent Document 1, the voice synthesis processing unit 220 directly generates a voice waveform, as in Non-Patent Document 2, generates voice parameters for each frame and then generates voice, Non-Patent Document 3 The synthesized speech 800 is generated by a method of synthesizing speech by connecting speech segments selected by DNN such as the above.

尚、例えば、音声コーパス５０に入力テキスト７００に対応する発話テキスト５１が含まれていない場合にモデル学習部１２０統計的手法（機械学習等）により類推して統計モデル６０を生成するようにしてもよい。例えば、発音記号列を含む発話テキスト１０２として「シンジュク」を含むものが音声コーパス５０に含まれていない場合に、モデル学習部１２０が、統計的手法により類推して「シンジュク」と音声データ５２との対応を含む統計モデル６０を生成するようにする。 Note that, for example, when the speech corpus 50 does not include the utterance text 51 corresponding to the input text 700, the statistical model 60 is generated by analogy with the model learning unit 120 statistical method (such as machine learning). Good. For example, in the case where the speech corpus 50 does not include “singjuk” as the utterance text 102 including the phonetic symbol string, the model learning unit 120 analogizes “singjuk” and the speech data 52 by the statistical method. To generate a statistical model 60 including the correspondence of

以上詳細に説明したように、本実施形態の音声合成システム１は、発話テキスト５１の一部の単語を発音記号列で置き換えた発話テキスト５１と音声データ５２とを対応づけた学習データを含んだ音声コーパス５０により統計モデル６０を予め学習しておき、入力テキスト７００について、当該入力テキスト７００に含まれている単語のうちユーザ辞書２１２に含まれている単語をユーザ辞書２１２における当該単語に対応づけられている発音記号列で置き換え、置き換え後の入力テキスト７００について統計モデル６０に基づく音声合成処理を行うことにより合成音声を生成する。そのため、統計モデル６０を再学習させることなく、ユーザが指定した読み（発音）での音声合成を行うことが可能できる。このように本実施形態によれば、ユーザが指定した読み（発音）での音声合成を行うことが可能なＥＴＥ型の音声合成方式による実用的な音声合成システムを実現することができる。 As described above in detail, the speech synthesis system 1 according to the present embodiment includes learning data in which the speech text 51 in which a part of the words of the speech text 51 is replaced with the phonetic symbol string and the speech data 52 are associated. The statistical model 60 is learned in advance by the speech corpus 50, and for the input text 700, a word included in the user dictionary 212 among the words included in the input text 700 is associated with the word in the user dictionary 212. The synthesized speech is generated by performing speech synthesis processing based on the statistical model 60 for the input text 700 after replacement with the phonetic symbol string being replaced. Therefore, it is possible to perform speech synthesis in reading (pronunciation) specified by the user without relearning the statistical model 60. As described above, according to the present embodiment, it is possible to realize a practical speech synthesis system based on an ETE speech synthesis method capable of speech synthesis in reading (pronunciation) specified by a user.

以上、本発明について実施の形態に基づき具体的に説明したが、本発明は上記の実施の形態に限定されるものではなく、その要旨を逸脱しない範囲で種々変更可能である。例えば、上記の実施の形態は本発明を分かりやすく説明するために詳細に説明したものであり、必ずしも説明した全ての構成を備えるものに限定されるものではない。また上記実施形態の構成の一部について、他の構成の追加・削除・置換をすることができる。 As mentioned above, although this invention was concretely demonstrated based on embodiment, this invention is not limited to said embodiment, It can change variously in the range which does not deviate from the summary. For example, the above embodiments have been described in detail in order to explain the present invention in an easy-to-understand manner, and are not necessarily limited to those having all the described configurations. In addition, with respect to a part of the configuration of the above embodiment, other configurations can be added, deleted, and replaced.

また上記の各構成、機能部、処理部、処理手段等は、それらの一部または全部を、例えば、集積回路で設計する等によりハードウェアで実現してもよい。また上記の各構成、機能等は、プロセッサが夫々の機能を実現するプログラムを解釈し、実行することによりソフトウェアで実現してもよい。各機能を実現するプログラム、テーブル、ファイル等の情報は、メモリやハードディスク、ＳＳＤ（Solid State Drive）等の記録装置、またはＩＣカード、ＳＤカード、ＤＶＤ等の記録媒体に置くことができる。 In addition, each of the configurations, function units, processing units, processing means, etc. described above may be realized by hardware, for example, by designing part or all of them with an integrated circuit. Further, each configuration, function, etc. described above may be realized by software by a processor interpreting and executing a program that realizes each function. Information such as a program, a table, and a file for realizing each function can be placed in a memory, a hard disk, a recording device such as a solid state drive (SSD), or a recording medium such as an IC card, an SD card, or a DVD.

また各図において、制御線や情報線は説明上必要と考えられるものを示しており、必ずしも実装上の全ての制御線や情報線を示しているとは限らない。例えば、実際にはほとんど全ての構成が相互に接続されていると考えてもよい。 Further, in each drawing, control lines and information lines indicate what is considered to be necessary for explanation, and not all the control lines and information lines in mounting are necessarily shown. For example, in practice it may be considered that almost all configurations are mutually connected.

また以上に説明した情報処理装置の各種機能部、各種処理部、各種データベースの配置形態は一例に過ぎない。各種機能部、各種処理部、各種データベースの配置形態は、各情報処理装置がハードウェアやソフトウェアの性能、処理効率、通信効率等の観点から最適な配置形態に変更し得る。 The arrangement of the various functional units, the various processing units, and the various databases of the information processing apparatus described above is merely an example. The arrangement form of the various functional units, the various processing units, and the various databases can be changed to an optimum arrangement form of each information processing apparatus in terms of hardware and software performance, processing efficiency, communication efficiency, and the like.

１音声合成システム、１０情報処理装置、５０音声コーパス、５１発話テキスト、５２音声データ、６０統計モデル、１００統計モデル生成部、１０２発音記号列を含むテキスト、１０３抽出した単語、１０４音声特徴量、１０５発音記号列、１０６単語抽出用データ、１１０テキスト置換部、１１１置換単語抽出部、１１１１形態素解析部、１１１２単語抽出部、１１２音声特徴量抽出部、１１２１発音（音素）抽出部、１１２２発話スタイル抽出部、１１２３韻律特徴抽出部、１１３発音記号列生成部、１１５テキスト置換処理部、１２０モデル学習部、２００音声合成部、２０３発音記号列を含むテキスト、２０２発音記号列を含むテキスト、２１０ユーザ辞書適用部、２１１単語抽出用データ、２１２ユーザ辞書、２１０１形態素解析部、２１０２単語抽出部、２１０３発音記号列置換部、７００入力テキスト、８００合成音声 DESCRIPTION OF SYMBOLS 1 speech synthesis system, 10 information processing apparatus, 50 speech corpus, 51 utterance text, 52 speech data, 60 statistical model, 100 statistical model generation unit, 102 text including a pronunciation symbol string, 103 extracted words, 104 speech feature amount, 105 phonetic symbol string, 106 word extraction data, 110 text substitution unit, 111 substitution word extraction unit, 1111 morpheme analysis unit, 1112 word extraction unit, 112 speech feature amount extraction unit, 1121 pronunciation (phoneme) extraction unit, 1122 speech style Extraction unit, 1123 prosody feature extraction unit, 113 phonetic symbol string generation unit, 115 text substitution processing unit, 120 model learning unit, 200 speech synthesis unit, 203 text including phonetic symbol sequence, 202 text including phonetic symbol sequence, 210 users Dictionary application section, 211 word extraction day , 212 user dictionary, 2101 morphological analysis unit, 2102 word extraction section, 2103 pronunciation symbol string replacing unit, 700 input text 800 synthesized speech

Claims

A model learning unit that generates a statistical model used for speech synthesis based on learning data in which speech text and speech data are associated;
A text substitution unit that generates, as the learning data, data in which speech data is associated with speech text in which a part of words of the speech text is replaced with a phonetic symbol string;
A storage unit storing a user dictionary which is data including information in which a word is associated with a phonetic symbol string for the word;
A statistical model generator having:
Regarding the target text which is a text to be subjected to speech synthesis, the phonetic symbol string in which a word included in the user dictionary among the words included in the target text is associated with the word in the user dictionary User dictionary application part replaced with,
A speech synthesis processing unit that generates synthesized speech by performing speech synthesis processing based on the statistical model on the target text after the replacement;
A speech synthesizer having
Speech synthesis system equipped with

The speech synthesis system according to claim 1, wherein
The text replacement unit extracts a speech feature amount from the speech data, generates a phonetic symbol string based on the extracted voice feature amount, and an utterance text in which a part of words of the utterance text is replaced with the phonetic symbol sequence Generating data associated with voice data as the learning data;
Speech synthesis system.

The speech synthesis system according to claim 2, wherein
The text substitution unit generates the phonetic symbol string by performing speech recognition on the speech data.
Speech synthesis system.

The speech synthesis system according to claim 2, wherein
The speech feature is at least one of pronunciation (phoneme), speech style, and prosody.
Speech synthesis system.

The speech synthesis system according to claim 1, wherein
The text substitution unit divides the utterance text into a plurality of words according to a morphological analysis algorithm;
The user dictionary application unit divides the target text into a plurality of words according to an algorithm common to the morphological analysis algorithm.
Speech synthesis system.

The speech synthesis system according to claim 1, wherein
The text replacement unit extracts a word to be replaced with a phonetic symbol string from the utterance text by a word extraction algorithm.
The user dictionary application unit extracts a word to be replaced with a phonetic symbol string from the target text by an algorithm common to the word extraction algorithm.
Speech synthesis system.

The speech synthesis system according to claim 1, wherein
The training data is a speech corpus
Speech synthesis system.

The speech synthesis system according to claim 7, wherein
The text substitution unit extracts a word to be replaced with a phonetic symbol string from the speech text of the speech corpus by a word extraction algorithm taking into consideration the balance of the extracted phoneme or prosody of the extracted word.
Speech synthesis system.

The speech synthesis system according to claim 1, wherein
The model learning unit generates the statistical model by DNN (Deep Neural Network).
Speech synthesis system.

The speech synthesis system according to claim 1, wherein
ETE (End-To-End) speech synthesis system,
Speech synthesis system.

The said statistical model production | generation apparatus in the speech synthesis system of Claim 1, Comprising:
The model learning unit that generates a statistical model used for speech synthesis based on learning data in which speech text and speech data are associated;
The text substitution unit generating, as the learning data, data in which speech data is associated with speech text in which a part of words of the speech text is replaced with a phonetic symbol string;
A statistical model generator comprising:

The statistical model generation device according to claim 11, wherein
The training data is a speech corpus
Statistical model generator.

The statistical model generation device according to claim 11, wherein
The model learning unit generates the statistical model by DNN (Deep Neural Network).
Statistical model generator.

The speech synthesizer according to claim 1, wherein the speech synthesizer comprises:
The storage unit storing a user dictionary which is data including information in which a word is associated with a phonetic symbol string for the word;
Regarding the target text which is a text to be subjected to speech synthesis, the phonetic symbol string in which a word included in the user dictionary among the words included in the target text is associated with the word in the user dictionary The user dictionary application unit replaced by
The speech synthesis processing unit that generates synthesized speech by performing speech synthesis processing based on the statistical model on the target text after the replacement;
, A speech synthesizer.

The information processing apparatus
Generating a statistical model to be used for speech synthesis based on learning data in which speech text and speech data are associated;
Generating, as the learning data, data in which an utterance text in which a part of words in the utterance text is replaced with a phonetic symbol string and speech data are associated;
Storing a user dictionary which is data including information in which a word is associated with a phonetic symbol string for the word;
Regarding the target text which is a text to be subjected to speech synthesis, the phonetic symbol string in which a word included in the user dictionary among the words included in the target text is associated with the word in the user dictionary Replacing with, and
Generating synthesized speech by performing speech synthesis processing based on the statistical model on the target text after the replacement;
Perform a speech synthesis method.