JP2709198B2

JP2709198B2 - Voice synthesis method

Info

Publication number: JP2709198B2
Application number: JP3044928A
Authority: JP
Inventors: 憲三伊藤; 大和佐藤; 智久広川
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 1991-03-11
Filing date: 1991-03-11
Publication date: 1998-02-04
Anticipated expiration: 2013-02-04
Also published as: JPH04281499A

Abstract

PURPOSE:To realize a smaller capacity storage device, control the speed of uttered voice and pitch of the voice and to obtain a high quality synthesized voice in a waveform editing/synthesizing method. CONSTITUTION:Each waveform P1, Pm and Pn, which includes three high peaks of original sound phoneme waveform column P1 to Pn+1 shown in A, is stored in a storage device as a representative waveform, similarly representative waveforms of each phoneme are stored, input synthesized text is analyzed to obtain phoneme columns, the representative waveforms of each of the phonemes are read out form the storage device and output synthesized waveforms P1m(0), P1m(1),..., P1m(k) which are obtained by interpolating the read out representative waveforms P1, Pm and Pn and their respective adjacent waveforms P1 and Pm employing given synthesized pitch period Tp and connecting time L, where P1m(i) (i=1, 2,...k) is P1Xalpha(i)+PmXbeta(i), alpha(i) and beta(i) are weighting coefficients against the waveforms P1 and Pm, respectively and alpha(i)=0.5X[1+cos{piX(L-Ti)/L}] and beta(i)=1-alpha(i). Ti is the time interval between P1m(0) and P1m(i).

Description

DETAILED DESCRIPTION OF THE INVENTION

【０００１】[0001]

【産業上の利用分野】この発明は規則によって任意の音
声語を合成する方法、音声信号を情報圧縮して符号化
し、伝送、又は蓄積し、その圧縮音声信号を再合成する
方法に適用され、波形情報から音声を合成する音声合成
方法に関する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention is applied to a method of synthesizing an arbitrary speech word according to a rule, a method of compressing and transmitting information of a speech signal, transmitting or storing the compressed speech signal, and resynthesizing the compressed speech signal. The present invention relates to a speech synthesis method for synthesizing speech from waveform information.

【０００２】[0002]

【従来の技術】音声の波形情報を全て記憶装置に蓄積し
ておき、この記憶装置を利用して音声合成をする音声合
成方式を実現するためには、記憶装置が大規模となる。
任意の音声語を合成する方式では、記憶容量の削減や、
多彩な合成音を出力するために、従来においては音声波
形から抽出した音声の特徴パラメータだけを記憶装置に
記憶し、必要に応じて特徴パラメータを読みだして合成
音を作成する、いわゆるパラメータ編集合成方式が主に
用いられていた。また、品質の高い合成音を得る目的
で、波形情報をそのまま蓄積しておく、波形編集合成方
式も用いられていた。2. Description of the Related Art A large-scale storage device is required to store all waveform information of a voice in a storage device and implement a voice synthesis method using the storage device to perform voice synthesis.
The method of synthesizing arbitrary spoken words reduces storage capacity,
Conventionally, in order to output a variety of synthesized sounds, only the feature parameters of the voice extracted from the voice waveform are stored in a storage device, and the feature parameters are read out as needed to create a synthesized voice, so-called parameter editing synthesis. The method was mainly used. Also, a waveform editing / synthesis method for storing waveform information as it is has been used for the purpose of obtaining a high-quality synthesized sound.

【０００３】しかし、上記のパラメータ編集合成方式で
は、多彩なパラメータ制御が容易な反面、得られる合成
音の品質は肉声の音声に比較すると大きな隔たりがあっ
た。また、波形編集合成方式では、品質の良い合成音が
得られる反面、大容量の記憶装置が必要であった（例え
ば、特願昭６３−１１５７２１）。音声波形の全てを記
憶しておくのではなく、音声区間の代表波形のみを用い
る方法として、線形予測分析で得られる残差信号に対し
て適用した例（特願昭５６−１７９９１５）が有るが、
これは積極的にピッチに同期した波形補間法は用いてい
ない。また、ピッチに同期した波形補間法は、ＣＶ（子
音−母音）／ＶＣ（母音−子音）のセグメント間の接続
に利用されている例（三留他、昭和５６年５月、日本音
響学会講演論文集、ｐ４３１）があるが、合成単位全体
に対して、情報圧縮と高品質化を目的にした合成方式は
今までに無い。[0003] However, in the above parameter editing / synthesis method, although various parameter controls are easy, the quality of the synthesized sound obtained is largely different from that of the real voice. In the waveform editing / synthesizing method, a high-quality synthesized sound can be obtained, but a large-capacity storage device is required (for example, Japanese Patent Application No. 63-115721). There is an example (Japanese Patent Application No. 56-179915) applied to a residual signal obtained by linear prediction analysis as a method using only a representative waveform of a voice section instead of storing all voice waveforms. ,
This does not use a waveform interpolation method that is actively synchronized with the pitch. An example in which the waveform interpolation method synchronized with pitch is used for connection between segments of CV (consonant-vowel) / VC (vowel-consonant) (Mitsuru et al., May 1981, lecture by The Acoustical Society of Japan) Papers, p431), but there is no synthesis method for information compression and high quality for the entire synthesis unit.

【０００４】この発明の目的は波形を合成して音声語を
得る方法において、記憶容量を大幅に削減可能とし、し
かも、高い品質の合成音を得ることができ、また、音声
を高能率で情報圧縮して伝送蓄積したものから高品質の
合成音を得ることができる音声合成方法を提供すること
にある。An object of the present invention is to provide a method of synthesizing a waveform to obtain a speech word, in which the storage capacity can be significantly reduced, a high-quality synthesized sound can be obtained, and the speech can be obtained with high efficiency. It is an object of the present invention to provide a speech synthesizing method capable of obtaining a high-quality synthesized sound from a compressed and transmitted and stored sound.

【０００５】[0005]

【課題を解決するための手段】この発明によれば、音声
波形の異なる時点から抽出した複数個の代表波形から、
与えられたピッチ周期と持続時間とに従ってその代表波
形間の音声波形を、これら代表波形を用いて補間をして
連続音声を再合成する。前述したように、波形編集型音
声合成方式で高品質の合成音を得るためには、あらかじ
め合成に必要な音声波形を記憶装置に蓄積しておく必要
があるが、パラメータを蓄積しておく方式に比較する
と、記憶装置の規模が非常に大きくなり、経済化や小規
模化を実現するには波形情報の効率的な削減策が必要で
あった。また従来技術では、合成音声の品質を劣化させ
ることなくピッチ周期や持続時間の制御、つまり合成音
声の高低や発話速度の制御を行うことは非常に困難であ
った。この発明では、この音声波形を音声のピッチ周期
に着目して、選択された代表波形のみを用いる方法で能
率的に圧縮し、合成の際に聴覚的に歪が検知されないよ
うに処理再生するものであり、記憶容量削減効果と共
に、ピッチ周期や持続時間が変化した音声を高品質に合
成できる。According to the present invention, a plurality of representative waveforms extracted from different points in time of a speech waveform are obtained.
According to a given pitch period and duration, a voice waveform between the representative waveforms is interpolated using these representative waveforms to resynthesize a continuous voice. As described above, in order to obtain a high-quality synthesized sound by the waveform editing type speech synthesis method, it is necessary to previously store a speech waveform required for synthesis in a storage device. In comparison with the above, the scale of the storage device becomes very large, and an efficient measure for reducing the waveform information is required to realize economy and downsizing. Further, in the related art, it is very difficult to control the pitch period and the duration without deteriorating the quality of the synthesized voice, that is, to control the level of the synthesized voice and the utterance speed. In the present invention, this audio waveform is efficiently compressed by a method using only the selected representative waveform, paying attention to the pitch period of the audio, and is processed and reproduced so that distortion is not detected audibly during synthesis. In addition to the effect of reducing the storage capacity, it is possible to synthesize a voice having a changed pitch period or duration with high quality.

【０００６】[0006]

【実施例】図１にこの発明を波形編集合成方法に適用し
た一実施例を示す。この音声合成は分析および蓄積過程
と合成過程との２段階がある。まず、音声の分析過程で
は、分析部１１において音声入力端子１２から原音声信
号とラベリング情報入力端子１３から音韻ラベリング情
報とが最適波形ピーク位置探索回路１４に入力される。
入力された原音声信号は視察によって音韻毎に区分化さ
れ、音韻記号と位置情報とが付加される。FIG. 1 shows an embodiment in which the present invention is applied to a waveform editing / synthesizing method. This speech synthesis has two stages: an analysis and accumulation process and a synthesis process. First, in the voice analysis process, the analysis unit 11 inputs the original voice signal from the voice input terminal 12 and the phoneme labeling information from the labeling information input terminal 13 to the optimum waveform peak position search circuit 14.
The input original audio signal is segmented for each phoneme by inspection, and phoneme symbols and position information are added.

【０００７】次に、ピーク近傍波形切り出し回路１５で
各音韻毎に、３個程度の代表波形に対してその近傍のピ
ーク位置を検索し、この位置情報を中心とする波形を切
り出し、これを代表波形としてその位置情報や音韻情報
等と共に記憶装置１６に蓄積する。例えば図２Ａに示す
ように１つの音韻を構成する原音声波形列Ｐ₁Ｐ₂Ｐ ₃
…Ｐ_m…が入力され、これらから図２Ｂに示すように異
なる時点の３個の代表波形Ｐ₁、Ｐ_m、Ｐ_nが抽出され
る。Next, a near-peak waveform extracting circuit 15
For each phoneme, about 3 representative waveforms
Search for the peak position and cut off the waveform centered on this position information.
And use this as a representative waveform for its position information and phoneme information.
And the like, and accumulate them in the storage device 16. For example, as shown in FIG. 2A
The original speech waveform sequence P that forms one phoneme₁P_TwoP _Three
… P_m… Are input, and from these are different as shown in FIG. 2B.
Three representative waveforms P₁, P_m, P_nIs extracted
You.

【０００８】音声合成過程では、合成部２１において、
テキスト入力端子２２から合成テキストがテキスト解析
部２３に入力され、合成テキストが解析されて音韻系列
に変換され、波形読み出し回路２４でその音韻系列から
必要な音声波形が記憶装置１６から読みだし、ピッチ同
期補間回路２５へ供給する。ピッチ同期補間回路２５は
韻律情報生成回路２６から与えられる合成ピッチ周期と
接続時間とに応じてその代表波形間の音声波形をその合
成ピッチ周期と同期して補間して一ピッチ毎に重ね合わ
せて得る。このときのピッチ毎の重ね合わせ窓形状は、
台形窓や余弦関数窓などが使える。In the speech synthesis process, the synthesizer 21
A synthesized text is input from a text input terminal 22 to a text analysis unit 23, the synthesized text is analyzed and converted into a phoneme sequence, and a waveform readout circuit 24 reads a necessary speech waveform from the phoneme sequence from the storage device 16 and generates a pitch. It is supplied to the synchronous interpolation circuit 25. The pitch synchronous interpolation circuit 25 interpolates the speech waveform between the representative waveforms in synchronism with the synthesized pitch cycle and superimposes the speech waveforms on a pitch-by-pitch basis in accordance with the synthesized pitch cycle and the connection time provided from the prosody information generating circuit 26. obtain. At this time, the overlapping window shape for each pitch is
You can use trapezoidal windows and cosine function windows.

【０００９】例えば図２Ｂに示す代表波形Ｐ₁、Ｐ_mが
読み出され合成ピッチ周期Ｔ_P、持続時間Ｌが与えられ
た場合、代表波形Ｐ₁とＰ_mとの間の時間をＬとし、周
期Ｔ _Pごとに両波形から次のように補間され、合成波形
Ｐ_1m(0) 、Ｐ_1m(1) 、Ｐ_1m(2) …Ｐ_1m(k) の合成波形を
得る。Ｐ_1m(i) ＝Ｐ₁×α（ｉ）＋Ｐ_m×β（ｉ）（１）ここで、Ｐ_1m(0) ＝Ｐ₁，Ｐ_1m(k) ＝Ｐ_mであり、α
（ｉ）及びβ（ｉ）は代表波形Ｐ₁及びＰ_mに対する重
み係数でそれぞれ式（２）と（３）で表わされる。For example, a representative waveform P shown in FIG.₁, P_mBut
Readout composite pitch period T_P, Given the duration L
, The representative waveform P₁And P_mL is the time between
Period T _PIs interpolated from both waveforms as
P_1m(0), P_1m(1), P_1m(2)… P_1m(k)
obtain. P_1m(i) = P₁× α (i) + P_m× β (i) (1) where P_1m(0) = P₁, P_1m(k) = P_mAnd α
(I) and β (i) are representative waveforms P₁And P_mHeavy against
And expressed by equations (2) and (3), respectively.

【００１０】 α（ｉ）＝０．５×〔１＋ｃｏｓ｛π×（Ｌ−Ｔｉ）／Ｌ｝〕（２） β（ｉ）＝１−α（ｉ）（３）ここで、ｉは合成波形番号、ＴｉはＰ_ｌｍ（０）からＰ
_ｌｍ（ｉ）の時間間隔、Ｌは代表波形Ｐ_ｌとＰ_ｍにおけ
る時間間隔を示す。同様の補間処理を時間的に隣接する
次の代表波形Ｐ _ｍとＰ _ｎについて行う。このようにし
て、時間的に隣接した２つの代表波形間が補間された合
成波形は音韻毎の接続及び合成回路２７で連続する音声
として出力される。Α (i) = 0.5 × [1 + cos {π × (L−Ti) / L}] (2) β (i) = 1−α (i) (3) where i is a composite waveform Number and Ti are P _lm (0) to P
time interval _lm (i), L represents a time interval in a representative waveform _{P l} and _{P m.} Similar interpolation processing is temporally adjacent
The following representative waveforms _Pm and _Pn are performed. In this way, a synthesized waveform obtained by interpolating between two representative waveforms that are temporally adjacent to each other is output as a continuous sound by the connection and synthesis circuit 27 for each phoneme.

【００１１】以上の処理において、代表波形の選択方法
には、例えば当該音韻区間を複数のサブ区間に分割し、
その各サブ区間毎に波形ピークを求め、その波形を代表
波形とする方法や、当該音韻区間のスペクトル情報ある
いは波形情報の動特性などを目安に選択する方法、つま
りその音韻区間内で最大のピークを求め、その前側、後
側において、順次隣接波形の差分を見て大きく変化した
部分の波形と前記最大ピーク波形とを代表波形とする方
法などが考えられる。また、重み係数α（ｉ）は、式
（４）のような簡単な線形関数も使用できる。In the above processing, the representative waveform selection method includes, for example, dividing the phoneme section into a plurality of sub-sections,
A method of obtaining a waveform peak for each of the sub-intervals and using the waveform as a representative waveform, or a method of selecting the spectrum information or the dynamic characteristics of the waveform information of the relevant phoneme interval as a guide, that is, a method of selecting the largest peak in the phoneme interval A method is conceivable in which the difference between adjacent waveforms is sequentially determined on the front side and the rear side, and the waveform of the portion that greatly changes and the maximum peak waveform are used as the representative waveform. In addition, a simple linear function such as Expression (4) can be used as the weight coefficient α (i).

【００１２】 α（ｉ）＝（Ｌ−Ｔｉ）／Ｌ（４）この実施例では、音韻を単位とする音声合成方式を例と
して説明したが、音節などその他の合成単位に基づく音
声合成でも利用可能であることは明らかである。更にこ
の発明は音声の高能率符号化における音声合成にも適用
することができる。つまり、図２Ｂにおける代表波形の
みを符号化して伝送、又は記憶し、伝送路又は記憶装置
における情報圧縮を計り、受信側又は記憶読み出し側で
前述したこの発明方法に従って音声合成することで分析
合成音が得られる。さらに、波形編集音声合成の場合
は、全合成単位について得られた全体の代表波形をベク
トル量子化手法などのクラスタリング技術を用いること
で、いくつかの類似の代表波形をさらに１つの波形で代
表させ、情報圧縮率を高めることもできる。Α (i) = (L−Ti) / L (4) In this embodiment, a speech synthesis method using phonemes as a unit has been described as an example, but it is also used in speech synthesis based on other synthesis units such as syllables. Clearly, it is possible. Further, the present invention can be applied to speech synthesis in high-efficiency speech coding. In other words, only the representative waveform in FIG. 2B is encoded and transmitted or stored, the information is compressed in the transmission path or the storage device, and the synthesized voice is analyzed on the receiving side or the storage and reading side according to the above-described method of the present invention. Is obtained. Furthermore, in the case of waveform editing speech synthesis, several similar representative waveforms are further represented by one waveform by using a clustering technique such as a vector quantization method for the entire representative waveform obtained for all synthesis units. Also, the information compression ratio can be increased.

【００１３】上述では１つの音韻について３つの代表波
形を用いたが、２つ以上の代表波形があればよい。 In the above description, one phoneme has three representative waves.
Although the shape is used, it is sufficient if there are two or more representative waveforms.

【００１４】[0014]

【発明の効果】以上説明したように、この発明によれば
代表波形を用いることにより音声波形情報を能率良く圧
縮再生できるため、波形編集形音声合成方式において記
憶容量を大幅に削減でき、かつ伝送／蓄積で高能率符号
化することができ、しかも、高品質の合成音を得ること
が可能である。また、波形の補間処理をピッチ周期に同
期して行い、かつ接続時間を選ぶことにより原音声の品
質を劣化させることなく発声速度や音声の高低の制御が
可能である。As described above, according to the present invention, since the voice waveform information can be efficiently compressed and reproduced by using the representative waveform, the storage capacity can be greatly reduced in the waveform editing type voice synthesis system and the transmission can be performed. / Accumulation enables high-efficiency encoding and obtains a high-quality synthesized sound. Further, by performing the waveform interpolation processing in synchronization with the pitch cycle and selecting the connection time, it is possible to control the utterance speed and the level of the voice without deteriorating the quality of the original voice.

[Brief description of the drawings]

【図１】この発明方法の一実施例を示す構成図。FIG. 1 is a configuration diagram showing one embodiment of the method of the present invention.

【図２】原波形、代表波形、合成波形の例を示す図。FIG. 2 is a diagram showing examples of an original waveform, a representative waveform, and a composite waveform.

Claims

(57) [Claims]

1. A is stored in the storage device a speech waveform of each speech unit, the synthesis, a method for synthesizing a continuous speech voice corresponding to the speech unit sequence given waveform read from the storage device, The storage device stores a pitch from the audio waveform in the audio unit.
From the waveform obtained corresponding to the
Select and store at least two or more waveforms as representative waveforms
And, at the time of synthesis, corresponding to the speech unit of the speech unit sequence given
The representative waveform is read from the storage device , and a set of two temporally adjacent representative waveforms in the representative waveform is
Speech waveforms between two representative waveforms were given for each
Given to be continuous for a length corresponding to the duration
A sequence of waveforms synchronized with the pitch period is converted to the two representative waveforms.
Generated by weighting synthesis
By interpolating between two representative waveforms adjacent to each other,
A speech synthesis method comprising synthesizing a speech waveform for each voice.