CN104217713A

CN104217713A - Tibetan-Chinese speech synthesis method and device

Info

Publication number: CN104217713A
Application number: CN201410341827.9A
Authority: CN
Inventors: 杨鸿武; 王海燕; 徐世鹏; 裴东; 甘振业
Original assignee: Northwest Normal University
Current assignee: Northwest Normal University
Priority date: 2014-07-15
Filing date: 2014-07-15
Publication date: 2014-12-17

Abstract

The present invention provides a Chinese-Tibetan bilingual speech synthesis method and device, which are used to synthesize the input Chinese or Tibetan sentences to be synthesized by using the pre-established Chinese and Tibetan mixed corpus. Using the method and device of the present invention, Can simultaneously synthesize Chinese or Tibetan speech. Compared with the traditional HMM-based speech synthesis system, this system adds a speaker adaptive training process in the training stage to obtain the average tone model of Chinese-Tibetan mixed speech. Through this process, the speaker's difference in the speech database can be reduced. The impact is to improve the quality of synthesized speech. On the basis of the average sound model, through the speaker adaptive transformation algorithm, only a small amount of Tibetan or Chinese corpus to be synthesized can be synthesized with good naturalness and fluency. Tibetan or Chinese pronunciation. The research of this system is of great significance to promoting communication with ethnic minorities and promoting the development of ethnic minority voice technology.

Description

Chinese-Tibetan bilingual speech synthesis method and device

技术领域technical field

本发明涉及多语种语音合成技术领域，更为具体的，提供了一种汉藏跨语言双语语音合成的方法和装置。The present invention relates to the technical field of multilingual speech synthesis, and more specifically, provides a Chinese-Tibetan cross-language bilingual speech synthesis method and device.

背景技术Background technique

近年来，多语种语音合成技术成为了人机语音交互领域的研究热点。利用该技术能够在同一个系统中实现不同语言的人机语音交互，这对说几种语言的国家或地区具有重要的应用价值。中国是一个少数民族语言和方言众多的国家，这就使得该技术的研究有了重要的意义，例如我国的藏族地区，主要说普通话、藏语和方言，如果能有一个语音系统实现跨语言的多种语音合成，将对促进与少数民族交流和促进少数民族语音技术的发展上有重要的意义。In recent years, multilingual speech synthesis technology has become a research hotspot in the field of human-computer speech interaction. Using this technology can realize human-computer voice interaction in different languages in the same system, which has important application value for countries or regions that speak several languages. China is a country with a large number of minority languages and dialects, which makes the research of this technology important. For example, in Tibetan areas in my country, Mandarin, Tibetan and dialects are mainly spoken. A variety of speech synthesis will be of great significance to promoting communication with ethnic minorities and promoting the development of ethnic minority voice technology.

国内外对多语种语音合成研究的技术主要包括基元选取拼接合成方法和统计参数语音合成方法。波形拼接合成方法的基本原理是根据输入文本进行分析，得到基本的单元信息，然后从预先录制和标注好的语音库中挑选出合适的单元，进行少量的调整，再经过拼接，最终得到合成的语音。由于最终合成语音的单元音素都是直接从音库中提取的，所以它能保持原始发音人的音质。但波形拼接合成系统一般需要一个大规模的语音库，语料库制作的工作量非常大，费时费力，而且合成效果很大程度上依赖于语音库，受环境影响较大，鲁棒性不高。统计参数语音合成方法的基本思想是对输入的语音信号进行参数分解并建立统计模型，通过训练得到的模型预测待合成文本的语音参数，将参数输入参数合成器，最终得到合成的语音。这种方法系统构建时需要的数据量少，较少需要人工干预，且合成的语音平滑流畅，鲁棒性高。但合成的语音音质不高，且情感韵律不够丰富较为平淡。The techniques of multilingual speech synthesis research at home and abroad mainly include primitive selection and splicing synthesis methods and statistical parameter speech synthesis methods. The basic principle of the waveform splicing synthesis method is to analyze the input text to obtain the basic unit information, then select the appropriate unit from the pre-recorded and marked voice library, make a small amount of adjustment, and then splicing to finally obtain the synthesized voice. Since the unit phonemes of the final synthesized speech are directly extracted from the sound bank, it can maintain the sound quality of the original speaker. However, the waveform splicing synthesis system generally requires a large-scale speech database. The workload of corpus production is very large, time-consuming and labor-intensive, and the synthesis effect largely depends on the speech database, which is greatly affected by the environment and has low robustness. The basic idea of the statistical parametric speech synthesis method is to decompose the parameters of the input speech signal and establish a statistical model, predict the speech parameters of the text to be synthesized through the trained model, input the parameters into the parametric synthesizer, and finally obtain the synthesized speech. This method requires less data and less manual intervention when the system is constructed, and the synthesized speech is smooth and smooth with high robustness. However, the sound quality of the synthesized voice is not high, and the emotional rhythm is not rich enough and relatively flat.

由于基于HMM的统计参数语音合成方法可通过说话人自适应变换合成不同说话人的语音，成为了跨语言的多语种语音合成中的研究热点。基于HMM的多语种语音合成系统采用混合语言法、音素映射法或状态映射法来实现多语言的语音合成。然而，现有的研究大都针对具有大语料库并且语音合成技术相对成熟的语言展开研究，缺少对方言、民族语言、以及语音资源不易获取的语言的研究。目前国内外的研究中，并没有实现普通话/少数民族语言或者普通话/方言的多语种语音合成系统。目前多语种语音合成的研究主要关注主流语言，多采用音素映射法或状态映射法，但这两种方法都需要大量的双语语音数据。对于缺少语音资源的藏语来说，由于缺少大规模的汉藏双语语音语料，难以将以上的方法应用于普通话-藏语的多语种语音合成中。Since the HMM-based statistical parametric speech synthesis method can synthesize the speech of different speakers through speaker adaptive transformation, it has become a research hotspot in cross-language multilingual speech synthesis. The multilingual speech synthesis system based on HMM adopts mixed language method, phoneme mapping method or state mapping method to realize multilingual speech synthesis. However, most of the existing research focuses on languages with large corpora and relatively mature speech synthesis technology, and there is a lack of research on dialects, ethnic languages, and languages whose speech resources are not easy to obtain. At present, in domestic and foreign research, there is no multilingual speech synthesis system for Mandarin/minority languages or Mandarin/dialects. At present, research on multilingual speech synthesis mainly focuses on mainstream languages, and mostly uses phoneme mapping or state mapping, but both methods require a large amount of bilingual speech data. For Tibetan, which lacks speech resources, it is difficult to apply the above method to Mandarin-Tibetan multilingual speech synthesis due to the lack of large-scale Chinese-Tibetan bilingual speech corpus.

发明内容Contents of the invention

本发明要解决背景技术中提出的多语种语音合成系统缺少对方言、民族语言、以及语音资源不易获取的语音的研究，例如藏语，不能实现普通话-藏语语言的多语种语音合成，提供一种汉藏双语语音合成方法及装置。The present invention aims to solve the problem that the multilingual speech synthesis system proposed in the background technology lacks research on dialects, ethnic languages, and voices that are not easy to obtain, such as Tibetan, which cannot realize the multilingual speech synthesis of Putonghua-Tibetan language, and provides a A Chinese-Tibetan bilingual speech synthesis method and device.

为解决上述技术问题，本发明采用的技术方案是：汉藏双语语音合成方法，其中，包括步骤：For solving the problems of the technologies described above, the technical solution adopted in the present invention is: a Chinese-Tibetan bilingual speech synthesis method, wherein, comprising steps:

A、以国际音标为参照，对输入的藏语拼音字母，获得其国际音标，然后与汉语拼音的国际音标比较，相同的部分直接采用SAMPA-SC标记，不同的部分则利用未使用的键盘符号标记，利用面向SAMPA-T的字音转换算法，完成藏语文本语料的SAMPA-T自动标注；A. Taking the International Phonetic Alphabet as a reference, obtain the International Phonetic Alphabet for the input Tibetan Pinyin letters, and then compare them with the International Phonetic Alphabet of the Chinese Pinyin. The same parts are directly marked with SAMPA-SC, and the different parts use unused keyboard symbols Marking, using the SAMPA-T-oriented word-to-sound conversion algorithm to complete the SAMPA-T automatic labeling of Tibetan text corpus;

B、根据藏语和普通话的相似性，在普通话标音体系的基础上，设计汉语和藏语通用的标音系统和问题集；B. According to the similarity between Tibetan and Mandarin, on the basis of the Mandarin phonetic system, design a common Chinese and Tibetan phonetic system and problem sets;

C、利用汉藏多说话人的语音数据，基于HMM模型，通过说话人自适应训练，训练得到混合语言平均音模型；C. Using the speech data of Chinese and Tibetan speakers, based on the HMM model, through speaker adaptive training, the mixed language average sound model is trained;

D、利用待合成目标语言藏语或汉语语音少量说话人的语料，通过说话人自适应变换，得到说话人自适应模型，并对自适应模型进行修正和更新；D. Using the corpus of a small number of speakers of the target language Tibetan or Chinese to be synthesized, the speaker adaptive model is obtained through speaker adaptive transformation, and the adaptive model is corrected and updated;

E、输入待合成文本，生成语音参数，合成藏语或汉语语音。E. Input the text to be synthesized, generate speech parameters, and synthesize Tibetan or Chinese speech.

进一步，所述步骤A中所述SAMPA-T的字音转换算法包括如下步骤：Further, the word-to-voice conversion algorithm of SAMPA-T described in the step A comprises the steps:

首先读入藏语语句文本，然后根据单垂符和音节符切分句子和音节，得到藏语单子，对每个音节，通过基字丁定位和基字丁分解分离出声和韵母，最后通过查找声母SAMPA-T表和韵母SAMPA-T表获得该音节的SAMPA-T字符串，所述的基字丁分解是根据基字丁拆分表。First read the Tibetan sentence text, and then segment the sentence and syllable according to the single vertical symbol and the syllable symbol to obtain the Tibetan monad. For each syllable, separate the sound and the final vowel through the base character Ding positioning and base character Ding decomposition, and finally through Look up the initial consonant SAMPA-T table and the final consonant SAMPA-T table to obtain the SAMPA-T character string of the syllable, and the base letter decomposition is to split the table according to the base letter.

进一步，所述步骤B中的设计汉语和藏语通用的标音系统和问题集包括如下步骤：Further, the common phonetic notation system and question set of designing Chinese and Tibetan in the said step B include the following steps:

首先，与普通话发音一致的藏语声韵母，采用汉语拼音来标注，与普通话发音不一致的采用藏语拼音来标注；First of all, Tibetan consonants that are pronounced the same as Mandarin are marked with Chinese pinyin, and those that are inconsistent with Mandarin are marked with Tibetan pinyin;

然后，选取普通话和藏语的所有声韵母以及静音和停顿作为上下文相关的MSD-HSMM的合成基元来设计上下文标注格式，用来标注每个合成基元的声韵母层、音节层、词层、韵律词层、短语层和语句层的上下文相关的特征；Then, select all the consonants and finals of Mandarin and Tibetan as well as silence and pause as the synthetic primitives of the context-dependent MSD-HSMM to design a contextual annotation format, which is used to mark the consonant and final layers, syllable layers, and word layers of each synthetic primitive , context-dependent features of prosodic word layer, phrase layer and sentence layer;

最后，在一个普通话的上下文相关问题集的基础上设计汉藏双语通用的问题集，该问题集中扩充了藏语特有的合成基元的相关问题，以反映藏语的特殊发音，问题集包含3000多个上下文相关的问题，覆盖了上下文相关标注的所有特征。Finally, a Chinese-Tibetan bilingual general question set is designed on the basis of a Mandarin context-sensitive question set. This question set expands the related questions of Tibetan-specific synthetic primitives to reflect the special pronunciation of Tibetan. The question set contains 3000 Multiple context-sensitive questions covering all features of context-sensitive labeling.

进一步，所述步骤C中的通过说话人自适应训练，训练得到混合语言平均音模型包括如下步骤：Further, through the speaker adaptive training in the step C, the training to obtain the mixed language average tone model includes the following steps:

a、对多说话人的汉语语料库和单说话人的藏语语料库数据进行语音分析，提取其声学参数：a. Perform phonetic analysis on multi-speaker Chinese corpus and single-speaker Tibetan corpus data, and extract their acoustic parameters:

(1)提取mel倒谱系数，对数基频和非周期索引，(1) Extract mel cepstrum coefficients, logarithmic fundamental frequency and aperiodic index,

(2)计算它们的一阶差分和二阶差分；(2) Calculate their first-order difference and second-order difference;

b、结合上下文属性集，进行HMM模型训练，训练其声学参数的统计模型：b. Combining the context attribute set, conduct HMM model training, and train the statistical model of its acoustic parameters:

(1)训练频谱和基频参数的HMM模型，(1) HMM model of training spectrum and fundamental frequency parameters,

(2)训练状态时长参数的多分布半隐马尔科夫模型(MSD-HSMM)；(2) Multi-distribution semi-hidden Markov model (MSD-HSMM) for training state duration parameters;

c、利用少量单说话人汉语语音库和单说话人藏语语音库，进行说话人自适应训练，从而获得平均音模型：c. Use a small number of single-speaker Chinese speech databases and single-speaker Tibetan speech databases to perform speaker adaptive training to obtain the average sound model:

(1)采用约束最大似然线性回归(CMML)算法，将训练中说话人的语音数据和平均音之间的差异用线性回归函数表示，(1) Using the Constrained Maximum Likelihood Linear Regression (CMML) algorithm, the difference between the speech data of the speaker in training and the average voice is represented by a linear regression function,

(2)用一组状态输出分布和状态时长分布的线性回归方程归一化训练说话人之间的差异，(2) Normalize the differences between training speakers with a set of linear regression equations for state output distributions and state duration distributions,

(3)训练得到汉藏双语的混合语言平均音模型，从而得到上下文相关的MSD-HSMM模型；(3) training to obtain the mixed language average sound model of Chinese and Tibetan bilinguals, so as to obtain the context-dependent MSD-HSMM model;

d、利用汉语和藏语的单说话人自适应数据，进行说话人自适应变换：d. Use the single-speaker adaptive data of Chinese and Tibetan to perform speaker adaptive transformation:

(1)采用CMML算法，计算说话人的状态输出概率分布和状态时长概率分布的均值向量和协方差矩阵，(1) Using the CMML algorithm to calculate the mean vector and covariance matrix of the speaker's state output probability distribution and state duration probability distribution,

(2)用一组状态输出分布和状态时长分布的变换矩阵将平均音模型的均值向量和协方差矩阵变换为待合成的藏语或汉语的目标说话人模型，(2) Transform the mean vector and covariance matrix of the average tone model into a Tibetan or Chinese target speaker model to be synthesized with a set of transformation matrices of state output distribution and state duration distribution,

(3)对归一化和转化后的频谱、基频和时长参数进行最大似然估计；(3) Carry out maximum likelihood estimation to normalized and converted frequency spectrum, fundamental frequency and duration parameter;

e、对自适应模型进行修正和更新：e. Correct and update the adaptive model:

(1)采用最大后验(MAP)算法，计算平均音模型状态输出和时长分布的MAP估计参数，(1) Using the maximum a posteriori (MAP) algorithm to calculate the MAP estimation parameters of the average tone model state output and duration distribution,

(2)计算自适应变换后的状态输出和状态时长的均值向量，(2) Calculate the mean vector of the state output and state duration after the adaptive transformation,

(3)计算自适应均值向量的加权平均MAP估计值；(3) Calculate the weighted average MAP estimate of the adaptive mean vector;

f、输入待合成的文本，对其进行文本分析，得到句子的HMM模型；f, input the text to be synthesized, carry out text analysis to it, obtain the HMM model of the sentence;

g、对句子HMM进行参数预测，进行语音参数生成，经参数合成器后得到合成语音，公式如下：g. Predict the parameters of the sentence HMM, generate the speech parameters, and obtain the synthesized speech after the parameter synthesizer, the formula is as follows:

其中，为训练说话人s的状态输出均值向量，为其状态时长均值向量。W＝[A，b]和X＝[α，β]分别为训练说话人s和平均音模型之间状态输出分布和状态时长分布差异的变换矩阵，o_i和d_i为平均观测向量和平均时长向量。in, output the mean vector for the state of the training speaker s, is the mean vector of its state duration. W=[A, b] and X=[α, β] are the transformation matrices of the difference between the state output distribution and the state duration distribution between the training speaker s and the average sound model respectively, and o _i and d _i are the average observation vector and the average duration vector.

进一步，所述步骤D中所述的利用待合成目标语言藏语或汉语语音少量说话人的语料，通过说话人自适应变换，得到说话人自适应模型，并对自适应模型进行修正和更新，包括如下步骤：Further, as described in the step D, using the corpus of a small number of speakers in the target language Tibetan or Chinese to be synthesized, the speaker adaptive model is obtained through speaker adaptive transformation, and the adaptive model is corrected and updated, Including the following steps:

首先，说话人自适应训练后，利用基于HSMM的CMLLR自适应算法，计算得到说话人转换的状态输出概率分布和时长概率分布的均值向量和协方差矩阵，状态i下特征向量o和状态时长d的变换方程为：First, after speaker adaptive training, use the HSMM-based CMLLR adaptive algorithm to calculate the mean vector and covariance matrix of the speaker transition state output probability distribution and duration probability distribution, the feature vector o and the state duration d in state i The transformation equation of is:

b_i(o)＝N(o；Au_i-b，A∑_iA^T)b _i (o)=N(o; Au _i -b, A∑ _i A ^T )

＝|A^-1|N(Wξ；u_i，∑_i)＝|A ^-1 |N(Wξ; u _i , ∑ _i )

$\begin{matrix} {p p}_{i i} ((d d)) = = N N ((d d;; α α {m m}_{i i} - - β β,, α α {σ σ}_{i i}^{22} α α)) \\ = = | | {α α}^{- - 11} | | N N ((αψ αψ;; {m m}_{i i},, {σ σ}_{i i}^{22})) \end{matrix}$

其中，ξ＝[o^T，1]，ψ＝[d，1]^T，μ_i为状态输出分布的均值，m_i为时长分布的均值，∑_i为对角协方差矩阵，为方差，W＝[A^-1 b^-1]为目标说话人状态输出概率密度分布的线性变换矩阵，X＝[α^-1，β^-1]为状态时长概率密度分布的变换矩阵；Among them, ξ=[o ^T , 1], ψ=[d, 1] ^T , μ _i is the mean value of the state output distribution, _mi is the mean value of the time-length distribution, ∑ _i is the diagonal covariance matrix, is the variance, W=[A ^-1 b ^-1 ] is the linear transformation matrix of the target speaker state output probability density distribution, X=[α ^-1 , β ^-1 ] is the transformation matrix of the state duration probability density distribution;

然后，通过基于HSMM的自适应变换算法，可对语音数据的频谱、基频和时长参数进行归一化和变换，对于长度为T的自适应数据O，可对变换Λ＝(W，X)进行最大似然估计Then, through the adaptive transformation algorithm based on HSMM, normalization and transformation can be carried out to the frequency spectrum, fundamental frequency and duration parameters of voice data, and for the adaptive data O whose length is T, transformation Λ=(W, X) Perform maximum likelihood estimation

$\overset{~ ~}{Λ Λ} = = ((\overset{~ ~}{W W},, \overset{~ ~}{X x})) = = \underset{Λ Λ}{arg arg max max} P P ((O o | | λ λ,, Λ Λ))$

其中，λ为HSMM的参数集；Among them, λ is the parameter set of HSMM;

最后，采用最大后验(MAP)算法对语音的自适应模型进行了修正和更新，对于给定的HSMMλ，若其前向概率和后向概率分别为：α_i(i)和β_i(i)，则其在状态i下连续观测序列o_t-d+1......o_t的生成概率为：Finally, the maximum a posteriori (MAP) algorithm is used to modify and update the adaptive model of speech. For a given HSMMλ, if its forward probability and backward probability are: α _i (i) and β _i (i ), then its generation probability of continuous observation sequence o _t-d+1 ......o _t in state i for:

${κ κ}_{t t}^{d d} ((i i)) = = \frac{11}{P P ((O o | | λ λ))} {\underset{j j = = 11}{Σ Σ}}_{j j &NotEqual; &NotEqual; i i}^{N N} {α α}_{t t - - d d} ((j j)) p p ((d d)) {Π Π}_{s the s = = t t - - d d + + 11}^{t t} {b b}_{i i} (({O o}_{s the s})) {β β}_{t t} ((i i))$

MAP估计描述如下：The MAP estimation is described as follows:

其中，和为线性回归变换后的均值向量，ω和τ分别为状态输出和时长分布的MAP估计参数，和为自适应均值向量和的加权平均MAP估计值。in, and is the mean vector after linear regression transformation, ω and τ are the MAP estimation parameters of the state output and time-length distribution, respectively, and is the adaptive mean vector and The weighted average MAP estimate of .

进一步，所述步骤E中所述的输入待合成文本，生成语音参数，合成藏语或汉语语音包括如下步骤：Further, the input to be synthesized text described in the step E, generating speech parameters, synthesizing Tibetan or Chinese speech includes the following steps:

首先使用文本分析工具将给定文本转换成包含语境描述信息的发音标注序列，使用训练过程中得到的决策树预测出每个发音的语境相关HMM模型，并连接成一个语句的HMM模型；First, use text analysis tools to convert the given text into a pronunciation annotation sequence containing contextual description information, use the decision tree obtained during the training process to predict the context-related HMM model of each pronunciation, and connect it into a sentence HMM model;

然后，使用参数生成算法从语句HMM中生成频谱、时长和基频的参数序列；Then, a parameter generation algorithm is used to generate a parameter sequence of spectrum, duration and fundamental frequency from the sentence HMM;

最后用Mel对数谱逼近(MLSA)滤波器作为参数合成器，合成出语音。Finally, the speech is synthesized by using the Mel logarithmic spectrum approximation (MLSA) filter as a parametric synthesizer.

进一步，汉藏双语语音合成装置，包括：HMM模型训练单元，用于建立语音数据的HMM模型；说话人自适应单元，用于归一化和转换训练中说话人的特征参数，获得自适应模型；语音合成单元，用于合成待合成的藏语或汉语语音。Further, the Chinese-Tibetan bilingual speech synthesis device includes: an HMM model training unit, which is used to establish an HMM model of speech data; a speaker adaptive unit, which is used to normalize and convert the characteristic parameters of the speaker in training to obtain an adaptive model ; Speech synthesis unit for synthesizing Tibetan or Chinese speech to be synthesized.

进一步，所述HMM模型训练单元，包括：语音分析子单元，提取语音库中语音数据的声学参数，主要提取基频、频谱和时长参数；目标HMM模型确定子单元，结合音库的上下文标注信息，训练声学模型的统计模型，根据上下文属性集，确定基频、频谱和时长参数，所述语音分析子单元与所述目标HMM模型确定子单元相连。Further, the HMM model training unit includes: a speech analysis subunit, which extracts the acoustic parameters of the speech data in the speech library, mainly extracting the fundamental frequency, frequency spectrum and duration parameters; the target HMM model determination subunit, combined with the context labeling information of the sound library , training the statistical model of the acoustic model, determining fundamental frequency, frequency spectrum and duration parameters according to the context attribute set, the speech analysis subunit is connected to the target HMM model determination subunit.

进一步，所述说话人自适应单元，包括依次相连的说话人训练子单元、平均音模型确定子单元、说话人自适应变换子单元和自适应模型确定子单元，所述目标HMM模型确定子单元与所述说话人训练子单元相连，Further, the speaker adaptation unit includes a speaker training subunit, an average tone model determination subunit, a speaker adaptive transformation subunit and an adaptive model determination subunit connected in sequence, and the target HMM model determination subunit connected with the speaker training subunit,

所述说话人训练子单元，用于归一化训练中说话人和平均音模型之间状态输出分布和状态时长分布之间的差异；The speaker training subunit is used for the difference between the state output distribution and the state duration distribution between the speaker and the average tone model in normalization training;

所述平均音模型确定子单元，采用最大似然线性回归算法确定汉藏双语混合语音平均音模型；The average sound model determination subunit adopts the maximum likelihood linear regression algorithm to determine the average sound model of Chinese-Tibetan bilingual mixed speech;

所述说话人自适应变换子单元，利用自适应数据，计算说话人的状态输出概率分布和时长概率分布的均值向量和协方差矩阵，并将其向目标说话人模型进行转化；The speaker adaptive transformation subunit uses adaptive data to calculate the mean vector and covariance matrix of the speaker's state output probability distribution and duration probability distribution, and convert it to the target speaker model;

所述自适应模型确定子单元，建立目标说话人的MSD-HSMM的自适应模型。The adaptive model determining subunit is to establish an adaptive model of the target speaker's MSD-HSMM.

进一步，所述语音合成单元包括依次相连的自适应模型修正子单元和合成子单元，所述自适应模型确定子单元与所述自适应模型修正子单元相连，Further, the speech synthesis unit includes an adaptive model correction subunit and a synthesis subunit connected in sequence, the adaptive model determination subunit is connected to the adaptive model correction subunit,

所述自适应模型修正子单元，利用MAP算法对语音的自适应模型进行修正和更新，减小模型偏差，提高合成质量；The adaptive model correction subunit uses the MAP algorithm to correct and update the adaptive model of speech, reduces model deviation, and improves the synthesis quality;

所述合成子单元，利用修正的自适应模型，预测输入文本的语音参数，提取参数通过语音合成器最终合成汉语或藏语语音。The synthesis subunit uses the modified adaptive model to predict the speech parameters of the input text, extracts the parameters and finally synthesizes Chinese or Tibetan speech through a speech synthesizer.

本发明具有的优点和积极效果是：汉藏双语语音合成方法及装置，通过汉语和藏语在发音上的相似性，利用基于HMM的自适应训练和自适应转换算法，实现了用同一个系统和装置合成自然度和流利度都较好的汉语和藏语语音。与传统的基于HMM的语音合成系统相比，本系统在训练阶段加入了说话人自适应训练过程，获得汉藏混合语音平均音模型，通过此过程，可以减小语音库中说话人的差异所造成的影响，提高合成语音的质量，在平均音模型的基础上，通过说话人自适应变换算法，只用少量的待合成的藏语或汉语语料，就可以合成自然度和流利度都很好的藏语或汉语语音。本系统的研究对促进与少数名族交流和促进少数名族语音技术的发展上有重要的意义。The advantages and positive effects of the present invention are: the Chinese-Tibetan bilingual speech synthesis method and device, through the similarity in pronunciation between Chinese and Tibetan, use the adaptive training and adaptive conversion algorithm based on HMM to realize the use of the same system Synthesize Chinese and Tibetan voices with good naturalness and fluency with the device. Compared with the traditional HMM-based speech synthesis system, this system adds a speaker adaptive training process in the training stage to obtain the average tone model of Chinese-Tibetan mixed speech. Through this process, the speaker's difference in the speech database can be reduced. The impact is to improve the quality of synthesized speech. On the basis of the average sound model, through the speaker adaptive transformation algorithm, only a small amount of Tibetan or Chinese corpus to be synthesized can be synthesized with good naturalness and fluency. Tibetan or Chinese pronunciation. The research of this system is of great significance to promoting communication with ethnic minorities and promoting the development of ethnic minority voice technology.

附图说明Description of drawings

图1是汉藏双语语音合成方法流程框图；Fig. 1 is a block diagram of a Chinese-Tibetan bilingual speech synthesis method;

图2是藏语文本到SAMPA-T转换流程图；Fig. 2 is a flow chart of Tibetan text to SAMPA-T conversion;

图3是汉藏双语说话人自适应语音合成流程框图；Fig. 3 is a block diagram of a Chinese-Tibetan bilingual speaker adaptive speech synthesis process;

图4是汉藏双语语音合成装置结构示意图；Fig. 4 is a schematic structural diagram of a Chinese-Tibetan bilingual speech synthesis device;

图5是模型训练的流程图；Fig. 5 is the flowchart of model training;

图6是语音合成的流程图。Fig. 6 is a flowchart of speech synthesis.

具体实施方式Detailed ways

本发明提出了一种汉藏双语语音合成的方法，提出了一种面向藏语机读音标SAMPA-T的字音转换算法，实现了藏语文本语料的SAMPA-T的自动标注，根据藏语和普通话之间的相似性，设计了普通话和藏语通用的标音系统、标注格式和问题集，利用多个普通话和藏语的说话人的语料，通过基于HMM的说话人自适应训练和说话人自适应变换算法，最终合成汉语或藏语语音。本发明汉藏双语语音合成方法流程框图如图1所示，具体步骤为：The present invention proposes a method for Chinese-Tibetan bilingual speech synthesis, proposes a word-to-sound conversion algorithm oriented to the Tibetan phonetic symbol SAMPA-T, and realizes the automatic labeling of the SAMPA-T of the Tibetan text corpus, according to Tibetan and The similarity between Mandarin, designed the common Mandarin and Tibetan phonetic transcription system, annotation format and question set, using the corpus of multiple Mandarin and Tibetan speakers, through HMM-based speaker adaptive training and speaker Adaptive transformation algorithm to finally synthesize Chinese or Tibetan speech. Chinese-Tibetan bilingual speech synthesis method block diagram of the present invention as shown in Figure 1, concrete steps are:

(1)设计藏语拉萨方言的SAMPA-T标注方案，利用面向SAMPA-T的字音转换算法，完成藏语文本语料的SAMPA-T自动标注。(1) Design the SAMPA-T tagging scheme for the Tibetan Lhasa dialect, and use the SAMPA-T-oriented word-to-sound conversion algorithm to complete the SAMPA-T automatic tagging of the Tibetan text corpus.

机读音标SAMPA(Speech Assessment Methods Phonetic Alphabet)是一种计算机可读的音标系统，它可以用ASCII字符表示国际音标的所有符号。目前，SAMPA被广泛应用于欧洲的主要语种，以及日语等东亚语言，国内的汉语、粤语和台湾的“国语”也提出了SAMPA方案。SAMPA (Speech Assessment Methods Phonetic Alphabet) is a computer-readable phonetic alphabet system that can represent all symbols of the International Phonetic Alphabet with ASCII characters. At present, SAMPA is widely used in major European languages, as well as East Asian languages such as Japanese. Chinese, Cantonese and Taiwan's "Mandarin" in China have also proposed SAMPA programs.

由于藏语和汉语同属藏汉语系，本发明在汉语普通话的机读音标设计方案的基础上，设计了一套藏语的计算机可读的标音系统SAMPA-T(Tibetan)，以藏语拉萨话为例列出了设计方案并实现了藏语到SAMPA-T的转写。Since Tibetan and Chinese belong to the Tibetan-Chinese family, the present invention designs a computer-readable phonetic system SAMPA-T (Tibetan) for Tibetan on the basis of the machine-readable phonetic symbol design scheme of Mandarin Chinese, and uses Tibetan Lhasa The design scheme is listed as an example and the transliteration from Tibetan to SAMPA-T is realized.

通过对照汉语和藏语的国际音标，发现汉语和藏语的国际音标有一部分是一致的，因此以国际音标为参照，对输入的藏语拼音字母，获得其国际音标，然后与汉语拼音的国际音标比较，相同的部分直接采用SAMPA-SC标记，不同的部分则按照简化原则，利用未使用的键盘符号标记。By comparing the International Phonetic Alphabets of Chinese and Tibetan, it is found that the International Phonetic Alphabets of Chinese and Tibetan are partly consistent. Therefore, with the International Phonetic Alphabet as a reference, the International Phonetic Alphabet for the input Tibetan Pinyin letters is obtained, and then compared with the International Phonetic Alphabet of Chinese Pinyin For comparison of phonetic symbols, the same parts are directly marked with SAMPA-SC, and the different parts are marked with unused keyboard symbols according to the principle of simplification.

藏语是一种拼音文字，由藏语拼音拼写而成，其基本单位是音节。传统藏文文法根据字母在音节中的结构位置，将这些不同位置的音节分为“前加字”、“基字”、“上加字”、“下加字”、“后加字”和“再后加字”，其中，基字为整个藏字的核心。其中藏文声母＝前加字+上加字+基字+下加字组成的字丁，藏文韵母＝元音+后加字+再后加字。Tibetan is a type of phonetic writing, which is spelled out from the Tibetan phonetic alphabet, and its basic unit is a syllable. According to the structural positions of letters in the syllable, the traditional Tibetan grammar divides these syllables in different positions into "pre-added characters", "basic characters", "upper-added characters", "lower-added characters", "post-added characters" and "Add characters later", in which the base character is the core of the entire Tibetan character. Wherein Tibetan initial consonant=pre-addition word+upper addition word+basic character+under-addition word composition Ding, Tibetan final=vowel+back addition word+after addition word again.

藏文文本到SAMPA-T的转写主要从藏文句子切分、藏语单字切分、基字丁的定位、声韵母的分离和转写、SAMPA-T字符串组合等几个方面考虑。其中基字丁定位和声韵母的SAMPA-T转换为核心模块，基字丁的定位即基字、元音等的识别，主要通过面向字典的统计和查找方法来实现。字丁转换主要是通过对声母和韵母的SAMPA-T转写支持库的查找来实现。首先读入藏语文本，然后根据单垂符和音节符切分句子和音节，得到藏语单子。对每个音节，通过基字丁定位和字丁分解分离出声韵母，然后通过查找声母和韵母的SAMPA-T表获得该音节的SAMPA-T。藏语文本到SAMPA-T转换流程图如图2所示。The transliteration of Tibetan texts into SAMPA-T mainly considers the segmentation of Tibetan sentences, the segmentation of Tibetan words, the positioning of base characters, the separation and transliteration of consonants and vowels, and the combination of SAMPA-T strings. Among them, the SAMPA-T conversion of the base character D positioning and the consonant and final is converted into the core module, and the base character D positioning is the recognition of base characters, vowels, etc., which is mainly realized through dictionary-oriented statistics and search methods. Alphabet conversion is mainly realized by searching the SAMPA-T transliteration support library for initials and finals. First read the Tibetan text, and then segment sentences and syllables according to the single vertical symbols and syllable symbols to obtain the Tibetan list. For each syllable, the initials and finals are separated by base syllable positioning and syllable decomposition, and then the SAMPA-T of the syllable is obtained by looking up the SAMPA-T tables of initials and finals. The flow chart of Tibetan text to SAMPA-T conversion is shown in Figure 2.

(2)根据藏语和普通话的相似性，在普通话标音体系的基础上，设计汉语和藏语通用的标音系统和问题集。(2) According to the similarity between Tibetan and Mandarin, and on the basis of the Mandarin phonetic system, design a common Chinese and Tibetan phonetic system and question sets.

藏语和汉语同属汉藏语系，在发音上有许多共性和差异。汉语普通话和藏语拉萨方言都是音节组成的语言，每一个音节都由一个声母和一个韵母组成。普通话有22个声母和39个韵母，而藏语拉萨方言有36个声母和45个韵母，这两种语言共享20个声母和13个韵母。首先本发明中与普通话发音一致的藏语声韵母，采用汉语拼音来标注；与普通话发音不一致的采用藏语拼音来标注。Tibetan and Chinese belong to the Sino-Tibetan language family, and there are many similarities and differences in pronunciation. Both Mandarin Chinese and Tibetan Lhasa dialect are languages composed of syllables, and each syllable is composed of an initial consonant and a final consonant. Mandarin has 22 initials and 39 finals, while Tibetan Lhasa dialect has 36 initials and 45 finals, and the two languages share 20 initials and 13 finals. At first among the present invention, the Tibetan phonetic vowels that are consistent with the Mandarin pronunciation are marked with the Chinese phonetic alphabet; those that are inconsistent with the Mandarin pronunciation are marked with the Tibetan phonetic alphabet.

然后，选取普通话和藏语的所有声韵母以及静音和停顿作为上下文相关的MSD-HSMM的合成基元来设计上下文标注格式，用来标注每个合成基元的声韵母层、音节层、词层、韵律词层、短语层和语句层的上下文相关的特征。Then, select all the consonants and finals of Mandarin and Tibetan as well as silence and pause as the synthetic primitives of the context-dependent MSD-HSMM to design a contextual annotation format, which is used to mark the consonant and final layers, syllable layers, and word layers of each synthetic primitive , prosodic word level, phrase level and sentence level context-dependent features.

最后，在一个普通话的上下文相关问题集的基础上设计汉藏双语通用的问题集。该问题集中扩充了藏语特有的合成基元的相关问题，以反映藏语的特殊发音。问题集包含3000多个上下文相关的问题，覆盖了上下文相关标注的所有特征。Finally, a Chinese-Tibetan bilingual question set is designed on the basis of a Mandarin context-sensitive question set. This question focuses on expanding the related questions of Tibetan-specific synthetic primitives to reflect the special pronunciation of Tibetan. The question set contains more than 3000 context-sensitive questions, covering all features of context-sensitive annotations.

本系统采用了分层级标注的方法对藏语文本语料进行了标注，标注的内容包括音节层、边界信息以及SAMPA-T转写结果。使用国际上通用的Praat语音学软件来进行标注，系统还可以根据需要添加必要的标注信息。标注完之后，编写脚本程序，将标注信息写入.TexGrid文件，里面包含了标注的四层信息，主要包括发音和音节边界信息。问题集中包含基本特征的分类信息，如声母，韵母类型，音节是否排在韵律短语中的前两位等等，这些分类信息通常是依据上下文信息中的某一组基本信息的集合。通过问题集的再次分类，就可以得到比基本特征更复杂的语境分类信息。在HTS系统中，设计的问题集都放在了.hed文件中，每一行描述一个问题，描述的问题为真假型的问题，每个问题都以QS命令开头。This system uses a hierarchical labeling method to label the Tibetan text corpus, and the content of the label includes the syllable layer, boundary information and SAMPA-T transcription results. Use the internationally accepted Praat phonetic software for annotation, and the system can also add necessary annotation information as needed. After marking, write a script program to write the marking information into the .TexGrid file, which contains four layers of marking information, mainly including pronunciation and syllable boundary information. The question set contains classification information of basic features, such as initial consonant, final type, whether the syllable is in the first two digits of the prosodic phrase, etc. These classification information are usually based on a set of basic information in the context information. By reclassifying the question set, more complex context classification information than the basic features can be obtained. In the HTS system, the designed question sets are placed in the .hed file, each line describes a question, and the described question is a true-false question, and each question starts with a QS command.

(3)利用汉藏多说话人的语音数据，基于HMM模型，通过说话人自适应训练，训练得到混合语言平均音模型。(3) Using the Chinese-Tibetan multi-speaker speech data, based on the HMM model, through speaker adaptive training, the mixed language average sound model is trained.

与传统的基于HMM的语音合成方法相比，本发明在训练阶段加入了说话人自适应训练过程，获得汉藏混合语音平均音模型，通过此方法，可以减小语音库中说话人的差异所造成的影响，提高合成语音的质量，在平均音模型的基础上，通过说话人自适应变换算法，只用少量的待合成的藏语或汉语语料，就可以合成自然度和流利度都很好的藏语或汉语语音。汉藏双语说话人自适应语音合成流程框图如图3所示：Compared with the traditional speech synthesis method based on HMM, the present invention adds a speaker self-adaptive training process in the training stage to obtain the average tone model of Chinese-Tibetan mixed speech. The impact is to improve the quality of synthesized speech. On the basis of the average sound model, through the speaker adaptive transformation algorithm, only a small amount of Tibetan or Chinese corpus to be synthesized can be synthesized with good naturalness and fluency. Tibetan or Chinese pronunciation. The block diagram of the adaptive speech synthesis process for Chinese-Tibetan bilingual speakers is shown in Figure 3:

1>对多说话人的汉语语料库和单说话人的藏语语料库数据进行语音分析，提取其声学参数：1> Conduct phonetic analysis on multi-speaker Chinese corpus and single-speaker Tibetan corpus data, and extract their acoustic parameters:

(1)提取mel倒谱系数，对数基频和非周期索引；(1) Extract mel cepstrum coefficients, logarithmic fundamental frequency and aperiodic index;

(2)计算它们的一阶差分和二阶差分。(2) Calculate their first-order difference and second-order difference.

2>结合上下文属性集，进行HMM模型训练，训练其声学参数的统计模型：2> Combining the context attribute set, conduct HMM model training, and train the statistical model of its acoustic parameters:

(1)训练频谱和基频参数的HMM模型；(1) train the HMM model of frequency spectrum and fundamental frequency parameter;

(2)训练状态时长参数的多分布半隐马尔科夫模型(MSD-HSMM)。(2) Multi-distribution semi-hidden Markov model (MSD-HSMM) for training state duration parameters.

3>利用少量单说话人汉语语音库和单说话人藏语语音库，进行说话人自适应训练，从而获得平均音模型：3> Use a small number of single-speaker Chinese speech databases and single-speaker Tibetan speech databases for speaker adaptive training to obtain the average sound model:

(1)采用约束最大似然线性回归(CMML)算法，将训练中说话人的语音数据和平均音之间的差异用线性回归函数表示；(1) Using the Constrained Maximum Likelihood Linear Regression (CMML) algorithm, the difference between the speaker's voice data and the average voice in training is represented by a linear regression function;

(2)用一组状态输出分布和状态时长分布的线性回归方程归一化训练说话人之间的差异；(2) Normalize the differences between training speakers with a set of linear regression equations for state output distributions and state duration distributions;

(3)训练得到汉藏双语的混合语言平均音模型，从而得到上下文相关的MSD-HSMM模型。(3) Training to get the Chinese-Tibetan bilingual mixed language average sound model, so as to get the context-dependent MSD-HSMM model.

4>利用汉语和藏语的单说话人自适应数据，进行说话人自适应变换：4> Use the single-speaker adaptive data of Chinese and Tibetan to perform speaker adaptive transformation:

(1)采用CMML算法，计算说话人的状态输出概率分布和状态时长概率分布的均值向量和协方差矩阵；(1) Using the CMML algorithm to calculate the mean vector and covariance matrix of the speaker's state output probability distribution and state duration probability distribution;

(2)用一组状态输出分布和状态时长分布的变换矩阵将平均音模型的均值向量和协方差矩阵变换为待合成的藏语或汉语的目标说话人模型；(2) Transform the mean value vector and covariance matrix of the average tone model into Tibetan or Chinese target speaker models to be synthesized with a group of transformation matrices of state output distribution and state duration distribution;

(3)对归一化和转化后的频谱、基频和时长参数进行最大似然估计。(3) Perform maximum likelihood estimation on the normalized and transformed spectrum, fundamental frequency and duration parameters.

5>对自适应模型进行修正和更新：5> Correct and update the adaptive model:

(1)采用最大后验(MAP)算法，计算平均音模型状态输出和时长分布的MAP估计参数；(1) Using the maximum a posteriori (MAP) algorithm to calculate the MAP estimation parameters of the average sound model state output and duration distribution;

(2)计算自适应变换后的状态输出和状态时长的均值向量；(2) Calculate the mean value vector of the state output and state duration after the adaptive transformation;

(3)计算自适应均值向量的加权平均MAP估计值。(3) Calculate the weighted average MAP estimate of the adaptive mean vector.

6>输入待合成的文本，对其进行文本分析，得到句子的HMM模型。6> Input the text to be synthesized, perform text analysis on it, and obtain the HMM model of the sentence.

7>对句子HMM进行参数预测，进行语音参数生成，经参数合成器后得到合成语音。7> Predict the parameters of the sentence HMM, generate speech parameters, and obtain synthesized speech after passing through the parameter synthesizer.

图3为汉藏双语语音合成流程图。利用普通话和藏语的混合语料，采用约束最大似然线性回归(CMML)训练获得汉藏双语的混合语言平均音模型，从而得到上下文相关的多分布半隐马尔科夫模型(MSD-HSMM)。在说话人自适应训练中，训练说话人的训练语音数据和平均音之间的差异用一个输出状态分布和状态时长分布均值向量的线性回归函数表示，可用一组状态输出分布和状态时长分布的线性回归方程归一化训练说话人之间的差异，公式如下：Figure 3 is a flowchart of Chinese-Tibetan bilingual speech synthesis. Using the mixed corpus of Mandarin and Tibetan, the constrained maximum likelihood linear regression (CMML) is used to train the mean phonetic model of Chinese-Tibetan bilingual mixed language, so as to obtain the context-dependent multi-distribution semi-hidden Markov model (MSD-HSMM). In speaker adaptive training, the difference between the training speech data and the average voice of the training speaker is represented by a linear regression function of the output state distribution and the mean vector of the state duration distribution, which can be expressed by a set of state output distribution and state duration distribution The linear regression equation normalizes the differences between the training speakers with the following formula:

(4)利用待合成目标语言藏语或汉语语音少量说话人的语料，通过说话人自适应变换，得到说话人自适应模型，并对自适应模型进行修正和更新。(4) Using the corpus of a small number of speakers of the target language Tibetan or Chinese to be synthesized, the speaker adaptive model is obtained through speaker adaptive transformation, and the adaptive model is corrected and updated.

说话人自适应训练后，利用基于HSMM的CMLLR自适应算法，计算得到说话人转换的状态输出概率分布和时长概率分布的均值向量和协方差矩阵。状态i下特征向量o和状态时长d的变换方程为：After speaker adaptive training, the mean vector and covariance matrix of the speaker transition state output probability distribution and duration probability distribution are calculated by using the HSMM-based CMLLR adaptive algorithm. The transformation equation of eigenvector o and state duration d in state i is:

b_i(o)＝N(o；Au_i-b，A∑_iA^T)b _i (o)=N(o; Au _i -b, A∑ _i A ^T )

＝|A^-1|N(Wξ；u_i，∑_i)＝|A ^-1 |N(Wξ; u _i , ∑ _i )

其中，ξ＝[o^T，1]，ψ＝[d，1]^T，μ_i为状态输出分布的均值，m_i为时长分布的均值，∑_i为对角协方差矩阵，为方差。W＝[A^-1 b^-1]为目标说话人状态输出概率密度分布的线性变换矩阵，X＝[α^-1，β^-1]为状态时长概率密度分布的变换矩阵。Among them, ξ=[o ^T , 1], ψ=[d, 1] ^T , μ _i is the mean value of the state output distribution, _mi is the mean value of the time-length distribution, ∑ _i is the diagonal covariance matrix, is the variance. W=[A ^-1 b ^-1 ] is the linear transformation matrix of the target speaker's state output probability density distribution, and X=[α ^-1 , β ^-1 ] is the transformation matrix of the state duration probability density distribution.

通过基于HSMM的自适应变换算法，可对语音数据的频谱、基频和时长参数进行归一化和变换。对于长度为T的自适应数据O，可对变换Λ＝(W，X)进行最大似然估计，Through the adaptive transformation algorithm based on HSMM, the frequency spectrum, fundamental frequency and duration parameters of speech data can be normalized and transformed. For the adaptive data O whose length is T, the maximum likelihood estimation can be performed on the transformation Λ=(W, X),

其中，λ为HSMM的参数集。Among them, λ is the parameter set of HSMM.

最后，采用最大后验(MAP)算法对语音的自适应模型进行了修正和更新。对于给定的HSMMλ，若其前向概率和后向概率分别为：α_i(i)和β_i(i)，则其在状态i下连续观测序列o_t-d+1......o_t的生成概率为：Finally, the adaptive model of speech is revised and updated by using the Maximum A Posteriori (MAP) algorithm. For a given HSMMλ, if its forward probability and backward probability are: α _i (i) and β _i (i), then it continuously observes the sequence o _t-d+1 in state i … The generation probability of .o _t for:

MAP估计描述如下：The MAP estimation is described as follows:

其中，和为线性回归变换后的均值向量，ω和τ分别为状态输出和时长分布的MAP估计参数。和为自适应均值向量和的加权平均MAP估计值。in, and is the mean vector after linear regression transformation, ω and τ are the MAP estimation parameters of the state output and time-length distribution, respectively. and is the adaptive mean vector and The weighted average MAP estimate of .

训练阶段主要包括预处理和HMM训练。在预处理阶段，首先对音库中的语音数据进行分析，提取相应的语音参数(基频和谱参数)。根据提取的语音参数，HMM的观测向量可分为谱和基频两个部分，其中谱参数部分采用连续概率分布HMM进行建模，而基频部分采用多空间概率分布HMM(MSD-HMM)进行建模，同时，系统使用高斯分布或者伽马分布建立状态时长模型来描述语音的时间结构。此外，HMM合成系统还使用语言学和韵律学的特征描述语境。模型训练前要对上下文属性集和用于决策树聚类的问题集进行设计，即根据先验知识来选择一些对声学参数(谱、基频和时长)有一定影响的上下文属性并设计相应的问题集以用于上下文相关模型聚类。The training phase mainly includes preprocessing and HMM training. In the preprocessing stage, first analyze the speech data in the sound bank, and extract the corresponding speech parameters (fundamental frequency and spectral parameters). According to the extracted speech parameters, the observation vector of HMM can be divided into two parts: spectrum and fundamental frequency. The spectral parameter part is modeled by continuous probability distribution HMM, while the fundamental frequency part is modeled by multi-spatial probability distribution HMM (MSD-HMM). Modeling. At the same time, the system uses Gaussian distribution or Gamma distribution to establish a state duration model to describe the time structure of speech. In addition, the HMM synthesis system also uses linguistic and prosodic features to describe the context. Before the model training, it is necessary to design the context attribute set and the problem set for decision tree clustering, that is, select some context attributes that have a certain influence on the acoustic parameters (spectrum, fundamental frequency, and duration) based on prior knowledge and design the corresponding Question set for context-sensitive model clustering.

在模型的训练过程中，根据ML准则，使用EM算法训练声学参数向量序列的HMM模型。最后，使用语境决策树分别对谱参数模型、基频参数模型和时长模型进行聚类，从而得到合成使用的预测模型。整个模型训练的流程如图5所示。During the training of the model, the EM algorithm is used to train the HMM model of the sequence of acoustic parameter vectors according to the ML criterion. Finally, the spectral parameter model, the fundamental frequency parameter model and the duration model were clustered separately using a contextual decision tree to obtain a predictive model for synthetic use. The entire model training process is shown in Figure 5.

(5)输入待合成文本，生成语音参数，合成藏语或汉语语音。(5) Input the text to be synthesized, generate speech parameters, and synthesize Tibetan or Chinese speech.

首先使用文本分析工具将给定文本转换成包含语境描述信息的发音标注序列，使用训练过程中得到的决策树预测出每个发音的语境相关HMM模型，并连接成一个语句的HMM模型。然后，使用参数生成算法从语句HMM中生成频谱、时长和基频的参数序列。最后用Mel对数谱逼近(MLSA)滤波器作为参数合成器，合成出语音。整个合成的流程如图6所示。First, text analysis tools are used to convert the given text into a pronunciation annotation sequence containing contextual description information, and the decision tree obtained during the training process is used to predict the context-related HMM model of each pronunciation, and connect it into a sentence HMM model. Then, a parameter generation algorithm is used to generate parameter sequences of spectrum, duration and fundamental frequency from the sentence HMM. Finally, the speech is synthesized by using the Mel logarithmic spectrum approximation (MLSA) filter as a parametric synthesizer. The entire synthesis process is shown in Figure 6.

与上述方法相对应，本发明还提供一种汉藏双语语音合成装置，该装置用于利用预先建立的汉藏双语音库对输入的待合成汉语或藏语语音进行语音合成，在实现上，可通过软件、硬件或软硬件结合实现本装置的功能。本发明装置的内部结构示意图如图4所示。Corresponding to the above method, the present invention also provides a Chinese-Tibetan bilingual speech synthesis device, which is used to perform speech synthesis on the input Chinese or Tibetan speech to be synthesized by utilizing the pre-established Chinese-Tibetan bilingual speech library. In terms of implementation, The functions of the device can be realized by software, hardware or a combination of software and hardware. The schematic diagram of the internal structure of the device of the present invention is shown in FIG. 4 .

本发明装置内部结构包括HMM模型训练单元，说话人自适应单元和语音合成单元。The internal structure of the device of the present invention includes an HMM model training unit, a speaker adaptive unit and a speech synthesis unit.

1>HMM模型训练单元，用于建立语音数据的HMM模型：1>HMM model training unit, for setting up the HMM model of voice data:

(1)语音分析子单元，提取语音库中语音数据的声学参数，主要提取基频、频谱和时长参数；(1) Speech analysis subunit extracts the acoustic parameters of the speech data in the speech database, mainly extracting fundamental frequency, frequency spectrum and duration parameters;

(2)目标HMM模型确定子单元，结合音库的上下文标注信息，训练声学模型的统计模型，根据上下文属性集，确定基频、频谱和时长参数。(2) The target HMM model determines the subunit, combines the contextual labeling information of the sound bank, trains the statistical model of the acoustic model, and determines the fundamental frequency, spectrum and duration parameters according to the context attribute set.

2>说话人自适应单元，用于归一化和转换训练中说话人的特征参数，获得自适应模型：2> Speaker Adaptation Unit, used to normalize and convert the characteristic parameters of the speaker in training to obtain an adaptive model:

(1)说话人训练子单元，归一化训练中说话人和平均音模型之间状态输出分布和状态时长分布之间的差异；(1) speaker training subunit, the difference between the state output distribution and the state duration distribution between the speaker and the average tone model in the normalization training;

(2)平均音模型确定子单元，采用最大似然线性回归算法确定汉藏双语混合语音平均音模型；(2) the average tone model determines the subunit, adopting the maximum likelihood linear regression algorithm to determine the average tone model of Chinese-Tibetan bilingual mixed speech;

(3)说话人自适应变换子单元，利用自适应数据，计算说话人的状态输出概率分布和时长概率分布的均值向量和协方差矩阵，并将其向目标说话人模型进行转化；(3) The speaker adaptive transformation subunit uses adaptive data to calculate the mean vector and covariance matrix of the speaker's state output probability distribution and duration probability distribution, and convert it to the target speaker model;

(4)自适应模型确定子单元，建立目标说话人的MSD-HSMM的自适应模型。(4) An adaptive model determination subunit, which establishes an adaptive model of the target speaker's MSD-HSMM.

3>语音合成单元，用于合成待合成的藏语或汉语语音：3> Speech synthesis unit, used to synthesize Tibetan or Chinese speech to be synthesized:

(1)自适应模型修正子单元，利用MAP算法对语音的自适应模型进行修正和更新，减小模型偏差，提高合成质量：(1) The adaptive model correction subunit uses the MAP algorithm to correct and update the adaptive model of speech, reduces the model deviation, and improves the synthesis quality:

(2)合成子单元，利用修正的自适应模型，预测输入文本的语音参数，提取参数通过语音合成器最终合成汉语或藏语语音。(2) The synthesis subunit uses the modified adaptive model to predict the speech parameters of the input text, and extracts the parameters to finally synthesize Chinese or Tibetan speech through a speech synthesizer.

上述所述的方法过程可通过程序指令相关的硬件完成，所述的程序可以存储在可读取的存储介质中，该程序在执行时执行上述方法中的相应步骤。The process of the above-mentioned method can be completed by a program instructing related hardware, and the program can be stored in a readable storage medium, and the corresponding steps in the above-mentioned method can be executed when the program is executed.

为了说明本发明采用的方法与其他方法的优越性，评估合成的藏语语音和汉语语音质量，训练了3种不同的MSD-HSMM的模型，通过比较这三种模型下合成语音的质量，可说明本发明公开方法的优势。In order to illustrate the superiority of the method adopted in the present invention and other methods, evaluate the Tibetan speech and Chinese speech quality of synthesis, train the model of 3 different MSD-HSMM, by comparing the quality of synthetic speech under these three kinds of models, can The advantages of the method disclosed in the present invention are illustrated.

语音库的选取为7个女性说话人的普通话语音(每个说话人169句)和录制的1个藏语女性说话人的800句语音作为训练数据。藏语语句选自近年的藏文报纸。所有的录音都保存为Microsoft WAV文件格式(单通道、16位量化、16 k Hz采样)。The speech library is selected as the Mandarin speech of 7 female speakers (169 sentences for each speaker) and 800 speeches of 1 Tibetan female speaker as the training data. The Tibetan sentences are selected from Tibetan newspapers in recent years. All recordings are saved in Microsoft WAV file format (single channel, 16-bit quantization, 16 kHz sampling).

实验中，从800句藏语语句中随机选取了100句语句作为测试语句。从剩下的700句藏语语句中随机挑选出10句、100句和700句的语句建立了3个藏语的训练集。这些训练集与7个女性普通话说话人的训练语音一起用来训练汉藏双语混合语言平均音模型。这3个藏语训练集和第一个女性普通话说话人的训练语料也用来获得藏语和普通话的说话人相关的声学模型。In the experiment, 100 sentences were randomly selected from 800 Tibetan sentences as test sentences. From the remaining 700 Tibetan sentences, 10 sentences, 100 sentences and 700 sentences were randomly selected to establish three Tibetan training sets. These training sets were used together with the training utterances of 7 female Mandarin speakers to train the Sino-Tibetan bilingual mixed language average voice model. The 3 Tibetan training sets and the first female Mandarin speaker training corpus were also used to obtain speaker-dependent acoustic models for Tibetan and Mandarin.

1)SD模型：分别利用3个藏语训练集(10/100/700句藏语语句)训练得到的藏语拉萨方言的说话人相关模型。1) SD model: the speaker-related model of Tibetan Lhasa dialect trained by using three Tibetan training sets (10/100/700 Tibetan sentences).

2)SI模型：仅利用7个女性普通话说话人的训练语句训练得到的说话人无关模型。2) SI model: a speaker-independent model trained using only the training sentences of 7 female Mandarin speakers.

3)SAT模型：首先分别利用3个藏语训练集和7个普通话说话人的所有普通话训练语句，训练获得3个汉藏双语的混合语言平均音模型；然后分别利用3个藏语训练集和第一个普通话说话人的训练语句，获得的说话人相关模型。3) SAT model: First, use 3 Tibetan training sets and all Mandarin training sentences of 7 Mandarin speakers to train and obtain 3 Chinese-Tibetan bilingual mixed language average phonetic models; then use 3 Tibetan training sets and 7 Mandarin speakers respectively. The first Mandarin speaker's training sentence, the obtained speaker-dependent model.

在测评时，给8个藏语拉萨方言评测人随机播放SD模型和SAT模型合成的藏语拉萨方言测试语句，共有120个测试语音文件(20句藏语测试语句×3个藏语训练集×2个模型)。要求测评者仔细听这120句语句，然后对每个语句的语音质量按5分制打分。在进行完MOS测评后，也要求测评者对不同藏语拉萨方言训练集合成的藏语语音进行一个整体的可懂度的描述。在汉语的MOS的评测中，采用相同的方法，给普通话评测者随机播放54句合成的普通话语音(18句普通话测试语句×3个藏语训练集)，被试对每个普通话语句的语音质量按5分制打分。During the evaluation, the Tibetan Lhasa dialect test sentences synthesized by the SD model and the SAT model were randomly played to 8 Tibetan Lhasa dialect evaluators. A total of 120 test voice files (20 Tibetan test sentences × 3 Tibetan training sets × 2 models). The testers are required to listen carefully to the 120 sentences, and then rate the voice quality of each sentence on a 5-point scale. After the MOS evaluation, the evaluators were also asked to describe the overall intelligibility of the Tibetan speech synthesized from different Tibetan Lhasa dialect training sets. In the evaluation of Chinese MOS, the same method was used to randomly play 54 synthetic Mandarin sentences (18 Mandarin test sentences × 3 Tibetan training sets) to the Mandarin evaluators. Score on a 5-point scale.

不同藏语语句训练库下合成的语音MOS评估结果显示平均MOS得分及其95％的置信区间。对于藏语合成语音，在每种藏语训练集下，SAT模型都优于SD模型。对于10句的藏语训练语音，SD模型合成的语音MOS得分只有1.99，而SAT模型的MOS相对较高，为2.4。测评者在参加评估时觉得SD模型合成的藏语理解比较困难，而SAT模型合成的藏语容易理解。当藏语训练语句为100句时，2个模型的MOS得分和可懂度都有了提高，但SAT模型仍明显优于SD模型。当训练语句达到700句时，2个模型的MOS得分基本相同。同时，测评者都觉得合成语音很容易理解。因此，在小语料情况下，SAT模型合成语音的质量优于SD模型合成语音的质量。当藏语语料库增加时，两种模型合成的藏语语音的质量将趋于相同。所以，本发明公开的方法很适合在缺少语料库的情况下合成出高质量的语音。The MOS evaluation results of speech synthesized under different Tibetan sentence training corpora show the average MOS score and its 95% confidence interval. For Tibetan synthesized speech, the SAT model outperforms the SD model under each Tibetan training set. For the 10-sentence Tibetan training speech, the MOS score of the speech synthesized by the SD model is only 1.99, while the MOS of the SAT model is relatively high at 2.4. When participating in the evaluation, the testers felt that the Tibetan language synthesized by the SD model was difficult to understand, while the Tibetan language synthesized by the SAT model was easy to understand. When the Tibetan training sentence is 100 sentences, the MOS score and intelligibility of the two models are improved, but the SAT model is still significantly better than the SD model. When the training sentence reaches 700 sentences, the MOS scores of the two models are basically the same. At the same time, reviewers found the synthesized voice to be easy to understand. Therefore, in the case of small corpus, the quality of speech synthesized by SAT model is better than that of SD model. When the Tibetan corpus increases, the quality of Tibetan speech synthesized by the two models will tend to be the same. Therefore, the method disclosed in the present invention is very suitable for synthesizing high-quality speech in the absence of a corpus.

对于普通话合成语音，在每种藏语训练集下，训练语料中混入藏语语句对普通话的合成结果几乎没有影响，合成的普通话MOS得分都在4.0左右，合成效果较好。For Mandarin speech synthesis, under each Tibetan training set, mixing Tibetan sentences into the training corpus has almost no effect on the synthesis results of Mandarin. The synthesized Mandarin MOS scores are all around 4.0, and the synthesis effect is good.

采用DMOS方法对语言的相似度进行了评测。在DMOS评测中，所有的测试语句及其原始录音都用来参加评测。共140句合成的藏语语音文件(20句藏语语句×3个藏语训练集×2个模型+SI模型合成的20句藏语语句)。每一句合成的藏语语句及其原始录音为一组语音文件。给藏语拉萨方言评测人随机播放140组测试文件：首先播放原始的藏语录音，然后播放合成的藏语语音。要求测评人仔细比较2种语音文件，评估合成语音与原始语音的相似程度。评估时采用5分制，5分代表合成语音与原始语音基本相同，1分代表合成语音与原始语音区别很大。The DMOS method is used to evaluate the similarity of language. In the DMOS evaluation, all test sentences and their original recordings are used to participate in the evaluation. A total of 140 Tibetan speech files synthesized (20 Tibetan sentences × 3 Tibetan training sets × 2 models + 20 Tibetan sentences synthesized by the SI model). Each synthesized Tibetan sentence and its original recording are a set of audio files. Randomly play 140 groups of test files to Tibetan Lhasa dialect testers: first play the original Tibetan recording, and then play the synthesized Tibetan speech. The evaluators are required to carefully compare the two voice files to assess how similar the synthesized voice is to the original voice. A 5-point scale is used for evaluation, with 5 points indicating that the synthesized speech is basically the same as the original speech, and 1 point representing that the synthesized speech is very different from the original speech.

在汉语的DMOS的评测中，采用相同的方法，给普通话评测者随机播放54组普通话语音(18句普通话测试语句×3个藏语训练集)，被试对每组普通话语句的相似程度按5分制打分。In the evaluation of Chinese DMOS, the same method was used to randomly play 54 groups of Mandarin speech (18 Mandarin test sentences × 3 Tibetan training sets) to the Mandarin evaluators. points system.

结果显示平均DMOS得分及其95％的置信区间。对于藏语的合成语音，SI模型的DMOS得分为2.41分，得分优于用10句藏语训练的SD模型，而与10句藏语训练的SAT模型得分接近。测评者对合成的SI模型进行主观评价时，觉得SI模型合成的藏语语音类似于非藏语说话人说藏语。这是因为普通话与藏语不仅共享33个合成基元，而且他们有相同的音节结构和韵律结构。因此，可以仅用普通话的模型来合成类似于藏语的语音。当加入更多的藏语训练语句时，SAT模型合成的藏语语音的DMOS得分优于SD模型的结果。当藏语训练语句增加到700句时，SD模型的DMOS得分与SAT模型的DMOS得分很接近。这表明在藏语语句较少的情况下，本发明公开的方法合成的藏语拉萨方言语音优于基于SD模型方法合成的藏语拉萨方言语音。The results show the mean DMOS score and its 95% confidence interval. For the synthesized speech of Tibetan, the DMOS score of the SI model is 2.41 points, which is better than the SD model trained with 10 Tibetan sentences and close to the SAT model trained with 10 Tibetan sentences. When the evaluators subjectively evaluated the synthesized SI model, they felt that the Tibetan speech synthesized by the SI model was similar to non-Tibetan speakers speaking Tibetan. This is because Mandarin and Tibetan not only share 33 synthetic primitives, but also have the same syllable structure and prosodic structure. Therefore, Tibetan-like speech can be synthesized using only Mandarin models. When adding more Tibetan training sentences, the DMOS score of the Tibetan speech synthesized by the SAT model is better than that of the SD model. When the number of Tibetan training sentences increases to 700, the DMOS score of the SD model is very close to that of the SAT model. This shows that in the case of fewer Tibetan sentences, the Tibetan Lhasa dialect speech synthesized by the method disclosed in the present invention is better than the Tibetan Lhasa dialect speech synthesized based on the SD model method.

以上对本发明的实施例进行了详细说明，但所述内容仅为本发明的较佳实施例，不能被认为用于限定本发明的实施范围。凡依本发明范围所作的均等变化与改进等，均应仍归属于本专利涵盖范围之内。The embodiments of the present invention have been described in detail above, but the content described is only a preferred embodiment of the present invention, and cannot be considered as limiting the implementation scope of the present invention. All equivalent changes and improvements made according to the scope of the present invention should still belong to the scope of this patent.

Claims

1. The method for synthesizing the bilingual speech in the Tibetan language is characterized in that: the method comprises the following steps:

A. obtaining the international phonetic symbols of the input Tibetan language pinyin letters by taking the international phonetic symbols as reference, then comparing the international phonetic symbols with the international phonetic symbols of Chinese pinyin, directly marking the same parts by SAMPA-SC, marking the different parts by unused keyboard symbols, and completing SAMPA-T automatic labeling of the Tibetan language text corpus by utilizing a SAMPA-T oriented character-sound conversion algorithm;

B. designing a Chinese and Tibetan universal phonetic system and a question set on the basis of a Mandarin phonetic system according to the similarity of Tibetan and Mandarin;

C. training to obtain a mixed language average voice model by utilizing voice data of a plurality of speakers in the Tibetan language and through speaker self-adaptive training based on an HMM model;

D. obtaining a speaker self-adaptive model by utilizing the corpus of a speaker with a small amount of Tibetan language or Chinese voice to be synthesized through speaker self-adaptive transformation, and correcting and updating the self-adaptive model;

E. inputting a text to be synthesized, generating voice parameters, and synthesizing Tibetan or Chinese voice.

2. The method of synthesizing bilingual hanzang speech according to claim 1, wherein: the character-to-sound conversion algorithm of SAMPA-T in the step A comprises the following steps:

firstly reading in a Tibetan language sentence text, then segmenting sentences and syllables according to single pendants and syllable characters to obtain Tibetan language sentences, separating out initial consonants and vowels through positioning of a primary character string and decomposition of the primary character string for each syllable, and finally obtaining SAMPA-T character strings of the syllables by searching an initial SAMPA-T list and a vowel SAMPA-T list, wherein the decomposition of the primary character string is realized according to the split list of the primary character string.

3. The method of synthesizing bilingual hanzang speech according to claim 1, wherein: the step B of designing a general phonetic transcription system and a general question set of Chinese and Tibetan comprises the following steps:

firstly, the Tibetan initials and finals which are consistent with the pronunciation of the Mandarin are marked by the Pinyin of Chinese, and the Tibetan initials and finals which are inconsistent with the pronunciation of the Mandarin are marked by the Pinyin of Tibetan;

then, selecting all initial and final consonants and mutes and pauses of the Mandarin and the Tibetan as context-related MSD-HSMM synthesis primitives to design a context labeling format for labeling the context-related characteristics of an initial and final layer, a syllable layer, a word layer, a prosodic word layer, a phrase layer and a sentence layer of each synthesis primitive;

finally, a problem set which is universal for the Chinese-Tibetan bilingual is designed on the basis of a context-related problem set of the Mandarin, the problem set expands related problems of synthesized primitives which are specific to the Tibetan to reflect the special pronunciation of the Tibetan, and the problem set comprises more than 3000 context-related problems and covers all the characteristics of context-related labels.

4. The method of synthesizing bilingual hanzang speech according to claim 1, wherein: the step C of obtaining the mixed language average voice model through the speaker self-adaptive training and the training comprises the following steps:

a. carrying out voice analysis on the Chinese language database of multiple speakers and the Tibetan language database data of single speaker, and extracting acoustic parameters:

(1) extracting mel cepstrum coefficients, logarithmic fundamental frequency and non-periodic indexes,

(2) calculating the first order difference and the second order difference;

b. and (3) carrying out HMM model training by combining the context attribute set, and training a statistical model of the acoustic parameters:

(1) training HMM models of the spectral and fundamental frequency parameters,

(2) training a multi-distribution semi-hidden Markov model (MSD-HSMM) of a state duration parameter;

c. using a small amount of single speaker Chinese speech library and a single speaker Tibetan speech library to perform speaker self-adaptive training, thereby obtaining an average sound model:

(1) using constrained maximum likelihood linear regression (CMML) algorithm, expressing the difference between the phonetic data and average voice of the speaker in training by linear regression function,

(2) the differences between training speakers are normalized using a set of linear regression equations for the state output distribution and the state duration distribution,

(3) training to obtain a mixed language average sound model of the Chinese-Tibetan bilingual so as to obtain a context-dependent MSD-HSMM model;

d. the speaker self-adaptive transformation is carried out by utilizing the single speaker self-adaptive data of Chinese and Tibetan:

(1) adopting CMML algorithm to calculate the mean vector and covariance matrix of the state output probability distribution and state duration probability distribution of the speaker,

(2) transforming the mean vector and covariance matrix of the mean tone model into a target speaker model of Tibetan or Chinese to be synthesized using a set of transformation matrices of state output distribution and state duration distribution,

(3) carrying out maximum likelihood estimation on the frequency spectrum, the fundamental frequency and the time length parameters after normalization and conversion;

e. and (3) modifying and updating the adaptive model:

(1) calculating MAP estimation parameters of average tone model state output and time length distribution by adopting a Maximum A Posteriori (MAP) algorithm,

(2) calculating the average vector of the state output and the state duration after the self-adaptive transformation,

(3) calculating a weighted average MAP estimation value of the adaptive mean vector;

f. inputting a text to be synthesized, and performing text analysis on the text to obtain an HMM model of a sentence;

g. performing parameter prediction on a sentence HMM, performing voice parameter generation, and obtaining synthetic voice through a parameter synthesizer, wherein the formula is as follows:

wherein,the mean vector is output for training the state of the speaker s,is its state-long mean vector, W ═ A, b]And X ═ α, β]A transformation matrix o for training the difference between the state output distribution and the state duration distribution between the speaker s and the average tone model_iAnd d_iThe average observation vector and the average time length vector.

5. The method of synthesizing bilingual hanzang speech according to claim 1, wherein: the step D of obtaining the speaker self-adaptive model by utilizing the corpus of the speaker with a small amount of Tibetan language or Chinese voice to be synthesized through speaker self-adaptive transformation and correcting and updating the self-adaptive model comprises the following steps:

firstly, after adaptive training of a speaker, calculating to obtain a mean vector and a covariance matrix of state output probability distribution and duration probability distribution of speaker conversion by using a CMLLR adaptive algorithm based on HSMM, wherein a transformation equation of a feature vector o and a state duration d under a state i is as follows:

b_i(o)＝N(o；Au_i-b，A∑_iA^T)

＝|A^-1|N(Wξ；u_i，Σ_i)

wherein xi is ═ o^T，1]，ψ＝[d，1]^T，μ_iIs the mean of the state output distribution, m_iIs the mean value, Σ, of the time length distribution_iIn the form of a diagonal covariance matrix,is the variance, W ═ A^-1 b^-1]Outputting a linear transformation matrix of probability density distribution for the target speaker state, X ═ alpha^-1，β^-1]A transformation matrix of state duration probability density distribution;

then, through the adaptive transformation algorithm based on HSMM, the frequency spectrum, fundamental frequency and duration parameters of the voice data can be normalized and transformed, and for the adaptive data O with the length T, the transformation Λ ═ W, X can be estimated with maximum likelihood,

wherein λ is the parameter set of HSMM;

finally, the Maximum A Posteriori (MAP) algorithm is used to modify and update the adaptive model of the speech, and for a given HSMM λ, if the forward probability and the backward probability are: alpha is alpha_i(i) And beta_i(i) Then it continuously observes the sequence o in state i_t-d+1......o_tGeneration probability ofComprises the following steps:

<math> <mrow> <msubsup> <mi>κ</mi> <mi>t</mi> <mi>d</mi> </msubsup> <mrow> <mo>(</mo> <mi>i</mi> <mo>)</mo> </mrow> <mo>=</mo> <mfrac> <mn>1</mn> <mrow> <mi>P</mi> <mrow> <mo>(</mo> <mi>O</mi> <mo>|</mo> <mi>λ</mi> <mo>)</mo> </mrow> </mrow> </mfrac> <munderover> <munder> <mi>Σ</mi> <mrow> <mi>j</mi> <mo>=</mo> <mn>1</mn> </mrow> </munder> <mrow> <mi>j</mi> <mo>&NotEqual;</mo> <mi>i</mi> </mrow> <mi>N</mi> </munderover> <msub> <mi>α</mi> <mrow> <mi>t</mi> <mo>-</mo> <mi>d</mi> </mrow> </msub> <mrow> <mo>(</mo> <mi>j</mi> <mo>)</mo> </mrow> <mi>p</mi> <mrow> <mo>(</mo> <mi>d</mi> <mo>)</mo> </mrow> <munderover> <mi>Π</mi> <mrow> <mi>s</mi> <mo>=</mo> <mi>t</mi> <mo>-</mo> <mi>d</mi> <mo>+</mo> <mn>1</mn> </mrow> <mi>t</mi> </munderover> <msub> <mi>b</mi> <mi>i</mi> </msub> <mrow> <mo>(</mo> <msub> <mi>O</mi> <mi>s</mi> </msub> <mo>)</mo> </mrow> <msub> <mi>β</mi> <mi>t</mi> </msub> <mrow> <mo>(</mo> <mi>i</mi> <mo>)</mo> </mrow> </mrow> </math>

the MAP estimate is described as follows:

wherein,andis the mean vector after linear regression transformation, omega and tau are respectively the MAP estimated parameters of state output and time length distribution,andas an adaptive mean vectorAndweighted average MAP estimate of (a).

6. The method of synthesizing bilingual hanzang speech according to claim 1, wherein: inputting the text to be synthesized in the step E to generate the voice parameters, and synthesizing the Tibetan language or the Chinese voice comprises the following steps:

firstly, converting a given text into a pronunciation labeling sequence containing context description information by using a text analysis tool, predicting a context-dependent HMM (hidden Markov model) model of each pronunciation by using a decision tree obtained in a training process, and connecting the HMM models into an HMM model of a sentence;

secondly, generating a parameter sequence of the frequency spectrum, the duration and the fundamental frequency from the sentence HMM by using a parameter generation algorithm;

finally, a Mel log-spectrum approximation (MLSA) filter is used as a parameter synthesizer to synthesize the voice.

7. The bilingual speech synthesis device of tibetan, its characterized in that: the method comprises the following steps: the HMM model training unit is used for establishing an HMM model of the voice data; the speaker self-adaptive unit is used for normalizing and converting the characteristic parameters of the speaker in training to obtain a self-adaptive model; and the voice synthesis unit is used for synthesizing the Tibetan or Chinese voice to be synthesized.

8. The apparatus according to claim 7, wherein: the HMM model training unit comprises: the voice analysis subunit extracts acoustic parameters of voice data in a voice library, and mainly extracts fundamental frequency, frequency spectrum and duration parameters; and the target HMM model determining subunit is used for training a statistical model of the acoustic model by combining context labeling information of the sound library, determining fundamental frequency, frequency spectrum and duration parameters according to a context attribute set, and the voice analysis subunit is connected with the target HMM model determining subunit.

9. The apparatus according to claim 8, wherein: the speaker self-adaptive unit comprises a speaker training subunit, an average tone model determining subunit, a speaker self-adaptive transformation subunit and a self-adaptive model determining subunit which are connected in sequence, the target HMM model determining subunit is connected with the speaker training subunit,

the speaker training subunit is used for normalizing the difference between the state output distribution and the state duration distribution between the speaker and the average voice model in the training;

the average sound model determining subunit determines a Chinese-Tibetan bilingual mixed speech average sound model by adopting a maximum likelihood linear regression algorithm;

the speaker self-adaptive transformation subunit calculates the mean vector and the covariance matrix of the state output probability distribution and the duration probability distribution of the speaker by using self-adaptive data and converts the mean vector and the covariance matrix into a target speaker model;

the adaptive model determining subunit establishes an adaptive model of the MSD-HSMM of the target speaker.

10. The apparatus according to claim 9, wherein: the voice synthesis unit comprises an adaptive model modification subunit and a synthesis subunit which are connected in sequence, the adaptive model determination subunit is connected with the adaptive model modification subunit,

the self-adaptive model correcting subunit corrects and updates the self-adaptive model of the voice by utilizing an MAP algorithm, reduces the model deviation and improves the synthesis quality;

and the synthesis subunit predicts the voice parameters of the input text by using the corrected self-adaptive model, extracts the parameters and finally synthesizes the Chinese or Tibetan voice through the voice synthesizer.