[go: up one dir, main page]
More Web Proxy on the site http://driver.im/

CN104217713A - Tibetan-Chinese speech synthesis method and device - Google Patents

Tibetan-Chinese speech synthesis method and device Download PDF

Info

Publication number
CN104217713A
CN104217713A CN201410341827.9A CN201410341827A CN104217713A CN 104217713 A CN104217713 A CN 104217713A CN 201410341827 A CN201410341827 A CN 201410341827A CN 104217713 A CN104217713 A CN 104217713A
Authority
CN
China
Prior art keywords
tibetan
model
mrow
speaker
adaptive
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201410341827.9A
Other languages
Chinese (zh)
Inventor
杨鸿武
王海燕
徐世鹏
裴东
甘振业
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Northwest Normal University
Original Assignee
Northwest Normal University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Northwest Normal University filed Critical Northwest Normal University
Priority to CN201410341827.9A priority Critical patent/CN104217713A/en
Publication of CN104217713A publication Critical patent/CN104217713A/en
Pending legal-status Critical Current

Links

Landscapes

  • Machine Translation (AREA)

Abstract

The invention provides a Tibetan-Chinese speech synthesis method and device and aims to synthesize input Chinese or Tibetan statements to be synthesized, through a Chinese-Tibetan hybrid corpus preliminarily established. Chinese or Tibetan speeches can be synthesized at the same time by the method and the device. Compared with a traditional HMM-based (hidden Markov model based) speech synthesis system, the method and the device have the advantages that a speaker adaptive training process is added to a training phase to acquire a Chinese-Tibetan hybrid speech average model, the influence caused by speaker differences in a speech library can be reduced through the speaker adaptive training process, and quality of synthesized speeches is improved; on the basis of the average model, and a Tibetan or Chinese speech excellent in both naturalness and fluency can be obtained through synthesis of few Tibetan or Chinese corpus data through a speaker adaptive conversion algorithm; the research is significant to promoting the development of communication with minorities and the development of minority speech technologies.

Description

Method and device for synthesizing bilingual speech from Chinese and Tibetan
Technical Field
The invention relates to the technical field of multilingual speech synthesis, and particularly provides a method and a device for cross-language bilingual speech synthesis in Chinese and Tibetan.
Background
In recent years, multilingual speech synthesis technology has become a research hotspot in the field of human-computer speech interaction. The technology can realize man-machine voice interaction of different languages in the same system, and has important application value for countries or regions speaking several languages. China is a country with numerous minority languages and dialects, so that the research of the technology has important significance, for example, the Tibetan region of China mainly speaks mandarin, Tibetan languages and dialects, and if a voice system can realize multi-voice synthesis of cross-language, the technology has important significance in promoting communication with the minority nationality and promoting the development of the minority nationality voice technology.
The technology for the research of the multilingual voice synthesis at home and abroad mainly comprises a primitive selecting and splicing synthesis method and a statistical parameter voice synthesis method. The basic principle of the waveform splicing and synthesizing method is that the basic unit information is obtained by analyzing according to the input text, then the proper unit is selected from the pre-recorded and marked voice library, a small amount of adjustment is carried out, and finally the synthesized voice is obtained by splicing. Since the unit phonemes of the final synthesized speech are extracted directly from the sound library, it can maintain the tone quality of the original speaker. However, the waveform splicing and synthesizing system generally needs a large-scale speech library, the workload of corpus production is very large, time and labor are wasted, the synthesizing effect depends on the speech library to a great extent, the influence of the environment is large, and the robustness is not high. The basic idea of the statistical parameter speech synthesis method is to perform parameter decomposition on an input speech signal and establish a statistical model, predict speech parameters of a text to be synthesized through the model obtained by training, input the parameters into a parameter synthesizer, and finally obtain synthesized speech. The method needs less data quantity when the system is constructed, needs less manual intervention, and has smooth and fluent synthesized voice and high robustness. But the synthesized voice has low tone quality and insufficient emotional rhythm and is relatively flat.
The statistical parameter speech synthesis method based on HMM can synthesize the speech of different speakers through speaker self-adaptive transformation, and becomes a research hotspot in cross-language multilingual speech synthesis. HMM-based multilingual speech synthesis systems use mixed-language methods, phoneme mapping or state mapping to achieve multilingual speech synthesis. However, most of the existing research is directed to languages with large corpora and relatively mature speech synthesis technology, and research on dialects, national languages, and languages in which speech resources are not easily available is lacking. At present, in the research at home and abroad, a multilingual speech synthesis system of mandarin/minority national language or mandarin/dialect is not realized. At present, research on multi-language speech synthesis mainly focuses on mainstream languages, and a phoneme mapping method or a state mapping method is mainly adopted, but both methods need a large amount of bilingual speech data. For Tibetan language lacking speech resources, it is difficult to apply the above method to multilingual speech synthesis of mandarin-Tibetan language due to the lack of large-scale bilingual phonetic corpus.
Disclosure of Invention
The invention provides a method and a device for synthesizing a bilingual Chinese-Tibetan language, aiming at solving the problems that a multilingual speech synthesis system provided in the background technology lacks of researches on dialects, national languages and speech which is difficult to obtain by speech resources, such as Tibetan language, and cannot realize multilingual speech synthesis of mandarin-Tibetan language.
In order to solve the technical problems, the invention adopts the technical scheme that: the method for synthesizing the bilingual speech in the Hanzang comprises the following steps:
A. obtaining the international phonetic symbols of the input Tibetan language pinyin letters by taking the international phonetic symbols as reference, then comparing the international phonetic symbols with the international phonetic symbols of Chinese pinyin, directly marking the same parts by SAMPA-SC, marking the different parts by unused keyboard symbols, and completing SAMPA-T automatic labeling of the Tibetan language text corpus by utilizing a SAMPA-T oriented character-sound conversion algorithm;
B. designing a Chinese and Tibetan universal phonetic system and a question set on the basis of a Mandarin phonetic system according to the similarity of Tibetan and Mandarin;
C. training to obtain a mixed language average voice model by utilizing voice data of a plurality of speakers in the Tibetan language and through speaker self-adaptive training based on an HMM model;
D. obtaining a speaker self-adaptive model by utilizing the corpus of a speaker with a small amount of Tibetan language or Chinese voice to be synthesized through speaker self-adaptive transformation, and correcting and updating the self-adaptive model;
E. inputting a text to be synthesized, generating voice parameters, and synthesizing Tibetan or Chinese voice.
Further, the word-to-sound conversion algorithm of SAMPA-T in step A includes the following steps:
firstly reading in a Tibetan language sentence text, then segmenting sentences and syllables according to single pendants and syllable characters to obtain Tibetan language sentences, separating out initial consonants and vowels through positioning of a primary character string and decomposition of the primary character string for each syllable, and finally obtaining SAMPA-T character strings of the syllables by searching an initial SAMPA-T list and a vowel SAMPA-T list, wherein the decomposition of the primary character string is realized according to the split list of the primary character string.
Further, the designing of the general phonetic transcription system and the problem set for chinese and tibetan in step B includes the following steps:
firstly, the Tibetan initials and finals which are consistent with the pronunciation of the Mandarin are marked by the Pinyin of Chinese, and the Tibetan initials and finals which are inconsistent with the pronunciation of the Mandarin are marked by the Pinyin of Tibetan;
then, selecting all initial and final consonants and mutes and pauses of the Mandarin and the Tibetan as context-related MSD-HSMM synthesis primitives to design a context labeling format for labeling the context-related characteristics of an initial and final layer, a syllable layer, a word layer, a prosodic word layer, a phrase layer and a sentence layer of each synthesis primitive;
finally, a problem set which is universal for the Chinese-Tibetan bilingual is designed on the basis of a context-related problem set of the Mandarin, the problem set expands related problems of synthesized primitives which are specific to the Tibetan to reflect the special pronunciation of the Tibetan, and the problem set comprises more than 3000 context-related problems and covers all the characteristics of context-related labels.
Further, the obtaining of the mixed language average model through speaker adaptive training in the step C includes the following steps:
a. carrying out voice analysis on the Chinese language database of multiple speakers and the Tibetan language database data of single speaker, and extracting acoustic parameters:
(1) extracting mel cepstrum coefficients, logarithmic fundamental frequency and non-periodic indexes,
(2) calculating the first order difference and the second order difference;
b. and (3) carrying out HMM model training by combining the context attribute set, and training a statistical model of the acoustic parameters:
(1) training HMM models of the spectral and fundamental frequency parameters,
(2) training a multi-distribution semi-hidden Markov model (MSD-HSMM) of a state duration parameter;
c. using a small amount of single speaker Chinese speech library and a single speaker Tibetan speech library to perform speaker self-adaptive training, thereby obtaining an average sound model:
(1) using constrained maximum likelihood linear regression (CMML) algorithm, expressing the difference between the phonetic data and average voice of the speaker in training by linear regression function,
(2) the differences between training speakers are normalized using a set of linear regression equations for the state output distribution and the state duration distribution,
(3) training to obtain a mixed language average sound model of the Chinese-Tibetan bilingual so as to obtain a context-dependent MSD-HSMM model;
d. the speaker self-adaptive transformation is carried out by utilizing the single speaker self-adaptive data of Chinese and Tibetan:
(1) adopting CMML algorithm to calculate the mean vector and covariance matrix of the state output probability distribution and state duration probability distribution of the speaker,
(2) transforming the mean vector and covariance matrix of the mean tone model into a target speaker model of Tibetan or Chinese to be synthesized using a set of transformation matrices of state output distribution and state duration distribution,
(3) carrying out maximum likelihood estimation on the frequency spectrum, the fundamental frequency and the time length parameters after normalization and conversion;
e. and (3) modifying and updating the adaptive model:
(1) calculating MAP estimation parameters of average tone model state output and time length distribution by adopting a Maximum A Posteriori (MAP) algorithm,
(2) calculating the average vector of the state output and the state duration after the self-adaptive transformation,
(3) calculating a weighted average MAP estimation value of the adaptive mean vector;
f. inputting a text to be synthesized, and performing text analysis on the text to obtain an HMM model of a sentence;
g. performing parameter prediction on a sentence HMM, performing voice parameter generation, and obtaining synthetic voice through a parameter synthesizer, wherein the formula is as follows:
wherein,the mean vector is output for training the state of the speaker s,is its state long mean vector. W ═ A, b]And X ═ α, β]A transformation matrix o for training the difference between the state output distribution and the state duration distribution between the speaker s and the average tone modeliAnd diThe average observation vector and the average time length vector.
Further, the step D of obtaining the speaker adaptive model by using the corpus of the speaker with a small amount of Tibetan language or Chinese speech to be synthesized through speaker adaptive transformation, and modifying and updating the adaptive model includes the following steps:
firstly, after adaptive training of a speaker, calculating to obtain a mean vector and a covariance matrix of state output probability distribution and duration probability distribution of speaker conversion by using a CMLLR adaptive algorithm based on HSMM, wherein a transformation equation of a feature vector o and a state duration d under a state i is as follows:
bi(o)=N(o;Aui-b,A∑iAT)
=|A-1|N(Wξ;ui,∑i)
<math> <mfenced open='' close=''> <mtable> <mtr> <mtd> <msub> <mi>p</mi> <mi>i</mi> </msub> <mrow> <mo>(</mo> <mi>d</mi> <mo>)</mo> </mrow> <mo>=</mo> <mi>N</mi> <mrow> <mo>(</mo> <mi>d</mi> <mo>;</mo> <mi>&alpha;</mi> <msub> <mi>m</mi> <mi>i</mi> </msub> <mo>-</mo> <mi>&beta;</mi> <mo>,</mo> <mi>&alpha;</mi> <msubsup> <mi>&sigma;</mi> <mi>i</mi> <mn>2</mn> </msubsup> <mi>&alpha;</mi> <mo>)</mo> </mrow> </mtd> </mtr> <mtr> <mtd> <mo>=</mo> <mo>|</mo> <msup> <mi>&alpha;</mi> <mrow> <mo>-</mo> <mn>1</mn> </mrow> </msup> <mo>|</mo> <mi>N</mi> <mrow> <mo>(</mo> <mi>&alpha;&psi;</mi> <mo>;</mo> <msub> <mi>m</mi> <mi>i</mi> </msub> <mo>,</mo> <msubsup> <mi>&sigma;</mi> <mi>i</mi> <mn>2</mn> </msubsup> <mo>)</mo> </mrow> </mtd> </mtr> </mtable> </mfenced> </math>
wherein xi is ═ oT,1],ψ=[d,1]T,μiIs the mean of the state output distribution, miIs the mean value, Σ, of the time length distributioniIn the form of a diagonal covariance matrix,is the variance, W ═ A-1 b-1]Outputting a linear transformation matrix of probability density distribution for the target speaker state, X ═ alpha-1,β-1]A transformation matrix of state duration probability density distribution;
then, through the adaptive transformation algorithm based on HSMM, the frequency spectrum, fundamental frequency and time length parameters of the voice data can be normalized and transformed, and for the adaptive data O with the length of T, the maximum likelihood estimation can be carried out on the transformation Λ ═ W, X
<math> <mrow> <mover> <mi>&Lambda;</mi> <mo>~</mo> </mover> <mo>=</mo> <mrow> <mo>(</mo> <mover> <mi>W</mi> <mo>~</mo> </mover> <mo>,</mo> <mover> <mi>X</mi> <mo>~</mo> </mover> <mo>)</mo> </mrow> <mo>=</mo> <munder> <mrow> <mi>arg</mi> <mi>max</mi> </mrow> <mi>&Lambda;</mi> </munder> <mi>P</mi> <mrow> <mo>(</mo> <mi>O</mi> <mo>|</mo> <mi>&lambda;</mi> <mo>,</mo> <mi>&Lambda;</mi> <mo>)</mo> </mrow> </mrow> </math>
Wherein λ is the parameter set of HSMM;
finally, the Maximum A Posteriori (MAP) algorithm is used to modify and update the adaptive model of the speech, and for a given HSMM λ, if the forward probability and the backward probability are: alpha is alphai(i) And betai(i) Then it continuously observes the sequence o in state it-d+1......otGeneration probability ofComprises the following steps:
<math> <mrow> <msubsup> <mi>&kappa;</mi> <mi>t</mi> <mi>d</mi> </msubsup> <mrow> <mo>(</mo> <mi>i</mi> <mo>)</mo> </mrow> <mo>=</mo> <mfrac> <mn>1</mn> <mrow> <mi>P</mi> <mrow> <mo>(</mo> <mi>O</mi> <mo>|</mo> <mi>&lambda;</mi> <mo>)</mo> </mrow> </mrow> </mfrac> <munderover> <munder> <mi>&Sigma;</mi> <mrow> <mi>j</mi> <mo>=</mo> <mn>1</mn> </mrow> </munder> <mrow> <mi>j</mi> <mo>&NotEqual;</mo> <mi>i</mi> </mrow> <mi>N</mi> </munderover> <msub> <mi>&alpha;</mi> <mrow> <mi>t</mi> <mo>-</mo> <mi>d</mi> </mrow> </msub> <mrow> <mo>(</mo> <mi>j</mi> <mo>)</mo> </mrow> <mi>p</mi> <mrow> <mo>(</mo> <mi>d</mi> <mo>)</mo> </mrow> <munderover> <mi>&Pi;</mi> <mrow> <mi>s</mi> <mo>=</mo> <mi>t</mi> <mo>-</mo> <mi>d</mi> <mo>+</mo> <mn>1</mn> </mrow> <mi>t</mi> </munderover> <msub> <mi>b</mi> <mi>i</mi> </msub> <mrow> <mo>(</mo> <msub> <mi>O</mi> <mi>s</mi> </msub> <mo>)</mo> </mrow> <msub> <mi>&beta;</mi> <mi>t</mi> </msub> <mrow> <mo>(</mo> <mi>i</mi> <mo>)</mo> </mrow> </mrow> </math>
the MAP estimate is described as follows:
wherein,andis the mean vector after linear regression transformation, omega and tau are respectively the MAP estimated parameters of state output and time length distribution,andas an adaptive mean vectorAndweighted average MAP estimate of (a).
Further, the step E of inputting the text to be synthesized and generating the speech parameters, and synthesizing the Tibetan language or the chinese language speech includes the following steps:
firstly, converting a given text into a pronunciation labeling sequence containing context description information by using a text analysis tool, predicting a context-dependent HMM (hidden Markov model) model of each pronunciation by using a decision tree obtained in a training process, and connecting the HMM models into an HMM model of a sentence;
secondly, generating a parameter sequence of the frequency spectrum, the duration and the fundamental frequency from the sentence HMM by using a parameter generation algorithm;
finally, a Mel log-spectrum approximation (MLSA) filter is used as a parameter synthesizer to synthesize the voice.
Further, the apparatus for synthesizing bilingual speech in hanzang comprises: the HMM model training unit is used for establishing an HMM model of the voice data; the speaker self-adaptive unit is used for normalizing and converting the characteristic parameters of the speaker in training to obtain a self-adaptive model; and the voice synthesis unit is used for synthesizing the Tibetan or Chinese voice to be synthesized.
Further, the HMM model training unit includes: the voice analysis subunit extracts acoustic parameters of voice data in a voice library, and mainly extracts fundamental frequency, frequency spectrum and duration parameters; and the target HMM model determining subunit is used for training a statistical model of the acoustic model by combining context labeling information of the sound library, determining fundamental frequency, frequency spectrum and duration parameters according to a context attribute set, and the voice analysis subunit is connected with the target HMM model determining subunit.
Further, the speaker self-adaptive unit comprises a speaker training subunit, an average tone model determining subunit, a speaker self-adaptive transformation subunit and a self-adaptive model determining subunit which are connected in sequence, the target HMM model determining subunit is connected with the speaker training subunit,
the speaker training subunit is used for normalizing the difference between the state output distribution and the state duration distribution between the speaker and the average voice model in the training;
the average sound model determining subunit determines a Chinese-Tibetan bilingual mixed speech average sound model by adopting a maximum likelihood linear regression algorithm;
the speaker self-adaptive transformation subunit calculates the mean vector and the covariance matrix of the state output probability distribution and the duration probability distribution of the speaker by using self-adaptive data and converts the mean vector and the covariance matrix into a target speaker model;
the adaptive model determining subunit establishes an adaptive model of the MSD-HSMM of the target speaker.
Further, the voice synthesis unit includes an adaptive model modification subunit and a synthesis subunit connected in sequence, the adaptive model determination subunit is connected to the adaptive model modification subunit,
the self-adaptive model correcting subunit corrects and updates the self-adaptive model of the voice by utilizing an MAP algorithm, reduces the model deviation and improves the synthesis quality;
and the synthesis subunit predicts the voice parameters of the input text by using the corrected self-adaptive model, extracts the parameters and finally synthesizes the Chinese or Tibetan voice through the voice synthesizer.
The invention has the advantages and positive effects that: the method and the device for synthesizing the Chinese-Tibetan bilingual speech realize the synthesis of the Chinese and Tibetan speech with better naturalness and fluency by using the same system and the device by utilizing the HMM-based adaptive training and adaptive conversion algorithm according to the similarity of the Chinese and the Tibetan in pronunciation. Compared with the traditional speech synthesis system based on HMM, the system adds the speaker self-adaptive training process in the training stage to obtain the average sound model of the Chinese-Tibetan mixed speech, through the process, the influence caused by the difference of speakers in a speech library can be reduced, the quality of synthesized speech is improved, and on the basis of the average sound model, the Tibetan or Chinese speech with good naturalness and fluency can be synthesized by the speaker self-adaptive transformation algorithm and only a small amount of Tibetan or Chinese corpus to be synthesized. The research of the system has important significance on promoting the communication with minority carriers and promoting the development of minority carrier voice technology.
Drawings
FIG. 1 is a flow chart of a method for synthesizing bilingual speech from Hanzang;
FIG. 2 is a flow diagram of the Tibetan text to SAMPA-T conversion;
FIG. 3 is a block diagram of a process for adaptive speech synthesis for a bilingual speaker in Hanzang;
FIG. 4 is a schematic diagram of a structure of a bilingual speech synthesis apparatus;
FIG. 5 is a flow chart of model training;
fig. 6 is a flow chart of speech synthesis.
Detailed Description
The invention provides a method for synthesizing Chinese-Tibetan bilingual speech, which provides a character-pronunciation conversion algorithm facing a Tibetan language machine-readable phonetic symbol SAMPA-T, realizes the automatic marking of SAMPA-T of a Tibetan language text corpus, designs a marking system, a marking format and a question set which are universal for mandarins and Tibetan languages according to the similarity between Tibetan languages and mandarins, utilizes the corpora of a plurality of speakers of mandarins and Tibetan languages, and finally synthesizes Chinese or Tibetan speech through speaker self-adaptive training based on HMM and a speaker self-adaptive conversion algorithm. The flow chart of the Chinese-Tibetan bilingual speech synthesis method of the invention is shown in figure 1, and the specific steps are as follows:
(1) an SAMPA-T labeling scheme of the Tibetan Lasa dialect is designed, and SAMPA-T automatic labeling of the Tibetan text corpus is completed by utilizing an SAMPA-T oriented word-pronunciation conversion algorithm.
Machine readable Phonetic symbols SAMPA (Speech Association Methods viral Alphabet) is a computer readable Phonetic symbol system that can represent all symbols of an international Phonetic symbol with ASCII characters. At present, SAMPA is widely applied to main languages in Europe, east Asia languages such as Japanese, and SAMPA schemes are also proposed in Chinese, Guangdong and Taiwan's national languages.
Because the Tibetan and the Chinese belong to the Tibetan Chinese system, the invention designs a set of computer-readable phonetic transcription system SAMPA-T (Tibeta) of the Tibetan on the basis of the machine-readable phonetic symbol design scheme of the Mandarin Chinese, lists the design scheme by taking the Tibetan Lassa as an example and realizes the transfer of the Tibetan to the SAMPA-T.
By comparing the international phonetic symbols of Chinese and Tibetan, it is found that some of the international phonetic symbols of Chinese and Tibetan are identical, so that the international phonetic symbols are obtained for the input Tibetan phonetic alphabet by taking the international phonetic symbols as reference, and then compared with the international phonetic symbols of Chinese phonetic alphabet, the same part is directly marked by SAMPA-SC, and the different parts are marked by unused keyboard symbols according to the simplification principle.
The Tibetan is a pinyin character and is formed by spelling the Tibetan by pinyin, and the basic unit is syllable. According to the structural positions of letters in syllables, the traditional Tibetan grammar divides the syllables at different positions into 'adding character before', 'base character', 'adding character above', 'adding character below', 'adding character after' and 'adding character after again', wherein the base character is the core of the whole Tibetan character. The Tibetan vowel is vowel + postword.
The transfer of the Tibetan text to the SAMPA-T is mainly considered from the aspects of Tibetan sentence segmentation, Tibetan single character segmentation, positioning of radical character cubes, separation and transfer of initial consonants and vowels, SAMPA-T character string combination and the like. The positioning of the basic character block, namely the recognition of the basic character, vowel and the like, is mainly realized by a dictionary-oriented statistical and searching method. The character-to-character conversion is mainly realized by searching SAMPA-T transcription supporting libraries of initials and finals. Firstly, reading in a Tibetan language text, and then segmenting sentences and syllables according to the single plumb symbols and the syllable symbols to obtain a Tibetan language list. For each syllable, the initial consonant and the final are separated through positioning the base character and the division, and then the SAMPA-T table of the initial consonant and the final is searched to obtain the SAMPA-T of the syllable. The flow chart for the conversion of Tibetan text to SAMPA-T is shown in FIG. 2.
(2) According to the similarity between Tibetan language and Mandarin, a universal phonetic system and a question set for Chinese and Tibetan language are designed on the basis of a Mandarin phonetic system.
Tibetan and Chinese belong to the same Chinese-Tibetan language family, and have many commonalities and differences in pronunciation. The mandarin chinese language and the russian Tibetan language are all languages composed of syllables, each of which is composed of an initial consonant and a final. Mandarin has 22 initials and 39 finals, while the Tibetan Lasa dialect has 36 initials and 45 finals, and the two languages share 20 initials and 13 finals. Firstly, marking the Tibetan initials and finals consistent with the pronunciation of the Mandarin by using Chinese pinyin; the pronunciation of the Tibetan language is inconsistent with that of the Mandarin.
Then, all initials and finals of Mandarin and Tibetan, silence and pause are selected as the synthesis primitives of the MSD-HSMM related to the context to design a context labeling format for labeling the context-related features of the initial-final layer, the syllable layer, the word layer, the prosodic word layer, the phrase layer and the sentence layer of each synthesis primitive.
Finally, a problem set which is universal to the Chinese-Tibetan bilingual is designed on the basis of a context-related problem set of the Mandarin. The problem is focused on expanding the problems associated with the synthetic primitives that are characteristic of the Tibetan language to reflect the special pronunciation of the Tibetan language. The problem set contains 3000 more contextually relevant problems, covering all the features of the contextually relevant annotations.
The system adopts a hierarchical labeling method to label the Tibetan language text corpus, and the labeled content comprises a syllable layer, boundary information and an SAMPA-T transcription result. The international Praat phonetic software is used for labeling, and the system can also add necessary labeling information according to the needs. After the marking is finished, a script program is compiled, and marking information is written into the Texgrid file, wherein the Texgrid file contains four layers of marked information, which mainly comprises pronunciation and syllable boundary information. The problem set contains classification information of basic features, such as initial consonants, vowel types, whether syllables are arranged in prosodic phrases, and the like, and the classification information is usually a set of basic information according to some set of context information. By reclassifying the problem set, more complex contextual classification information than the basic features can be obtained. In the HTS system, a set of design problems is set forth in the hed document, each line describing a problem, which is a true-false type problem, each beginning with a QS command.
(3) And training to obtain a mixed language average voice model by utilizing the voice data of multiple speakers in the Chinese Tibetan language and through speaker self-adaptive training based on an HMM model.
Compared with the traditional speech synthesis method based on HMM, the method adds the speaker self-adaptive training process in the training stage to obtain the average sound model of the Chinese-Tibetan mixed speech, can reduce the influence caused by the difference of speakers in a speech library and improve the quality of synthesized speech, and can synthesize the Tibetan or Chinese speech with good naturalness and fluency by using a small amount of Tibetan or Chinese corpus to be synthesized through the speaker self-adaptive transformation algorithm on the basis of the average sound model. The block diagram of the adaptive speech synthesis process for bilingual speakers in Hanzang is shown in FIG. 3:
carrying out voice analysis on Chinese corpus data of multiple speakers and Tibetan language corpus data of single speakers, and extracting acoustic parameters of the Chinese corpus data and the Tibetan language corpus data:
(1) extracting mel cepstrum coefficients, logarithmic fundamental frequency and non-periodic indexes;
(2) their first and second order differences are calculated.
And 2, performing HMM model training by combining the context attribute set, and training a statistical model of the acoustic parameters:
(1) training HMM models of frequency spectrum and fundamental frequency parameters;
(2) a multi-distributed semi-hidden Markov model (MSD-HSMM) of the training state duration parameter.
And 3, carrying out speaker self-adaptive training by utilizing a small number of single speaker Chinese speech libraries and single speaker Tibetan speech libraries so as to obtain an average sound model:
(1) expressing the difference between the voice data and the average voice of the speaker in training by using a linear regression function by adopting a constrained maximum likelihood linear regression (CMML) algorithm;
(2) normalizing the differences between the training speakers by a set of linear regression equations of state output distribution and state duration distribution;
(3) training to obtain a mixed language average voice model of the Chinese-Tibetan bilingual so as to obtain a context-dependent MSD-HSMM model.
And 4, carrying out speaker self-adaptive transformation by using single speaker self-adaptive data of Chinese and Tibetan:
(1) calculating a mean vector and a covariance matrix of state output probability distribution and state duration probability distribution of the speaker by adopting a CMML algorithm;
(2) transforming the mean vector and the covariance matrix of the average tone model into a target speaker model of Tibetan or Chinese to be synthesized by using a set of transformation matrices of state output distribution and state duration distribution;
(3) and carrying out maximum likelihood estimation on the normalized and converted frequency spectrum, fundamental frequency and duration parameters.
And 5, modifying and updating the adaptive model:
(1) calculating MAP estimation parameters of average tone model state output and duration distribution by adopting a maximum posterior (MAP) algorithm;
(2) calculating the average value vector of the state output and the state duration after the self-adaptive transformation;
(3) a weighted average MAP estimate of the adaptive mean vector is calculated.
And 6, inputting a text to be synthesized, and performing text analysis on the text to obtain an HMM model of the sentence.
And 7, performing parameter prediction on the sentence HMM, performing voice parameter generation, and obtaining synthesized voice through a parameter synthesizer.
Fig. 3 is a flow chart of the process of synthesizing bilingual speech in hanzang. The method comprises the steps of utilizing a mixed corpus of the common Chinese and the Tibetan to obtain a mixed language average phonetic model of the Chinese-Tibetan bilingual by adopting constrained maximum likelihood linear regression (CMML) training, so as to obtain a context-dependent multi-distribution semi-hidden Markov model (MSD-HSMM). In adaptive training of a speaker, the difference between training speech data and an average of a training speaker is expressed by a linear regression function of an average vector of an output state distribution and a state duration distribution, and the difference between the training speakers is normalized by a set of linear regression equations of the state output distribution and the state duration distribution, wherein the formula is as follows:
wherein,the mean vector is output for training the state of the speaker s,is its state long mean vector. W ═ A, b]And X ═ α, β]A transformation matrix o for training the difference between the state output distribution and the state duration distribution between the speaker s and the average tone modeliAnd diThe average observation vector and the average time length vector.
(4) The speaker self-adaptive model is obtained by utilizing the corpus of a small number of speakers of Tibetan language or Chinese voice to be synthesized through speaker self-adaptive transformation, and the self-adaptive model is corrected and updated.
After the speaker self-adaptive training, the CMLLR self-adaptive algorithm based on the HSMM is utilized to calculate and obtain the mean vector and the covariance matrix of the state output probability distribution and the duration probability distribution of the speaker conversion. The transformation equation of the feature vector o and the state duration d under the state i is as follows:
bi(o)=N(o;Aui-b,A∑iAT)
=|A-1|N(Wξ;ui,∑i)
<math> <mfenced open='' close=''> <mtable> <mtr> <mtd> <msub> <mi>p</mi> <mi>i</mi> </msub> <mrow> <mo>(</mo> <mi>d</mi> <mo>)</mo> </mrow> <mo>=</mo> <mi>N</mi> <mrow> <mo>(</mo> <mi>d</mi> <mo>;</mo> <mi>&alpha;</mi> <msub> <mi>m</mi> <mi>i</mi> </msub> <mo>-</mo> <mi>&beta;</mi> <mo>,</mo> <mi>&alpha;</mi> <msubsup> <mi>&sigma;</mi> <mi>i</mi> <mn>2</mn> </msubsup> <mi>&alpha;</mi> <mo>)</mo> </mrow> </mtd> </mtr> <mtr> <mtd> <mo>=</mo> <mo>|</mo> <msup> <mi>&alpha;</mi> <mrow> <mo>-</mo> <mn>1</mn> </mrow> </msup> <mo>|</mo> <mi>N</mi> <mrow> <mo>(</mo> <mi>&alpha;&psi;</mi> <mo>;</mo> <msub> <mi>m</mi> <mi>i</mi> </msub> <mo>,</mo> <msubsup> <mi>&sigma;</mi> <mi>i</mi> <mn>2</mn> </msubsup> <mo>)</mo> </mrow> </mtd> </mtr> </mtable> </mfenced> </math>
wherein xi is ═ oT,1],ψ=[d,1]T,μiIs the mean of the state output distribution, miIs the mean value, Σ, of the time length distributioniIn the form of a diagonal covariance matrix,is the variance. W ═ A-1 b-1]Outputting a linear transformation matrix of probability density distribution for the target speaker state, X ═ alpha-1,β-1]Is a transformation matrix of state duration probability density distribution.
Through the HSMM-based adaptive transformation algorithm, the frequency spectrum, fundamental frequency and duration parameters of the voice data can be normalized and transformed. For adaptive data O of length T, the maximum likelihood estimation can be performed on the transform Λ ═ (W, X),
<math> <mrow> <mover> <mi>&Lambda;</mi> <mo>~</mo> </mover> <mo>=</mo> <mrow> <mo>(</mo> <mover> <mi>W</mi> <mo>~</mo> </mover> <mo>,</mo> <mover> <mi>X</mi> <mo>~</mo> </mover> <mo>)</mo> </mrow> <mo>=</mo> <munder> <mrow> <mi>arg</mi> <mi>max</mi> </mrow> <mi>&Lambda;</mi> </munder> <mi>P</mi> <mrow> <mo>(</mo> <mi>O</mi> <mo>|</mo> <mi>&lambda;</mi> <mo>,</mo> <mi>&Lambda;</mi> <mo>)</mo> </mrow> </mrow> </math>
where λ is the parameter set for HSMM.
Finally, the adaptive model of the speech is modified and updated using a Maximum A Posteriori (MAP) algorithm. For a given HSMM λ, if its forward probability and backward probability are: alpha is alphai(i) And betai(i) Then it continuously observes the sequence o in state it-d+1......otGeneration probability ofComprises the following steps:
<math> <mrow> <msubsup> <mi>&kappa;</mi> <mi>t</mi> <mi>d</mi> </msubsup> <mrow> <mo>(</mo> <mi>i</mi> <mo>)</mo> </mrow> <mo>=</mo> <mfrac> <mn>1</mn> <mrow> <mi>P</mi> <mrow> <mo>(</mo> <mi>O</mi> <mo>|</mo> <mi>&lambda;</mi> <mo>)</mo> </mrow> </mrow> </mfrac> <munderover> <munder> <mi>&Sigma;</mi> <mrow> <mi>j</mi> <mo>=</mo> <mn>1</mn> </mrow> </munder> <mrow> <mi>j</mi> <mo>&NotEqual;</mo> <mi>i</mi> </mrow> <mi>N</mi> </munderover> <msub> <mi>&alpha;</mi> <mrow> <mi>t</mi> <mo>-</mo> <mi>d</mi> </mrow> </msub> <mrow> <mo>(</mo> <mi>j</mi> <mo>)</mo> </mrow> <mi>p</mi> <mrow> <mo>(</mo> <mi>d</mi> <mo>)</mo> </mrow> <munderover> <mi>&Pi;</mi> <mrow> <mi>s</mi> <mo>=</mo> <mi>t</mi> <mo>-</mo> <mi>d</mi> <mo>+</mo> <mn>1</mn> </mrow> <mi>t</mi> </munderover> <msub> <mi>b</mi> <mi>i</mi> </msub> <mrow> <mo>(</mo> <msub> <mi>O</mi> <mi>s</mi> </msub> <mo>)</mo> </mrow> <msub> <mi>&beta;</mi> <mi>t</mi> </msub> <mrow> <mo>(</mo> <mi>i</mi> <mo>)</mo> </mrow> </mrow> </math>
the MAP estimate is described as follows:
wherein,andand omega and tau are respectively the MAP estimation parameters of state output and time length distribution.Andas an adaptive mean vectorAndweighted average MAP estimate of (a).
The training phase mainly comprises preprocessing and HMM training. In the preprocessing stage, the speech data in the sound bank is firstly analyzed, and corresponding speech parameters (fundamental frequency and spectral parameters) are extracted. According to the extracted speech parameters, an observation vector of the HMM can be divided into two parts of a spectrum and a fundamental frequency, wherein the spectrum parameter part is modeled by using a continuous probability distribution HMM, the fundamental frequency part is modeled by using a multi-space probability distribution HMM (MSD-HMM), and meanwhile, the system uses a Gaussian distribution or a gamma distribution to establish a state duration model to describe the time structure of the speech. In addition, HMM synthesis systems also describe contexts using linguistic and prosodic features. Before model training, a context attribute set and a problem set used for decision tree clustering are designed, namely, some context attributes having certain influence on acoustic parameters (spectrum, fundamental frequency and duration) are selected according to prior knowledge, and a corresponding problem set is designed to be used for context-dependent model clustering.
In the training process of the model, an HMM model of the acoustic parameter vector sequence is trained by using an EM algorithm according to an ML criterion. And finally, clustering the spectrum parameter model, the fundamental frequency parameter model and the duration model by using a context decision tree respectively to obtain a prediction model for synthesis. The whole model training process is shown in fig. 5.
(5) Inputting a text to be synthesized, generating voice parameters, and synthesizing Tibetan or Chinese voice.
Firstly, a given text is converted into a pronunciation annotation sequence containing contextual description information by using a text analysis tool, a context-dependent HMM model of each pronunciation is predicted by using a decision tree obtained in a training process, and the HMM models are connected into an HMM model of a sentence. Then, a parameter generation algorithm is used to generate a sequence of parameters of the spectrum, duration and fundamental frequency from the sentence HMM. Finally, a Mel log-spectrum approximation (MLSA) filter is used as a parameter synthesizer to synthesize the voice. The scheme of the whole synthesis is shown in FIG. 6.
Corresponding to the method, the invention also provides a Chinese-Tibetan bilingual speech synthesis device, which is used for carrying out speech synthesis on the input Chinese or Tibetan speech to be synthesized by utilizing the pre-established Chinese-Tibetan bilingual library, and the functions of the device can be realized by software, hardware or the combination of the software and the hardware. The internal structure of the device of the invention is schematically shown in fig. 4.
The internal structure of the device comprises an HMM model training unit, a speaker self-adaption unit and a speech synthesis unit.
1> HMM model training unit for building an HMM model of speech data:
(1) the voice analysis subunit extracts acoustic parameters of voice data in a voice library, and mainly extracts fundamental frequency, frequency spectrum and duration parameters;
(2) and the target HMM model determining subunit is used for training a statistical model of the acoustic model by combining context labeling information of the sound library, and determining fundamental frequency, frequency spectrum and duration parameters according to the context attribute set.
And 2, a speaker self-adaptive unit, which is used for normalizing and converting the characteristic parameters of the speaker in training to obtain a self-adaptive model:
(1) a speaker training subunit for normalizing the difference between the state output distribution and the state duration distribution between the speaker and the average tone model in the training;
(2) the average sound model determining subunit determines a Chinese-Tibetan bilingual mixed speech average sound model by adopting a maximum likelihood linear regression algorithm;
(3) the speaker self-adaptive transformation subunit calculates the mean vector and the covariance matrix of the state output probability distribution and the duration probability distribution of the speaker by using self-adaptive data and converts the mean vector and the covariance matrix into a target speaker model;
(4) and the self-adaptive model determining subunit establishes a self-adaptive model of the MSD-HSMM of the target speaker.
And 3, a speech synthesis unit for synthesizing Tibetan or Chinese speech to be synthesized:
(1) and the self-adaptive model correction subunit corrects and updates the self-adaptive model of the voice by utilizing an MAP algorithm, reduces the model deviation and improves the synthesis quality:
(2) and the synthesis subunit predicts the voice parameters of the input text by using the corrected self-adaptive model, extracts the parameters and finally synthesizes the Chinese or Tibetan voice through the voice synthesizer.
The above method processes may be implemented by hardware related to program instructions, and the program may be stored in a readable storage medium, and when executed, the program performs the corresponding steps in the above method.
In order to illustrate the superiority of the method adopted by the invention and other methods, the synthesized Tibetan speech and Chinese speech quality are evaluated, 3 different MSD-HSMM models are trained, and the advantages of the method disclosed by the invention can be illustrated by comparing the synthesized speech quality under the three models.
The speech library was selected as the training data for mandarin chinese speech of 7 female speakers (169 sentences per speaker) and 800 recorded Tibetan language female speakers. The Tibetan language sentence is selected from the recent Tibetan newspapers. All recordings were saved as Microsoft WAV file format (single channel, 16 bit quantization, 16 k Hz sampling).
In the experiment, 100 sentences are randomly selected from 800 Tibetan sentences as test sentences. Randomly picking out 10 sentences, 100 sentences and 700 sentences from the rest 700 Tibetan sentences establishes a training set of 3 Tibetan sentences. These training sets were used to train the Hanzang bilingual mixed-language mean-phoneme model, along with the training speech of 7 female Mandarin speakers. The 3 training sets of Tibetan and the first training corpus of female Mandarin speakers were also used to obtain acoustic models related to the speakers of Tibetan and Mandarin.
1) SD model: the speaker correlation model of the Tibetan Lhasa dialect obtained is trained by using 3 Tibetan language training sets (10/100/700 Tibetan language sentences) respectively.
2) And (3) SI model: the resulting speaker independent model was trained using only the training sentences of 7 female Mandarin speakers.
3) SAT model: firstly, training and obtaining 3 Chinese-Tibetan bilingual mixed language average sound models by respectively utilizing 3 Tibetan training sets and all Mandarin training sentences of 7 Mandarin speakers; then 3 Tibetan language training sets and the training sentences of the first Mandarin speaker are respectively utilized to obtain the speaker correlation model.
During evaluation, the Tibetan Lasa dialect test sentences synthesized by the SD model and the SAT model are randomly played for 8 Tibetan Lasa dialect evaluators, and 120 test voice files are provided in total (20 Tibetan test sentences multiplied by 3 Tibetan training sets multiplied by 2 models). The examiner is asked to listen to the 120 sentences minutely and then to score the voice quality of each sentence by a 5-point system. After the MOS evaluation, the examiner is also required to describe the integral intelligibility of the Tibetan speech integrated by the different Tibetan Lasa dialect training sets. In the evaluation of the MOS of chinese, the same method is adopted, 54 sentences of synthesized mandarin speech (18 sentences of mandarin test sentences × 3 Tibetan training sets) are randomly played for a mandarin evaluator, and the speech quality of each mandarin sentence is scored according to 5-point system.
The evaluation results of the synthesized voice MOS under different Tibetan language sentence training libraries show the average MOS score and the 95% confidence interval thereof. For Tibetan synthesized speech, the SAT model is superior to the SD model under each Tibetan training set. For 10 sentences of Tibetan training speech, the MOS score of the speech synthesized by the SD model is only 1.99, while the MOS score of the SAT model is relatively high and is 2.4. The examiner feels that the Tibetan synthesized by the SD model is difficult to understand while the Tibetan synthesized by the SAT model is easy to understand when taking the evaluation. When the Tibetan language training sentence is 100 sentences, the MOS scores and the intelligibility of the 2 models are improved, but the SAT model is still obviously superior to the SD model. When the training sentence reaches 700 sentences, the MOS scores of the 2 models are substantially the same. Meanwhile, the examiner feels that the synthesized speech is easily understood. Therefore, in the small corpus case, the quality of the SAT model synthesized speech is better than that of the SD model synthesized speech. When the Tibetan language corpus is increased, the quality of the Tibetan language speech synthesized by the two models tends to be the same. Therefore, the method disclosed by the invention is very suitable for synthesizing high-quality voice under the condition of lacking the corpus.
For mandarin synthesized voice, under each Tibetan language training set, Tibetan language sentences mixed in the training corpus have almost no influence on the synthesis result of mandarin, the MOS scores of the synthesized mandarin are all about 4.0, and the synthesis effect is good.
The similarity of the languages is evaluated by adopting a DMOS method. In DMOS evaluation, all test statements and their original recordings are used to take part in the evaluation. The Tibetan language voice file synthesized by 140 sentences is total (20 Tibetan sentences multiplied by 3 Tibetan training sets multiplied by 2 models + 20 Tibetan sentences synthesized by SI model). The Tibetan language sentence synthesized by each sentence and the original recording thereof form a group of voice files. Randomly playing 140 groups of test files for the evaluation person of the Tibetan Lasa dialect: firstly, playing original Tibetan language recording, and then playing synthesized Tibetan language voice. The examiner is asked to carefully compare the 2 kinds of voice files and evaluate the degree of similarity of the synthesized voice to the original voice. And 5 points are adopted during evaluation, wherein the 5 points represent that the synthesized voice is basically the same as the original voice, and the 1 point represents that the synthesized voice is greatly different from the original voice.
In the evaluation of DMOS of chinese, the same method is adopted to randomly play 54 groups of mandarin voices (18 mandarin test sentences x 3 Tibetan training sets) for mandarin evaluators, and the degree of similarity of each group of mandarin sentences is graded by 5 grades.
The results show the average DMOS score with its 95% confidence interval. For synthesized speech in Tibetan, the DMOS score of the SI model is 2.41 points, which is better than the SD model trained with 10 Tibetan sentences, and is close to the SAT model trained with 10 Tibetan sentences. When the examiner subjectively evaluates the synthesized SI model, the Tibetan language voice synthesized by the SI model is similar to the Tibetan language spoken by a non-Tibetan speaker. This is because mandarin chinese and tibetan not only share 33 synthetic primitives, but they also have the same syllable structure and prosodic structure. Thus, a Mandarin-like speech can be synthesized using only the model of Mandarin. The DMOS score of the zang speech synthesized by the SAT model outperforms the results of the SD model when more zang training sentences are added. When the Tibetan training sentence is increased to 700 sentences, the DMOS score of the SD model is very close to that of the SAT model. This shows that the Tibetan pizza dialect speech synthesized by the method disclosed by the invention is superior to the Tibetan pizza dialect speech synthesized based on the SD model method under the condition of less Tibetan sentences.
The embodiments of the present invention have been described in detail, but the description is only for the preferred embodiments of the present invention and should not be construed as limiting the scope of the present invention. All equivalent changes and modifications made within the scope of the present invention should be covered by the present patent.

Claims (10)

1. The method for synthesizing the bilingual speech in the Tibetan language is characterized in that: the method comprises the following steps:
A. obtaining the international phonetic symbols of the input Tibetan language pinyin letters by taking the international phonetic symbols as reference, then comparing the international phonetic symbols with the international phonetic symbols of Chinese pinyin, directly marking the same parts by SAMPA-SC, marking the different parts by unused keyboard symbols, and completing SAMPA-T automatic labeling of the Tibetan language text corpus by utilizing a SAMPA-T oriented character-sound conversion algorithm;
B. designing a Chinese and Tibetan universal phonetic system and a question set on the basis of a Mandarin phonetic system according to the similarity of Tibetan and Mandarin;
C. training to obtain a mixed language average voice model by utilizing voice data of a plurality of speakers in the Tibetan language and through speaker self-adaptive training based on an HMM model;
D. obtaining a speaker self-adaptive model by utilizing the corpus of a speaker with a small amount of Tibetan language or Chinese voice to be synthesized through speaker self-adaptive transformation, and correcting and updating the self-adaptive model;
E. inputting a text to be synthesized, generating voice parameters, and synthesizing Tibetan or Chinese voice.
2. The method of synthesizing bilingual hanzang speech according to claim 1, wherein: the character-to-sound conversion algorithm of SAMPA-T in the step A comprises the following steps:
firstly reading in a Tibetan language sentence text, then segmenting sentences and syllables according to single pendants and syllable characters to obtain Tibetan language sentences, separating out initial consonants and vowels through positioning of a primary character string and decomposition of the primary character string for each syllable, and finally obtaining SAMPA-T character strings of the syllables by searching an initial SAMPA-T list and a vowel SAMPA-T list, wherein the decomposition of the primary character string is realized according to the split list of the primary character string.
3. The method of synthesizing bilingual hanzang speech according to claim 1, wherein: the step B of designing a general phonetic transcription system and a general question set of Chinese and Tibetan comprises the following steps:
firstly, the Tibetan initials and finals which are consistent with the pronunciation of the Mandarin are marked by the Pinyin of Chinese, and the Tibetan initials and finals which are inconsistent with the pronunciation of the Mandarin are marked by the Pinyin of Tibetan;
then, selecting all initial and final consonants and mutes and pauses of the Mandarin and the Tibetan as context-related MSD-HSMM synthesis primitives to design a context labeling format for labeling the context-related characteristics of an initial and final layer, a syllable layer, a word layer, a prosodic word layer, a phrase layer and a sentence layer of each synthesis primitive;
finally, a problem set which is universal for the Chinese-Tibetan bilingual is designed on the basis of a context-related problem set of the Mandarin, the problem set expands related problems of synthesized primitives which are specific to the Tibetan to reflect the special pronunciation of the Tibetan, and the problem set comprises more than 3000 context-related problems and covers all the characteristics of context-related labels.
4. The method of synthesizing bilingual hanzang speech according to claim 1, wherein: the step C of obtaining the mixed language average voice model through the speaker self-adaptive training and the training comprises the following steps:
a. carrying out voice analysis on the Chinese language database of multiple speakers and the Tibetan language database data of single speaker, and extracting acoustic parameters:
(1) extracting mel cepstrum coefficients, logarithmic fundamental frequency and non-periodic indexes,
(2) calculating the first order difference and the second order difference;
b. and (3) carrying out HMM model training by combining the context attribute set, and training a statistical model of the acoustic parameters:
(1) training HMM models of the spectral and fundamental frequency parameters,
(2) training a multi-distribution semi-hidden Markov model (MSD-HSMM) of a state duration parameter;
c. using a small amount of single speaker Chinese speech library and a single speaker Tibetan speech library to perform speaker self-adaptive training, thereby obtaining an average sound model:
(1) using constrained maximum likelihood linear regression (CMML) algorithm, expressing the difference between the phonetic data and average voice of the speaker in training by linear regression function,
(2) the differences between training speakers are normalized using a set of linear regression equations for the state output distribution and the state duration distribution,
(3) training to obtain a mixed language average sound model of the Chinese-Tibetan bilingual so as to obtain a context-dependent MSD-HSMM model;
d. the speaker self-adaptive transformation is carried out by utilizing the single speaker self-adaptive data of Chinese and Tibetan:
(1) adopting CMML algorithm to calculate the mean vector and covariance matrix of the state output probability distribution and state duration probability distribution of the speaker,
(2) transforming the mean vector and covariance matrix of the mean tone model into a target speaker model of Tibetan or Chinese to be synthesized using a set of transformation matrices of state output distribution and state duration distribution,
(3) carrying out maximum likelihood estimation on the frequency spectrum, the fundamental frequency and the time length parameters after normalization and conversion;
e. and (3) modifying and updating the adaptive model:
(1) calculating MAP estimation parameters of average tone model state output and time length distribution by adopting a Maximum A Posteriori (MAP) algorithm,
(2) calculating the average vector of the state output and the state duration after the self-adaptive transformation,
(3) calculating a weighted average MAP estimation value of the adaptive mean vector;
f. inputting a text to be synthesized, and performing text analysis on the text to obtain an HMM model of a sentence;
g. performing parameter prediction on a sentence HMM, performing voice parameter generation, and obtaining synthetic voice through a parameter synthesizer, wherein the formula is as follows:
wherein,the mean vector is output for training the state of the speaker s,is its state-long mean vector, W ═ A, b]And X ═ α, β]A transformation matrix o for training the difference between the state output distribution and the state duration distribution between the speaker s and the average tone modeliAnd diThe average observation vector and the average time length vector.
5. The method of synthesizing bilingual hanzang speech according to claim 1, wherein: the step D of obtaining the speaker self-adaptive model by utilizing the corpus of the speaker with a small amount of Tibetan language or Chinese voice to be synthesized through speaker self-adaptive transformation and correcting and updating the self-adaptive model comprises the following steps:
firstly, after adaptive training of a speaker, calculating to obtain a mean vector and a covariance matrix of state output probability distribution and duration probability distribution of speaker conversion by using a CMLLR adaptive algorithm based on HSMM, wherein a transformation equation of a feature vector o and a state duration d under a state i is as follows:
bi(o)=N(o;Aui-b,A∑iAT)
=|A-1|N(Wξ;ui,Σi)
<math> <mfenced open='' close=''> <mtable> <mtr> <mtd> <msub> <mi>p</mi> <mi>i</mi> </msub> <mrow> <mo>(</mo> <mi>d</mi> <mo>)</mo> </mrow> <mo>=</mo> <mi>N</mi> <mrow> <mo>(</mo> <mi>d</mi> <mo>;</mo> <mi>&alpha;</mi> <msub> <mi>m</mi> <mi>i</mi> </msub> <mo>-</mo> <mi>&beta;</mi> <mo>,</mo> <mi>&alpha;</mi> <msubsup> <mi>&sigma;</mi> <mi>i</mi> <mn>2</mn> </msubsup> <mi>&alpha;</mi> <mo>)</mo> </mrow> </mtd> </mtr> <mtr> <mtd> <mo>=</mo> <mo>|</mo> <msup> <mi>&alpha;</mi> <mrow> <mo>-</mo> <mn>1</mn> </mrow> </msup> <mo>|</mo> <mi>N</mi> <mrow> <mo>(</mo> <mi>&alpha;&psi;</mi> <mo>;</mo> <msub> <mi>m</mi> <mi>i</mi> </msub> <mo>,</mo> <msubsup> <mi>&sigma;</mi> <mi>i</mi> <mn>2</mn> </msubsup> <mo>)</mo> </mrow> </mtd> </mtr> </mtable> </mfenced> </math>
wherein xi is ═ oT,1],ψ=[d,1]T,μiIs the mean of the state output distribution, miIs the mean value, Σ, of the time length distributioniIn the form of a diagonal covariance matrix,is the variance, W ═ A-1 b-1]Outputting a linear transformation matrix of probability density distribution for the target speaker state, X ═ alpha-1,β-1]A transformation matrix of state duration probability density distribution;
then, through the adaptive transformation algorithm based on HSMM, the frequency spectrum, fundamental frequency and duration parameters of the voice data can be normalized and transformed, and for the adaptive data O with the length T, the transformation Λ ═ W, X can be estimated with maximum likelihood,
<math> <mrow> <mover> <mi>&Lambda;</mi> <mo>~</mo> </mover> <mo>=</mo> <mrow> <mo>(</mo> <mover> <mi>W</mi> <mo>~</mo> </mover> <mo>,</mo> <mover> <mi>X</mi> <mo>~</mo> </mover> <mo>)</mo> </mrow> <mo>=</mo> <munder> <mrow> <mi>arg</mi> <mi>max</mi> </mrow> <mi>&Lambda;</mi> </munder> <mi>P</mi> <mrow> <mo>(</mo> <mi>O</mi> <mo>|</mo> <mi>&lambda;</mi> <mo>,</mo> <mi>&Lambda;</mi> <mo>)</mo> </mrow> </mrow> </math>
wherein λ is the parameter set of HSMM;
finally, the Maximum A Posteriori (MAP) algorithm is used to modify and update the adaptive model of the speech, and for a given HSMM λ, if the forward probability and the backward probability are: alpha is alphai(i) And betai(i) Then it continuously observes the sequence o in state it-d+1......otGeneration probability ofComprises the following steps:
<math> <mrow> <msubsup> <mi>&kappa;</mi> <mi>t</mi> <mi>d</mi> </msubsup> <mrow> <mo>(</mo> <mi>i</mi> <mo>)</mo> </mrow> <mo>=</mo> <mfrac> <mn>1</mn> <mrow> <mi>P</mi> <mrow> <mo>(</mo> <mi>O</mi> <mo>|</mo> <mi>&lambda;</mi> <mo>)</mo> </mrow> </mrow> </mfrac> <munderover> <munder> <mi>&Sigma;</mi> <mrow> <mi>j</mi> <mo>=</mo> <mn>1</mn> </mrow> </munder> <mrow> <mi>j</mi> <mo>&NotEqual;</mo> <mi>i</mi> </mrow> <mi>N</mi> </munderover> <msub> <mi>&alpha;</mi> <mrow> <mi>t</mi> <mo>-</mo> <mi>d</mi> </mrow> </msub> <mrow> <mo>(</mo> <mi>j</mi> <mo>)</mo> </mrow> <mi>p</mi> <mrow> <mo>(</mo> <mi>d</mi> <mo>)</mo> </mrow> <munderover> <mi>&Pi;</mi> <mrow> <mi>s</mi> <mo>=</mo> <mi>t</mi> <mo>-</mo> <mi>d</mi> <mo>+</mo> <mn>1</mn> </mrow> <mi>t</mi> </munderover> <msub> <mi>b</mi> <mi>i</mi> </msub> <mrow> <mo>(</mo> <msub> <mi>O</mi> <mi>s</mi> </msub> <mo>)</mo> </mrow> <msub> <mi>&beta;</mi> <mi>t</mi> </msub> <mrow> <mo>(</mo> <mi>i</mi> <mo>)</mo> </mrow> </mrow> </math>
the MAP estimate is described as follows:
wherein,andis the mean vector after linear regression transformation, omega and tau are respectively the MAP estimated parameters of state output and time length distribution,andas an adaptive mean vectorAndweighted average MAP estimate of (a).
6. The method of synthesizing bilingual hanzang speech according to claim 1, wherein: inputting the text to be synthesized in the step E to generate the voice parameters, and synthesizing the Tibetan language or the Chinese voice comprises the following steps:
firstly, converting a given text into a pronunciation labeling sequence containing context description information by using a text analysis tool, predicting a context-dependent HMM (hidden Markov model) model of each pronunciation by using a decision tree obtained in a training process, and connecting the HMM models into an HMM model of a sentence;
secondly, generating a parameter sequence of the frequency spectrum, the duration and the fundamental frequency from the sentence HMM by using a parameter generation algorithm;
finally, a Mel log-spectrum approximation (MLSA) filter is used as a parameter synthesizer to synthesize the voice.
7. The bilingual speech synthesis device of tibetan, its characterized in that: the method comprises the following steps: the HMM model training unit is used for establishing an HMM model of the voice data; the speaker self-adaptive unit is used for normalizing and converting the characteristic parameters of the speaker in training to obtain a self-adaptive model; and the voice synthesis unit is used for synthesizing the Tibetan or Chinese voice to be synthesized.
8. The apparatus according to claim 7, wherein: the HMM model training unit comprises: the voice analysis subunit extracts acoustic parameters of voice data in a voice library, and mainly extracts fundamental frequency, frequency spectrum and duration parameters; and the target HMM model determining subunit is used for training a statistical model of the acoustic model by combining context labeling information of the sound library, determining fundamental frequency, frequency spectrum and duration parameters according to a context attribute set, and the voice analysis subunit is connected with the target HMM model determining subunit.
9. The apparatus according to claim 8, wherein: the speaker self-adaptive unit comprises a speaker training subunit, an average tone model determining subunit, a speaker self-adaptive transformation subunit and a self-adaptive model determining subunit which are connected in sequence, the target HMM model determining subunit is connected with the speaker training subunit,
the speaker training subunit is used for normalizing the difference between the state output distribution and the state duration distribution between the speaker and the average voice model in the training;
the average sound model determining subunit determines a Chinese-Tibetan bilingual mixed speech average sound model by adopting a maximum likelihood linear regression algorithm;
the speaker self-adaptive transformation subunit calculates the mean vector and the covariance matrix of the state output probability distribution and the duration probability distribution of the speaker by using self-adaptive data and converts the mean vector and the covariance matrix into a target speaker model;
the adaptive model determining subunit establishes an adaptive model of the MSD-HSMM of the target speaker.
10. The apparatus according to claim 9, wherein: the voice synthesis unit comprises an adaptive model modification subunit and a synthesis subunit which are connected in sequence, the adaptive model determination subunit is connected with the adaptive model modification subunit,
the self-adaptive model correcting subunit corrects and updates the self-adaptive model of the voice by utilizing an MAP algorithm, reduces the model deviation and improves the synthesis quality;
and the synthesis subunit predicts the voice parameters of the input text by using the corrected self-adaptive model, extracts the parameters and finally synthesizes the Chinese or Tibetan voice through the voice synthesizer.
CN201410341827.9A 2014-07-15 2014-07-15 Tibetan-Chinese speech synthesis method and device Pending CN104217713A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201410341827.9A CN104217713A (en) 2014-07-15 2014-07-15 Tibetan-Chinese speech synthesis method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201410341827.9A CN104217713A (en) 2014-07-15 2014-07-15 Tibetan-Chinese speech synthesis method and device

Publications (1)

Publication Number Publication Date
CN104217713A true CN104217713A (en) 2014-12-17

Family

ID=52099126

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201410341827.9A Pending CN104217713A (en) 2014-07-15 2014-07-15 Tibetan-Chinese speech synthesis method and device

Country Status (1)

Country Link
CN (1) CN104217713A (en)

Cited By (25)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104538025A (en) * 2014-12-23 2015-04-22 西北师范大学 Method and device for converting gestures to Chinese and Tibetan bilingual voices
CN104882141A (en) * 2015-03-03 2015-09-02 盐城工学院 Serial port voice control projection system based on time delay neural network and hidden Markov model
CN105390133A (en) * 2015-10-09 2016-03-09 西北师范大学 Tibetan TTVS system realization method
CN105654939A (en) * 2016-01-04 2016-06-08 北京时代瑞朗科技有限公司 Voice synthesis method based on voice vector textual characteristics
CN106128450A (en) * 2016-08-31 2016-11-16 西北师范大学 The bilingual method across language voice conversion and system thereof hidden in a kind of Chinese
CN106297764A (en) * 2015-05-27 2017-01-04 科大讯飞股份有限公司 A kind of multilingual mixed Chinese language treatment method and system
CN106294311A (en) * 2015-06-12 2017-01-04 科大讯飞股份有限公司 A kind of Tibetan language tone Forecasting Methodology and system
CN106971703A (en) * 2017-03-17 2017-07-21 西北师范大学 A kind of song synthetic method and device based on HMM
CN107103900A (en) * 2017-06-06 2017-08-29 西北师范大学 A kind of across language emotional speech synthesizing method and system
CN107886938A (en) * 2016-09-29 2018-04-06 中国科学院深圳先进技术研究院 Virtual reality guides hypnosis method of speech processing and device
CN108492821A (en) * 2018-03-27 2018-09-04 华南理工大学 A kind of method that speaker influences in decrease speech recognition
CN108573694A (en) * 2018-02-01 2018-09-25 北京百度网讯科技有限公司 Language material expansion and speech synthesis system construction method based on artificial intelligence and device
CN109003601A (en) * 2018-08-31 2018-12-14 北京工商大学 A kind of across language end-to-end speech recognition methods for low-resource Tujia language
CN109036370A (en) * 2018-06-06 2018-12-18 安徽继远软件有限公司 A kind of speaker's voice adaptive training method
CN109767755A (en) * 2019-03-01 2019-05-17 广州多益网络股份有限公司 A kind of phoneme synthesizing method and system
CN109949796A (en) * 2019-02-28 2019-06-28 天津大学 A kind of end-to-end framework Lhasa dialect phonetic recognition methods based on Tibetan language component
CN110232909A (en) * 2018-03-02 2019-09-13 北京搜狗科技发展有限公司 A kind of audio-frequency processing method, device, equipment and readable storage medium storing program for executing
CN110349567A (en) * 2019-08-12 2019-10-18 腾讯科技(深圳)有限公司 The recognition methods and device of voice signal, storage medium and electronic device
CN110534089A (en) * 2019-07-10 2019-12-03 西安交通大学 A kind of Chinese speech synthesis method based on phoneme and rhythm structure
CN111326138A (en) * 2020-02-24 2020-06-23 北京达佳互联信息技术有限公司 Voice generation method and device
CN111833845A (en) * 2020-07-31 2020-10-27 平安科技(深圳)有限公司 Multi-language speech recognition model training method, device, equipment and storage medium
CN111986646A (en) * 2020-08-17 2020-11-24 云知声智能科技股份有限公司 Dialect synthesis method and system based on small corpus
CN112116903A (en) * 2020-08-17 2020-12-22 北京大米科技有限公司 Method and device for generating speech synthesis model, storage medium and electronic equipment
CN115547292A (en) * 2022-11-28 2022-12-30 成都启英泰伦科技有限公司 Acoustic model training method for speech synthesis
CN117275458A (en) * 2023-11-20 2023-12-22 深圳市加推科技有限公司 Speech generation method, device and equipment for intelligent customer service and storage medium

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6035271A (en) * 1995-03-15 2000-03-07 International Business Machines Corporation Statistical methods and apparatus for pitch extraction in speech recognition, synthesis and regeneration
CN101290766A (en) * 2007-04-20 2008-10-22 西北民族大学 Syllable splitting method of Tibetan language of Anduo
US20090055162A1 (en) * 2007-08-20 2009-02-26 Microsoft Corporation Hmm-based bilingual (mandarin-english) tts techniques
CN202615783U (en) * 2012-05-23 2012-12-19 西北师范大学 Mel cepstrum analysis synthesizer based on FPGA
CN203276836U (en) * 2013-06-08 2013-11-06 西北民族大学 Novel Tibetan language identification apparatus
CN103440236A (en) * 2013-09-16 2013-12-11 中央民族大学 United labeling method for syntax of Tibet language and semantic roles

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6035271A (en) * 1995-03-15 2000-03-07 International Business Machines Corporation Statistical methods and apparatus for pitch extraction in speech recognition, synthesis and regeneration
CN101290766A (en) * 2007-04-20 2008-10-22 西北民族大学 Syllable splitting method of Tibetan language of Anduo
US20090055162A1 (en) * 2007-08-20 2009-02-26 Microsoft Corporation Hmm-based bilingual (mandarin-english) tts techniques
CN202615783U (en) * 2012-05-23 2012-12-19 西北师范大学 Mel cepstrum analysis synthesizer based on FPGA
CN203276836U (en) * 2013-06-08 2013-11-06 西北民族大学 Novel Tibetan language identification apparatus
CN103440236A (en) * 2013-09-16 2013-12-11 中央民族大学 United labeling method for syntax of Tibet language and semantic roles

Non-Patent Citations (14)

* Cited by examiner, † Cited by third party
Title
C.J. LEGGETTER AND P.C. WOODLAND: ""Maximum likelihood linear regression for speaker adaptation of continuous density hidden Markova models"", 《COMPUTER SPEECH& LANGUAGE》 *
HONGWU YANG ET AL: ""Using speaker adaptive training to realize Mandarin-Tibetan cross-lingual speech synthesis"", 《MULTIMEDIA TOOLS AND APPLICATIONS》 *
J. YAMAGISHI ET AL: ""Analysis of speaker adaptation algorithms for HMM-based speech synthesis and a constrained SMAPLR adaptation algorithm", 《IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING》 *
SIOHAN O ET AL: ""Structural maximum a posteri-ori linear regression for fast HMM adaptation"", 《COMPUTER SPEECH & LANGUAGE》 *
YAMAEISHI J ET AL: ""Average-voice-based speech synthesis using HSMM-based speaker adaptation and adaptive training"", 《IEICE TRANSACTIONS ON INFORMATION AND SYSTEMS》 *
YAMAGISHI J ET AL.: ""Analysis of speaker adaptation algorithms for HMM-based speech synthesis and a constrained SMAPLR adaptation algorithm "", 《IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING》 *
YAMAGISHI J ET AL: "" Average-voice-based speech synthesis"", 《TOKYO INSTITUTE OF TECHNOLOGY》 *
YU HONGZHI,ZHANG JINXI ET AL: ""Research On Tibetan Language Synthesis System Front-end Text Processing Technology Based on HMM"", 《APPLIED MECHANICS AND MATERIALS》 *
刘博: ""藏语拉萨方言的统计参数语音合成的研究"", 《中国优秀硕士学位论文全文数据库信息科技辑》 *
吴义坚: ""基于隐马尔科夫模型的语音合成技术研究"", 《中国优秀博硕士论文全文数据库(博士)信息科技辑》 *
宋文龙: ""基于说话人自适应训练的统计参数语音合成的研究"", 《万方学位论文》 *
杜嘉: ""HMM在基于参数的语音合成系统中的应用"", 《中国优秀硕士学位论文全文数据库信息科技辑》 *
王海燕等: ""基于说话人自适应训练的汉藏双语语音合成"", 《第十二届全国人机语音通讯学术会议(NCMMSC2013)论文集》 *
赵欢欢等: ""基于最大后验概率的语音合成说话人自适应"", 《数据采集与处理》 *

Cited By (34)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104538025A (en) * 2014-12-23 2015-04-22 西北师范大学 Method and device for converting gestures to Chinese and Tibetan bilingual voices
CN104882141A (en) * 2015-03-03 2015-09-02 盐城工学院 Serial port voice control projection system based on time delay neural network and hidden Markov model
CN106297764A (en) * 2015-05-27 2017-01-04 科大讯飞股份有限公司 A kind of multilingual mixed Chinese language treatment method and system
CN106294311B (en) * 2015-06-12 2019-03-19 科大讯飞股份有限公司 A kind of Tibetan language tone prediction technique and system
CN106294311A (en) * 2015-06-12 2017-01-04 科大讯飞股份有限公司 A kind of Tibetan language tone Forecasting Methodology and system
CN105390133A (en) * 2015-10-09 2016-03-09 西北师范大学 Tibetan TTVS system realization method
CN105654939A (en) * 2016-01-04 2016-06-08 北京时代瑞朗科技有限公司 Voice synthesis method based on voice vector textual characteristics
CN105654939B (en) * 2016-01-04 2019-09-13 极限元(杭州)智能科技股份有限公司 A kind of phoneme synthesizing method based on sound vector text feature
CN106128450A (en) * 2016-08-31 2016-11-16 西北师范大学 The bilingual method across language voice conversion and system thereof hidden in a kind of Chinese
CN107886938B (en) * 2016-09-29 2020-11-17 中国科学院深圳先进技术研究院 Virtual reality guidance hypnosis voice processing method and device
CN107886938A (en) * 2016-09-29 2018-04-06 中国科学院深圳先进技术研究院 Virtual reality guides hypnosis method of speech processing and device
CN106971703A (en) * 2017-03-17 2017-07-21 西北师范大学 A kind of song synthetic method and device based on HMM
CN107103900A (en) * 2017-06-06 2017-08-29 西北师范大学 A kind of across language emotional speech synthesizing method and system
CN108573694A (en) * 2018-02-01 2018-09-25 北京百度网讯科技有限公司 Language material expansion and speech synthesis system construction method based on artificial intelligence and device
CN110232909A (en) * 2018-03-02 2019-09-13 北京搜狗科技发展有限公司 A kind of audio-frequency processing method, device, equipment and readable storage medium storing program for executing
CN108492821B (en) * 2018-03-27 2021-10-22 华南理工大学 Method for weakening influence of speaker in voice recognition
CN108492821A (en) * 2018-03-27 2018-09-04 华南理工大学 A kind of method that speaker influences in decrease speech recognition
CN109036370A (en) * 2018-06-06 2018-12-18 安徽继远软件有限公司 A kind of speaker's voice adaptive training method
CN109003601A (en) * 2018-08-31 2018-12-14 北京工商大学 A kind of across language end-to-end speech recognition methods for low-resource Tujia language
CN109949796A (en) * 2019-02-28 2019-06-28 天津大学 A kind of end-to-end framework Lhasa dialect phonetic recognition methods based on Tibetan language component
CN109767755A (en) * 2019-03-01 2019-05-17 广州多益网络股份有限公司 A kind of phoneme synthesizing method and system
CN110534089A (en) * 2019-07-10 2019-12-03 西安交通大学 A kind of Chinese speech synthesis method based on phoneme and rhythm structure
CN110349567A (en) * 2019-08-12 2019-10-18 腾讯科技(深圳)有限公司 The recognition methods and device of voice signal, storage medium and electronic device
CN110349567B (en) * 2019-08-12 2022-09-13 腾讯科技(深圳)有限公司 Speech signal recognition method and device, storage medium and electronic device
CN111326138A (en) * 2020-02-24 2020-06-23 北京达佳互联信息技术有限公司 Voice generation method and device
CN111833845A (en) * 2020-07-31 2020-10-27 平安科技(深圳)有限公司 Multi-language speech recognition model training method, device, equipment and storage medium
CN111833845B (en) * 2020-07-31 2023-11-24 平安科技(深圳)有限公司 Multilingual speech recognition model training method, device, equipment and storage medium
CN111986646A (en) * 2020-08-17 2020-11-24 云知声智能科技股份有限公司 Dialect synthesis method and system based on small corpus
CN112116903A (en) * 2020-08-17 2020-12-22 北京大米科技有限公司 Method and device for generating speech synthesis model, storage medium and electronic equipment
CN111986646B (en) * 2020-08-17 2023-12-15 云知声智能科技股份有限公司 Dialect synthesis method and system based on small corpus
CN115547292A (en) * 2022-11-28 2022-12-30 成都启英泰伦科技有限公司 Acoustic model training method for speech synthesis
CN115547292B (en) * 2022-11-28 2023-02-28 成都启英泰伦科技有限公司 Acoustic model training method for speech synthesis
CN117275458A (en) * 2023-11-20 2023-12-22 深圳市加推科技有限公司 Speech generation method, device and equipment for intelligent customer service and storage medium
CN117275458B (en) * 2023-11-20 2024-03-05 深圳市加推科技有限公司 Speech generation method, device and equipment for intelligent customer service and storage medium

Similar Documents

Publication Publication Date Title
CN104217713A (en) Tibetan-Chinese speech synthesis method and device
Ramani et al. A common attribute based unified HTS framework for speech synthesis in Indian languages
Abushariah et al. Arabic speaker-independent continuous automatic speech recognition based on a phonetically rich and balanced speech corpus.
US10235991B2 (en) Hybrid phoneme, diphone, morpheme, and word-level deep neural networks
CN107103900A (en) A kind of across language emotional speech synthesizing method and system
Liu et al. Mongolian text-to-speech system based on deep neural network
CN104538025A (en) Method and device for converting gestures to Chinese and Tibetan bilingual voices
Maia et al. Towards the development of a brazilian portuguese text-to-speech system based on HMM.
CN116092471A (en) Multi-style personalized Tibetan language speech synthesis model oriented to low-resource condition
Labied et al. Moroccan dialect “Darija” automatic speech recognition: a survey
Sakti et al. Development of HMM-based Indonesian speech synthesis
Azim et al. Large vocabulary Arabic continuous speech recognition using tied states acoustic models
Liu et al. A maximum entropy based hierarchical model for automatic prosodic boundary labeling in mandarin
Sun et al. A method for generation of Mandarin F0 contours based on tone nucleus model and superpositional model
Bonafonte et al. The UPC TTS system description for the 2008 blizzard challenge
Chiang et al. The Speech Labeling and Modeling Toolkit (SLMTK) Version 1.0
Nursetyo LatAksLate: Javanese script translator based on Indonesian speech recognition using sphinx-4 and google API
JP7406418B2 (en) Voice quality conversion system and voice quality conversion method
Chunwijitra et al. A tone-modeling technique using a quantized F0 context to improve tone correctness in average-voice-based speech synthesis
Bouselmi et al. Multilingual recognition of non-native speech using acoustic model transformation and pronunciation modeling
Janyoi et al. An Isarn dialect HMM-based text-to-speech system
Iyanda et al. Development of a Yorúbà Textto-Speech System Using Festival
Biczysko Automatic Annotation of Speech: Exploring Boundaries within Forced Alignment for Swedish and Norwegian
Hirose et al. Superpositional modeling of fundamental frequency contours for HMM-based speech synthesis
Hosn et al. New resources for brazilian portuguese: Results for grapheme-to-phoneme and phone classification

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20141217

RJ01 Rejection of invention patent application after publication