CN104217713A - Tibetan-Chinese speech synthesis method and device - Google Patents
Tibetan-Chinese speech synthesis method and device Download PDFInfo
- Publication number
- CN104217713A CN104217713A CN201410341827.9A CN201410341827A CN104217713A CN 104217713 A CN104217713 A CN 104217713A CN 201410341827 A CN201410341827 A CN 201410341827A CN 104217713 A CN104217713 A CN 104217713A
- Authority
- CN
- China
- Prior art keywords
- tibetan
- model
- mrow
- speaker
- adaptive
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000001308 synthesis method Methods 0.000 title abstract description 8
- 238000012549 training Methods 0.000 claims abstract description 115
- 238000000034 method Methods 0.000 claims abstract description 54
- 230000015572 biosynthetic process Effects 0.000 claims abstract description 50
- 238000003786 synthesis reaction Methods 0.000 claims abstract description 50
- 230000003044 adaptive effect Effects 0.000 claims abstract description 44
- 238000006243 chemical reaction Methods 0.000 claims abstract description 17
- 230000008569 process Effects 0.000 claims abstract description 14
- 238000009826 distribution Methods 0.000 claims description 71
- 241001672694 Citrus reticulata Species 0.000 claims description 48
- 230000009466 transformation Effects 0.000 claims description 41
- 239000011159 matrix material Substances 0.000 claims description 27
- 230000002194 synthesizing effect Effects 0.000 claims description 26
- 238000001228 spectrum Methods 0.000 claims description 23
- 238000002372 labelling Methods 0.000 claims description 18
- 238000012417 linear regression Methods 0.000 claims description 17
- 238000004458 analytical method Methods 0.000 claims description 14
- 238000007476 Maximum Likelihood Methods 0.000 claims description 13
- 239000000284 extract Substances 0.000 claims description 9
- 230000001419 dependent effect Effects 0.000 claims description 8
- 238000013461 design Methods 0.000 claims description 8
- 238000013179 statistical model Methods 0.000 claims description 7
- 238000003066 decision tree Methods 0.000 claims description 5
- 238000000354 decomposition reaction Methods 0.000 claims description 5
- 230000006870 function Effects 0.000 claims description 5
- 238000012986 modification Methods 0.000 claims description 5
- 230000004048 modification Effects 0.000 claims description 5
- 238000013518 transcription Methods 0.000 claims description 5
- 230000035897 transcription Effects 0.000 claims description 5
- 230000000737 periodic effect Effects 0.000 claims description 3
- 230000003595 spectral effect Effects 0.000 claims description 3
- 230000001131 transforming effect Effects 0.000 claims description 3
- 238000010606 normalization Methods 0.000 claims description 2
- 238000011160 research Methods 0.000 abstract description 11
- 238000005516 engineering process Methods 0.000 abstract description 10
- 230000001737 promoting effect Effects 0.000 abstract description 5
- 238000011161 development Methods 0.000 abstract description 4
- 238000004891 communication Methods 0.000 abstract description 3
- 238000011156 evaluation Methods 0.000 description 10
- 238000012360 testing method Methods 0.000 description 8
- 238000010586 diagram Methods 0.000 description 4
- 238000013507 mapping Methods 0.000 description 4
- 238000012546 transfer Methods 0.000 description 3
- 230000018109 developmental process Effects 0.000 description 2
- 230000003993 interaction Effects 0.000 description 2
- 235000013550 pizza Nutrition 0.000 description 2
- 238000007781 pre-processing Methods 0.000 description 2
- 230000011218 segmentation Effects 0.000 description 2
- 241000984929 Tibeta Species 0.000 description 1
- 239000000969 carrier Substances 0.000 description 1
- 238000012937 correction Methods 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 230000002996 emotional effect Effects 0.000 description 1
- 238000002474 experimental method Methods 0.000 description 1
- 238000004519 manufacturing process Methods 0.000 description 1
- 230000008092 positive effect Effects 0.000 description 1
- 238000013139 quantization Methods 0.000 description 1
- 230000033764 rhythmic process Effects 0.000 description 1
- 238000005070 sampling Methods 0.000 description 1
- 238000000926 separation method Methods 0.000 description 1
- 238000003860 storage Methods 0.000 description 1
- 230000003612 virological effect Effects 0.000 description 1
Landscapes
- Machine Translation (AREA)
Abstract
The invention provides a Tibetan-Chinese speech synthesis method and device and aims to synthesize input Chinese or Tibetan statements to be synthesized, through a Chinese-Tibetan hybrid corpus preliminarily established. Chinese or Tibetan speeches can be synthesized at the same time by the method and the device. Compared with a traditional HMM-based (hidden Markov model based) speech synthesis system, the method and the device have the advantages that a speaker adaptive training process is added to a training phase to acquire a Chinese-Tibetan hybrid speech average model, the influence caused by speaker differences in a speech library can be reduced through the speaker adaptive training process, and quality of synthesized speeches is improved; on the basis of the average model, and a Tibetan or Chinese speech excellent in both naturalness and fluency can be obtained through synthesis of few Tibetan or Chinese corpus data through a speaker adaptive conversion algorithm; the research is significant to promoting the development of communication with minorities and the development of minority speech technologies.
Description
Technical Field
The invention relates to the technical field of multilingual speech synthesis, and particularly provides a method and a device for cross-language bilingual speech synthesis in Chinese and Tibetan.
Background
In recent years, multilingual speech synthesis technology has become a research hotspot in the field of human-computer speech interaction. The technology can realize man-machine voice interaction of different languages in the same system, and has important application value for countries or regions speaking several languages. China is a country with numerous minority languages and dialects, so that the research of the technology has important significance, for example, the Tibetan region of China mainly speaks mandarin, Tibetan languages and dialects, and if a voice system can realize multi-voice synthesis of cross-language, the technology has important significance in promoting communication with the minority nationality and promoting the development of the minority nationality voice technology.
The technology for the research of the multilingual voice synthesis at home and abroad mainly comprises a primitive selecting and splicing synthesis method and a statistical parameter voice synthesis method. The basic principle of the waveform splicing and synthesizing method is that the basic unit information is obtained by analyzing according to the input text, then the proper unit is selected from the pre-recorded and marked voice library, a small amount of adjustment is carried out, and finally the synthesized voice is obtained by splicing. Since the unit phonemes of the final synthesized speech are extracted directly from the sound library, it can maintain the tone quality of the original speaker. However, the waveform splicing and synthesizing system generally needs a large-scale speech library, the workload of corpus production is very large, time and labor are wasted, the synthesizing effect depends on the speech library to a great extent, the influence of the environment is large, and the robustness is not high. The basic idea of the statistical parameter speech synthesis method is to perform parameter decomposition on an input speech signal and establish a statistical model, predict speech parameters of a text to be synthesized through the model obtained by training, input the parameters into a parameter synthesizer, and finally obtain synthesized speech. The method needs less data quantity when the system is constructed, needs less manual intervention, and has smooth and fluent synthesized voice and high robustness. But the synthesized voice has low tone quality and insufficient emotional rhythm and is relatively flat.
The statistical parameter speech synthesis method based on HMM can synthesize the speech of different speakers through speaker self-adaptive transformation, and becomes a research hotspot in cross-language multilingual speech synthesis. HMM-based multilingual speech synthesis systems use mixed-language methods, phoneme mapping or state mapping to achieve multilingual speech synthesis. However, most of the existing research is directed to languages with large corpora and relatively mature speech synthesis technology, and research on dialects, national languages, and languages in which speech resources are not easily available is lacking. At present, in the research at home and abroad, a multilingual speech synthesis system of mandarin/minority national language or mandarin/dialect is not realized. At present, research on multi-language speech synthesis mainly focuses on mainstream languages, and a phoneme mapping method or a state mapping method is mainly adopted, but both methods need a large amount of bilingual speech data. For Tibetan language lacking speech resources, it is difficult to apply the above method to multilingual speech synthesis of mandarin-Tibetan language due to the lack of large-scale bilingual phonetic corpus.
Disclosure of Invention
The invention provides a method and a device for synthesizing a bilingual Chinese-Tibetan language, aiming at solving the problems that a multilingual speech synthesis system provided in the background technology lacks of researches on dialects, national languages and speech which is difficult to obtain by speech resources, such as Tibetan language, and cannot realize multilingual speech synthesis of mandarin-Tibetan language.
In order to solve the technical problems, the invention adopts the technical scheme that: the method for synthesizing the bilingual speech in the Hanzang comprises the following steps:
A. obtaining the international phonetic symbols of the input Tibetan language pinyin letters by taking the international phonetic symbols as reference, then comparing the international phonetic symbols with the international phonetic symbols of Chinese pinyin, directly marking the same parts by SAMPA-SC, marking the different parts by unused keyboard symbols, and completing SAMPA-T automatic labeling of the Tibetan language text corpus by utilizing a SAMPA-T oriented character-sound conversion algorithm;
B. designing a Chinese and Tibetan universal phonetic system and a question set on the basis of a Mandarin phonetic system according to the similarity of Tibetan and Mandarin;
C. training to obtain a mixed language average voice model by utilizing voice data of a plurality of speakers in the Tibetan language and through speaker self-adaptive training based on an HMM model;
D. obtaining a speaker self-adaptive model by utilizing the corpus of a speaker with a small amount of Tibetan language or Chinese voice to be synthesized through speaker self-adaptive transformation, and correcting and updating the self-adaptive model;
E. inputting a text to be synthesized, generating voice parameters, and synthesizing Tibetan or Chinese voice.
Further, the word-to-sound conversion algorithm of SAMPA-T in step A includes the following steps:
firstly reading in a Tibetan language sentence text, then segmenting sentences and syllables according to single pendants and syllable characters to obtain Tibetan language sentences, separating out initial consonants and vowels through positioning of a primary character string and decomposition of the primary character string for each syllable, and finally obtaining SAMPA-T character strings of the syllables by searching an initial SAMPA-T list and a vowel SAMPA-T list, wherein the decomposition of the primary character string is realized according to the split list of the primary character string.
Further, the designing of the general phonetic transcription system and the problem set for chinese and tibetan in step B includes the following steps:
firstly, the Tibetan initials and finals which are consistent with the pronunciation of the Mandarin are marked by the Pinyin of Chinese, and the Tibetan initials and finals which are inconsistent with the pronunciation of the Mandarin are marked by the Pinyin of Tibetan;
then, selecting all initial and final consonants and mutes and pauses of the Mandarin and the Tibetan as context-related MSD-HSMM synthesis primitives to design a context labeling format for labeling the context-related characteristics of an initial and final layer, a syllable layer, a word layer, a prosodic word layer, a phrase layer and a sentence layer of each synthesis primitive;
finally, a problem set which is universal for the Chinese-Tibetan bilingual is designed on the basis of a context-related problem set of the Mandarin, the problem set expands related problems of synthesized primitives which are specific to the Tibetan to reflect the special pronunciation of the Tibetan, and the problem set comprises more than 3000 context-related problems and covers all the characteristics of context-related labels.
Further, the obtaining of the mixed language average model through speaker adaptive training in the step C includes the following steps:
a. carrying out voice analysis on the Chinese language database of multiple speakers and the Tibetan language database data of single speaker, and extracting acoustic parameters:
(1) extracting mel cepstrum coefficients, logarithmic fundamental frequency and non-periodic indexes,
(2) calculating the first order difference and the second order difference;
b. and (3) carrying out HMM model training by combining the context attribute set, and training a statistical model of the acoustic parameters:
(1) training HMM models of the spectral and fundamental frequency parameters,
(2) training a multi-distribution semi-hidden Markov model (MSD-HSMM) of a state duration parameter;
c. using a small amount of single speaker Chinese speech library and a single speaker Tibetan speech library to perform speaker self-adaptive training, thereby obtaining an average sound model:
(1) using constrained maximum likelihood linear regression (CMML) algorithm, expressing the difference between the phonetic data and average voice of the speaker in training by linear regression function,
(2) the differences between training speakers are normalized using a set of linear regression equations for the state output distribution and the state duration distribution,
(3) training to obtain a mixed language average sound model of the Chinese-Tibetan bilingual so as to obtain a context-dependent MSD-HSMM model;
d. the speaker self-adaptive transformation is carried out by utilizing the single speaker self-adaptive data of Chinese and Tibetan:
(1) adopting CMML algorithm to calculate the mean vector and covariance matrix of the state output probability distribution and state duration probability distribution of the speaker,
(2) transforming the mean vector and covariance matrix of the mean tone model into a target speaker model of Tibetan or Chinese to be synthesized using a set of transformation matrices of state output distribution and state duration distribution,
(3) carrying out maximum likelihood estimation on the frequency spectrum, the fundamental frequency and the time length parameters after normalization and conversion;
e. and (3) modifying and updating the adaptive model:
(1) calculating MAP estimation parameters of average tone model state output and time length distribution by adopting a Maximum A Posteriori (MAP) algorithm,
(2) calculating the average vector of the state output and the state duration after the self-adaptive transformation,
(3) calculating a weighted average MAP estimation value of the adaptive mean vector;
f. inputting a text to be synthesized, and performing text analysis on the text to obtain an HMM model of a sentence;
g. performing parameter prediction on a sentence HMM, performing voice parameter generation, and obtaining synthetic voice through a parameter synthesizer, wherein the formula is as follows:
wherein,the mean vector is output for training the state of the speaker s,is its state long mean vector. W ═ A, b]And X ═ α, β]A transformation matrix o for training the difference between the state output distribution and the state duration distribution between the speaker s and the average tone modeliAnd diThe average observation vector and the average time length vector.
Further, the step D of obtaining the speaker adaptive model by using the corpus of the speaker with a small amount of Tibetan language or Chinese speech to be synthesized through speaker adaptive transformation, and modifying and updating the adaptive model includes the following steps:
firstly, after adaptive training of a speaker, calculating to obtain a mean vector and a covariance matrix of state output probability distribution and duration probability distribution of speaker conversion by using a CMLLR adaptive algorithm based on HSMM, wherein a transformation equation of a feature vector o and a state duration d under a state i is as follows:
bi(o)=N(o;Aui-b,A∑iAT)
=|A-1|N(Wξ;ui,∑i)
wherein xi is ═ oT,1],ψ=[d,1]T,μiIs the mean of the state output distribution, miIs the mean value, Σ, of the time length distributioniIn the form of a diagonal covariance matrix,is the variance, W ═ A-1 b-1]Outputting a linear transformation matrix of probability density distribution for the target speaker state, X ═ alpha-1,β-1]A transformation matrix of state duration probability density distribution;
then, through the adaptive transformation algorithm based on HSMM, the frequency spectrum, fundamental frequency and time length parameters of the voice data can be normalized and transformed, and for the adaptive data O with the length of T, the maximum likelihood estimation can be carried out on the transformation Λ ═ W, X
Wherein λ is the parameter set of HSMM;
finally, the Maximum A Posteriori (MAP) algorithm is used to modify and update the adaptive model of the speech, and for a given HSMM λ, if the forward probability and the backward probability are: alpha is alphai(i) And betai(i) Then it continuously observes the sequence o in state it-d+1......otGeneration probability ofComprises the following steps:
the MAP estimate is described as follows:
wherein,andis the mean vector after linear regression transformation, omega and tau are respectively the MAP estimated parameters of state output and time length distribution,andas an adaptive mean vectorAndweighted average MAP estimate of (a).
Further, the step E of inputting the text to be synthesized and generating the speech parameters, and synthesizing the Tibetan language or the chinese language speech includes the following steps:
firstly, converting a given text into a pronunciation labeling sequence containing context description information by using a text analysis tool, predicting a context-dependent HMM (hidden Markov model) model of each pronunciation by using a decision tree obtained in a training process, and connecting the HMM models into an HMM model of a sentence;
secondly, generating a parameter sequence of the frequency spectrum, the duration and the fundamental frequency from the sentence HMM by using a parameter generation algorithm;
finally, a Mel log-spectrum approximation (MLSA) filter is used as a parameter synthesizer to synthesize the voice.
Further, the apparatus for synthesizing bilingual speech in hanzang comprises: the HMM model training unit is used for establishing an HMM model of the voice data; the speaker self-adaptive unit is used for normalizing and converting the characteristic parameters of the speaker in training to obtain a self-adaptive model; and the voice synthesis unit is used for synthesizing the Tibetan or Chinese voice to be synthesized.
Further, the HMM model training unit includes: the voice analysis subunit extracts acoustic parameters of voice data in a voice library, and mainly extracts fundamental frequency, frequency spectrum and duration parameters; and the target HMM model determining subunit is used for training a statistical model of the acoustic model by combining context labeling information of the sound library, determining fundamental frequency, frequency spectrum and duration parameters according to a context attribute set, and the voice analysis subunit is connected with the target HMM model determining subunit.
Further, the speaker self-adaptive unit comprises a speaker training subunit, an average tone model determining subunit, a speaker self-adaptive transformation subunit and a self-adaptive model determining subunit which are connected in sequence, the target HMM model determining subunit is connected with the speaker training subunit,
the speaker training subunit is used for normalizing the difference between the state output distribution and the state duration distribution between the speaker and the average voice model in the training;
the average sound model determining subunit determines a Chinese-Tibetan bilingual mixed speech average sound model by adopting a maximum likelihood linear regression algorithm;
the speaker self-adaptive transformation subunit calculates the mean vector and the covariance matrix of the state output probability distribution and the duration probability distribution of the speaker by using self-adaptive data and converts the mean vector and the covariance matrix into a target speaker model;
the adaptive model determining subunit establishes an adaptive model of the MSD-HSMM of the target speaker.
Further, the voice synthesis unit includes an adaptive model modification subunit and a synthesis subunit connected in sequence, the adaptive model determination subunit is connected to the adaptive model modification subunit,
the self-adaptive model correcting subunit corrects and updates the self-adaptive model of the voice by utilizing an MAP algorithm, reduces the model deviation and improves the synthesis quality;
and the synthesis subunit predicts the voice parameters of the input text by using the corrected self-adaptive model, extracts the parameters and finally synthesizes the Chinese or Tibetan voice through the voice synthesizer.
The invention has the advantages and positive effects that: the method and the device for synthesizing the Chinese-Tibetan bilingual speech realize the synthesis of the Chinese and Tibetan speech with better naturalness and fluency by using the same system and the device by utilizing the HMM-based adaptive training and adaptive conversion algorithm according to the similarity of the Chinese and the Tibetan in pronunciation. Compared with the traditional speech synthesis system based on HMM, the system adds the speaker self-adaptive training process in the training stage to obtain the average sound model of the Chinese-Tibetan mixed speech, through the process, the influence caused by the difference of speakers in a speech library can be reduced, the quality of synthesized speech is improved, and on the basis of the average sound model, the Tibetan or Chinese speech with good naturalness and fluency can be synthesized by the speaker self-adaptive transformation algorithm and only a small amount of Tibetan or Chinese corpus to be synthesized. The research of the system has important significance on promoting the communication with minority carriers and promoting the development of minority carrier voice technology.
Drawings
FIG. 1 is a flow chart of a method for synthesizing bilingual speech from Hanzang;
FIG. 2 is a flow diagram of the Tibetan text to SAMPA-T conversion;
FIG. 3 is a block diagram of a process for adaptive speech synthesis for a bilingual speaker in Hanzang;
FIG. 4 is a schematic diagram of a structure of a bilingual speech synthesis apparatus;
FIG. 5 is a flow chart of model training;
fig. 6 is a flow chart of speech synthesis.
Detailed Description
The invention provides a method for synthesizing Chinese-Tibetan bilingual speech, which provides a character-pronunciation conversion algorithm facing a Tibetan language machine-readable phonetic symbol SAMPA-T, realizes the automatic marking of SAMPA-T of a Tibetan language text corpus, designs a marking system, a marking format and a question set which are universal for mandarins and Tibetan languages according to the similarity between Tibetan languages and mandarins, utilizes the corpora of a plurality of speakers of mandarins and Tibetan languages, and finally synthesizes Chinese or Tibetan speech through speaker self-adaptive training based on HMM and a speaker self-adaptive conversion algorithm. The flow chart of the Chinese-Tibetan bilingual speech synthesis method of the invention is shown in figure 1, and the specific steps are as follows:
(1) an SAMPA-T labeling scheme of the Tibetan Lasa dialect is designed, and SAMPA-T automatic labeling of the Tibetan text corpus is completed by utilizing an SAMPA-T oriented word-pronunciation conversion algorithm.
Machine readable Phonetic symbols SAMPA (Speech Association Methods viral Alphabet) is a computer readable Phonetic symbol system that can represent all symbols of an international Phonetic symbol with ASCII characters. At present, SAMPA is widely applied to main languages in Europe, east Asia languages such as Japanese, and SAMPA schemes are also proposed in Chinese, Guangdong and Taiwan's national languages.
Because the Tibetan and the Chinese belong to the Tibetan Chinese system, the invention designs a set of computer-readable phonetic transcription system SAMPA-T (Tibeta) of the Tibetan on the basis of the machine-readable phonetic symbol design scheme of the Mandarin Chinese, lists the design scheme by taking the Tibetan Lassa as an example and realizes the transfer of the Tibetan to the SAMPA-T.
By comparing the international phonetic symbols of Chinese and Tibetan, it is found that some of the international phonetic symbols of Chinese and Tibetan are identical, so that the international phonetic symbols are obtained for the input Tibetan phonetic alphabet by taking the international phonetic symbols as reference, and then compared with the international phonetic symbols of Chinese phonetic alphabet, the same part is directly marked by SAMPA-SC, and the different parts are marked by unused keyboard symbols according to the simplification principle.
The Tibetan is a pinyin character and is formed by spelling the Tibetan by pinyin, and the basic unit is syllable. According to the structural positions of letters in syllables, the traditional Tibetan grammar divides the syllables at different positions into 'adding character before', 'base character', 'adding character above', 'adding character below', 'adding character after' and 'adding character after again', wherein the base character is the core of the whole Tibetan character. The Tibetan vowel is vowel + postword.
The transfer of the Tibetan text to the SAMPA-T is mainly considered from the aspects of Tibetan sentence segmentation, Tibetan single character segmentation, positioning of radical character cubes, separation and transfer of initial consonants and vowels, SAMPA-T character string combination and the like. The positioning of the basic character block, namely the recognition of the basic character, vowel and the like, is mainly realized by a dictionary-oriented statistical and searching method. The character-to-character conversion is mainly realized by searching SAMPA-T transcription supporting libraries of initials and finals. Firstly, reading in a Tibetan language text, and then segmenting sentences and syllables according to the single plumb symbols and the syllable symbols to obtain a Tibetan language list. For each syllable, the initial consonant and the final are separated through positioning the base character and the division, and then the SAMPA-T table of the initial consonant and the final is searched to obtain the SAMPA-T of the syllable. The flow chart for the conversion of Tibetan text to SAMPA-T is shown in FIG. 2.
(2) According to the similarity between Tibetan language and Mandarin, a universal phonetic system and a question set for Chinese and Tibetan language are designed on the basis of a Mandarin phonetic system.
Tibetan and Chinese belong to the same Chinese-Tibetan language family, and have many commonalities and differences in pronunciation. The mandarin chinese language and the russian Tibetan language are all languages composed of syllables, each of which is composed of an initial consonant and a final. Mandarin has 22 initials and 39 finals, while the Tibetan Lasa dialect has 36 initials and 45 finals, and the two languages share 20 initials and 13 finals. Firstly, marking the Tibetan initials and finals consistent with the pronunciation of the Mandarin by using Chinese pinyin; the pronunciation of the Tibetan language is inconsistent with that of the Mandarin.
Then, all initials and finals of Mandarin and Tibetan, silence and pause are selected as the synthesis primitives of the MSD-HSMM related to the context to design a context labeling format for labeling the context-related features of the initial-final layer, the syllable layer, the word layer, the prosodic word layer, the phrase layer and the sentence layer of each synthesis primitive.
Finally, a problem set which is universal to the Chinese-Tibetan bilingual is designed on the basis of a context-related problem set of the Mandarin. The problem is focused on expanding the problems associated with the synthetic primitives that are characteristic of the Tibetan language to reflect the special pronunciation of the Tibetan language. The problem set contains 3000 more contextually relevant problems, covering all the features of the contextually relevant annotations.
The system adopts a hierarchical labeling method to label the Tibetan language text corpus, and the labeled content comprises a syllable layer, boundary information and an SAMPA-T transcription result. The international Praat phonetic software is used for labeling, and the system can also add necessary labeling information according to the needs. After the marking is finished, a script program is compiled, and marking information is written into the Texgrid file, wherein the Texgrid file contains four layers of marked information, which mainly comprises pronunciation and syllable boundary information. The problem set contains classification information of basic features, such as initial consonants, vowel types, whether syllables are arranged in prosodic phrases, and the like, and the classification information is usually a set of basic information according to some set of context information. By reclassifying the problem set, more complex contextual classification information than the basic features can be obtained. In the HTS system, a set of design problems is set forth in the hed document, each line describing a problem, which is a true-false type problem, each beginning with a QS command.
(3) And training to obtain a mixed language average voice model by utilizing the voice data of multiple speakers in the Chinese Tibetan language and through speaker self-adaptive training based on an HMM model.
Compared with the traditional speech synthesis method based on HMM, the method adds the speaker self-adaptive training process in the training stage to obtain the average sound model of the Chinese-Tibetan mixed speech, can reduce the influence caused by the difference of speakers in a speech library and improve the quality of synthesized speech, and can synthesize the Tibetan or Chinese speech with good naturalness and fluency by using a small amount of Tibetan or Chinese corpus to be synthesized through the speaker self-adaptive transformation algorithm on the basis of the average sound model. The block diagram of the adaptive speech synthesis process for bilingual speakers in Hanzang is shown in FIG. 3:
carrying out voice analysis on Chinese corpus data of multiple speakers and Tibetan language corpus data of single speakers, and extracting acoustic parameters of the Chinese corpus data and the Tibetan language corpus data:
(1) extracting mel cepstrum coefficients, logarithmic fundamental frequency and non-periodic indexes;
(2) their first and second order differences are calculated.
And 2, performing HMM model training by combining the context attribute set, and training a statistical model of the acoustic parameters:
(1) training HMM models of frequency spectrum and fundamental frequency parameters;
(2) a multi-distributed semi-hidden Markov model (MSD-HSMM) of the training state duration parameter.
And 3, carrying out speaker self-adaptive training by utilizing a small number of single speaker Chinese speech libraries and single speaker Tibetan speech libraries so as to obtain an average sound model:
(1) expressing the difference between the voice data and the average voice of the speaker in training by using a linear regression function by adopting a constrained maximum likelihood linear regression (CMML) algorithm;
(2) normalizing the differences between the training speakers by a set of linear regression equations of state output distribution and state duration distribution;
(3) training to obtain a mixed language average voice model of the Chinese-Tibetan bilingual so as to obtain a context-dependent MSD-HSMM model.
And 4, carrying out speaker self-adaptive transformation by using single speaker self-adaptive data of Chinese and Tibetan:
(1) calculating a mean vector and a covariance matrix of state output probability distribution and state duration probability distribution of the speaker by adopting a CMML algorithm;
(2) transforming the mean vector and the covariance matrix of the average tone model into a target speaker model of Tibetan or Chinese to be synthesized by using a set of transformation matrices of state output distribution and state duration distribution;
(3) and carrying out maximum likelihood estimation on the normalized and converted frequency spectrum, fundamental frequency and duration parameters.
And 5, modifying and updating the adaptive model:
(1) calculating MAP estimation parameters of average tone model state output and duration distribution by adopting a maximum posterior (MAP) algorithm;
(2) calculating the average value vector of the state output and the state duration after the self-adaptive transformation;
(3) a weighted average MAP estimate of the adaptive mean vector is calculated.
And 6, inputting a text to be synthesized, and performing text analysis on the text to obtain an HMM model of the sentence.
And 7, performing parameter prediction on the sentence HMM, performing voice parameter generation, and obtaining synthesized voice through a parameter synthesizer.
Fig. 3 is a flow chart of the process of synthesizing bilingual speech in hanzang. The method comprises the steps of utilizing a mixed corpus of the common Chinese and the Tibetan to obtain a mixed language average phonetic model of the Chinese-Tibetan bilingual by adopting constrained maximum likelihood linear regression (CMML) training, so as to obtain a context-dependent multi-distribution semi-hidden Markov model (MSD-HSMM). In adaptive training of a speaker, the difference between training speech data and an average of a training speaker is expressed by a linear regression function of an average vector of an output state distribution and a state duration distribution, and the difference between the training speakers is normalized by a set of linear regression equations of the state output distribution and the state duration distribution, wherein the formula is as follows:
wherein,the mean vector is output for training the state of the speaker s,is its state long mean vector. W ═ A, b]And X ═ α, β]A transformation matrix o for training the difference between the state output distribution and the state duration distribution between the speaker s and the average tone modeliAnd diThe average observation vector and the average time length vector.
(4) The speaker self-adaptive model is obtained by utilizing the corpus of a small number of speakers of Tibetan language or Chinese voice to be synthesized through speaker self-adaptive transformation, and the self-adaptive model is corrected and updated.
After the speaker self-adaptive training, the CMLLR self-adaptive algorithm based on the HSMM is utilized to calculate and obtain the mean vector and the covariance matrix of the state output probability distribution and the duration probability distribution of the speaker conversion. The transformation equation of the feature vector o and the state duration d under the state i is as follows:
bi(o)=N(o;Aui-b,A∑iAT)
=|A-1|N(Wξ;ui,∑i)
wherein xi is ═ oT,1],ψ=[d,1]T,μiIs the mean of the state output distribution, miIs the mean value, Σ, of the time length distributioniIn the form of a diagonal covariance matrix,is the variance. W ═ A-1 b-1]Outputting a linear transformation matrix of probability density distribution for the target speaker state, X ═ alpha-1,β-1]Is a transformation matrix of state duration probability density distribution.
Through the HSMM-based adaptive transformation algorithm, the frequency spectrum, fundamental frequency and duration parameters of the voice data can be normalized and transformed. For adaptive data O of length T, the maximum likelihood estimation can be performed on the transform Λ ═ (W, X),
where λ is the parameter set for HSMM.
Finally, the adaptive model of the speech is modified and updated using a Maximum A Posteriori (MAP) algorithm. For a given HSMM λ, if its forward probability and backward probability are: alpha is alphai(i) And betai(i) Then it continuously observes the sequence o in state it-d+1......otGeneration probability ofComprises the following steps:
the MAP estimate is described as follows:
wherein,andand omega and tau are respectively the MAP estimation parameters of state output and time length distribution.Andas an adaptive mean vectorAndweighted average MAP estimate of (a).
The training phase mainly comprises preprocessing and HMM training. In the preprocessing stage, the speech data in the sound bank is firstly analyzed, and corresponding speech parameters (fundamental frequency and spectral parameters) are extracted. According to the extracted speech parameters, an observation vector of the HMM can be divided into two parts of a spectrum and a fundamental frequency, wherein the spectrum parameter part is modeled by using a continuous probability distribution HMM, the fundamental frequency part is modeled by using a multi-space probability distribution HMM (MSD-HMM), and meanwhile, the system uses a Gaussian distribution or a gamma distribution to establish a state duration model to describe the time structure of the speech. In addition, HMM synthesis systems also describe contexts using linguistic and prosodic features. Before model training, a context attribute set and a problem set used for decision tree clustering are designed, namely, some context attributes having certain influence on acoustic parameters (spectrum, fundamental frequency and duration) are selected according to prior knowledge, and a corresponding problem set is designed to be used for context-dependent model clustering.
In the training process of the model, an HMM model of the acoustic parameter vector sequence is trained by using an EM algorithm according to an ML criterion. And finally, clustering the spectrum parameter model, the fundamental frequency parameter model and the duration model by using a context decision tree respectively to obtain a prediction model for synthesis. The whole model training process is shown in fig. 5.
(5) Inputting a text to be synthesized, generating voice parameters, and synthesizing Tibetan or Chinese voice.
Firstly, a given text is converted into a pronunciation annotation sequence containing contextual description information by using a text analysis tool, a context-dependent HMM model of each pronunciation is predicted by using a decision tree obtained in a training process, and the HMM models are connected into an HMM model of a sentence. Then, a parameter generation algorithm is used to generate a sequence of parameters of the spectrum, duration and fundamental frequency from the sentence HMM. Finally, a Mel log-spectrum approximation (MLSA) filter is used as a parameter synthesizer to synthesize the voice. The scheme of the whole synthesis is shown in FIG. 6.
Corresponding to the method, the invention also provides a Chinese-Tibetan bilingual speech synthesis device, which is used for carrying out speech synthesis on the input Chinese or Tibetan speech to be synthesized by utilizing the pre-established Chinese-Tibetan bilingual library, and the functions of the device can be realized by software, hardware or the combination of the software and the hardware. The internal structure of the device of the invention is schematically shown in fig. 4.
The internal structure of the device comprises an HMM model training unit, a speaker self-adaption unit and a speech synthesis unit.
1> HMM model training unit for building an HMM model of speech data:
(1) the voice analysis subunit extracts acoustic parameters of voice data in a voice library, and mainly extracts fundamental frequency, frequency spectrum and duration parameters;
(2) and the target HMM model determining subunit is used for training a statistical model of the acoustic model by combining context labeling information of the sound library, and determining fundamental frequency, frequency spectrum and duration parameters according to the context attribute set.
And 2, a speaker self-adaptive unit, which is used for normalizing and converting the characteristic parameters of the speaker in training to obtain a self-adaptive model:
(1) a speaker training subunit for normalizing the difference between the state output distribution and the state duration distribution between the speaker and the average tone model in the training;
(2) the average sound model determining subunit determines a Chinese-Tibetan bilingual mixed speech average sound model by adopting a maximum likelihood linear regression algorithm;
(3) the speaker self-adaptive transformation subunit calculates the mean vector and the covariance matrix of the state output probability distribution and the duration probability distribution of the speaker by using self-adaptive data and converts the mean vector and the covariance matrix into a target speaker model;
(4) and the self-adaptive model determining subunit establishes a self-adaptive model of the MSD-HSMM of the target speaker.
And 3, a speech synthesis unit for synthesizing Tibetan or Chinese speech to be synthesized:
(1) and the self-adaptive model correction subunit corrects and updates the self-adaptive model of the voice by utilizing an MAP algorithm, reduces the model deviation and improves the synthesis quality:
(2) and the synthesis subunit predicts the voice parameters of the input text by using the corrected self-adaptive model, extracts the parameters and finally synthesizes the Chinese or Tibetan voice through the voice synthesizer.
The above method processes may be implemented by hardware related to program instructions, and the program may be stored in a readable storage medium, and when executed, the program performs the corresponding steps in the above method.
In order to illustrate the superiority of the method adopted by the invention and other methods, the synthesized Tibetan speech and Chinese speech quality are evaluated, 3 different MSD-HSMM models are trained, and the advantages of the method disclosed by the invention can be illustrated by comparing the synthesized speech quality under the three models.
The speech library was selected as the training data for mandarin chinese speech of 7 female speakers (169 sentences per speaker) and 800 recorded Tibetan language female speakers. The Tibetan language sentence is selected from the recent Tibetan newspapers. All recordings were saved as Microsoft WAV file format (single channel, 16 bit quantization, 16 k Hz sampling).
In the experiment, 100 sentences are randomly selected from 800 Tibetan sentences as test sentences. Randomly picking out 10 sentences, 100 sentences and 700 sentences from the rest 700 Tibetan sentences establishes a training set of 3 Tibetan sentences. These training sets were used to train the Hanzang bilingual mixed-language mean-phoneme model, along with the training speech of 7 female Mandarin speakers. The 3 training sets of Tibetan and the first training corpus of female Mandarin speakers were also used to obtain acoustic models related to the speakers of Tibetan and Mandarin.
1) SD model: the speaker correlation model of the Tibetan Lhasa dialect obtained is trained by using 3 Tibetan language training sets (10/100/700 Tibetan language sentences) respectively.
2) And (3) SI model: the resulting speaker independent model was trained using only the training sentences of 7 female Mandarin speakers.
3) SAT model: firstly, training and obtaining 3 Chinese-Tibetan bilingual mixed language average sound models by respectively utilizing 3 Tibetan training sets and all Mandarin training sentences of 7 Mandarin speakers; then 3 Tibetan language training sets and the training sentences of the first Mandarin speaker are respectively utilized to obtain the speaker correlation model.
During evaluation, the Tibetan Lasa dialect test sentences synthesized by the SD model and the SAT model are randomly played for 8 Tibetan Lasa dialect evaluators, and 120 test voice files are provided in total (20 Tibetan test sentences multiplied by 3 Tibetan training sets multiplied by 2 models). The examiner is asked to listen to the 120 sentences minutely and then to score the voice quality of each sentence by a 5-point system. After the MOS evaluation, the examiner is also required to describe the integral intelligibility of the Tibetan speech integrated by the different Tibetan Lasa dialect training sets. In the evaluation of the MOS of chinese, the same method is adopted, 54 sentences of synthesized mandarin speech (18 sentences of mandarin test sentences × 3 Tibetan training sets) are randomly played for a mandarin evaluator, and the speech quality of each mandarin sentence is scored according to 5-point system.
The evaluation results of the synthesized voice MOS under different Tibetan language sentence training libraries show the average MOS score and the 95% confidence interval thereof. For Tibetan synthesized speech, the SAT model is superior to the SD model under each Tibetan training set. For 10 sentences of Tibetan training speech, the MOS score of the speech synthesized by the SD model is only 1.99, while the MOS score of the SAT model is relatively high and is 2.4. The examiner feels that the Tibetan synthesized by the SD model is difficult to understand while the Tibetan synthesized by the SAT model is easy to understand when taking the evaluation. When the Tibetan language training sentence is 100 sentences, the MOS scores and the intelligibility of the 2 models are improved, but the SAT model is still obviously superior to the SD model. When the training sentence reaches 700 sentences, the MOS scores of the 2 models are substantially the same. Meanwhile, the examiner feels that the synthesized speech is easily understood. Therefore, in the small corpus case, the quality of the SAT model synthesized speech is better than that of the SD model synthesized speech. When the Tibetan language corpus is increased, the quality of the Tibetan language speech synthesized by the two models tends to be the same. Therefore, the method disclosed by the invention is very suitable for synthesizing high-quality voice under the condition of lacking the corpus.
For mandarin synthesized voice, under each Tibetan language training set, Tibetan language sentences mixed in the training corpus have almost no influence on the synthesis result of mandarin, the MOS scores of the synthesized mandarin are all about 4.0, and the synthesis effect is good.
The similarity of the languages is evaluated by adopting a DMOS method. In DMOS evaluation, all test statements and their original recordings are used to take part in the evaluation. The Tibetan language voice file synthesized by 140 sentences is total (20 Tibetan sentences multiplied by 3 Tibetan training sets multiplied by 2 models + 20 Tibetan sentences synthesized by SI model). The Tibetan language sentence synthesized by each sentence and the original recording thereof form a group of voice files. Randomly playing 140 groups of test files for the evaluation person of the Tibetan Lasa dialect: firstly, playing original Tibetan language recording, and then playing synthesized Tibetan language voice. The examiner is asked to carefully compare the 2 kinds of voice files and evaluate the degree of similarity of the synthesized voice to the original voice. And 5 points are adopted during evaluation, wherein the 5 points represent that the synthesized voice is basically the same as the original voice, and the 1 point represents that the synthesized voice is greatly different from the original voice.
In the evaluation of DMOS of chinese, the same method is adopted to randomly play 54 groups of mandarin voices (18 mandarin test sentences x 3 Tibetan training sets) for mandarin evaluators, and the degree of similarity of each group of mandarin sentences is graded by 5 grades.
The results show the average DMOS score with its 95% confidence interval. For synthesized speech in Tibetan, the DMOS score of the SI model is 2.41 points, which is better than the SD model trained with 10 Tibetan sentences, and is close to the SAT model trained with 10 Tibetan sentences. When the examiner subjectively evaluates the synthesized SI model, the Tibetan language voice synthesized by the SI model is similar to the Tibetan language spoken by a non-Tibetan speaker. This is because mandarin chinese and tibetan not only share 33 synthetic primitives, but they also have the same syllable structure and prosodic structure. Thus, a Mandarin-like speech can be synthesized using only the model of Mandarin. The DMOS score of the zang speech synthesized by the SAT model outperforms the results of the SD model when more zang training sentences are added. When the Tibetan training sentence is increased to 700 sentences, the DMOS score of the SD model is very close to that of the SAT model. This shows that the Tibetan pizza dialect speech synthesized by the method disclosed by the invention is superior to the Tibetan pizza dialect speech synthesized based on the SD model method under the condition of less Tibetan sentences.
The embodiments of the present invention have been described in detail, but the description is only for the preferred embodiments of the present invention and should not be construed as limiting the scope of the present invention. All equivalent changes and modifications made within the scope of the present invention should be covered by the present patent.
Claims (10)
1. The method for synthesizing the bilingual speech in the Tibetan language is characterized in that: the method comprises the following steps:
A. obtaining the international phonetic symbols of the input Tibetan language pinyin letters by taking the international phonetic symbols as reference, then comparing the international phonetic symbols with the international phonetic symbols of Chinese pinyin, directly marking the same parts by SAMPA-SC, marking the different parts by unused keyboard symbols, and completing SAMPA-T automatic labeling of the Tibetan language text corpus by utilizing a SAMPA-T oriented character-sound conversion algorithm;
B. designing a Chinese and Tibetan universal phonetic system and a question set on the basis of a Mandarin phonetic system according to the similarity of Tibetan and Mandarin;
C. training to obtain a mixed language average voice model by utilizing voice data of a plurality of speakers in the Tibetan language and through speaker self-adaptive training based on an HMM model;
D. obtaining a speaker self-adaptive model by utilizing the corpus of a speaker with a small amount of Tibetan language or Chinese voice to be synthesized through speaker self-adaptive transformation, and correcting and updating the self-adaptive model;
E. inputting a text to be synthesized, generating voice parameters, and synthesizing Tibetan or Chinese voice.
2. The method of synthesizing bilingual hanzang speech according to claim 1, wherein: the character-to-sound conversion algorithm of SAMPA-T in the step A comprises the following steps:
firstly reading in a Tibetan language sentence text, then segmenting sentences and syllables according to single pendants and syllable characters to obtain Tibetan language sentences, separating out initial consonants and vowels through positioning of a primary character string and decomposition of the primary character string for each syllable, and finally obtaining SAMPA-T character strings of the syllables by searching an initial SAMPA-T list and a vowel SAMPA-T list, wherein the decomposition of the primary character string is realized according to the split list of the primary character string.
3. The method of synthesizing bilingual hanzang speech according to claim 1, wherein: the step B of designing a general phonetic transcription system and a general question set of Chinese and Tibetan comprises the following steps:
firstly, the Tibetan initials and finals which are consistent with the pronunciation of the Mandarin are marked by the Pinyin of Chinese, and the Tibetan initials and finals which are inconsistent with the pronunciation of the Mandarin are marked by the Pinyin of Tibetan;
then, selecting all initial and final consonants and mutes and pauses of the Mandarin and the Tibetan as context-related MSD-HSMM synthesis primitives to design a context labeling format for labeling the context-related characteristics of an initial and final layer, a syllable layer, a word layer, a prosodic word layer, a phrase layer and a sentence layer of each synthesis primitive;
finally, a problem set which is universal for the Chinese-Tibetan bilingual is designed on the basis of a context-related problem set of the Mandarin, the problem set expands related problems of synthesized primitives which are specific to the Tibetan to reflect the special pronunciation of the Tibetan, and the problem set comprises more than 3000 context-related problems and covers all the characteristics of context-related labels.
4. The method of synthesizing bilingual hanzang speech according to claim 1, wherein: the step C of obtaining the mixed language average voice model through the speaker self-adaptive training and the training comprises the following steps:
a. carrying out voice analysis on the Chinese language database of multiple speakers and the Tibetan language database data of single speaker, and extracting acoustic parameters:
(1) extracting mel cepstrum coefficients, logarithmic fundamental frequency and non-periodic indexes,
(2) calculating the first order difference and the second order difference;
b. and (3) carrying out HMM model training by combining the context attribute set, and training a statistical model of the acoustic parameters:
(1) training HMM models of the spectral and fundamental frequency parameters,
(2) training a multi-distribution semi-hidden Markov model (MSD-HSMM) of a state duration parameter;
c. using a small amount of single speaker Chinese speech library and a single speaker Tibetan speech library to perform speaker self-adaptive training, thereby obtaining an average sound model:
(1) using constrained maximum likelihood linear regression (CMML) algorithm, expressing the difference between the phonetic data and average voice of the speaker in training by linear regression function,
(2) the differences between training speakers are normalized using a set of linear regression equations for the state output distribution and the state duration distribution,
(3) training to obtain a mixed language average sound model of the Chinese-Tibetan bilingual so as to obtain a context-dependent MSD-HSMM model;
d. the speaker self-adaptive transformation is carried out by utilizing the single speaker self-adaptive data of Chinese and Tibetan:
(1) adopting CMML algorithm to calculate the mean vector and covariance matrix of the state output probability distribution and state duration probability distribution of the speaker,
(2) transforming the mean vector and covariance matrix of the mean tone model into a target speaker model of Tibetan or Chinese to be synthesized using a set of transformation matrices of state output distribution and state duration distribution,
(3) carrying out maximum likelihood estimation on the frequency spectrum, the fundamental frequency and the time length parameters after normalization and conversion;
e. and (3) modifying and updating the adaptive model:
(1) calculating MAP estimation parameters of average tone model state output and time length distribution by adopting a Maximum A Posteriori (MAP) algorithm,
(2) calculating the average vector of the state output and the state duration after the self-adaptive transformation,
(3) calculating a weighted average MAP estimation value of the adaptive mean vector;
f. inputting a text to be synthesized, and performing text analysis on the text to obtain an HMM model of a sentence;
g. performing parameter prediction on a sentence HMM, performing voice parameter generation, and obtaining synthetic voice through a parameter synthesizer, wherein the formula is as follows:
wherein,the mean vector is output for training the state of the speaker s,is its state-long mean vector, W ═ A, b]And X ═ α, β]A transformation matrix o for training the difference between the state output distribution and the state duration distribution between the speaker s and the average tone modeliAnd diThe average observation vector and the average time length vector.
5. The method of synthesizing bilingual hanzang speech according to claim 1, wherein: the step D of obtaining the speaker self-adaptive model by utilizing the corpus of the speaker with a small amount of Tibetan language or Chinese voice to be synthesized through speaker self-adaptive transformation and correcting and updating the self-adaptive model comprises the following steps:
firstly, after adaptive training of a speaker, calculating to obtain a mean vector and a covariance matrix of state output probability distribution and duration probability distribution of speaker conversion by using a CMLLR adaptive algorithm based on HSMM, wherein a transformation equation of a feature vector o and a state duration d under a state i is as follows:
bi(o)=N(o;Aui-b,A∑iAT)
=|A-1|N(Wξ;ui,Σi)
wherein xi is ═ oT,1],ψ=[d,1]T,μiIs the mean of the state output distribution, miIs the mean value, Σ, of the time length distributioniIn the form of a diagonal covariance matrix,is the variance, W ═ A-1 b-1]Outputting a linear transformation matrix of probability density distribution for the target speaker state, X ═ alpha-1,β-1]A transformation matrix of state duration probability density distribution;
then, through the adaptive transformation algorithm based on HSMM, the frequency spectrum, fundamental frequency and duration parameters of the voice data can be normalized and transformed, and for the adaptive data O with the length T, the transformation Λ ═ W, X can be estimated with maximum likelihood,
wherein λ is the parameter set of HSMM;
finally, the Maximum A Posteriori (MAP) algorithm is used to modify and update the adaptive model of the speech, and for a given HSMM λ, if the forward probability and the backward probability are: alpha is alphai(i) And betai(i) Then it continuously observes the sequence o in state it-d+1......otGeneration probability ofComprises the following steps:
the MAP estimate is described as follows:
wherein,andis the mean vector after linear regression transformation, omega and tau are respectively the MAP estimated parameters of state output and time length distribution,andas an adaptive mean vectorAndweighted average MAP estimate of (a).
6. The method of synthesizing bilingual hanzang speech according to claim 1, wherein: inputting the text to be synthesized in the step E to generate the voice parameters, and synthesizing the Tibetan language or the Chinese voice comprises the following steps:
firstly, converting a given text into a pronunciation labeling sequence containing context description information by using a text analysis tool, predicting a context-dependent HMM (hidden Markov model) model of each pronunciation by using a decision tree obtained in a training process, and connecting the HMM models into an HMM model of a sentence;
secondly, generating a parameter sequence of the frequency spectrum, the duration and the fundamental frequency from the sentence HMM by using a parameter generation algorithm;
finally, a Mel log-spectrum approximation (MLSA) filter is used as a parameter synthesizer to synthesize the voice.
7. The bilingual speech synthesis device of tibetan, its characterized in that: the method comprises the following steps: the HMM model training unit is used for establishing an HMM model of the voice data; the speaker self-adaptive unit is used for normalizing and converting the characteristic parameters of the speaker in training to obtain a self-adaptive model; and the voice synthesis unit is used for synthesizing the Tibetan or Chinese voice to be synthesized.
8. The apparatus according to claim 7, wherein: the HMM model training unit comprises: the voice analysis subunit extracts acoustic parameters of voice data in a voice library, and mainly extracts fundamental frequency, frequency spectrum and duration parameters; and the target HMM model determining subunit is used for training a statistical model of the acoustic model by combining context labeling information of the sound library, determining fundamental frequency, frequency spectrum and duration parameters according to a context attribute set, and the voice analysis subunit is connected with the target HMM model determining subunit.
9. The apparatus according to claim 8, wherein: the speaker self-adaptive unit comprises a speaker training subunit, an average tone model determining subunit, a speaker self-adaptive transformation subunit and a self-adaptive model determining subunit which are connected in sequence, the target HMM model determining subunit is connected with the speaker training subunit,
the speaker training subunit is used for normalizing the difference between the state output distribution and the state duration distribution between the speaker and the average voice model in the training;
the average sound model determining subunit determines a Chinese-Tibetan bilingual mixed speech average sound model by adopting a maximum likelihood linear regression algorithm;
the speaker self-adaptive transformation subunit calculates the mean vector and the covariance matrix of the state output probability distribution and the duration probability distribution of the speaker by using self-adaptive data and converts the mean vector and the covariance matrix into a target speaker model;
the adaptive model determining subunit establishes an adaptive model of the MSD-HSMM of the target speaker.
10. The apparatus according to claim 9, wherein: the voice synthesis unit comprises an adaptive model modification subunit and a synthesis subunit which are connected in sequence, the adaptive model determination subunit is connected with the adaptive model modification subunit,
the self-adaptive model correcting subunit corrects and updates the self-adaptive model of the voice by utilizing an MAP algorithm, reduces the model deviation and improves the synthesis quality;
and the synthesis subunit predicts the voice parameters of the input text by using the corrected self-adaptive model, extracts the parameters and finally synthesizes the Chinese or Tibetan voice through the voice synthesizer.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201410341827.9A CN104217713A (en) | 2014-07-15 | 2014-07-15 | Tibetan-Chinese speech synthesis method and device |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201410341827.9A CN104217713A (en) | 2014-07-15 | 2014-07-15 | Tibetan-Chinese speech synthesis method and device |
Publications (1)
Publication Number | Publication Date |
---|---|
CN104217713A true CN104217713A (en) | 2014-12-17 |
Family
ID=52099126
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201410341827.9A Pending CN104217713A (en) | 2014-07-15 | 2014-07-15 | Tibetan-Chinese speech synthesis method and device |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN104217713A (en) |
Cited By (25)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104538025A (en) * | 2014-12-23 | 2015-04-22 | 西北师范大学 | Method and device for converting gestures to Chinese and Tibetan bilingual voices |
CN104882141A (en) * | 2015-03-03 | 2015-09-02 | 盐城工学院 | Serial port voice control projection system based on time delay neural network and hidden Markov model |
CN105390133A (en) * | 2015-10-09 | 2016-03-09 | 西北师范大学 | Tibetan TTVS system realization method |
CN105654939A (en) * | 2016-01-04 | 2016-06-08 | 北京时代瑞朗科技有限公司 | Voice synthesis method based on voice vector textual characteristics |
CN106128450A (en) * | 2016-08-31 | 2016-11-16 | 西北师范大学 | The bilingual method across language voice conversion and system thereof hidden in a kind of Chinese |
CN106297764A (en) * | 2015-05-27 | 2017-01-04 | 科大讯飞股份有限公司 | A kind of multilingual mixed Chinese language treatment method and system |
CN106294311A (en) * | 2015-06-12 | 2017-01-04 | 科大讯飞股份有限公司 | A kind of Tibetan language tone Forecasting Methodology and system |
CN106971703A (en) * | 2017-03-17 | 2017-07-21 | 西北师范大学 | A kind of song synthetic method and device based on HMM |
CN107103900A (en) * | 2017-06-06 | 2017-08-29 | 西北师范大学 | A kind of across language emotional speech synthesizing method and system |
CN107886938A (en) * | 2016-09-29 | 2018-04-06 | 中国科学院深圳先进技术研究院 | Virtual reality guides hypnosis method of speech processing and device |
CN108492821A (en) * | 2018-03-27 | 2018-09-04 | 华南理工大学 | A kind of method that speaker influences in decrease speech recognition |
CN108573694A (en) * | 2018-02-01 | 2018-09-25 | 北京百度网讯科技有限公司 | Language material expansion and speech synthesis system construction method based on artificial intelligence and device |
CN109003601A (en) * | 2018-08-31 | 2018-12-14 | 北京工商大学 | A kind of across language end-to-end speech recognition methods for low-resource Tujia language |
CN109036370A (en) * | 2018-06-06 | 2018-12-18 | 安徽继远软件有限公司 | A kind of speaker's voice adaptive training method |
CN109767755A (en) * | 2019-03-01 | 2019-05-17 | 广州多益网络股份有限公司 | A kind of phoneme synthesizing method and system |
CN109949796A (en) * | 2019-02-28 | 2019-06-28 | 天津大学 | A kind of end-to-end framework Lhasa dialect phonetic recognition methods based on Tibetan language component |
CN110232909A (en) * | 2018-03-02 | 2019-09-13 | 北京搜狗科技发展有限公司 | A kind of audio-frequency processing method, device, equipment and readable storage medium storing program for executing |
CN110349567A (en) * | 2019-08-12 | 2019-10-18 | 腾讯科技(深圳)有限公司 | The recognition methods and device of voice signal, storage medium and electronic device |
CN110534089A (en) * | 2019-07-10 | 2019-12-03 | 西安交通大学 | A kind of Chinese speech synthesis method based on phoneme and rhythm structure |
CN111326138A (en) * | 2020-02-24 | 2020-06-23 | 北京达佳互联信息技术有限公司 | Voice generation method and device |
CN111833845A (en) * | 2020-07-31 | 2020-10-27 | 平安科技(深圳)有限公司 | Multi-language speech recognition model training method, device, equipment and storage medium |
CN111986646A (en) * | 2020-08-17 | 2020-11-24 | 云知声智能科技股份有限公司 | Dialect synthesis method and system based on small corpus |
CN112116903A (en) * | 2020-08-17 | 2020-12-22 | 北京大米科技有限公司 | Method and device for generating speech synthesis model, storage medium and electronic equipment |
CN115547292A (en) * | 2022-11-28 | 2022-12-30 | 成都启英泰伦科技有限公司 | Acoustic model training method for speech synthesis |
CN117275458A (en) * | 2023-11-20 | 2023-12-22 | 深圳市加推科技有限公司 | Speech generation method, device and equipment for intelligent customer service and storage medium |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US6035271A (en) * | 1995-03-15 | 2000-03-07 | International Business Machines Corporation | Statistical methods and apparatus for pitch extraction in speech recognition, synthesis and regeneration |
CN101290766A (en) * | 2007-04-20 | 2008-10-22 | 西北民族大学 | Syllable splitting method of Tibetan language of Anduo |
US20090055162A1 (en) * | 2007-08-20 | 2009-02-26 | Microsoft Corporation | Hmm-based bilingual (mandarin-english) tts techniques |
CN202615783U (en) * | 2012-05-23 | 2012-12-19 | 西北师范大学 | Mel cepstrum analysis synthesizer based on FPGA |
CN203276836U (en) * | 2013-06-08 | 2013-11-06 | 西北民族大学 | Novel Tibetan language identification apparatus |
CN103440236A (en) * | 2013-09-16 | 2013-12-11 | 中央民族大学 | United labeling method for syntax of Tibet language and semantic roles |
-
2014
- 2014-07-15 CN CN201410341827.9A patent/CN104217713A/en active Pending
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US6035271A (en) * | 1995-03-15 | 2000-03-07 | International Business Machines Corporation | Statistical methods and apparatus for pitch extraction in speech recognition, synthesis and regeneration |
CN101290766A (en) * | 2007-04-20 | 2008-10-22 | 西北民族大学 | Syllable splitting method of Tibetan language of Anduo |
US20090055162A1 (en) * | 2007-08-20 | 2009-02-26 | Microsoft Corporation | Hmm-based bilingual (mandarin-english) tts techniques |
CN202615783U (en) * | 2012-05-23 | 2012-12-19 | 西北师范大学 | Mel cepstrum analysis synthesizer based on FPGA |
CN203276836U (en) * | 2013-06-08 | 2013-11-06 | 西北民族大学 | Novel Tibetan language identification apparatus |
CN103440236A (en) * | 2013-09-16 | 2013-12-11 | 中央民族大学 | United labeling method for syntax of Tibet language and semantic roles |
Non-Patent Citations (14)
Cited By (34)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104538025A (en) * | 2014-12-23 | 2015-04-22 | 西北师范大学 | Method and device for converting gestures to Chinese and Tibetan bilingual voices |
CN104882141A (en) * | 2015-03-03 | 2015-09-02 | 盐城工学院 | Serial port voice control projection system based on time delay neural network and hidden Markov model |
CN106297764A (en) * | 2015-05-27 | 2017-01-04 | 科大讯飞股份有限公司 | A kind of multilingual mixed Chinese language treatment method and system |
CN106294311B (en) * | 2015-06-12 | 2019-03-19 | 科大讯飞股份有限公司 | A kind of Tibetan language tone prediction technique and system |
CN106294311A (en) * | 2015-06-12 | 2017-01-04 | 科大讯飞股份有限公司 | A kind of Tibetan language tone Forecasting Methodology and system |
CN105390133A (en) * | 2015-10-09 | 2016-03-09 | 西北师范大学 | Tibetan TTVS system realization method |
CN105654939A (en) * | 2016-01-04 | 2016-06-08 | 北京时代瑞朗科技有限公司 | Voice synthesis method based on voice vector textual characteristics |
CN105654939B (en) * | 2016-01-04 | 2019-09-13 | 极限元(杭州)智能科技股份有限公司 | A kind of phoneme synthesizing method based on sound vector text feature |
CN106128450A (en) * | 2016-08-31 | 2016-11-16 | 西北师范大学 | The bilingual method across language voice conversion and system thereof hidden in a kind of Chinese |
CN107886938B (en) * | 2016-09-29 | 2020-11-17 | 中国科学院深圳先进技术研究院 | Virtual reality guidance hypnosis voice processing method and device |
CN107886938A (en) * | 2016-09-29 | 2018-04-06 | 中国科学院深圳先进技术研究院 | Virtual reality guides hypnosis method of speech processing and device |
CN106971703A (en) * | 2017-03-17 | 2017-07-21 | 西北师范大学 | A kind of song synthetic method and device based on HMM |
CN107103900A (en) * | 2017-06-06 | 2017-08-29 | 西北师范大学 | A kind of across language emotional speech synthesizing method and system |
CN108573694A (en) * | 2018-02-01 | 2018-09-25 | 北京百度网讯科技有限公司 | Language material expansion and speech synthesis system construction method based on artificial intelligence and device |
CN110232909A (en) * | 2018-03-02 | 2019-09-13 | 北京搜狗科技发展有限公司 | A kind of audio-frequency processing method, device, equipment and readable storage medium storing program for executing |
CN108492821B (en) * | 2018-03-27 | 2021-10-22 | 华南理工大学 | Method for weakening influence of speaker in voice recognition |
CN108492821A (en) * | 2018-03-27 | 2018-09-04 | 华南理工大学 | A kind of method that speaker influences in decrease speech recognition |
CN109036370A (en) * | 2018-06-06 | 2018-12-18 | 安徽继远软件有限公司 | A kind of speaker's voice adaptive training method |
CN109003601A (en) * | 2018-08-31 | 2018-12-14 | 北京工商大学 | A kind of across language end-to-end speech recognition methods for low-resource Tujia language |
CN109949796A (en) * | 2019-02-28 | 2019-06-28 | 天津大学 | A kind of end-to-end framework Lhasa dialect phonetic recognition methods based on Tibetan language component |
CN109767755A (en) * | 2019-03-01 | 2019-05-17 | 广州多益网络股份有限公司 | A kind of phoneme synthesizing method and system |
CN110534089A (en) * | 2019-07-10 | 2019-12-03 | 西安交通大学 | A kind of Chinese speech synthesis method based on phoneme and rhythm structure |
CN110349567A (en) * | 2019-08-12 | 2019-10-18 | 腾讯科技(深圳)有限公司 | The recognition methods and device of voice signal, storage medium and electronic device |
CN110349567B (en) * | 2019-08-12 | 2022-09-13 | 腾讯科技(深圳)有限公司 | Speech signal recognition method and device, storage medium and electronic device |
CN111326138A (en) * | 2020-02-24 | 2020-06-23 | 北京达佳互联信息技术有限公司 | Voice generation method and device |
CN111833845A (en) * | 2020-07-31 | 2020-10-27 | 平安科技(深圳)有限公司 | Multi-language speech recognition model training method, device, equipment and storage medium |
CN111833845B (en) * | 2020-07-31 | 2023-11-24 | 平安科技(深圳)有限公司 | Multilingual speech recognition model training method, device, equipment and storage medium |
CN111986646A (en) * | 2020-08-17 | 2020-11-24 | 云知声智能科技股份有限公司 | Dialect synthesis method and system based on small corpus |
CN112116903A (en) * | 2020-08-17 | 2020-12-22 | 北京大米科技有限公司 | Method and device for generating speech synthesis model, storage medium and electronic equipment |
CN111986646B (en) * | 2020-08-17 | 2023-12-15 | 云知声智能科技股份有限公司 | Dialect synthesis method and system based on small corpus |
CN115547292A (en) * | 2022-11-28 | 2022-12-30 | 成都启英泰伦科技有限公司 | Acoustic model training method for speech synthesis |
CN115547292B (en) * | 2022-11-28 | 2023-02-28 | 成都启英泰伦科技有限公司 | Acoustic model training method for speech synthesis |
CN117275458A (en) * | 2023-11-20 | 2023-12-22 | 深圳市加推科技有限公司 | Speech generation method, device and equipment for intelligent customer service and storage medium |
CN117275458B (en) * | 2023-11-20 | 2024-03-05 | 深圳市加推科技有限公司 | Speech generation method, device and equipment for intelligent customer service and storage medium |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN104217713A (en) | Tibetan-Chinese speech synthesis method and device | |
Ramani et al. | A common attribute based unified HTS framework for speech synthesis in Indian languages | |
Abushariah et al. | Arabic speaker-independent continuous automatic speech recognition based on a phonetically rich and balanced speech corpus. | |
US10235991B2 (en) | Hybrid phoneme, diphone, morpheme, and word-level deep neural networks | |
CN107103900A (en) | A kind of across language emotional speech synthesizing method and system | |
Liu et al. | Mongolian text-to-speech system based on deep neural network | |
CN104538025A (en) | Method and device for converting gestures to Chinese and Tibetan bilingual voices | |
Maia et al. | Towards the development of a brazilian portuguese text-to-speech system based on HMM. | |
CN116092471A (en) | Multi-style personalized Tibetan language speech synthesis model oriented to low-resource condition | |
Labied et al. | Moroccan dialect “Darija” automatic speech recognition: a survey | |
Sakti et al. | Development of HMM-based Indonesian speech synthesis | |
Azim et al. | Large vocabulary Arabic continuous speech recognition using tied states acoustic models | |
Liu et al. | A maximum entropy based hierarchical model for automatic prosodic boundary labeling in mandarin | |
Sun et al. | A method for generation of Mandarin F0 contours based on tone nucleus model and superpositional model | |
Bonafonte et al. | The UPC TTS system description for the 2008 blizzard challenge | |
Chiang et al. | The Speech Labeling and Modeling Toolkit (SLMTK) Version 1.0 | |
Nursetyo | LatAksLate: Javanese script translator based on Indonesian speech recognition using sphinx-4 and google API | |
JP7406418B2 (en) | Voice quality conversion system and voice quality conversion method | |
Chunwijitra et al. | A tone-modeling technique using a quantized F0 context to improve tone correctness in average-voice-based speech synthesis | |
Bouselmi et al. | Multilingual recognition of non-native speech using acoustic model transformation and pronunciation modeling | |
Janyoi et al. | An Isarn dialect HMM-based text-to-speech system | |
Iyanda et al. | Development of a Yorúbà Textto-Speech System Using Festival | |
Biczysko | Automatic Annotation of Speech: Exploring Boundaries within Forced Alignment for Swedish and Norwegian | |
Hirose et al. | Superpositional modeling of fundamental frequency contours for HMM-based speech synthesis | |
Hosn et al. | New resources for brazilian portuguese: Results for grapheme-to-phoneme and phone classification |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20141217 |
|
RJ01 | Rejection of invention patent application after publication |