CN1534595A - Speech sound change over synthesis device and its method - Google Patents
Speech sound change over synthesis device and its method Download PDFInfo
- Publication number
- CN1534595A CN1534595A CNA031160506A CN03116050A CN1534595A CN 1534595 A CN1534595 A CN 1534595A CN A031160506 A CNA031160506 A CN A031160506A CN 03116050 A CN03116050 A CN 03116050A CN 1534595 A CN1534595 A CN 1534595A
- Authority
- CN
- China
- Prior art keywords
- voice
- speech
- unspecified person
- unit sequence
- specific people
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Images
Landscapes
- Document Processing Apparatus (AREA)
Abstract
A speech conversion-composition device is composed of a speech analyzing module, a speech recognizing module, and a speech composing module for outputting particular speech. Its method features that on the basis of analyzed and recognized results, a non-particular speech is converted to the speech of a particular person pointed by user.
Description
Technical field
The present invention is particularly to a kind of speech conversion synthesizer and its method that the unspecified person speech conversion is become specific people's voice relevant for a kind of speech conversion synthesizer and method thereof.
Background technology
Voice Conversion Techniques in text-converted (Text To Speech, be called for short TTS) system design, voice are covered up and aspect such as toy designs has a wide range of applications.And Voice Conversion Techniques substantially is how to focus on research according to the speech data of source words person with target words person, sets up transformational relation between the two.
The conversion method of known voice conversion device includes vector quantization and code book mapping method, linear transformation method, neural net for catching fish or birds method, mixed Gauss model method etc., above-mentioned these methods can both be used to set up the characteristic parameter between the words person, as the transformational relation of frequency domain character parameter.But these methods all can only be used to set up man-to-man transformational relation, it is the transformational relation between specific people's voice and the specific objective words person voice, therefore the speech conversion system that adopts these methods to set up can only be faced specific user, and for new user, speech conversion system must rebulid.So known phonetics transfer method also is not suitable for that voice are covered up or toy etc. need become the unspecified person speech conversion occasion of specific people's voice.
Summary of the invention
Therefore, the present invention is providing a kind of speech conversion synthesizer exactly, is to utilize the unspecified person speech recognition technology, and the unspecified person voice are discerned, synthesize according to corresponding speech data in recognition result and the specific people's speech database again, and obtain specific people's voice.
The present invention is that the unspecified person voice that obtained are discerned proposing a kind of speech conversion synthetic method, utilizes corresponding speech data to synthesize again, and obtains specific people's voice.
For reaching above-mentioned purpose with other, the present invention proposes a kind of speech conversion synthesizer, and this device comprises speech analysis module, speech recognition module and phonetic synthesis module.
Above-mentioned speech analysis module receives the unspecified person voice that the speech conversion synthesizer is obtained, be divided into voiceless sound section and voiced segments after dividing frame to handle the unspecified person voice, wherein, the voiceless sound section directly is output to output terminal, and voiced segments is then at analyzed back output spectrum feature and prosodic information.
Above-mentioned speech recognition module is coupled to the speech analysis module, receive the spectrum signature that the speech analysis module transmits, be responsible for identifying the voice unit sequence that is comprised with the corresponding corresponding voice segments of spectrum signature, and determining time span (abbreviation duration) the back output of each voice unit.Wherein, the speech recognition module comprises unspecified person speech database and voice recognition unit.This nonspecific speech database stores all speech unit models parameters that are used for the unspecified person speech recognition, and voice recognition unit is coupled to the unspecified person speech database, when receiving spectrum signature, identify the voice unit sequence that is comprised with the corresponding corresponding voice segments of spectrum signature to the unspecified person speech database.
Above-mentioned phonetic synthesis module is coupled to speech recognition module and speech analysis module, be responsible for receiving duration, voice unit sequence and prosodic information, and synthesize with the respective phonetic unit data of voice unit sequence, produce specific people's voice, export specific people's voice by output terminal at last.Wherein, the phonetic synthesis module comprises specific people's speech database and phonetic synthesis unit, and specific people's speech database stores respective specific people's voice unit data of speech unit models parameter, and the phonetic synthesis unit is coupled to specific people's speech database, when receiving the voice unit sequence, extremely identify respective specific people's voice unit data of speech unit models parameter in specific people's speech database.
Described according to preferred embodiment of the present invention, above-mentioned unspecified person speech database adopts hidden Markov model (Hidden Markov Model, be called for short HMM) set up, and the corresponding hidden Markov model of each voice unit can be obtained by a large amount of continuous speech training of unspecified person.
Described according to preferred embodiment of the present invention, above-mentioned specific people's speech database can be one or more, and these specific people's speech databases all have its corresponding specific people.
Described according to preferred embodiment of the present invention, above-mentioned prosodic information comprises pitch period and short-time energy.
Described according to preferred embodiment of the present invention, above-mentionedly divide frame to be treated to the unspecified person voice a series of unspecified person voice are cut with a Preset Time.
Described according to preferred embodiment of the present invention, above-mentioned speech recognition module only carries out the identification of voice layer, and does not carry out the identification of semantic primitive (as word).
For reaching above-mentioned purpose with other, the present invention proposes a kind of speech conversion synthetic method, is applicable to the synthetic specific people's voice of the unspecified person speech conversion that will be obtained.Its method obtains the unspecified person voice for the speech analysis module, then the unspecified person voice is divided frame to handle, and is divided into voiceless sound section and voiced segments, and secondly speech analysis module obtains spectrum signature and prosodic information after with the voiced segments analysis.The speech recognition module identifies the voice unit sequence that is comprised with the corresponding corresponding voice segments of spectrum signature then according to spectrum signature, and the duration of definite voice unit sequence.At last, the phonetic synthesis module is exported by output terminal after with the respective phonetic unit data of voice unit sequence and the synthetic specific people's voice of voiceless sound section according to voice unit sequence, duration, prosodic information.
For above-mentioned and other purposes of the present invention, feature and advantage can be become apparent, a preferred embodiment cited below particularly, and conjunction with figs. are described in detail below:
Description of drawings
Fig. 1 is the functional block diagram of a kind of speech conversion synthesizer of preferred embodiment of the present invention;
Fig. 2 is a kind of circuit block diagram of realizing with Digital System Processor of preferred embodiment of the present invention; And
Fig. 3 is the method flow diagram of a kind of speech conversion synthetic method of preferred embodiment of the present invention.
Embodiment
Please refer to Fig. 1, it has illustrated the functional block diagram according to a kind of speech conversion synthesizer of preferred embodiment of the present invention.This speech conversion synthesizer 100 can be covered up or aspect such as toy designs as text-converted system design, voice, and it comprises: speech analysis module 110, speech recognition module 120 and phonetic synthesis module 130.
In addition, dividing frame to be treated to the unspecified person voice is cut a series of unspecified person voice with a Preset Time, the unspecified person voice are promptly cut every 20 milliseconds be defined as a frame, and Preset Time can be when speech conversion synthesizer 100 dispatches from the factory and has preset.
Wherein, speech recognition module 120 comprises unspecified person speech database 124 and voice recognition unit 122.In unspecified person speech database 124, store all voice unit sequences that are used for the unspecified person speech recognition, and voice recognition unit 122 is coupled to unspecified person speech database 124, when receiving spectrum signature, to unspecified person speech database 124, identify the voice unit sequence that is comprised with the corresponding corresponding voice segments of spectrum signature.
Wherein, phonetic synthesis module 130 comprises a plurality of specific people's speech database D
1~D
NStore the corresponding respective specific people's voice unit of speech unit models parameter data, and phonetic synthesis unit 132 is coupled to these specific people's speech database D
1~D
N, when receiving the voice unit sequence, to specific people's speech database D
1~D
NIn identify and the corresponding respective phonetic unit data of voice unit sequence.
In preferred embodiment of the present invention, specific people's speech database D
1~D
NCan be one or more, and these specific people's speech databases all there is its corresponding specific people.
In preferred embodiment of the present invention, the unspecified person speech database adopts hidden Markov model (HiddenMarkov Model, be called for short HMM) set up, and the corresponding hidden Markov model of each voice unit can be obtained by a large amount of continuous speech training of unspecified person.
In preferred embodiment of the present invention, speech recognition module 120 only carries out the identification of voice layer, and does not carry out the identification of semantic primitive (as word).
The manner of execution of this speech conversion synthesizer 100 is that speech analysis module 110 receives the unspecified person voice that speech conversion synthesizer 100 is obtained, be divided into voiceless sound section and voiced segments after dividing frame to handle the unspecified person voice, then directly export the voiceless sound section to output terminal, voiced segments then obtains exporting behind spectrum signature and the prosodic information after analyzed.Secondly, speech recognition module 120 receives the spectrum signature that speech analysis module 110 transmits, and exports after identifying voice unit sequence that is comprised with the corresponding corresponding voice segments of spectrum signature and the duration of determining the voice unit sequence.At last, the prosodic information that duration, voice unit sequence and the speech analysis module 110 that phonetic synthesis module 130 reception speech recognition modules 120 transmit transmits, and utilize the corresponding respective phonetic unit data of voice unit sequence to synthesize, after producing specific people's voice, export specific people's voice by output terminal.
Please then refer to Fig. 2, it has illustrated a kind of circuit block diagram of realizing with Digital System Processor of preferred embodiment of the present invention.Voice conversion device 100 comprises analog/digital converter 200, Digital System Processor 210, digital/analog converter 220, unspecified person speech database 230 and a plurality of specific people's speech database D in Fig. 2
1~D
N
Analog/digital converter 200 is the phonetic entry port, exports after being responsible for received unspecified person speech simulation signal is converted to unspecified person speech digit signal.Digital System Processor 210 is responsible for carrying out the calculating in the speech conversion, and it comprises analysis and identification and specific people's phonetic synthesis of unspecified person voice.Digital/analog converter 220 is responsible for exporting after analog signal with specific people's voice converts specific people's speech digit signal to for the voice output port.Unspecified person speech database 230 is for storing speech conversion formula and hidden Markov model (HMM) parameter, and wherein unspecified person speech database 230 is a ROM (read-only memory).A plurality of specific people's speech database D
1~D
NFor storing a plurality of specific people's speech database, speech database D wherein
1~D
NBe storer.
In preferred embodiment of the present invention, Digital System Processor 210 comprises input buffer 212, digital signal processing enter 214 and output buffer 216.Wherein, input buffer 212 is for storing the frequency spectrum parameter and the prosodic parameter of input voice segments; Digital signal processing enter 214 is responsible for carrying out the calculating of speech conversion; Output buffer 216 is for storing the output voice.
Please continue with reference to figure 3, it has illustrated the process flow diagram of a kind of speech conversion synthetic method of preferred embodiment of the present invention.In the speech conversion synthetic method,, please merge with reference to figure 1 and Fig. 3 for ease of understanding.The method is that speech analysis module 110 is obtained unspecified person voice (s302), then the unspecified person voice are divided frame to handle, and be divided into voiceless sound section and voiced segments (s304), secondly speech analysis module 110 obtains spectrum signature and prosodic information (s306) after with the voiced segments analysis.120 of speech recognition modules identify the voice unit sequence that is comprised with the corresponding corresponding voice segments of spectrum signature according to spectrum signature to unspecified person speech database 124, and the duration of definite voice unit sequence.At last, phonetic synthesis module 130 receives voice unit sequence, duration, prosodic information, together up to specific people's speech database D
1~D
NIn identify and the corresponding respective phonetic unit data of voice unit sequence, export by output terminal after according to voice unit sequence, duration and prosodic information then the synthetic specific people's voice of voiceless sound section and respective phonetic unit data.
Comprehensive the above, speech conversion synthesizer of the present invention and method thereof have following advantage:
(1) speech conversion synthesizer of the present invention and method thereof can become resulting arbitrary speech conversion one specific people's voice, need not in use to adjust, and have very strong adaptive faculty.
(2) speech conversion synthesizer of the present invention and method thereof are not changing under speech conversion synthesizer structure and the parameter, only increase new specific people's speech database, can make the speech conversion synthesizer possess transfer capability to new specific people's voice.
Though the present invention discloses as above with a preferred embodiment; right its is not in order to limit the present invention; anyly be familiar with present technique field person; without departing from the spirit and scope of the present invention; when can doing a little change and retouching, so protection scope of the present invention defines and is as the criterion when looking the accompanying Claim book.
Claims (11)
1. speech conversion synthesizer is applicable to and unspecified person voice that obtained is changed synthetic that this speech conversion synthesizer comprises:
One speech analysis module, receive this unspecified person voice, be divided into a voiceless sound Duan Yuyi voiced segments after dividing frame to handle these unspecified person voice, wherein this voiceless sound section is for exporting an output terminal to, and this voiced segments is exported after analyzeding as being a spectrum signature and a prosodic information;
One speech recognition module, be coupled to this speech analysis module, receive this spectrum signature that this speech analysis module transmits, a voice unit sequence that is comprised in order to a corresponding voice segments that identifies this spectrum signature, and behind a duration of determining this voice unit sequence, export; And
One phonetic synthesis module, be coupled to this speech recognition device and this speech analysis module, receive this prosodic information, this duration with and this voice unit sequence, and according to this voice unit sequence, this duration, this prosodic information and after utilizing the synthetic specific people's voice of the corresponding respective specific people voice unit data of this voice unit sequence, by these these specific people's voice of output terminal output.
2. speech conversion synthesizer as claimed in claim 1 is characterized in that, this speech recognition module comprises:
One unspecified person speech database is used for this voice unit sequence of this unspecified person speech recognition in order to storage; And
One voice recognition unit is coupled to this unspecified person speech database, when being used to receive this spectrum signature, identifies this voice unit sequence that this corresponding voice segments comprised of this spectrum signature to this unspecified person speech database.
3. speech conversion synthesizer as claimed in claim 2 is characterized in that, this unspecified person speech database adopts a hidden Markov model to set up, and this hidden Markov model is obtained by a large amount of continuous speech training of specific people.
4. speech conversion synthesizer as claimed in claim 1 is characterized in that, this phonetic synthesis module comprises:
One specific people's speech database is in order to store and corresponding this respective specific of this voice unit sequence people voice unit data; And
One phonetic synthesis unit is coupled to this specific people's speech database, when being used to receive this voice unit sequence, identifies and corresponding this respective specific of this voice unit sequence people voice unit data to this specific people's speech database.
5. speech conversion synthesizer as claimed in claim 4 is characterized in that, this specific people's speech database stores at least one specific people's voice data.
6. speech conversion synthesizer as claimed in claim 1 is characterized in that this prosodic information comprises pitch period and short-time energy.
7. speech conversion synthesizer as claimed in claim 1 is characterized in that, divides frame to be treated to these unspecified person voice a series of these unspecified person voice are cut with a Preset Time.
8. speech conversion synthesizer as claimed in claim 1 is characterized in that, this speech recognition module only carries out the identification of voice layer, and does not carry out the identification of semantic primitive.
9. a speech conversion synthetic method comprises the following steps:
Obtain unspecified person voice;
Divide frame to handle these unspecified person voice, and be divided into a voiceless sound Duan Yuyi voiced segments;
A spectrum signature and a prosodic information will be obtained after this voiced segments analysis;
Identify the voice unit sequence that a corresponding voice segments is comprised according to this spectrum signature, and determine this voice unit sequence one duration; And
According to this voice unit sequence, this duration, this prosodic information, will export behind corresponding respective phonetic unit data of this voice unit sequence and the synthetic specific people's voice of this voiceless sound section.
10. speech conversion synthetic method as claimed in claim 9 is characterized in that this prosodic information comprises pitch period and short-time energy.
11. speech conversion synthetic method as claimed in claim 9 is characterized in that, divides frame to be treated to these unspecified person voice a series of these unspecified person voice are cut with a Preset Time.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CNA031160506A CN1534595A (en) | 2003-03-28 | 2003-03-28 | Speech sound change over synthesis device and its method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CNA031160506A CN1534595A (en) | 2003-03-28 | 2003-03-28 | Speech sound change over synthesis device and its method |
Publications (1)
Publication Number | Publication Date |
---|---|
CN1534595A true CN1534595A (en) | 2004-10-06 |
Family
ID=34284550
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CNA031160506A Pending CN1534595A (en) | 2003-03-28 | 2003-03-28 | Speech sound change over synthesis device and its method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN1534595A (en) |
Cited By (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN100349206C (en) * | 2005-09-12 | 2007-11-14 | 周运南 | Text-to-speech interchanging device |
CN102737628A (en) * | 2012-07-04 | 2012-10-17 | 哈尔滨工业大学深圳研究生院 | Method for converting voice based on linear predictive coding and radial basis function neural network |
CN103794206A (en) * | 2014-02-24 | 2014-05-14 | 联想(北京)有限公司 | Method for converting text data into voice data and terminal equipment |
CN104123932A (en) * | 2014-07-29 | 2014-10-29 | 科大讯飞股份有限公司 | Voice conversion system and method |
CN105206257A (en) * | 2015-10-14 | 2015-12-30 | 科大讯飞股份有限公司 | Voice conversion method and device |
CN105227966A (en) * | 2015-09-29 | 2016-01-06 | 深圳Tcl新技术有限公司 | To televise control method, server and control system of televising |
CN105654941A (en) * | 2016-01-20 | 2016-06-08 | 华南理工大学 | Voice change method and device based on specific target person voice change ratio parameter |
WO2017067206A1 (en) * | 2015-10-20 | 2017-04-27 | 百度在线网络技术(北京)有限公司 | Training method for multiple personalized acoustic models, and voice synthesis method and device |
CN109935225A (en) * | 2017-12-15 | 2019-06-25 | 富泰华工业(深圳)有限公司 | Character information processor and method, computer storage medium and mobile terminal |
WO2021120145A1 (en) * | 2019-12-20 | 2021-06-24 | 深圳市优必选科技股份有限公司 | Voice conversion method and apparatus, computer device and computer-readable storage medium |
-
2003
- 2003-03-28 CN CNA031160506A patent/CN1534595A/en active Pending
Cited By (13)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN100349206C (en) * | 2005-09-12 | 2007-11-14 | 周运南 | Text-to-speech interchanging device |
CN102737628A (en) * | 2012-07-04 | 2012-10-17 | 哈尔滨工业大学深圳研究生院 | Method for converting voice based on linear predictive coding and radial basis function neural network |
CN103794206B (en) * | 2014-02-24 | 2017-04-19 | 联想(北京)有限公司 | Method for converting text data into voice data and terminal equipment |
CN103794206A (en) * | 2014-02-24 | 2014-05-14 | 联想(北京)有限公司 | Method for converting text data into voice data and terminal equipment |
CN104123932A (en) * | 2014-07-29 | 2014-10-29 | 科大讯飞股份有限公司 | Voice conversion system and method |
CN105227966A (en) * | 2015-09-29 | 2016-01-06 | 深圳Tcl新技术有限公司 | To televise control method, server and control system of televising |
CN105206257B (en) * | 2015-10-14 | 2019-01-18 | 科大讯飞股份有限公司 | A kind of sound converting method and device |
CN105206257A (en) * | 2015-10-14 | 2015-12-30 | 科大讯飞股份有限公司 | Voice conversion method and device |
WO2017067206A1 (en) * | 2015-10-20 | 2017-04-27 | 百度在线网络技术(北京)有限公司 | Training method for multiple personalized acoustic models, and voice synthesis method and device |
US10410621B2 (en) | 2015-10-20 | 2019-09-10 | Baidu Online Network Technology (Beijing) Co., Ltd. | Training method for multiple personalized acoustic models, and voice synthesis method and device |
CN105654941A (en) * | 2016-01-20 | 2016-06-08 | 华南理工大学 | Voice change method and device based on specific target person voice change ratio parameter |
CN109935225A (en) * | 2017-12-15 | 2019-06-25 | 富泰华工业(深圳)有限公司 | Character information processor and method, computer storage medium and mobile terminal |
WO2021120145A1 (en) * | 2019-12-20 | 2021-06-24 | 深圳市优必选科技股份有限公司 | Voice conversion method and apparatus, computer device and computer-readable storage medium |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN1169115C (en) | Prosodic databases holding fundamental frequency templates for use in speech synthesis | |
Purwins et al. | Deep learning for audio signal processing | |
Schuller et al. | The relevance of feature type for the automatic classification of emotional user states: low level descriptors and functionals | |
CN102231278B (en) | Method and system for realizing automatic addition of punctuation marks in speech recognition | |
CN1156819C (en) | Method of producing individual characteristic speech sound from text | |
Lu et al. | Combining a vector space representation of linguistic context with a deep neural network for text-to-speech synthesis | |
CN102496363B (en) | Correction method for Chinese speech synthesis tone | |
CN116364055B (en) | Speech generation method, device, equipment and medium based on pre-training language model | |
CN1300049A (en) | Method and apparatus for identifying speech sound of chinese language common speech | |
CN113035228A (en) | Acoustic feature extraction method, device, equipment and storage medium | |
CN109979441A (en) | A kind of birds recognition methods based on deep learning | |
CN1534595A (en) | Speech sound change over synthesis device and its method | |
CN114495969A (en) | Voice recognition method integrating voice enhancement | |
KR20200088263A (en) | Method and system of text to multiple speech | |
CN113297383A (en) | Knowledge distillation-based speech emotion classification method | |
Gaol et al. | Match to win: Analysing sequences lengths for efficient self-supervised learning in speech and audio | |
KR20190135853A (en) | Method and system of text to multiple speech | |
EP3363015A1 (en) | Method for forming the excitation signal for a glottal pulse model based parametric speech synthesis system | |
Schmid et al. | Low-complexity audio embedding extractors | |
CN112242134A (en) | Speech synthesis method and device | |
CN1113330C (en) | Phoneme regulating method for phoneme synthesis | |
CN1924994A (en) | Embedded language synthetic method and system | |
CN116580698A (en) | Speech synthesis method, device, computer equipment and medium based on artificial intelligence | |
CN116913244A (en) | Speech synthesis method, equipment and medium | |
Razak et al. | Towards automatic recognition of emotion in speech |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
C02 | Deemed withdrawal of patent application after publication (patent law 2001) | ||
WD01 | Invention patent application deemed withdrawn after publication |