[go: up one dir, main page]
More Web Proxy on the site http://driver.im/

CN112216293A - Tone conversion method and device - Google Patents

Tone conversion method and device Download PDF

Info

Publication number
CN112216293A
CN112216293A CN202010889099.0A CN202010889099A CN112216293A CN 112216293 A CN112216293 A CN 112216293A CN 202010889099 A CN202010889099 A CN 202010889099A CN 112216293 A CN112216293 A CN 112216293A
Authority
CN
China
Prior art keywords
target
parameter
tone conversion
vector
voice
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202010889099.0A
Other languages
Chinese (zh)
Other versions
CN112216293B (en
Inventor
王愈
李健
陈明
武卫东
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Sinovoice Technology Co Ltd
Original Assignee
Beijing Sinovoice Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Sinovoice Technology Co Ltd filed Critical Beijing Sinovoice Technology Co Ltd
Priority to CN202010889099.0A priority Critical patent/CN112216293B/en
Publication of CN112216293A publication Critical patent/CN112216293A/en
Application granted granted Critical
Publication of CN112216293B publication Critical patent/CN112216293B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/003Changing voice quality, e.g. pitch or formants
    • G10L21/007Changing voice quality, e.g. pitch or formants characterised by the process used
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/24Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Signal Processing (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Computational Linguistics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Quality & Reliability (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Electrophonic Musical Instruments (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The embodiment of the invention provides a tone conversion method and a tone conversion device, wherein the method comprises the following steps: acquiring a voice to be converted; extracting a plurality of characteristic parameters of the voice to be converted; combining the multiple characteristic parameters to obtain a characteristic vector; performing tone conversion on the feature vector to obtain a target feature parameter; and performing sound production processing by adopting the target characteristic parameters to obtain target voice. The method and the device can perform tone conversion on various characteristic parameters of the voice to be converted, thoroughly convert the various characteristic parameters of the voice to be converted into the characteristic parameters of a target person, improve the naturalness and stability of a conversion result, and enable the converted voice to keep the characteristics of tone, tone and the like of an original speaker.

Description

Tone conversion method and device
Technical Field
The present invention relates to the field of data processing technologies, and in particular, to a tone conversion method and a tone conversion apparatus.
Background
VC (Voice Conversion) is a method for converting the tone of sound rays of one person's Voice into the tone of sound rays of another person, and the content of the Voice is not changed. The distinction between the sound color conversion and the speech synthesis is that the former is from speech to speech, the former needs to perform NLP (Natural Language Processing) analysis from speech to speech, and then generates the speech of table sound, table meaning and expression, and focuses on the generation of speech; NLP is not involved from speech to speech, and the change of the direct acoustic level emphasizes the mapping of speech.
The tone conversion has wide application range, and can be widely applied from common entertainment, pronunciation correction, identity attack and defense and the like. The development history of the tone-color conversion is not short, and initially, only two persons can read voices with the same content (namely parallel linguistic data) to train a one-to-one conversion model between the two persons, so that the requirement on the total amount of data is high, and the conversion stability is poor.
At present, the tone conversion mainly converts the voice of any person into the sound line tone of a specific target person, while the content remains unchanged, and the converted voice is close to the target person in all aspects, including intonation. However, in some demand scenarios, it is more desirable that the converted speech can retain the original speaker's voice, for example, the original speaker is still angry after the conversion of the angry speech.
Disclosure of Invention
In view of the above problems, embodiments of the present invention are proposed to provide a tone color conversion method and a corresponding tone color conversion apparatus that overcome or at least partially solve the above problems.
The embodiment of the invention discloses a tone conversion method, which comprises the following steps:
acquiring a voice to be converted;
extracting a plurality of characteristic parameters of the voice to be converted;
combining the multiple characteristic parameters to obtain a characteristic vector;
performing tone conversion on the feature vector to obtain a target feature parameter;
and performing sound production processing by adopting the target characteristic parameters to obtain target voice.
Optionally, the characteristic parameters include a first spectral parameter, a fundamental frequency parameter and an aperiodic component parameter; the combining the plurality of feature parameters to obtain a feature vector comprises:
extracting acoustic features of the first spectrum parameters to obtain second spectrum parameters, wherein the second spectrum parameters correspond to the sounding contents of the voice to be converted;
and combining the second spectrum parameter, the fundamental frequency parameter and the aperiodic component parameter to obtain a feature vector.
Optionally, the performing the tone conversion on the feature vector to obtain the target feature parameter includes:
and performing tone conversion on the feature vector to obtain a target spectrum parameter, a target fundamental frequency parameter and a target aperiodic component parameter.
Optionally, the performing the sound processing by using the target feature parameter to obtain the target voice includes:
and inputting the target spectrum parameters, the target fundamental frequency parameters and the target aperiodic component parameters into a preset vocoder for performing sound production processing to obtain target voice.
Optionally, the performing the tone conversion on the feature vector to obtain the target feature parameter includes:
and performing tone conversion on the feature vector by adopting a preset U-shaped tone conversion model to obtain a target feature parameter.
Optionally, the preset U-shaped structure tone conversion model includes a pooling layer and a deconvolution layer, where an operation core of the pooling layer includes a binary context prediction model, and an operation core of the deconvolution layer includes a binary context prediction model.
Optionally, performing tone conversion on the feature vector by using a tone conversion model with a preset U-shaped structure to obtain a target feature parameter, including:
in the pooling layer of the preset U-shaped structure tone conversion model, performing down-sampling processing on the feature vector by adopting a binary context prediction model to obtain a first intermediate vector;
in the deconvolution layer of the preset tone conversion model with the U-shaped structure, the binary context prediction model is adopted to carry out upsampling processing on the first intermediate vector to obtain a second intermediate vector;
and converting the second intermediate vector to obtain a target characteristic parameter.
Optionally, the performing, in the pooling layer of the preset U-shaped structure tone conversion model, downsampling the feature vector by using a binary context prediction model to obtain a first intermediate vector includes:
and in the pooling layer of the preset tone conversion model with the U-shaped structure, predicting a vector at one moment according to the feature vectors at two adjacent moments by adopting a binary context prediction model to obtain a first intermediate vector.
Optionally, the performing, in the deconvolution layer of the preset U-shaped tone conversion model, upsampling the first intermediate vector by using the binary context prediction model to obtain a second intermediate vector includes:
and in the deconvolution layer of the preset tone conversion model with the U-shaped structure, predicting a vector at one moment according to the first intermediate vector at two adjacent moments by adopting the binary context prediction model to obtain a second intermediate vector.
Optionally, weights of the pooling layer and the deconvolution layer are shared.
The embodiment of the invention also discloses a tone conversion device, which comprises:
the language acquisition module is used for acquiring the voice to be converted;
the characteristic parameter extraction module is used for extracting various characteristic parameters of the voice to be converted;
the characteristic parameter combination module is used for combining the various characteristic parameters to obtain a characteristic vector;
the tone conversion module is used for carrying out tone conversion on the feature vector to obtain a target feature parameter;
and the sound processing module is used for performing sound processing by adopting the target characteristic parameters to obtain target voice.
Optionally, the characteristic parameters include a first spectral parameter, a fundamental frequency parameter and an aperiodic component parameter; the characteristic parameter combination module comprises:
the spectrum parameter extraction submodule is used for extracting the acoustic characteristics of the first spectrum parameter to obtain a second spectrum parameter, and the second spectrum parameter corresponds to the sounding content of the voice to be converted;
and the characteristic parameter combination submodule is used for combining the second spectrum parameter, the fundamental frequency parameter and the aperiodic component parameter to obtain a characteristic vector.
Optionally, the tone conversion module includes:
and the first tone conversion submodule is used for carrying out tone conversion on the characteristic vector to obtain a target spectrum parameter, a target fundamental frequency parameter and a target aperiodic component parameter.
Optionally, the utterance processing module includes:
and the sound production processing submodule is used for inputting the target spectrum parameter, the target fundamental frequency parameter and the target aperiodic component parameter into a preset vocoder for sound production processing to obtain target voice.
Optionally, the tone conversion module includes:
and the second tone conversion submodule is used for performing tone conversion on the feature vector by adopting a preset U-shaped tone conversion model to obtain a target feature parameter.
Optionally, the preset U-shaped structure tone conversion model includes a pooling layer and a deconvolution layer, where an operation core of the pooling layer includes a binary context prediction model, and an operation core of the deconvolution layer includes a binary context prediction model.
Optionally, the second tone conversion sub-module includes:
the down-sampling processing unit is used for performing down-sampling processing on the feature vector by adopting a binary context prediction model in a pooling layer of the preset U-shaped structure tone conversion model to obtain a first intermediate vector;
the up-sampling processing unit is used for performing up-sampling processing on the first intermediate vector by adopting the binary context prediction model in the deconvolution layer of the preset tone conversion model with the U-shaped structure to obtain a second intermediate vector;
and the conversion unit is used for converting the second intermediate vector to obtain the target characteristic parameter.
Optionally, the down-sampling processing unit includes:
and the down-sampling processing subunit is used for predicting a vector at one moment according to the feature vectors at two adjacent moments by adopting a binary context prediction model in a pooling layer of the preset U-shaped structure tone conversion model to obtain a first intermediate vector.
Optionally, the upsampling processing unit includes:
and the up-sampling processing subunit is used for predicting a vector at one moment according to the first intermediate vector at two adjacent moments by adopting the binary context prediction model in the deconvolution layer of the preset tone conversion model with the U-shaped structure to obtain a second intermediate vector.
Optionally, weights of the pooling layer and the deconvolution layer are shared.
The embodiment of the invention also discloses an electronic device, which comprises:
one or more processors; and
one or more machine-readable media having instructions stored thereon, which when executed by the one or more processors, cause the electronic device to perform a method according to any one of the embodiments of the invention.
Embodiments of the present invention also disclose a computer-readable storage medium having instructions stored thereon, which, when executed by one or more processors, cause the processors to perform the method according to any one of the embodiments of the present invention.
The embodiment of the invention has the following advantages:
in the embodiment of the invention, the voice to be converted is obtained, the multiple characteristic parameters of the voice to be converted are extracted, the multiple characteristic parameters are combined to obtain the characteristic vector, the characteristic vector is subjected to tone conversion to obtain the target characteristic parameter, the target characteristic parameter is adopted to perform sound production processing to obtain the target voice, so that the multiple characteristic parameters of the voice to be converted can be subjected to tone conversion, the multiple characteristic parameters of the voice to be converted are thoroughly converted into the characteristic parameters of a target person, the naturalness and the stability of a conversion result are improved, and the characteristics of the original speaker, tone and the like can be reserved in the converted voice.
Drawings
FIG. 1 is a block diagram of a tone conversion system according to the present invention;
FIG. 2 is a flow chart of the steps of one embodiment of a method of tone conversion of the present invention;
FIG. 3 is a block diagram of a binary context prediction model according to the present invention;
fig. 4 is a block diagram of a tone conversion apparatus according to an embodiment of the present invention.
Detailed Description
In order to make the aforementioned objects, features and advantages of the present invention comprehensible, embodiments accompanied with figures are described in further detail below.
Tone conversion method based on PPGs (Phonetic posterolograms): the method introduces voice recognition, firstly extracts basic pronunciation characteristics without personal characteristics through voice recognition, and then converts the basic pronunciation characteristics to specific target people. As shown in fig. 1, the system as a whole includes three parts of an ASR (Automatic Speech Recognition), a conversion model and a Vocoder (Vocoder), which are separated by dotted lines in the figure, and the first two parts are model training steps for respectively training an acoustic model for Speech Recognition and a tone conversion model for spectral parameters; the third part is the real tone conversion process after the model training. The ASR is responsible for extracting an acoustic feature which is irrelevant to a speaker and only reflects pronunciation content from the voice, and the acoustic feature is called PPGs; the conversion model is responsible for converting from PPGs to person-specific spectral parameters, and the generated spectral parameters are then fed into the vocoder utterance along with other parameters such as Log F0 and AP of the input speech. The tone conversion process comprises the following steps:
1) inputting the voice to be converted into a voice signal parameter extraction algorithm, and extracting two sets of parameters: the first set is a characteristic pre-extraction module of a voice recognition system, which is used for extracting a spectrum parameter MFCC (Mel Frequency Cepstrum Coefficient) for the next voice recognition; the second set is to extract spectrum parameters MCEPs, fundamental frequency F0 (and Log to get Log F0) and non-periodic component AP by using parameter extraction algorithm of vocoder in voice synthesis field (MCEPs are sent back to reconstruction synthesis algorithm of vocoder together with Log F0 and AP after sound color conversion by subsequent steps to get voice, and the voice sounds as converted sound color).
2) And sending the MFCC into an acoustic model of the voice recognition system to obtain the PPGs.
3) And sending the PPGs into a tone conversion model to obtain the MCEPs of the target person.
4) Performing simple linear transformation on the Log F0 obtained in the step 1), for example, calculating the difference of the global average values of the Log F0 of the two persons before and after the transformation in advance, and then uniformly adding the difference to the Log F0 obtained in the step 1).
5) The MCEPs of the target person are fed into the vocoder together with the Log F0 obtained in 4) and the AP obtained in 1) to obtain the final converted speech.
Wherein the ASR portion trains the DNN recognition model with a Kaldi toolkit; the vocoder adopts the STRAIGHT toolkit of the traditional signal class to extract MCEP, Log F0 and AP; the conversion model adopts a simple bidirectional LSTM structure to model the conversion relation from PPGs to MCEPs.
The above scheme can convert the voice of any person into the sound line tone of a specific target person, the content is kept unchanged, and the converted voice is close to the target person in all aspects including intonation. However, in some demand scenarios, it is more desirable that the converted speech can retain the original speaker's voice, for example, the original speaker is still angry after the conversion of the angry speech. In phonetics, the first factor affecting tone is tone, including pitch and variation pattern of the whole tone, and further expansion is performed, if tone can be finely controlled in fine granularity, singing is performed.
Therefore, in the embodiment of the invention, three sets of parameters, namely the MCEPs, the AP and the Log F0, can be sent to the tone conversion model together to obtain three parameters of the target person, and the three sets of parameters are integrally and completely converted to the target person.
The UFNANS (U-shaped full-parallel Acoustic Neural Structure) is a deep Neural network Structure oriented to one-dimensional sequence modeling tasks (such as speech, natural language processing, etc.), and has two major features: firstly, the structure is a U-shaped structure, the structure uses U-Net which is very popular in recent years in the image field, the input size is reduced by half by sampling one round inside the structure recursively, then, the size is doubled by deconvolution again after the result of each round is added to the input of the round as a residual error, for each round, the sum of two paths of information can be considered, the basic convolution sum is detected to be the other path of information which is returned after one time of size reduction, the second path of information is provided, and the wider visual field can be covered. Secondly, the full convolution Network is adopted, only convolution, deconvolution and Pooling Pooling operations are performed in the model, and any RNN (Recurrent Neural Network) infrastructure is not included, so that the effect of full parallelization calculation is achieved, and the calculation speed can be greatly improved.
The forward calculation process of UFANS is specifically illustrated below, assuming the input to the model is a matrix of size [ _ T, _ D ], where _ T represents the length of time (e.g., the number of frames of speech) and _ D represents the feature dimension for each frame. The calculation flow is as follows:
1) the end of the input is complemented by 0 along the time axis, resulting IN a matrix IN of size T, _ D, such that the complemented length T is exactly 2 to the power of an integer (e.g., 4, 8, 16, 32, 64, 128, etc.).
2) IN passes through convolutional layer A1 (convolution kernel size is 3, output characteristic dimension is F) and the matched set of excitation functions to obtain matrix O _ A1 with size [ T, F ].
3) O _ A1 was averaged over Pooling layer B1 (cell size 2, jumping unit 2) to yield a matrix O _ B1 of size [ T/2, F ]. The calculation procedure for the average value Pooling layer B1 is: calculating the average value of the first frame and the second frame as the first output moment; calculating the average value of the third frame and the fourth frame as the second output moment; calculating the average value of the fifth frame and the sixth frame as the third output moment; after the calculation is finished one by one, the calculation is equivalent to calculating a moment from front to back, so that the final time length is halved.
4) The O _ B1 passes through convolutional layer A2 (convolutional kernel size is 3, output characteristic dimension is F) and a matched set of excitation functions to obtain a matrix O _ A2 with the size of [ T/2, F ].
5) O _ A2 was averaged over Pooling layer B2 (cell size 2, jumping unit 2) to yield a matrix O _ B2 of size [ T/4, F ].
6) The O _ B2 is subjected to deconvolution layer C2 (convolution kernel size is 2, jumping unit is 2, output characteristic dimension is F) to obtain matrix O _ C2 with size [ T/2, F ]. The calculation process of deconvolution layer C2 is: firstly, an all-zero vector is inserted between every two input moments to obtain a temporary matrix with doubled size, and then common convolution calculation is carried out, so that the obtained result is doubled compared with the input size.
7) Adding O _ C2 obtained in O _ A2 and 6) obtained in 4) (both lengths are T/2), and passing the sum of the additions through a convolutional layer D2 (convolution kernel size is 3, output characteristic dimension is F) and a matched set of excitation functions to obtain a matrix O _ D2 with the size [ T/2, F ].
8) The O _ D2 is subjected to deconvolution layer C1 (convolution kernel size is 2, jumping unit is 2, output characteristic dimension is F) to obtain matrix O _ C1 with size [ T, F ].
9) Adding O _ C1 obtained in O _ A1 and 8) obtained in 2) (the length is both 2), and passing the sum of the additions through a convolutional layer D1 (the convolutional kernel size is 3, the output characteristic dimension is F) and a matched set of excitation functions to obtain a matrix O _ D1 with the size [ T, F ].
10) The O _ D1 goes through the final convolution layer E (convolution kernel size is 3, output characteristic dimension is 2F) and the matched excitation function (e.g., tanh) to obtain a matrix OUT with size [ T, 2F ].
It should be noted that the above process describes a structure with only 2 layers inside, and in reality, the structure is generally designed to be more layers, except that the single-layer process from 4) to 7) is iterated again and again for more rounds.
In the embodiment of the present invention, the tone conversion model in the tone conversion method based on PPGs may be replaced by the original DBLSTM structure with the UFNANS structure with better effect and performance. LSTM (Long Short-Term Memory network) is a high-level RNN structure, the state of a previous period of time is slowly memorized by an internal state unit, new information is received and the oldest information is gradually forgotten as time advances, the coverage width of a context visual field by a model depends on the Memory of the state unit, so that the previous limited time of the context information can be only seen at any time, and DBLSTM is that LSTMs in two opposite directions from front to back and from back to front are simply added together, and each direction is basically memorized independently. The principle of UFNANS is more excellent: firstly, all basic operations inside the U-shaped structure are convolution operations in nature, the convolution operations are integrated forward and backward information merging in nature, and the two directions superior to DBLSTM are respectively and independently memorized. Secondly, the U-shaped structure enables the information of the contexts with different spans of [ front 1, back 1] [ front 2, back 2] [ front 4, back 4] [ front 8, back 8] [ front 16, back 16] … to be seen at the same time for each moment, and the fusion weight of the information of different groups can be learned automatically. The deeper the number of U-shaped layers, the wider the span of the covered context, without limitation. Finally, the performance is that the neural network with the RNN structure has to operate recursively from front to back one time by one time, and the neural network with the convolution structure can combine the operations of convolution kernels one by one into a matrix operation to complete once, namely the parallel operation, and the performance is very fast.
Referring to fig. 2, a flowchart illustrating steps of an embodiment of a tone conversion method according to the present invention is shown, which may specifically include the following steps:
step 201, acquiring a voice to be converted;
the voice to be converted may be audio data that needs to be subjected to tone conversion. In the embodiment of the invention, the voice to be converted can be obtained, so that the voice to be converted is input into a pre-trained voice recognition system, and the voice recognition system is adopted to recognize the voice to be converted and convert the tone.
Step 202, extracting a plurality of characteristic parameters of the voice to be converted;
the feature parameter may be a parameter of a key feature of the speech to be converted, for example, the feature parameter may be MFCC (Mel Frequency Cepstrum Coefficient), a person generates sound through a sound channel, a shape of the sound channel determines what sound is generated, the shape of the sound channel is displayed in an envelope of a speech short-time power spectrum, and MFCC is a feature that accurately describes the envelope; the characteristic parameter can also be a fundamental frequency F0, which is used for representing the vibration frequency of the fundamental tone, and the fundamental frequency F0 determines the height of the voice tone; the characteristic parameter may also be an aperiodic component AP.
Specifically, the speech to be converted may be input into a speech signal parameter extraction algorithm, and the speech signal parameter extraction algorithm is adopted to extract a plurality of feature parameters of the speech to be converted. As an example, the extracted feature parameters may include two sets: the first set can adopt a feature pre-extraction module of a voice recognition system to extract a first spectrum parameter MFCC for subsequent voice recognition; the second set may employ a vocoder's parameter extraction algorithm to extract the spectral parameters MCEPs (Mel-cepstral Coefficients), the fundamental frequency parameter F0 (after extraction of F0, the logarithm of which may be taken to obtain Log F0), and the aperiodic component parameter AP.
Step 203, combining the multiple characteristic parameters to obtain a characteristic vector;
the feature parameters extracted for each frame of speech to be converted can be various, and the feature parameters corresponding to each frame of speech to be converted can be spliced together to obtain a long feature vector, so that the feature vector can be used for tone conversion subsequently.
Step 204, performing tone conversion on the feature vector to obtain a target feature parameter;
the target characteristic parameters refer to characteristic parameters corresponding to the voice of the target person, and in the embodiment of the invention, the characteristic vectors can be converted into the target characteristic parameters through tone conversion.
Specifically, the voice recognition system may include a tone conversion model, and may perform tone conversion using a tone conversion model feature vector to obtain a target feature parameter, and since the feature vector input to the tone conversion model includes a plurality of feature parameters, the plurality of feature parameters may be integrally and thoroughly converted to the feature parameters of the target person, thereby improving the naturalness and stability of the conversion result.
And step 205, performing sound production processing by using the target characteristic parameters to obtain target voice.
The speech recognition system may be coupled to a vocoder that synthesizes the received feature parameters to generate speech. In the embodiment of the present invention, the target feature parameter may be input into a vocoder, and the vocoder may be used to perform a sound production process on the target feature parameter to obtain a target voice, where the target voice may be a voice that matches the timbre of a target person.
In the embodiment of the invention, the voice to be converted is obtained, the multiple characteristic parameters of the voice to be converted are extracted, the multiple characteristic parameters are combined to obtain the characteristic vector, the characteristic vector is subjected to tone conversion to obtain the target characteristic parameter, the target characteristic parameter is adopted to perform sound production processing to obtain the target voice, so that the multiple characteristic parameters of the voice to be converted can be subjected to tone conversion, the multiple characteristic parameters of the voice to be converted are thoroughly converted into the characteristic parameters of a target person, the naturalness and the stability of a conversion result are improved, and the characteristics of the original speaker, tone and the like can be reserved in the converted voice.
In a preferred embodiment of the invention, the characteristic parameters comprise a first spectral parameter, a fundamental frequency parameter and an aperiodic component parameter; said step 203 may comprise the sub-steps of:
extracting acoustic features of the first spectrum parameters to obtain second spectrum parameters, wherein the second spectrum parameters correspond to the sounding contents of the voice to be converted; and combining the second spectrum parameter, the fundamental frequency parameter and the aperiodic component parameter to obtain a feature vector.
Wherein the first spectral parameter may be a MFCC mel-frequency cepstrum parameter.
In the embodiment of the present invention, the first spectral parameter MFCC may be further extracted to obtain a second spectral parameter, where the second spectral parameter may be PPGs (Phonetic post-cursor voices), where the PPGs is an acoustic feature that is irrelevant to a speaker and only represents pronunciation content, and the second spectral parameter corresponds to the pronunciation content of a voice to be converted.
After the second spectral parameters PPGs are extracted, the second spectral parameters, the fundamental frequency parameters and the aperiodic component parameters may be pieced together into one long feature vector. In a specific implementation, the logarithm of the fundamental frequency parameter F0 may be taken to obtain Log F0, and then the PPGs, Log F0, and AP of each frame are pieced into one long eigenvector.
In a preferred embodiment of the present invention, the step 204 may comprise the following sub-steps:
and performing tone conversion on the feature vector to obtain a target spectrum parameter, a target fundamental frequency parameter and a target aperiodic component parameter.
The feature vector comprises PPGs, Log F0 and AP of the voice to be converted, and a tone conversion module in the voice recognition system can be adopted to perform tone conversion on the PPGs, Log F0 and AP in the feature vector to obtain a target spectrum parameter, a target fundamental frequency parameter and a target aperiodic component parameter. The target spectrum parameters are spectrum parameters MCEPs of the target person, the target fundamental frequency parameters are fundamental frequency parameters Log F0 of the target person, and the target aperiodic component parameters are aperiodic component parameters AP of the target person.
The second spectrum parameter PPGs of the voice to be converted can be converted into the spectrum parameter MCEPs of the target person through the tone conversion module, the Log F0 of the voice to be converted is converted into the Log F0 of the target person, and the AP of the voice to be converted is converted into the AP of the target person.
In a preferred embodiment of the present invention, the step 205 may include the following sub-steps:
and inputting the target spectrum parameters, the target fundamental frequency parameters and the target aperiodic component parameters into a preset vocoder for performing sound production processing to obtain target voice.
The preset vocoder may be a preset module for synthesizing voice. In the embodiment of the present invention, the target spectrum parameter, the target fundamental frequency parameter and the target aperiodic component parameter may be input to a preset vocoder, and the vocoder may synthesize the received target spectrum parameter, target fundamental frequency parameter and target aperiodic component parameter to generate a target voice, where the target voice may be a voice conforming to the timbre of a target person.
In a preferred embodiment of the present invention, the step 204 may comprise the following sub-steps:
and performing tone conversion on the feature vector by adopting a preset U-shaped tone conversion model to obtain a target feature parameter.
The preset U-shaped tone conversion model may be a preset tone conversion model with a UFNANS structure, and the preset U-shaped tone conversion model is used for performing tone conversion on input data.
In the embodiment of the invention, a preset U-shaped tone conversion model can be adopted to perform tone conversion on the feature vector to obtain the target feature parameter, compared with the method for performing tone conversion by adopting a DBLSTM-structured tone conversion model, the tone conversion model of the UFNANS structure has more excellent effect and performance, the advantage in the aspect of effect is derived from a wide context view, the advantage in the aspect of performance is derived from a fully parallel network structure, and the naturalness and the stability of the conversion result are further improved.
In a preferred embodiment of the present invention, the preset U-shaped structural tone color conversion model includes a pooling layer and an deconvolution layer, wherein the operation core of the pooling layer includes a binary context prediction model, and the operation core of the deconvolution layer includes a binary context prediction model.
The binary context prediction model may be a 2-gram prediction model, and the 2-gram prediction model may Input feature vectors at two time instants and Output the feature vectors at one time instant in the pooling layer, as shown in fig. 2, which illustrates a schematic structural diagram of the binary context prediction model according to the embodiment of the present invention, where Input1 and Input2 are feature vectors at two time instants, respectively, and Output is an Output feature vector, and the binary context prediction model may be regarded as a feature vector that downsamples the feature vectors at two time instants to one time instant.
In the embodiment of the invention, the operation core of the pooling layer comprises a binary context prediction model, and the operation core of the deconvolution layer comprises a binary context prediction model. In the original UFNANS structure, the Pooling layer is subjected to downsampling in an Average value mode, the deconvolution layer is subjected to upsampling in a zero filling mode, and the processed vector has deviation.
In a preferred embodiment of the invention, the weight of the pooling layer and the deconvolution layer are shared.
Because the pooling layer and the deconvolution layer pay attention to the same context information, the pooling layer and the deconvolution layer can be jointly learned by the same weight, so that the requirement on training data can be reduced, and information with commonality can be jointly mined in the process of down-sampling and up-sampling.
In a preferred embodiment of the present invention, the obtaining the target feature parameter by performing tone conversion on the feature vector using a preset tone conversion model with a U-shaped structure includes:
in the pooling layer of the preset U-shaped structure tone conversion model, performing down-sampling processing on the feature vector by adopting a binary context prediction model to obtain a first intermediate vector; in the deconvolution layer of the preset tone conversion model with the U-shaped structure, the binary context prediction model is adopted to carry out upsampling processing on the first intermediate vector to obtain second intermediate data; and converting the second intermediate data to obtain target characteristic parameters.
In the embodiment of the invention, a binary context prediction model can be adopted to perform down-sampling processing on the feature vector in a pooling layer of a preset tone conversion model with a U-shaped structure to obtain a first intermediate vector; in an deconvolution layer of a preset tone conversion model with a U-shaped structure, a binary context prediction model is adopted to carry out up-sampling processing on a first intermediate vector to obtain a second intermediate vector; and converting the second intermediate vector to obtain the target characteristic parameters.
In a preferred embodiment of the present invention, the down-sampling the feature vector by using a binary context prediction model in the pooling layer of the preset U-shaped structure tone color conversion model to obtain a first intermediate vector includes:
and in the pooling layer of the preset tone conversion model with the U-shaped structure, predicting a vector at one moment according to the feature vectors at two adjacent moments by adopting a binary context prediction model to obtain a first intermediate vector.
Specifically, a binary context prediction model may be used in a pooling layer of a preset U-shaped tone conversion model, and a vector at a time is predicted according to feature vectors at two adjacent times, so as to obtain a first intermediate vector. Compared with the mode that the operation core in the Average Powing Average value Pooling layer of the original UFNANS structure does not contain a binary context prediction model, an Average value can be calculated by the feature vectors of the original two adjacent moments. After a binary context prediction model is added into an operation core of an Average Pooling mean value Pooling layer, feature vectors at two adjacent moments are changed into the binary context prediction model to obtain a down-sampling result, the initial purpose of the Average Pooling mean value Pooling layer is to reduce the length by half through down-sampling, and the result is flexibly predicted by using the binary context prediction model, so that the relation between the adjacent moments can be more flexibly and accurately learned compared with the result obtained by simply and roughly averaging two Average values.
In a preferred embodiment of the present invention, the upsampling the first intermediate vector by using the binary context prediction model in the deconvolution layer of the preset U-shaped tone conversion model to obtain a second intermediate vector includes:
and in the deconvolution layer of the preset tone conversion model with the U-shaped structure, predicting a vector at one moment according to the first intermediate vector at two adjacent moments by adopting the binary context prediction model to obtain a second intermediate vector.
Specifically, in the deconvolution layer of the preset tone conversion model with the U-shaped structure, a binary context prediction model is adopted to predict vectors at two moments according to the first intermediate vector at one moment, so as to obtain a second intermediate vector. The deconvoltation Deconvolution layer is originally designed to realize length doubling (up-sampling), and the information of the length which is inevitably added by one time is an inexplicable tree, and the length can be complemented only by a most tragic mode of zero padding, and then the deconvolation Deconvolution layer is properly flattened by rolling. After the binary context prediction model is added to the operation core of the deconvoltation Deconvolution layer, the result of real prediction based on the context at each position is supplemented, the result has real meaning, and the information amount of the binary context prediction model can be traced back to the real statistical distribution of Average Pooling mean value Pooling layer downsampling. Therefore, the down-sampling and the up-sampling share a set of binary context prediction model, joint training is carried out, the two models bring out the best in each other and have the effect of resisting generation, and the convergence of the preset tone color conversion model with the U-shaped structure is more accurate.
It should be noted that, for simplicity of description, the method embodiments are described as a series of acts or combination of acts, but those skilled in the art will recognize that the present invention is not limited by the illustrated order of acts, as some steps may occur in other orders or concurrently in accordance with the embodiments of the present invention. Further, those skilled in the art will appreciate that the embodiments described in the specification are presently preferred and that no particular act is required to implement the invention.
Referring to fig. 4, a block diagram of a tone conversion apparatus according to an embodiment of the present invention is shown, and may specifically include the following modules:
a language obtaining module 401, configured to obtain a speech to be converted;
a feature parameter extraction module 402, configured to extract a plurality of feature parameters of the speech to be converted;
a feature parameter combination module 403, configured to combine the multiple feature parameters to obtain a feature vector;
a tone conversion module 404, configured to perform tone conversion on the feature vector to obtain a target feature parameter;
and the voice processing module 405 is configured to perform voice processing by using the target characteristic parameter to obtain a target voice.
In a preferred embodiment of the invention, the characteristic parameters comprise a first spectral parameter, a fundamental frequency parameter and an aperiodic component parameter; the feature parameter combination module 403 includes:
the spectrum parameter extraction submodule is used for extracting the acoustic characteristics of the first spectrum parameter to obtain a second spectrum parameter, and the second spectrum parameter corresponds to the sounding content of the voice to be converted;
and the characteristic parameter combination submodule is used for combining the second spectrum parameter, the fundamental frequency parameter and the aperiodic component parameter to obtain a characteristic vector.
In a preferred embodiment of the present invention, the tone conversion module 404 includes:
and the first tone conversion submodule is used for carrying out tone conversion on the characteristic vector to obtain a target spectrum parameter, a target fundamental frequency parameter and a target aperiodic component parameter.
In a preferred embodiment of the present invention, the utterance processing module 405 includes:
and the sound production processing submodule is used for inputting the target spectrum parameter, the target fundamental frequency parameter and the target aperiodic component parameter into a preset vocoder for sound production processing to obtain target voice.
In a preferred embodiment of the present invention, the tone conversion module 404 includes:
and the second tone conversion submodule is used for performing tone conversion on the feature vector by adopting a preset U-shaped tone conversion model to obtain a target feature parameter.
In a preferred embodiment of the present invention, the preset U-shaped structure of the timbre conversion model includes a pooling layer and an deconvolution layer, wherein an operation core of the pooling layer includes a binary context prediction model, and an operation core of the deconvolution layer includes a binary context prediction model.
In a preferred embodiment of the present invention, the second tone conversion sub-module includes:
the down-sampling processing unit is used for performing down-sampling processing on the feature vector by adopting a binary context prediction model in a pooling layer of the preset U-shaped structure tone conversion model to obtain a first intermediate vector;
the up-sampling processing unit is used for performing up-sampling processing on the first intermediate vector by adopting the binary context prediction model in the deconvolution layer of the preset tone conversion model with the U-shaped structure to obtain a second intermediate vector;
and the conversion unit is used for converting the second intermediate vector to obtain the target characteristic parameter.
In a preferred embodiment of the present invention, the down-sampling processing unit includes:
and the down-sampling processing subunit is used for predicting a vector at one moment according to the feature vectors at two adjacent moments by adopting a binary context prediction model in a pooling layer of the preset U-shaped structure tone conversion model to obtain a first intermediate vector.
In a preferred embodiment of the present invention, the up-sampling processing unit includes:
and the up-sampling processing subunit is used for predicting a vector at one moment according to the first intermediate vector at two adjacent moments by adopting the binary context prediction model in the deconvolution layer of the preset tone conversion model with the U-shaped structure to obtain a second intermediate vector.
In a preferred embodiment of the invention, the weight of the pooling layer and the deconvolution layer are shared.
For the device embodiment, since it is basically similar to the method embodiment, the description is simple, and for the relevant points, refer to the partial description of the method embodiment.
An embodiment of the present invention provides an electronic device, including:
one or more processors; and one or more machine-readable media having instructions stored thereon, which when executed by the one or more processors, cause the electronic device to perform the method of any of the embodiments of the invention.
Embodiments of the present invention disclose a computer-readable storage medium having instructions stored thereon, which, when executed by one or more processors, cause the processors to perform a method according to any one of the embodiments of the present invention.
The embodiments in the present specification are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other.
As will be appreciated by one skilled in the art, embodiments of the present invention may be provided as a method, apparatus, or computer program product. Accordingly, embodiments of the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, embodiments of the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
Embodiments of the present invention are described with reference to flowchart illustrations and/or block diagrams of methods, terminal devices (systems), and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing terminal to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing terminal, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing terminal to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing terminal to cause a series of operational steps to be performed on the computer or other programmable terminal to produce a computer implemented process such that the instructions which execute on the computer or other programmable terminal provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
While preferred embodiments of the present invention have been described, additional variations and modifications of these embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. Therefore, it is intended that the appended claims be interpreted as including preferred embodiments and all such alterations and modifications as fall within the scope of the embodiments of the invention.
Finally, it should also be noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or terminal that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or terminal. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or terminal that comprises the element.
The above detailed description is provided for a tone conversion method and a tone conversion apparatus provided by the present invention, and the principle and the implementation of the present invention are explained by applying specific examples, and the description of the above embodiments is only used to help understanding the method and the core idea of the present invention; meanwhile, for a person skilled in the art, according to the idea of the present invention, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present invention.

Claims (13)

1. A method of tone color conversion, comprising:
acquiring a voice to be converted;
extracting a plurality of characteristic parameters of the voice to be converted;
combining the multiple characteristic parameters to obtain a characteristic vector;
performing tone conversion on the feature vector to obtain a target feature parameter;
and performing sound production processing by adopting the target characteristic parameters to obtain target voice.
2. The method of claim 1, wherein the characteristic parameters include a first spectral parameter, a fundamental frequency parameter, and an aperiodic component parameter; the combining the plurality of feature parameters to obtain a feature vector comprises:
extracting acoustic features of the first spectrum parameters to obtain second spectrum parameters, wherein the second spectrum parameters correspond to the sounding contents of the voice to be converted;
and combining the second spectrum parameter, the fundamental frequency parameter and the aperiodic component parameter to obtain a feature vector.
3. The method of claim 2, wherein the performing the timbre conversion on the feature vector to obtain a target feature parameter comprises:
and performing tone conversion on the feature vector to obtain a target spectrum parameter, a target fundamental frequency parameter and a target aperiodic component parameter.
4. The method according to claim 3, wherein the performing the voicing process using the target feature parameters to obtain a target speech includes:
and inputting the target spectrum parameters, the target fundamental frequency parameters and the target aperiodic component parameters into a preset vocoder for performing sound production processing to obtain target voice.
5. The method of claim 1, wherein the performing the timbre conversion on the feature vector to obtain a target feature parameter comprises:
and performing tone conversion on the feature vector by adopting a preset U-shaped tone conversion model to obtain a target feature parameter.
6. The method according to claim 5, wherein the preset U-shaped structure of the timbre conversion model comprises a pooling layer and an deconvolution layer, wherein the kernels of the pooling layer comprise a binary context prediction model, and the kernels of the deconvolution layer comprise a binary context prediction model.
7. The method of claim 6, wherein performing a tone conversion on the feature vector to obtain a target feature parameter by using a tone conversion model with a preset U-shaped structure comprises:
in the pooling layer of the preset U-shaped structure tone conversion model, performing down-sampling processing on the feature vector by adopting a binary context prediction model to obtain a first intermediate vector;
in the deconvolution layer of the preset tone conversion model with the U-shaped structure, the binary context prediction model is adopted to carry out upsampling processing on the first intermediate vector to obtain a second intermediate vector;
and converting the second intermediate vector to obtain a target characteristic parameter.
8. The method according to claim 7, wherein the down-sampling the feature vector by using a binary context prediction model in the pooling layer of the preset U-shaped tone conversion model to obtain a first intermediate vector comprises:
and in the pooling layer of the preset tone conversion model with the U-shaped structure, predicting a vector at one moment according to the feature vectors at two adjacent moments by adopting a binary context prediction model to obtain a first intermediate vector.
9. The method according to claim 7, wherein the upsampling the first intermediate vector by using the binary context prediction model in the deconvolution layer of the preset U-shaped tone conversion model to obtain a second intermediate vector comprises:
and in the deconvolution layer of the preset tone conversion model with the U-shaped structure, predicting a vector at one moment according to the first intermediate vector at two adjacent moments by adopting the binary context prediction model to obtain a second intermediate vector.
10. The method of claim 6, wherein weights of the pooling layer and the deconvolution layer are shared.
11. A tone color conversion apparatus, comprising:
the language acquisition module is used for acquiring the voice to be converted;
the characteristic parameter extraction module is used for extracting various characteristic parameters of the voice to be converted;
the characteristic parameter combination module is used for combining the various characteristic parameters to obtain a characteristic vector;
the tone conversion module is used for carrying out tone conversion on the feature vector to obtain a target feature parameter;
and the sound processing module is used for performing sound processing by adopting the target characteristic parameters to obtain target voice.
12. An electronic device, comprising:
one or more processors; and
one or more machine-readable media having instructions stored thereon that, when executed by the one or more processors, cause the electronic device to perform the method of any of claims 1-10.
13. A computer-readable storage medium having stored thereon instructions, which when executed by one or more processors, cause the processors to perform the method of any one of claims 1-10.
CN202010889099.0A 2020-08-28 2020-08-28 Tone color conversion method and device Active CN112216293B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010889099.0A CN112216293B (en) 2020-08-28 2020-08-28 Tone color conversion method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010889099.0A CN112216293B (en) 2020-08-28 2020-08-28 Tone color conversion method and device

Publications (2)

Publication Number Publication Date
CN112216293A true CN112216293A (en) 2021-01-12
CN112216293B CN112216293B (en) 2024-08-02

Family

ID=74058954

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010889099.0A Active CN112216293B (en) 2020-08-28 2020-08-28 Tone color conversion method and device

Country Status (1)

Country Link
CN (1) CN112216293B (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114093387A (en) * 2021-11-19 2022-02-25 北京跳悦智能科技有限公司 Sound conversion method and system for modeling tone and computer equipment
CN114220456A (en) * 2021-11-29 2022-03-22 北京捷通华声科技股份有限公司 Method, device and electronic device for generating speech synthesis model
CN114283825A (en) * 2021-12-24 2022-04-05 北京达佳互联信息技术有限公司 Voice processing method and device, electronic equipment and storage medium

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180012613A1 (en) * 2016-07-11 2018-01-11 The Chinese University Of Hong Kong Phonetic posteriorgrams for many-to-one voice conversion
CN108461079A (en) * 2018-02-02 2018-08-28 福州大学 A kind of song synthetic method towards tone color conversion
CN108876792A (en) * 2018-04-13 2018-11-23 北京迈格威科技有限公司 Semantic segmentation methods, devices and systems and storage medium
US20190279361A1 (en) * 2018-03-07 2019-09-12 University Of Virginia Patent Foundation Automatic quantification of cardiac mri for hypertrophic cardiomyopathy
CN110246488A (en) * 2019-06-14 2019-09-17 苏州思必驰信息科技有限公司 Half optimizes the phonetics transfer method and device of CycleGAN model
CN110910413A (en) * 2019-11-28 2020-03-24 中国人民解放军战略支援部队航天工程大学 A U-Net-based ISAR Image Segmentation Method
CN110930981A (en) * 2018-09-20 2020-03-27 深圳市声希科技有限公司 Many-to-one voice conversion system
CN111261177A (en) * 2020-01-19 2020-06-09 平安科技(深圳)有限公司 Voice conversion method, electronic device and computer readable storage medium
KR20200084443A (en) * 2018-12-26 2020-07-13 충남대학교산학협력단 System and method for voice conversion

Patent Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180012613A1 (en) * 2016-07-11 2018-01-11 The Chinese University Of Hong Kong Phonetic posteriorgrams for many-to-one voice conversion
CN107610717A (en) * 2016-07-11 2018-01-19 香港中文大学 Many-to-One Speech Conversion Method Based on Speech Posterior Probability
CN108461079A (en) * 2018-02-02 2018-08-28 福州大学 A kind of song synthetic method towards tone color conversion
US20190279361A1 (en) * 2018-03-07 2019-09-12 University Of Virginia Patent Foundation Automatic quantification of cardiac mri for hypertrophic cardiomyopathy
CN108876792A (en) * 2018-04-13 2018-11-23 北京迈格威科技有限公司 Semantic segmentation methods, devices and systems and storage medium
CN110930981A (en) * 2018-09-20 2020-03-27 深圳市声希科技有限公司 Many-to-one voice conversion system
KR20200084443A (en) * 2018-12-26 2020-07-13 충남대학교산학협력단 System and method for voice conversion
CN110246488A (en) * 2019-06-14 2019-09-17 苏州思必驰信息科技有限公司 Half optimizes the phonetics transfer method and device of CycleGAN model
CN110910413A (en) * 2019-11-28 2020-03-24 中国人民解放军战略支援部队航天工程大学 A U-Net-based ISAR Image Segmentation Method
CN111261177A (en) * 2020-01-19 2020-06-09 平安科技(深圳)有限公司 Voice conversion method, electronic device and computer readable storage medium

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
何慧;陈胜;: "改进预训练编码器U-Net模型的PET肿瘤自动分割", 中国图象图形学报, no. 01, 16 January 2020 (2020-01-16) *
胡文俊;马秀丽;: "基于上下文的多路径空间编码图像语义分割方法", 工业控制计算机, no. 08, 25 August 2020 (2020-08-25) *

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114093387A (en) * 2021-11-19 2022-02-25 北京跳悦智能科技有限公司 Sound conversion method and system for modeling tone and computer equipment
CN114093387B (en) * 2021-11-19 2024-07-26 北京跳悦智能科技有限公司 Sound conversion method and system for modeling tone and computer equipment
CN114220456A (en) * 2021-11-29 2022-03-22 北京捷通华声科技股份有限公司 Method, device and electronic device for generating speech synthesis model
CN114283825A (en) * 2021-12-24 2022-04-05 北京达佳互联信息技术有限公司 Voice processing method and device, electronic equipment and storage medium

Also Published As

Publication number Publication date
CN112216293B (en) 2024-08-02

Similar Documents

Publication Publication Date Title
CN109147758B (en) Speaker voice conversion method and device
EP3895159B1 (en) Multi-speaker neural text-to-speech synthesis
US10535336B1 (en) Voice conversion using deep neural network with intermediate voice training
Takamichi et al. Postfilters to modify the modulation spectrum for statistical parametric speech synthesis
CN113506562B (en) End-to-end voice synthesis method and system based on fusion of acoustic features and text emotional features
US7580839B2 (en) Apparatus and method for voice conversion using attribute information
CN112489629B (en) Voice transcription model, method, medium and electronic equipment
WO2018151125A1 (en) Word vectorization model learning device, word vectorization device, speech synthesis device, method for said devices, and program
US12046226B2 (en) Text-to-speech synthesis method and system, a method of training a text-to-speech synthesis system, and a method of calculating an expressivity score
JP2008203543A (en) Voice quality conversion apparatus and voice synthesizer
JP4829477B2 (en) Voice quality conversion device, voice quality conversion method, and voice quality conversion program
EP4266306A1 (en) A speech processing system and a method of processing a speech signal
CN112216293B (en) Tone color conversion method and device
Laskar et al. Comparing ANN and GMM in a voice conversion framework
CN113470622B (en) Conversion method and device capable of converting any voice into multiple voices
CN111508470A (en) Training method and device of speech synthesis model
CN116783647A (en) Generating diverse and natural text-to-speech samples
CN114464162B (en) Speech synthesis method, neural network model training method, and speech synthesis model
CN112002302A (en) Speech synthesis method and device
CN116994553A (en) Training method of speech synthesis model, speech synthesis method, device and equipment
CN117711371A (en) Speech synthesis method, device, electronic equipment and storage medium
Lee MLP-based phone boundary refining for a TTS database
CN116168678A (en) Speech synthesis method, device, computer equipment and storage medium
Neekhara et al. SelfVC: Voice conversion with iterative refinement using self transformations
Zhao et al. Research on voice cloning with a few samples

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant