CN114283782A

CN114283782A - Speech synthesis method and apparatus, electronic device, and storage medium

Info

Publication number: CN114283782A
Application number: CN202111665515.XA
Authority: CN
Inventors: 刘利娟; 胡亚军; 江源; 潘嘉; 刘庆峰
Original assignee: iFlytek Co Ltd
Current assignee: University of Science and Technology of China USTC; iFlytek Co Ltd
Priority date: 2021-12-31
Filing date: 2021-12-31
Publication date: 2022-04-05

Abstract

The application discloses a voice synthesis method and device, electronic equipment and a storage medium, wherein the voice synthesis method comprises the following steps: extracting pronunciation attribute features of a text to be synthesized, and acquiring target attribute features of various voice attributes based on target categories of the text to be synthesized on a plurality of voice attributes respectively; and synthesizing to obtain the synthesized voice of the text to be synthesized based on the pronunciation attribute characteristics and the target attribute characteristics. According to the scheme, the voice synthesis freedom degree can be improved, and meanwhile the cost is reduced.

Description

Speech synthesis method and apparatus, electronic device, and storage medium

Technical Field

The present application relates to the field of speech synthesis technologies, and in particular, to a speech synthesis method and apparatus, an electronic device, and a storage medium.

Background

The speech synthesis technology is one of the core technologies for realizing human-computer interaction. With the continuous development and improvement of speech synthesis technology, speech synthesis is widely applied to various aspects of social life, including the fields of public services (information broadcasting, intelligent customer service, and the like), intelligent hardware (intelligent sound boxes, intelligent robots, and the like), smart traffic (speech navigation, intelligent vehicle-mounted equipment, and the like), education (smart classes, foreign language learning, and the like), general entertainment (audio reading, movie and television dubbing, virtual images, and the like).

At present, the speech synthesis technology can reach the level close to the real person vocalization, and it becomes one of the research hotspots of the current speech synthesis technology to improve the freedom of synthesizing speech and reduce the construction cost of the synthesis system. The problem can be solved by modeling by recording voice data with different emotions and different styles by the same speaker, but the capability requirement of the speaker is high by the method, and the difficulty of finding and recording the voice frequency of the speaker is high. In view of the above, it is an urgent need to improve the degree of freedom of speech synthesis and reduce the cost thereof.

Disclosure of Invention

The technical problem mainly solved by the application is to provide a voice synthesis method and device, an electronic device and a storage medium, which can improve the freedom degree of voice synthesis and reduce the cost of the voice synthesis.

In order to solve the above technical problem, a first aspect of the present application provides a speech synthesis method, including: extracting pronunciation attribute characteristics of a text to be synthesized; acquiring target attribute characteristics of various voice attributes based on target categories of the text to be synthesized on the plurality of voice attributes respectively; and synthesizing to obtain the synthesized voice of the text to be synthesized based on the pronunciation attribute characteristics and the target attribute characteristics.

In order to solve the above technical problem, a second aspect of the present application provides a speech synthesis method, including: extracting pronunciation attribute characteristics of the voice to be processed; acquiring target attribute characteristics of various voice attributes based on target categories of voices to be processed on a plurality of voice attributes respectively; and synthesizing to obtain the synthesized voice of the voice to be processed based on the pronunciation attribute characteristics and the target attribute characteristics.

In order to solve the above technical problem, a third aspect of the present application provides a speech synthesis apparatus, including an extraction module, an acquisition module, and a synthesis module, where the extraction module is configured to extract pronunciation attribute features of a text to be synthesized; the acquisition module is used for acquiring target attribute characteristics of various voice attributes based on target categories of the text to be synthesized on a plurality of voice attributes respectively; and the synthesis module is used for synthesizing and obtaining the synthesized voice of the text to be synthesized based on the pronunciation attribute characteristics and the target attribute characteristics.

In order to solve the above technical problem, a fourth aspect of the present application provides a speech synthesis apparatus, including an extraction module, an acquisition module, and a synthesis module, where the extraction module is configured to extract pronunciation attribute features of a text to be synthesized; the acquisition module is used for acquiring target attribute characteristics of various voice attributes based on target categories of the text to be synthesized on a plurality of voice attributes respectively; and the synthesis module is used for synthesizing and obtaining the synthesized voice of the text to be synthesized based on the pronunciation attribute characteristics and the target attribute characteristics.

In order to solve the above technical problem, a fifth aspect of the present application provides an electronic device, which includes a memory and a processor coupled to each other, wherein the memory stores program instructions, and the processor is configured to execute the program instructions to implement the speech synthesis method in the first aspect or the second aspect.

In order to solve the above technical problem, a sixth aspect of the present application provides a computer-readable storage medium storing program instructions executable by a processor for implementing the speech synthesis method according to the first or second aspect.

According to the scheme, the pronunciation attribute characteristics of the text to be synthesized are extracted, and the target attribute characteristics of various voice attributes are obtained based on the target categories of the text to be synthesized on a plurality of voice attributes respectively; and finally, synthesizing to obtain the synthesized voice of the text to be synthesized based on the pronunciation attribute characteristics and the target attribute characteristics. On one hand, because a large amount of voice recording is not needed, the time is saved, the voice synthesis cost is reduced, on the other hand, in the voice synthesis process, the target types of the texts to be synthesized on a plurality of voice attributes are obtained through obtaining the target attributes of the texts to be synthesized, the target attributes of the various voice attributes are obtained, the voice synthesis is carried out on the basis, the voice data with any voice attribute can be synthesized, and the freedom degree of the voice synthesis is improved. Therefore, the method can improve the freedom degree of voice synthesis and reduce the cost.

Drawings

FIG. 1 is a schematic flow chart diagram of an embodiment of a speech synthesis method of the present application;

FIG. 2 is a schematic diagram of attribute feature extraction in step S12 in FIG. 1;

FIG. 3 is a schematic diagram illustrating attribute feature synthesis in step S13 of FIG. 1;

FIG. 4 is a schematic flow chart diagram illustrating another embodiment of the speech synthesis method of the present application;

FIG. 5 is a block diagram of an embodiment of a speech synthesis apparatus according to the present application;

FIG. 6 is a block diagram of another embodiment of the speech synthesis apparatus of the present application;

FIG. 7 is a block diagram of an embodiment of an electronic device of the present application;

FIG. 8 is a block diagram of an embodiment of a computer-readable storage medium of the present application.

Detailed Description

The following describes in detail the embodiments of the present application with reference to the drawings attached hereto.

In the following description, for purposes of explanation and not limitation, specific details are set forth such as particular system structures, interfaces, techniques, etc. in order to provide a thorough understanding of the present application.

The terms "system" and "network" are often used interchangeably herein. The term "and/or" herein is merely an association describing an associated object, meaning that three relationships may exist, e.g., a and/or B, may mean: a exists alone, A and B exist simultaneously, and B exists alone. In addition, the character "/" herein generally indicates that the former and latter related objects are in an "or" relationship. Further, the term "plurality" herein means two or more than two.

Referring to fig. 1, fig. 1 is a schematic flowchart illustrating a speech synthesis method according to an embodiment of the present application.

Specifically, the method may include the steps of:

step S11: and extracting pronunciation attribute characteristics of the text to be synthesized.

In one implementation scenario, the pronunciation attribute may include, but is not limited to, pronunciation content information, and the like, and is not limited herein. Specifically, a pronunciation sequence of the text to be synthesized may be extracted, the pronunciation sequence includes a plurality of pronunciation marks, and feature coding is performed on each pronunciation mark, so as to obtain pronunciation attribute features of the text to be synthesized. Illustratively, the pronunciation content information of the text to be synthesized may be marked by using International Phonetic Alphabet (IPA), so as to obtain a pronunciation sequence; in addition, the individual articulation markers may be feature coded using one-hot coding (one-hot). For a specific process, reference may be made to technical details such as IPA, one-hot, and the like, which are not described herein again.

Step S12: and acquiring target attribute characteristics of various voice attributes based on target categories of the text to be synthesized on the plurality of voice attributes respectively.

In one implementation scenario, the plurality of speech attributes includes at least one of a channel attribute, a dialect attribute, a timbre attribute, an emotion attribute, and a style attribute. According to the mode, the voice attributes are distinguished, namely the voice information which is coupled together is distinguished into various voice attributes in a decoupling mode, so that the diversity of voice synthesis is improved in the voice synthesis process, and the efficiency of voice synthesis is improved.

In one implementation scenario, the difficulty of controlling the voice attribute is that the voice includes many information couplings, and in order to realize free combination and control synthesis of the attribute information included in the voice, it is necessary to distinguish the voice attribute and decouple the coupled information, thereby realizing the content of free combination and control voice synthesis. The voice includes rich attribute information, and in order to implement the control of the implementation information, the category of the voice attribute needs to be determined first, and the specific category of the voice attribute may be set according to a specific application scenario, which is not specifically limited herein. The present application describes a speech synthesis process by taking channel attributes, dialect attributes, tone attributes, emotion attributes, and style attributes as examples.

In a specific implementation scenario, the channel attribute refers to an environment where the voice is located, such as a recording studio, a conference room, a vehicle-mounted device, and the like, different recording environments have different channel attributes of the voice, and different audiences, and the voice generation can be performed by controlling the voice channel attribute, so that the audiences of the synthesized voice are more consistent with the scene, and the synthesized voice is more real; dialect attributes refer to different dialects in a language, such as mandarin in chinese, northeast, etc., or english in english, american english, etc. Determining dialect categories according to the used language scenes; the tone attribute refers to the voice tone information of a person contained in the voice for distinguishing the identity of the speaker; the emotion attribute refers to an emotion category included in speech, for example: neutral, happy, sad, angry, etc.; style attributes refer to the speaking style of speech, such as: and the speaking styles of different scenes such as news, interaction, customer service, novel and the like. After the voice attributes are determined, several voice attributes may be further classified, for example: for the timbre attributes, each speaker may be used as a category, or a gender may be used as a category, or an age interval may be used as a category, and the specific classification of the timbre attributes may be set according to actual situations, which is not specifically limited herein. And determining attributes with mature category classification such as emotion according to application requirements. Attributes with mature category classification such as emotion can be determined according to application requirements.

In one implementation scenario, the voice attributes include multiple categories, and voice data needs to be collected to ensure accuracy of voice synthesis, and the voice data collection is performed according to the multiple categories of the voice attributes to cover the voice attribute categories. In order to ensure the accuracy of the voice attributes between different types when the voice attributes are decoupled, it is ensured that at least more than two persons of voice data are collected by each attribute type in a plurality of voice attributes. In the process, voice data can be recorded by a speaker; the voice data can also be obtained by means of collection and the like; the voice data can also be obtained by processing the recording data in the recording device at the announcement site. The specific voice data collection method may be selected according to the actual application, and is not limited specifically herein.

In an implementation scenario, according to target categories of a text to be synthesized on a plurality of voice attributes, target attribute features of various voice attributes are acquired, and the target attribute features at least include: performing feature modeling based on voice data related to attribute types of various voice attributes to obtain attribute space probability distribution models of various voice attributes; and sampling in an attribute space probability distribution model corresponding to the voice attributes based on the target category of the text to be synthesized on the voice attributes to obtain target attribute characteristics corresponding to the voice attributes. In the above manner, the voice data related to the attribute type of the voice attribute is subjected to feature modeling, so that an attribute space probability distribution model of the voice attribute can be obtained, and the space probability distribution model can be subjected to region division on different attribute types to obtain distribution spaces of different attribute types under the same voice attribute, thereby being beneficial to sampling of data in the voice synthesis process and improving the voice synthesis efficiency.

In one implementation scenario, each voice attribute has at least one attribute class; then, performing feature modeling based on voice data related to attribute types of various voice attributes to obtain attribute space probability distribution models of various voice attributes, wherein the model comprises the following steps: for various attribute categories under the voice attribute, extracting sample attribute characteristics of voice data related to the attribute categories, wherein the sample attribute characteristics are related to the voice attribute; and then constructing an attribute space probability distribution model of the voice attributes based on the sample attribute characteristics of various attribute categories under the voice attributes. By the method, the attribute space probability distribution model of the voice attribute is constructed, corresponding attribute features can be directly sampled in the distribution model in the voice synthesis process, and the voice synthesis efficiency is improved.

In an implementation scenario, in the process of obtaining target attribute features of various voice attributes, the target attributes may be extracted in a manual labeling manner, and also, feature modeling may be performed on voice data related to various voice attribute categories to obtain attribute spatial probability distribution models of various voice attributes, and the target attribute features may be obtained by sampling attribute spatial probability distribution of the target voice attributes, and a specific obtaining manner of the target attribute features may be selected according to an actual application scenario, which is not specifically limited herein.

In a specific implementation scenario, before performing feature modeling on voice data related to attribute types of various voice attributes, it is necessary to extract features of each attribute type in the voice data, and then perform modeling on the attribute features to obtain attribute spatial probability distribution models of various voice attributes. In the process of extracting the target attribute, the target attribute can be extracted by establishing a feature extraction model, the target attribute can also be extracted in a manual labeling mode, and the specific mode for extracting the target attribute can be set according to the actual situation, and is not particularly limited herein.

In a specific implementation scenario, the target attribute category features are extracted in a manual labeling mode, where the manual labeling method is to label corresponding attribute features after a person listens to speech, encode the labeled information in a certain encoding mode, and use the encoded features as the characterizing features of the attribute information. For example: if the voice attribute includes a pronunciation attribute, the pronunciation attribute may be obtained by labeling the pronunciation content in the voice with an International Phonetic Alphabet (IPA) to obtain a pronunciation sequence label of the voice, and then encoding each pronunciation in the pronunciation sequence with a unique hot code to obtain a coded representation sequence. Or, manually labeling the emotional attributes, wherein the emotional attributes comprise neutral, happy, sad, angry and other categories, the emotional attributes can be labeled through unique hot coding, the neutral can be labeled as 1000, the happy can be labeled as 0100, the sad can be labeled as 0010, and the angry can be labeled as 0001; for emotional attributes, it can also be encoded by arabic numbers, neutral can be labeled as 1, happy can be labeled as 2, sad can be labeled as 3, and anger can be labeled as 4. In the process of manually labeling the target attribute, different encoding modes can be selected for labeling according to actual conditions, and are not specifically limited herein.

In a specific implementation scenario, the target attribute class features are extracted by establishing features and then extracting a model by a model training method, and the model training method may include, but is not limited to, supervised learning, self-supervised learning, a modeling method, and the like. In the learning mode of the supervised learning method, the emotion classification model is trained by using the existing attribute labeling data as training data. For example: taking the emotion attribute as an example, the training data of the emotion attribute may be recorded voice data or additionally collected voice data with emotion marks. The emotion classification model can be used for carrying out emotion prediction on the voice data to obtain the predicted emotion category of the voice data, and the network parameters of the emotion classification model are adjusted based on the difference between the sample emotion category labeled by the voice data and the predicted emotion category, so that the emotion classification model can be trained. It should be noted that the emotion classification model may use a deep neural network, a convolutional neural network, and the like, and is not limited herein. In addition, the model can be continuously updated and optimized by minimizing a cross entropy loss function, so that the emotion classification model converged by training is obtained. In this case, the last hidden layer feature of the emotion classification model can be used as the emotion attribute feature of the current speech. In addition, in order to enable the extracted emotional attribute features not to contain feature information of other attributes, in the training process of the emotional classification model, an information constraint criterion can be additionally designed, for example, a criterion such as minimization of mutual information with tone and the like, and the extracted information representation is constrained from being influenced by the features of other attributes. Other voice attributes may be analogized, and are not exemplified here.

Referring to fig. 2, fig. 2 is a schematic diagram illustrating the extraction of attribute features of step S12 in fig. 1, as shown in fig. 2, the voice data 20 is audio information collected by various ways, and several attribute modules may include a channel attribute module 21, a dialect attribute module 22, a timbre attribute module 23, an emotion attribute module 24 and a style attribute module 25, and each attribute module may be obtained based on a manual labeling or model learning method. Of course, the attribute modules may further include a pronunciation attribute module (not shown), and the pronunciation attribute features may be extracted by the pronunciation attribute module, and the specific extraction process of the pronunciation attribute module may refer to the foregoing description about extracting the pronunciation attribute features, which is not described herein again. The attribute features may include a channel attribute feature 210, a dialect attribute feature 220, a tone attribute feature 230, an emotion attribute feature 240, and a style attribute feature 250, corresponding to the respective attribute modules, and the attribute features are obtained according to the corresponding attribute modules. By extracting the attribute features corresponding to a large amount of voice data, different attribute class features corresponding to different attribute classes in each attribute can be obtained. And according to the sample attribute characteristics of various attribute categories under the voice attributes, constructing an attribute space probability distribution model of the voice attributes. According to the attribute space probability distribution model of the voice attribute, corresponding attribute features of different attribute categories can be obtained.

In one implementation scenario, feature modeling is performed according to voice data associated with attribute categories of various voice attributes to obtain attribute spatial probability distribution models of various voice attributes. Methods for modeling the speech attribute spatial probability distribution model may include, but are not limited to, modeling methods of maximum likelihood criterion, and the like, wherein the model structure employed includes, but is not limited to, gaussian mixture models, flow models, and the like. After the attribute space probability distribution model of the voice attribute is established, in the process of voice synthesis, sampling of different attribute categories can be performed by utilizing probability distribution, and target attribute category characteristics are obtained. In addition, in order to realize controlled sampling of the attribute space probability distribution, other fine control information is input during modeling to model the conditional probability distribution of the attribute feature space. For example, for the emotion space, emotion category information is input into the model for modeling during modeling. Therefore, in the generation stage, the specified emotion attribute type characteristics can be obtained through sampling, and therefore more precise sampling prediction can be achieved.

In a specific implementation scenario, the attribute features of the attribute spatial probability distribution model for the voice attributes need to be obtained from the voice in a model learning method, and when the attribute spatial probability distribution model for the voice attributes is established, the model can be selectively established for the voice attributes, and a specific selection mode can be selected according to specific situations.

Step S13: and synthesizing to obtain the synthesized voice of the text to be synthesized based on the pronunciation attribute characteristics and the target attribute characteristics.

In an implementation scene, the pronunciation attribute features are pronunciation attribute features of a text to be synthesized, each target attribute feature is a target attribute feature obtained by sampling in an attribute space probability distribution model of various voice attributes, and all the obtained attribute features are synthesized to obtain synthesized voice.

In an implementation scenario, as described above, the voice attribute includes a tone attribute, and before the synthesized voice of the text to be synthesized is synthesized based on the pronunciation attribute feature and each target attribute feature, the reference attribute feature of the tone attribute may also be obtained based on the target category of the text to be synthesized on the tone attribute; and the reference attribute feature of the tone attribute is extracted based on the voice data related to the target category of the tone attribute; and then, adjusting based on the reference attribute features of the tone attributes to obtain target attribute features of the tone attributes, so that tones different from the speaker in the training set can be generated, and on the basis, synthesized voice of the text to be synthesized can be synthesized based on the pronunciation attribute features and the target attribute features. In the above manner, in the process of voice synthesis, the reference attribute can be adjusted, so that the tone of the speaker different from that of the speaker in the training set can be generated, and the method is favorable for quickly creating the synthesized voice containing rich tone.

In a specific implementation scenario, in the process of adjusting based on the reference attribute features of the tone attributes, weighting may be performed on a plurality of reference attribute features of the tone attributes to obtain target attribute features of the tone attributes. Illustratively, canBy sampling the spatial probability distribution of timbre, synthetic speech different from the timbre of the speaker in the training set can be generated, thereby generating synthetic speech with a new timbre. For two different timbre property features S_AAnd S_BNew timbre attribute characteristics S can be generated by means of interpolation_newBy adding a weight λ, where λ is the interpolation weight between two features, the range of values is 0<λ<1. The interpolated characteristic S_newBy speech synthesizing new speech data, which can be denoted as S_new＝λ*S_A+(1-λ)*S_B. By adopting the method, the synthetic voice is constructed, the work of finding by a speaker and recording audio is omitted, and the synthetic voice with the tone different from that of the speaker in the training set can be generated, so that the synthetic voice with the new tone is generated, and the method has the advantage of avoiding copyright risk.

In a specific implementation scenario, in the process of adjusting the reference attribute feature based on the tone attribute, the reference attribute feature of the tone attribute may also be subjected to scale adjustment to obtain a target attribute feature of the tone attribute. Illustratively, a timbre attribute feature S_CNew timbre attribute features S can be generated in a stretching manner_newBy adding a weight λ, the stretched feature S_newCan be represented as S_new＝λ*S_C. By adopting the method, the synthesized voice constructed by the method can generate the synthesized voice with different timbre from the speaker in the training set, thereby generating the synthesized voice with new timbre and reducing the cost of voice synthesis.

Referring to fig. 3, fig. 3 is a schematic diagram illustrating synthesis of attribute features in step S13 in fig. 1, and as shown in fig. 3, the voice attribute features include a pronunciation attribute feature 30, a channel attribute feature 31, a dialect attribute feature 32, a tone attribute feature 33, an emotion attribute feature 34, and a style attribute feature 35, and the pronunciation attribute feature and the voice attribute features can be input into the voice synthesis module 36 for synthesis processing, so as to obtain a synthesized voice. In particular, the speech synthesis module 36 may be trained from sample speech. In the training process, the pronunciation attribute feature of the sample voice and the above-mentioned voice attribute features may be obtained, and the pronunciation attribute feature of the sample voice and the above-mentioned voice attribute features are synthesized by the voice synthesis module 36 to obtain the synthesized voice corresponding to the sample voice, so that the network parameter of the voice synthesis module 36 may be adjusted based on the difference between the sample voice and the synthesized voice corresponding thereto (e.g., the difference between mel spectrums of the two), and thus the control generation of the voice by each attribute feature may be realized by training the voice synthesis module 36. It should be noted that the input of the speech synthesis module 36 may be the extracted attribute features, and the output is the synthesized speech 37, and the hidden layer network is composed of a deep neural network module, and may be one or a combination of several network types such as a deep neural network, a cyclic neural network, and a convolutional neural network. And then training the network by setting certain training criteria. The training criteria include, but are not limited to, a minimum mean square error criterion, a maximum likelihood criterion, and the like. According to the speech synthesis module 36, control of speech synthesis can be performed, which in turn allows better control of the characteristics of the synthesized speech 37.

Referring to fig. 4, fig. 4 is a flowchart illustrating a speech synthesis method according to another embodiment of the present application. It should be noted that, the embodiment of the present disclosure may perform any attribute migration on the basis of the speech to be processed, so as to synthesize a synthesized speech having a part of the same attribute characteristics as the speech to be processed. In addition, the embodiments of the present disclosure only focus on differences from the embodiments of the present disclosure, and the same or similar parts may be referred to the embodiments of the present disclosure, which are not described herein again. Specifically, the embodiment of the present disclosure may specifically include the following steps:

step S41: and extracting pronunciation attribute characteristics of the voice to be processed.

In one implementation scenario, as previously described, the pronunciation attributes may include pronunciation content information in the speech. Similar to the aforementioned extraction of the pronunciation attribute feature of the text to be synthesized, for the speech to be processed, the pronunciation content in the speech may also be labeled by using International Phonetic Alphabet (IPA) to obtain a pronunciation sequence label of the speech, and then each pronunciation in the pronunciation sequence is encoded by using unique hot coding to obtain the pronunciation attribute feature.

Step S42: and acquiring target attribute characteristics of various voice attributes based on target categories of the voice to be processed on the plurality of voice attributes respectively.

In an implementation scenario, target attribute features of various speech attributes are obtained based on target categories of the speech to be processed on the several speech attributes, which can be referred to in the foregoing disclosure embodiment, and the relevant descriptions of the target attribute features of the various speech attributes are obtained based on the target categories of the text to be synthesized on the several speech attributes, which are not described herein again.

In an implementation scenario, before target attribute features of various voice attributes are acquired based on target categories of voices to be processed on a plurality of voice attributes respectively, at least one voice attribute can be selected as a first voice attribute, and unselected voice attributes are taken as a second voice attribute; the method comprises the steps of obtaining target attribute characteristics of various voice attributes based on target categories of voice to be processed on a plurality of voice attributes respectively, and obtaining the target attribute characteristics of various first voice attributes based on the target categories of the voice to be processed on the first voice attributes respectively; and extracting the voice attribute characteristics of the voice to be processed respectively related to the second voice attributes as target attribute characteristics corresponding to the second voice attributes. In the above manner, the attribute feature of the speech is migrated by migrating the attribute feature of the speech, so that a plurality of synthetic speeches with different synthetic attribute feature types can be controlled.

In a specific implementation scenario, taking selecting an emotion attribute as a first voice attribute as an example, a channel attribute, a dialect attribute, a tone attribute, and a style attribute may be used as a second voice attribute, each attribute in the voice to be processed has a respective attribute category, and a target attribute feature of the first voice attribute is happy, at this time, an attribute feature of the emotion attribute, which is extracted from existing voice data, may be a happy attribute feature, and the extracted attribute feature of the emotion attribute, which is happy, and attribute features of the remaining attributes are synthesized to obtain a final synthesized voice. That is, for the speech to be processed, only the emotion attribute features are changed, and the rest attribute features are kept to obtain the final synthesized speech.

In a specific implementation scenario, taking a style attribute as a first voice attribute as an example, a channel attribute, a dialect attribute, a tone attribute, and an emotion attribute may be used as a second voice attribute, each attribute in the voice to be processed has a respective attribute type, and a target attribute feature of the first voice attribute is interaction, at this time, the attribute type of the style attribute may be extracted from existing voice data as an interactive attribute feature, and the extracted attribute feature with the interactive style attribute may be synthesized with attribute features of the remaining attributes to obtain a final synthesized voice.

In the above description, the specific process of attribute feature migration is described by taking emotional attributes and style attributes as examples, and in the actual application process, the migration may be performed by selecting any one of the plurality of voice attributes or a combination of at least two of the voice attributes, which is not limited herein.

In addition, for example, the pronunciation attribute feature of the speech to be processed may be extracted, and the target attribute features of the several speech attributes may be obtained based on the target categories of the speech to be processed on the channel attribute, the timbre attribute, the dialect attribute, the emotion attribute, and the style attribute, respectively, and the target category of the speech to be processed on the timbre attribute is the original timbre category thereof, that is, the timbre attribute feature may be extracted from the speech to be processed as the target attribute feature of the timbre attribute, so that the speech synthesis may be performed subsequently based on the pronunciation attribute feature, the timbre attribute feature, and the target attribute features corresponding to the target categories respectively specified on the speech attributes of the channel attribute, the dialect attribute, the emotion attribute, the style attribute, and the like, so as to freely synthesize the speech having different channels, on the premise of keeping the original pronunciation attribute and the original timbre attribute of the speech to be processed, Different dialects, different emotions and different styles of synthesized voice.

Step S43: and synthesizing to obtain the synthesized voice of the voice to be processed based on the pronunciation attribute characteristics and the target attribute characteristics.

In an implementation scene, the synthesized voice of the voice to be processed can be synthesized according to the pronunciation attribute features and the target attribute features, and the target attribute features can be set according to the actual application scene, so that on one hand, various synthesized voices can be generated, and on the other hand, the voice attribute features of the synthesized voices can be effectively controlled. For a specific synthesizing process, reference may be made to the related description of the synthesized speech for synthesizing the text to be synthesized based on the pronunciation attribute feature and each target attribute feature in the foregoing disclosed embodiment, which is not described herein again.

According to the scheme, the pronunciation attribute features of the voice to be processed are extracted, and the target attribute features of various voice attributes are obtained based on the target categories of the voice to be processed on a plurality of voice attributes respectively; and finally, synthesizing to obtain the synthesized voice of the voice to be processed based on the pronunciation attribute characteristics and the target attribute characteristics. On one hand, because a large amount of voice recording is not needed, the time is saved, the voice synthesis cost is reduced, on the other hand, in the voice synthesis process, the target types of the texts to be synthesized on a plurality of voice attributes are obtained through obtaining the target attributes of the texts to be synthesized, the target attributes of the various voice attributes are obtained, the voice synthesis is carried out on the basis, the voice data with any voice attribute can be synthesized, and the freedom degree of the voice synthesis is improved. Therefore, the method can improve the freedom degree of voice synthesis and reduce the cost.

Referring to fig. 5, fig. 5 is a schematic diagram of a frame of a speech synthesis apparatus according to an embodiment of the present application. The speech synthesis apparatus 50 includes an extraction module 51, an acquisition module 52, and a synthesis module 53. The extraction module 51 is configured to extract pronunciation attribute features of a text to be synthesized; the obtaining module 52 is configured to obtain target attribute features of various voice attributes based on target categories of the text to be synthesized on the plurality of voice attributes respectively; the synthesis module 53 is configured to synthesize a synthesized speech of the text to be synthesized based on the pronunciation attribute features and the target attribute features.

In some disclosed embodiments, the number of speech attributes includes at least one of channel attributes, dialect attributes, timbre attributes, emotion attributes, and style attributes.

Therefore, by distinguishing the voice attributes, namely distinguishing the voice information coupled together into various voice attributes in a decoupling mode, the diversity of voice synthesis in the voice synthesis process is improved, and the efficiency of voice synthesis is improved.

In some disclosed embodiments, the obtaining module 52 includes a feature modeling sub-module, configured to perform feature modeling based on voice data related to attribute categories of various voice attributes to obtain attribute spatial probability distribution models of various voice attributes; the obtaining module 52 includes a model sampling sub-module, configured to sample an attribute space probability distribution model corresponding to the voice attribute based on a target category of the text to be synthesized on the voice attribute, so as to obtain a target attribute feature corresponding to the voice attribute.

Therefore, the voice data related to the attribute type of the voice attribute is subjected to feature modeling, so that an attribute space probability distribution model of the voice attribute can be obtained, different attribute types can be subjected to region division in the space probability distribution model, distribution spaces of different attribute types under the same voice attribute are obtained, the sampling of the data in the voice synthesis process is facilitated, and the voice synthesis efficiency is improved.

In some disclosed embodiments, each voice attribute has at least one attribute class; the feature modeling submodule comprises an extraction unit, a feature extraction unit and a feature modeling unit, wherein the extraction unit is used for extracting sample attribute features of voice data related to attribute categories for various attribute categories under voice attributes; the feature modeling submodule comprises a construction unit, and is used for constructing an attribute space probability distribution model of the voice attribute based on sample attribute features of various attribute categories under the voice attribute.

Therefore, by constructing the attribute space probability distribution model of the voice attributes, corresponding attribute features can be directly obtained by sampling in the distribution model in the voice synthesis process, and the voice synthesis efficiency is improved.

In some disclosed embodiments, the synthesis module 53 includes an obtaining sub-module, configured to obtain a reference attribute feature of the tone attribute based on a target category of the text to be synthesized on the tone attribute; and the reference attribute feature of the tone attribute is extracted based on the voice data related to the target category of the tone attribute; the synthesis module 53 includes an adjustment submodule, configured to perform adjustment processing based on the reference attribute feature of the tone attribute, so as to obtain a target attribute feature of the tone attribute.

Therefore, in the process of voice synthesis, the reference attribute can be adjusted to obtain the target attribute characteristic, and the synthesized voice different from the tone of the speaker in the training set can be generated, so that the synthesized voice with new tone is generated, and the method has the advantage of avoiding copyright risk.

In some disclosed embodiments, the adjustment submodule includes any one of a weighting processing unit and a scaling unit: the weighting processing unit is used for weighting the multiple reference attribute characteristics of the tone attributes to obtain target attribute characteristics of the tone attributes; the scale adjustment unit is used for carrying out scale adjustment on the reference attribute characteristics of the tone attributes to obtain target attribute characteristics of the tone attributes.

Therefore, the tone attribute can be adjusted through interpolation, stretching and other modes, the target attribute characteristic of the tone attribute can be obtained, and further, the voice data of a new tone different from the tone of a speaker in a training set can be obtained in the voice synthesis process, and the method has the advantage of avoiding copyright risk.

Referring to fig. 6, fig. 6 is a schematic diagram of a speech synthesis apparatus according to another embodiment of the present application. The speech synthesis apparatus 60 comprises an extraction module 61, an acquisition module 62 and a synthesis module 63. The extraction module 61 is configured to extract pronunciation attribute features of the speech to be processed; the obtaining module 62 is configured to obtain target attribute features of various voice attributes based on target categories of voices to be processed on a plurality of voice attributes respectively; the synthesis module 63 is configured to synthesize a synthesized voice of the to-be-processed voice based on the pronunciation attribute features and the target attribute features.

According to the scheme, the pronunciation attribute features of the voice to be processed are extracted, and the target attribute features of various voice attributes are obtained based on the target categories of the voice to be processed on a plurality of voice attributes respectively; and finally, synthesizing to obtain the synthesized voice of the voice to be processed based on the pronunciation attribute characteristics and the target attribute characteristics. On one hand, because a large amount of voice recording is not needed, the time is saved, the voice synthesis cost is reduced, on the other hand, in the voice synthesis process, the target types of the voice to be processed on a plurality of voice attributes are obtained through obtaining the target attributes of the voice to be processed, the voice synthesis is carried out on the basis, the voice data with any voice attribute can be synthesized, and the freedom degree of the voice synthesis is improved. Therefore, the method can improve the freedom degree of voice synthesis and reduce the cost.

In some disclosed embodiments, the speech synthesis apparatus 60 further comprises a selection module for selecting at least one speech attribute as a first speech attribute and unselected speech attributes as a second speech attribute; the obtaining module 62 includes an obtaining submodule and an extracting submodule, and the obtaining submodule is configured to obtain a target attribute feature of each first voice attribute based on a target category of the voice to be processed on each first voice attribute; the extraction submodule is used for extracting the voice attribute features of the voice to be processed respectively related to the second voice attributes as the target attribute features corresponding to the second voice attributes.

Therefore, by performing the migration of the voice attribute feature, the migration of the voice attribute feature is completed, and a plurality of synthetic voices with different synthetic attribute feature types can be controlled to be synthesized.

Referring to fig. 7, fig. 7 is a schematic diagram of a frame of an electronic device according to an embodiment of the present application. The electronic device 70 comprises a memory 71 and a processor 72 coupled to each other, wherein the memory 71 stores program instructions, and the processor 72 is configured to execute the program instructions to implement the steps in any of the above-mentioned embodiments of the speech synthesis method. Specifically, the electronic device 70 may include, but is not limited to: desktop computers, notebook computers, servers, mobile phones, tablet computers, and the like, without limitation.

In particular, the processor 72 is configured to control itself and the memory 71 to implement the steps in any of the above-described embodiments of the speech synthesis method. The processor 72 may also be referred to as a CPU (Central Processing Unit). The processor 72 may be an integrated circuit chip having signal processing capabilities. The Processor 72 may also be a general purpose Processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other Programmable logic device, discrete Gate or transistor logic, discrete hardware components. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. Additionally, the processor 72 may be collectively implemented by an integrated circuit chip.

According to the scheme, on one hand, because a large amount of voice recording is not needed, the time is saved, the voice synthesis cost is reduced, on the other hand, in the voice synthesis process, the target attributes of various voice attributes are obtained by obtaining the target categories of the text to be synthesized on a plurality of voice attributes respectively, and then the voice synthesis is carried out on the basis, so that the voice data with any category of voice attributes can be synthesized, and the freedom degree of the voice synthesis is improved. Therefore, the method can improve the freedom degree of voice synthesis and reduce the cost.

Referring to fig. 8, fig. 8 is a block diagram illustrating an embodiment of a computer-readable storage medium according to the present application. The computer readable storage medium 80 stores program instructions 81 executable by the processor, the program instructions 81 for implementing the steps in any of the speech synthesis method embodiments described above.

In some embodiments, functions of or modules included in the apparatus provided in the embodiments of the present disclosure may be used to execute the method described in the above method embodiments, and specific implementation thereof may refer to the description of the above method embodiments, and for brevity, will not be described again here.

The foregoing description of the various embodiments is intended to highlight various differences between the embodiments, and the same or similar parts may be referred to each other, and for brevity, will not be described again herein.

In the several embodiments provided in the present application, it should be understood that the disclosed method and apparatus may be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, a division of a module or a unit is merely a logical division, and an actual implementation may have another division, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection of devices or units through some interfaces, and may be in an electrical, mechanical or other form.

Units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the embodiment.

In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.

The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application may be substantially implemented or contributed to by the prior art, or all or part of the technical solution may be embodied in a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, a network device, or the like) or a processor (processor) to execute all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.

Claims

1. A method of speech synthesis, comprising:

extracting pronunciation attribute characteristics of a text to be synthesized;

acquiring target attribute characteristics of various voice attributes based on target categories of the text to be synthesized on the plurality of voice attributes respectively;

and synthesizing to obtain the synthesized voice of the text to be synthesized based on the pronunciation attribute characteristics and the target attribute characteristics.

2. The method of claim 1, wherein the plurality of voice attributes comprises: at least one of channel attribute, dialect attribute, timbre attribute, emotion attribute and style attribute.

3. The method according to claim 1 or 2, wherein the obtaining of the target attribute feature of each voice attribute based on the target category of the text to be synthesized respectively on several voice attributes at least comprises:

performing feature modeling based on voice data related to attribute types of various voice attributes to obtain attribute space probability distribution models of various voice attributes;

and sampling an attribute space probability distribution model corresponding to the voice attribute based on the target category of the text to be synthesized on the voice attribute to obtain target attribute characteristics corresponding to the voice attribute.

4. The method of claim 3, wherein each of said voice attributes has at least one of said attribute categories; the performing feature modeling based on the voice data related to the attribute types of the various voice attributes to obtain an attribute space probability distribution model of the various voice attributes includes:

for each attribute category under the voice attribute, extracting sample attribute features of the voice data related to the attribute category and related to the voice attribute;

and constructing an attribute space probability distribution model of the voice attributes based on the sample attribute characteristics of various attribute categories under the voice attributes.

5. The method according to claim 1, wherein the voice attribute comprises a tone attribute, and the obtaining of the target attribute feature of the tone attribute comprises:

acquiring reference attribute characteristics of the tone attribute based on the target category of the text to be synthesized on the tone attribute; wherein the reference attribute feature of the timbre attribute is extracted based on the voice data related to the target category of the timbre attribute;

and adjusting based on the reference attribute characteristics of the tone attributes to obtain target attribute characteristics of the tone attributes.

6. The method according to claim 5, wherein the adjusting process based on the reference attribute feature of the tone color attribute to obtain the target attribute feature of the tone color attribute comprises any one of:

weighting the multiple reference attribute characteristics of the tone attributes to obtain target attribute characteristics of the tone attributes;

and carrying out scale adjustment on the reference attribute characteristics of the tone attributes to obtain target attribute characteristics of the tone attributes.

7. A method of speech synthesis, comprising:

extracting pronunciation attribute characteristics of the voice to be processed;

acquiring target attribute characteristics of various voice attributes based on target categories of the voice to be processed on a plurality of voice attributes respectively;

and synthesizing to obtain the synthesized voice of the voice to be processed based on the pronunciation attribute characteristics and the target attribute characteristics.

8. The method according to claim 7, before said obtaining target attribute features of various voice attributes based on target classes of the voice to be processed on several voice attributes respectively, comprising:

selecting at least one voice attribute as a first voice attribute, and using the unselected voice attribute as a second voice attribute;

the obtaining of the target attribute characteristics of various voice attributes based on the target categories of the voice to be processed on the plurality of voice attributes respectively comprises:

acquiring target attribute characteristics of each first voice attribute based on the target category of the voice to be processed on each first voice attribute; and the number of the first and second groups,

and extracting voice attribute characteristics of the voice to be processed respectively related to the second voice attributes as target attribute characteristics corresponding to the second voice attributes.

9. A speech synthesis apparatus, comprising:

the extraction module is used for extracting pronunciation attribute characteristics of the text to be synthesized;

the acquisition module is used for acquiring target attribute characteristics of various voice attributes based on target categories of the text to be synthesized on the voice attributes respectively;

and the synthesis module is used for synthesizing the synthesized voice of the text to be synthesized based on the pronunciation attribute characteristics and the target attribute characteristics.

10. A speech synthesis apparatus, comprising:

the extraction module is used for extracting pronunciation attribute characteristics of the voice to be processed;

the acquisition module is used for acquiring target attribute characteristics of various voice attributes based on target categories of the voice to be processed on the voice attributes respectively;

and the synthesis module is used for synthesizing the synthesized voice of the voice to be processed based on the pronunciation attribute characteristics and the target attribute characteristics.

11. An electronic device comprising a memory and a processor coupled to each other, the memory having stored therein program instructions, the processor being configured to execute the program instructions to implement the speech synthesis method of any one of claims 1 to 8.

12. A computer-readable storage medium, characterized in that program instructions executable by a processor for implementing the speech synthesis method of any one of claims 1 to 8 are stored.