CN118173082A

CN118173082A - Speech generation method, device, computer equipment and storage medium

Info

Publication number: CN118173082A
Application number: CN202410580421.XA
Authority: CN
Inventors: 林诗伦
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2024-05-11
Filing date: 2024-05-11
Publication date: 2024-06-11
Anticipated expiration: 2044-05-11
Also published as: CN118173082B

Abstract

The present application relates to a method, an apparatus, a computer device, a storage medium and a computer program product for speech generation. The method comprises the following steps: responding to the voice generation request, and acquiring target text and reference audio contained in the voice generation request; extracting text embedded features of a target text, and performing feature coding processing on the text embedded features to obtain text hidden layer features; extracting acoustic features of a target object to which the reference audio belongs based on the reference audio, and carrying out feature fusion processing on the acoustic features and the text hidden layer features to obtain fusion features; and generating target voice which simulates the target object to send by taking the target text as voice content based on the text embedding feature, the text hidden layer feature and the fusion feature. By adopting the method, the accuracy of voice generation can be improved.

Description

Speech generation method, device, computer equipment and storage medium

Technical Field

The present application relates to the field of artificial intelligence, and in particular, to a method, an apparatus, a computer device, and a storage medium for generating speech.

Background

With the rapid development of artificial intelligence technology, artificial intelligence technology has been developed and applied in various fields, in which natural language processing (Nature Language Processing, NLP) and speech processing are one important direction in artificial intelligence technology, such as speech generation of text through a speech generation model to obtain synthesized speech, so that the synthesized speech can be played to a user.

At present, mapping from text to voice features is needed to be achieved in the process of voice generation, then needed voice is synthesized through the voice features, the process of achieving mapping from text to voice features is needed to train the voice generation process of a vocoder, however, the actual text to voice synthesis process is needed to predict voice features through inputting text, and a problem of mismatch between the predicted voice features and the actual voice features extracted from the actual voice is likely to exist, and at the moment, the accuracy of the synthesized voice is reduced.

Disclosure of Invention

In view of the foregoing, it is desirable to provide a method, apparatus, computer device, and storage medium for speech generation that can improve the accuracy of speech generation.

In a first aspect, the present application provides a method of speech generation. The method comprises the following steps:

responding to the voice generation request, and acquiring target text and reference audio contained in the voice generation request;

Extracting text embedded features of a target text, and performing feature coding processing on the text embedded features to obtain text hidden layer features;

Extracting acoustic features of a target object to which the reference audio belongs based on the reference audio, and carrying out feature fusion processing on the acoustic features and the text hidden layer features to obtain fusion features;

and generating target voice which simulates the target object to send by taking the target text as voice content based on the text embedding feature, the text hidden layer feature and the fusion feature.

In a second aspect, the application further provides a voice generating device. The device comprises:

the acquisition module is used for responding to the voice generation request and acquiring target text and reference audio contained in the voice generation request;

The feature coding module is used for extracting text embedded features of the target text, and performing feature coding processing on the text embedded features to obtain text hidden layer features;

The feature fusion module is used for extracting the acoustic features of the target object to which the reference audio belongs based on the reference audio, and carrying out feature fusion processing on the acoustic features and the text hidden layer features to obtain fusion features;

And the voice generation module is used for generating target voice which is sent by the simulated target object by taking the target text as voice content based on the text embedding feature, the text hidden layer feature and the fusion feature.

In a third aspect, the present application also provides a computer device. The computer device comprises a memory storing a computer program and a processor which when executing the computer program performs the steps of:

In a fourth aspect, the present application also provides a computer-readable storage medium. The computer readable storage medium having stored thereon a computer program which when executed by a processor performs the steps of:

In a fifth aspect, the present application also provides a computer program product. The computer program product comprises a computer program which, when executed by a processor, implements the steps of:

According to the method, the device, the computer equipment, the storage medium and the computer program product for generating the voice, the text embedded feature of the target text contained in the voice generation request is extracted and is subjected to feature encoding processing to obtain the text hidden layer feature, and the acoustic feature of the target object to which the reference audio belongs and the text hidden layer feature are subjected to feature fusion processing to obtain the fusion feature.

Drawings

FIG. 1 is an application environment diagram of a method of speech generation in one embodiment;

FIG. 2 is a system diagram of a speech generation system in one embodiment;

FIG. 3 is a flow diagram of a method of speech generation in one embodiment;

FIG. 4 is a schematic flow chart of feature encoding processing performed on text embedded features of a target text to obtain text hidden features in one embodiment;

FIG. 5 is a schematic flow chart of feature encoding processing on acoustic features to obtain acoustic hidden layer features in an embodiment;

FIG. 6 is a flow chart of feature fusion processing of acoustic hidden features and text hidden features to obtain fused features in one embodiment;

FIG. 7 is a schematic flow chart diagram in one embodiment;

FIG. 8 is a flow diagram of generating target speech through text prompt features and acoustic hidden features in one embodiment;

FIG. 9 is a flow diagram of generating target speech based on text-embedded features, text hidden features, and fusion features in one embodiment;

FIG. 10 is a flow diagram of sequentially generating acoustic tokens in one embodiment;

FIG. 11 is a flow diagram of a manner in which a speech generation model is obtained in one embodiment;

FIG. 12 is a complete flow diagram of a method of speech generation in one embodiment;

FIG. 13 is a block diagram of a speech generating device in one embodiment;

Fig. 14 is an internal structural diagram of a computer device in one embodiment.

Detailed Description

The present application will be described in further detail with reference to the drawings and examples, in order to make the objects, technical solutions and advantages of the present application more apparent. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the application.

The embodiment of the application provides a voice generation method capable of generating voice efficiently. The voice generation method provided by the embodiment of the application can be applied to an application environment shown in fig. 1. Wherein the terminal 102 communicates with the server 104 via a network. The data storage system may store data that the server 104 needs to process. The data storage system may be integrated on the server 104 or may be located on the cloud or other servers.

Specifically, taking the application to the server 104 as an example, the server 104 responds to a voice generation request, acquires a target text and a reference audio contained in the voice generation request, extracts text embedded features of the target text, performs feature encoding processing on the text embedded features to obtain text hidden layer features, extracts acoustic features of a target object to which the reference audio belongs based on the reference audio, performs feature fusion processing on the acoustic features and the text hidden layer features to obtain fusion features, and finally generates target voice of a simulation target object taking the target text as voice content based on the text embedded features, the text hidden layer features and the fusion features. The target text is subjected to multi-dimensional and deep feature extraction, and the acoustic features of the target object attached to the text can be more effectively mined through feature fusion processing, so that the generated simulated target object uses the target text as the target voice sent by the voice content, the accuracy of the voice content can be ensured, and the acoustic features of the target object attached to the text can be further improved, so that the accuracy of voice generation can be improved.

The terminal 102 may be, but not limited to, various desktop computers, notebook computers, smart phones, tablet computers, internet of things devices, and portable wearable devices, where the internet of things devices may be smart speakers, smart televisions, smart air conditioners, smart vehicle devices, and the like. The portable wearable device may be a smart watch, smart bracelet, headset, or the like. The server 104 may be implemented as a stand-alone server or as a server cluster of multiple servers. And the method for generating the voice provided by the application embodiment can be applied to various scenes, including but not limited to cloud technology, artificial intelligence and the like.

The application can be applied to the scenes needing to be subjected to voice synthesis, such as voice interaction, voice reading education, navigation guidance and the like, wherein the scenes of voice interaction can be intelligent assistants, intelligent customer service and the like, and the voice reading education can be audio book playing, news broadcasting and the like. Based on this, as specifically shown in fig. 2, a speech synthesis system capable of executing the speech generation method provided by the present application may be provided on a cloud service as a basic technology for enabling a user who uses the cloud service. The cloud service executes the voice generation method provided by the application through a voice synthesis system to generate target voice which simulates the target object to send by taking the target text as voice content, and sends the generated target voice to the terminal equipment in a streaming or sentence-by-sentence return mode.

The following presents some definitions of terms that are relevant to the present application:

spectrum (Spectrograms): the method refers to a representation mode of a time domain signal in a frequency domain, the representation mode can be obtained by carrying out Fourier transform on the signal, the obtained result is two graphs with amplitude and phase as vertical axes and frequency as horizontal axes, phase information is mostly omitted in the application of the voice synthesis technology, and only corresponding amplitude information in different frequencies is reserved.

Fundamental frequency (Fundamental frequency): in sound, the fundamental frequency refers to the frequency of the fundamental tone in a complex tone, denoted by the symbol FO. Among the several tones constituting one complex tone, the fundamental tone has the lowest frequency and the greatest intensity. The level of the fundamental frequency determines the level of a tone. The frequency of speech is usually referred to as the pitch frequency.

Vocoder (Vocoder): an abbreviation from human voice encoder (Voice Encoder), also known as a speech signal analysis-by-synthesis system, functions to convert acoustic features into sound.

Hidden markov model (Hidden Markov Model, HMM): is a statistical analysis model describing a markov process with implicit unknown parameters. In hidden Markov models, the states are not directly visible, and some variables (observations) affected by the states are visible.

Deep neural network (Deep Neural Network, DNN): is a discriminant model, is a multi-layer perceptron (Multilayer Perceptron, MLP) comprising more than two hidden layers, each node being a neuron with a nonlinear activation function except for the input node, as with MLP, DNN can be trained using a back propagation algorithm.

Convolutional neural network (Convolutional Neural Network, CNN): is a feed-forward neural network whose neurons are responsive to elements within the receptive field. CNNs are typically composed of multiple convolutional layers and top fully-connected layers that reduce the number of parameters of the model by sharing parameters, making them widely used in image and speech recognition.

Recurrent neural network (Recurrent Neural Network, RNN): is a type of recurrent neural network (Recursive Neural Network) that takes sequence data as input, performs recursion (recursion) in the evolution direction of the sequence, and all nodes (circulation units) are connected in a chain.

Long Short-Term Memory network (LSTM): is a cyclic neural network, which incorporates a Cell in the algorithm that determines whether information is useful or not. An input gate, a forget gate, and an output gate are placed in one Cell. After the information enters the LSTM, whether the information is useful or not is judged according to rules. Information conforming to the algorithm authentication is left, and information not conforming to the algorithm authentication is forgotten through a forgetting door. The network is adapted to process and predict important events that are relatively long spaced and delayed in time series.

A cycle gate unit (Gate Recurrent Unit, GRU): is one type of recurrent neural network. As with LSTM, it has also been proposed to address the problems of long-term memory and gradients in back propagation. Compared with LSTM, GRU has one less "gate control" inside, has fewer parameters than LSTM, can achieve the effect equivalent to LSTM in most cases and effectively reduces the calculation time consumption.

The method for generating the voice provided by the embodiment of the application also relates to the technology of artificial intelligence (ARTIFICIAL INTELLIGENCE, AI), and is described below as AI. AI is a theory, method, technique, and application system that utilizes a digital computer or a digital computer-controlled machine to simulate, extend, and extend human intelligence, sense the environment, acquire knowledge, and use knowledge to obtain optimal results. In other words, artificial intelligence is an integrated technology of computer science that attempts to understand the essence of intelligence and to produce a new intelligent machine that can react in a similar way to human intelligence. Artificial intelligence, i.e. research on design principles and implementation methods of various intelligent machines, enables the machines to have functions of sensing, reasoning and decision.

The artificial intelligence technology is a comprehensive subject, and relates to the technology with wide fields, namely the technology with a hardware level and the technology with a software level. Artificial intelligence infrastructure technologies generally include, for example, sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, pre-training model technologies, operation/interaction systems, mechatronics, and the like. The pre-training model is also called a large model and a basic model, and can be widely applied to all large-direction downstream tasks of artificial intelligence after fine adjustment. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and other directions.

The scheme provided by the embodiment of the application relates to a voice technology (Speech Technology) and a machine learning (MACHINE LEARNING, ML) technology under an artificial intelligence technology. Among key technologies for Speech technology are automatic Speech recognition technology (Automatic Speech Recognition, ASR) and Speech synthesis technology (TTS) and voiceprint recognition technology. The method can enable the computer to listen, watch, say and feel, is the development direction of human-computer interaction in the future, and voice becomes one of the best human-computer interaction modes in the future. The large model technology brings reform for the development of the voice technology, and the pre-training models such as WavLM, uniSpeech and the like which use a transducer architecture have strong generalization and universality and can excellently finish voice processing tasks in all directions.

Machine learning is a multi-domain interdisciplinary, and relates to multiple disciplines such as probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory and the like. It is specially studied how a computer simulates or implements learning behavior of a human to acquire new knowledge or skills, and reorganizes existing knowledge structures to continuously improve own performance. Machine learning is the core of artificial intelligence, a fundamental approach to letting computers have intelligence, which is applied throughout various areas of artificial intelligence. Machine learning and deep learning typically include techniques such as artificial neural networks, confidence networks, reinforcement learning, transfer learning, induction learning, teaching learning, and the like. The pre-training model is the latest development result of deep learning, and integrates the technology.

The following examples are provided to illustrate the invention: in one embodiment, as shown in fig. 3, a method of generating speech is provided, which is illustrated by using the method applied to the server 104 in fig. 1 as an example, it is understood that the method may also be applied to the terminal 102, and may also be applied to a system including the terminal 102 and the server 104, and implemented through interaction between the terminal 102 and the server 104. In this embodiment, the method includes the steps of:

in step 302, in response to the speech generation request, target text and reference audio included in the speech generation request are acquired.

Wherein the speech generation request contains the target text and the reference audio. The target text is the text corresponding to the voice content of the voice to be generated, namely the target text contains the text content of the target voice to be generated. The reference audio is the audio of the target object, that is, the reference audio is a piece of audio containing voice information such as timbre, emotion, rhythm and the like of the target object. For example, the target text is "early safety", the reference audio may be the audio spoken by the target object, and the generated target voice is the "early safety" spoken by the target object.

Second, the target object may be any object that may be a character, a virtual character, or the like that may emit audio having voice content, e.g., the target object may be a father, a cartoon character, a game character, or the like. And the target object may further include a target emotion between the target person and the target person, a target scene between the target person and the target person, target information related to the target person and the target person, and the like, where the target object is: happy emotion (target emotion) of crayon small new (target person). It may be understood that the target text, the reference audio, and the target voice to be generated may belong to the same scene, for example, the target text is a complete news manuscript text, the reference audio is a news broadcast with a duration of 10 seconds sent by a news broadcaster (target object), and the obtained target voice is a voice sent by the news broadcaster (target object) with the news manuscript text as voice content.

Specifically, in a scenario where speech generation is required, a user who requires speech generation inputs a target text and a reference audio, and the user who requires speech generation may input the target text and the reference audio through a terminal, or may input the target text through the terminal, and then select a desired reference audio in a selectable reference audio list, which does not specifically limit how the user inputs the target text and the reference audio. Thus, the terminal can generate a speech generation request based on the target text and the reference audio, and send the speech generation request including the target text and the reference audio to the server performing speech generation, so that the server performing speech generation receives the speech generation request, thereby responding to the speech generation request to acquire the target text and the reference audio included in the speech generation request.

And 304, extracting text embedded features of the target text, and performing feature coding processing on the text embedded features to obtain text hidden layer features.

The text embedding feature is specifically a feature obtained after vectorization processing is performed on the text. Next, the feature encoding process is specifically Text encoding (Text encoding) process, and it is known that the obtained Text hidden layer feature (Text hidden representation) is a feature obtained by Text encoding the Text embedded feature.

Specifically, the server performs vectorization processing on the target text to obtain text embedding characteristics. Namely, the server specifically performs text embedding (Text embedding) on the target text, so as to obtain the text embedding characteristics of the target text. Further, the server inputs the Text embedded feature into a Text encoder (Text encoder), and the Text embedded feature is subjected to Text encoding by the Text encoder, so that the Text hidden layer feature output by the Text encoder can be obtained. For ease of understanding, as shown in fig. 4, the target text 401 is first text-embedded to obtain the text-embedded feature 402 of the target text 401, the text-embedded feature 402 is input into the text encoder 403, the text-embedded feature 402 is text-encoded by the text encoder 403, and then the text hidden feature 404 is output by the text encoder 403.

Step 306, extracting the acoustic features of the target object to which the reference audio belongs based on the reference audio, and performing feature fusion processing on the acoustic features and the text hidden layer features to obtain fusion features.

The acoustic features are used for indicating the audio features of sound information such as timbre, emotion, rhythm and the like of the target object to which the reference audio belongs. Secondly, the feature fusion process is specifically cross-attention process, and the cross-attention process is used for fusing acoustic features and text hidden layer features so as to analyze the semantic layer of the text on the acoustic features, so that the obtained fusion features are specifically: and the acoustic features and the text hidden layer features are obtained after cross attention processing.

Specifically, the server firstly extracts the acoustic features of the target object to which the reference audio belongs based on the reference audio, then inputs the acoustic features and the text hidden layer features into a cross attention layer (Cross attention), and carries out cross attention processing on the acoustic features and the text hidden layer features through the cross attention layer so that the cross attention layer outputs fusion features between the acoustic features and the text hidden layer features.

However, in practical applications, since more sound information is included in audio, the sound information in audio is generally described by an acoustic Token (Token), and how to extract acoustic features by referring to the acoustic Token will be described in detail below: in a specific embodiment, extracting, based on the reference audio, acoustic features of a target object to which the reference audio belongs includes: and performing audio encoding and decoding on the reference audio to obtain a reference acoustic token of the reference audio, wherein the reference acoustic token is used for representing acoustic characteristics of a target object to which the reference audio belongs.

The Audio Codec is specifically performed by an Audio Codec (Audio Codec), which is a separate device that encodes analog Audio to digital Audio and then decodes digital Audio to analog Audio. The reference audio is audio-coded, i.e. encoded (analog audio) to obtain a corresponding reference acoustic token (digital audio) for the reference audio (analog audio). It follows that the reference Acoustic Token (Acoustic Token) is specifically digital audio, and is used to characterize the Acoustic characteristics of the target object to which the reference audio belongs.

Specifically, the server performs audio encoding and decoding on the reference audio through the audio encoder and decoder, that is, inputs the reference audio into the audio encoder and decoder, and then obtains the reference acoustic token of the reference audio output by the audio encoder and decoder. In this embodiment, the obtained reference acoustic token may be composed of a sequence number for encoding a sound unit corresponding to sound information contained in the reference audio, where the sound unit is a minimum sound object in a sound codebook. For example, reference audio having a duration of 5 seconds may be converted into 5×56 reference acoustic tokens after audio encoding by an audio codec. And 5 x 56 reference acoustic tokens collectively characterize the acoustic characteristics of the target object to which the reference audio belongs.

Further, considering that the acoustic feature can only describe sound information from the dimension of the acoustic feature of the object to which the acoustic feature belongs, and further considering the acoustic feature of a more hidden layer in the process of generating the voice, further feature encoding processing can be considered to be performed on the acoustic feature in the process of performing fusion processing on the acoustic feature and the text hidden layer feature, so as to promote the acoustic feature of a deeper layer. Specific embodiments for obtaining the fusion feature will be described in detail below:

In a specific embodiment, performing feature fusion processing on the acoustic feature and the text hidden layer feature to obtain a fusion feature, including: performing feature coding processing on the acoustic features to obtain acoustic hidden layer features; and carrying out feature fusion processing on the acoustic hidden layer features and the text hidden layer features to obtain fusion features.

The feature encoding process is specifically an acoustic token encoding (Acoustic Token Encode) process, and as can be seen from the foregoing description, the acoustic feature of the target object to which the reference audio belongs is specifically characterized by the reference acoustic token, so the obtained acoustic hidden layer feature (Prompt hidden representation) is a feature obtained by performing acoustic token encoding on the reference acoustic token for characterizing the acoustic feature.

Specifically, the server performs feature encoding processing on the acoustic features to obtain acoustic hidden layer features, and since the reference acoustic tokens are used for representing the acoustic features of the target objects to which the reference audio belongs, the server specifically inputs the reference acoustic tokens into an acoustic token encoder (Acoustic Token Encoder), and the acoustic token encoder encodes the reference acoustic tokens, namely the acoustic hidden layer features output by the acoustic token encoder. For ease of understanding, as shown in fig. 5, the reference audio 501 is first audio-encoded to obtain a reference acoustic token 502 of the reference audio 501, the reference acoustic token 502 is input into an acoustic token encoder 503, the reference acoustic token 502 is acoustic token encoded by the acoustic token encoder 503, and then the acoustic hidden layer feature 504 is output by the acoustic token encoder 503.

Further, the server specifically inputs the acoustic hidden layer features and the text hidden layer features to a cross attention layer, and cross attention processing is carried out on the acoustic hidden layer features and the text hidden layer features through the cross attention layer, so that the cross attention layer outputs fusion features between the acoustic hidden layer features and the text hidden layer features. The server uses the text hidden layer feature as a Query statement (Query), and uses the acoustic hidden layer feature as a data item (Value) and a Key Value (Key) of the data item to carry out cross attention processing, so that a fusion feature between the acoustic hidden layer feature and the text hidden layer feature is output. To facilitate understanding, as shown in FIG. 6, the text hidden layer feature 602 is obtained by way of example as shown in FIG. 4, and the acoustic hidden layer feature 604 is obtained by way of example as shown in FIG. 5, the text hidden layer feature 602 and the acoustic hidden layer feature 604 are input to the cross-attention layer 606 for cross-attention processing, and the fusion feature 608 is output by the cross-attention layer 606.

In step 308, a target voice generated by simulating the target object with the target text as the voice content is generated based on the text embedded feature, the text hidden feature and the fusion feature.

The target voice is a voice generated by simulating a target object to take a target text as voice content. For example, the target text is "evening", the reference audio may be audio sent by a cartoon character, and then the generated target voice is "evening" spoken by the cartoon character. Or the target text is an article paragraph, the reference audio is the audio sent by the game character, and the obtained target voice is the voice sent by the game character by taking the article paragraph as the voice content. Specifically, the server generates target voice which simulates a target object to send by taking the target text as voice content based on the text embedded feature, the text hidden feature and the fusion feature. The target text may include a plurality of text units arranged in sequence, so the server may sequentially generate respective acoustic tokens of each text unit in the target text based on the text embedding feature, the text hidden feature and the fusion feature, and then generate target speech simulating the target object by using the target text as speech content according to the arrangement sequence of each text unit based on the respective acoustic tokens of each text unit.

However, since the text embedding feature, the text hidden layer feature and the fusion feature are feature extraction and fusion in the text dimension, and the acoustic feature fitting the text should be learned and mined in the process of generating the voice, the acoustic feature should be considered in the process of generating the voice, and the following detailed description will be given based on this: in a specific embodiment, generating a target voice of a simulated target object using a target text as voice content based on a text embedding feature, a text hidden layer feature and a fusion feature includes: generating text prompt features based on the text embedding features, the text hidden features and the fusion features; and generating target voice which simulates the target object to send by taking the target text as voice content through the text prompt feature and the acoustic hidden layer feature.

The text prompt feature is obtained by adding the text embedded feature, the text hidden feature and the fusion feature. Specifically, the server adds the text embedded feature, the text hidden feature and the fusion feature to obtain the text prompt feature, where the feature addition may be directly adding the features, or may assign corresponding feature weights to different features, so that the features are weighted based on the feature weights and added, and the text prompt feature is not specifically limited herein. For ease of understanding, as shown in fig. 7, by obtaining the text-embedded feature 702, the text-hidden feature 704, and the fusion feature 706 as in the previous illustrations and examples, the text-embedded feature 702, the text-hidden feature 704, and the fusion feature 706 are added to obtain the text-alert feature 708, and then the desired target speech 712 is generated based on the text-alert feature 708 and the obtained acoustic hidden feature 710.

Further, the server generates target voice which simulates the target object to send by taking the target text as voice content through the text prompt feature and the acoustic hidden layer feature. The voice generation process is specifically performed through a voice generation model, that is, the server takes the text prompt feature as a part of input data and takes the acoustic hidden layer feature as another part of the input data, so that the text prompt feature and the acoustic hidden layer feature together form the input data to be input into the voice generation model, and the voice generation model generates target voice which is generated by simulating a target object by taking the target text as voice content.

In another specific embodiment, the application can also open the way of mining global acoustic features which are difficult to annotate with the target object in a self-supervision mode, wherein the global acoustic features are global timbre, global emotion, global prosody and the like. That is, the specific implementation manner of generating the target voice of the simulated target object by taking the target text as the voice content through the text prompt feature and the acoustic hidden layer feature may be: generating self-supervision prompt features based on the acoustic hidden layer features and the learned center features; and generating target voice through a voice generation model based on the text prompt feature and the self-supervision prompt feature.

The learned center features are features which are jointly subjected to learning adjustment in the training process of the speech generation model. Specifically, in the training process of the speech generation model, a plurality of learnable center features are randomly initialized to be used as Query, then acoustic hidden layer features are used as Key and Value to generate self-supervision prompt features, in the model parameter adjustment process of the initial speech generation model, the plurality of learnable center features can also be adjusted to mine global acoustic features which are difficult to mark by a target object of sounding, and therefore when the model parameters are determined to obtain the speech generation model, adjustment of the learnable center features is completed, and the learnable center features in practical application are obtained.

Based on the above, the server will take the learned center feature as Query, take the acoustic hidden layer feature as Key and Value to process cross attention, thus outputting the self-supervision prompt feature between the learned center feature and the acoustic hidden layer feature, then, similar to the above description, take the text prompt feature as a part of the input data, and take the self-supervision prompt feature as another part of the input data, thus, the text prompt feature and the self-supervision prompt feature together form the input data to be input into the speech generation model, and the speech generation model outputs the target speech which is generated by the simulation target object with the target text as the speech content. For easy understanding, as shown in fig. 8, the text embedded feature, the text hidden layer feature and the fusion feature obtained in the foregoing example are added to obtain a text prompt feature 802, then the acoustic hidden layer feature 804 and the learned center feature 806 are input to a cross attention layer to perform cross attention processing, the self-supervision prompt feature 808 is output through the cross attention layer, then the text prompt feature 802 and the self-supervision prompt feature 808 are input to a speech generation model obtained by training, and the speech generation model outputs a target speech 810 generated by simulating a target object with a target text as speech content.

It will be appreciated that the corresponding examples in the embodiments of the present application are for understanding the present solution, but should not be construed as a specific limitation on the present solution.

According to the method for generating the voice, the text embedded feature of the target text contained in the voice generation request is extracted and taken, feature encoding processing is carried out on the text embedded feature to obtain the text hidden layer feature, and feature fusion processing is carried out on the acoustic feature of the target object to which the reference audio belongs and the text hidden layer feature to obtain the fusion feature.

The manner in which the target speech is generated by the aforementioned features will be described in detail below: in one embodiment, the target text includes a plurality of text units arranged in sequence, the text units being the smallest text objects, and if the target text is chinese, the text units are chinese characters. For example, the target text is "evening", i.e., the target text specifically includes text units "evening" and text units "evening" arranged in sequence. And it can be seen that the arrangement of the text units in sequence is the order of the text units in the target text. Based on this, as shown in fig. 9, based on the text embedding feature, the text hidden layer feature and the fusion feature, a target voice generated by simulating the target object with the target text as voice content is generated, which includes:

step 902, based on the text embedded feature, the text hidden feature and the fusion feature, generating respective acoustic tokens of each text unit in the target text in turn.

The acoustic token of the text unit is used for describing sound information corresponding to the text unit in the audio, for example, the target text includes a text unit A1, a text unit A2, a text unit A3 and a text unit A4 which are sequentially arranged, and the generated acoustic token specifically includes: the acoustic token B1 of the text unit A1, the acoustic token B2 of the text unit A2, the acoustic token B3 of the text unit A3, and the acoustic token B4 of the text unit A4.

Specifically, the server sequentially generates respective acoustic tokens of each text unit in the target text based on the text embedding feature, the text hidden layer feature and the fusion feature, namely, the server firstly generates the acoustic token of the first text unit in the text units which are sequentially arranged through the text embedding feature, the text hidden layer feature and the fusion feature, then the acoustic token of the first text unit is generated, then the text embedding feature, the text hidden layer feature, the fusion feature and the obtained acoustic token of the first text unit are generated, and the acoustic token of the next text unit of the first text unit is generated until the acoustic token of the last text unit in the text units which are sequentially arranged is generated, namely, the generation of the acoustic token of each text unit in the target text is completed. And how to determine the acoustic token reaching the END text unit, an END (END) flag may be specifically set so that the acoustic token reaching the END text unit is generated as a termination.

Further, according to the foregoing embodiment, it is specifically possible to generate a text prompt feature based on the text embedding feature, the text hidden layer feature and the fusion feature, and generate a self-supervision prompt feature based on the acoustic hidden layer feature and the learned center feature, so that the target voice is generated through the voice generation model based on the text prompt feature and the self-supervision prompt feature. Therefore, since the target text comprises a plurality of text units which are arranged in sequence, the acoustic tokens of each text unit in the target text can be sequentially generated through the voice generation model based on the text prompt characteristics and the self-supervision prompt characteristics. To facilitate understanding, further description is given with the foregoing example, as shown in fig. 10, text prompt feature 1001 and self-supervising prompt feature 1002 are input to a speech generation model, which outputs acoustic tokens 1003.

Similarly, the text prompt feature 1001, the self-supervising prompt feature 1002, and the acoustic token 1003 are again input to a speech generation model, which outputs the acoustic token 1004. By analogy, the text prompt feature 1001, the self-supervising prompt feature 1002, and the acoustic token 1004 are input to a speech generation model, the speech generation model outputs an acoustic token 1005, and the output acoustic token 1005 is specifically the acoustic token B3 of the text unit A3. And inputting the text prompt feature 1001, the self-supervising prompt feature 1002, and the acoustic token 1005 into a speech generation model, which outputs the acoustic token 1006. This can sequentially obtain an acoustic token 1003, an acoustic token 1004, an acoustic token 1005, and an acoustic token 1006. Wherein the output acoustic token 1003 is specifically an acoustic token B1 of the text unit A1, the output acoustic token 1004 is specifically an acoustic token B2 of the text unit A2, the output acoustic token 1005 is specifically an acoustic token B3 of the text unit A3, and the output acoustic token 1006 is specifically an acoustic token B4 of the text unit A4.

It will be appreciated that in practical applications, it is also possible to generate the acoustic token of the first text unit first, then generate the acoustic token of the first text unit and the subsequent text unit in a similar manner as described above, and thus, finally generate the acoustic token of the target text. That is, as shown in fig. 10, the acoustic token 1003 output at this time is specifically an acoustic token B1 of the text unit A1, the acoustic token 1004 output is specifically an acoustic token B2 of the text unit A1 and the text unit A2 which correspond in common, the acoustic token 1005 output is specifically an acoustic token B3 of the text unit A1, the text unit A2 and the text unit A3 which correspond in common, and the acoustic token 1006 output is specifically an acoustic token B4 of the text unit A1, the text unit A2, the text unit A3 and the text unit A4 which correspond in common.

Step 904, based on the respective acoustic tokens of each text unit, generating target voice which simulates the target object to take the target text as voice content according to the arrangement sequence among each text unit.

The arrangement sequence between each text unit is specifically the arrangement sequence of each text unit in the target text, for example, the target text is "evening", that is, there are text units "evening" and text units "ann", and then the arrangement sequence between the text units is: "late", "an". Or the target text includes text unit A1, text unit A2, text unit A3 and text unit A4 arranged in sequence, the arrangement order among the text units is: text unit A1, text unit A2, text unit A3, text unit A4.

Specifically, through the foregoing steps, respective acoustic tokens of each text unit in the target text may be sequentially generated, and then the server needs to sequentially splice the respective acoustic tokens of the obtained text units according to the arrangement sequence between each text unit, so as to obtain the target voice. The splicing mode may be to splice the acoustic tokens of the text units, and then to perform audio decoding on the sequence obtained after splicing to obtain the target voice. Or the respective acoustic tokens of the text units can be subjected to audio decoding to obtain respective unit voices of the text units, and then the unit voices are spliced to obtain target voices. It can be understood that, as can be seen from the foregoing description, in practical application, the acoustic token corresponding to the target text may also be directly generated, and at this time, the audio decoding process is performed on the acoustic token corresponding to the target text, so that the target voice can be obtained, and therefore, the manner how to obtain the target voice is not limited herein.

The following describes the foregoing manner in which the splicing is required to perform speech generation: firstly introducing acoustic tokens of text units to splice, and then performing audio decoding on the spliced sequence to obtain a target voice mode: in an alternative embodiment, based on the respective acoustic tokens of each text unit, generating a target voice of the simulated target object with the target text as voice content according to the arrangement sequence between each text unit includes: generating an acoustic token sequence according to the arrangement sequence among each text unit based on the respective acoustic token of each text unit; and performing audio decoding processing on the acoustic token sequence to generate target voice which is sent by the simulated target object by taking the target text as voice content.

The acoustic token sequence comprises acoustic tokens of text units with sequentially ordered target texts. Specifically, the server firstly performs sorting processing on the acoustic tokens of each text unit according to the arrangement sequence among each text unit, and then obtains an acoustic token sequence through the obtained sorting result. For example, the target text includes text unit A1, text unit A2, text unit A3, and text unit A4, which are sequentially arranged, and the arrangement order between the foregoing text units is: text element A1, text element A2, text element A3, text element A4, then the acoustic token B1 of text element A1, acoustic token B2 of text element A2, acoustic token B3 of text element A3, and acoustic token B4 of text element A4 are ordered sequentially based on the foregoing order, i.e., an acoustic token sequence: [ Acoustic token B1], [ Acoustic token B2], [ Acoustic token B3], [ Acoustic token B4] ".

Further, the server performs audio decoding processing on the acoustic token sequence, that is, the server inputs the acoustic token sequence to a sound decoder, and outputs target voice generated by simulating a target object with target text as voice content through the sound decoder. The sound decoder may be a trained acoustic token extractor and an unlabeled audio sample, and the machine learning model is trained according to an unsupervised learning mode, and is used for decoding an input acoustic token so as to generate a voice corresponding to the acoustic token.

The following describes the manner in which the respective acoustic tokens of the text units are audio decoded to obtain respective unit voices of the text units, and then the unit voices are spliced to obtain target voices: in an alternative embodiment, based on the respective acoustic tokens of each text unit, generating a target voice of the simulated target object with the target text as voice content according to the arrangement sequence between each text unit includes: performing audio decoding processing on the acoustic tokens of each text unit, and respectively generating unit voices which are sent by the simulation target object by taking the text units as voice contents; and generating target voice which simulates the target object to take the target text as voice content according to the arrangement sequence among the text units based on the unit voice corresponding to each text unit.

The unit voices are voices which are sent by the target object by taking text units as voice contents, namely one unit voice corresponds to one text unit. For example, the target text is "evening," and the reference audio is audio emitted by a cartoon character. The text unit 'night' and the text unit 'safety' can be obtained aiming at the target text 'night', and at the moment, the unit voice of the cartoon character sent by taking the text unit 'night' as the voice content and the unit voice of the cartoon character sent by taking the text unit 'safety' as the voice content can be obtained.

Specifically, the server performs audio decoding processing on the acoustic tokens of each text unit, and generates unit voices which are generated by the simulation target object by taking the text unit as voice content respectively, namely, the server inputs the acoustic tokens of each text unit to the voice decoder respectively, and outputs the unit voices of the text unit respectively through the voice decoder. Illustratively, further to the foregoing example, the acoustic token B1 of the text unit A1, the acoustic token B2 of the text unit A2, the acoustic token B3 of the text unit A3, and the acoustic token B4 of the text unit A4 are subjected to audio decoding processing on the acoustic token B1, the acoustic token B2, the acoustic token B3, and the acoustic token B4, respectively, so that the unit speech C1 of the text unit A1, the unit speech C2 of the text unit A2, the unit speech C3 of the text unit A3, and the unit speech C4 of the text unit A4 can be obtained.

Further, the server generates target voices which simulate the target objects to be sent by taking the target texts as voice contents according to the arrangement sequence among the text units based on the unit voices which correspond to the text units respectively. The server sorts the unit voices corresponding to each text unit according to the arrangement sequence among the text units and performs voice fusion processing to obtain target voices. Describing the foregoing example again, for example, a unit voice of a cartoon character in which a text unit is "late" is obtained, and a unit voice of a cartoon character in which a text unit is "safe" is obtained, where the arrangement order among the text units is: the "evening" and "safety" can be used for carrying out voice fusion on the unit voices sent by the cartoon characters by the text unit "evening" and the text unit "safety" at the moment, so that the target voices sent by the cartoon characters by the voice contents of "evening" are obtained. Or obtaining the unit speech C1 of the text unit A1, the unit speech C2 of the text unit A2, the unit speech C3 of the text unit A3 and the unit speech C4 of the text unit A4, wherein the arrangement sequence among the text units is as follows: text element A1, text element A2, text element A3, text element A4, are obtained by similar processing: the target voice sent by the target object by taking the text unit A1, the text unit A2, the text unit A3 and the text unit A1 as voice contents, and whether the text unit A1, the text unit A2, the text unit A3 and the text unit A1 are continuous or discontinuous is determined by practical conditions, and the text unit is not understood to be discontinuous voice contents.

In this embodiment, the acoustic token is obtained for each text unit in the target text, so as to ensure that tone, prosody and emotion sound information in the reference audio are deeply mined for each text unit with finer granularity, so as to ensure the reliability of the obtained acoustic token, and then the acoustic token is converted through the sound decoder for each text unit, so as to obtain the target voice, so that the quick conversion from the acoustic token to the audio is realized, and the purpose of zero-order synthesis of the target voice through the reference audio is realized.

In one embodiment, the feature encoding process, the feature fusion process and the speech generation process are all performed through a speech generation model, that is, the text embedded feature of the target text is extracted through the speech generation model, the feature encoding process is performed on the text embedded feature through the speech generation model to obtain the text hidden layer feature, the acoustic feature of the target object to which the reference audio belongs is extracted through the speech generation model based on the reference audio, the feature fusion process is performed on the acoustic feature and the text hidden layer feature through the speech generation model to obtain the fusion feature, and therefore the target speech generated by the simulation target object taking the target text as the speech content is generated through the speech generation model based on the text embedded feature, the text hidden layer feature and the fusion feature. The following describes how to acquire the speech generation model, as shown in fig. 11, the acquisition mode of the speech generation model includes:

In step 1102, a sample text and a reference sample audio are obtained, where the reference sample audio is a voice uttered by a sample object with the sample text as a voice content.

The sample object is similar to the target object described above, and may be any object that may be a person, a virtual character, or the like, and may also include a target emotion of the sample person to the sample person, a sample scene of the sample person to the sample person, target information related to the sample person, and the like. The reference sample audio is a voice of the sample object with the sample text as the voice content, and is similar to the reference audio, and will not be described herein.

Specifically, when the server needs to train the speech generation model, the server firstly acquires the sample text and the reference sample audio. The server may obtain the sample text and the reference sample audio input by the user from the terminal through a communication connection with the terminal, or the server may also obtain the pre-stored sample text and the reference sample audio from the data storage system. It will be appreciated that the specific manner in which the sample text is acquired and the reference sample audio is not limited herein.

In step 1104, based on the sample text and the reference sample audio, the sample acoustic hidden features, the sample text embedded features, the sample text hidden features, and the sample fusion features are obtained through the initial speech generation model.

The acoustic hidden layer feature, the text embedded feature, the text hidden layer feature and the fusion feature are similar to the acoustic hidden layer feature, the text embedded feature, the text hidden layer feature and the fusion feature in the foregoing embodiments, respectively, and are not described herein again. Specifically, the server performs text embedding on the sample text to obtain sample text embedded features, then inputs the sample text embedded features into a text encoder, and performs text encoding on the sample text embedded features through the text encoder to output sample text hidden layer features.

Further, the server inputs the reference sample audio into the audio coder to obtain a reference sample acoustic token of the reference sample audio through audio coding processing of the audio coder, the reference sample acoustic token is used for representing acoustic characteristics of a sample object to which the reference sample audio belongs, then the server inputs the reference sample acoustic token into the acoustic token coder, and the acoustic token coder carries out acoustic token coding on the reference sample acoustic token, so that the acoustic hidden layer characteristics of the sample are output through the acoustic token coder.

Based on the foregoing process, it is possible to obtain: the server also inputs the sample acoustic hidden layer features and the sample text hidden layer features to a cross attention layer, and cross attention processing is carried out on the sample acoustic hidden layer features and the sample text hidden layer features through the cross attention layer, so that the cross attention layer outputs sample fusion features between the sample acoustic hidden layer features and the sample text hidden layer features. The specific implementation is similar to the previous embodiment, and will not be repeated here.

The manner in which the foregoing features of the model are generated by the initial speech will be described in detail below: in a specific embodiment, based on the sample text and the reference sample audio, obtaining the sample acoustic hidden layer feature, the sample text embedded feature, the sample text hidden layer feature and the sample fusion feature through an initial speech generation model comprises: extracting sample acoustic features of a sample object to which the reference sample audio belongs through an initial speech generation model, and carrying out feature coding processing on the sample acoustic features to obtain sample acoustic hidden layer features; extracting sample text embedded features of a sample text through an initial speech generation model, and carrying out feature coding processing on the sample text embedded features to obtain sample text hidden layer features; and carrying out feature fusion processing on the sample text hidden layer features and the sample acoustic hidden layer features through an initial speech generation model to obtain sample fusion features.

Specifically, the server performs audio coding processing on the reference sample audio through an audio coder in the initial speech generation model to obtain a reference sample acoustic token of the reference sample audio, and then performs acoustic token coding on the reference sample acoustic token through an acoustic token coder in the initial speech generation model, so that the sample acoustic hidden layer characteristics are output through the acoustic token coder in the initial speech generation model.

Based on the above, the server also performs text embedding on the sample text through the initial speech generation model to obtain sample text embedded features, and performs text encoding on the sample text embedded features through a text encoder in the initial speech generation model to output sample text hidden layer features through the text encoder in the initial speech generation model. After the sample acoustic hidden layer features are obtained, the server carries out cross attention processing on the sample acoustic hidden layer features and the sample text hidden layer features through the cross attention layer in the initial speech generation model, so that the sample fusion features output by the cross attention layer in the initial speech generation model.

In step 1106, based on the sample acoustic hidden features, the sample text embedded features, the sample text hidden features, and the sample fusion features, a predicted speech generated by simulating the sample object with the sample text as speech content is generated by the initial speech generation model.

Wherein, the predicted voice is the voice generated by simulating the sample object to take the sample text as voice content. Specifically, the server uses the sample acoustic hidden layer feature, the sample text embedded feature, the sample text hidden layer feature and the sample fusion feature as input data of an initial speech generation model, and generates predicted speech which simulates a sample object and is sent by taking the sample text as speech content through the initial speech generation model. The manner of obtaining the predicted speech is similar to that of obtaining the target speech in the foregoing embodiment, and will not be described here again.

As can be seen from the foregoing description, the text may include a plurality of sequentially arranged text units, and thus, in an alternative embodiment, the sample text includes a plurality of sequentially arranged sample text units; based on the above, based on the sample acoustic hidden layer feature, the sample text embedded feature, the sample text hidden layer feature and the sample fusion feature, generating a predicted voice generated by simulating the sample object by taking the sample text as voice content through an initial voice generation model comprises the following steps: feature fusion is carried out on the leachable center features and the sample acoustic hidden layer features through an initial voice generation model, so that self-supervision sample features are obtained; based on the self-supervision sample feature, the sample text embedding feature, the sample text hidden layer feature and the sample fusion feature, sequentially generating respective sample acoustic tokens of each sample unit in the sample text through an initial voice generation model; based on the respective sample acoustic tokens of each sample unit, the generating model generates predicted voices which simulate sample objects and take sample texts as voice contents according to the arrangement sequence among the sample units.

According to the method, in the training process of the voice generation model, a plurality of learnable center features are randomly initialized to be used as Query, then acoustic hidden layer features are used as Key and Value to generate self-supervision prompt features, in the model parameter adjustment process of the initial voice generation model, the plurality of learnable center features can be adjusted to mine global acoustic features which are difficult to mark by a sounding target object, and therefore adjustment of the learnable center features is completed when the model parameters are determined to obtain the voice generation model, and the learnable center features in practical application are obtained. Specifically, the server performs cross attention processing on the learner-able center feature and the sample acoustic hidden layer feature through a cross attention layer in the initial speech generation model to obtain a self-supervision prompt feature after the learner-able center feature and the sample acoustic hidden layer feature are subjected to cross attention processing.

Further, the server may generate sample text prompt features by initially based on the sample text embedded features, the sample text hidden features, and the sample fusion features. The server may add the embedded feature of the sample text, the hidden feature of the sample text, and the fusion feature of the sample text through the initial speech generation model to obtain a prompt feature of the sample text, where the feature addition may be directly adding the features, or may assign corresponding feature weights to different features, so that the features are weighted based on the feature weights and added, which is not specifically limited herein.

Based on the method, the server takes the sample text prompt feature as a part of input data, takes the self-supervision sample feature as another part of the input data, and accordingly the sample text prompt feature and the self-supervision sample feature together form the input data to be input into an initial voice generation model, so that respective acoustic tokens of each sample unit in the sample text are sequentially generated through the initial voice generation model, and then based on the respective sample acoustic tokens of each sample unit in the sample text, the generation model generates predicted voices of simulated sample objects sent by the sample text as voice contents according to the arrangement sequence among the sample units.

Further, after generating respective sample acoustic tokens of each sample unit in the sample text in turn, the server specifically needs to splice the respective sample acoustic tokens of the obtained sample text units in turn according to the arrangement sequence between each sample unit, so as to obtain predicted speech. The splicing manner is similar to that described in the foregoing embodiment, and may be that the respective sample acoustic tokens of the sample text units are spliced first, and then the sequence obtained after the splicing is subjected to audio decoding, so as to obtain the predicted speech. Or the respective acoustic tokens of the sample text units can be subjected to audio decoding to obtain respective sample unit voices of the sample text units, and then the sample unit voices are spliced to obtain predicted voices. It can be understood that in practical application, the sample acoustic token corresponding to the sample text can also be directly generated, and at this time, the sample acoustic token corresponding to the sample text is subjected to audio decoding processing, so that the predicted speech can be obtained. The manner in which the predicted speech is obtained is therefore not limited herein.

Step 1108 adjusts model parameters of the initial speech generation model by referencing the sample audio with the predicted speech to generate a speech generation model.

Wherein the model parameters are adjusted based on the loss values between the reference sample audio and the predicted speech, and the loss values between the reference sample audio and the predicted speech need to be based on a loss function (cost function), which is a function for evaluating the degree of difference between the predicted value and the actual value of the neural network model, and the smaller the function value of the loss function is, the better the performance of the neural network model is, therefore, the training process for the model is to minimize the process of the loss function value by adjusting the model parameters. The loss function used is also different for different neural network models, and cross entropy loss functions and the like can be used in this embodiment.

Specifically, the server adjusts model parameters of the initial speech generation model by referencing the sample audio with the predicted speech to generate the speech generation model. The server determines a loss value between the reference sample audio and the predicted voice based on the loss function, then judges whether the loss function of the voice generation model reaches a convergence condition through the loss value, and if the loss function does not reach the convergence condition, adjusts model parameters of the initial voice generation model by using the loss value. Based on the above, until the loss function of the speech generation model reaches the convergence condition, the speech generation model is obtained according to the model parameters obtained after the last time of model parameter adjustment, so that the speech generation model obtained through training in practical application carries out speech generation on the target text and the reference audio.

The convergence condition of the foregoing loss function may be that the loss value is less than or equal to a first preset threshold, for example, the value of the first preset threshold may be 0.005, 0.01, 0.02 or other values approaching 0. The difference between the obtained loss values of two adjacent loss functions may be less than or equal to a second preset threshold, where the second threshold may be the same as or different from the first threshold, for example, the second preset threshold may be 0.005, 0.01, 0.02, or other values approaching 0. The model parameter updating of the initial speech generation model may reach the updating iteration threshold, etc. in practical application, other convergence conditions may be adopted, etc., which is not limited herein.

As can be seen from the foregoing examples, in the training process of the speech generation model, a plurality of learnable center features are randomly initialized to be Query, and then acoustic hidden layer features are used as Key and Value to generate self-supervision prompt features, and in the model parameter adjustment process of the initial speech generation model, the plurality of learnable center features can be adjusted, and a training mode considering the learnable center features will be described below: in an alternative embodiment, adjusting model parameters of an initial speech generation model to generate a speech generation model by referencing sample audio with predicted speech, comprises: constructing a sample acoustic token sequence based on the respective sample acoustic tokens of each sample cell; model parameters and learnable center features of the initial speech generation model are adjusted by a reference sample acoustic token of the reference sample audio and a sequence of sample acoustic tokens to generate a speech generation model.

The sample acoustic token sequence comprises sample acoustic tokens of sample text units sequentially ordered in the sample text. Specifically, the server firstly performs sorting processing on the respective sample acoustic tokens of each sample unit according to the arrangement sequence among the sample units, and then obtains a sample acoustic token sequence through the obtained sorting result, and the construction mode is similar to the acoustic token sequence, and is not repeated here.

The sample acoustic token sequence is specifically digital audio, the reference sample audio is analog audio, that is, the loss function cannot directly calculate the difference between the analog audio and the digital audio, at this time, the reference sample audio needs to be encoded to obtain a reference sample acoustic token of the reference sample audio, and at this time, the reference sample acoustic token is used for characterizing the acoustic characteristics of a sample object to which the reference sample audio belongs. Based on the above, the server determines a loss value between the reference sample acoustic token and the sample acoustic token sequence based on the loss function, then judges whether the loss function of the speech generation model reaches a convergence condition through the loss value, and if the loss function does not reach the convergence condition, adjusts model parameters and the learnable center features of the initial speech generation model by using the loss value. Based on the model parameters obtained after the last time the model parameters are adjusted and the learned center features are used for obtaining the speech generation model until the loss function of the speech generation model reaches the convergence condition. The characteristic of the learnable center is a randomly initialized characteristic, the characteristic can be learned along with training of a voice generation model, namely, the characteristic of the learnable center can be learned when model parameters of the initial voice generation model are adjusted, so that the learnable center can learn how to better perform cross attention processing with the acoustic hidden layer characteristic, and the required self-supervision characteristic can be obtained to mine global acoustic characteristics which are difficult to label of a sounding object.

In this embodiment, in the feature extraction stage, multi-dimensional and deep feature extraction is performed on the target text, and acoustic features of the target object attached to the text can be more effectively mined through feature fusion processing, and in the model training stage, differentiation between the reference sample audio and the predicted speech is considered to perform model parameter adjustment, so that differential analysis can be performed on the speech to perform parameter adjustment, and in the model parameter adjustment process, a learnable center feature is further introduced to mine global acoustic features which are difficult to be marked by the sounding target object, so that in practical application, more accurate target speech can be obtained through the model, and the accuracy of speech generation is further improved.

Based on the foregoing detailed description of the embodiments, a complete flow of the method for generating speech in the embodiments of the present application will be described, and in one embodiment, as shown in fig. 12, a method for generating speech is provided, where the method is applied to the server 104 in fig. 1, and is illustrated by way of example, it will be understood that the method may also be applied to the terminal 102, and may also be applied to a system including the terminal 102 and the server 104, and implemented through interaction between the terminal 102 and the server 104. In this embodiment, the method includes the steps of:

In step 1201, in response to the speech generation request, a target text and a reference audio included in the speech generation request are acquired, where the target text includes a plurality of text units arranged in sequence.

Step 1202, extracting text embedded features of the target text, and performing feature encoding processing on the text embedded features to obtain text hidden layer features.

Specifically, the server performs vectorization processing on the target text to obtain text embedding characteristics. Namely, the server specifically performs text embedding on the target text, so that the text embedding characteristics of the target text are obtained. Further, the server inputs the text embedded feature into a text encoder, and the text encoder encodes the text embedded feature to obtain the text hidden layer feature output by the text encoder.

In step 1203, the reference audio is subjected to audio encoding and decoding to obtain a reference acoustic token of the reference audio, where the reference acoustic token is used for characterizing acoustic characteristics of a target object to which the reference audio belongs.

Specifically, the server performs audio encoding and decoding on the reference audio through the audio encoder and decoder, that is, inputs the reference audio into the audio encoder and decoder, and then obtains the reference acoustic token of the reference audio output by the audio encoder and decoder. In this embodiment, the obtained reference acoustic token may be composed of a sequence number for encoding a sound unit corresponding to sound information contained in the reference audio, where the sound unit is a minimum sound object in a sound codebook.

And 1204, performing feature encoding processing on the acoustic features to obtain acoustic hidden layer features.

Specifically, the server performs feature encoding processing on the acoustic features to obtain acoustic hidden layer features, and since the reference acoustic tokens are used for representing the acoustic features of the target objects to which the reference audio belongs, the server specifically inputs the reference acoustic tokens into an acoustic token encoder, and the acoustic token encoder performs acoustic token encoding on the reference acoustic tokens, namely, the acoustic hidden layer features output by the acoustic token encoder.

And 1205, performing feature fusion processing on the acoustic hidden layer features and the text hidden layer features to obtain fusion features.

The server inputs the acoustic hidden layer features and the text hidden layer features to a cross attention layer, and cross attention processing is carried out on the acoustic hidden layer features and the text hidden layer features through the cross attention layer, so that the cross attention layer outputs fusion features between the acoustic hidden layer features and the text hidden layer features. The server uses the text hidden layer feature as a Query, and uses the acoustic hidden layer feature as Value and Key to perform cross attention processing, so that fusion features between the acoustic hidden layer feature and the text hidden layer feature are output.

In step 1206, text prompt features are generated based on the text embedding features, the text hidden features, and the fusion features.

Specifically, the server adds the text embedded feature, the text hidden feature and the fusion feature to obtain the text prompt feature, where the feature addition may be directly adding the features, or may assign corresponding feature weights to different features, so that the features are weighted based on the feature weights and added, and the text prompt feature is not specifically limited herein.

In step 1207, the acoustic tokens of each text unit in the target text are sequentially generated through the text prompt feature and the acoustic hidden layer feature.

Specifically, through the text prompt feature and the acoustic hidden layer feature, the respective acoustic token of each text unit in the target text is sequentially generated. The server can generate acoustic tokens of the first text unit in the text units which are arranged in sequence through the text prompt feature and the acoustic hidden layer feature, then the acoustic tokens of the first text unit, then the text prompt feature, the acoustic hidden layer feature and the obtained acoustic tokens of the first text unit, and generate acoustic tokens of the next text unit of the first text unit until the acoustic tokens of the last text unit in the text units which are arranged in sequence are generated, namely, the generation of the acoustic tokens of all text units in the target text is completed. And how to determine the acoustic token reaching the END text unit, the END identifier may be specifically set so that the acoustic token reaching the END text unit is generated as a termination. It will be appreciated that in practical applications, it is also possible to generate the acoustic token of the first text unit first, then generate the acoustic token of the first text unit and the subsequent text unit in a similar manner as described above, and thus, finally generate the acoustic token of the target text.

Step 1208, generating target voice of the simulated target object with the target text as voice content according to the arrangement sequence between each text unit based on the respective acoustic tokens of each text unit.

It should be understood that the specific implementation of steps 1201 to 1208 is similar to the previous embodiment, and will not be repeated here.

It should be understood that, although the steps in the flowcharts related to the embodiments described above are sequentially shown as indicated by arrows, these steps are not necessarily sequentially performed in the order indicated by the arrows. The steps are not strictly limited to the order of execution unless explicitly recited herein, and the steps may be executed in other orders. Moreover, at least some of the steps in the flowcharts described in the above embodiments may include a plurality of steps or a plurality of stages, which are not necessarily performed at the same time, but may be performed at different times, and the order of the steps or stages is not necessarily performed sequentially, but may be performed alternately or alternately with at least some of the other steps or stages.

Based on the same inventive concept, the embodiment of the application also provides a voice generating device for realizing the above related voice generating method. The implementation of the solution provided by the device is similar to the implementation described in the above method, so the specific limitation in one or more embodiments of the speech generating device provided below may refer to the limitation of the method for generating speech described above, which is not repeated here.

In one embodiment, as shown in fig. 13, there is provided a voice generating apparatus including: an acquisition module 1302, a feature encoding module 1304, a feature fusion module 1306, and a speech generation module 1308, wherein:

An obtaining module 1302, configured to obtain, in response to the speech generation request, a target text and a reference audio included in the speech generation request;

The feature encoding module 1304 is configured to extract a text embedded feature of the target text, and perform feature encoding processing on the text embedded feature to obtain a text hidden layer feature;

The feature fusion module 1306 is configured to extract an acoustic feature of a target object to which the reference audio belongs based on the reference audio, and perform feature fusion processing on the acoustic feature and the text hidden layer feature to obtain a fusion feature;

The speech generating module 1308 is configured to generate a target speech that simulates a target object to send out a target text as speech content, based on the text embedding feature, the text hidden feature and the fusion feature.

In one embodiment, the feature fusion module 1306 is specifically configured to perform feature encoding processing on the acoustic feature to obtain an acoustic hidden layer feature; and carrying out feature fusion processing on the acoustic hidden layer features and the text hidden layer features to obtain fusion features.

In one embodiment, the speech generation module 1308 is specifically configured to generate text prompt features based on the text embedded features, the text hidden features, and the fusion features; and generating target voice which simulates the target object to send by taking the target text as voice content through the text prompt feature and the acoustic hidden layer feature.

In one embodiment, the feature fusion module 1306 is specifically configured to perform audio encoding and decoding on the reference audio to obtain a reference acoustic token of the reference audio, where the reference acoustic token is used to characterize acoustic features of a target object to which the reference audio belongs.

In one embodiment, the target text includes a plurality of text units arranged in sequence;

The speech generation module 1308 is specifically configured to sequentially generate an acoustic token for each text unit in the target text based on the text embedding feature, the text hidden feature and the fusion feature; and generating target voice which simulates the target object to send by taking the target text as voice content according to the arrangement sequence among the text units based on the respective acoustic tokens of the text units.

In one embodiment, the speech generation module 1308 is specifically configured to generate, based on the respective acoustic tokens of each text unit, a sequence of acoustic tokens according to an arrangement order between each text unit; and performing audio decoding processing on the acoustic token sequence to generate target voice which is sent by the simulated target object by taking the target text as voice content.

In one embodiment, the voice generating module 1308 is specifically configured to perform audio decoding processing on the acoustic token of each text unit, and generate unit voices that are generated by the simulation target object with the text unit as voice content respectively; and generating target voice which simulates the target object to take the target text as voice content according to the arrangement sequence among the text units based on the unit voice corresponding to each text unit.

In one embodiment, the feature encoding process, the feature fusion process, and the speech generation process are all performed by a speech generation model; the voice generating device also comprises a model training module;

The model training module is used for acquiring a sample text and a reference sample audio, wherein the reference sample audio is a voice sent by a sample object by taking the sample text as voice content; based on the sample text and the reference sample audio, obtaining sample acoustic hidden layer characteristics, sample text embedded characteristics, sample text hidden layer characteristics and sample fusion characteristics through an initial voice generation model; based on the sample acoustic hidden layer characteristics, the sample text embedding characteristics, the sample text hidden layer characteristics and the sample fusion characteristics, generating a predicted voice which simulates a sample object and takes the sample text as voice content through an initial voice generation model; model parameters of the initial speech generation model are adjusted by referencing the sample audio and the predicted speech to generate a speech generation model.

In one embodiment, the model training module is specifically configured to extract, through an initial speech generation model, a sample acoustic feature of a sample object to which a reference sample audio belongs, and perform feature encoding processing on the sample acoustic feature to obtain a sample acoustic hidden layer feature; extracting sample text embedded features of a sample text through an initial speech generation model, and carrying out feature coding processing on the sample text embedded features to obtain sample text hidden layer features; and carrying out feature fusion processing on the sample text hidden layer features and the sample acoustic hidden layer features through an initial speech generation model to obtain sample fusion features.

In one embodiment, the sample text includes a plurality of sample text units arranged in sequence;

The model training module is specifically used for carrying out feature fusion on the characteristic of the leachable center and the characteristic of the sample acoustic hidden layer through the initial voice generation model to obtain a self-supervision sample characteristic; based on the self-supervision sample feature, the sample text embedding feature, the sample text hidden layer feature and the sample fusion feature, sequentially generating respective sample acoustic tokens of each sample unit in the sample text through an initial voice generation model; based on the respective sample acoustic tokens of each sample unit, the generating model generates predicted voices which simulate sample objects and take sample texts as voice contents according to the arrangement sequence among the sample units.

In one embodiment, the model training module is specifically configured to construct a sample acoustic token sequence based on a respective sample acoustic token for each sample unit herein; model parameters and learnable center features of the initial speech generation model are adjusted by a reference sample acoustic token of the reference sample audio and a sequence of sample acoustic tokens to generate a speech generation model.

In one embodiment, a computer device is provided, which may be a server or a terminal, and in this embodiment, the computer device is taken as a server to be described as an example, and the internal structure thereof may be as shown in fig. 14. The computer device includes a processor, a memory, an Input/Output interface (I/O) and a communication interface. The processor, the memory and the input/output interface are connected through a system bus, and the communication interface is connected to the system bus through the input/output interface. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device includes a non-volatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, computer programs, and a database. The internal memory provides an environment for the operation of the operating system and computer programs in the non-volatile storage media. The database of the computer equipment is used for storing data such as target text, reference audio and the like which are relevant to the embodiment of the application. The input/output interface of the computer device is used to exchange information between the processor and the external device. The communication interface of the computer device is used for communicating with an external terminal through a network connection. The computer program is executed by a processor to implement a method of speech generation.

It will be appreciated by those skilled in the art that the structure shown in fig. 14 is merely a block diagram of a portion of the structure associated with the present inventive arrangements and is not limiting of the computer device to which the present inventive arrangements are applied, and that a particular computer device may include more or fewer components than shown, or may combine certain components, or have a different arrangement of components.

In an embodiment, there is also provided a computer device comprising a memory and a processor, the memory having stored therein a computer program, the processor implementing the steps of the method embodiments described above when the computer program is executed.

In one embodiment, a computer-readable storage medium is provided, on which a computer program is stored which, when executed by a processor, carries out the steps of the method embodiments described above.

In an embodiment, a computer program product is provided, comprising a computer program which, when executed by a processor, implements the steps of the method embodiments described above.

It should be noted that, the object information (including, but not limited to, object device information, object personal information, etc.) and the data (including, but not limited to, data for analysis, stored data, presented data, etc.) related to the present application are both information and data authorized by the object or sufficiently authorized by each party, and the collection, use and processing of the related data need to comply with the related laws and regulations and standards of the related countries and regions.

Those skilled in the art will appreciate that implementing all or part of the above described methods may be accomplished by way of a computer program stored on a non-transitory computer readable storage medium, which when executed, may comprise the steps of the embodiments of the methods described above. Any reference to memory, database, or other medium used in embodiments provided herein may include at least one of non-volatile and volatile memory. The nonvolatile Memory may include Read-Only Memory (ROM), magnetic tape, floppy disk, flash Memory, optical Memory, high density embedded nonvolatile Memory, resistive random access Memory (ReRAM), magneto-resistive random access Memory (Magnetoresistive Random Access Memory, MRAM), ferroelectric Memory (Ferroelectric Random Access Memory, FRAM), phase change Memory (PHASE CHANGE Memory, PCM), graphene Memory, and the like. Volatile memory can include random access memory (Random Access Memory, RAM) or external cache memory, and the like. By way of illustration, and not limitation, RAM can be in various forms such as static random access memory (Static Random Access Memory, SRAM) or dynamic random access memory (Dynamic Random Access Memory, DRAM), etc. The databases referred to in the embodiments provided herein may include at least one of a relational database and a non-relational database. The non-relational database may include, but is not limited to, a blockchain-based distributed database, and the like. The processor referred to in the embodiments provided in the present application may be a general-purpose processor, a central processing unit, a graphics processor, a digital signal processor, a programmable logic unit, a data processing logic unit based on quantum computing, or the like, but is not limited thereto.

The technical feature information of the above embodiments may be arbitrarily combined, and for brevity of description, all possible combinations of the technical feature information in the above embodiments are not described, however, as long as there is no contradiction between the combinations of the technical feature information, they should be considered as the scope of the description.

The foregoing examples illustrate only a few embodiments of the application and are described in detail herein without thereby limiting the scope of the application. It should be noted that it will be apparent to those skilled in the art that several variations and modifications can be made without departing from the spirit of the application, which are all within the scope of the application. Accordingly, the scope of the application should be assessed as that of the appended claims.

Claims

1. A method of speech generation, comprising:

responding to a voice generation request, and acquiring target text and reference audio contained in the voice generation request;

extracting text embedded features of the target text, and performing feature coding processing on the text embedded features to obtain text hidden layer features;

and generating target voice which simulates the target object to send by taking the target text as voice content based on the text embedded feature, the text hidden layer feature and the fusion feature.

2. The method according to claim 1, wherein the feature fusion processing of the acoustic feature and the text hidden feature to obtain a fused feature includes:

Performing feature coding processing on the acoustic features to obtain acoustic hidden layer features;

And carrying out feature fusion processing on the acoustic hidden layer features and the text hidden layer features to obtain fusion features.

3. The method of claim 2, wherein generating a target voice simulating the target object uttered by the target text as voice content based on the text-embedded feature, the text-hidden feature, and the fusion feature comprises:

Generating a text prompt feature based on the text embedding feature, the text hidden layer feature and the fusion feature;

And generating target voice which simulates the target object to send by taking the target text as voice content through the text prompt feature and the acoustic hidden layer feature.

4. The method according to claim 1, wherein the extracting acoustic features of the target object to which the reference audio belongs based on the reference audio comprises:

And carrying out audio encoding and decoding on the reference audio to obtain a reference acoustic token of the reference audio, wherein the reference acoustic token is used for representing acoustic characteristics of a target object to which the reference audio belongs.

5. The method of claim 1, wherein the target text comprises a plurality of text units arranged in sequence;

the generating, based on the text embedded feature, the text hidden feature and the fusion feature, a target voice simulating the target object to send out with the target text as voice content includes:

sequentially generating respective acoustic tokens of each text unit in the target text based on the text embedding feature, the text hidden layer feature and the fusion feature;

And generating target voice which simulates the target object to send by taking the target text as voice content according to the arrangement sequence among the text units based on the respective acoustic tokens of the text units.

6. The method of claim 5, wherein generating a target voice simulating the target object speaking the target text as voice content based on the respective acoustic tokens of each text unit in the arrangement order between each text unit, comprises:

Generating an acoustic token sequence according to the arrangement sequence among the text units based on the acoustic tokens of the text units;

And performing audio decoding processing on the acoustic token sequence to generate target voice which simulates the target object to send by taking the target text as voice content.

7. The method of claim 5, wherein generating a target voice simulating the target object speaking the target text as voice content based on the respective acoustic tokens of each text unit in the arrangement order between each text unit, comprises:

Performing audio decoding processing on the acoustic tokens of each text unit, and respectively generating unit voices which simulate the target object and are sent by taking the text units as voice contents;

And generating target voice which simulates the target object to send by taking the target text as voice content according to the arrangement sequence among the text units based on the unit voice corresponding to each text unit.

8. The method of claim 1, wherein the feature encoding process, the feature fusion process, and the speech generation process are all performed by a speech generation model;

The method for acquiring the voice generation model comprises the following steps:

acquiring a sample text and a reference sample audio, wherein the reference sample audio is a voice sent by a sample object by taking the sample text as voice content;

Based on the sample text and the reference sample audio, obtaining sample acoustic hidden layer characteristics, sample text embedded characteristics, sample text hidden layer characteristics and sample fusion characteristics through an initial voice generation model;

Generating a predicted voice which simulates the sample object to send by taking the sample text as voice content through the initial voice generation model based on the sample acoustic hidden layer feature, the sample text embedded feature, the sample text hidden layer feature and the sample fusion feature;

And adjusting model parameters of the initial voice generation model through the reference sample audio and the predicted voice so as to generate the voice generation model.

9. The method of claim 8, wherein the obtaining, based on the sample text and the reference sample audio, the sample acoustic hidden feature, the sample text embedded feature, the sample text hidden feature, and the sample fusion feature by an initial speech generation model comprises:

extracting sample acoustic features of a sample object to which the reference sample audio belongs through an initial speech generation model, and performing feature coding processing on the sample acoustic features to obtain sample acoustic hidden layer features;

Extracting sample text embedded features of the sample text through the initial speech generation model, and performing feature coding processing on the sample text embedded features to obtain sample text hidden layer features;

And carrying out feature fusion processing on the sample text hidden layer features and the sample acoustic hidden layer features through the initial speech generation model to obtain sample fusion features.

10. The method of claim 8, wherein the sample text comprises a plurality of sample text units arranged in sequence;

The generating, by the initial speech generation model, a predicted speech simulating the sample object to be sent by using the sample text as speech content based on the sample acoustic hidden feature, the sample text embedded feature, the sample text hidden feature, and the sample fusion feature includes:

Feature fusion is carried out on the learnable center features and the sample acoustic hidden layer features through the initial voice generation model, so that self-supervision sample features are obtained;

Based on the self-supervision sample feature, the sample text embedding feature, the sample text hidden layer feature and the sample fusion feature, sequentially generating a sample acoustic token of each sample text unit in the sample text through the initial voice generation model;

Based on the respective sample acoustic tokens of each sample text unit, generating a model to generate predicted voice which simulates the sample object to take the sample text as voice content according to the arrangement sequence among each sample text unit.

11. The method of claim 10, wherein said adjusting model parameters of the initial speech generation model to generate the speech generation model by the reference sample audio and the predicted speech comprises:

constructing a sample acoustic token sequence based on the respective sample acoustic tokens of each sample text unit;

And adjusting model parameters of the initial speech generation model and the leachable center features through the reference sample acoustic token of the reference sample audio and the sample acoustic token sequence to generate the speech generation model.

12. A speech generating apparatus, the apparatus comprising:

The feature coding module is used for extracting text embedded features of the target text, and carrying out feature coding processing on the text embedded features to obtain text hidden layer features;

And the voice generation module is used for generating target voice which simulates the target object to send by taking the target text as voice content based on the text embedding feature, the text hidden layer feature and the fusion feature.

13. A computer device comprising a memory and a processor, the memory storing a computer program, characterized in that the processor implements the steps of the method of any one of claims 1 to 11 when the computer program is executed.

14. A computer readable storage medium, on which a computer program is stored, characterized in that the computer program, when being executed by a processor, implements the steps of the method of any of claims 1 to 11.

15. A computer program product comprising a computer program, characterized in that the computer program, when being executed by a processor, implements the steps of the method of any one of claims 1 to 11.