CN115641860A

CN115641860A - Model training method, voice conversion method and device, equipment and storage medium

Info

Publication number: CN115641860A
Application number: CN202211101803.7A
Authority: CN
Inventors: 张旭龙; 王健宗
Original assignee: Ping An Technology Shenzhen Co Ltd
Current assignee: Ping An Technology Shenzhen Co Ltd
Priority date: 2022-09-09
Filing date: 2022-09-09
Publication date: 2023-01-24

Abstract

The application provides a model training method, a voice conversion device, equipment and a storage medium, and belongs to the technical field of artificial intelligence. The method comprises the following steps: acquiring sample audio data of a sample speaking object; inputting sample audio data to a neural network model comprising an encoding network and a decoding network; reconstructing the sample audio data through a coding network to obtain initial audio data; carrying out voice alignment on the initial audio data to obtain a sample audio embedded vector; decoupling the sample audio embedded vector, the pre-acquired sample tone parameter and the sample tone characteristic vector through a decoding network to obtain synthetic audio data; performing loss calculation on the synthesized audio data and the sample voice data through a loss function to obtain a model loss value; and updating parameters of the neural network model according to the model loss value so as to train the neural network model and obtain the voice conversion model. The voice conversion effect can be improved.

Description

Model training method, voice conversion method and device, equipment and storage medium

Technical Field

The present application relates to the field of artificial intelligence technologies, and in particular, to a model training method, a speech conversion method and apparatus, a device, and a storage medium.

Background

Speech conversion, in general, refers to the exchange of a speaker for another speaker without changing the utterance content information. When a common speech conversion model is used for speech conversion, actual speech content and style characteristics of a speaker cannot be well represented, so that the speech conversion effect is poor, and therefore, how to improve the speech conversion effect becomes a technical problem to be solved urgently.

Disclosure of Invention

The embodiment of the application mainly aims to provide a model training method, a voice conversion device, equipment and a storage medium, and aims to improve the voice conversion effect.

In order to achieve the above object, a first aspect of an embodiment of the present application provides a training method for a model, where the training method includes:

acquiring sample audio data of a sample speaking object; wherein the sample audio data comprises sample audio content and sample acoustic features, the sample acoustic features comprising sample timbre information, sample tonality information;

inputting the sample audio data into a preset neural network model, wherein the neural network model comprises an encoding network and a decoding network;

reconstructing the sample audio data through the coding network to obtain initial audio data, wherein the initial audio data includes the sample audio content and the sample tone information, and the initial audio data does not include the sample tone information;

carrying out voice alignment on the initial audio data to obtain a sample audio embedded vector;

decoupling the sample audio embedded vector, the pre-acquired sample tone parameter and the sample tone characteristic vector through the decoding network to obtain synthesized audio data, wherein the sample tone characteristic vector is used for representing the speaking style characteristics of the sample speaking object;

performing loss calculation on the synthetic audio data and the sample voice data through a preset loss function to obtain a model loss value;

and updating parameters of the neural network model according to the model loss value so as to train the neural network model and obtain a voice conversion model.

In some embodiments, the reconstructing, by the coding network, the sample audio data to obtain initial audio data includes:

extracting parameters of the sample audio data through the coding network to obtain initial fundamental frequency parameters, non-periodic parameters and spectrum envelope parameters of the sample audio data;

carrying out mean value calculation on the initial fundamental frequency parameters to obtain target fundamental frequency parameters;

and performing voice reconstruction on the target fundamental frequency parameter, the aperiodic parameter and the spectrum envelope parameter through the coding network to obtain the initial audio data.

In some embodiments, the speech aligning the initial audio data to obtain a sample audio embedding vector comprises:

performing phoneme feature recognition on the initial audio data to obtain phoneme feature data, and obtaining a duration time sequence of the initial audio data according to the phoneme feature data;

and performing voice alignment on the initial audio data according to the duration time sequence to obtain the sample audio embedded vector.

In some embodiments, the performing phoneme feature recognition on the initial audio data to obtain phoneme feature data, and obtaining a duration sequence of the initial audio data according to the phoneme feature data includes:

performing framing processing on the initial audio data to obtain a plurality of audio segments;

identifying the audio segments according to a preset phoneme comparison table to obtain the phoneme categories of the initial audio data and the number of phonemes in each phoneme category;

and obtaining the duration time sequence according to the phoneme category and the phoneme number.

In some embodiments, said speech aligning said initial audio data according to said time duration sequence resulting in said sample audio embedding vector comprises:

embedding the initial audio data to obtain an audio text embedded vector;

segmenting the audio text embedded vectors according to the duration time sequence to obtain intermediate embedded vectors corresponding to each phoneme type, wherein the number of the intermediate embedded vectors is the same as the number of phonemes of the audio type;

carrying out mean value calculation on the intermediate embedding vector of each phoneme type to obtain a candidate embedding vector corresponding to each phoneme type;

copying the candidate embedding vectors according to the number of the phonemes to obtain target embedding vectors corresponding to each phoneme type, wherein the number of the target embedding vectors is the same as the number of the phonemes of the audio type;

and splicing all the target embedded vectors to obtain the sample audio embedded vector.

In some embodiments, before the decoupling processing is performed on the sample audio embedding vector, the pre-obtained sample pitch parameter, and the sample tone characteristic vector through the decoding network to obtain the synthesized audio data, the training method further obtains the sample tone characteristic vector, specifically including:

inputting the sample audio data into a preset voiceprint recognition model, wherein the voiceprint recognition model comprises an LSTM layer and a linear layer;

performing feature extraction on the sample audio data through the LSTM layer to obtain a sample audio feature hidden vector;

and performing prediction processing on the sample audio characteristic hidden vector through the linear layer to obtain the sample tone characteristic vector.

To achieve the above object, a second aspect of an embodiment of the present application provides a speech conversion method, including:

obtaining raw audio data to be processed

Inputting the original audio data, the pre-acquired target tone characteristic and the target tone characteristic of the target speaking object into a voice conversion model for voice conversion to obtain target audio data, wherein the voice conversion model is obtained by training according to the training method of the first aspect.

In order to achieve the above object, a third aspect of the embodiments of the present application provides a training apparatus for a model, the training apparatus including:

the audio data acquisition module is used for acquiring sample audio data of a sample speaking object; wherein the sample audio data comprises sample audio content and sample acoustic features, the sample acoustic features comprising sample timbre information, sample tonal information;

the data input module is used for inputting the sample audio data into a preset neural network model, wherein the neural network model comprises a coding network and a decoding network;

a reconstruction module, configured to perform reconstruction processing on the sample audio data through the coding network to obtain initial audio data, where the initial audio data includes the sample audio content and the sample tone information, and the initial audio data does not include the sample tone information;

the voice alignment module is used for carrying out voice alignment on the initial audio data to obtain a sample audio embedded vector;

the decoupling module is used for decoupling the sample audio embedded vector, the pre-acquired sample tone parameter and the sample tone characteristic vector through the decoding network to obtain synthesized audio data, wherein the sample tone characteristic vector is used for representing the speaking style characteristics of the sample speaking object;

the loss calculation module is used for performing loss calculation on the synthesized audio data and the sample voice data through a preset loss function to obtain a model loss value;

and the parameter updating module is used for updating parameters of the neural network model according to the model loss value so as to train the neural network model and obtain a voice conversion model.

In order to achieve the above object, a fourth aspect of the embodiments of the present application provides an electronic device, which includes a memory and a processor, where the memory stores a computer program, and the processor implements the method of the first aspect or the method of the second aspect when executing the computer program.

To achieve the above object, a fifth aspect of embodiments of the present application proposes a computer-readable storage medium storing a computer program, which when executed by a processor implements the method of the first aspect or the method of the second aspect.

The model training method, the voice conversion method, the model training device, the electronic equipment and the storage medium are characterized in that sample audio data of a sample speaking object are obtained; the sample audio data comprises sample audio content and sample acoustic characteristics, and the sample acoustic characteristics comprise sample tone information and sample tone information; inputting sample audio data into a preset neural network model, wherein the neural network model comprises an encoding network and a decoding network; the initial audio data comprises sample audio content and sample tone information, and the initial audio data does not comprise the sample tone information, so that the tone information in the sample audio data can be eliminated on the premise of not changing the audio content and the tone information of the sample audio data, the influence on the training of the model caused by the difference of tone characteristics of different sample speaking objects is avoided, and the training effect of the model is improved. Furthermore, voice alignment is carried out on the initial audio data to obtain a sample audio embedded vector, so that the audio length of the sample audio embedded vector is consistent with that of the initial audio data, the characteristic constraint of model training is realized, and the characteristic decoupling of the neural network model on the voice characteristics is strengthened. Furthermore, decoupling processing is carried out on the sample audio embedded vector, the pre-acquired sample tone parameter and the sample tone characteristic vector through a decoding network to obtain synthesized audio data, wherein the sample tone characteristic vector is used for representing the speaking style characteristics of a sample speaking object, and the synthesized audio data can contain voice content, tone information and tone information which are close to the sample audio data through the method, so that the obtained synthesized audio data has better audio quality. Finally, loss calculation is carried out on the synthesized audio data and the sample voice data through a preset loss function, and a model loss value is obtained; and updating parameters of the neural network model according to the model loss value, realizing the training of the neural network model, and obtaining the voice conversion model, thereby effectively improving the training effect of the model and improving the voice conversion effect of the voice conversion model on the input audio data.

Drawings

FIG. 1 is a flow chart of a method for training a model provided by an embodiment of the present application;

fig. 2 is a flowchart of step S103 in fig. 1;

FIG. 3 is a flowchart of step S104 in FIG. 1;

fig. 4 is a flowchart of step S301 in fig. 3;

FIG. 5 is a flowchart of step S302 in FIG. 3;

FIG. 6 is another flow chart of a method for training a model provided by an embodiment of the present application;

FIG. 7 is a flowchart of a method for voice conversion provided by an embodiment of the present application;

FIG. 8 is a schematic structural diagram of a training apparatus for a model provided in an embodiment of the present application;

fig. 9 is a schematic hardware structure diagram of an electronic device according to an embodiment of the present application.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more clearly understood, the present application is further described in detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application.

It is noted that while functional block divisions are provided in device diagrams and logical sequences are shown in flowcharts, in some cases, steps shown or described may be performed in sequences other than block divisions within devices or flowcharts. The terms first, second and the like in the description and in the claims, as well as in the drawings described above, are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order.

Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs. The terminology used herein is for the purpose of describing embodiments of the present application only and is not intended to be limiting of the application.

First, several terms referred to in the present application are resolved:

artificial Intelligence (AI): is a new technical science for researching and developing theories, methods, technologies and application systems for simulating, extending and expanding human intelligence; artificial intelligence is a branch of computer science that attempts to understand the essence of intelligence and produces a new intelligent machine that can react in a manner similar to human intelligence, and research in this field includes robotics, language recognition, image recognition, natural language processing, and expert systems, among others. The artificial intelligence can simulate the information process of human consciousness and thinking. Artificial intelligence is also a theory, method, technique and application system that uses a digital computer or a machine controlled by a digital computer to simulate, extend and expand human intelligence, perceive the environment, acquire knowledge and use the knowledge to obtain the best results.

Natural Language Processing (NLP): NLP uses computer to process, understand and use human language (such as chinese, english, etc.), and belongs to a branch of artificial intelligence, which is a cross discipline between computer science and linguistics, also commonly called computational linguistics. Natural language processing includes parsing, semantic analysis, discourse understanding, and the like. Natural language processing is commonly used in the technical fields of machine translation, character recognition of handwriting and print, speech recognition and text-to-speech conversion, information intention recognition, information extraction and filtering, text classification and clustering, public opinion analysis and viewpoint mining, and relates to data mining, machine learning, knowledge acquisition, knowledge engineering, artificial intelligence research, linguistic research related to language calculation and the like related to language processing.

Information Extraction (NER): and extracting entity, relation, event and other factual information of specified types from the natural language text, and forming a text processing technology for outputting structured data. Information extraction is a technique for extracting specific information from text data. The text data is composed of specific units, such as sentences, paragraphs and chapters, and the text information is composed of small specific units, such as words, phrases, sentences and paragraphs or combinations of these specific units. The extraction of noun phrases, names of people, names of places, etc. in the text data is text information extraction, and of course, the information extracted by the text information extraction technology can be various types of information.

Phoneme (Phone): the minimum phonetic unit is divided according to the natural attributes of the speech, and is analyzed according to the pronunciation action in the syllable, and one action forms a phoneme.

Fundamental frequency (Baseband): also called baseband. In sound, fundamental frequency refers to the frequency of a fundamental tone in a complex tone. Among the several tones constituting a complex tone, the fundamental tone has the lowest frequency and the highest intensity. The level of the fundamental frequency determines the level of a tone. The frequency of speech is usually the frequency of fundamental tones.

Mel-Frequency cepstral Coefficients (MFCC): is a set of key coefficients used to create the mel-frequency cepstrum. From segments of the music signal, a set of cepstra sufficient to represent the music signal is obtained, and the mel-frequency cepstral coefficients are the cepstrum (i.e. the spectrum of the spectrum) derived from the cepstrum. Unlike the general cepstrum, the largest feature of the mel cepstrum is that the frequency bands on the mel cepstrum are uniformly distributed on the mel scale, i.e., such frequency bands are closer to the human nonlinear auditory System (Audio System) than the commonly seen linear cepstrum representation method. For example: in the art of audio compression, the mel-frequency cepstrum is often used for processing.

Pooling (Pooling): the method is essentially sampling, and selects a certain mode to perform dimensionality reduction processing and compression processing on an input characteristic diagram so as to accelerate the operation speed, and adopts more Pooling processes as Max Pooling (Max Pooling).

Activation Function (Activation Function): is a function that runs on a neuron of an artificial neural network responsible for mapping the input of the neuron to the output.

Encoding (Encoder): the input sequence is converted into a vector of fixed length.

Decoding (Decoder): converting the fixed vector generated before into an output sequence; wherein, the input sequence can be characters, voice, images and videos; the output sequence may be text, images.

Softmax function: the Softmax function is a normalized exponential function that "compresses" one K-dimensional vector z containing arbitrary real numbers into another K-dimensional real vector σ (z) such that each element ranges between (0,1) and the sum of all elements is 1, which is commonly used in multi-classification problems.

Based on this, embodiments of the present application provide a model training method, a speech conversion method and apparatus, a device, and a storage medium, and aim to improve a speech conversion effect.

The method for training a model, the method for converting speech, the device for converting speech, the apparatus for converting speech, and the storage medium provided in the embodiments of the present application are specifically described in the following embodiments, and first, the method for training a model in the embodiments of the present application is described.

The embodiment of the application can acquire and process related data based on an artificial intelligence technology. Among them, artificial Intelligence (AI) is a theory, method, technique and application system that simulates, extends and expands human Intelligence using a digital computer or a machine controlled by a digital computer, senses the environment, acquires knowledge and uses the knowledge to obtain the best result.

The artificial intelligence infrastructure generally includes technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a robot technology, a biological recognition technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and the like.

The embodiment of the application provides a model training method, and relates to the technical field of artificial intelligence. The model training method provided by the embodiment of the application can be applied to a terminal, a server side and software running in the terminal or the server side. In some embodiments, the terminal may be a smartphone, tablet, laptop, desktop computer, or the like; the server side can be configured as an independent physical server, can also be configured as a server cluster or a distributed system formed by a plurality of physical servers, and can also be configured as a cloud server for providing basic cloud computing services such as cloud service, a cloud database, cloud computing, cloud functions, cloud storage, network service, cloud communication, middleware service, domain name service, security service, CDN (content distribution network) and big data and artificial intelligence platforms; the software may be an application of a training method or the like that implements a model, but is not limited to the above form.

The application is operational with numerous general purpose or special purpose computing system environments or configurations. For example: personal computers, server computers, hand-held or portable devices, tablet-type devices, multiprocessor systems, microprocessor-based systems, set top boxes, programmable consumer electronics, network PCs, minicomputers, mainframe computers, distributed computing environments that include any of the above systems or devices, and the like. The application may be described in the general context of computer-executable instructions, such as program modules, being executed by a computer. Generally, program modules include routines, programs, objects, components, data structures, etc. that perform particular tasks or implement particular abstract data types. The application may also be practiced in distributed computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed computing environment, program modules may be located in both local and remote computer storage media including memory storage devices.

In each embodiment of the present application, when data related to the identity or characteristics of a user, such as user information, user behavior data, user history data, and user location information, is processed, permission or consent of the user is obtained, and the collection, use, and processing of the data comply with relevant laws and regulations and standards of relevant countries and regions. In addition, when the embodiment of the present application needs to acquire sensitive personal information of a user, individual permission or individual consent of the user is obtained through a pop-up window or a jump to a confirmation page, and after the individual permission or individual consent of the user is definitely obtained, necessary user-related data for enabling the embodiment of the present application to operate normally is acquired.

Fig. 1 is an alternative flowchart of a training method of a model provided in an embodiment of the present application, and the method in fig. 1 may include, but is not limited to, steps S101 to S107.

Step S101, obtaining sample audio data of a sample speaking object; the sample audio data comprises sample audio content and sample acoustic characteristics, and the sample acoustic characteristics comprise sample tone information and sample tone information;

step S102, inputting sample audio data into a preset neural network model, wherein the neural network model comprises a coding network and a decoding network;

step S103, reconstructing the sample audio data through the coding network to obtain initial audio data, wherein the initial audio data comprises sample audio content and sample tone information, and the initial audio data does not comprise the sample tone information;

step S104, performing voice alignment on the initial audio data to obtain a sample audio embedded vector;

step S105, decoupling the sample audio embedded vector, the pre-acquired sample tone parameter and the sample tone characteristic vector through a decoding network to obtain synthetic audio data, wherein the sample tone characteristic vector is used for representing the speaking style characteristics of a sample speaking object;

step S106, carrying out loss calculation on the synthesized audio data and the sample voice data through a preset loss function to obtain a model loss value;

and S107, updating parameters of the neural network model according to the model loss value to train the neural network model to obtain a voice conversion model.

In the steps S101 to S107 illustrated in the embodiment of the present application, sample audio data of a sample speaking object is obtained; inputting sample audio data into a preset neural network model, wherein the neural network model comprises an encoding network and a decoding network; the initial audio data is obtained by reconstructing the sample audio data through the coding network, so that the tone information in the sample audio data can be eliminated on the premise of not changing the audio content and the tone information of the sample audio data, the influence on the model training caused by the difference of the tone characteristics of different sample speaking objects is avoided, and the model training effect is improved. And performing voice alignment on the initial audio data to obtain a sample audio embedded vector, so that the audio length of the sample audio embedded vector is consistent with that of the initial audio data, the feature constraint of the model training is realized, and the feature decoupling of the neural network model on the voice features is strengthened. The method comprises the steps that decoupling processing is carried out on a sample audio embedding vector, a pre-acquired sample tone parameter and a sample tone characteristic vector through a decoding network to obtain synthesized audio data, the synthesized audio data can contain voice content, tone information and tone information which are close to the sample audio data, and the obtained synthesized audio data have good audio quality. Finally, loss calculation is carried out on the synthesized audio data and the sample voice data through a preset loss function, and a model loss value is obtained; and updating parameters of the neural network model according to the model loss value, realizing the training of the neural network model, and obtaining the voice conversion model, thereby effectively improving the training effect of the model.

In step S101 of some embodiments, a web crawler may be written, and a data source is set, and then data is crawled in a targeted manner, so as to obtain sample audio data of a sample speaking object, where the data source may be various types of network platforms, social media, some specific audio databases, and the like, the sample audio data may be music materials, lectures reports, chat conversations, and the like of the sample speaking object, the sample audio data includes sample audio content and sample acoustic features, and the sample acoustic features include sample tone color information and sample tone information.

In step S102 of some embodiments, the sample audio data is input into a preset neural network model, where the neural network model may be constructed based on an AuTo VC model, the neural network model includes an encoding network and a decoding network, the encoding network is mainly used to perform speech reconstruction and speech alignment on the input audio data to eliminate tonal information in the audio data, and perform fine-tuning processing on the sample audio data according to phoneme characteristics of the phoneme data to obtain an audio embedding vector, and the decoding network is mainly used to perform joint decoupling on the audio embedding vector and tonal characteristics and timbre characteristics of a target speaking object, so as to convert an original speaking object into the target speaking object without changing speaking content of the input audio data, that is, the audio embedding vector and the tonal characteristics and timbre characteristics of the target speaking object are fused to form new audio data.

Referring to fig. 2, in some embodiments, step S103 may include, but is not limited to, step S201 to step S203:

step S201, extracting parameters of sample audio data through a coding network to obtain initial fundamental frequency parameters, aperiodic parameters and spectrum envelope parameters of the sample audio data;

step S202, carrying out mean value calculation on the initial fundamental frequency parameters to obtain target fundamental frequency parameters;

step S203, performing voice reconstruction on the target fundamental frequency parameter, the aperiodic parameter and the spectrum envelope parameter through a coding network to obtain initial audio data.

In step S201 of some embodiments, when parameter extraction is performed on sample audio data through a coding network, a common audio parameter analysis tool (e.g., world analyzer tool, etc.) may be used to perform parameter analysis on the sample audio data to determine various audio parameters of the sample audio data, and then the audio parameters are subjected to screening processing through preset keywords in the coding network, so as to extract important parameter information in the sample audio data, such as aperiodic parameters a and spectral envelope parameters E of the sample audio data, and initial fundamental frequency parameters F corresponding to each frame of audio, where the preset keywords may include names, tag values, and the like of important parameters.

In step S202 of some embodiments, summing is performed on all initial fundamental frequency parameters in the sample audio data, and then division is performed on the sum of the fundamental frequencies obtained by the summing and the number of frames (i.e., the number of the initial fundamental frequency parameters) of the sample audio data, so that the initial fundamental frequency parameters are averaged, and a target fundamental frequency parameter F0 is obtained, where the target fundamental frequency parameter can clearly reflect the overall frequency condition of the sample audio data.

In step S203 of some embodiments, a target fundamental frequency parameter, a non-periodic parameter, and a spectral envelope parameter are feature-fused by an encoding network to obtain a fused audio feature, so as to reconstruct a new speech waveform according to the fused audio feature, thereby obtaining initial audio data, where the initial audio data includes sample audio content and sample tone information, but the initial audio data does not include the sample tone information.

Through the steps S201 to S203, the tone information in the sample audio data can be eliminated without changing the audio content and the tone information of the sample audio data, so that the influence on the model training due to the difference of the tone characteristics of different sample speaking objects is avoided, and the model training effect is improved.

Referring to fig. 3, in some embodiments, step S104 may include, but is not limited to, step S301 to step S302:

step S301, performing phoneme feature recognition on the initial audio data to obtain phoneme feature data, and obtaining a duration time sequence of the initial audio data according to the phoneme feature data;

step S302, performing voice alignment on the initial audio data according to the duration time sequence to obtain a sample audio embedded vector.

In step S301 of some embodiments, when performing phoneme feature recognition on initial audio data, it is first required to frame the initial audio data, determine how many frames the initial audio data includes, perform type recognition on phonemes of each frame of audio fragment, determine a phoneme type corresponding to each frame of audio fragment, thereby counting phoneme types (i.e., phoneme types) included in the entire initial audio data and the number of times each phoneme occurs (i.e., the number of phonemes), and use the phoneme types and the number of times each phoneme occurs as phoneme feature data. When constructing the duration sequence of the initial audio data, the phoneme category is used as the number of elements, i.e., how many phonemes are present and how many elements are present, and the number of times each phoneme is used as the element value, i.e., how many times each phoneme is present, and the numerical value of the element value is what number. For example, the entire initial audio data contains 2 kinds of phonemes, namely, a phoneme a and a phoneme b, where the phoneme a appears 3 times and the phoneme b appears 4 times, and the duration sequence of the initial audio data is [3,4].

In step S302 of some embodiments, first, text feature extraction needs to be performed on initial audio data to obtain an audio text embedded vector corresponding to the initial audio data, then the audio text embedded vector is segmented according to a duration sequence, and then the audio text embedded vectors after the segmentation processing are merged according to elements of the duration sequence, so as to obtain a sample audio embedded vector corresponding to the speech length of the initial audio data.

Through the steps S301 to S302, the number of elements and the element value of the duration sequence can be determined according to the phoneme information of the initial audio data, and the text content information of the initial audio data and the audio length of the initial audio data are aligned according to the element condition of the duration sequence, so as to obtain a sample audio embedded vector which can represent the text content feature of the initial audio data and has the audio length consistent with the initial audio data.

Referring to fig. 4, in some embodiments, the phoneme feature data includes a phoneme category and a phoneme number, and step S301 may include, but is not limited to, steps S401 to S403:

step S401, performing framing processing on initial audio data to obtain a plurality of audio segments;

step S402, carrying out recognition processing on the audio segments according to a preset phoneme comparison table to obtain the phoneme type of the initial audio data and the phoneme quantity of each phoneme type;

step S403, obtaining a duration sequence according to the phoneme type and the phoneme number.

In step S401 of some embodiments, the initial audio data is framed according to the audio length of the initial audio data, so as to obtain a plurality of audio segments, where each audio segment corresponds to a certain frame of audio in the initial audio data.

For example, if the audio duration of a certain initial audio data is 7 seconds, the frame number of the initial audio data is 7, and according to the frame number of the initial audio data, the initial audio data is divided into 7 audio segments, where each audio segment corresponds to a mel-frequency cepstrum frame.

In step S402 of some embodiments, the preset phoneme comparison table includes a mapping relationship between phonemes and mel-frequency cepstral frames, and the mapping relationship may be in a one-to-one or one-to-many form, for example, one mel-frequency cepstral frame has unique phonemes corresponding to each other, or one phoneme corresponds to a plurality of different mel-frequency cepstral frames. In order to improve the training effect of the model, in the embodiment of the present application, a phoneme comparison table with a one-to-one mapping relationship may be adopted, after each audio segment is subjected to short-time fourier transform and mel cepstrum filter filtering, a mel cepstrum frame corresponding to each audio segment is obtained, a phoneme corresponding to the mel cepstrum frame is obtained by referring to a preset phoneme comparison table, and the phoneme is used as a phoneme of the audio segment. By the method, all phonemes corresponding to the initial audio data can be obtained more conveniently, and the phonemes are classified and counted, so that the initial audio data is determined to contain the phoneme type and the number of phonemes in each phoneme type.

For example, if the audio duration of a certain initial audio data is 9 seconds, the frame number of the initial audio data is 9, and according to the frame number of the initial audio data, the initial audio data is divided into 9 audio segments, where each audio segment corresponds to a mel-frequency cepstrum frame. After each audio clip is subjected to short-time Fourier transform and filtering by a Mel cepstrum filter, a Mel cepstrum frame corresponding to each audio clip is obtained, phonemes corresponding to the Mel cepstrum frame are obtained by referring to a preset phoneme comparison table, the phonemes are classified and counted, and initial audio data comprising two phonemes, namely a phoneme a and a phoneme b, are obtained, wherein the number of the phonemes of the phoneme a is 4, and the number of the phonemes of the phoneme b is 5.

In step S403 of some embodiments, the number of elements of the duration sequence is set according to the number of phoneme categories, and the value of the elements of the duration sequence is set according to the number of phonemes. Specifically, when the phoneme category includes two kinds, that is, phoneme a and phoneme b, the number of elements of the duration sequence is two, and if the number of phonemes a is 4 and the number of phonemes b is 5, the duration sequence may be represented as [4,5].

The number of elements and the value of the elements of the duration sequence can be determined according to the phoneme information of the initial audio data through the steps S401 to S403, and the duration of the phonemes of the initial audio data is converted into a form of sequence representation, so that the speech length can be controlled based on the duration of each phoneme in the subsequent speech conversion process, and the effect of speech conversion is improved.

Referring to fig. 5, in some embodiments, step S302 may include, but is not limited to, step S501 to step S505:

step S501, embedding initial audio data to obtain an audio text embedded vector;

step S502, the audio text embedded vectors are segmented according to the duration time sequence to obtain middle embedded vectors corresponding to each phoneme type, wherein the number of the middle embedded vectors is the same as the number of the phonemes of the audio type;

step S503, carrying out mean value calculation on the intermediate embedding vector of each phoneme type to obtain a candidate embedding vector corresponding to each phoneme type;

step S504, copying the candidate embedding vectors according to the number of phonemes to obtain target embedding vectors corresponding to each phoneme type, wherein the number of the target embedding vectors is the same as the number of phonemes of the audio type;

and step S505, splicing all target embedded vectors to obtain sample audio embedded vectors.

In step S501 of some embodiments, the initial audio data is embedded by an embedding layer of the coding network, so as to implement vectorization of the initial audio data, and obtain an audio text embedding vector, where the audio text embedding vector includes text content information corresponding to the initial audio data.

In step S502 of some embodiments, the audio text embedding vector is subjected to vector segmentation processing according to the sum of the element values of the duration sequence, and an intermediate embedding vector corresponding to each phoneme class is obtained. Specifically, the elements appearing in the duration sequence are summed to obtain the sum of the element values, for example, if a certain duration sequence is [4,5], and the sum of the element values is 4+5=9, the audio text embedded vector is divided into 9 intermediate embedded vectors, each intermediate embedded vector corresponds to one frame segment of the initial audio data, and since one frame segment corresponds to one phoneme, each intermediate embedded vector also corresponds to one phoneme, that is, the number of the intermediate embedded vectors is the same as the number of phonemes in the audio category.

In step S503 of some embodiments, the intermediate embedding vectors belonging to the same phoneme class are averaged to obtain a candidate embedding vector corresponding to each phoneme class. Specifically, vector summation is performed on all intermediate embedded vectors of a certain phoneme type, the number of phonemes of the phoneme type is determined at the same time, division calculation is performed on the vector summation result and the number of phonemes to obtain an average vector of the phoneme type, and the average vector is used as a candidate embedded vector.

In step S504 of some embodiments, the candidate embedding vectors are copied according to the number of phonemes to obtain target embedding vectors corresponding to each phoneme category, that is, if a phoneme category includes k phonemes, the number of phonemes is k, and the candidate embedding vectors are copied k times to obtain k target embedding vectors for the phoneme category, so that the number of target embedding vectors is the same as the number of phonemes for the audio category, and the target embedding vectors and the candidate embedding vectors are the same vectors.

In step S505 of some embodiments, vector splicing is performed on all target embedded vectors belonging to the same initial audio data according to a preset splicing order, so as to obtain a sample audio embedded vector corresponding to the initial audio data.

For example, if the audio duration of a certain initial audio data is 9 seconds, the frame number of the initial audio data is 9, and the initial audio data is divided into 9 audio segments according to the frame number of the initial audio data, where each audio segment corresponds to a mel-frequency cepstrum frame. Performing short-time fourier transform and mel cepstrum filter filtering on each audio fragment to obtain a mel cepstrum frame corresponding to each audio fragment, obtaining phonemes corresponding to the mel cepstrum frame by referring to a preset phoneme comparison table, classifying and counting the phonemes to obtain initial audio data comprising two phonemes, namely a phoneme a and a phoneme b, wherein the number of the phonemes a is 4, the number of the phonemes b is 5, and the duration sequence can be represented as [4,5]. Further, the audio text embedding vector is divided into 9 intermediate embedding vectors according to the sum of the element values of the duration sequence, wherein 4 intermediate embedding vectors (A1, A2, A3, A4) correspond to the phoneme a and 5 intermediate embedding vectors (B1, B2, B3, B4, B5) correspond to the phoneme B. Averaging 4 intermediate embedded vectors of the phoneme a to obtain a candidate embedded vector An corresponding to the phoneme a, namely (A1 + A2+ A3+ A4)/4 = An; copying An 4 times to obtain target embedded vectors (An, an) corresponding to the phoneme a. Similarly, averaging 5 intermediate embedded vectors of the phoneme B to obtain a candidate embedded vector An corresponding to the phoneme B, that is, (B1 + B2+ B3+ B4+ B5)/5 = bn; copying the Bn for 5 times to obtain target embedded vectors (Bn, bn, bn, bn, bn) corresponding to the phoneme b, and finally carrying out vector splicing on the target embedded vectors (An, an) corresponding to the phoneme a and the target embedded vectors (Bn, bn, bn, bn) corresponding to the phoneme b to obtain sample audio embedded vectors corresponding to the initial audio data.

Through the steps S501 to S505, the text content information of the initial audio data and the audio length of the initial audio data can be aligned according to the element condition of the duration sequence, so that the text content characteristics of the initial audio data can be represented, and the sample audio embedded vector with the audio length consistent with that of the initial audio data can be obtained, so that the characteristic constraint can be better performed on the training of the model, the characteristic decoupling of the neural network model on the speech characteristics and the characteristics of the speaking object can be enhanced, and the training effect of the model can be improved.

Referring to fig. 6, before step S105 in some embodiments, the training method of the model further includes, but is not limited to, steps S601 to S602:

step S601, inputting sample audio data into a preset voiceprint recognition model, wherein the voiceprint recognition model comprises an LSTM layer and a linear layer;

step S602, extracting the characteristics of the sample audio data through an LSTM layer to obtain a sample audio characteristic hidden vector;

step S603, performing prediction processing on the sample audio feature implicit vector through the linear layer to obtain a sample tone feature vector.

In step S601 of some embodiments, the sample audio data is input into a preset voiceprint recognition model through a pre-written computer program or a script program, the voiceprint recognition model may be constructed based on a deep convolutional network structure or a long-short term memory network structure, the voiceprint recognition model includes an LSTM layer and a linear layer, the LSTM layer and the linear layer are used to extract a feature expression of a speaking object, the feature expression can be used to characterize the speaking style characteristics of the speaking object, and the speaking style characteristics include speech speed, pitch, timbre, and the like.

In step S602 in some embodiments, sample audio data is encoded through the LSTM layer in the order from left to right to obtain a first audio feature vector, then the sample audio data is encoded through the LSTM layer in the order from right to left to obtain a second audio feature vector, and then the first audio feature vector and the second audio feature vector are vector-spliced to obtain a sample audio feature hidden vector.

Further, the voiceprint recognition model may include a plurality of LSTM layers, where each LSTM layer performs left-to-right and right-to-left encoding processing on sample audio data, and performs splicing processing on results of the two encoding processing, inputs the results to the next LSTM layer to perform the same encoding operation and splicing operation, and uses the output of the last LSTM layer as a final sample audio feature hidden vector.

In step S603 in some embodiments, the linear layer includes a prediction function such as a softmax function, and performs probability distribution calculation on the sample audio feature hidden vector through the prediction function of the linear layer to obtain a probability distribution of the sample audio feature hidden vector on each preset reference tone feature label, where the probability distribution can more clearly reflect a possibility that the sample audio feature hidden vector belongs to each reference tone feature label, and therefore, the reference tone feature label with the largest probability distribution is selected as a target tone feature label of the sample audio feature hidden vector, and the target tone feature label is converted into a vector form, so as to obtain the sample tone feature vector of the sample audio data.

Through the steps S601 to S602, the tone information of the sample audio data can be conveniently extracted, the tone feature labels which can be used for representing the tone information of the sample audio data are screened out from the reference tone feature labels, and the tone feature labels are converted into a vector form, so that the tone feature labels can be used for subsequent speech synthesis, and the speech synthesis effect is improved.

In step S105 of some embodiments, first, a sample audio embedding vector, a sample pitch parameter, and a sample tone feature vector are vector-spliced by a decoding network to obtain a synthesized audio vector, and then the synthesized audio vector is decoupled by the decoding network to convert the synthesized audio vector into a waveform form, so as to obtain synthesized audio data, where the sample tone feature vector is used to characterize the speaking style of a sample speaking object, and the sample pitch parameter includes the pitch feature of the sample speaking object, and in this way, the synthesized audio data can include speech content, tone information, and tone information that are closer to the sample audio data, so that the obtained synthesized audio data has better audio quality.

In step S106 of some embodiments, the process of performing the loss calculation on the synthesized audio data and the sample speech data by the preset loss function may be expressed as shown in formula (1):

L _recon ＝||x-x′|| ₁ formula (1)

Wherein L is _recon To model loss values, x is sample speech data and x' is synthesized audio data. Loss value L through model _recon Can clearly reflect the similarity degree of the sample voice data and the synthesized audio data, and simultaneously, the model loss value L is used _recon Can also clearly reflect the training degree of the model.

In step S107 of some embodiments, since the sample speech data is derived from the sample speaking object, and the sample pitch parameters and the sample tone color feature vectors used for synthesizing the audio data are also derived from the sample speaking object, the synthesized audio data obtained by the neural network model needs to be as close to the sample speech data as possible, that is, the model loss value needs to be as small as possible. Therefore, the parameters of the neural network model are updated according to the model loss values, and the synthetic audio data obtained through the neural network model is closer to the sample voice data by updating the model parameters of the neural network model. And when the model loss value is less than or equal to the preset loss threshold value after multiple times of parameter updating, the similarity degree of the synthesized audio data and the sample voice data is better, the voice conversion effect of the neural network model can meet the current requirement, and the neural network model is stopped from being trained to obtain the voice conversion model.

The model training method comprises the steps of obtaining sample audio data of a sample speaking object; the sample audio data comprises sample audio content and sample acoustic characteristics, and the sample acoustic characteristics comprise sample tone information and sample tone information; inputting sample audio data into a preset neural network model, wherein the neural network model comprises an encoding network and a decoding network; the initial audio data comprises sample audio content and sample tone information, and the initial audio data does not comprise the sample tone information, so that the tone information in the sample audio data can be eliminated on the premise of not changing the audio content and the tone information of the sample audio data, the influence on the training of the model caused by the difference of tone characteristics of different sample speaking objects is avoided, and the training effect of the model is improved. Furthermore, voice alignment is carried out on the initial audio data to obtain a sample audio embedded vector, so that the audio length of the sample audio embedded vector is consistent with that of the initial audio data, the characteristic constraint of model training is realized, and the characteristic decoupling of the neural network model on the voice characteristics is strengthened. Furthermore, decoupling processing is carried out on the sample audio embedded vector, the pre-acquired sample tone parameter and the sample tone characteristic vector through a decoding network to obtain synthesized audio data, wherein the sample tone characteristic vector is used for representing the speaking style characteristics of a sample speaking object, and the synthesized audio data can contain voice content, tone information and tone information which are close to the sample audio data through the method, so that the obtained synthesized audio data has better audio quality. Finally, loss calculation is carried out on the synthesized audio data and the sample voice data through a preset loss function, and a model loss value is obtained; and updating parameters of the neural network model according to the model loss value, realizing the training of the neural network model, and obtaining the voice conversion model, thereby effectively improving the training effect of the model and improving the voice conversion effect of the voice conversion model on the input audio data.

Referring to fig. 7, an embodiment of the present application further provides a voice conversion method, which may include, but is not limited to, steps S701 to S702:

step S701, acquiring original audio data to be processed;

step S702, inputting the original audio data, the pre-obtained target tone characteristic and the target tone characteristic of the target speaking object into a voice conversion model for voice conversion to obtain target audio data, wherein the voice conversion model is obtained by training according to the training method of the first aspect.

In step S701 of some embodiments, a web crawler may be written, and data crawling is performed with a target after a data source is set, so as to obtain original audio data to be processed, where the data source may be various types of network platforms, social media, some specific audio databases, and the like, and the original audio data may be a music material, a lecture report, a chat conversation, and the like of a certain speaking object. The original audio data may also be acquired by other means, without being limited thereto.

Further, audio data of the target speaking object is obtained from a network platform, social media or an audio database, and a target tone characteristic and a target tone color characteristic of the target speaking object are obtained through a voiceprint recognition model or other d-vector technologies, wherein the target tone characteristic can represent the tone height of the target speaking object, and the target tone color characteristic can be used for representing the speaking style characteristic of the target speaking object.

In step S702 of some embodiments, the original audio data, the pre-obtained target tone feature and the target tone feature of the target speaking object are input to a speech conversion model for speech conversion, the speech content of the original audio data is obtained through the speech conversion model, the tone feature and the tone feature of the original audio data are removed, and then the target tone feature and the target tone feature of the target speaking object are fused with the speech content of the original audio data, so as to realize conversion of the tone information and the tone information of the original audio data, and obtain the target audio data.

The voice conversion method of the embodiment of the application carries out reconstruction processing on original audio data through a coding network of a voice conversion model to obtain candidate audio data, wherein the candidate audio data does not contain tone information and tone information of the original audio data, only the voice content of the original audio data is reserved, then the candidate audio data is subjected to voice alignment, and the candidate audio data subjected to voice alignment is decoupled from the target tone characteristic and the target tone characteristic of a target speaking object to form new audio data, namely the target audio data, so that the voice content of the target audio data is the same as the voice content of the original audio data, meanwhile, the target audio data contains the tone characteristic and the tone characteristic of the target speaking object, the fact that the original speaking object corresponding to the original audio data is converted into the target speaking object under the condition that the speaking content information of the original audio data is not changed is achieved, the method can better characterize the speaking content information and the tone characteristic of the target speaking object, and can effectively improve the voice conversion effect.

Referring to fig. 8, an embodiment of the present application further provides a training apparatus for a model, which can implement the training method for the model, and the apparatus includes:

an audio data obtaining module 801, configured to obtain sample audio data of a sample speaking object; the sample audio data comprises sample audio content and sample acoustic characteristics, and the sample acoustic characteristics comprise sample tone information and sample tone information;

a data input module 802, configured to input sample audio data into a preset neural network model, where the neural network model includes a coding network and a decoding network;

a reconstructing module 803, configured to perform reconstruction processing on the sample audio data through an encoding network to obtain initial audio data, where the initial audio data includes sample audio content and sample tone information, and the initial audio data does not include sample tone information;

a voice alignment module 804, configured to perform voice alignment on the initial audio data to obtain a sample audio embedded vector;

a decoupling module 805, configured to perform decoupling processing on the sample audio embedded vector, the pre-obtained sample pitch parameter, and the sample tone feature vector through a decoding network to obtain synthesized audio data, where the sample tone feature vector is used to characterize a speaking style characteristic of a sample speaking object;

a loss calculating module 806, configured to perform loss calculation on the synthesized audio data and the sample voice data through a preset loss function to obtain a model loss value;

the parameter updating module 807 is configured to perform parameter updating on the neural network model according to the model loss value to train the neural network model to obtain a speech conversion model.

In some embodiments, the reconstruction module 803 includes:

the parameter extraction unit is used for extracting parameters of the sample audio data through the coding network to obtain initial fundamental frequency parameters, non-periodic parameters and spectrum envelope parameters of the sample audio data;

the mean value calculation unit is used for carrying out mean value calculation on the initial fundamental frequency parameters to obtain target fundamental frequency parameters;

and the voice reconstruction unit is used for performing voice reconstruction on the target fundamental frequency parameter, the aperiodic parameter and the spectrum envelope parameter through a coding network to obtain initial audio data.

In some embodiments, the speech alignment module 804 includes:

the characteristic identification unit is used for carrying out phoneme characteristic identification on the initial audio data to obtain phoneme characteristic data and obtaining a duration time sequence of the initial audio data according to the phoneme characteristic data;

and the aligning unit is used for carrying out voice alignment on the initial audio data according to the duration time sequence to obtain a sample audio embedded vector.

In some embodiments, the phoneme feature data includes a phoneme category and a phoneme number, and the feature recognition unit includes:

the framing subunit is used for framing the initial audio data to obtain a plurality of audio segments;

the identifying subunit is used for identifying the audio segments according to a preset phoneme comparison table to obtain the phoneme categories of the initial audio data and the number of phonemes of each phoneme category;

and the sequence determining subunit is used for obtaining the duration sequence according to the phoneme type and the phoneme number.

In some embodiments, the alignment unit includes:

the embedding subunit is used for carrying out embedding processing on the initial audio data to obtain an audio text embedding vector;

the segmentation subunit is used for carrying out segmentation processing on the audio text embedded vectors according to the duration time sequence to obtain intermediate embedded vectors corresponding to each phoneme type, wherein the number of the intermediate embedded vectors is the same as that of the phonemes of the audio type;

the calculation subunit is used for carrying out mean value calculation on the intermediate embedded vector of each phoneme type to obtain a candidate embedded vector corresponding to each phoneme type;

the replication sub unit is used for performing replication processing on the candidate embedded vectors according to the number of the phonemes to obtain a target embedded vector corresponding to each phoneme type, wherein the number of the target embedded vectors is the same as the number of the phonemes of the audio type;

and the splicing subunit is used for splicing all the target embedded vectors to obtain the sample audio embedded vector.

In some embodiments, the training apparatus for the model further includes a sample tone characteristic obtaining module, which specifically includes:

the system comprises a data input unit, a data output unit and a voice print recognition unit, wherein the data input unit is used for inputting sample audio data into a preset voice print recognition model, and the voice print recognition model comprises an LSTM layer and a linear layer;

the extraction unit is used for extracting the characteristics of the sample audio data through the LSTM layer to obtain a sample audio characteristic hidden vector;

and the prediction unit is used for performing prediction processing on the sample audio characteristic hidden vector through the linear layer to obtain a sample tone characteristic vector.

The specific implementation of the training apparatus for the model is substantially the same as the specific implementation of the training method for the model, and is not described herein again.

An embodiment of the present application further provides an electronic device, where the electronic device includes: the system comprises a memory, a processor, a program stored on the memory and capable of running on the processor, and a data bus for realizing connection communication between the processor and the memory, wherein when the program is executed by the processor, the program realizes the training method or the voice conversion method of the model. The electronic equipment can be any intelligent terminal including a tablet computer, a vehicle-mounted computer and the like.

Referring to fig. 9, fig. 9 illustrates a hardware structure of an electronic device according to another embodiment, where the electronic device includes:

the processor 901 may be implemented by a general-purpose CPU (Central processing unit), a microprocessor, an Application Specific Integrated Circuit (ASIC), or one or more integrated circuits, and is configured to execute a relevant program to implement the technical solution provided in the embodiment of the present application;

the memory 902 may be implemented in the form of a Read Only Memory (ROM), a static storage device, a dynamic storage device, or a Random Access Memory (RAM). The memory 902 may store an operating system and other application programs, and when the technical solution provided by the embodiments of the present disclosure is implemented by software or firmware, the relevant program codes are stored in the memory 902 and called by the processor 901 to execute a training method or a speech conversion method of the model of the embodiments of the present disclosure;

an input/output interface 903 for implementing information input and output;

a communication interface 904, configured to implement communication interaction between the device and another device, where communication may be implemented in a wired manner (e.g., USB, network cable, etc.), or in a wireless manner (e.g., mobile network, WIFI, bluetooth, etc.);

a bus 905 that transfers information between various components of the device (e.g., the processor 901, the memory 902, the input/output interface 903, and the communication interface 904);

wherein the processor 901, the memory 902, the input/output interface 903 and the communication interface 904 are communicatively connected to each other within the device via a bus 905.

Embodiments of the present application also provide a computer-readable storage medium, which stores one or more programs, where the one or more programs are executable by one or more processors to implement the model training method or the speech conversion method.

The memory, which is a non-transitory computer readable storage medium, may be used to store non-transitory software programs as well as non-transitory computer executable programs. Further, the memory may include high speed random access memory, and may also include non-transitory memory, such as at least one disk storage device, flash memory device, or other non-transitory solid state storage device. In some embodiments, the memory optionally includes memory located remotely from the processor, and these remote memories may be connected to the processor through a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The model training method, the voice conversion method, the model training device, the electronic equipment and the computer readable storage medium provided by the embodiment of the application are realized by acquiring sample audio data of a sample speaking object; the sample audio data comprises sample audio content and sample acoustic characteristics, and the sample acoustic characteristics comprise sample tone information and sample tone information; inputting sample audio data into a preset neural network model, wherein the neural network model comprises an encoding network and a decoding network; the initial audio data comprises sample audio content and sample tone information, and the initial audio data does not comprise the sample tone information, so that the tone information in the sample audio data can be eliminated on the premise of not changing the audio content and the tone information of the sample audio data, the influence on the training of the model caused by the difference of tone characteristics of different sample speaking objects is avoided, and the training effect of the model is improved. Furthermore, initial audio data are subjected to voice alignment to obtain a sample audio embedding vector, so that the audio length of the sample audio embedding vector is consistent with the initial audio data, the characteristic constraint of the model training is realized, and the characteristic decoupling of the neural network model on voice characteristics is strengthened. Furthermore, decoupling processing is carried out on the sample audio embedded vector, the pre-acquired sample tone parameter and the sample tone characteristic vector through a decoding network to obtain synthesized audio data, wherein the sample tone characteristic vector is used for representing the speaking style characteristics of a sample speaking object, and the synthesized audio data can contain voice content, tone information and tone information which are close to the sample audio data through the method, so that the obtained synthesized audio data has better audio quality. Finally, loss calculation is carried out on the synthesized audio data and the sample voice data through a preset loss function, and a model loss value is obtained; and updating parameters of the neural network model according to the model loss value, realizing the training of the neural network model, and obtaining the voice conversion model, thereby effectively improving the training effect of the model. In the application stage of the voice conversion model, the voice conversion model is used for carrying out merging conversion on the original audio data and the tone characteristic of the target speaking object, so that the original speaking object corresponding to the original audio data is converted into the target speaking object under the condition that the speaking content information of the original audio data is not changed, the speaking content information and the tone characteristic of the target speaking object can be well represented, and the voice conversion effect can be effectively improved.

The embodiments described in the embodiments of the present application are for more clearly illustrating the technical solutions of the embodiments of the present application, and do not constitute a limitation to the technical solutions provided in the embodiments of the present application, and it is obvious to those skilled in the art that the technical solutions provided in the embodiments of the present application are also applicable to similar technical problems with the evolution of technology and the emergence of new application scenarios.

It will be appreciated by those skilled in the art that the solutions shown in fig. 1-7 are not intended to limit the embodiments of the present application and may include more or fewer steps than those shown, or some of the steps may be combined, or different steps may be included.

The above-described embodiments of the apparatus are merely illustrative, wherein the units illustrated as separate components may or may not be physically separate, i.e. may be located in one place, or may also be distributed over a plurality of network elements. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment.

It will be understood by those of ordinary skill in the art that all or some of the steps of the methods, systems, and functional modules/units in the devices disclosed above may be implemented as software, firmware, hardware, and suitable combinations thereof.

The terms "first," "second," "third," "fourth," and the like in the description of the application and the above-described figures, if any, are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the application described herein are capable of operation in sequences other than those illustrated or described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

It should be understood that, in this application, "at least one" means one or more, "a plurality" means two or more. "and/or" for describing an association relationship of associated objects, indicating that there may be three relationships, e.g., "a and/or B" may indicate: only A, only B and both A and B are present, wherein A and B may be singular or plural. The character "/" generally indicates that the former and latter associated objects are in an "or" relationship. "at least one of the following" or similar expressions refer to any combination of these items, including any combination of single item(s) or plural items. For example, at least one (one) of a, b, or c, may represent: a, b, c, "a and b", "a and c", "b and c", or "a and b and c", wherein a, b, c may be single or plural.

In the several embodiments provided in the present application, it should be understood that the disclosed apparatus and method may be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, the above-described division of units is only one type of division of logical functions, and other divisions may be realized in practice, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.

The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solutions of the present application, which are essential or part of the technical solutions contributing to the prior art, or all or part of the technical solutions, may be embodied in the form of a software product stored in a storage medium, which includes multiple instructions for enabling a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the methods of the embodiments of the present application. And the aforementioned storage medium includes: various media capable of storing programs, such as a usb disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk.

The preferred embodiments of the present application have been described above with reference to the accompanying drawings, and the scope of the claims of the embodiments of the present application is not limited thereto. Any modifications, equivalents and improvements that may occur to those skilled in the art without departing from the scope and spirit of the embodiments of the present application are intended to be within the scope of the claims of the embodiments of the present application.

Claims

1. A method of training a model, the method comprising:

acquiring sample audio data of a sample speaking object; wherein the sample audio data comprises sample audio content and sample acoustic features, the sample acoustic features comprising sample timbre information, sample tonal information;

performing voice alignment on the initial audio data to obtain a sample audio embedded vector;

2. The training method according to claim 1, wherein the reconstructing the sample audio data through the coding network to obtain initial audio data comprises:

3. The training method of claim 1, wherein the performing speech alignment on the initial audio data to obtain a sample audio embedding vector comprises:

4. The training method of claim 3, wherein the phoneme feature data comprises a phoneme type and a phoneme number, the performing phoneme feature recognition on the initial audio data to obtain phoneme feature data, and obtaining a duration sequence of the initial audio data according to the phoneme feature data comprises:

5. The training method of claim 3, wherein the speech aligning the initial audio data according to the time duration sequence to obtain the sample audio embedding vector comprises:

embedding the initial audio data to obtain an audio text embedded vector;

segmenting the audio text embedded vectors according to the duration time sequence to obtain intermediate embedded vectors corresponding to each phoneme type, wherein the number of the intermediate embedded vectors is the same as the number of the phonemes of the audio type;

6. The training method according to any one of claims 1 to 5, wherein before the decoupling processing is performed on the sample audio embedding vector, the pre-obtained sample pitch parameter, and the sample tone feature vector by the decoding network to obtain the synthesized audio data, the training method further obtains the sample tone feature vector, and specifically includes:

7. A method of speech conversion, the method comprising:

obtaining raw audio data to be processed

Inputting the original audio data, the pre-acquired target tone characteristic and the target tone characteristic of the target speaking object into a voice conversion model for voice conversion to obtain target audio data, wherein the voice conversion model is obtained by training according to the training method of any one of claims 1 to 6.

8. An apparatus for training a model, the apparatus comprising:

9. An electronic device, characterized in that the electronic device comprises a memory storing a computer program and a processor implementing the training method of any one of claims 1 to 6 or the speech conversion method of claim 7 when the computer program is executed.

10. A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out the training method of any one of claims 1 to 6 or the speech conversion method of claim 7.