CN115270922A

CN115270922A - Speaking style generation method and device, electronic equipment and storage medium

Info

Publication number: CN115270922A
Application number: CN202210714001.7A
Authority: CN
Inventors: 刘韶
Original assignee: Hisense Visual Technology Co Ltd
Current assignee: Hisense Visual Technology Co Ltd
Priority date: 2022-06-22
Filing date: 2022-06-22
Publication date: 2022-11-01

Abstract

The present disclosure relates to a speaking style generation method, device, electronic device and storage medium, the method comprising: fitting the target style characteristic attributes based on the plurality of style characteristic attributes, and determining a fitting coefficient of each style characteristic attribute; determining a target style characteristic vector according to the fitting coefficient of each style characteristic attribute and a plurality of style characteristic vectors, wherein the style characteristic vectors correspond to the style characteristic attributes one to one; inputting the target style characteristic vector into a speaking style model, and outputting target speaking style parameters, wherein the speaking style model is obtained by training a frame of the speaking style model based on the style characteristic vectors; and generating a target speaking style based on the target speaking style parameters. The method can realize the rapid transfer of the speaking style and improve the generation efficiency of the speaking style.

Description

Speaking style generation method and device, electronic equipment and storage medium

Technical Field

The present disclosure relates to the field of computers and natural language processing, and in particular, to a method and an apparatus for generating a speaking style, an electronic device, and a storage medium.

Background

With the progress of single voice interaction to multi-modal interaction in a human-computer interaction mode, voice-driven virtual digital people appear, and the virtual digital people start to enter a growth period, are combined with industries such as text travel, finance, anchor, games, movie and television entertainment and the like, and are developing towards a more intelligent, fine and diversified direction under the continuous promotion of artificial intelligence technology. Different people have different speaking styles when speaking, for example, some people have accurate mouth shapes and rich expressions when speaking, some people have small mouth shapes and serious expressions when speaking, and the like. Therefore, three-dimensional virtual digital people with different speaking styles can be designed.

However, with the prior art, each time a new speaking style is generated, the model needs to be retrained, and a lot of data operation needs to be performed, so that the generation efficiency of the new speaking style is low.

Disclosure of Invention

The disclosure provides a method and a device for generating speaking styles, electronic equipment and a storage medium, which can realize the rapid migration of the speaking styles and improve the generation efficiency of the speaking styles.

In a first aspect, the present disclosure provides a speech style generation method, including:

fitting the target style characteristic attributes based on the style characteristic attributes, and determining a fitting coefficient of each style characteristic attribute;

determining a target style characteristic vector according to the fitting coefficient of each style characteristic attribute and a plurality of style characteristic vectors, wherein the style characteristic vectors correspond to the style characteristic attributes one to one;

inputting the target style characteristic vector into a speaking style model, and outputting target speaking style parameters, wherein the speaking style model is obtained by training a frame of the speaking style model based on the plurality of style characteristic vectors;

and generating a target speaking style based on the target speaking style parameters.

In a second aspect, the present disclosure provides a speaking style generating apparatus, including:

the determining module is used for fitting the target style characteristic attributes based on the plurality of style characteristic attributes and determining the fitting coefficient of each style characteristic attribute; determining a target style characteristic vector according to the fitting coefficient of each style characteristic attribute and a plurality of style characteristic vectors, wherein the style characteristic vectors correspond to the style characteristic attributes one to one; inputting the target style characteristic vector into a speaking style model, and outputting target speaking style parameters, wherein the speaking style model is obtained by training a frame of the speaking style model based on the style characteristic vectors;

and the generating module is used for generating the target speaking style based on the target speaking style parameters.

In a third aspect, the present disclosure also provides an electronic device, including: a processor for executing a computer program stored in a memory, the computer program, when executed by the processor, implementing the steps of the speech style generation method of any one of the first aspect.

In a fourth aspect, the present disclosure also provides a computer-readable storage medium, on which a computer program is stored, which when executed by a processor, implements the speaking style generation method according to any one of the first aspect.

In the technical scheme of the embodiment of the disclosure, the fitting coefficient of each style characteristic attribute is determined by fitting the target style characteristic attribute based on a plurality of style characteristic attributes; determining a target style characteristic vector according to the fitting coefficient of each style characteristic attribute and a plurality of style characteristic vectors, wherein the style characteristic vectors correspond to the style characteristic attributes one to one; inputting the target style characteristic vector into a speaking style model, and outputting target speaking style parameters, wherein the speaking style model is obtained by training a frame of the speaking style model based on a plurality of style characteristic vectors; the target speaking style is generated based on the target speaking style parameters, so that the target style characteristic vectors can be fitted by using a plurality of style characteristic vectors, and the speaking model is obtained by training based on the plurality of style characteristic vectors, so that the corresponding new speaking style can be directly obtained by inputting the target style characteristic vectors fitted by the plurality of style characteristic vectors into the speaking model, retraining aiming at the speaking style model is not needed, the rapid transfer of the speaking style can be realized, and the generation efficiency of the speaking style is improved.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present disclosure and, together with the description, serve to explain the principles of the disclosure.

In order to more clearly illustrate the embodiments or technical solutions in the prior art of the present disclosure, the drawings used in the description of the embodiments or prior art will be briefly described below, and it is obvious for those skilled in the art that other drawings can be obtained according to the drawings without inventive exercise.

FIG. 1A is a schematic illustration of a three-dimensional virtual digital person provided by some embodiments of the present disclosure;

FIG. 1B is a schematic diagram of a three-dimensional virtual digital person provided by some embodiments of the present disclosure;

FIG. 1C is a schematic diagram of the generation of a new speech style in accordance with some embodiments of the present disclosure;

FIG. 2 is a schematic diagram of a human-computer interaction scenario, according to some embodiments of the present disclosure;

FIG. 3 is a flow chart of a speaking style generation method provided by some embodiments of the present disclosure;

FIG. 4 is a schematic diagram of facial topology data zoning provided by some embodiments of the present disclosure;

FIG. 5 is a flow chart of a speaking style generation method provided by some embodiments of the present disclosure;

FIG. 6 is a flow chart of a speaking style generation method provided by some embodiments of the present disclosure;

FIG. 7 is a flow chart of a speaking style generation method provided by some embodiments of the present disclosure;

FIG. 8 is a block diagram of a framework of a speech style model according to some embodiments of the present disclosure;

FIG. 9 is a flow chart of a speaking style generation method provided by some embodiments of the present disclosure;

FIG. 10 is a block diagram of a framework of a speech style generation model according to some embodiments of the present disclosure;

FIG. 11 is a flow chart of a speech style generation method provided by some embodiments of the present disclosure;

FIG. 12A is a block diagram of a framework of a speech style generation model according to some embodiments of the present disclosure;

FIG. 12B is a block diagram of a framework of a speech style generation model according to some embodiments of the present disclosure;

FIG. 13 is a flow chart of a speech style generation method provided by some embodiments of the present disclosure;

fig. 14 is a schematic structural diagram of a speech style generation apparatus according to some embodiments of the present disclosure;

fig. 15 is a schematic structural diagram of a speech style generation apparatus according to some embodiments of the present disclosure;

fig. 16 is a schematic structural diagram of a speech style generation apparatus according to some embodiments of the present disclosure.

Detailed Description

In order that the above objects, features and advantages of the present disclosure may be more clearly understood, aspects of the present disclosure will be further described below. It should be noted that the embodiments and features of the embodiments of the present disclosure may be combined with each other without conflict.

In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present disclosure, but the present disclosure may be practiced in other ways than those described herein; it is to be understood that the embodiments disclosed in the specification are only a few embodiments of the present disclosure, and not all embodiments.

The terms "first" and "second," etc. in this disclosure are used to distinguish between different objects, rather than to describe a particular order of objects. For example, the first prediction score and the second prediction score, etc. are used to distinguish between different prediction scores, rather than to describe a particular order of prediction scores.

With the rapid development of intelligent technology and the increasing popularization of intelligent terminals, voice multimodal interaction becomes an increasingly important mode. The traditional voice interaction is that only the voice of the user is heard without seeing the user, the user sends a voice instruction to the intelligent device, the intelligent device generates response information after receiving the voice instruction, corresponding voice response information is played, and the user can obtain the voice response information, so that the interaction between the user and the intelligent device is realized. In the process of human-computer interaction upgrading evolution, a voice-driven three-dimensional virtual digital person is generated, the intelligent device comprises a display screen, the display screen can display the three-dimensional virtual digital person, as shown in fig. 1A, the intelligent device can synchronously display the expression and the mouth shape of the three-dimensional virtual digital person during speaking while playing voice response information, as shown in fig. 1B, a user can hear the voice of the three-dimensional virtual digital person and can also see the expression of the three-dimensional virtual digital person during speaking, and a human-conversation experience feeling is provided for the user.

Usually, when people speak, different people have different states, for example, some people have accurate mouth shapes and rich expressions when speaking, some people have small mouth shapes and serious expressions when speaking. That is, different people have different speaking styles. Therefore, three-dimensional virtual digital people with different speaking styles can be designed, namely, the mouth shapes and the expressions of the three-dimensional virtual digital people with different speaking styles are different, and a user can have a conversation with the three-dimensional virtual people with different speaking styles, so that the user experience can be improved. For each new speaking style of the three-dimensional virtual human, a corresponding training sample is acquired first, and a speaking style model is retrained based on the corresponding training sample, so that a new speaking style parameter can be generated based on the retrained speaking style model, and a basic speaking style is driven based on the speaking style parameter, as shown in fig. 1C, a new speaking style can be generated. Because it takes a lot of time to collect training samples and process a lot of data to retrain the speaking style model, it takes a lot of time to generate a new speaking style, so that the generating efficiency of the speaking style is low.

In order to solve the above problem, the present disclosure determines a fitting coefficient for each style feature attribute by fitting a target style feature attribute based on a plurality of style feature attributes; determining a target style characteristic vector according to the fitting coefficient of each style characteristic attribute and a plurality of style characteristic vectors, wherein the style characteristic vectors correspond to the style characteristic attributes one to one; inputting the target style characteristic vector into a speaking style model, and outputting target speaking style parameters, wherein the speaking style model is obtained by training a frame of the speaking style model based on a plurality of style characteristic vectors; the target speaking style is generated based on the target speaking style parameters, so that the target style characteristic vectors can be fitted by using a plurality of style characteristic vectors, and the speaking model is obtained by training based on the plurality of style characteristic vectors, so that the corresponding new speaking style can be directly obtained by inputting the target style characteristic vectors fitted by the plurality of style characteristic vectors into the speaking model, retraining aiming at the speaking style model is not needed, the rapid transfer of the speaking style can be realized, and the generation efficiency of the speaking style is improved.

Fig. 2 is a schematic diagram of a human-computer interaction scenario provided by some embodiments of the present disclosure. As shown in fig. 2, in a voice interaction scenario between a user and a smart home, the smart devices may include a smart refrigerator 110, a smart washing machine 120, a smart display device 130, and the like. When a user wants to control the intelligent device, a voice instruction needs to be sent out firstly, and when the intelligent device receives the voice instruction, the voice instruction needs to be subjected to semantic understanding, a semantic understanding result corresponding to the voice instruction is determined, and a corresponding control instruction is executed according to the semantic understanding result, so that the use requirement of the user is met. The intelligent devices in the scene all include display screens, and the display screens may be touch screens or non-touch screens, and for a terminal device with a touch screen, a user may implement an interactive operation with the terminal device through a gesture, a finger, or a touch tool (e.g., a stylus pen). For a terminal device without a touch screen, the terminal device can be interactively operated through an external device (such as a mouse or a keyboard). The display screen can display the three-dimensional virtual human, and the user can see the three-dimensional virtual human and the expression of the three-dimensional virtual human during speaking through the display screen, so that the dialogue interaction with the three-dimensional virtual human is realized.

The speaking style generation method provided by the embodiment of the disclosure can be implemented based on computer equipment, or a functional module or a functional entity in the computer equipment. The computer device may be a Personal Computer (PC), a server, a mobile phone, a tablet computer, a notebook computer, a mainframe computer, and the like, which is not particularly limited in this disclosure.

For a more detailed description of the speech style generation scheme, the following description is given in conjunction with fig. 3 by way of example, and it is understood that the steps involved in fig. 3 may include more steps or fewer steps in actual implementation, and the order between the steps may be different, so as to enable the speech style generation method provided in the embodiment of the present application.

Fig. 3 is a schematic flowchart of a speaking style generating method according to some embodiments of the present disclosure, and as shown in fig. 3, the method specifically includes the following steps:

s101, fitting the target style characteristic attribute based on the plurality of style characteristic attributes, and determining a fitting coefficient of each style characteristic attribute.

Illustratively, a face topological structure data sequence when a user speaks in a time period of delta t is collected, each frame of face topological structure data corresponds to a dynamic face topological structure in the face topological structure data sequence, the face topological structure comprises a plurality of vertexes, and each vertex in the dynamic face topological structure corresponds to a vertex coordinate (x, y, z). When the user does not speak, the vertex coordinates of each vertex in the static face topology are (x ', y', z '), and thus, based on the difference between the vertex coordinates of the same vertex in the dynamic face topology and the vertex coordinates of the same vertex in the static face topology, the vertex offset (Δ x, Δ y, Δ z) of each vertex in each dynamic face topology can be determined, i.e., Δ x = x-x', Δ y = y-y ', Δ z = z-z'. Based on the vertex offsets (Δ x, Δ y, Δ z) of each vertex in all dynamic face topologies corresponding to the face topology data sequence, an average vertex offset for each vertex in the dynamic face topologies may be determined

Fig. 4 is a schematic diagram of dividing the face topological structure data into a plurality of regions according to an embodiment of the present disclosure, as shown in fig. 4, for example, the face topological structure data may be divided into three regions, which are S1, S2 and S3, where S1 is all the face regions above the lower edge of the eye, S2 is all the face regions below the lower edge of the eye, and S2 is the face region below the lower edge of the eyeThe facial area from the upper edge of the upper lip, and S3 the facial area from the upper edge of the upper lip to the chin. Based on the above embodiment, the average vertex offset of all the vertices of the dynamic face topology in the region S1 can be determined

Average value of (2)

Average vertex offset for all vertices of dynamic face topology within region S2

Average value of (2)

All vertex average vertex offsets for dynamic face topology within region S3

Average value of (2)

The style characteristic attribute can be obtained as

In summary, one style attribute may be obtained for one user, and thus, a plurality of style attributes may be obtained based on a plurality of users.

According to the obtained multiple style characteristic attributes, a new style characteristic attribute, namely the target style characteristic attribute, can be formed through fitting. For example, the target style characteristic attribute may be fit based on the following formula:

wherein,

in order to be a target feature attribute,

as an attribute of the style characteristics of the user 1,

as the genre characteristic attribute of the user 2,

the method comprises the steps of obtaining a style characteristic attribute of a user n, wherein a1 is a fitting coefficient of the style characteristic attribute of the user 1, a2 is a fitting coefficient of the style characteristic attribute of the user 2, an is a fitting coefficient of the style characteristic attribute of the user n, n is the number of users, and a1+ a2+ … + an =1.

Based on the above formula, an optimization method, such as a gradient descent method, a gauss-newton method, or the like, can be used to obtain the fitting coefficient for each style feature attribute.

It should be noted that the present embodiment only exemplifies the division of the face topology data into three regions, and does not serve as a specific limitation to the division of the face topology data regions.

And S102, determining a target style characteristic vector according to the fitting coefficient of each style characteristic attribute and the style characteristic vectors.

The style feature vectors correspond to the style feature attributes one to one.

Exemplarily, the style feature vector is a representation of a style, and embedding obtained by training a classification task model can be used as the style feature vector based on the classification task model, or a one-hot feature vector can be directly designed as the style feature vector. For example, if 3 users correspond to 3 genre feature attributes as one-hot feature vectors, then the 3 genre feature vectors may be [1;0;0, 0;1;0] and [0;0;1].

On the basis of the embodiment, the style characteristic attributes of n users with different speaking styles are obtained, and accordingly, style characteristic vectors of n users can be obtained, wherein the n style characteristic attributes correspond to the n style characteristic vectors one by one, and the n style characteristic attributes and the respective corresponding style characteristic vectors form a style basic characteristic base. Based on the multiplication of the fitting coefficients of the n style feature attributes and the corresponding style feature vectors, the target style feature vector can be expressed in the form of a style basic feature base, as follows:

p＝a1×F1+a2×F2+…+an×Fn (2)

wherein, F1 is the style feature vector of user 1, F2 is the style feature vector of user 2, fn is the style feature vector of user n, and p is the target style feature vector.

For example, the style feature vector is a one-hot feature vector, and the target style feature vector p may be represented as:

s103, inputting the target style characteristic vector into a speaking style model, and outputting a target speaking style parameter.

The speaking style model is obtained by training a frame of the speaking style model based on the style feature vectors.

Illustratively, a frame of the speaking style model is trained according to a plurality of style feature vectors in the style basic feature bases, and the trained frame of the speaking style model, namely the speaking style model, is obtained. Inputting the target style feature vector into the speaking style model can be understood as inputting the product of a plurality of style feature vectors and respective fitting coefficients into the speaking style model, which is the same as the training sample input when training the frame of the speaking style model. Therefore, based on the speaking style model, the target style characteristic vector is used as input, and the target speaking style parameter can be directly output.

The target speaking style parameter can be the vertex offset of each vertex in the dynamic face topological structure and the corresponding vertex in the static face topological structure; alternatively, the coefficients of the expression bases of the dynamic face topology structure may be used, or other parameters may also be used, which is not specifically limited by the present disclosure.

And S104, generating a target speaking style based on the target speaking style parameters.

Illustratively, the target speaking style parameter is a vertex offset between each vertex in the dynamic face topological structure and a corresponding vertex in the static face topological structure, so that on the basis of the static face topological structure, each vertex of the static face topological structure is driven to move to a corresponding position based on the vertex offset of each vertex, and the target speaking style can be obtained.

In the embodiment of the disclosure, the fitting coefficient of each style characteristic attribute is determined by fitting the target style characteristic attribute based on a plurality of style characteristic attributes; determining a target style characteristic vector according to the fitting coefficient of each style characteristic attribute and a plurality of style characteristic vectors, wherein the style characteristic vectors correspond to the style characteristic attributes one to one; inputting the target style characteristic vector into a speaking style model, and outputting target speaking style parameters, wherein the speaking style model is obtained by training a frame of the speaking style model based on a plurality of style characteristic vectors; the target speaking style is generated based on the target speaking style parameters, so that the target style characteristic vectors can be fitted by using a plurality of style characteristic vectors, and the speaking model is obtained by training based on the plurality of style characteristic vectors, so that a corresponding new speaking style can be directly obtained by inputting the target style characteristic vectors fitted by the plurality of style characteristic vectors into the speaking model, retraining aiming at the speaking style model is not needed, the rapid transfer of the speaking style can be realized, and the generation efficiency of the speaking style is improved.

Fig. 5 is a schematic flowchart of a speaking style generating method according to some embodiments of the present disclosure, and fig. 5 is a flowchart of the embodiment shown in fig. 3, before executing S101, further including:

s201, collecting multi-frame face topological structure data when a plurality of preset users read a plurality of sections of voice aloud.

Illustratively, users with different speaking styles are selected as preset users, and meanwhile, multiple sections of voices are also selected, and when each preset user reads each section of voice, multi-frame face topological structure data of the preset user are collected. For example, the duration of the voice 1 is t1, and the frequency of collecting the face topology data is 30 frames/second, so that after the user 1 reads each voice 1, t1 × 30 frames of face topology data can be collected.

S202, aiming at each preset user: and determining the average value of the speaking style parameters of the multi-frame face topological structure data in each divided region according to the respective speaking style parameters of the multi-frame face topological structure data corresponding to the multiple sections of voices and the divided regions of the face topological structure data.

Illustratively, based on the above embodiment, for the preset user 1, after the preset user 1 finishes speaking m segments of speech, t1 × 30 × m frames of facial topology data may be collected. The vertex offsets (Δ x, Δ y, Δ z) of the vertices of the dynamic face topology and the vertices of the static face topology in each frame of face topology data can be used as the speaking style parameters of each frame of face topology data, and the average vertex offset (Δ x, Δ y, Δ z) of each vertex of the dynamic face topology in the preset user 1 face topology data can be determined based on the vertex offsets (Δ x, Δ y, Δ z) of each vertex of all dynamic face topologies corresponding to t1 × 30 × m frames of face topology data of the preset user 1

Based on the divided regions of the face topology structure data, the average vertex offset of all the vertices of the dynamic face topology structure in the face topology structure data in the divided regions can be obtained for each divided region of the preset user 1

Average value of (a). For example, the face topology data is divided into three regions, wherein the average of the average vertex offsets of all vertices of the dynamic face topology in the face topology data in the region S1 is

The average of the average vertex offsets of all vertices of the dynamic face topology in the face topology data within region S2 is

The average of the average vertex offsets of all vertices of the dynamic face topology in the face topology data within region S3 is

S203, splicing the average values of the speaking style parameters of the multi-frame facial topological structure data in each divided region according to a preset sequence to obtain the style characteristic attribute of each preset user.

Illustratively, the preset sequence may be in a top-to-bottom sequence as shown in fig. 4, or may be in a bottom-to-top sequence as shown in fig. 4, which is not specifically limited by the present disclosure. If the preset sequence is from top to bottom as shown in fig. 4, based on the above embodiment, the average of the average vertex offsets of all vertices of the dynamic human face topology structure in the face topology structure data corresponding to the respective regions may be spliced according to the sequence of the regions S1, S2, and S3, so that the wind of the preset user 1 may be obtainedLattice characteristic attributes, i.e.

In summary, style feature attributes can be obtained for the preset user 1

Thus, a plurality of style characteristic attributes can be obtained for a plurality of preset users.

Fig. 6 is a schematic flowchart of a speaking style generating method according to some embodiments of the present disclosure, where fig. 6 is based on the embodiment shown in fig. 5, and before executing S101, the method further includes:

s301, collecting multi-frame target face topological structure data when the target user reads the multiple sections of voice.

The target user and the preset users are different users.

Illustratively, when a target speaking style different from speaking styles of a plurality of preset users needs to be generated currently, multi-frame target face topological structure data corresponding to the target speaking style when the target users read multiple sections of speech aloud are collected, and contents of the multiple sections of speech aloud read by the target users are the same as contents of the multiple sections of speech aloud read by the preset users. For example, after the target user reads m segments of speech with the duration of t1, t1 × 30 × m frames of target face topology data can be obtained.

S302, determining the average value of the speaking style parameters of the multi-frame target face topological structure data in each divided region according to the speaking style parameters of the multi-frame target face topological structure data corresponding to the multiple sections of voices and the divided regions of the face topological structure data.

The vertices of the dynamic human face topological structure in each frame of target face topological structure data can be summedThe vertex offsets (Δ x ', Δ y ', Δ z ') of the vertices of the static face topology as speaking style parameters of each frame of the target face topology data, and the average vertex offset (Δ x ', Δ y ', Δ z ') of each vertex of the dynamic face topology in the target face topology data of the target user may be determined based on the vertex offsets (Δ x ', Δ y ', Δ z ') of each vertex of all the dynamic face topologies in the t1 × 30 × m frames of the target face topology data of the target user

Based on the divided regions of the face topology data, the average vertex offset of all the vertices of the dynamic face topology in the target face topology data in the divided regions can be obtained for each divided region of the target user

Average value of (a). For example, the face topology data is divided into three regions, wherein the average of the average vertex offsets of all vertices of the dynamic face topology in the target face topology data within region S1 is

Dynamic faces in target face topology data within region S2 the average of the average vertex offsets for all vertices of the topology is

The average of the average vertex offsets for all vertices of the dynamic face topology in the target face topology data within region S3 is

And S303, splicing the average values of the speaking style parameters of the multi-frame target face topological structure data in each divided region according to the preset sequence to obtain the target style characteristic attribute.

For example, based on the same preset sequence as in the above embodiment, the average of the average vertex offsets of all vertices of the dynamic face topology in the target face topology data is spliced, for example, based on the sequence from top to bottom as shown in fig. 4, the average of the average vertex offsets of all vertices of the dynamic face topology in the target face topology data corresponding to the respective regions may be spliced according to the sequence of the regions S1, S2, and S3, and the target style characteristic attribute of the target user, that is, the target style characteristic attribute of the target user may be obtained

It should be noted that S201 to S203 shown in fig. 5 may be executed first, and then S301 to S303 shown in fig. 6 may be executed; alternatively, S301-S303 shown in fig. 6 may be performed first, and then S201-S203 shown in fig. 5 may be performed, which is not specifically limited by the present disclosure.

Fig. 7 is a schematic flowchart of a speech style generation method according to some embodiments of the present disclosure, where fig. 7 is based on the embodiments shown in fig. 5 and fig. 4, and before executing S103, the method further includes:

s401, a training sample set is obtained.

The training sample set comprises an input sample set and an output sample set, wherein the input sample set comprises voice features and a plurality of style feature vectors corresponding to the voice features, and the output sample set comprises the speaking style parameters.

When a preset user reads speech aloud, the internal features of the speech information can be extracted, mainly the features capable of expressing the speech content are extracted, for example, the merell feature of the speech can be extracted as the speech features, or speech features can be extracted by using a speech feature extraction model commonly used in the industry, or the speech features can be extracted based on a designed deep network model, and the like. Based on the extraction efficiency of the voice features, after the preset users read the multiple pieces of voice, the voice feature sequences can be extracted, and the content of the multiple pieces of voice read by the multiple preset users is completely the same, so that the same voice feature sequences can be extracted for different preset users. Therefore, for the same voice feature in the voice feature sequence, a plurality of style feature vectors of a plurality of preset users correspond to the same voice feature, one voice feature and a plurality of style feature vectors corresponding to the voice feature can be used as input samples, and a plurality of input samples can be obtained based on all voice features of the voice feature sequence, namely, an input sample set is obtained.

Illustratively, corresponding face topological structure data can be acquired while extracting each voice feature, and vertex offsets of all vertexes of the dynamic face topological structure in the face topological structure data can be obtained based on respective vertex coordinates of all vertexes of the dynamic face topological structure in the face topological structure data. The respective vertex offsets of all the vertexes of the dynamic human face topological structure in the face topological structure data are used as a group of speaking style parameters, and the group of speaking style parameters is an output sample.

S402, defining a frame of the speaking style model.

The frame of the speaking style model comprises a linear combination unit and a network model, wherein the linear combination unit is used for generating linear combination style feature vectors of the style feature vectors and generating linear combination output samples of the output samples, and the input samples correspond to the output samples one to one; and the network model is used for generating a corresponding prediction output sample according to the linear combination style feature vector.

Fig. 8 is a schematic structural diagram of a frame of a speaking style model according to some embodiments of the present disclosure, as shown in fig. 8, the frame of the speaking style model includes a linear combination unit 310 and a network model 320, an input end of the linear combination unit 310 is used for receiving a training sample, an output end of the linear combination unit 310 is connected to an input end of the network model 320, and an output end of the network model 320 is an output end of the frame 300 of the speaking style model.

After the training sample is input to the linear combination unit 310, the training sample includes an input sample and an output sample, where the input sample includes speech features and a plurality of style feature vectors corresponding to the speech features, the linear combination unit 310 may perform linear combination on the plurality of style feature vectors to obtain linear combination style feature vectors, and may further perform linear combination on speech style parameters corresponding to the plurality of style feature vectors to obtain linear combination output samples. The linear combination unit 310 may output the speech features and the corresponding linear combination style feature vectors, that is, the linear combination input samples, and may also output the corresponding linear combination output samples. The linear combination training samples are input to the network model 320, the linear combination training samples include linear combination input samples and linear combination output samples, and the network model 320 is trained based on the linear combination training samples.

And S403, training a frame of the speaking style model according to the training sample set and the loss function to obtain the speaking style model.

Based on the embodiment, the training samples in the training sample set are input to the frame of the speaking style model, the frame of the speaking style model can output the prediction output samples, the loss function is used for determining the loss values of the prediction output samples and the output samples, the model parameters of the frame of the speaking style model are adjusted based on the direction of the loss value reduction, and therefore one-time iterative training is completed. Therefore, the trained frame for training the speaking style model, namely the speaking style model, can be obtained based on the frame for training the speaking style model by multiple iterations.

In this embodiment, a training sample set is obtained, where the training sample set includes an input sample set and an output sample set, the input sample includes speech features and a plurality of style feature vectors corresponding to the speech features, and the output sample includes speaking style parameters; defining a frame of a speaking style model, wherein the frame of the speaking style model comprises a linear combination unit and a network model, the linear combination unit is used for generating linear combination style characteristic vectors of a plurality of style characteristic vectors and generating linear combination output samples of a plurality of output samples, and the input samples correspond to the output samples one by one; the network model is used for generating a corresponding prediction output sample according to the linear combination style feature vector; and training a frame of the speaking style model according to the training sample set and the loss function to obtain the speaking style model, so that the speaking style model is obtained by training the network model based on the linear combination style feature vectors of the plurality of style feature vectors, the diversity of the training samples of the network model can be improved, and the universality of the speaking style model can be improved.

Fig. 9 is a schematic flowchart of a speech style generation method according to some embodiments of the present disclosure, and fig. 9 is a detailed description of a possible implementation manner when S403 is executed on the basis of the embodiment shown in fig. 7, as follows:

s501, inputting the training sample set into the linear combination unit, generating the linear combination style feature vector based on the style feature vectors and their respective weight values, and generating the linear combination output sample based on the respective weight values of the style feature vectors and the output samples.

The sum of the weight values of the plurality of style feature vectors is 1.

For example, after the training sample is input to the linear combination unit, based on the linear combination unit, weight values may be respectively given to the plurality of style feature vectors, a sum of the weight values of the plurality of style feature vectors is 1, and products of each style feature vector and a corresponding weight value in the plurality of style feature vectors are added to obtain a linear combination style feature vector. Each style feature vector corresponds to one output sample, and products of weighted values of the style feature vectors and the corresponding output samples are added to obtain a linear combination output sample. Thus, different linear combination style feature vectors and different linear combination output samples can be obtained based on different weight values, a linear combination input sample set can be obtained based on a plurality of voice features and linear combination style feature vectors corresponding to the voice features, and a linear combination output sample set can be obtained based on output samples corresponding to the voice features.

S502, training the network model according to the loss function and the linear combination training sample set to obtain the speaking style model.

The linear combination training sample set comprises a linear combination input sample set and a linear combination output sample set, and the linear combination input sample comprises the speech features and the linear combination style feature vectors corresponding to the speech features.

Illustratively, the linear combination training sample set includes a linear combination input sample set and a linear combination output sample set, the linear combination training samples are input to the network model, a prediction output sample can be obtained based on the network model and the linear combination input sample, model parameters of the network model are adjusted based on the direction of loss value reduction of the loss function, and thus, one iterative training of the network model is completed. Therefore, a trained frame for training the speaking style model, namely the speaking style model, can be obtained based on multiple iterative training of the network model.

In this embodiment, a training sample set is input to a linear combination unit, a linear combination style feature vector is generated based on a plurality of style feature vectors and their respective weight values, a linear combination output sample is generated based on the respective weight values of the plurality of style feature vectors and a plurality of output samples, and the sum of the weight values of the plurality of style feature vectors is 1; training the network model according to the loss function and the linear combination training sample set to obtain the speaking style model, wherein the linear combination training sample set comprises a linear combination input sample set and a linear combination output sample set, the linear combination input sample comprises voice characteristics and linear combination style characteristic vectors corresponding to the voice characteristics, the training samples after linear combination can be used as training samples of the network model, the number and diversity of the training samples of the network model can be improved, and the universality and accuracy of the speaking style model can be improved.

In some embodiments of the present disclosure, fig. 10 is a schematic structural diagram of a framework of another speech style generation model provided in the embodiments of the present disclosure, as shown in fig. 10, based on the embodiment shown in fig. 8, the framework of the speech style model further includes a scaling unit 330. The input end of the scaling unit 330 is configured to receive a training sample, the output end of the scaling unit 330 is connected to the input end of the linear combination unit 310, the scaling unit 330 is configured to scale the plurality of style feature vectors and the plurality of output samples based on a randomly generated scaling factor, to obtain a plurality of scaling style feature vectors and a plurality of scaling output samples, and output a scaling training sample, where the scaling training sample includes the plurality of scaling style feature vectors and scaling training samples corresponding to the plurality of scaling style feature vectors. The scaling factor may be 0.5-2, the exact arrival of the scaling factor is one decimal place.

The scaled training samples are input to the linear combination unit 310, and based on the linear combination unit 310, the multiple scaled style feature vectors may be linearly combined to obtain a linear combination style feature vector, and the scaled output samples corresponding to the multiple scaled style feature vectors may also be linearly combined to obtain a linear combination output sample. The linear combination unit 310 may output the speech features and the corresponding linear combination style feature vectors, that is, the linear combination input samples, and may also output the corresponding linear combination output samples. The linear combination training samples are input to the network model 320, the linear combination training samples include linear combination input samples and linear combination output samples, and the network model 320 is trained based on the linear combination training samples.

Fig. 11 is a schematic flowchart of a speech style generation method according to some embodiments of the present disclosure, and fig. 11 is a detailed description of another possible implementation manner when S403 is executed on the basis of the embodiment shown in fig. 7, as follows:

s5011, inputting the training sample set to the scaling unit, generating a plurality of scaled style feature vectors based on a scaling factor and the plurality of style feature vectors, and generating a plurality of scaled output samples based on the scaling factor and the plurality of output samples.

For example, after the training sample is input to the scaling unit, based on the scaling unit, the multiple style feature vectors can be scaled by the random scaling factor, so that multiple scaled style feature vectors can be obtained. Each style feature vector corresponds to an output sample, and the corresponding output sample is scaled based on the scaling factor of each style feature vector to obtain a plurality of scaled output samples. In this way, a scaled input sample set can be obtained based on the plurality of speech features and the plurality of scaled style feature vectors corresponding to the speech features, and a scaled output sample set can be obtained based on the scaled output samples corresponding to the speech features.

S5012, inputting the plurality of scaling style feature vectors and the plurality of scaling output samples to the linear combination unit, generating the linear combination style feature vectors based on the plurality of scaling style feature vectors and their respective weight values, and generating the linear combination output samples based on the respective weight values of the plurality of scaling style feature vectors and the plurality of scaling output samples.

The sum of the weight values of the plurality of scaling style feature vectors is 1.

Illustratively, the scaling training sample set includes a scaling input sample set and a scaling output sample set, the scaling training sample set is input to the linear combination unit, based on the linear combination unit, weight values may be respectively given to the plurality of scaling style feature vectors, a sum of the weight values of the plurality of scaling style feature vectors is 1, products of the scaling style feature vectors and the corresponding weight values in the plurality of scaling style feature vectors are added, and the linear combination style feature vector may be obtained. Each scaling style feature vector corresponds to one scaling output sample, and products of respective weight values of the scaling style feature vectors and the corresponding scaling output samples are added to obtain a linear combination output sample. Thus, different linear combination style feature vectors and different linear combination output samples can be obtained based on different weight values, a linear combination input sample set can be obtained based on a plurality of voice features and linear combination style feature vectors corresponding to the voice features, and a linear combination output sample set can be obtained based on a scaling output sample corresponding to the voice features.

In this embodiment, the frame passing through the speaking style model further includes a scaling unit; inputting the training sample set into a scaling unit, generating a plurality of scaling style characteristic vectors based on a scaling factor and a plurality of style characteristic vectors, and generating a plurality of scaling output samples based on the scaling factor and a plurality of output samples; inputting a plurality of scaling style feature vectors and a plurality of scaling output samples into a linear combination unit, generating linear combination style feature vectors based on the plurality of scaling style feature vectors and respective weight values thereof, and generating linear combination output samples based on the respective weight values of the plurality of scaling style feature vectors and the plurality of scaling output samples, wherein the sum of the weight values of the plurality of scaling style feature vectors is 1; training the network model according to the loss function and the linear combination training sample set to obtain the speaking style model, wherein the linear combination training sample set comprises a linear combination input sample set and a linear combination output sample set, and the linear combination input sample comprises voice characteristics and linear combination style characteristic vectors corresponding to the voice characteristics.

In some embodiments of the present disclosure, fig. 12A is a schematic structural diagram of a framework of a speech style generation model provided in some embodiments of the present disclosure, fig. 12B is a schematic structural diagram of a framework of a speech style generation model provided in some embodiments of the present disclosure, fig. 12A is based on the embodiment shown in fig. 8, fig. 12B is based on the embodiment shown in fig. 10, a network model 320 includes a primary network model 321, a secondary network model 322, and an overlap unit 323, an output end of the primary network model 321 and an output end of the secondary network model 322 are both connected to an input end of the overlap unit 323, and an output end of the overlap unit 323 is used to output a prediction output sample. The loss function includes a first loss function and a second loss function.

The linear combination training samples are respectively input into the primary network model 321 and the secondary network model 322, a primary prediction output sample can be output based on the primary network model 321, a secondary prediction output sample can be output based on the secondary network model 322, the primary prediction output sample and the secondary prediction output sample are input into the superposition unit 323, and the primary prediction output sample and the secondary prediction output sample are superposed based on the superposition unit 323 to obtain a prediction output sample. The primary network model 321 may include a convolutional network and a fully-connected network, which function to extract a single-frame correspondence between speech and facial topology data, and the secondary network model 322 may be a sequence-to-sequence seq2seq network model, for example, a Long short-term memory (LSTM) network model, a gated round-robin Unit (GRU) network model, or a Transformer network model, which function to enhance the smoothness between speech features and facial expressions and speaking styles.

Illustratively, the loss function L = b1 × L1+ b2 × L2, where L1 is a first loss function for determining loss values of the primary predicted output samples and the linear combined output samples, L2 is a second loss function for determining loss values of the secondary predicted output samples and the linear combined output samples, b1 is a weight of the first loss function, b2 is a weight of the second loss function, and b1 and b2 are adjustable. The primary network model 321 can be trained by setting b2 to approach 0, and the secondary network model 322 can be trained by setting b1 to approach 0. Therefore, the primary network model and the secondary network model can be trained independently in stages, the convergence rate of network model training can be improved, the training time of the network model is saved, and the generation efficiency of the speaking style can be improved.

Fig. 13 is a schematic flowchart of a speech style generation method according to some embodiments of the present disclosure, and fig. 13 is a detailed description of a possible implementation manner when S502 is executed on the basis of the embodiments shown in fig. 9 or fig. 11, as follows:

s5021, training the primary network model according to the linear combination training sample set and the first loss function to obtain an intermediate speaking style model.

The intermediate speaking style model comprises the secondary network model and the trained primary network model.

Illustratively, based on the above embodiment, in the first stage, the weight b2 of the second loss function is set to approach 0, the loss function of the current network model can be understood as the first loss function, and the linear combination training samples are respectively input into the primary network model and the secondary network model. And adjusting model parameters of the primary network model based on the direction of reduction of the first loss value until the first loss value converges to obtain a trained primary network model, wherein the frame of the first-stage trained speaking style model is an intermediate speaking style model.

And S5022, fixing the trained model parameters of the primary network model.

For example, after the first-level network model is trained, the second stage is entered, and first, the model parameters of the trained first-level network model need to be fixed.

S5023, training the secondary network model in the middle speaking style model according to the linear combination training sample set and the second loss function to obtain the speaking style model.

The speaking style model comprises the trained primary network and the trained secondary network.

Secondly, the weight b1 of the first loss function is set to approach 0, the loss function of the current network model can be understood as a second loss function, and the linear combination training sample is input into the secondary network model and the trained primary network model. And adjusting model parameters of the secondary network model based on the direction of reduction of the second loss value until the second loss value converges to obtain a trained secondary network model, wherein the frame of the first-stage trained speaking style model is the speaking style model.

In this embodiment, the network model includes a primary network model, a secondary network model and a superposition unit, an output end of the primary network model and an output end of the secondary network model are both connected with an input end of the superposition unit, and an output end of the superposition unit is used for outputting a prediction output sample; the loss function comprises a first loss function and a second loss function; training a primary network model according to the linear combination training sample set and the first loss function to obtain an intermediate speaking style model, wherein the intermediate speaking style model comprises a secondary network model and a trained primary network model; fixing model parameters of the trained primary network model; according to the linear combination training sample set and the second loss function, a secondary network model in the middle speaking style model is trained to obtain a speaking style model, and the speaking style model comprises a trained primary network and a trained secondary network.

Fig. 14 is a schematic structural diagram of a speech style generation apparatus according to some embodiments of the present disclosure. The device is configured in computer equipment, and can realize the speaking style generation method in any embodiment of the application. The device specifically comprises the following steps:

a determining module 410, configured to fit the target style characteristic attribute based on the plurality of style characteristic attributes, and determine a fitting coefficient of each style characteristic attribute; determining a target style characteristic vector according to the fitting coefficient of each style characteristic attribute and a plurality of style characteristic vectors, wherein the style characteristic vectors correspond to the style characteristic attributes one to one; and inputting the target style characteristic vector into a speaking style model, and outputting target speaking style parameters, wherein the speaking style model is obtained by training a frame of the speaking style model based on the plurality of style characteristic vectors.

A generating module 420, configured to generate a target speaking style based on the target speaking style parameter.

As an optional implementation manner of the embodiment of the present disclosure, fig. 15 is a schematic structural diagram of a speech style generation device provided in some embodiments of the present disclosure, and fig. 15 is a schematic structural diagram of the embodiment shown in fig. 14, where the speech style generation device further includes:

the collecting module 430 is configured to collect multi-frame face topology data when multiple preset users read multiple segments of voice aloud.

A determining module 410, further configured to, for each preset user: determining the average value of the speaking style parameters of the multi-frame face topological structure data in each divided region according to the respective speaking style parameters of the multi-frame face topological structure data corresponding to the multiple sections of voices and the divided regions of the face topological structure data; and splicing the average values of the speaking style parameters of the multi-frame facial topological structure data in each divided region according to a preset sequence to obtain the style characteristic attribute of each preset user.

As an optional implementation manner of the embodiment of the present disclosure, on the basis of the above embodiment, the collecting module 430 is further configured to collect multi-frame target face topology data when a target user reads the multiple segments of speech, where the target user is a different user from the multiple preset users.

The determining module 410 is further configured to determine, according to respective speaking style parameters of the multi-frame target face topological structure data corresponding to the multiple segments of speech and divided regions of the face topological structure data, an average value of the speaking style parameters of the multi-frame target face topological structure data in each divided region; and splicing the average values of the speaking style parameters of the multi-frame target face topological structure data in each divided region according to the preset sequence to obtain the target style characteristic attribute.

As an optional implementation manner of the embodiment of the present disclosure, fig. 16 is a schematic structural diagram of a speaking style generating device provided in some embodiments of the present disclosure, and fig. 16 is a schematic structural diagram of the embodiment shown in fig. 15, where the speaking style generating device further includes:

the obtaining module 440 is configured to obtain a training sample set, where the training sample set includes an input sample set and an output sample set, the input sample includes a speech feature and the plurality of style feature vectors corresponding to the speech feature, and the output sample includes the speaking style parameter.

A frame defining module 450, configured to define a frame of the speaking style model, where the frame of the speaking style model includes a linear combination unit and a network model, the linear combination unit is configured to generate linear combination style feature vectors of the multiple style feature vectors, and generate linear combination output samples of multiple output samples, where the input samples correspond to the output samples one to one; and the network model is used for generating a corresponding prediction output sample according to the linear combination style feature vector.

And the training module 460 is configured to train a frame of the speaking style model according to the training sample set and the loss function, so as to obtain the speaking style model.

As an optional implementation manner of the embodiment of the present disclosure, on the basis of the above embodiment, the training module 440 is further configured to input the training sample set to the linear combination unit, generate the linear combination style feature vector based on the plurality of style feature vectors and their respective weight values, and generate the linear combination output sample based on the respective weight values of the plurality of style feature vectors and the plurality of output samples, where a sum of the weight values of the plurality of style feature vectors is 1; and training the network model according to the loss function and a linear combination training sample set to obtain the speaking style model, wherein the linear combination training sample set comprises a linear combination input sample set and a linear combination output sample set, and the linear combination input sample comprises the speech features and the linear combination style feature vectors corresponding to the speech features.

As an optional implementation manner of the embodiment of the present disclosure, on the basis of the above embodiment, the frame of the speaking style model further includes a scaling unit.

A training module 440, further configured to input the training sample set to the scaling unit, generate a plurality of scaled style feature vectors based on a scaling factor and the plurality of style feature vectors, and generate a plurality of scaled output samples based on the scaling factor and the plurality of output samples; inputting the plurality of scaling style feature vectors and the plurality of scaling output samples into the linear combination unit, generating the linear combination style feature vectors based on the plurality of scaling style feature vectors and their respective weight values, and generating the linear combination output samples based on the respective weight values of the plurality of scaling style feature vectors and the plurality of scaling output samples, wherein the sum of the weight values of the plurality of scaling style feature vectors is 1; and training the network model according to the loss function and a linear combination training sample set to obtain the speaking style model, wherein the linear combination training sample set comprises a linear combination input sample set and a linear combination output sample set, and the linear combination input sample comprises the speech features and the linear combination style feature vectors corresponding to the speech features.

As an optional implementation manner of the embodiment of the present disclosure, on the basis of the above embodiment, the network model includes a primary network model, a secondary network model, and a superposition unit, an output end of the primary network model and an output end of the secondary network model are both connected to an input end of the superposition unit, and an output end of the superposition unit is used to output the prediction output sample; the loss function includes a first loss function and a second loss function.

A training module 440, further configured to train the primary network model according to the linear combination training sample set and the first loss function, to obtain an intermediate speaking style model, where the intermediate speaking style model includes the secondary network model and the trained primary network model; fixing model parameters of the trained primary network model; and training the secondary network model in the intermediate speaking style model according to the linear combination training sample set and the second loss function to obtain the speaking style model, wherein the speaking style model comprises the trained primary network and the trained secondary network.

The speaking style generating device provided by the embodiment of the disclosure can execute the speaking style generating method provided by any embodiment of the disclosure, and has corresponding functional modules and beneficial effects of the executing method.

An embodiment of the present disclosure provides an electronic device, including: a processor for executing a computer program stored in a memory, the computer program, when executed by the processor, implementing the steps of any method embodiment of the present disclosure.

Embodiments of the present disclosure also provide a computer-readable storage medium having stored thereon a computer program, which when executed by a processor, performs the steps of any of the method embodiments of the present disclosure.

Claims

1. A speech style generation method, comprising:

fitting the target style characteristic attributes based on the plurality of style characteristic attributes, and determining a fitting coefficient of each style characteristic attribute;

2. The method of claim 1, wherein fitting the target style attribute based on a plurality of style attributes further comprises, prior to determining a fitting coefficient for each style attribute:

collecting multi-frame face topological structure data when a plurality of preset users read a plurality of sections of voice;

for each preset user: determining the average value of the speaking style parameters of the multi-frame face topological structure data in each divided region according to the respective speaking style parameters of the multi-frame face topological structure data corresponding to the multiple sections of voices and the divided regions of the face topological structure data;

and splicing the average values of the speaking style parameters of the multi-frame face topological structure data in each divided region according to a preset sequence to obtain the style characteristic attribute of each preset user.

3. The method of claim 2, further comprising:

collecting multi-frame target face topological structure data when a target user reads the multiple sections of voice, wherein the target user and the preset users are different users;

determining the average value of the speaking style parameters of the multi-frame target face topological structure data in each divided region according to the respective speaking style parameters of the multi-frame target face topological structure data corresponding to the multiple sections of voices and the divided regions of the face topological structure data;

and splicing the average values of the speaking style parameters of the multi-frame target face topological structure data in each divided region according to the preset sequence to obtain the target style characteristic attribute.

4. The method according to claim 2, wherein before inputting the target speech style feature vector into the speech style model and outputting the target speech style parameters, the method further comprises:

acquiring a training sample set, wherein the training sample set comprises an input sample set and an output sample set, the input sample comprises voice features and a plurality of style feature vectors corresponding to the voice features, and the output sample comprises the speaking style parameters;

defining a frame of the speaking style model, wherein the frame of the speaking style model comprises a linear combination unit and a network model, the linear combination unit is used for generating linear combination style feature vectors of the style feature vectors and generating linear combination output samples of the output samples, and the input samples correspond to the output samples one to one; the network model is used for generating a corresponding prediction output sample according to the linear combination style feature vector;

and training a frame of the speaking style model according to the training sample set and the loss function to obtain the speaking style model.

5. The method according to claim 4, wherein the training the frame of the speaking style model according to the training sample set and the loss function to obtain the speaking style model comprises:

inputting the training sample set into the linear combination unit, generating the linear combination style feature vector based on the style feature vectors and their respective weight values, and generating the linear combination output sample based on the respective weight values of the style feature vectors and the output samples, wherein the sum of the weight values of the style feature vectors is 1;

and training the network model according to the loss function and a linear combination training sample set to obtain the speaking style model, wherein the linear combination training sample set comprises a linear combination input sample set and a linear combination output sample set, and the linear combination input sample comprises the speech features and the linear combination style feature vectors corresponding to the speech features.

6. The method according to claim 4, wherein the framework of the speaking style model further comprises a scaling unit;

the training the frame of the speaking style model according to the training sample set and the loss function to obtain the speaking style model comprises the following steps:

inputting the training sample set into the scaling unit, generating a plurality of scaling style feature vectors based on a scaling factor and the plurality of style feature vectors, and generating a plurality of scaled output samples based on the scaling factor and the plurality of output samples;

inputting the plurality of scaling style feature vectors and the plurality of scaling output samples into the linear combination unit, generating the linear combination style feature vectors based on the plurality of scaling style feature vectors and their respective weight values, and generating the linear combination output samples based on the respective weight values of the plurality of scaling style feature vectors and the plurality of scaling output samples, wherein the sum of the weight values of the plurality of scaling style feature vectors is 1;

7. The method according to claim 5 or 6, wherein the network model comprises a primary network model, a secondary network model and a superposition unit, wherein an output of the primary network model and an output of the secondary network model are both connected to an input of the superposition unit, and an output of the superposition unit is used for outputting the predicted output samples; the loss function comprises a first loss function and a second loss function;

the training the network model according to the loss function and the linear combination training sample set to obtain the speaking style model comprises the following steps:

training the primary network model according to the linear combination training sample set and the first loss function to obtain an intermediate speaking style model, wherein the intermediate speaking style model comprises the secondary network model and the trained primary network model;

fixing model parameters of the trained primary network model;

and training the secondary network model in the intermediate speaking style model according to the linear combination training sample set and the second loss function to obtain the speaking style model, wherein the speaking style model comprises the trained primary network and the trained secondary network.

8. A speech style generation apparatus, comprising:

the determining module is used for fitting the target style characteristic attributes based on the plurality of style characteristic attributes and determining the fitting coefficient of each style characteristic attribute; determining a target style characteristic vector according to the fitting coefficient of each style characteristic attribute and a plurality of style characteristic vectors, wherein the style characteristic vectors correspond to the style characteristic attributes one to one; inputting the target style characteristic vector into a speaking style model, and outputting target speaking style parameters, wherein the speaking style model is obtained by training a frame of the speaking style model based on the plurality of style characteristic vectors;

9. An electronic device, comprising: a processor for executing a computer program stored in a memory, the computer program, when executed by the processor, implementing the steps of the method of any of claims 1-7.

10. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the method according to any one of claims 1-7.