WO2021073416A1

WO2021073416A1 - Method for generating virtual character video on the basis of neural network, and related device

Info

Publication number: WO2021073416A1
Application number: PCT/CN2020/118373
Authority: WO
Inventors: 王健宗; 王义文
Original assignee: 平安科技（深圳）有限公司
Priority date: 2019-10-18
Filing date: 2020-09-28
Publication date: 2021-04-22
Also published as: CN110866968A

Abstract

The present application relates to the technical field of artificial intelligence, and particularly relates to a method for generating a virtual character video on the basis of a neural network, and a related device. The method comprises: acquiring text to be identified, and obtaining audio after the text is imported into a text-to-speech conversion model and is subjected to sound conversion; extracting a rhythm parameter of the audio, and extracting an audio feature point; generating a mouth motion trajectory of a virtual character; obtaining a two-dimensional picture of the virtual character, and generating a three-dimensional facial picture of the virtual character after processing the two-dimensional picture; importing the mouth motion trajectory into the three-dimensional facial picture to generate a dynamic facial image; and acquiring real-time audio corresponding to each frame of dynamic facial image, and synchronously performing audio and video synthesis encoding on the dynamic facial image and the real-time audio to obtain a virtual character video. According to the present application, the aim of obtaining a desired video display effect as long as text is input is achieved, such that it is ensured that the sound of a virtual character and the mouth action of the virtual character are kept completely consistent.

Description

Method and related equipment for generating virtual character video based on neural network

This application claims the priority of a Chinese patent application filed with the Chinese Patent Office on October 18, 2019, the application number is 201910990742.6, and the invention title is "Method and Related Equipment for Generating Virtual Character Video Based on Neural Network", the entire content of which is incorporated by reference Incorporate in the application.

Technical field

This application relates to the field of artificial intelligence technology, and in particular to a method and related equipment for generating a virtual character video based on a neural network.

Background technique

A virtual character refers to a character that does not exist in reality. It can exist in creative works such as TV series, comics, and games. It is a fictional character in creative works such as TV series, comics, and games. Synthesizing virtual characters usually uses 3D scanning and other methods to generate the required virtual characters by setting face parameters.

However, the inventor realized that when the virtual character is generated, the voice of the virtual character cannot be kept completely consistent with the mouth movement of the virtual character, which results in poor fidelity of the virtual character, and it is impossible to achieve a fake playback effect.

Summary of the invention

Based on this, in view of the problem that the voice of the virtual character cannot be kept completely consistent with the mouth movements of the virtual character when the virtual character is generated, a method and related equipment for generating a virtual character video based on a neural network are provided.

A method for generating a virtual character video based on a neural network includes the following steps:

Acquiring the text to be recognized, and importing the text to be recognized into a preset text-to-speech conversion model for voice conversion to obtain audio;

Extracting the prosody parameters of the audio, and importing the prosody parameters into a preset audio generation model for audio feature point extraction;

According to the audio feature points, generating the trajectory of the mouth movement of the virtual character;

Acquiring a preset two-dimensional picture of the virtual character, and importing the two-dimensional picture into the facial feature generation model for processing to generate a three-dimensional facial image of the virtual character;

Importing the movement trajectory of the mouth into the three-dimensional facial image to generate multiple frames of continuous dynamic facial images;

Acquire the real-time audio corresponding to each frame of the dynamic face and facial picture, and perform audio-video synthesis and coding on the dynamic face and facial picture and the real-time audio in synchronization to obtain a virtual character video.

A device for generating a virtual character video based on a neural network, including the following modules:

The trajectory generation module is configured to obtain the text to be recognized, and import the text to be recognized into a preset text-to-speech conversion model to obtain audio; extract the prosodic parameters of the audio, and import the prosodic parameters into the preset Audio feature point extraction in the audio generation model; according to the audio feature point, a trajectory of the mouth movement of the virtual character is generated;

The screen generation module is configured to obtain a two-dimensional picture of a preset virtual character, import the two-dimensional picture into a facial feature generation model for processing, and generate a three-dimensional facial image of the virtual character; import the movement trajectory of the mouth to the Three-dimensional facial image, generating multiple frames of continuous dynamic facial images;

The video generation module is configured to obtain the real-time audio corresponding to each frame of the dynamic face and facial painting, and synchronize the audio and video synthesis and coding of the dynamic face and the real-time audio to obtain a virtual character video.

Further, in order to achieve the above-mentioned object, the present application also provides a computer device, including a memory and a processor. The memory stores computer-readable instructions. When the computer-readable instructions are executed by the processor, the The processor executes the steps of the following method for generating a virtual character video based on a neural network, including: obtaining a text to be recognized, and importing the text to be recognized into a preset text-to-speech conversion model for voice conversion to obtain audio;

Further, in order to achieve the above-mentioned object, the present application also provides a storage medium storing computer-readable instructions. When the computer-readable instructions are executed by one or more processors, the one or more processors execute the above-mentioned The steps of a method for generating a virtual character video by a neural network include: obtaining a text to be recognized, and importing the text to be recognized into a preset text-to-speech conversion model for voice conversion to obtain audio;

Compared with the existing mechanism, this application effectively converts the characters in the text into audio, and then uses the anti-neural network and memory neural network technology to reconstruct the facial features corresponding to the audio on the three-dimensional facial image. As a whole, the conversion of text to video is realized. There is no need to separately simulate each link in text, audio and video. It realizes the purpose of obtaining the desired video display effect as long as the text is input, thereby ensuring the virtual The voice of the character is exactly the same as the mouth movement of the virtual character.

Description of the drawings

By reading the detailed description of the preferred embodiments below, various other advantages and benefits will become clear to those of ordinary skill in the art. The drawings are only used for the purpose of illustrating the preferred embodiments, and are not considered as a limitation to the application.

FIG. 1 is an overall flowchart of a method for generating a virtual character video based on a neural network in an embodiment of the present application;

2 is a schematic diagram of an audio generation process in a method for generating a virtual character video based on a neural network in an embodiment of the present application;

3 is a schematic diagram of an audio feature point extraction process in a method for generating a virtual character based on a neural network in an embodiment of the present application;

Fig. 4 is a structural diagram of an apparatus for generating a virtual character video based on a neural network in an embodiment of the application.

Detailed ways

In order to make the purpose, technical solutions, and advantages of this application clearer, the following further describes this application in detail with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described here are only used to explain the application, and not used to limit the application.

Those skilled in the art can understand that, unless specifically stated, the singular forms "a", "an", "said" and "the" used herein may also include plural forms. It should be further understood that the term "comprising" used in the specification of this application refers to the presence of the described features, integers, steps, operations, elements and/or components, but does not exclude the presence or addition of one or more other features, Integers, steps, operations, elements, components, and/or groups of them.

Fig. 1 is an overall flowchart of a method for generating a virtual character video based on a neural network in an embodiment of the present application. A method for generating a virtual character video based on a neural network includes the following steps:

S1. Obtain the text to be recognized, and import the text to be recognized into a preset text-to-speech conversion model for voice conversion to obtain audio;

Specifically, the text to be recognized may be in languages such as Chinese, English, or Japanese. For the text to be recognized, the position of the separator in the text can be determined first, such as ",", "." and so on. According to the positions of these separators, the text to be recognized is divided into several sub-texts. Import each sub-text into the text-to-speech conversion model for text-to-sound conversion.

Among them, the text-to-speech conversion model can be composed of the Char2Wav architecture. In the Char2Wav architecture, a simple cyclic neural network and a cross cyclic network sub-text are used for voice conversion.

When performing voice conversion on the sub-text, the words in the sub-text can be converted into multi-dimensional word vectors by word vector conversion, and then the feature values and dimensions of the multi-dimensional word vector are used as parameters into the simple cyclic neural network and cross-cyclic neural network Training conversion in the network.

S2. Extract the prosody parameters of the audio, and import the prosody parameters into an audio generation model to extract audio feature points;

Specifically, the prosody parameters of the audio include pitch, pitch, pause frequency, and so on. When extracting the prosody parameters, the audio generation model may adopt a hidden Markov model.

The frequency range value and the vibration amplitude value of the audio frequency spectrum are obtained, and the frequency range value and the vibration amplitude value are input into the hidden Markov model to extract audio feature points. Among them, the formula for audio feature point extraction is:

D(x,y)=∫P(X|x)·P(X|y)dX, in the formula, D(x,y) represents the value of the audio feature point in the two-dimensional coordinate system, and represents the vibration amplitude The probability value indicates the frequency probability value.

S3. Generate a trajectory of the mouth movement of the virtual character according to the audio feature points;

Specifically, using the 20 key points of the mouth extracted by the dlib algorithm, the key points of the mouth are normalized so as not to be affected by the image size, face position, face rotation, and face size. Normalization is very important in this process because it can make the generated key points compatible with any video. Then, PCA is used to reduce the dimensionality of the normalized mouth-key points, reducing a total of 20×2 and a total of 40 dimensions to 8 dimensions. The bilinear difference method is used to expand the mouth standard point data after PCA, and then according to the mouth opening and closing amplitude and frequency corresponding to each audio feature point, the movement trajectory of the mouth standard point is determined, and all mouths are summarized After the movement trajectory of the standard point is obtained, the movement trajectory of the mouth of the virtual character is obtained.

S4. Obtain a two-dimensional picture of a preset virtual character, and import the two-dimensional picture into a facial feature generation model for processing to generate a three-dimensional facial image of the virtual character;

Specifically, a two-dimensional picture of the virtual character to be generated is obtained, a two-dimensional coordinate system is established, and the contours of the mouth, nose, and eyes in the two-dimensional picture are obtained from the two-dimensional coordinate system, and from the contours of the mouth, nose, and eyes Extract the coordinates of the key points of the facial features, such as nose coordinates, mouth corner coordinates, etc. According to the coordinates of these key points, the head posture of the virtual character is determined, and the correctness of the head posture is evaluated by the least square method. Among them, the least squares estimation calculation formula is:

In the formula, c represents the correct estimation, n represents the number of feature points, pi represents the probability of feature points, s represents the rotation parameter, R represents the translation parameter, t represents the scaling parameter, and V represents the distance from the feature point to the origin.

S5. Import the movement trajectory of the mouth into the three-dimensional facial image, and generate multiple continuous dynamic facial images;

Specifically, when importing the movement trajectory of the mouth into the three-dimensional facial image, the Canny algorithm can be used to detect the edge of the mouth in the three-dimensional facial image. Among them, the image processed by the Canny algorithm is usually a grayscale image, so if the camera acquires a color image, it must first be grayscaled. To grayscale a color image is to perform a weighted average according to the sampled values of each channel of the image. The commonly used gray-scale processing method is Gray=0.299R+0.587G+0.114B. Gaussian filtering is performed on the gray-scale processed image, and the first-order partial derivative finite difference is used to calculate the magnitude and direction of the gradient. The operators that can be used in the calculation of gradient amplitude and direction include Roberts operator: the contour of the mouth edge is obtained after non-maximum suppression of the gradient amplitude. After the trajectory of the mouth movement is sequentially marked on the contour of the mouth edge, multiple frames of continuous dynamic facial images can be generated.

S6. Acquire real-time audio corresponding to each frame of the dynamic face and facial painting, and perform audio-video synthesis and coding on the dynamic face and facial image and the real-time audio in synchronization to obtain a virtual character video.

Specifically, the dynamic facial image is played at the preset playing speed, and the playing time, initial playing node and ending playing node of the complete dynamic facial facial image are recorded, and then based on the playing time and the position of the initial playing node. And the position of the end playback node determines the segment of the audio generated in step S1 to be played. Finally, the video of the virtual character can be obtained by using the video encoder to synthesize the audio segment to be played with the corresponding dynamic facial image.

In this embodiment, the characters in the text are effectively converted into audio, and then the facial features corresponding to the audio are well reconstructed on the three-dimensional facial image through the adversarial neural network and memory neural network technology. As a whole, the conversion of text to video is realized. There is no need to separately simulate each link in text, audio and video. It realizes the purpose of obtaining the desired video display effect as long as the text is input, thereby ensuring the virtual The voice of the character is exactly the same as the mouth movement of the virtual character.

Figure 2 is a schematic diagram of the audio generation process in a method for generating a virtual character video based on a neural network in an embodiment of the application. As shown in the figure, the S1, obtain the text to be recognized, and import the text to After voice conversion in the text-to-speech conversion model, the audio is obtained, including:

S11. Obtain a text to be recognized, extract a separator in the text to be recognized, and divide the text to be recognized into several sub-texts according to the separator;

Specifically, the text to be recognized is usually a streaming data text, and the separator in the streaming data text can be punctuation marks such as "." and ",", or numbers such as "1" and "2". When dividing the text to be recognized, equal length division or unequal length division can be used.

S12. Perform word vector encoding on the sub-text to obtain several multi-dimensional word vectors;

Wherein, each character in the sub-text is encoded by word2vec. After the word vector is encoded, a multi-dimensional word vector corresponding to each character can be generated. The multi-dimensional word vector corresponding to each character can be based on the character in the sub-text. The position in the text is marked, that is, the word vector of the first character is [1,2,5], and the marked word vector is [1,1,2,5]. After marking, the position of each multi-dimensional word vector can be determined, so as to avoid the character position change during the speech conversion, which will cause the generated audio to be inconsistent with the original text.

S13. After reducing the dimensions of the multi-dimensional word vector, a two-dimensional word vector is obtained;

Among them, the method of multidimensional word vector dimensionality reduction can use PCA dimensionality reduction method, or vector projection method, projecting n as a vector to n-1 dimensional space, and then n-1 dimension in n-1 dimensional space The vector is projected into an n-2 dimensional space, and successively projected to a two-dimensional plane to obtain the two-dimensional word vector.

S14. Calculate the feature value of the two-dimensional word vector, and use the feature value of the two-dimensional word vector as a weight, and import the two-dimensional word vector and the weight into the text-to-speech conversion model for text-to-speech conversion , Get the audio.

Among them, the text-to-speech conversion model can use a two-way cyclic neural network model. The two-way cyclic neural network model is applied to situations where the learning objective is related to the complete input sequence. For example, in speech recognition, the vocabulary corresponding to the current voice may be compared with the vocabulary that appears later There is a corresponding relationship, so a complete voice is required as input.

In this embodiment, the text-to-speech conversion model is used to accurately convert the input text into the corresponding audio without missing characters.

Figure 3 is a schematic diagram of the audio feature point extraction process in a method for generating virtual characters based on neural networks in an embodiment of this application. As shown in the figure, S2, extracting the prosody parameters of the audio, and combining the prosody The parameters are imported into the preset audio generation model for audio feature point extraction, including:

S21. Extract the first prosody parameter and level language parameter of the audio, and generate a prosody mark according to the pitch, pitch, and pause timing in the first prosody parameter;

Among them, the level language parameters are divided into low-level language parameters and high-level language parameters, and the first prosody parameter is the prosody parameter when the audio is not processed.

S22. Encode the prosody mark to generate an encoded stream;

Wherein, when encoding according to the prosody parameter, the formula used is:

T = argmax P(q, A|L), in the formula, T represents the prosody code, P represents the prosody state, q represents the pitch, A represents the basic rhythm feature parameter, L represents the level language parameter, and argmax is the maximum value of the independent variable function.

S23. Generate a second prosody parameter according to the encoding stream and the level language parameter;

Specifically, the second prosody parameter generation is obtained by adding the coded stream and the level language parameter.

S24. Import the second prosody parameter into the audio generation model to extract audio feature points in the second prosody parameter.

In this embodiment, the audio feature points of the generated audio are extracted through the prosody parameter, thereby simplifying the time for generating the virtual character video, that is, the required virtual character video can be generated in a short time.

Figure 3 is a schematic diagram of the mouth trajectory generation process in a method for generating a virtual character video based on a neural network in an embodiment of the application. As shown in the figure, the S3 generates a virtual character according to the audio feature points Mouth movement trajectory, including:

S31. Acquire a preset virtual character image, and extract key points of the mouth from the virtual character image according to a preset mouth key point extraction algorithm;

Among them, the preset mouth feature extraction algorithm is the dlib algorithm.

S32. Perform normalization processing on the key points of the mouth to obtain enhanced key points;

Specifically, the normalized mouth feature point adopts the method of region clustering. Since the mouth is a symmetrical structure, the mouth is divided into four areas: upper left, lower left, upper right, and lower right, and the straight line dividing the area is used as the coordinates. Axis, a coordinate system is established, and the key points in any one of the four areas in the coordinates are clustered and enhanced. The specific enhancement method is to calculate the distance between two key points, and if the distance is less than a preset threshold, the midpoint of the line segment connecting the two key points is used as an enhancement key point. This can reduce the number of key points that need to be calculated.

S33. Obtain the audio playback frequency and the mouth motion amplitude during playback according to the enhancement key points, and after fitting the playback frequency and the mouth motion amplitude to obtain the virtual character's mouth motion Trajectory.

Specifically, according to the position of the enhancement key point in the mouth, for example, the enhancement key point of character A is the corner of the left mouth and the middle point of the upper lip, then the playback frequency of character A is 30kbps, and the mouth movement amplitude during playback is 0.8mm, character B The key points of enhancement are the right corner of the mouth and the midpoint of the lower lip. Then the playback frequency of character B is 35kbps, and the mouth movement amplitude during playback is 0.7mm. The playing frequency and the mouth movement amplitude of the above-mentioned character A and character B are obtained from the existing correspondence relationship between the mouth movement and the audio playing in the database. After fitting the playing frequency and the mouth movement amplitude, the trajectory of the virtual character's mouth movement can be obtained.

In this embodiment, by effectively extracting the key points of the mouth, the accuracy of the combination of the audio and the mouth movement in the virtual character picture is ensured.

In one embodiment, the step S4, acquiring a two-dimensional picture of a virtual character, and importing the two-dimensional picture into a facial feature generation model for processing to generate a three-dimensional facial image of the virtual character includes:

Obtain a two-dimensional picture of the virtual person and perform grayscale processing on the two-dimensional picture to obtain a binarized two-dimensional picture, and obtain the three-dimensional face image according to the gradient of the binarized two-dimensional picture In-depth information;

Among them, when calculating the gradient of a two-dimensional picture, the two-dimensional picture can be divided into multiple sub-blocks of equal size, and then the pixel value of each sub-block pixel is extracted, and the sub-block binarized pixel value matrix is established. The number of times "1" and "0" in each row or column of the valued pixel value matrix are alternated to obtain the gradient value of each sub-block. After the gradient values of each sub-block are summarized, different facial regions in the 3D facial image can be obtained In-depth information. That is, the area with a large gradient value has a large depth of the 3D image, and an area with a small gradient value has a small depth of the 3D image.

Using the lower left corner of the two-dimensional picture as the origin of coordinates to establish a coordinate system of facial feature points;

Acquiring the coordinates of the key points of the facial features in the two-dimensional picture from the coordinate system of the facial feature points, and calculating the distance between the key points of the facial features;

According to the distance, the positions of the facial features in the preset standard three-dimensional facial image are adjusted to obtain the three-dimensional facial image of the virtual character.

Specifically, for example, in a preset standard three-dimensional facial image, the distance between the corners of the mouth is 10, and the calculated distance is 9, and the distance between the corners of the mouth is modified according to the distance value 9.

In this embodiment, the two-dimensional image is converted into a three-dimensional facial image through deep processing, which effectively guarantees the authenticity of the virtual character's face, so as to achieve the effect of being fake.

In an embodiment, the S5. importing the movement trajectory of the mouth into the three-dimensional facial image to generate multiple frames of continuous dynamic facial images includes:

Importing the movement trajectory of the mouth into the three-dimensional facial image, and extracting the key points of facial features that have changed positions in the three-dimensional facial image as change features;

Specifically, the three-dimensional face image still retains the key points of the face in the two-dimensional picture, but the key points of the face are three-dimensionally made. After importing the mouth motion trajectory into the 3D facial image, the key points of the face in the 3D facial image will change with the mouth motion trajectory. If the mouth motion trajectory is an open mouth, the mouth in the 3D facial image The key points in the middle are shifted upward by 5mm, the key points in the corners of the mouth, and the key points of the facial features that have changed positions are used as the change features.

Incorporating the change feature into a preset confrontation neural network model to reconstruct the mouth image;

Among them, when using the adversarial neural network model (Edge-connect) to reconstruct the mouth image, the following formula can be used to reduce the adversarial error:

In the formula, L represents the anti-error, E() represents the expectation, G() represents the anti-error generation model, D() represents the anti-error model, S represents the change feature, and T represents the reconstructed image.

After the reconstructed mouth images are sorted according to the generation time, the multi-frame continuous dynamic facial images are generated.

Specifically, several mouth images can be generated through the adversarial neural network model. Since these mouth images are sequentially generated according to the changes in the movement trajectory of the mouth, the reconstructed mouth images are sequentially sorted and then played continuously. Dynamic facial images.

In this embodiment, the adversarial neural network is used to generate dynamic facial images, thereby ensuring the synchronization of audio and facial images.

In one embodiment, in S6, the real-time audio corresponding to each frame in the dynamic face and facial image is acquired, and the dynamic face and facial image and the real-time audio are played synchronously, and after the virtual character video is obtained, the Methods also include:

Locate the position of the key audio frame corresponding to the prosody parameter in the virtual character video;

Specifically, the prosody parameters corresponding to the audio generated in step S1 include pitch, pitch, and so on. When the pitch is greater than the preset pitch threshold, it is marked as a treble, and the positions of all trebles in the audio are recorded. These positions are the positions of the key audio frames in the avatar video. In other words, if the third second in the audio is a high pitch, then the position of the third second in the avatar video is a key audio frame.

Extracting respectively the mouth image and audio signal corresponding to the key audio frame from the virtual character video according to the position of the key audio frame;

Incorporating the spectral characteristics of the audio signal into a pre-stored long and short-term memory network model to perform speech recognition;

Among them, the long and short-term memory neural network is a time recurrent neural network, which is specially designed to solve long-term problems. All RNNs have a chain form of repeating neural network modules. The frequency spectrum corresponding to the audio can be input into the long and short-term memory neural network for memory storage in advance, and then the audio can be effectively identified from the video stream of the key frame.

According to the voice recognition result, the mouth state corresponding to the audio signal is obtained, and the mouth state is compared with the mouth image. If synchronized, the virtual character video is sent to the client, otherwise the audio and video are restarted Synthesize encoding until the state of the mouth of the virtual character in the virtual character video is synchronized with the mouth image.

In this embodiment, by effectively analyzing the key frames in the virtual character's video screen, the synchronization of the sound and the picture of the virtual character's video screen is verified.

In one embodiment, a device for generating a virtual character video based on a neural network is proposed, as shown in FIG. 4, which includes the following modules:

In one embodiment, a computer device is provided. The computer device includes a memory and a processor. The memory stores computer-readable instructions. When the computer-readable instructions are executed by the processor, the processor executes the foregoing embodiments. The steps of the method for generating a virtual character video based on a neural network.

In one embodiment, a storage medium storing computer-readable instructions is provided. When the computer-readable instructions are executed by one or more processors, the one or more processors execute all of the foregoing embodiments. Describes the steps of the method for generating virtual character videos based on neural networks. Wherein, the storage medium may be non-volatile or volatile.

Those of ordinary skill in the art can understand that all or part of the steps in the various methods of the above-mentioned embodiments can be completed by a program instructing relevant hardware. The program can be stored in a computer-readable storage medium, and the storage medium can include: Read only memory (ROM, Read Only Memory), random access memory (RAM, Random Access Memory), magnetic disk or optical disk, etc.

The technical features of the above-mentioned embodiments can be combined arbitrarily. In order to make the description concise, all possible combinations of the technical features in the above-mentioned embodiments are not described. However, as long as there is no contradiction in the combination of these technical features, all possible combinations are not described. It should be considered within the scope of this specification.

The above-mentioned embodiments only express some exemplary embodiments of the present application. The descriptions are more specific and detailed, but they should not be interpreted as a limitation on the patent scope of the present application. It should be pointed out that for those of ordinary skill in the art, without departing from the concept of this application, several modifications and improvements can be made, and these all fall within the protection scope of this application. Therefore, the scope of protection of the patent of this application shall be subject to the appended claims.

Claims

A method for generating a virtual character video based on a neural network, which includes:

Acquiring the text to be recognized, and importing the text to be recognized into a preset text-to-speech conversion model for voice conversion to obtain audio;

Extracting the prosody parameters of the audio, and importing the prosody parameters into a preset audio generation model for audio feature point extraction;

According to the audio feature points, generating the trajectory of the mouth movement of the virtual character;

Acquiring a preset two-dimensional picture of the virtual character, and importing the two-dimensional picture into the facial feature generation model for processing to generate a three-dimensional facial image of the virtual character;

Importing the movement trajectory of the mouth into the three-dimensional facial image to generate multiple frames of continuous dynamic facial images;

Acquire the real-time audio corresponding to each frame of the dynamic face and facial picture, and perform audio-video synthesis and coding on the dynamic face and facial picture and the real-time audio in synchronization to obtain a virtual character video.
The method for generating a virtual character video based on a neural network according to claim 1, wherein said obtaining the text to be recognized and importing the text to be recognized into a preset text-to-speech conversion model for voice conversion to obtain audio includes:

Acquiring the text to be recognized, extracting the separator in the text to be recognized, and dividing the text to be recognized into several sub-texts according to the separator;

Performing word vector encoding on the sub-text to obtain several multi-dimensional word vectors;

After dimensionality reduction is performed on the multi-dimensional word vector, a two-dimensional word vector is obtained;

Calculate the feature value of the two-dimensional word vector, take the feature value of the two-dimensional word vector as a weight, and import the two-dimensional word vector and the weight into the text-to-speech conversion model to perform text-to-speech conversion to obtain The audio.
The method for generating a virtual character video based on a neural network according to claim 1, wherein the extracting the prosody parameters of the audio and importing the prosody parameters into an audio generation model for audio feature point extraction comprises:

Extracting the first prosody parameter and the level language parameter of the audio, and generating a prosody mark according to the pitch, pitch and pause timing in the first prosody parameter;

Encoding the prosody mark to generate an encoded stream;

Generating a second prosody parameter according to the encoding stream and the level language parameter;

The second prosody parameter is imported into the audio generation model to extract audio feature points in the second prosody parameter.
The method for generating a virtual character video based on a neural network according to claim 3, wherein the generating a trajectory of the virtual character's mouth according to the audio feature points comprises:

Acquiring a preset virtual character image, and extracting the mouth key points from the virtual character image according to a preset mouth key point extraction algorithm;

Normalize the key points of the mouth to obtain enhanced key points;

According to the enhancement key points, the audio playback frequency and the mouth motion amplitude during playback are obtained, and the playback frequency and the mouth motion amplitude are fitted to obtain the virtual character's mouth motion trajectory.
The method for generating a virtual character video based on a neural network according to claim 1, wherein said obtaining a preset two-dimensional picture of the virtual character, and importing the two-dimensional picture into a facial feature generation model for processing to generate the virtual character The three-dimensional face map, including:

Obtain a two-dimensional picture of the virtual person and perform grayscale processing on the two-dimensional picture to obtain a binarized two-dimensional picture, and obtain the three-dimensional face image according to the gradient of the binarized two-dimensional picture In-depth information;

Using the lower left corner of the two-dimensional picture as the origin of coordinates to establish a coordinate system of facial feature points;

Acquiring the coordinates of the key points of the facial features in the two-dimensional picture from the coordinate system of the facial feature points, and calculating the distance between the key points of the facial features;

According to the distance, the positions of the facial features in the preset standard three-dimensional facial image are adjusted to obtain the three-dimensional facial image of the virtual character.
The method for generating a virtual character video based on a neural network according to claim 5, wherein said importing the movement trajectory of the mouth into the three-dimensional facial image to generate multiple frames of continuous dynamic facial images comprises:

Importing the movement trajectory of the mouth into the three-dimensional facial image, and extracting the key points of facial features that have changed positions in the three-dimensional facial image as change features;

Incorporating the change feature into a preset confrontation neural network model to reconstruct the mouth image;

After the reconstructed mouth images are sorted according to the generation time, the multi-frame continuous dynamic facial images are generated.
The method for generating a virtual character video based on a neural network according to any one of claims 1 to 6, wherein said acquiring the real-time audio corresponding to each frame of dynamic face and facial painting, and responding to said dynamic face and face After the screen and the real-time audio are synchronized to perform audio and video synthesis and coding, and after obtaining the virtual character video, the method further includes:

Locate the position of the key audio frame corresponding to the prosody parameter in the virtual character video;

Extracting respectively the mouth image and audio signal corresponding to the key audio frame from the virtual character video according to the position of the key audio frame;

Incorporating the spectral characteristics of the audio signal into a pre-stored long and short-term memory network model to perform speech recognition;

According to the voice recognition result, the mouth state corresponding to the audio signal is obtained, and the mouth state is compared with the mouth image. If synchronized, the virtual character video is sent to the client, otherwise the audio and video are restarted Synthesize encoding until the state of the mouth of the virtual character in the virtual character video is synchronized with the mouth image.
A computer device includes a memory and a processor, and the memory stores computer-readable instructions. When the computer-readable instructions are executed by the processor, the processor executes the following neural network-based Steps of the method of generating a virtual character video:

Acquiring the text to be recognized, and importing the text to be recognized into a preset text-to-speech conversion model for voice conversion to obtain audio;

Extracting the prosody parameters of the audio, and importing the prosody parameters into a preset audio generation model for audio feature point extraction;

According to the audio feature points, generating the trajectory of the mouth movement of the virtual character;

Acquiring a preset two-dimensional picture of the virtual character, and importing the two-dimensional picture into the facial feature generation model for processing to generate a three-dimensional facial image of the virtual character;

Importing the movement trajectory of the mouth into the three-dimensional facial image to generate multiple frames of continuous dynamic facial images;

Acquire the real-time audio corresponding to each frame of the dynamic face and facial picture, and perform audio-video synthesis and coding on the dynamic face and facial picture and the real-time audio in synchronization to obtain a virtual character video.
8. The computer device according to claim 8, wherein the program for generating a virtual character video based on a neural network is executed by the processor to obtain the text to be recognized, and import the text to be recognized into a preset text-to-speech conversion model When performing sound conversion in the process to obtain audio, the following steps are included:

Acquiring the text to be recognized, extracting the separator in the text to be recognized, and dividing the text to be recognized into several sub-texts according to the separator;

Performing word vector encoding on the sub-text to obtain several multi-dimensional word vectors;

After reducing the dimensions of the multi-dimensional word vector, a two-dimensional word vector is obtained;

Calculate the feature value of the two-dimensional word vector, take the feature value of the two-dimensional word vector as a weight, and import the two-dimensional word vector and the weight into the text-to-speech conversion model to perform text-to-speech conversion to obtain The audio.
8. The computer device according to claim 8, wherein the program of the computer device is executed by the processor to extract the prosody parameters of the audio, and import the prosody parameters into an audio generation model to perform audio feature points Steps include the following steps:

Extracting the first prosody parameter and the level language parameter of the audio, and generating a prosody mark according to the pitch, pitch and pause timing in the first prosody parameter;

Encoding the prosody mark to generate an encoded stream;

Generating a second prosody parameter according to the encoded stream and the level language parameter;

The second prosody parameter is imported into the audio generation model to extract audio feature points in the second prosody parameter.
10. The computer device according to claim 10, wherein when the program of the computer device is executed by the processor, the step of generating the trajectory of the avatar's mouth according to the audio feature points comprises the following steps:

Acquiring a preset virtual character image, and extracting the mouth key points from the virtual character image according to a preset mouth key point extraction algorithm;

Normalize the key points of the mouth to obtain enhanced key points;

According to the enhancement key points, the audio playback frequency and the mouth motion amplitude during playback are obtained, and the playback frequency and the mouth motion amplitude are fitted to obtain the virtual character's mouth motion trajectory.
8. The computer device according to claim 8, wherein the program of the computer device is executed by the processor to obtain a two-dimensional picture of a preset virtual character, and import the two-dimensional picture into a facial feature generation model to perform After processing, the step of generating a three-dimensional face map of the virtual character includes the following steps:

Obtain a two-dimensional picture of the virtual person and perform grayscale processing on the two-dimensional picture to obtain a binarized two-dimensional picture, and obtain the three-dimensional face image according to the gradient of the binarized two-dimensional picture In-depth information;

Using the lower left corner of the two-dimensional picture as the origin of coordinates to establish a coordinate system of facial feature points;

Acquiring the coordinates of the key points of the facial features in the two-dimensional picture from the coordinate system of the facial feature points, and calculating the distance between the key points of the facial features;

According to the distance, the positions of the facial features in the preset standard three-dimensional facial image are adjusted to obtain the three-dimensional facial image of the virtual character.
The computer device according to claim 12, wherein the program of the computer device is executed by the processor, and said importing the movement trajectory of the mouth into the three-dimensional facial image to generate a multi-frame continuous dynamic face and face The steps of the screen include the following steps:

Importing the movement trajectory of the mouth into the three-dimensional facial image, and extracting the key points of facial features that have changed positions in the three-dimensional facial image as change features;

Incorporating the change feature into a preset confrontation neural network model to reconstruct the mouth image;

After the reconstructed mouth images are sorted according to the generation time, the multi-frame continuous dynamic facial images are generated.
The computer device according to any one of claims 8 to 13, wherein the program of the computer device is executed by the processor. After the step of performing audio and video synthesis and encoding synchronously on the dynamic facial image and the real-time audio to obtain the virtual character video, the following steps are also performed:

Locate the position of the key audio frame corresponding to the prosody parameter in the virtual character video;

Extracting respectively the mouth image and audio signal corresponding to the key audio frame from the virtual character video according to the position of the key audio frame;

Incorporating the spectral characteristics of the audio signal into a pre-stored long and short-term memory network model to perform speech recognition;

According to the voice recognition result, the mouth state corresponding to the audio signal is obtained, and the mouth state is compared with the mouth image. If synchronized, the virtual character video is sent to the client, otherwise the audio and video are restarted Synthesize encoding until the state of the mouth of the virtual character in the virtual character video is synchronized with the mouth image.
A storage medium storing computer-readable instructions, wherein when the computer-readable instructions are executed by one or more processors, the one or more processors execute the method for generating a virtual character video based on a neural network as described below A step of:

Acquiring the text to be recognized, and importing the text to be recognized into a preset text-to-speech conversion model for voice conversion to obtain audio;

Extracting the prosody parameters of the audio, and importing the prosody parameters into a preset audio generation model for audio feature point extraction;

According to the audio feature points, generating the trajectory of the mouth movement of the virtual character;

Acquiring a preset two-dimensional picture of the virtual character, and importing the two-dimensional picture into the facial feature generation model for processing to generate a three-dimensional facial image of the virtual character;

Importing the movement trajectory of the mouth into the three-dimensional facial image to generate multiple frames of continuous dynamic facial images;

Acquire the real-time audio corresponding to each frame of the dynamic face and facial picture, and perform audio-video synthesis and coding on the dynamic face and facial picture and the real-time audio in synchronization to obtain a virtual character video.
The storage medium storing computer-readable instructions according to claim 15, wherein the program for generating a virtual character video based on the neural network is executed by the processor to obtain the text to be recognized, and import the text to be recognized into the pre- When the voice conversion is performed in the text-to-speech conversion model to obtain the audio, the steps include the following steps:

Acquiring the text to be recognized, extracting the separator in the text to be recognized, and dividing the text to be recognized into several sub-texts according to the separator;

Performing word vector encoding on the sub-text to obtain several multi-dimensional word vectors;

After dimensionality reduction is performed on the multi-dimensional word vector, a two-dimensional word vector is obtained;

Calculate the feature value of the two-dimensional word vector, take the feature value of the two-dimensional word vector as a weight, and import the two-dimensional word vector and the weight into the text-to-speech conversion model to perform text-to-speech conversion to obtain The audio.
The storage medium storing computer-readable instructions according to claim 15, wherein the program for generating a virtual character video based on a neural network is executed by a processor to extract the prosody parameters of the audio, and import the prosody parameters When the audio feature point extraction step is performed in the audio generation model, the following steps are included:

Extracting the first prosody parameter and the level language parameter of the audio, and generating a prosody mark according to the pitch, pitch and pause timing in the first prosody parameter;

Encoding the prosody mark to generate an encoded stream;

Generating a second prosody parameter according to the encoding stream and the level language parameter;

The second prosody parameter is imported into the audio generation model to extract audio feature points in the second prosody parameter.
The storage medium storing computer-readable instructions according to claim 17, wherein the program for generating a virtual character video based on a neural network is executed by a processor, and the virtual character’s mouth movement trajectory is generated based on the audio feature points The steps include the following steps:

Acquiring a preset virtual character image, and extracting the mouth key points from the virtual character image according to a preset mouth key point extraction algorithm;

Normalize the key points of the mouth to obtain enhanced key points;

According to the enhancement key points, the audio playback frequency and the mouth motion amplitude during playback are obtained, and the playback frequency and the mouth motion amplitude are fitted to obtain the virtual character's mouth motion trajectory.
The storage medium storing computer-readable instructions according to claim 15, wherein the program for generating a virtual character video based on a neural network is executed by the processor to obtain a preset two-dimensional picture of the virtual character, and to convert the When the two-dimensional image is imported into the facial feature generation model for processing, the step of generating a three-dimensional facial image of the virtual character includes the following steps:

Obtain a two-dimensional picture of the virtual person and perform grayscale processing on the two-dimensional picture to obtain a binarized two-dimensional picture, and obtain the three-dimensional face image according to the gradient of the binarized two-dimensional picture In-depth information;

Using the lower left corner of the two-dimensional picture as the origin of coordinates to establish a coordinate system of facial feature points;

Acquiring the coordinates of the key points of the facial features in the two-dimensional picture from the coordinate system of the facial feature points, and calculating the distance between the key points of the facial features;

According to the distance, the positions of the facial features in the preset standard three-dimensional facial image are adjusted to obtain the three-dimensional facial image of the virtual character.
A device for generating a virtual character video based on a neural network, which includes the following modules:

The trajectory generation module is configured to obtain the text to be recognized, and import the text to be recognized into a preset text-to-speech conversion model to obtain audio; extract the prosodic parameters of the audio, and import the prosodic parameters into the preset Audio feature point extraction in the audio generation model; according to the audio feature point, a trajectory of the mouth movement of the virtual character is generated;

The screen generation module is configured to obtain a two-dimensional picture of a preset virtual character, import the two-dimensional picture into a facial feature generation model for processing, and generate a three-dimensional facial image of the virtual character; import the movement trajectory of the mouth to the Three-dimensional facial image, generating multiple frames of continuous dynamic facial images;

The video generation module is configured to obtain the real-time audio corresponding to each frame of the dynamic face and facial painting, and synchronize the audio and video synthesis and coding of the dynamic face and the real-time audio to obtain a virtual character video.