[go: up one dir, main page]
More Web Proxy on the site http://driver.im/

WO2021073416A1 - Method for generating virtual character video on the basis of neural network, and related device - Google Patents

Method for generating virtual character video on the basis of neural network, and related device Download PDF

Info

Publication number
WO2021073416A1
WO2021073416A1 PCT/CN2020/118373 CN2020118373W WO2021073416A1 WO 2021073416 A1 WO2021073416 A1 WO 2021073416A1 CN 2020118373 W CN2020118373 W CN 2020118373W WO 2021073416 A1 WO2021073416 A1 WO 2021073416A1
Authority
WO
WIPO (PCT)
Prior art keywords
audio
virtual character
dimensional
mouth
text
Prior art date
Application number
PCT/CN2020/118373
Other languages
French (fr)
Chinese (zh)
Inventor
王健宗
王义文
Original Assignee
平安科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 平安科技(深圳)有限公司 filed Critical 平安科技(深圳)有限公司
Publication of WO2021073416A1 publication Critical patent/WO2021073416A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T17/00Three dimensional [3D] modelling, e.g. data description of 3D objects
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/46Descriptors for shape, contour or point-related descriptors, e.g. scale invariant feature transform [SIFT] or bags of words [BoW]; Salient regional features
    • G06V10/462Salient features, e.g. scale invariant feature transforms [SIFT]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/168Feature extraction; Face representation
    • G06V40/171Local features and components; Facial parts ; Occluding parts, e.g. glasses; Geometrical relationships
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems

Definitions

  • This application relates to the field of artificial intelligence technology, and in particular to a method and related equipment for generating a virtual character video based on a neural network.
  • a virtual character refers to a character that does not exist in reality. It can exist in creative works such as TV series, comics, and games. It is a fictional character in creative works such as TV series, comics, and games. Synthesizing virtual characters usually uses 3D scanning and other methods to generate the required virtual characters by setting face parameters.
  • the inventor realized that when the virtual character is generated, the voice of the virtual character cannot be kept completely consistent with the mouth movement of the virtual character, which results in poor fidelity of the virtual character, and it is impossible to achieve a fake playback effect.
  • a method and related equipment for generating a virtual character video based on a neural network are provided.
  • a device for generating a virtual character video based on a neural network including the following modules:
  • the trajectory generation module is configured to obtain the text to be recognized, and import the text to be recognized into a preset text-to-speech conversion model to obtain audio; extract the prosodic parameters of the audio, and import the prosodic parameters into the preset Audio feature point extraction in the audio generation model; according to the audio feature point, a trajectory of the mouth movement of the virtual character is generated;
  • the screen generation module is configured to obtain a two-dimensional picture of a preset virtual character, import the two-dimensional picture into a facial feature generation model for processing, and generate a three-dimensional facial image of the virtual character; import the movement trajectory of the mouth to the Three-dimensional facial image, generating multiple frames of continuous dynamic facial images;
  • the video generation module is configured to obtain the real-time audio corresponding to each frame of the dynamic face and facial painting, and synchronize the audio and video synthesis and coding of the dynamic face and the real-time audio to obtain a virtual character video.
  • the present application also provides a computer device, including a memory and a processor.
  • the memory stores computer-readable instructions.
  • the processor executes the steps of the following method for generating a virtual character video based on a neural network, including: obtaining a text to be recognized, and importing the text to be recognized into a preset text-to-speech conversion model for voice conversion to obtain audio;
  • the present application also provides a storage medium storing computer-readable instructions.
  • the computer-readable instructions are executed by one or more processors, the one or more processors execute the above-mentioned
  • the steps of a method for generating a virtual character video by a neural network include: obtaining a text to be recognized, and importing the text to be recognized into a preset text-to-speech conversion model for voice conversion to obtain audio;
  • this application effectively converts the characters in the text into audio, and then uses the anti-neural network and memory neural network technology to reconstruct the facial features corresponding to the audio on the three-dimensional facial image.
  • the conversion of text to video is realized. There is no need to separately simulate each link in text, audio and video. It realizes the purpose of obtaining the desired video display effect as long as the text is input, thereby ensuring the virtual The voice of the character is exactly the same as the mouth movement of the virtual character.
  • FIG. 1 is an overall flowchart of a method for generating a virtual character video based on a neural network in an embodiment of the present application
  • FIG. 2 is a schematic diagram of an audio generation process in a method for generating a virtual character video based on a neural network in an embodiment of the present application
  • FIG. 3 is a schematic diagram of an audio feature point extraction process in a method for generating a virtual character based on a neural network in an embodiment of the present application
  • Fig. 4 is a structural diagram of an apparatus for generating a virtual character video based on a neural network in an embodiment of the application.
  • Fig. 1 is an overall flowchart of a method for generating a virtual character video based on a neural network in an embodiment of the present application.
  • a method for generating a virtual character video based on a neural network includes the following steps:
  • the text to be recognized may be in languages such as Chinese, English, or Japanese.
  • the position of the separator in the text can be determined first, such as ",”, ".” and so on. According to the positions of these separators, the text to be recognized is divided into several sub-texts. Import each sub-text into the text-to-speech conversion model for text-to-sound conversion.
  • the text-to-speech conversion model can be composed of the Char2Wav architecture.
  • the Char2Wav architecture a simple cyclic neural network and a cross cyclic network sub-text are used for voice conversion.
  • the words in the sub-text can be converted into multi-dimensional word vectors by word vector conversion, and then the feature values and dimensions of the multi-dimensional word vector are used as parameters into the simple cyclic neural network and cross-cyclic neural network Training conversion in the network.
  • the prosody parameters of the audio include pitch, pitch, pause frequency, and so on.
  • the audio generation model may adopt a hidden Markov model.
  • the frequency range value and the vibration amplitude value of the audio frequency spectrum are obtained, and the frequency range value and the vibration amplitude value are input into the hidden Markov model to extract audio feature points.
  • the formula for audio feature point extraction is:
  • D(x,y) ⁇ P(X
  • D(x,y) represents the value of the audio feature point in the two-dimensional coordinate system, and represents the vibration amplitude
  • the probability value indicates the frequency probability value.
  • the key points of the mouth are normalized so as not to be affected by the image size, face position, face rotation, and face size. Normalization is very important in this process because it can make the generated key points compatible with any video. Then, PCA is used to reduce the dimensionality of the normalized mouth-key points, reducing a total of 20 ⁇ 2 and a total of 40 dimensions to 8 dimensions.
  • the bilinear difference method is used to expand the mouth standard point data after PCA, and then according to the mouth opening and closing amplitude and frequency corresponding to each audio feature point, the movement trajectory of the mouth standard point is determined, and all mouths are summarized After the movement trajectory of the standard point is obtained, the movement trajectory of the mouth of the virtual character is obtained.
  • a two-dimensional picture of the virtual character to be generated is obtained, a two-dimensional coordinate system is established, and the contours of the mouth, nose, and eyes in the two-dimensional picture are obtained from the two-dimensional coordinate system, and from the contours of the mouth, nose, and eyes Extract the coordinates of the key points of the facial features, such as nose coordinates, mouth corner coordinates, etc.
  • the head posture of the virtual character is determined, and the correctness of the head posture is evaluated by the least square method.
  • the least squares estimation calculation formula is:
  • c represents the correct estimation
  • n represents the number of feature points
  • pi represents the probability of feature points
  • s represents the rotation parameter
  • R represents the translation parameter
  • t represents the scaling parameter
  • V represents the distance from the feature point to the origin.
  • the Canny algorithm can be used to detect the edge of the mouth in the three-dimensional facial image.
  • the image processed by the Canny algorithm is usually a grayscale image, so if the camera acquires a color image, it must first be grayscaled.
  • grayscale a color image is to perform a weighted average according to the sampled values of each channel of the image.
  • Gaussian filtering is performed on the gray-scale processed image, and the first-order partial derivative finite difference is used to calculate the magnitude and direction of the gradient.
  • the operators that can be used in the calculation of gradient amplitude and direction include Roberts operator: the contour of the mouth edge is obtained after non-maximum suppression of the gradient amplitude. After the trajectory of the mouth movement is sequentially marked on the contour of the mouth edge, multiple frames of continuous dynamic facial images can be generated.
  • the dynamic facial image is played at the preset playing speed, and the playing time, initial playing node and ending playing node of the complete dynamic facial facial image are recorded, and then based on the playing time and the position of the initial playing node. And the position of the end playback node determines the segment of the audio generated in step S1 to be played.
  • the video of the virtual character can be obtained by using the video encoder to synthesize the audio segment to be played with the corresponding dynamic facial image.
  • the characters in the text are effectively converted into audio, and then the facial features corresponding to the audio are well reconstructed on the three-dimensional facial image through the adversarial neural network and memory neural network technology.
  • the conversion of text to video is realized. There is no need to separately simulate each link in text, audio and video. It realizes the purpose of obtaining the desired video display effect as long as the text is input, thereby ensuring the virtual The voice of the character is exactly the same as the mouth movement of the virtual character.
  • FIG. 2 is a schematic diagram of the audio generation process in a method for generating a virtual character video based on a neural network in an embodiment of the application.
  • the S1 obtain the text to be recognized, and import the text to After voice conversion in the text-to-speech conversion model, the audio is obtained, including:
  • the text to be recognized is usually a streaming data text
  • the separator in the streaming data text can be punctuation marks such as ".” and “,”, or numbers such as "1" and "2".
  • equal length division or unequal length division can be used.
  • each character in the sub-text is encoded by word2vec.
  • a multi-dimensional word vector corresponding to each character can be generated.
  • the multi-dimensional word vector corresponding to each character can be based on the character in the sub-text.
  • the position in the text is marked, that is, the word vector of the first character is [1,2,5], and the marked word vector is [1,1,2,5].
  • the position of each multi-dimensional word vector can be determined, so as to avoid the character position change during the speech conversion, which will cause the generated audio to be inconsistent with the original text.
  • the method of multidimensional word vector dimensionality reduction can use PCA dimensionality reduction method, or vector projection method, projecting n as a vector to n-1 dimensional space, and then n-1 dimension in n-1 dimensional space The vector is projected into an n-2 dimensional space, and successively projected to a two-dimensional plane to obtain the two-dimensional word vector.
  • the text-to-speech conversion model can use a two-way cyclic neural network model.
  • the two-way cyclic neural network model is applied to situations where the learning objective is related to the complete input sequence. For example, in speech recognition, the vocabulary corresponding to the current voice may be compared with the vocabulary that appears later There is a corresponding relationship, so a complete voice is required as input.
  • the text-to-speech conversion model is used to accurately convert the input text into the corresponding audio without missing characters.
  • Figure 3 is a schematic diagram of the audio feature point extraction process in a method for generating virtual characters based on neural networks in an embodiment of this application.
  • S2 extracting the prosody parameters of the audio, and combining the prosody
  • the parameters are imported into the preset audio generation model for audio feature point extraction, including:
  • the level language parameters are divided into low-level language parameters and high-level language parameters, and the first prosody parameter is the prosody parameter when the audio is not processed.
  • T argmax P(q, A
  • the second prosody parameter generation is obtained by adding the coded stream and the level language parameter.
  • the audio feature points of the generated audio are extracted through the prosody parameter, thereby simplifying the time for generating the virtual character video, that is, the required virtual character video can be generated in a short time.
  • Figure 3 is a schematic diagram of the mouth trajectory generation process in a method for generating a virtual character video based on a neural network in an embodiment of the application.
  • the S3 generates a virtual character according to the audio feature points Mouth movement trajectory, including:
  • the preset mouth feature extraction algorithm is the dlib algorithm.
  • the normalized mouth feature point adopts the method of region clustering. Since the mouth is a symmetrical structure, the mouth is divided into four areas: upper left, lower left, upper right, and lower right, and the straight line dividing the area is used as the coordinates. Axis, a coordinate system is established, and the key points in any one of the four areas in the coordinates are clustered and enhanced.
  • the specific enhancement method is to calculate the distance between two key points, and if the distance is less than a preset threshold, the midpoint of the line segment connecting the two key points is used as an enhancement key point. This can reduce the number of key points that need to be calculated.
  • the enhancement key point of character A is the corner of the left mouth and the middle point of the upper lip
  • the playback frequency of character A is 30kbps
  • the mouth movement amplitude during playback is 0.8mm
  • character B The key points of enhancement are the right corner of the mouth and the midpoint of the lower lip.
  • the playback frequency of character B is 35kbps
  • the mouth movement amplitude during playback is 0.7mm.
  • the playing frequency and the mouth movement amplitude of the above-mentioned character A and character B are obtained from the existing correspondence relationship between the mouth movement and the audio playing in the database. After fitting the playing frequency and the mouth movement amplitude, the trajectory of the virtual character's mouth movement can be obtained.
  • the step S4, acquiring a two-dimensional picture of a virtual character, and importing the two-dimensional picture into a facial feature generation model for processing to generate a three-dimensional facial image of the virtual character includes:
  • the two-dimensional picture when calculating the gradient of a two-dimensional picture, the two-dimensional picture can be divided into multiple sub-blocks of equal size, and then the pixel value of each sub-block pixel is extracted, and the sub-block binarized pixel value matrix is established. The number of times "1" and "0" in each row or column of the valued pixel value matrix are alternated to obtain the gradient value of each sub-block.
  • different facial regions in the 3D facial image can be obtained In-depth information. That is, the area with a large gradient value has a large depth of the 3D image, and an area with a small gradient value has a small depth of the 3D image.
  • the positions of the facial features in the preset standard three-dimensional facial image are adjusted to obtain the three-dimensional facial image of the virtual character.
  • the distance between the corners of the mouth is 10, and the calculated distance is 9, and the distance between the corners of the mouth is modified according to the distance value 9.
  • the two-dimensional image is converted into a three-dimensional facial image through deep processing, which effectively guarantees the authenticity of the virtual character's face, so as to achieve the effect of being fake.
  • the S5. importing the movement trajectory of the mouth into the three-dimensional facial image to generate multiple frames of continuous dynamic facial images includes:
  • the three-dimensional face image still retains the key points of the face in the two-dimensional picture, but the key points of the face are three-dimensionally made.
  • the key points of the face in the 3D facial image will change with the mouth motion trajectory. If the mouth motion trajectory is an open mouth, the mouth in the 3D facial image The key points in the middle are shifted upward by 5mm, the key points in the corners of the mouth, and the key points of the facial features that have changed positions are used as the change features.
  • L represents the anti-error
  • E() represents the expectation
  • G() represents the anti-error generation model
  • D() represents the anti-error model
  • S represents the change feature
  • T represents the reconstructed image.
  • the multi-frame continuous dynamic facial images are generated.
  • mouth images can be generated through the adversarial neural network model. Since these mouth images are sequentially generated according to the changes in the movement trajectory of the mouth, the reconstructed mouth images are sequentially sorted and then played continuously. Dynamic facial images.
  • the adversarial neural network is used to generate dynamic facial images, thereby ensuring the synchronization of audio and facial images.
  • the Methods also include:
  • the prosody parameters corresponding to the audio generated in step S1 include pitch, pitch, and so on.
  • the pitch is greater than the preset pitch threshold, it is marked as a treble, and the positions of all trebles in the audio are recorded. These positions are the positions of the key audio frames in the avatar video. In other words, if the third second in the audio is a high pitch, then the position of the third second in the avatar video is a key audio frame.
  • the long and short-term memory neural network is a time recurrent neural network, which is specially designed to solve long-term problems.
  • All RNNs have a chain form of repeating neural network modules.
  • the frequency spectrum corresponding to the audio can be input into the long and short-term memory neural network for memory storage in advance, and then the audio can be effectively identified from the video stream of the key frame.
  • the mouth state corresponding to the audio signal is obtained, and the mouth state is compared with the mouth image. If synchronized, the virtual character video is sent to the client, otherwise the audio and video are restarted Synthesize encoding until the state of the mouth of the virtual character in the virtual character video is synchronized with the mouth image.
  • a device for generating a virtual character video based on a neural network is proposed, as shown in FIG. 4, which includes the following modules:
  • the trajectory generation module is configured to obtain the text to be recognized, and import the text to be recognized into a preset text-to-speech conversion model to obtain audio; extract the prosodic parameters of the audio, and import the prosodic parameters into the preset Audio feature point extraction in the audio generation model; according to the audio feature point, a trajectory of the mouth movement of the virtual character is generated;
  • the screen generation module is configured to obtain a two-dimensional picture of a preset virtual character, import the two-dimensional picture into a facial feature generation model for processing, and generate a three-dimensional facial image of the virtual character; import the movement trajectory of the mouth to the Three-dimensional facial image, generating multiple frames of continuous dynamic facial images;
  • the video generation module is configured to obtain the real-time audio corresponding to each frame of the dynamic face and facial painting, and synchronize the audio and video synthesis and coding of the dynamic face and the real-time audio to obtain a virtual character video.
  • a computer device in one embodiment, includes a memory and a processor.
  • the memory stores computer-readable instructions.
  • the processor executes the foregoing embodiments. The steps of the method for generating a virtual character video based on a neural network.
  • a storage medium storing computer-readable instructions.
  • the computer-readable instructions are executed by one or more processors, the one or more processors execute all of the foregoing embodiments. Describes the steps of the method for generating virtual character videos based on neural networks.
  • the storage medium may be non-volatile or volatile.
  • the program can be stored in a computer-readable storage medium, and the storage medium can include: Read only memory (ROM, Read Only Memory), random access memory (RAM, Random Access Memory), magnetic disk or optical disk, etc.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Data Mining & Analysis (AREA)
  • Oral & Maxillofacial Surgery (AREA)
  • General Health & Medical Sciences (AREA)
  • Multimedia (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Software Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Evolutionary Computation (AREA)
  • Computational Linguistics (AREA)
  • Human Computer Interaction (AREA)
  • Artificial Intelligence (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Acoustics & Sound (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Computer Graphics (AREA)
  • Geometry (AREA)
  • Processing Or Creating Images (AREA)

Abstract

The present application relates to the technical field of artificial intelligence, and particularly relates to a method for generating a virtual character video on the basis of a neural network, and a related device. The method comprises: acquiring text to be identified, and obtaining audio after the text is imported into a text-to-speech conversion model and is subjected to sound conversion; extracting a rhythm parameter of the audio, and extracting an audio feature point; generating a mouth motion trajectory of a virtual character; obtaining a two-dimensional picture of the virtual character, and generating a three-dimensional facial picture of the virtual character after processing the two-dimensional picture; importing the mouth motion trajectory into the three-dimensional facial picture to generate a dynamic facial image; and acquiring real-time audio corresponding to each frame of dynamic facial image, and synchronously performing audio and video synthesis encoding on the dynamic facial image and the real-time audio to obtain a virtual character video. According to the present application, the aim of obtaining a desired video display effect as long as text is input is achieved, such that it is ensured that the sound of a virtual character and the mouth action of the virtual character are kept completely consistent.

Description

基于神经网络生成虚拟人物视频的方法及相关设备Method and related equipment for generating virtual character video based on neural network
本申请要求于2019年10月18日提交中国专利局、申请号为201910990742.6、发明名称为“基于神经网络生成虚拟人物视频的方法及相关设备”的中国专利申请的优先权,其全部内容通过引用结合在申请中。This application claims the priority of a Chinese patent application filed with the Chinese Patent Office on October 18, 2019, the application number is 201910990742.6, and the invention title is "Method and Related Equipment for Generating Virtual Character Video Based on Neural Network", the entire content of which is incorporated by reference Incorporate in the application.
技术领域Technical field
本申请涉及人工智能技术领域,尤其涉及一种基于神经网络生成虚拟人物视频的方法及相关设备。This application relates to the field of artificial intelligence technology, and in particular to a method and related equipment for generating a virtual character video based on a neural network.
背景技术Background technique
虚拟人物指在现实中不存在的人物,它可以存在于电视剧、漫画、游戏等创作性作品中,是在电视剧、漫画、游戏等创作性作品中虚构的人物。合成虚拟人物通常采用3D扫描等方式,通过对人脸参数设置,生成所需的虚拟人物。A virtual character refers to a character that does not exist in reality. It can exist in creative works such as TV series, comics, and games. It is a fictional character in creative works such as TV series, comics, and games. Synthesizing virtual characters usually uses 3D scanning and other methods to generate the required virtual characters by setting face parameters.
但是,发明人意识到,在生成虚拟人物时无法使虚拟人物的声音和虚拟人物的嘴部动作保持完全一致,导致虚拟人物逼真度差,不能做到以假乱真的播放效果。However, the inventor realized that when the virtual character is generated, the voice of the virtual character cannot be kept completely consistent with the mouth movement of the virtual character, which results in poor fidelity of the virtual character, and it is impossible to achieve a fake playback effect.
发明内容Summary of the invention
基于此,针对生成虚拟人物时无法使虚拟人物的声音和虚拟人物的嘴部动作保持完全一致的问题,提供一种基于神经网络生成虚拟人物视频的方法及相关设备。Based on this, in view of the problem that the voice of the virtual character cannot be kept completely consistent with the mouth movements of the virtual character when the virtual character is generated, a method and related equipment for generating a virtual character video based on a neural network are provided.
一种基于神经网络生成虚拟人物视频的方法,包括如下步骤:A method for generating a virtual character video based on a neural network includes the following steps:
获取待识别文本,并将所述待识别文本导入到预置文本语音转换模型中进行声音转换,得到音频;Acquiring the text to be recognized, and importing the text to be recognized into a preset text-to-speech conversion model for voice conversion to obtain audio;
提取所述音频的韵律参数,并将所述韵律参数导入到预置音频生成模型中进行音频特征点提取;Extracting the prosody parameters of the audio, and importing the prosody parameters into a preset audio generation model for audio feature point extraction;
根据所述音频特征点,生成虚拟人物的嘴部运动轨迹;According to the audio feature points, generating the trajectory of the mouth movement of the virtual character;
获取预置虚拟人物的二维图片,并将所述二维图片导入到面部特征生成模型进行处理后生成虚拟人物的三维面部图;Acquiring a preset two-dimensional picture of the virtual character, and importing the two-dimensional picture into the facial feature generation model for processing to generate a three-dimensional facial image of the virtual character;
将所述嘴部运动轨迹导入到所述三维面部图,生成多帧连续的动态人脸面部画面;Importing the movement trajectory of the mouth into the three-dimensional facial image to generate multiple frames of continuous dynamic facial images;
获取所述每一帧动态人脸面部画对应的实时音频,并对所述动态人脸面部画面和所述实时音频同步进行音视频合成编码,得到虚拟人物视频。Acquire the real-time audio corresponding to each frame of the dynamic face and facial picture, and perform audio-video synthesis and coding on the dynamic face and facial picture and the real-time audio in synchronization to obtain a virtual character video.
一种基于神经网络生成虚拟人物视频的装置,包括如下模块:A device for generating a virtual character video based on a neural network, including the following modules:
轨迹生成模块,设置为获取待识别文本,将所述待识别文本导入到预置文本语音转换模型中进行声音转换后,得到音频;提取所述音频的韵律参数,将所述韵律参数导入到预置音频生成模型中进行音频特征点提取;根据所述音频特征点,生成虚拟人物的嘴部运动轨迹;The trajectory generation module is configured to obtain the text to be recognized, and import the text to be recognized into a preset text-to-speech conversion model to obtain audio; extract the prosodic parameters of the audio, and import the prosodic parameters into the preset Audio feature point extraction in the audio generation model; according to the audio feature point, a trajectory of the mouth movement of the virtual character is generated;
画面生成模块,设置为获取预置虚拟人物的二维图片,将所述二维图片导入到面部特征生成模型进行处理后生成虚拟人物的三维面部图;将所述嘴部运动轨迹导入到所述三维面部图,生成多帧连续的动态人脸面部画面;The screen generation module is configured to obtain a two-dimensional picture of a preset virtual character, import the two-dimensional picture into a facial feature generation model for processing, and generate a three-dimensional facial image of the virtual character; import the movement trajectory of the mouth to the Three-dimensional facial image, generating multiple frames of continuous dynamic facial images;
视频生成模块,设置为获取所述每一帧动态人脸面部画对应的实时音频,并对所述动态人脸面部画面和所述实时音频同步进行音视频合成编码,得到虚拟人物视频。The video generation module is configured to obtain the real-time audio corresponding to each frame of the dynamic face and facial painting, and synchronize the audio and video synthesis and coding of the dynamic face and the real-time audio to obtain a virtual character video.
进一步地,为实现上述目的,本申请还提供一种计算机设备,包括存储器和处理器,所述存储器中存储有计算机可读指令,所述计算机可读指令被所述处理器执行时,使得所述处理器执行如下所述基于神经网络生成虚拟人物视频的方法的步骤,包括:获取待识别文本,并将所述待识别文本导入到预置文本语音转换模型中进行声音转换,得到音频;Further, in order to achieve the above-mentioned object, the present application also provides a computer device, including a memory and a processor. The memory stores computer-readable instructions. When the computer-readable instructions are executed by the processor, the The processor executes the steps of the following method for generating a virtual character video based on a neural network, including: obtaining a text to be recognized, and importing the text to be recognized into a preset text-to-speech conversion model for voice conversion to obtain audio;
提取所述音频的韵律参数,并将所述韵律参数导入到预置音频生成模型中进行音频特征点提取;Extracting the prosody parameters of the audio, and importing the prosody parameters into a preset audio generation model for audio feature point extraction;
根据所述音频特征点,生成虚拟人物的嘴部运动轨迹;According to the audio feature points, generating the trajectory of the mouth movement of the virtual character;
获取预置虚拟人物的二维图片,并将所述二维图片导入到面部特征生成模型进行处理后生成虚拟人物的三维面部图;Acquiring a preset two-dimensional picture of the virtual character, and importing the two-dimensional picture into the facial feature generation model for processing to generate a three-dimensional facial image of the virtual character;
将所述嘴部运动轨迹导入到所述三维面部图,生成多帧连续的动态人脸面部画面;Importing the movement trajectory of the mouth into the three-dimensional facial image to generate multiple frames of continuous dynamic facial images;
获取所述每一帧动态人脸面部画对应的实时音频,并对所述动态人脸面部画面和所述实时音频同步进行音视频合成编码,得到虚拟人物视频。Acquire the real-time audio corresponding to each frame of the dynamic face and facial picture, and perform audio-video synthesis and coding on the dynamic face and facial picture and the real-time audio in synchronization to obtain a virtual character video.
进一步地,为实现上述目的,本申请还提供一种存储有计算机可读指令的存储介质,所述计算机可读指令被一个或多个处理器执行时,使得一个或多个处理器执行上述基于神经网络生成虚拟人物视频的方法的步骤,包括:获取待识别文本,并将所述待识别文本导入到预置文本语音转换模型中进行声音转换,得到音频;Further, in order to achieve the above-mentioned object, the present application also provides a storage medium storing computer-readable instructions. When the computer-readable instructions are executed by one or more processors, the one or more processors execute the above-mentioned The steps of a method for generating a virtual character video by a neural network include: obtaining a text to be recognized, and importing the text to be recognized into a preset text-to-speech conversion model for voice conversion to obtain audio;
提取所述音频的韵律参数,并将所述韵律参数导入到预置音频生成模型中进行音频特征点提取;Extracting the prosody parameters of the audio, and importing the prosody parameters into a preset audio generation model for audio feature point extraction;
根据所述音频特征点,生成虚拟人物的嘴部运动轨迹;According to the audio feature points, generating the trajectory of the mouth movement of the virtual character;
获取预置虚拟人物的二维图片,并将所述二维图片导入到面部特征生成模型进行处理后生成虚拟人物的三维面部图;Acquiring a preset two-dimensional picture of the virtual character, and importing the two-dimensional picture into the facial feature generation model for processing to generate a three-dimensional facial image of the virtual character;
将所述嘴部运动轨迹导入到所述三维面部图,生成多帧连续的动态人脸面部画面;Importing the movement trajectory of the mouth into the three-dimensional facial image to generate multiple frames of continuous dynamic facial images;
获取所述每一帧动态人脸面部画对应的实时音频,并对所述动态人脸面部画面和所述实时音频同步进行音视频合成编码,得到虚拟人物视频。Acquire the real-time audio corresponding to each frame of the dynamic face and facial picture, and perform audio-video synthesis and coding on the dynamic face and facial picture and the real-time audio in synchronization to obtain a virtual character video.
与现有机制相比,本申请通过对文本中字符进行有效转换成音频,然后再通过对抗神经网络和记忆神经网络技术将音频对应的面部特征很好的重构在三维面部图像上。从整体上实现了文本到视频的转换,不需要再对文本、音频和视频中的每一个环节进行分别模拟训练,实现了只要输入文本就可以得到想要的视频展示效果的目的,从而保证虚拟人物的声音和虚拟人物的嘴部动作保持完全一致。Compared with the existing mechanism, this application effectively converts the characters in the text into audio, and then uses the anti-neural network and memory neural network technology to reconstruct the facial features corresponding to the audio on the three-dimensional facial image. As a whole, the conversion of text to video is realized. There is no need to separately simulate each link in text, audio and video. It realizes the purpose of obtaining the desired video display effect as long as the text is input, thereby ensuring the virtual The voice of the character is exactly the same as the mouth movement of the virtual character.
附图说明Description of the drawings
通过阅读下文优选实施方式的详细描述,各种其他的优点和益处对于本领域普通技术人员将变得清楚明了。附图仅用于示出优选实施方式的目的,而并不认为是对本申请的限制。By reading the detailed description of the preferred embodiments below, various other advantages and benefits will become clear to those of ordinary skill in the art. The drawings are only used for the purpose of illustrating the preferred embodiments, and are not considered as a limitation to the application.
图1为本申请在一个实施例中的一种基于神经网络生成虚拟人物视频的方法的整体流程图;FIG. 1 is an overall flowchart of a method for generating a virtual character video based on a neural network in an embodiment of the present application;
图2为本申请在一个实施例中的一种基于神经网络生成虚拟人物视频的方法中的音频生成过程示意图;2 is a schematic diagram of an audio generation process in a method for generating a virtual character video based on a neural network in an embodiment of the present application;
图3为本申请在一个实施例中的一种基于神经网络生成虚拟人物的方法中的音频特征点提取过程示意图;3 is a schematic diagram of an audio feature point extraction process in a method for generating a virtual character based on a neural network in an embodiment of the present application;
图4为本申请在一个实施例中的一种基于神经网络生成虚拟人物视频的装置的结构图。Fig. 4 is a structural diagram of an apparatus for generating a virtual character video based on a neural network in an embodiment of the application.
具体实施方式Detailed ways
为了使本申请的目的、技术方案及优点更加清楚明白,以下结合附图及实施例,对本申请进行进一步详细说明。应当理解,此处所描述的具体实施例仅仅用以解释本申请,并不用于限定本申请。In order to make the purpose, technical solutions, and advantages of this application clearer, the following further describes this application in detail with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described here are only used to explain the application, and not used to limit the application.
本技术领域技术人员可以理解,除非特意声明,这里使用的单数形式“一”、“一个”、“所述”和“该”也可包括复数形式。应该进一步理解的是,本申请的说明书中使用的措辞“包括”是指存在所述特征、整数、步骤、操作、元件和/或组件,但是并不排除存在或添加一个或多个其他特征、整数、步骤、操作、元件、组件和/或它们的组。Those skilled in the art can understand that, unless specifically stated, the singular forms "a", "an", "said" and "the" used herein may also include plural forms. It should be further understood that the term "comprising" used in the specification of this application refers to the presence of the described features, integers, steps, operations, elements and/or components, but does not exclude the presence or addition of one or more other features, Integers, steps, operations, elements, components, and/or groups of them.
图1为本申请在一个实施例中的一种基于神经网络生成虚拟人物视频的方法的整体流程图,一种基于神经网络生成虚拟人物视频的方法,包括以下步骤:Fig. 1 is an overall flowchart of a method for generating a virtual character video based on a neural network in an embodiment of the present application. A method for generating a virtual character video based on a neural network includes the following steps:
S1、获取待识别文本,将所述待识别文本导入到预置文本语音转换模型中进行声音转换后,得到音频;S1. Obtain the text to be recognized, and import the text to be recognized into a preset text-to-speech conversion model for voice conversion to obtain audio;
具体的,待识别文本可以是采用中文、英文或者日语等语言,对于待识别文本可以先确定文本中分隔符的位置,比如“,”、“。”等。根据这些分隔符的位置,将待识别文本分割成数个子文本。将每一个子文本导入到文本语音转换模型进行文本到声音的转换。Specifically, the text to be recognized may be in languages such as Chinese, English, or Japanese. For the text to be recognized, the position of the separator in the text can be determined first, such as ",", "." and so on. According to the positions of these separators, the text to be recognized is divided into several sub-texts. Import each sub-text into the text-to-speech conversion model for text-to-sound conversion.
其中,文本语音转换模型可以采用Char2Wav架构组成,在Char2Wav架构中采用简单循环神经网络和交叉循环网络子文本中的词语进行声音转换。Among them, the text-to-speech conversion model can be composed of the Char2Wav architecture. In the Char2Wav architecture, a simple cyclic neural network and a cross cyclic network sub-text are used for voice conversion.
在对子文本进行声音转换时,可以将子文本中的词语进行词向量转换,转换成多维词向量,然后以多维词向量的特征值和维度作为参数入参到简单循环神经网络和交叉循环神经网络中进行训练转换。When performing voice conversion on the sub-text, the words in the sub-text can be converted into multi-dimensional word vectors by word vector conversion, and then the feature values and dimensions of the multi-dimensional word vector are used as parameters into the simple cyclic neural network and cross-cyclic neural network Training conversion in the network.
S2、提取所述音频的韵律参数,将所述韵律参数导入到音频生成模型中进行音频特征点提取;S2. Extract the prosody parameters of the audio, and import the prosody parameters into an audio generation model to extract audio feature points;
具体的,音频的韵律参数包括,音高、音长、停顿频率等。在对所述韵律参数进行提取时,音频生成模型可以采用隐马尔科夫模型。Specifically, the prosody parameters of the audio include pitch, pitch, pause frequency, and so on. When extracting the prosody parameters, the audio generation model may adopt a hidden Markov model.
获取音频频谱的频率范围值和振动幅度值,将所述频率范围值和振动幅度值入参到隐马尔科夫模型中进行音频特征点提取。其中,音频特征点提取的公式为:The frequency range value and the vibration amplitude value of the audio frequency spectrum are obtained, and the frequency range value and the vibration amplitude value are input into the hidden Markov model to extract audio feature points. Among them, the formula for audio feature point extraction is:
D(x,y)=∫P(X|x)·P(X|y)dX,式子中,D(x,y)表示音频特征点在二维坐标系下的取值,表示振动幅度概率值,表示频率概率值。D(x,y)=∫P(X|x)·P(X|y)dX, in the formula, D(x,y) represents the value of the audio feature point in the two-dimensional coordinate system, and represents the vibration amplitude The probability value indicates the frequency probability value.
S3、根据所述音频特征点,生成虚拟人物嘴部运动轨迹;S3. Generate a trajectory of the mouth movement of the virtual character according to the audio feature points;
具体的,应用dlib算法提取的20个嘴部关键点,将嘴部关键点做归一化处理从而不受图像大小、面部位置、面部旋转、面部大小的影响。归一化在此过程中非常重要,因为它能使生成的关键点兼容于任何视频。然后,在归一化处理过的嘴部-关键点上利用PCA降维,将一共20×2共40维降到8维。采用双线性差值的方法,对PCA后的嘴部标准点数据进行扩充,然后根据每一个音频特征点对应的嘴部开闭幅度和频率,确定嘴部标准点的运动轨迹,汇总所有嘴部标准点的运动轨迹后,得到所述虚拟人物嘴部运动轨迹。Specifically, using the 20 key points of the mouth extracted by the dlib algorithm, the key points of the mouth are normalized so as not to be affected by the image size, face position, face rotation, and face size. Normalization is very important in this process because it can make the generated key points compatible with any video. Then, PCA is used to reduce the dimensionality of the normalized mouth-key points, reducing a total of 20×2 and a total of 40 dimensions to 8 dimensions. The bilinear difference method is used to expand the mouth standard point data after PCA, and then according to the mouth opening and closing amplitude and frequency corresponding to each audio feature point, the movement trajectory of the mouth standard point is determined, and all mouths are summarized After the movement trajectory of the standard point is obtained, the movement trajectory of the mouth of the virtual character is obtained.
S4、获取预置虚拟人物的二维图片,将所述二维图片导入到面部特征生成模型进行处理后生成虚拟人物的三维面部图;S4. Obtain a two-dimensional picture of a preset virtual character, and import the two-dimensional picture into a facial feature generation model for processing to generate a three-dimensional facial image of the virtual character;
具体的,获取待生成虚拟人物的二维图片,建立二维坐标系,从所述二维坐标系中,获取二维图片中嘴、鼻子和眼睛的轮廓,从嘴、鼻子和眼睛的轮廓中提取五官关键点的坐标,比如,鼻尖坐标,嘴角坐标等。根据这些关键点坐标,确定虚拟人物的头部姿态,并利用最小二乘法对头部姿态的正确性进行评价。其中,最小二乘法估计计算公式为:Specifically, a two-dimensional picture of the virtual character to be generated is obtained, a two-dimensional coordinate system is established, and the contours of the mouth, nose, and eyes in the two-dimensional picture are obtained from the two-dimensional coordinate system, and from the contours of the mouth, nose, and eyes Extract the coordinates of the key points of the facial features, such as nose coordinates, mouth corner coordinates, etc. According to the coordinates of these key points, the head posture of the virtual character is determined, and the correctness of the head posture is evaluated by the least square method. Among them, the least squares estimation calculation formula is:
Figure PCTCN2020118373-appb-000001
Figure PCTCN2020118373-appb-000001
式子中,c表示正确估值,n表示特征点数目,pi表示特征点出现概率,s表示旋转参数,R表示平移参数,t表示缩放参数,V表示特征点到原点的距离。In the formula, c represents the correct estimation, n represents the number of feature points, pi represents the probability of feature points, s represents the rotation parameter, R represents the translation parameter, t represents the scaling parameter, and V represents the distance from the feature point to the origin.
S5、将所述嘴部运动轨迹导入到所述三维面部图,生成多帧连续的动态人脸面部画面;S5. Import the movement trajectory of the mouth into the three-dimensional facial image, and generate multiple continuous dynamic facial images;
具体的,在将嘴部运动轨迹导入到所述三维面部图时,可以采用Canny算法对三维面部图中的嘴部边缘进行检测。其中,Canny算法通常处理的图像为灰度图,因此如果摄像机获取的是彩色图像,那首先就得进行灰度化。对一幅彩色图进行灰度化,就是根据图像各个通道的采样值进行加权平均。常用灰度处理方法为Gray=0.299R+0.587G+0.114B,对灰度处理后的图像进行高斯滤波处理,在进行用一阶偏导的有限差分来计算梯度的幅值和方向,在进行 梯度幅度和方向计算时可以采用的算子有Roberts算子:对梯度幅值进行非极大值抑制后得到嘴部边缘轮廓。将嘴部运动轨迹在嘴部边缘轮廓上进行依次标记后,就可以生成多帧连续的动态人脸面部画面。Specifically, when importing the movement trajectory of the mouth into the three-dimensional facial image, the Canny algorithm can be used to detect the edge of the mouth in the three-dimensional facial image. Among them, the image processed by the Canny algorithm is usually a grayscale image, so if the camera acquires a color image, it must first be grayscaled. To grayscale a color image is to perform a weighted average according to the sampled values of each channel of the image. The commonly used gray-scale processing method is Gray=0.299R+0.587G+0.114B. Gaussian filtering is performed on the gray-scale processed image, and the first-order partial derivative finite difference is used to calculate the magnitude and direction of the gradient. The operators that can be used in the calculation of gradient amplitude and direction include Roberts operator: the contour of the mouth edge is obtained after non-maximum suppression of the gradient amplitude. After the trajectory of the mouth movement is sequentially marked on the contour of the mouth edge, multiple frames of continuous dynamic facial images can be generated.
S6、获取所述每一帧动态人脸面部画对应的实时音频,并对所述动态人脸面部画面和所述实时音频同步进行音视频合成编码,得到虚拟人物视频。S6. Acquire real-time audio corresponding to each frame of the dynamic face and facial painting, and perform audio-video synthesis and coding on the dynamic face and facial image and the real-time audio in synchronization to obtain a virtual character video.
具体的,将动态人脸面部画面按照预设的播放速度进行播放,记录好播放完整个动态人脸面部画面的播放时长、初始播放节点和终止播放节点,然后根据播放时长、初始播放节点的位置和终止播放节点的位置确定步骤S1生成的音频所要播放的片段。最后,应用视频编码器将音频所要播放的片段和对应的动态人脸面部画面进行合成就可以得到虚拟人物视频。Specifically, the dynamic facial image is played at the preset playing speed, and the playing time, initial playing node and ending playing node of the complete dynamic facial facial image are recorded, and then based on the playing time and the position of the initial playing node. And the position of the end playback node determines the segment of the audio generated in step S1 to be played. Finally, the video of the virtual character can be obtained by using the video encoder to synthesize the audio segment to be played with the corresponding dynamic facial image.
本实施例,通过对文本中字符进行有效转换成音频,然后再通过对抗神经网络和记忆神经网络技术将音频对应的面部特征很好的重构在三维面部图像上。从整体上实现了文本到视频的转换,不需要再对文本、音频和视频中的每一个环节进行分别模拟训练,实现了只要输入文本就可以得到想要的视频展示效果的目的,从而保证虚拟人物的声音和虚拟人物的嘴部动作保持完全一致。In this embodiment, the characters in the text are effectively converted into audio, and then the facial features corresponding to the audio are well reconstructed on the three-dimensional facial image through the adversarial neural network and memory neural network technology. As a whole, the conversion of text to video is realized. There is no need to separately simulate each link in text, audio and video. It realizes the purpose of obtaining the desired video display effect as long as the text is input, thereby ensuring the virtual The voice of the character is exactly the same as the mouth movement of the virtual character.
图2为本申请在一个实施例中的一种基于神经网络生成虚拟人物视频的方法中的音频生成过程示意图,如图所示,所述S1、获取待识别的文本,将所述文本导入到文本语音转换模型中进行声音转换后,得到音频,包括:Figure 2 is a schematic diagram of the audio generation process in a method for generating a virtual character video based on a neural network in an embodiment of the application. As shown in the figure, the S1, obtain the text to be recognized, and import the text to After voice conversion in the text-to-speech conversion model, the audio is obtained, including:
S11、获取待识别文本,提取所述待识别文本中的分割符,根据所述分隔符,将所述待识别文本划分为数个子文本;S11. Obtain a text to be recognized, extract a separator in the text to be recognized, and divide the text to be recognized into several sub-texts according to the separator;
具体的,待识别文本通常是流式数据文本,在流式数据文本中分割符可以是“。”、“,”等标点符号,也可以是“1”、“2”等数字。在对待识别文本进行划分时可以采用等长度划分,也可以采用不等长度划分。Specifically, the text to be recognized is usually a streaming data text, and the separator in the streaming data text can be punctuation marks such as "." and ",", or numbers such as "1" and "2". When dividing the text to be recognized, equal length division or unequal length division can be used.
S12、将所述子文本进行词向量编码,得到数个多维词向量;S12. Perform word vector encoding on the sub-text to obtain several multi-dimensional word vectors;
其中,将所述子文本中的每一个字符应用word2vec进行词向量编码,词向量编码后可以生成每一个字符对应的多维词向量,对于每一个字符对应的多维词向量可以根据字符在所述子文本中的位置进行标记,即第一个字符的词向量为[1,2,5],则标记后的词向量为[1,1,2,5]。进行标记后可以确定每一个多维词向量的位置,从而避免在进行语音转换时,字符位置发生变化,从而导致生成的音频与原文本不一致。Wherein, each character in the sub-text is encoded by word2vec. After the word vector is encoded, a multi-dimensional word vector corresponding to each character can be generated. The multi-dimensional word vector corresponding to each character can be based on the character in the sub-text. The position in the text is marked, that is, the word vector of the first character is [1,2,5], and the marked word vector is [1,1,2,5]. After marking, the position of each multi-dimensional word vector can be determined, so as to avoid the character position change during the speech conversion, which will cause the generated audio to be inconsistent with the original text.
S13、将所述多维词向量进行降维后,得到二维词向量;S13. After reducing the dimensions of the multi-dimensional word vector, a two-dimensional word vector is obtained;
其中,多维词向量降维的方法可以采用PCA降维的方式,也可以采用向量投影法,将n为向量投影到n-1维空间,然后再将n-1维空间中的n-1维向量投影到n-2维空间,逐次投影直到二维平面上,得到所述二维词向量。Among them, the method of multidimensional word vector dimensionality reduction can use PCA dimensionality reduction method, or vector projection method, projecting n as a vector to n-1 dimensional space, and then n-1 dimension in n-1 dimensional space The vector is projected into an n-2 dimensional space, and successively projected to a two-dimensional plane to obtain the two-dimensional word vector.
S14、计算所述二维词向量的特征值,以所述二维词向量的特征值为权重,将所述二维词向量和所述权重导入到所述文本语音转换模型中进行文本声音转换,得到所述音频。S14. Calculate the feature value of the two-dimensional word vector, and use the feature value of the two-dimensional word vector as a weight, and import the two-dimensional word vector and the weight into the text-to-speech conversion model for text-to-speech conversion , Get the audio.
其中,文本语音转换模型可以采用双向循环神经网络模型,双向循环神经网络模型被应用于学习目标与完整(输入序列相关的场合。例如在语音识别中,当前语音对应的词汇可能与其后出现的词汇有对应关系,因此需要以完整的语音作为输入。Among them, the text-to-speech conversion model can use a two-way cyclic neural network model. The two-way cyclic neural network model is applied to situations where the learning objective is related to the complete input sequence. For example, in speech recognition, the vocabulary corresponding to the current voice may be compared with the vocabulary that appears later There is a corresponding relationship, so a complete voice is required as input.
本实施例,通过文本语音转换模型,准确的将输入的文本转换成对应的音频,而不会产生字符遗漏。In this embodiment, the text-to-speech conversion model is used to accurately convert the input text into the corresponding audio without missing characters.
图3为本申请在一个实施例中的一种基于神经网络生成虚拟人物的方法中的音频特征点提取过程示意图,如图所示所述S2、提取所述音频的韵律参数,将所述韵律参数导入到预置音频生成模型中进行音频特征点提取,包括:Figure 3 is a schematic diagram of the audio feature point extraction process in a method for generating virtual characters based on neural networks in an embodiment of this application. As shown in the figure, S2, extracting the prosody parameters of the audio, and combining the prosody The parameters are imported into the preset audio generation model for audio feature point extraction, including:
S21、提取所述音频的第一韵律参数和级别语言参数,根据所述第一韵律参数中的音长、音高和停顿时机,生成韵律标记;S21. Extract the first prosody parameter and level language parameter of the audio, and generate a prosody mark according to the pitch, pitch, and pause timing in the first prosody parameter;
其中,级别语言参数分为低级别语言参数和高级别语言参数,第一韵律参数是音频未进行加工时的韵律参数。Among them, the level language parameters are divided into low-level language parameters and high-level language parameters, and the first prosody parameter is the prosody parameter when the audio is not processed.
S22、对所述韵律标记进行编码,生成编码串流;S22. Encode the prosody mark to generate an encoded stream;
其中,在根据所述韵律参数进行编码时,采用的公式为:Wherein, when encoding according to the prosody parameter, the formula used is:
T=argmax P(q,A|L),式子中,T表示韵律编码,P表示韵律状态,q表示音高,A表示基础音律特征参数,L表示级别语言参数,argmax为自变量最大值函数。T = argmax P(q, A|L), in the formula, T represents the prosody code, P represents the prosody state, q represents the pitch, A represents the basic rhythm feature parameter, L represents the level language parameter, and argmax is the maximum value of the independent variable function.
S23、根据所述编码串流和所述级别语言参数,生成第二韵律参数;S23. Generate a second prosody parameter according to the encoding stream and the level language parameter;
具体的,所述第二韵律参数生成是将编码串流和所述级别语言参数进行相加后得到的。Specifically, the second prosody parameter generation is obtained by adding the coded stream and the level language parameter.
S24、将所述第二韵律参数导入到所述音频生成模型,以提取所述第二韵律参数中的音频特征点。S24. Import the second prosody parameter into the audio generation model to extract audio feature points in the second prosody parameter.
本实施例,通过韵律参数对生成的音频进行音频特征点提取,从而简化了生成虚拟人物视频的时间,即可以在短时间内生成所需的虚拟人物视频。In this embodiment, the audio feature points of the generated audio are extracted through the prosody parameter, thereby simplifying the time for generating the virtual character video, that is, the required virtual character video can be generated in a short time.
图3为本申请在一个实施例中的一种基于神经网络生成虚拟人物视频的方法中的嘴部轨迹生成过程示意图,如图所示,所述S3、根据所述音频特征点,生成虚拟人物嘴部运动轨迹,包括:Figure 3 is a schematic diagram of the mouth trajectory generation process in a method for generating a virtual character video based on a neural network in an embodiment of the application. As shown in the figure, the S3 generates a virtual character according to the audio feature points Mouth movement trajectory, including:
S31、获取预置虚拟人物图像,根据预设的嘴部关键点提取算法,从所述虚拟人物图像中提取嘴部关键点;S31. Acquire a preset virtual character image, and extract key points of the mouth from the virtual character image according to a preset mouth key point extraction algorithm;
其中,预设的嘴部特征提取算法为dlib算法。Among them, the preset mouth feature extraction algorithm is the dlib algorithm.
S32、对所述嘴部关键点进行归一化处理,得到增强关键点;S32. Perform normalization processing on the key points of the mouth to obtain enhanced key points;
具体的,在归一化嘴部特征点采用区域聚类的方式,由于嘴部是对称结构,因此将嘴部划分为左上、左下、右上和右下四个区域,以划分区域的直线作为坐标轴,建立一坐标系,对坐标中四个区域中的任意一个区域中的关键点进行聚类增强。具体增强的方式是,计算两个关键点之间的距离,若距离小于预设阈值,则以这连个关键点连线的线段中点作为一个增强关键点。这样可以减少需要进行计算的关键点数量。Specifically, the normalized mouth feature point adopts the method of region clustering. Since the mouth is a symmetrical structure, the mouth is divided into four areas: upper left, lower left, upper right, and lower right, and the straight line dividing the area is used as the coordinates. Axis, a coordinate system is established, and the key points in any one of the four areas in the coordinates are clustered and enhanced. The specific enhancement method is to calculate the distance between two key points, and if the distance is less than a preset threshold, the midpoint of the line segment connecting the two key points is used as an enhancement key point. This can reduce the number of key points that need to be calculated.
S33、根据所述增强关键点,得到所述音频的播放频率和播放时嘴部运动幅度,并对所述播放频率和所述嘴部运动幅度进行拟合后,得到所述虚拟人物嘴部运动轨迹。S33. Obtain the audio playback frequency and the mouth motion amplitude during playback according to the enhancement key points, and after fitting the playback frequency and the mouth motion amplitude to obtain the virtual character's mouth motion Trajectory.
具体的,根据增强关键点在嘴部的位置,例如,A人物增强关键点为左嘴角和上嘴唇中点,那么A人物的播放频率为30kbps,播放时嘴部运动幅度为0.8mm,B人物增强关键点为右嘴角和下嘴唇中点,那么B人物的播放频率为35kbps,播放时嘴部运动幅度为0.7mm。上述A人物和B人物的播放频率和播放时嘴部运动幅度是从根据数据库中已有的嘴部运动与音频播放对应关系得到的。拟合播放频率和嘴部运动幅度后,就可以得到虚拟人物嘴部运动轨迹。Specifically, according to the position of the enhancement key point in the mouth, for example, the enhancement key point of character A is the corner of the left mouth and the middle point of the upper lip, then the playback frequency of character A is 30kbps, and the mouth movement amplitude during playback is 0.8mm, character B The key points of enhancement are the right corner of the mouth and the midpoint of the lower lip. Then the playback frequency of character B is 35kbps, and the mouth movement amplitude during playback is 0.7mm. The playing frequency and the mouth movement amplitude of the above-mentioned character A and character B are obtained from the existing correspondence relationship between the mouth movement and the audio playing in the database. After fitting the playing frequency and the mouth movement amplitude, the trajectory of the virtual character's mouth movement can be obtained.
本实施例,通过对嘴部关键点进行有效提取,从而保证了音频和虚拟人物画面中嘴部运动结合时的准确性。In this embodiment, by effectively extracting the key points of the mouth, the accuracy of the combination of the audio and the mouth movement in the virtual character picture is ensured.
在一个实施例中,所述S4、获取虚拟人物的二维图片,将所述二维图片导入到面部特征生成模型进行处理后生成虚拟人物的三维面部图,包括:In one embodiment, the step S4, acquiring a two-dimensional picture of a virtual character, and importing the two-dimensional picture into a facial feature generation model for processing to generate a three-dimensional facial image of the virtual character includes:
获取所述虚拟人物的二维图片并对所述二维图片进行灰度处理,得到二值化的二维图片,以及根据所述二值化的二维图片的梯度,得到所述三维面部图的深度信息;Obtain a two-dimensional picture of the virtual person and perform grayscale processing on the two-dimensional picture to obtain a binarized two-dimensional picture, and obtain the three-dimensional face image according to the gradient of the binarized two-dimensional picture In-depth information;
其中,在计算二维图片的梯度时,可以将二维图片分割成多个等大小的子块,然后提取每一个子块像素点的像素值,建立子块二值化像素值矩阵,根据二值化像素值矩阵中每一行或者每一列中“1”和“0”交替的次数,得到每一个子块的梯度值,汇总各个子块的梯度值后,可以得到三维面部图中不同面部区域的深度信息。即梯度值大的区域,三维图深度大,梯度值小的区域,三维图深度小。Among them, when calculating the gradient of a two-dimensional picture, the two-dimensional picture can be divided into multiple sub-blocks of equal size, and then the pixel value of each sub-block pixel is extracted, and the sub-block binarized pixel value matrix is established. The number of times "1" and "0" in each row or column of the valued pixel value matrix are alternated to obtain the gradient value of each sub-block. After the gradient values of each sub-block are summarized, different facial regions in the 3D facial image can be obtained In-depth information. That is, the area with a large gradient value has a large depth of the 3D image, and an area with a small gradient value has a small depth of the 3D image.
以所述二维图片的左下角为坐标原点,建立人脸特征点坐标系;Using the lower left corner of the two-dimensional picture as the origin of coordinates to establish a coordinate system of facial feature points;
从所述人脸特征点坐标系中获取所述二维图片中人脸五官关键点的坐标,并计算所述各人脸五官关键点之间的距离;Acquiring the coordinates of the key points of the facial features in the two-dimensional picture from the coordinate system of the facial feature points, and calculating the distance between the key points of the facial features;
根据所述距离,调整预置标准三维面部图中人脸五官的位置,得到虚拟人物的三维面部图。According to the distance, the positions of the facial features in the preset standard three-dimensional facial image are adjusted to obtain the three-dimensional facial image of the virtual character.
具体的,例如,在预设的标准三维面部图中,嘴角之间的距离为10,而通过计算后的距离为9,则根据距离值9对嘴角距离进行修改。Specifically, for example, in a preset standard three-dimensional facial image, the distance between the corners of the mouth is 10, and the calculated distance is 9, and the distance between the corners of the mouth is modified according to the distance value 9.
本实施例,通过深度处理将二维图像转换后三维面部图,有效保证了虚拟人物人脸的真实性,以达到以假乱真的效果。In this embodiment, the two-dimensional image is converted into a three-dimensional facial image through deep processing, which effectively guarantees the authenticity of the virtual character's face, so as to achieve the effect of being fake.
在一个实施例中,所述S5、将所述嘴部运动轨迹导入到所述三维面部图,生成多帧连续的动态人脸面部画面,包括:In an embodiment, the S5. importing the movement trajectory of the mouth into the three-dimensional facial image to generate multiple frames of continuous dynamic facial images includes:
将所述嘴部运动轨迹导入到所述三维面部图,并提取三维面部图中发生位置变化的人脸五官关键点作为变化特征;Importing the movement trajectory of the mouth into the three-dimensional facial image, and extracting the key points of facial features that have changed positions in the three-dimensional facial image as change features;
具体的,三维面部图中仍然保留着二维图片的人脸关键点,只是将这些人脸关键点三维立体化。在将嘴部运动轨迹导入到三维面部图后,三维面部图中的人脸关键点会随着嘴部运动轨迹发生位置变化,如嘴部运动轨迹为张开嘴,则三维面部图中的嘴中关键点向上位移5mm,嘴角关键点,将这些发生位置变化的人脸五官关键点作为变化特征。Specifically, the three-dimensional face image still retains the key points of the face in the two-dimensional picture, but the key points of the face are three-dimensionally made. After importing the mouth motion trajectory into the 3D facial image, the key points of the face in the 3D facial image will change with the mouth motion trajectory. If the mouth motion trajectory is an open mouth, the mouth in the 3D facial image The key points in the middle are shifted upward by 5mm, the key points in the corners of the mouth, and the key points of the facial features that have changed positions are used as the change features.
将所述变化特征入参到预置对抗神经网络模型中进行嘴部图像重构;Incorporating the change feature into a preset confrontation neural network model to reconstruct the mouth image;
其中,在利用对抗神经网络模型(Edge-connect)进行嘴部图像重构时,可以采用下面公式减少对抗误差:Among them, when using the adversarial neural network model (Edge-connect) to reconstruct the mouth image, the following formula can be used to reduce the adversarial error:
Figure PCTCN2020118373-appb-000002
式子中,L表示对抗误差,E()表示期望,G()表示对抗误差生成模型,D()表示对抗误差模型,S表示变化特征,T表示重构图像。
Figure PCTCN2020118373-appb-000002
In the formula, L represents the anti-error, E() represents the expectation, G() represents the anti-error generation model, D() represents the anti-error model, S represents the change feature, and T represents the reconstructed image.
将重构后的数张嘴部图像按照生成时间进行排序后,生成所述多帧连续的动态人脸面部画面。After the reconstructed mouth images are sorted according to the generation time, the multi-frame continuous dynamic facial images are generated.
具体的,通过对抗神经网络模型可以生成数张嘴部图像,由于这些嘴部图像是根据嘴部运动轨迹的变化依次生成的,所以将这些重构后的嘴部图像依次排序后连续播放就可以得到动态人脸面部画面。Specifically, several mouth images can be generated through the adversarial neural network model. Since these mouth images are sequentially generated according to the changes in the movement trajectory of the mouth, the reconstructed mouth images are sequentially sorted and then played continuously. Dynamic facial images.
本实施例,利用对抗神经网络生成动态人脸面部画面,从而保证了音频与人脸图像的同步性。In this embodiment, the adversarial neural network is used to generate dynamic facial images, thereby ensuring the synchronization of audio and facial images.
在一个实施例中,所述S6、获取所述动态人脸面部画面中每一帧对应的实时音频,同步播放所述动态人脸面部画面和所述实时音频,得到虚拟人物视频之后,所述方法还包括:In one embodiment, in S6, the real-time audio corresponding to each frame in the dynamic face and facial image is acquired, and the dynamic face and facial image and the real-time audio are played synchronously, and after the virtual character video is obtained, the Methods also include:
定位所述虚拟人物视频中所述韵律参数对应的关键音频帧的位置;Locate the position of the key audio frame corresponding to the prosody parameter in the virtual character video;
具体的,在步骤S1生成的音频对应的韵律参数中有音高、音长等。当音高大于预设的音高阈值时,标记为高音,记录所有高音在音频中所处的位置,这些位置就是虚拟人物视频时关键音频帧的位置。也就是说,如在音频中第3秒是一个高音,那么在虚拟人物视频时第3秒的位置就是一个关键音频帧。Specifically, the prosody parameters corresponding to the audio generated in step S1 include pitch, pitch, and so on. When the pitch is greater than the preset pitch threshold, it is marked as a treble, and the positions of all trebles in the audio are recorded. These positions are the positions of the key audio frames in the avatar video. In other words, if the third second in the audio is a high pitch, then the position of the third second in the avatar video is a key audio frame.
根据所述关键音频帧的位置,分别从所述虚拟人物视频中提取所述关键音频帧对应的嘴部图像和音频信号;Extracting respectively the mouth image and audio signal corresponding to the key audio frame from the virtual character video according to the position of the key audio frame;
将所述音频信号的谱特征入参到预存的长短期记忆网络模型中,进行语音识别;Incorporating the spectral characteristics of the audio signal into a pre-stored long and short-term memory network model to perform speech recognition;
其中,长短期记忆神经网络是一种时间递归神经网络,是为了解决长期以来问题而专门设计出来的,所有的RNN都具有一种重复神经网络模块的链式形式。可以预先将该音频对应的频谱输入到长短期记忆神经网络进行记忆存储,然后从关键帧的视频流中对音频进行有效的识别。Among them, the long and short-term memory neural network is a time recurrent neural network, which is specially designed to solve long-term problems. All RNNs have a chain form of repeating neural network modules. The frequency spectrum corresponding to the audio can be input into the long and short-term memory neural network for memory storage in advance, and then the audio can be effectively identified from the video stream of the key frame.
根据语音识别结果,得到所述音频信号对应的嘴部状态,将所述嘴部状态与所述嘴部图像进行比较,若同步,则发送所述虚拟人物视频至客户端,否则重新进行音视频合成编码,直到所述虚拟人物视频中虚拟人物的嘴部状态与所述嘴部图像同步。According to the voice recognition result, the mouth state corresponding to the audio signal is obtained, and the mouth state is compared with the mouth image. If synchronized, the virtual character video is sent to the client, otherwise the audio and video are restarted Synthesize encoding until the state of the mouth of the virtual character in the virtual character video is synchronized with the mouth image.
本实施例,通过对虚拟人物视屏中的关键帧进行有效分析,从而验证了虚拟人物视屏的声音与画面的同步性。In this embodiment, by effectively analyzing the key frames in the virtual character's video screen, the synchronization of the sound and the picture of the virtual character's video screen is verified.
在一个实施例中,提出了一种基于神经网络生成虚拟人物视频的装置,如图4所示,包括如下模块:In one embodiment, a device for generating a virtual character video based on a neural network is proposed, as shown in FIG. 4, which includes the following modules:
轨迹生成模块,设置为获取待识别文本,将所述待识别文本导入到预置文本语音转换模型中进行声音转换后,得到音频;提取所述音频的韵律参数,将所述韵律参数导入到预置音频生成模型中进行音频特征点提取;根据所述音频特征点,生成虚拟人物的嘴部运动轨迹;The trajectory generation module is configured to obtain the text to be recognized, and import the text to be recognized into a preset text-to-speech conversion model to obtain audio; extract the prosodic parameters of the audio, and import the prosodic parameters into the preset Audio feature point extraction in the audio generation model; according to the audio feature point, a trajectory of the mouth movement of the virtual character is generated;
画面生成模块,设置为获取预置虚拟人物的二维图片,将所述二维图片导入到面部特征生成模型进行处理后生成虚拟人物的三维面部图;将所述嘴部运动轨迹导入到所述三维面部图,生成多帧连续的动态人脸面部画面;The screen generation module is configured to obtain a two-dimensional picture of a preset virtual character, import the two-dimensional picture into a facial feature generation model for processing, and generate a three-dimensional facial image of the virtual character; import the movement trajectory of the mouth to the Three-dimensional facial image, generating multiple frames of continuous dynamic facial images;
视频生成模块,设置为获取所述每一帧动态人脸面部画对应的实时音频,并对所述动态人脸面部画面和所述实时音频同步进行音视频合成编码,得到虚拟人物视频。The video generation module is configured to obtain the real-time audio corresponding to each frame of the dynamic face and facial painting, and synchronize the audio and video synthesis and coding of the dynamic face and the real-time audio to obtain a virtual character video.
在一个实施例中,提出了一种计算机设备,所述计算机设备包括存储器和处理器,存储器中存储有计算机可读指令,计算机可读指令被处理器执行时,使得处理器执行上述各实施例中的所述基于神经网络生成虚拟人物视频的方法的步骤。In one embodiment, a computer device is provided. The computer device includes a memory and a processor. The memory stores computer-readable instructions. When the computer-readable instructions are executed by the processor, the processor executes the foregoing embodiments. The steps of the method for generating a virtual character video based on a neural network.
在一个实施例中,提出了一种存储有计算机可读指令的存储介质,该计算机可读指令被一个或多个处理器执行时,使得一个或多个处理器执行上述各实施例中的所述基于神经网络生成虚拟人物视频的方法的步骤。其中,所述存储介质可以是非易失性,也可以是易失性。In one embodiment, a storage medium storing computer-readable instructions is provided. When the computer-readable instructions are executed by one or more processors, the one or more processors execute all of the foregoing embodiments. Describes the steps of the method for generating virtual character videos based on neural networks. Wherein, the storage medium may be non-volatile or volatile.
本领域普通技术人员可以理解上述实施例的各种方法中的全部或部分步骤是可以通过程序来指令相关的硬件来完成,该程序可以存储于一计算机可读存储介质中,存储介质可以包括:只读存储器(ROM,Read Only Memory)、随机存取存储器(RAM,Random Access Memory)、磁盘或光盘等。Those of ordinary skill in the art can understand that all or part of the steps in the various methods of the above-mentioned embodiments can be completed by a program instructing relevant hardware. The program can be stored in a computer-readable storage medium, and the storage medium can include: Read only memory (ROM, Read Only Memory), random access memory (RAM, Random Access Memory), magnetic disk or optical disk, etc.
以上所述实施例的各技术特征可以进行任意的组合,为使描述简洁,未对上述实施例中的技术特征所有可能的组合都进行描述,然而,只要这些技术特征的组合不存在矛盾,都应当认为是本说明书记载的范围。The technical features of the above-mentioned embodiments can be combined arbitrarily. In order to make the description concise, all possible combinations of the technical features in the above-mentioned embodiments are not described. However, as long as there is no contradiction in the combination of these technical features, all possible combinations are not described. It should be considered within the scope of this specification.
以上所述实施例仅表达了本申请一些示例性实施例,其中描述较为具体和详细,但并不能因此而理解为对本申请专利范围的限制。应当指出的是,对于本领域的普通技术人员来说,在不脱离本申请构思的前提下,还可以做出若干变形和改进,这些都属于本申请的保护范围。因此,本申请专利的保护范围应以所附权利要求为准。The above-mentioned embodiments only express some exemplary embodiments of the present application. The descriptions are more specific and detailed, but they should not be interpreted as a limitation on the patent scope of the present application. It should be pointed out that for those of ordinary skill in the art, without departing from the concept of this application, several modifications and improvements can be made, and these all fall within the protection scope of this application. Therefore, the scope of protection of the patent of this application shall be subject to the appended claims.

Claims (20)

  1. 一种基于神经网络生成虚拟人物视频的方法,其中,包括:A method for generating a virtual character video based on a neural network, which includes:
    获取待识别文本,并将所述待识别文本导入到预置文本语音转换模型中进行声音转换,得到音频;Acquiring the text to be recognized, and importing the text to be recognized into a preset text-to-speech conversion model for voice conversion to obtain audio;
    提取所述音频的韵律参数,并将所述韵律参数导入到预置音频生成模型中进行音频特征点提取;Extracting the prosody parameters of the audio, and importing the prosody parameters into a preset audio generation model for audio feature point extraction;
    根据所述音频特征点,生成虚拟人物的嘴部运动轨迹;According to the audio feature points, generating the trajectory of the mouth movement of the virtual character;
    获取预置虚拟人物的二维图片,并将所述二维图片导入到面部特征生成模型进行处理后生成虚拟人物的三维面部图;Acquiring a preset two-dimensional picture of the virtual character, and importing the two-dimensional picture into the facial feature generation model for processing to generate a three-dimensional facial image of the virtual character;
    将所述嘴部运动轨迹导入到所述三维面部图,生成多帧连续的动态人脸面部画面;Importing the movement trajectory of the mouth into the three-dimensional facial image to generate multiple frames of continuous dynamic facial images;
    获取所述每一帧动态人脸面部画对应的实时音频,并对所述动态人脸面部画面和所述实时音频同步进行音视频合成编码,得到虚拟人物视频。Acquire the real-time audio corresponding to each frame of the dynamic face and facial picture, and perform audio-video synthesis and coding on the dynamic face and facial picture and the real-time audio in synchronization to obtain a virtual character video.
  2. 根据权利要求1所述的基于神经网络生成虚拟人物视频的方法,其中,所述获取待识别文本,将所述待识别文本导入到预置文本语音转换模型中进行声音转换,得到音频,包括:The method for generating a virtual character video based on a neural network according to claim 1, wherein said obtaining the text to be recognized and importing the text to be recognized into a preset text-to-speech conversion model for voice conversion to obtain audio includes:
    获取待识别文本,提取所述待识别文本中的分割符,根据所述分隔符,将所述待识别文本划分为数个子文本;Acquiring the text to be recognized, extracting the separator in the text to be recognized, and dividing the text to be recognized into several sub-texts according to the separator;
    将所述子文本进行词向量编码,得到数个多维词向量;Performing word vector encoding on the sub-text to obtain several multi-dimensional word vectors;
    将所述多维词向量进行降维后,得到二维词向量;After dimensionality reduction is performed on the multi-dimensional word vector, a two-dimensional word vector is obtained;
    计算所述二维词向量的特征值,以所述二维词向量的特征值为权重,将所述二维词向量和所述权重导入到所述文本语音转换模型中进行文本声音转换,得到所述音频。Calculate the feature value of the two-dimensional word vector, take the feature value of the two-dimensional word vector as a weight, and import the two-dimensional word vector and the weight into the text-to-speech conversion model to perform text-to-speech conversion to obtain The audio.
  3. 根据权利要求1所述的基于神经网络生成虚拟人物视频的方法,其中,所述提取所述音频的韵律参数,将所述韵律参数导入到音频生成模型中进行音频特征点提取,包括:The method for generating a virtual character video based on a neural network according to claim 1, wherein the extracting the prosody parameters of the audio and importing the prosody parameters into an audio generation model for audio feature point extraction comprises:
    提取所述音频的第一韵律参数和级别语言参数,并根据所述第一韵律参数中的音长、音高和停顿时机,生成韵律标记;Extracting the first prosody parameter and the level language parameter of the audio, and generating a prosody mark according to the pitch, pitch and pause timing in the first prosody parameter;
    对所述韵律标记进行编码,生成编码串流;Encoding the prosody mark to generate an encoded stream;
    根据所述编码串流和所述级别语言参数,生成第二韵律参数;Generating a second prosody parameter according to the encoding stream and the level language parameter;
    将所述第二韵律参数导入到所述音频生成模型,以提取所述第二韵律参数中的音频特征点。The second prosody parameter is imported into the audio generation model to extract audio feature points in the second prosody parameter.
  4. 根据权利要求3所述的基于神经网络生成虚拟人物视频的方法,其中,所述根据所述音频特征点,生成虚拟人物嘴部运动轨迹,包括:The method for generating a virtual character video based on a neural network according to claim 3, wherein the generating a trajectory of the virtual character's mouth according to the audio feature points comprises:
    获取预置虚拟人物图像,根据预设的嘴部关键点提取算法,从所述虚拟人物图像中提取嘴部关键点;Acquiring a preset virtual character image, and extracting the mouth key points from the virtual character image according to a preset mouth key point extraction algorithm;
    对所述嘴部关键点进行归一化处理,得到增强关键点;Normalize the key points of the mouth to obtain enhanced key points;
    根据所述增强关键点,得到所述音频的播放频率和播放时的嘴部运动幅度,并对所述播放频率和所述嘴部运动幅度进行拟合,得到所述虚拟人物嘴部运动轨迹。According to the enhancement key points, the audio playback frequency and the mouth motion amplitude during playback are obtained, and the playback frequency and the mouth motion amplitude are fitted to obtain the virtual character's mouth motion trajectory.
  5. 根据权利要求1所述的基于神经网络生成虚拟人物视频的方法,其中,所述获取预置虚拟人物的二维图片,并将所述二维图片导入到面部特征生成模型进行处理后生成虚拟人物的三维面部图,包括:The method for generating a virtual character video based on a neural network according to claim 1, wherein said obtaining a preset two-dimensional picture of the virtual character, and importing the two-dimensional picture into a facial feature generation model for processing to generate the virtual character The three-dimensional face map, including:
    获取所述虚拟人物的二维图片并对所述二维图片进行灰度处理,得到二值化的二维图片,以及根据所述二值化的二维图片的梯度,得到所述三维面部图的深度信息;Obtain a two-dimensional picture of the virtual person and perform grayscale processing on the two-dimensional picture to obtain a binarized two-dimensional picture, and obtain the three-dimensional face image according to the gradient of the binarized two-dimensional picture In-depth information;
    以所述二维图片的左下角为坐标原点,建立人脸特征点坐标系;Using the lower left corner of the two-dimensional picture as the origin of coordinates to establish a coordinate system of facial feature points;
    从所述人脸特征点坐标系中获取所述二维图片中人脸五官关键点的坐标,并计算所述各人脸五官关键点之间的距离;Acquiring the coordinates of the key points of the facial features in the two-dimensional picture from the coordinate system of the facial feature points, and calculating the distance between the key points of the facial features;
    根据所述距离,调整预置标准三维面部图中人脸五官的位置,得到虚拟人物的三维面部图。According to the distance, the positions of the facial features in the preset standard three-dimensional facial image are adjusted to obtain the three-dimensional facial image of the virtual character.
  6. 根据权利要求5所述的基于神经网络生成虚拟人物视频的方法,其中,所述将所述嘴部运动轨迹导入到所述三维面部图,生成多帧连续的动态人脸面部画面,包括:The method for generating a virtual character video based on a neural network according to claim 5, wherein said importing the movement trajectory of the mouth into the three-dimensional facial image to generate multiple frames of continuous dynamic facial images comprises:
    将所述嘴部运动轨迹导入到所述三维面部图,并提取三维面部图中发生位置变化的人脸五官关键点作为变化特征;Importing the movement trajectory of the mouth into the three-dimensional facial image, and extracting the key points of facial features that have changed positions in the three-dimensional facial image as change features;
    将所述变化特征入参到预置对抗神经网络模型中进行嘴部图像重构;Incorporating the change feature into a preset confrontation neural network model to reconstruct the mouth image;
    将重构后的数张嘴部图像按照生成时间进行排序后,生成所述多帧连续的动态人脸面部画面。After the reconstructed mouth images are sorted according to the generation time, the multi-frame continuous dynamic facial images are generated.
  7. 根据权利要求1至6任一项所述的基于神经网络生成虚拟人物视频的方法,其中,所述获取所述每一帧动态人脸面部画对应的实时音频,并对所述动态人脸面部画面和所述实时音频同步进行音视频合成编码,得到虚拟人物视频之后,所述方法还包括:The method for generating a virtual character video based on a neural network according to any one of claims 1 to 6, wherein said acquiring the real-time audio corresponding to each frame of dynamic face and facial painting, and responding to said dynamic face and face After the screen and the real-time audio are synchronized to perform audio and video synthesis and coding, and after obtaining the virtual character video, the method further includes:
    定位所述虚拟人物视频中所述韵律参数对应的关键音频帧的位置;Locate the position of the key audio frame corresponding to the prosody parameter in the virtual character video;
    根据所述关键音频帧的位置,分别从所述虚拟人物视频中提取所述关键音频帧对应的嘴部图像和音频信号;Extracting respectively the mouth image and audio signal corresponding to the key audio frame from the virtual character video according to the position of the key audio frame;
    将所述音频信号的谱特征入参到预存的长短期记忆网络模型中,进行语音识别;Incorporating the spectral characteristics of the audio signal into a pre-stored long and short-term memory network model to perform speech recognition;
    根据语音识别结果,得到所述音频信号对应的嘴部状态,将所述嘴部状态与所述嘴部图像进行比较,若同步,则发送所述虚拟人物视频至客户端,否则重新进行音视频合成编码,直到所述虚拟人物视频中虚拟人物的嘴部状态与所述嘴部图像同步。According to the voice recognition result, the mouth state corresponding to the audio signal is obtained, and the mouth state is compared with the mouth image. If synchronized, the virtual character video is sent to the client, otherwise the audio and video are restarted Synthesize encoding until the state of the mouth of the virtual character in the virtual character video is synchronized with the mouth image.
  8. 一种计算机设备,包括存储器和处理器,所述存储器中存储有计算机可读指令,其中,所述计算机可读指令被所述处理器执行时,使得所述处理器执行如下所述基于神经网络生成虚拟人物视频的方法的步骤:A computer device includes a memory and a processor, and the memory stores computer-readable instructions. When the computer-readable instructions are executed by the processor, the processor executes the following neural network-based Steps of the method of generating a virtual character video:
    获取待识别文本,并将所述待识别文本导入到预置文本语音转换模型中进行声音转换,得到音频;Acquiring the text to be recognized, and importing the text to be recognized into a preset text-to-speech conversion model for voice conversion to obtain audio;
    提取所述音频的韵律参数,并将所述韵律参数导入到预置音频生成模型中进行音频特征点提取;Extracting the prosody parameters of the audio, and importing the prosody parameters into a preset audio generation model for audio feature point extraction;
    根据所述音频特征点,生成虚拟人物的嘴部运动轨迹;According to the audio feature points, generating the trajectory of the mouth movement of the virtual character;
    获取预置虚拟人物的二维图片,并将所述二维图片导入到面部特征生成模型进行处理后生成虚拟人物的三维面部图;Acquiring a preset two-dimensional picture of the virtual character, and importing the two-dimensional picture into the facial feature generation model for processing to generate a three-dimensional facial image of the virtual character;
    将所述嘴部运动轨迹导入到所述三维面部图,生成多帧连续的动态人脸面部画面;Importing the movement trajectory of the mouth into the three-dimensional facial image to generate multiple frames of continuous dynamic facial images;
    获取所述每一帧动态人脸面部画对应的实时音频,并对所述动态人脸面部画面和所述实时音频同步进行音视频合成编码,得到虚拟人物视频。Acquire the real-time audio corresponding to each frame of the dynamic face and facial picture, and perform audio-video synthesis and coding on the dynamic face and facial picture and the real-time audio in synchronization to obtain a virtual character video.
  9. 根据权利要求8所述的计算机设备,其中,所述基于神经网络生成虚拟人物视频的程序被所述处理器执行所述获取待识别文本,将所述待识别文本导入到预置文本语音转换模型中进行声音转换,得到音频的步骤时,包括以下步骤:8. The computer device according to claim 8, wherein the program for generating a virtual character video based on a neural network is executed by the processor to obtain the text to be recognized, and import the text to be recognized into a preset text-to-speech conversion model When performing sound conversion in the process to obtain audio, the following steps are included:
    获取待识别文本,提取所述待识别文本中的分割符,根据所述分隔符,将所述待识别文本划分为数个子文本;Acquiring the text to be recognized, extracting the separator in the text to be recognized, and dividing the text to be recognized into several sub-texts according to the separator;
    将所述子文本进行词向量编码,得到数个多维词向量;Performing word vector encoding on the sub-text to obtain several multi-dimensional word vectors;
    将所述多维词向量进行降维后,得到二维词向量;After reducing the dimensions of the multi-dimensional word vector, a two-dimensional word vector is obtained;
    计算所述二维词向量的特征值,以所述二维词向量的特征值为权重,将所述二维词向量和所述权重导入到所述文本语音转换模型中进行文本声音转换,得到所述音频。Calculate the feature value of the two-dimensional word vector, take the feature value of the two-dimensional word vector as a weight, and import the two-dimensional word vector and the weight into the text-to-speech conversion model to perform text-to-speech conversion to obtain The audio.
  10. 根据权利要求8所述的计算机设备,其中,所述计算机设备的程序被所述处理器执行所述提取所述音频的韵律参数,将所述韵律参数导入到音频生成模型中进行音频特征点的步骤时,包括以下步骤:8. The computer device according to claim 8, wherein the program of the computer device is executed by the processor to extract the prosody parameters of the audio, and import the prosody parameters into an audio generation model to perform audio feature points Steps include the following steps:
    提取所述音频的第一韵律参数和级别语言参数,并根据所述第一韵律参数中的音长、音高和停顿时机,生成韵律标记;Extracting the first prosody parameter and the level language parameter of the audio, and generating a prosody mark according to the pitch, pitch and pause timing in the first prosody parameter;
    对所述韵律标记进行编码,生成编码串流;Encoding the prosody mark to generate an encoded stream;
    根据所述编码串流和所述级别语言参数,生成第二韵律参数;Generating a second prosody parameter according to the encoded stream and the level language parameter;
    将所述第二韵律参数导入到所述音频生成模型,以提取所述第二韵律参数中的音频特征点。The second prosody parameter is imported into the audio generation model to extract audio feature points in the second prosody parameter.
  11. 根据权利要求10所述的计算机设备,其中,所述计算机设备的程序被所述处理器执行所述根据所述音频特征点,生成虚拟人物嘴部运动轨迹的步骤时,包括以下步骤:10. The computer device according to claim 10, wherein when the program of the computer device is executed by the processor, the step of generating the trajectory of the avatar's mouth according to the audio feature points comprises the following steps:
    获取预置虚拟人物图像,根据预设的嘴部关键点提取算法,从所述虚拟人物图像中提取嘴部关键点;Acquiring a preset virtual character image, and extracting the mouth key points from the virtual character image according to a preset mouth key point extraction algorithm;
    对所述嘴部关键点进行归一化处理,得到增强关键点;Normalize the key points of the mouth to obtain enhanced key points;
    根据所述增强关键点,得到所述音频的播放频率和播放时的嘴部运动幅度,并对所述播放频率和所述嘴部运动幅度进行拟合,得到所述虚拟人物嘴部运动轨迹。According to the enhancement key points, the audio playback frequency and the mouth motion amplitude during playback are obtained, and the playback frequency and the mouth motion amplitude are fitted to obtain the virtual character's mouth motion trajectory.
  12. 根据权利要求8所述的计算机设备,其中,所述计算机设备的程序被所述处理器执行所述获取预置虚拟人物的二维图片,并将所述二维图片导入到面部特征生成模型进行处理后生成虚拟人物的三维面部图的步骤时,包括以下步骤:8. The computer device according to claim 8, wherein the program of the computer device is executed by the processor to obtain a two-dimensional picture of a preset virtual character, and import the two-dimensional picture into a facial feature generation model to perform After processing, the step of generating a three-dimensional face map of the virtual character includes the following steps:
    获取所述虚拟人物的二维图片并对所述二维图片进行灰度处理,得到二值化的二维图片,以及根据所述二值化的二维图片的梯度,得到所述三维面部图的深度信息;Obtain a two-dimensional picture of the virtual person and perform grayscale processing on the two-dimensional picture to obtain a binarized two-dimensional picture, and obtain the three-dimensional face image according to the gradient of the binarized two-dimensional picture In-depth information;
    以所述二维图片的左下角为坐标原点,建立人脸特征点坐标系;Using the lower left corner of the two-dimensional picture as the origin of coordinates to establish a coordinate system of facial feature points;
    从所述人脸特征点坐标系中获取所述二维图片中人脸五官关键点的坐标,并计算所述各人脸五官关键点之间的距离;Acquiring the coordinates of the key points of the facial features in the two-dimensional picture from the coordinate system of the facial feature points, and calculating the distance between the key points of the facial features;
    根据所述距离,调整预置标准三维面部图中人脸五官的位置,得到虚拟人物的三维面部图。According to the distance, the positions of the facial features in the preset standard three-dimensional facial image are adjusted to obtain the three-dimensional facial image of the virtual character.
  13. 根据权利要求12所述的计算机设备,其中,所述计算机设备的程序被所述处理器执行所述将所述嘴部运动轨迹导入到所述三维面部图,生成多帧连续的动态人脸面部画面的步骤时,包括以下步骤:The computer device according to claim 12, wherein the program of the computer device is executed by the processor, and said importing the movement trajectory of the mouth into the three-dimensional facial image to generate a multi-frame continuous dynamic face and face The steps of the screen include the following steps:
    将所述嘴部运动轨迹导入到所述三维面部图,并提取三维面部图中发生位置变化的人脸五官关键点作为变化特征;Importing the movement trajectory of the mouth into the three-dimensional facial image, and extracting the key points of facial features that have changed positions in the three-dimensional facial image as change features;
    将所述变化特征入参到预置对抗神经网络模型中进行嘴部图像重构;Incorporating the change feature into a preset confrontation neural network model to reconstruct the mouth image;
    将重构后的数张嘴部图像按照生成时间进行排序后,生成所述多帧连续的动态人脸面部画面。After the reconstructed mouth images are sorted according to the generation time, the multi-frame continuous dynamic facial images are generated.
  14. 根据权利要求8至13任一项所述的计算机设备,其中,所述计算机设备的程序被所述处理器执行所述获取所述每一帧动态人脸面部画对应的实时音频,并对所述动态人脸面部画面和所述实时音频同步进行音视频合成编码,得到虚拟人物视频的步骤之后,还执行以下步骤:The computer device according to any one of claims 8 to 13, wherein the program of the computer device is executed by the processor. After the step of performing audio and video synthesis and encoding synchronously on the dynamic facial image and the real-time audio to obtain the virtual character video, the following steps are also performed:
    定位所述虚拟人物视频中所述韵律参数对应的关键音频帧的位置;Locate the position of the key audio frame corresponding to the prosody parameter in the virtual character video;
    根据所述关键音频帧的位置,分别从所述虚拟人物视频中提取所述关键音频帧对应的嘴部图像和音频信号;Extracting respectively the mouth image and audio signal corresponding to the key audio frame from the virtual character video according to the position of the key audio frame;
    将所述音频信号的谱特征入参到预存的长短期记忆网络模型中,进行语音识别;Incorporating the spectral characteristics of the audio signal into a pre-stored long and short-term memory network model to perform speech recognition;
    根据语音识别结果,得到所述音频信号对应的嘴部状态,将所述嘴部状态与所述嘴部图像进行比较,若同步,则发送所述虚拟人物视频至客户端,否则重新进行音视频合成编码,直到所述虚拟人物视频中虚拟人物的嘴部状态与所述嘴部图像同步。According to the voice recognition result, the mouth state corresponding to the audio signal is obtained, and the mouth state is compared with the mouth image. If synchronized, the virtual character video is sent to the client, otherwise the audio and video are restarted Synthesize encoding until the state of the mouth of the virtual character in the virtual character video is synchronized with the mouth image.
  15. 一种存储有计算机可读指令的存储介质,其中,所述计算机可读指令被一个或多个处理器执行时,使得一个或多个处理器执行如下所述基于神经网络生成虚拟人物视频的方法的步骤:A storage medium storing computer-readable instructions, wherein when the computer-readable instructions are executed by one or more processors, the one or more processors execute the method for generating a virtual character video based on a neural network as described below A step of:
    获取待识别文本,并将所述待识别文本导入到预置文本语音转换模型中进行声音转换,得到音频;Acquiring the text to be recognized, and importing the text to be recognized into a preset text-to-speech conversion model for voice conversion to obtain audio;
    提取所述音频的韵律参数,并将所述韵律参数导入到预置音频生成模型中进行音频特征点提取;Extracting the prosody parameters of the audio, and importing the prosody parameters into a preset audio generation model for audio feature point extraction;
    根据所述音频特征点,生成虚拟人物的嘴部运动轨迹;According to the audio feature points, generating the trajectory of the mouth movement of the virtual character;
    获取预置虚拟人物的二维图片,并将所述二维图片导入到面部特征生成模型进行处理后生成虚拟人物的三维面部图;Acquiring a preset two-dimensional picture of the virtual character, and importing the two-dimensional picture into the facial feature generation model for processing to generate a three-dimensional facial image of the virtual character;
    将所述嘴部运动轨迹导入到所述三维面部图,生成多帧连续的动态人脸面部画面;Importing the movement trajectory of the mouth into the three-dimensional facial image to generate multiple frames of continuous dynamic facial images;
    获取所述每一帧动态人脸面部画对应的实时音频,并对所述动态人脸面部画面和所述实时音频同步进行音视频合成编码,得到虚拟人物视频。Acquire the real-time audio corresponding to each frame of the dynamic face and facial picture, and perform audio-video synthesis and coding on the dynamic face and facial picture and the real-time audio in synchronization to obtain a virtual character video.
  16. 根据权利要求15所述的存储有计算机可读指令的存储介质,其中,所述基于神经网络生成虚拟人物视频的程序被处理器执行所述获取待识别文本,将所述待识别文本导入到预置文本语音转换模型中进行声音转换,得到音频的步骤时,包括如下步骤:The storage medium storing computer-readable instructions according to claim 15, wherein the program for generating a virtual character video based on the neural network is executed by the processor to obtain the text to be recognized, and import the text to be recognized into the pre- When the voice conversion is performed in the text-to-speech conversion model to obtain the audio, the steps include the following steps:
    获取待识别文本,提取所述待识别文本中的分割符,根据所述分隔符,将所述待识别文本划分为数个子文本;Acquiring the text to be recognized, extracting the separator in the text to be recognized, and dividing the text to be recognized into several sub-texts according to the separator;
    将所述子文本进行词向量编码,得到数个多维词向量;Performing word vector encoding on the sub-text to obtain several multi-dimensional word vectors;
    将所述多维词向量进行降维后,得到二维词向量;After dimensionality reduction is performed on the multi-dimensional word vector, a two-dimensional word vector is obtained;
    计算所述二维词向量的特征值,以所述二维词向量的特征值为权重,将所述二维词向量和所述权重导入到所述文本语音转换模型中进行文本声音转换,得到所述音频。Calculate the feature value of the two-dimensional word vector, take the feature value of the two-dimensional word vector as a weight, and import the two-dimensional word vector and the weight into the text-to-speech conversion model to perform text-to-speech conversion to obtain The audio.
  17. 根据权利要求15所述的存储有计算机可读指令的存储介质,其中,所述基于神经网络生成虚拟人物视频的程序被处理器执行所述提取所述音频的韵律参数,将所述韵律参数导入到音频生成模型中进行音频特征点提取的步骤时,包括如下步骤:The storage medium storing computer-readable instructions according to claim 15, wherein the program for generating a virtual character video based on a neural network is executed by a processor to extract the prosody parameters of the audio, and import the prosody parameters When the audio feature point extraction step is performed in the audio generation model, the following steps are included:
    提取所述音频的第一韵律参数和级别语言参数,并根据所述第一韵律参数中的音长、音高和停顿时机,生成韵律标记;Extracting the first prosody parameter and the level language parameter of the audio, and generating a prosody mark according to the pitch, pitch and pause timing in the first prosody parameter;
    对所述韵律标记进行编码,生成编码串流;Encoding the prosody mark to generate an encoded stream;
    根据所述编码串流和所述级别语言参数,生成第二韵律参数;Generating a second prosody parameter according to the encoding stream and the level language parameter;
    将所述第二韵律参数导入到所述音频生成模型,以提取所述第二韵律参数中的音频特征点。The second prosody parameter is imported into the audio generation model to extract audio feature points in the second prosody parameter.
  18. 根据权利要求17所述的存储有计算机可读指令的存储介质,其中,所述基于神经网络生成虚拟人物视频的程序被处理器执行所述根据所述音频特征点,生成虚拟人物嘴部运动轨迹的步骤时,包括如下步骤:The storage medium storing computer-readable instructions according to claim 17, wherein the program for generating a virtual character video based on a neural network is executed by a processor, and the virtual character’s mouth movement trajectory is generated based on the audio feature points The steps include the following steps:
    获取预置虚拟人物图像,根据预设的嘴部关键点提取算法,从所述虚拟人物图像中提取嘴部关键点;Acquiring a preset virtual character image, and extracting the mouth key points from the virtual character image according to a preset mouth key point extraction algorithm;
    对所述嘴部关键点进行归一化处理,得到增强关键点;Normalize the key points of the mouth to obtain enhanced key points;
    根据所述增强关键点,得到所述音频的播放频率和播放时的嘴部运动幅度,并对所述播放频率和所述嘴部运动幅度进行拟合,得到所述虚拟人物嘴部运动轨迹。According to the enhancement key points, the audio playback frequency and the mouth motion amplitude during playback are obtained, and the playback frequency and the mouth motion amplitude are fitted to obtain the virtual character's mouth motion trajectory.
  19. 根据权利要求15所述的存储有计算机可读指令的存储介质,其中,所述基于神经网络生成虚拟人物视频的程序被处理器执行所述获取预置虚拟人物的二维图片,并将所述二维图片导入到面部特征生成模型进行处理后生成虚拟人物的三维面部图的步骤时,包括如下步骤:The storage medium storing computer-readable instructions according to claim 15, wherein the program for generating a virtual character video based on a neural network is executed by the processor to obtain a preset two-dimensional picture of the virtual character, and to convert the When the two-dimensional image is imported into the facial feature generation model for processing, the step of generating a three-dimensional facial image of the virtual character includes the following steps:
    获取所述虚拟人物的二维图片并对所述二维图片进行灰度处理,得到二值化的二维图片,以及根据所述二值化的二维图片的梯度,得到所述三维面部图的深度信息;Obtain a two-dimensional picture of the virtual person and perform grayscale processing on the two-dimensional picture to obtain a binarized two-dimensional picture, and obtain the three-dimensional face image according to the gradient of the binarized two-dimensional picture In-depth information;
    以所述二维图片的左下角为坐标原点,建立人脸特征点坐标系;Using the lower left corner of the two-dimensional picture as the origin of coordinates to establish a coordinate system of facial feature points;
    从所述人脸特征点坐标系中获取所述二维图片中人脸五官关键点的坐标,并计算所述各人脸五官关键点之间的距离;Acquiring the coordinates of the key points of the facial features in the two-dimensional picture from the coordinate system of the facial feature points, and calculating the distance between the key points of the facial features;
    根据所述距离,调整预置标准三维面部图中人脸五官的位置,得到虚拟人物的三维面部图。According to the distance, the positions of the facial features in the preset standard three-dimensional facial image are adjusted to obtain the three-dimensional facial image of the virtual character.
  20. 一种基于神经网络生成虚拟人物视频的装置,其中,包括以下模块:A device for generating a virtual character video based on a neural network, which includes the following modules:
    轨迹生成模块,设置为获取待识别文本,将所述待识别文本导入到预置文本语音转换模型中进行声音转换后,得到音频;提取所述音频的韵律参数,将所述韵律参数导入到预置音频生成模型中进行音频特征点提取;根据所述音频特征点,生成虚拟人物的嘴部运动轨迹;The trajectory generation module is configured to obtain the text to be recognized, and import the text to be recognized into a preset text-to-speech conversion model to obtain audio; extract the prosodic parameters of the audio, and import the prosodic parameters into the preset Audio feature point extraction in the audio generation model; according to the audio feature point, a trajectory of the mouth movement of the virtual character is generated;
    画面生成模块,设置为获取预置虚拟人物的二维图片,将所述二维图片导入到面部特征生成模型进行处理后生成虚拟人物的三维面部图;将所述嘴部运动轨迹导入到所述三维面部图,生成多帧连续的动态人脸面部画面;The screen generation module is configured to obtain a two-dimensional picture of a preset virtual character, import the two-dimensional picture into a facial feature generation model for processing, and generate a three-dimensional facial image of the virtual character; import the movement trajectory of the mouth to the Three-dimensional facial image, generating multiple frames of continuous dynamic facial images;
    视频生成模块,设置为获取所述每一帧动态人脸面部画对应的实时音频,并对所述动态人脸面部画面和所述实时音频同步进行音视频合成编码,得到虚拟人物视频。The video generation module is configured to obtain the real-time audio corresponding to each frame of the dynamic face and facial painting, and synchronize the audio and video synthesis and coding of the dynamic face and the real-time audio to obtain a virtual character video.
PCT/CN2020/118373 2019-10-18 2020-09-28 Method for generating virtual character video on the basis of neural network, and related device WO2021073416A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201910990742.6A CN110866968A (en) 2019-10-18 2019-10-18 Method for generating virtual character video based on neural network and related equipment
CN201910990742.6 2019-10-18

Publications (1)

Publication Number Publication Date
WO2021073416A1 true WO2021073416A1 (en) 2021-04-22

Family

ID=69652464

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/118373 WO2021073416A1 (en) 2019-10-18 2020-09-28 Method for generating virtual character video on the basis of neural network, and related device

Country Status (2)

Country Link
CN (1) CN110866968A (en)
WO (1) WO2021073416A1 (en)

Cited By (29)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113329190A (en) * 2021-05-27 2021-08-31 武汉连岳传媒有限公司 Animation design video production analysis management method, equipment, system and computer storage medium
CN113408449A (en) * 2021-06-25 2021-09-17 达闼科技(北京)有限公司 Face action synthesis method based on voice drive, electronic equipment and storage medium
CN113538644A (en) * 2021-07-19 2021-10-22 北京百度网讯科技有限公司 Method and device for generating character dynamic video, electronic equipment and storage medium
CN113641836A (en) * 2021-08-20 2021-11-12 安徽淘云科技股份有限公司 Display method and related equipment thereof
CN113873324A (en) * 2021-10-18 2021-12-31 深圳追一科技有限公司 Audio processing method, device, storage medium and equipment
CN113873297A (en) * 2021-10-18 2021-12-31 深圳追一科技有限公司 Method and related device for generating digital character video
CN113903067A (en) * 2021-10-18 2022-01-07 深圳追一科技有限公司 Virtual object video generation method, device, equipment and medium
CN114007091A (en) * 2021-10-27 2022-02-01 北京市商汤科技开发有限公司 Video processing method and device, electronic equipment and storage medium
CN114202605A (en) * 2021-12-07 2022-03-18 北京百度网讯科技有限公司 3D video generation method, model training method, device, equipment and medium
CN114332671A (en) * 2021-11-08 2022-04-12 深圳追一科技有限公司 Processing method, device, equipment and medium based on video data
CN114356084A (en) * 2021-12-24 2022-04-15 阿里巴巴(中国)有限公司 Image processing method and system and electronic equipment
CN115033690A (en) * 2022-05-31 2022-09-09 国网江苏省电力有限公司信息通信分公司 Communication defect study and judgment knowledge base construction method, defect identification method and system
CN115052197A (en) * 2022-03-24 2022-09-13 北京沃丰时代数据科技有限公司 Virtual portrait video generation method and device
CN115209180A (en) * 2022-06-02 2022-10-18 阿里巴巴(中国)有限公司 Video generation method and device
CN115243095A (en) * 2021-04-30 2022-10-25 百度在线网络技术(北京)有限公司 Method and device for pushing data to be broadcasted and method and device for broadcasting data
CN115375802A (en) * 2022-06-17 2022-11-22 北京百度网讯科技有限公司 Method and device for generating dynamic image, storage medium and electronic equipment
CN115661908A (en) * 2022-12-09 2023-01-31 凝动万生医疗科技(武汉)有限公司 Method and device for desensitizing facial dynamic data
CN115690280A (en) * 2022-12-28 2023-02-03 山东金东数字创意股份有限公司 Three-dimensional image pronunciation mouth shape simulation method
CN116301481A (en) * 2023-05-12 2023-06-23 北京天图万境科技有限公司 Multi-multiplexing visual bearing interaction method and device
CN116385604A (en) * 2023-06-02 2023-07-04 摩尔线程智能科技(北京)有限责任公司 Video generation and model training method, device, equipment and storage medium
WO2023125844A1 (en) * 2021-12-31 2023-07-06 中科寒武纪科技股份有限公司 Method for video encoding, method for video decoding, and related product
CN116400806A (en) * 2023-04-03 2023-07-07 中国科学院心理研究所 Personalized virtual person generation method and system
CN116546252A (en) * 2023-04-28 2023-08-04 南京硅基智能科技有限公司 Mouth shape data processing method and content expression device in network live broadcast scene
WO2023197979A1 (en) * 2022-04-13 2023-10-19 腾讯科技(深圳)有限公司 Data processing method and apparatus, and computer device and storage medium
CN116934930A (en) * 2023-07-18 2023-10-24 杭州一知智能科技有限公司 Multilingual lip data generation method and system based on virtual 2d digital person
CN116993918A (en) * 2023-08-11 2023-11-03 无锡芯算智能科技有限公司 Modeling system and method for anchor image based on deep learning
WO2024056078A1 (en) * 2022-09-16 2024-03-21 腾讯科技(深圳)有限公司 Video generation method and apparatus and computer-readable storage medium
WO2024164909A1 (en) * 2023-02-08 2024-08-15 华为技术有限公司 Video generation method, apparatus and storage medium
CN118660117A (en) * 2024-08-13 2024-09-17 浩神科技(北京)有限公司 Virtual person video clip synthesis method and system for intelligent video generation

Families Citing this family (36)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110866968A (en) * 2019-10-18 2020-03-06 平安科技(深圳)有限公司 Method for generating virtual character video based on neural network and related equipment
CN111369967B (en) * 2020-03-11 2021-03-05 北京字节跳动网络技术有限公司 Virtual character-based voice synthesis method, device, medium and equipment
CN113689879B (en) * 2020-05-18 2024-05-14 北京搜狗科技发展有限公司 Method, device, electronic equipment and medium for driving virtual person in real time
CN113691833B (en) * 2020-05-18 2023-02-03 北京搜狗科技发展有限公司 Virtual anchor face changing method and device, electronic equipment and storage medium
CN113689880B (en) * 2020-05-18 2024-05-28 北京搜狗科技发展有限公司 Method, device, electronic equipment and medium for driving virtual person in real time
CN113761988A (en) * 2020-06-05 2021-12-07 北京灵汐科技有限公司 Image processing method, image processing apparatus, storage medium, and electronic device
CN111741326B (en) * 2020-06-30 2023-08-18 腾讯科技(深圳)有限公司 Video synthesis method, device, equipment and storage medium
CN112164128B (en) * 2020-09-07 2024-06-11 广州汽车集团股份有限公司 Vehicle-mounted multimedia music visual interaction method and computer equipment
CN112150638B (en) * 2020-09-14 2024-01-26 北京百度网讯科技有限公司 Virtual object image synthesis method, device, electronic equipment and storage medium
CN112383721B (en) * 2020-11-13 2023-04-07 北京有竹居网络技术有限公司 Method, apparatus, device and medium for generating video
CN112562722A (en) * 2020-12-01 2021-03-26 新华智云科技有限公司 Audio-driven digital human generation method and system based on semantics
CN112560622B (en) * 2020-12-08 2023-07-21 中国联合网络通信集团有限公司 Virtual object action control method and device and electronic equipment
CN112669417B (en) * 2020-12-18 2024-04-16 北京大米科技有限公司 Virtual image generation method and device, storage medium and electronic equipment
CN112770062B (en) * 2020-12-22 2024-03-08 北京奇艺世纪科技有限公司 Image generation method and device
CN112735371B (en) * 2020-12-28 2023-08-04 北京羽扇智信息科技有限公司 Method and device for generating speaker video based on text information
CN112785671B (en) * 2021-01-07 2024-05-17 中国科学技术大学 Virtual dummy face animation synthesis method
CN112785669B (en) * 2021-02-01 2024-04-23 北京字节跳动网络技术有限公司 Virtual image synthesis method, device, equipment and storage medium
CN112954235B (en) * 2021-02-04 2021-10-29 读书郎教育科技有限公司 Early education panel interaction method based on family interaction
CN114338959A (en) * 2021-04-15 2022-04-12 西安汉易汉网络科技股份有限公司 End-to-end text-to-video synthesis method, system medium and application
CN113194348B (en) * 2021-04-22 2022-07-22 清华珠三角研究院 Virtual human lecture video generation method, system, device and storage medium
CN113344770B (en) * 2021-04-30 2024-11-08 螳螂慧视科技有限公司 Virtual model, construction method thereof, interaction method and electronic equipment
US20220374637A1 (en) * 2021-05-20 2022-11-24 Nvidia Corporation Synthesizing video from audio using one or more neural networks
CN114040126B (en) * 2021-09-22 2022-09-09 西安深信科创信息技术有限公司 Character-driven character broadcasting video generation method and device
CN113891150B (en) * 2021-09-24 2024-10-11 北京搜狗科技发展有限公司 Video processing method, device and medium
CN113870395A (en) * 2021-09-29 2021-12-31 平安科技(深圳)有限公司 Animation video generation method, device, equipment and storage medium
CN113987268A (en) * 2021-09-30 2022-01-28 深圳追一科技有限公司 Digital human video generation method and device, electronic equipment and storage medium
CN113886644A (en) * 2021-09-30 2022-01-04 深圳追一科技有限公司 Digital human video generation method and device, electronic equipment and storage medium
CN114283227B (en) * 2021-11-26 2023-04-07 北京百度网讯科技有限公司 Virtual character driving method and device, electronic equipment and readable storage medium
CN114299204B (en) * 2021-12-22 2023-04-18 深圳市海清视讯科技有限公司 Three-dimensional cartoon character model generation method and device
CN114401431B (en) * 2022-01-19 2024-04-09 中国平安人寿保险股份有限公司 Virtual person explanation video generation method and related device
CN114913280A (en) * 2022-05-12 2022-08-16 杭州倒映有声科技有限公司 Voice-driven digital human broadcasting method capable of customizing content
CN115409920A (en) * 2022-08-30 2022-11-29 重庆爱车天下科技有限公司 Virtual object lip driving system
CN115393945A (en) * 2022-10-27 2022-11-25 科大讯飞股份有限公司 Voice-based image driving method and device, electronic equipment and storage medium
CN115426536B (en) * 2022-11-02 2023-01-20 北京优幕科技有限责任公司 Audio and video generation method and device
CN115511704B (en) * 2022-11-22 2023-03-10 成都新希望金融信息有限公司 Virtual customer service generation method and device, electronic equipment and storage medium
CN117292030B (en) * 2023-10-27 2024-10-29 海看网络科技(山东)股份有限公司 Method and system for generating three-dimensional digital human animation

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1971621A (en) * 2006-11-10 2007-05-30 中国科学院计算技术研究所 Generating method of cartoon face driven by voice and text together
CN101751692A (en) * 2009-12-24 2010-06-23 四川大学 Method for voice-driven lip animation
CN103218842A (en) * 2013-03-12 2013-07-24 西南交通大学 Voice synchronous-drive three-dimensional face mouth shape and face posture animation method
US20160098622A1 (en) * 2013-06-27 2016-04-07 Sitaram Ramachandrula Authenticating A User By Correlating Speech and Corresponding Lip Shape
US20190057533A1 (en) * 2017-08-16 2019-02-21 Td Ameritrade Ip Company, Inc. Real-Time Lip Synchronization Animation
CN109377539A (en) * 2018-11-06 2019-02-22 北京百度网讯科技有限公司 Method and apparatus for generating animation
CN110866968A (en) * 2019-10-18 2020-03-06 平安科技(深圳)有限公司 Method for generating virtual character video based on neural network and related equipment

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108510437B (en) * 2018-04-04 2022-05-17 科大讯飞股份有限公司 Virtual image generation method, device, equipment and readable storage medium
CN110321789A (en) * 2019-05-21 2019-10-11 平安普惠企业管理有限公司 Method and relevant device based on living things feature recognition interview fraud

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1971621A (en) * 2006-11-10 2007-05-30 中国科学院计算技术研究所 Generating method of cartoon face driven by voice and text together
CN101751692A (en) * 2009-12-24 2010-06-23 四川大学 Method for voice-driven lip animation
CN103218842A (en) * 2013-03-12 2013-07-24 西南交通大学 Voice synchronous-drive three-dimensional face mouth shape and face posture animation method
US20160098622A1 (en) * 2013-06-27 2016-04-07 Sitaram Ramachandrula Authenticating A User By Correlating Speech and Corresponding Lip Shape
US20190057533A1 (en) * 2017-08-16 2019-02-21 Td Ameritrade Ip Company, Inc. Real-Time Lip Synchronization Animation
CN109377539A (en) * 2018-11-06 2019-02-22 北京百度网讯科技有限公司 Method and apparatus for generating animation
CN110866968A (en) * 2019-10-18 2020-03-06 平安科技(深圳)有限公司 Method for generating virtual character video based on neural network and related equipment

Cited By (39)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115243095A (en) * 2021-04-30 2022-10-25 百度在线网络技术(北京)有限公司 Method and device for pushing data to be broadcasted and method and device for broadcasting data
CN113329190B (en) * 2021-05-27 2022-04-08 深圳市点维文化传播有限公司 Animation design video production analysis management method, equipment, system and computer storage medium
CN113329190A (en) * 2021-05-27 2021-08-31 武汉连岳传媒有限公司 Animation design video production analysis management method, equipment, system and computer storage medium
CN113408449A (en) * 2021-06-25 2021-09-17 达闼科技(北京)有限公司 Face action synthesis method based on voice drive, electronic equipment and storage medium
CN113538644A (en) * 2021-07-19 2021-10-22 北京百度网讯科技有限公司 Method and device for generating character dynamic video, electronic equipment and storage medium
CN113538644B (en) * 2021-07-19 2023-08-29 北京百度网讯科技有限公司 Character dynamic video generation method, device, electronic equipment and storage medium
CN113641836A (en) * 2021-08-20 2021-11-12 安徽淘云科技股份有限公司 Display method and related equipment thereof
CN113903067A (en) * 2021-10-18 2022-01-07 深圳追一科技有限公司 Virtual object video generation method, device, equipment and medium
CN113873324A (en) * 2021-10-18 2021-12-31 深圳追一科技有限公司 Audio processing method, device, storage medium and equipment
CN113873297B (en) * 2021-10-18 2024-04-30 深圳追一科技有限公司 Digital character video generation method and related device
CN113873297A (en) * 2021-10-18 2021-12-31 深圳追一科技有限公司 Method and related device for generating digital character video
CN114007091A (en) * 2021-10-27 2022-02-01 北京市商汤科技开发有限公司 Video processing method and device, electronic equipment and storage medium
CN114332671A (en) * 2021-11-08 2022-04-12 深圳追一科技有限公司 Processing method, device, equipment and medium based on video data
CN114202605A (en) * 2021-12-07 2022-03-18 北京百度网讯科技有限公司 3D video generation method, model training method, device, equipment and medium
CN114202605B (en) * 2021-12-07 2022-11-08 北京百度网讯科技有限公司 3D video generation method, model training method, device, equipment and medium
US12125131B2 (en) 2021-12-07 2024-10-22 Beijing Baidu Netcom Science Technology Co., Ltd. Method of generating 3D video, method of training model, electronic device, and storage medium
CN114356084A (en) * 2021-12-24 2022-04-15 阿里巴巴(中国)有限公司 Image processing method and system and electronic equipment
WO2023125844A1 (en) * 2021-12-31 2023-07-06 中科寒武纪科技股份有限公司 Method for video encoding, method for video decoding, and related product
CN115052197B (en) * 2022-03-24 2024-05-28 北京沃丰时代数据科技有限公司 Virtual portrait video generation method and device
CN115052197A (en) * 2022-03-24 2022-09-13 北京沃丰时代数据科技有限公司 Virtual portrait video generation method and device
WO2023197979A1 (en) * 2022-04-13 2023-10-19 腾讯科技(深圳)有限公司 Data processing method and apparatus, and computer device and storage medium
CN115033690A (en) * 2022-05-31 2022-09-09 国网江苏省电力有限公司信息通信分公司 Communication defect study and judgment knowledge base construction method, defect identification method and system
CN115209180A (en) * 2022-06-02 2022-10-18 阿里巴巴(中国)有限公司 Video generation method and device
CN115375802A (en) * 2022-06-17 2022-11-22 北京百度网讯科技有限公司 Method and device for generating dynamic image, storage medium and electronic equipment
WO2024056078A1 (en) * 2022-09-16 2024-03-21 腾讯科技(深圳)有限公司 Video generation method and apparatus and computer-readable storage medium
CN115661908A (en) * 2022-12-09 2023-01-31 凝动万生医疗科技(武汉)有限公司 Method and device for desensitizing facial dynamic data
CN115690280B (en) * 2022-12-28 2023-03-21 山东金东数字创意股份有限公司 Three-dimensional image pronunciation mouth shape simulation method
CN115690280A (en) * 2022-12-28 2023-02-03 山东金东数字创意股份有限公司 Three-dimensional image pronunciation mouth shape simulation method
WO2024164909A1 (en) * 2023-02-08 2024-08-15 华为技术有限公司 Video generation method, apparatus and storage medium
CN116400806B (en) * 2023-04-03 2023-10-17 中国科学院心理研究所 Personalized virtual person generation method and system
CN116400806A (en) * 2023-04-03 2023-07-07 中国科学院心理研究所 Personalized virtual person generation method and system
CN116546252A (en) * 2023-04-28 2023-08-04 南京硅基智能科技有限公司 Mouth shape data processing method and content expression device in network live broadcast scene
CN116301481A (en) * 2023-05-12 2023-06-23 北京天图万境科技有限公司 Multi-multiplexing visual bearing interaction method and device
CN116385604B (en) * 2023-06-02 2023-12-19 摩尔线程智能科技(北京)有限责任公司 Video generation and model training method, device, equipment and storage medium
CN116385604A (en) * 2023-06-02 2023-07-04 摩尔线程智能科技(北京)有限责任公司 Video generation and model training method, device, equipment and storage medium
CN116934930A (en) * 2023-07-18 2023-10-24 杭州一知智能科技有限公司 Multilingual lip data generation method and system based on virtual 2d digital person
CN116993918B (en) * 2023-08-11 2024-02-13 无锡芯算智能科技有限公司 Modeling system and method for anchor image based on deep learning
CN116993918A (en) * 2023-08-11 2023-11-03 无锡芯算智能科技有限公司 Modeling system and method for anchor image based on deep learning
CN118660117A (en) * 2024-08-13 2024-09-17 浩神科技(北京)有限公司 Virtual person video clip synthesis method and system for intelligent video generation

Also Published As

Publication number Publication date
CN110866968A (en) 2020-03-06

Similar Documents

Publication Publication Date Title
WO2021073416A1 (en) Method for generating virtual character video on the basis of neural network, and related device
US11935166B2 (en) Training method and apparatus for image processing model, image processing method and apparatus for image processing model, and storage medium
US11682153B2 (en) System and method for synthesizing photo-realistic video of a speech
CN112887698B (en) High-quality face voice driving method based on nerve radiation field
CN113077537B (en) Video generation method, storage medium and device
CN113901894A (en) Video generation method, device, server and storage medium
CN110738153B (en) Heterogeneous face image conversion method and device, electronic equipment and storage medium
KR102409988B1 (en) Method and apparatus for face swapping using deep learning network
CN117237521A (en) Speech driving face generation model construction method and target person speaking video generation method
Chen et al. Sound to visual: Hierarchical cross-modal talking face video generation
KR20210086744A (en) System and method for producing video contents based on deep learning
KR20220102905A (en) Apparatus, method and computer program for generating facial video
KR20210105159A (en) Apparatus, method and computer program for generating personalized avatar video
CN115631274B (en) Face image generation method, device, equipment and storage medium
Korshunov et al. Vulnerability of Automatic Identity Recognition to Audio-Visual Deepfakes
Jha et al. Cross-language speech dependent lip-synchronization
US20240265606A1 (en) Method and apparatus for generating mouth shape by using deep learning network
CN115052197B (en) Virtual portrait video generation method and device
Zahedi et al. Robust sign language recognition system using ToF depth cameras
CN114155321B (en) Face animation generation method based on self-supervision and mixed density network
CN113160799B (en) Video generation method and device, computer-readable storage medium and electronic equipment
Narwekar et al. PRAV: A Phonetically Rich Audio Visual Corpus.
Zimmermann et al. Combining multiple views for visual speech recognition
CN111260602B (en) Ultrasonic image analysis method for SSI
Mattos et al. Towards view-independent viseme recognition based on CNNs and synthetic data

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20876809

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20876809

Country of ref document: EP

Kind code of ref document: A1