CN115439614B

CN115439614B - Virtual image generation method and device, electronic equipment and storage medium

Info

Publication number: CN115439614B
Application number: CN202211326579.1A
Authority: CN
Inventors: 刘聪; 胡诗卉; 何山; 周良; 胡金水; 殷兵
Original assignee: iFlytek Co Ltd
Current assignee: iFlytek Co Ltd
Priority date: 2022-10-27
Filing date: 2022-10-27
Publication date: 2023-03-14
Anticipated expiration: 2042-10-27
Also published as: CN115439614A

Abstract

The application provides a method, a device, an electronic device and a storage medium for generating an avatar, which can extract phoneme features from audio data of a target user and extract facial expression features from video data synchronized with the audio data. And calculating to obtain a facial expression parameter sequence of the target user based on the phoneme characteristics and the facial expression characteristics. The method includes the steps that a part with complex facial actions is concentrated in the lip region of the lower half face, and the speech phoneme of a user speaking has strong correlation with the lip actions, so that phoneme characteristics are introduced, a facial expression parameter sequence is calculated according to the phoneme characteristics and the facial expression characteristics, an avatar generated by a three-dimensional avatar model corresponding to a target user is driven by the facial expression parameter sequence, and the facial actions of the target user can be accurately restored.

Description

Virtual image generation method and device, electronic equipment and storage medium

Technical Field

The present application relates to the field of avatar technologies, and in particular, to a method and an apparatus for generating an avatar, an electronic device, and a storage medium.

Background

The facial expression of the user is captured and synchronized to the face of the three-dimensional virtual image, and the purpose of synchronizing the facial expression of the three-dimensional virtual image with the facial expression of the user can be achieved. In the prior art, a front camera of a mobile phone is often used, or a camera is used for shooting facial expressions, and as the shot data is only a two-dimensional video and lacks accurate three-dimensional information, the three-dimensional action of the real face of a user cannot be restored, so that the generated virtual image generally has the problem of low precision.

Disclosure of Invention

Based on the above requirements, the present application provides a method and an apparatus for generating an avatar, an electronic device, and a storage medium, so as to solve the problem of low accuracy of the avatar in the prior art.

The technical scheme provided by the application is as follows:

in one aspect, the present application provides a method for generating an avatar, including:

extracting phoneme features from audio data of a target user, and extracting facial expression features from video data synchronized with the audio data;

calculating to obtain a facial expression parameter sequence of the target user based on the phoneme characteristics and the facial expression characteristics;

and driving the three-dimensional virtual image model corresponding to the target user by using the facial expression parameter sequence to generate the virtual image corresponding to the target user.

As an optional implementation manner, in the method described above, extracting the phoneme feature from the audio data of the target user includes:

extracting a voice data segment and a mute data segment from the audio data;

performing phoneme coding on the voice data segment and the mute data segment to obtain a phoneme code of the voice data segment and a phoneme code of the mute data segment;

and splicing the phoneme codes of the voice data segment and the silence data segment to obtain phoneme characteristics.

As an optional implementation manner, in the method described above, extracting facial expression features from video data synchronized with the audio data includes:

inputting video data synchronous with the audio data into a pre-trained facial expression feature extraction model to obtain facial expression features;

the facial expression feature extraction model is obtained by training with sample video data as a first training sample and with facial feature points and facial expression categories corresponding to the sample video data as first labels.

As an optional implementation manner, in the method described above, the calculating a facial expression parameter sequence of the target user based on the phoneme features and the facial expression features includes:

splicing the phoneme features and the facial expression features to obtain spliced features;

inputting the splicing features into a facial expression parameter calculation model trained in advance to obtain a facial expression parameter sequence of the target user;

the facial expression parameter calculation model is obtained by taking a phoneme feature and a splicing feature of a facial expression feature extracted from sample audio and video data as a second training sample and taking a facial expression parameter sequence corresponding to the sample audio and video data as a second label for training.

As an optional implementation manner, in the method described above, the sample audio/video data is audio/video data of a speech of a set person; wherein, the content of the speech of the setting personnel covers all phonemes.

As an optional implementation manner, after extracting facial expression features from the video data synchronized with the audio data, the method further includes: and performing feature dimension reduction processing on the facial expression features.

As an optional implementation manner, in the method described above, performing feature dimension reduction processing on the facial expression features includes:

reducing the dimension of the facial expression features by using a principal component analysis method to obtain facial expression principal component codes;

and carrying out nonlinear mapping on the facial expression principal component codes by utilizing a polynomial kernel function to obtain the processed facial expression characteristics.

On the other hand, the application also provides a device for generating the virtual image, which comprises:

the extraction module is used for extracting phoneme characteristics from audio data of a target user and extracting facial expression characteristics from video data synchronous with the audio data;

the settlement module is used for calculating to obtain a facial expression parameter sequence of the target user based on the phoneme characteristics and the facial expression characteristics;

and the generating module is used for driving the three-dimensional virtual image model corresponding to the target user by using the facial expression parameter sequence to generate the virtual image corresponding to the target user.

In another aspect, the present application further provides an electronic device, including:

a memory and a processor;

wherein the memory is used for storing programs;

the processor is configured to implement the method for generating an avatar according to any one of the above aspects by running the program in the memory.

In another aspect, the present application further provides a storage medium, including: the storage medium stores thereon a computer program which, when executed by a processor, implements the avatar generation method of any of the above.

The method for generating the virtual image can extract the phoneme characteristics from the audio data of the target user and extract the facial expression characteristics from the video data synchronous with the audio data. And calculating to obtain a facial expression parameter sequence of the target user based on the phoneme characteristics and the facial expression characteristics. The method includes the steps that a part with complex facial actions is concentrated in the lip region of the lower half face, and a speech phoneme of a user speaking has strong correlation with the lip actions, so that phoneme characteristics are introduced, a facial expression parameter sequence is calculated according to the phoneme characteristics and the facial expression characteristics, an avatar generated by a three-dimensional avatar model corresponding to a target user is driven by the facial expression parameter sequence, and the facial actions of the target user can be accurately restored.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings needed to be used in the description of the embodiments or the prior art will be briefly introduced below, it is obvious that the drawings in the following description are only embodiments of the present application, and for those skilled in the art, other drawings can be obtained according to the provided drawings without creative efforts.

Fig. 1 is a schematic flowchart of a method for generating an avatar according to an embodiment of the present application;

fig. 2 is a schematic flowchart of a process for generating a sequence of facial expression parameters according to an embodiment of the present application;

FIG. 3 is a schematic flow chart of generating phoneme features provided by an embodiment of the present application;

fig. 4 is a schematic flowchart of another process for generating a sequence of facial expression parameters according to an embodiment of the present application;

fig. 5 is a schematic flowchart of another process for generating a sequence of facial expression parameters according to an embodiment of the present application;

fig. 6 is a schematic structural diagram of an avatar generation apparatus provided in an embodiment of the present application;

fig. 7 is a schematic structural diagram of an electronic device according to an embodiment of the present application.

Detailed Description

Summary of the application

The technical scheme of the embodiment of the application is suitable for making the application scene of the virtual image, and the virtual image generated by the technical scheme of the embodiment of the application can accurately restore the action of the target user face.

In recent years, with the development of industries such as games, movies, and cartoons, and the popularization of the concept of the meta universe, the technology for creating an avatar has attracted more attention and is applied. The avatar generation technique captures and maps facial expression features of a user to an avatar by means of a facial motion capture device, so that the facial expression of the avatar is consistent with the facial expression of the user.

The prior art facial motion capture devices are classified into low precision consumer-grade devices and high precision film and television industry-grade devices. The low-precision consumer-grade equipment comprises a camera or an intelligent electronic product with a camera shooting function and the like, and the intelligent electronic product with the camera shooting function comprises a smart phone, an intelligent tablet computer and the like. The method comprises the steps of shooting a facial expression image of a user by using low-precision consumer-grade equipment, extracting facial expression features from the facial expression image of the user through a corresponding algorithm, and mapping the facial expression features to an avatar. High-precision film and television industrial equipment generally uses a camera matrix to shoot real three-dimensional data, marks are marked on the face of a user to improve facial expression capturing precision, facial expression features of the user are extracted from the three-dimensional data and are mapped to an avatar, and finally the avatar is manually refined to meet film and television precision and effect.

Due to high cost, high-precision film and television industrial-grade equipment is generally only applied to the film and television industry. And the low-precision consumer-grade equipment cannot capture the action in the depth direction because the shot data is only a two-dimensional video, and cannot restore the three-dimensional action of the real face of the user, so that the precision of the virtual image is low.

Based on the above, the application provides a method and a device for generating an avatar, an electronic device and a storage medium, the technical scheme can solve a facial expression parameter sequence according to phoneme characteristics and facial expression characteristics extracted from a two-dimensional video, and the avatar generated by using the facial expression parameter sequence can accurately restore the action of the target user face, so that the problem of low accuracy of the avatar in the prior art is solved.

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

Exemplary method

The embodiment of the application provides a method for generating an avatar, which can be executed by an electronic device, and the electronic device can be any device with data and instruction processing functions, such as a computer, an intelligent terminal, a server, and the like. Referring to fig. 1, the method includes:

s101, extracting phoneme features from audio data of a target user, and extracting facial expression features from video data synchronized with the audio data.

The target user refers to a subject for performing expression redirection. In this embodiment, the facial expression of the target user is redirected, so that the expression of the avatar corresponding to the target user can be consistent with the expression of the target user.

The video data refers to a video containing a face image of a target user; the audio data refers to audio including a target user voice. For example, to avoid noise interference, the video data may only contain images of the target user's face and the audio data may only include the target user's voice. If the acquired video includes the facial image of the non-target user, the facial image of the target user may be extracted as video data based on the facial features of the target user.

The audio and video data of the target user can be collected by using equipment capable of simultaneously collecting audio and video, and the audio data and the video data are extracted from the audio and video data. The device capable of simultaneously acquiring audio and video can adopt a smart phone, a smart tablet computer, a camera and the like, and the embodiment is not limited. It should be noted that, when the audio data and the video data are extracted from the audio and video data, time stamps of the audio data and the video data should be reserved to ensure that the audio data and the video data can be synchronously aligned.

The video data can be collected by a video collecting device capable of collecting videos, and the audio data can be collected by an audio data collecting device capable of collecting audios. The video acquisition device capable of acquiring videos can adopt a smart phone, a smart tablet computer, a camera and the like, and the embodiment is not limited. The audio data acquisition device capable of acquiring audio can adopt a smart phone, a smart tablet computer, a recorder and the like, and the embodiment is not limited. It should be noted that the captured audio data and video data should retain time stamps, and the audio frames and video frames captured at the same time should have time stamps consistent to ensure synchronous alignment between the audio data and the video data.

Phoneme features are extracted from the audio data. Phonemes are the smallest units of speech that are divided according to the natural properties of the speech. From an acoustic property point of view, a phoneme is the smallest unit of speech divided from a psychoacoustic point of view. From the physiological point of view, a pronunciation action forms a phoneme. If "ma" contains "m" and "a" two pronunciation actions, it is two phonemes. The sounds uttered by the same pronunciation action are the same phoneme, and the sounds uttered by different pronunciation actions are different phonemes. For example, in the 'ma-mi', two'm' pronunciation actions are the same and are the same phoneme, and 'a' and 'i' pronunciation actions are different and are different phonemes. The analysis of phonemes is generally described in terms of pronunciation actions. The pronunciation actions like "m" are: the upper and lower lips are closed, the vocal cords vibrate, and the airflow flows out of the nasal cavity to make sound. The part with more complex facial actions is concentrated in the lip region of the next half face, and the speech phoneme of the user speaking has stronger correlation with the lip actions, so that the influence of phoneme characteristics is introduced when the virtual image corresponding to the target user is generated, and the redirection precision of the lip region of the next half face is improved.

The phoneme feature refers to data capable of representing phoneme content, and in this embodiment, the format of the phoneme feature is not limited, and may be an encoding format, a vector format, or the like.

Illustratively, the phoneme features may be extracted from the audio data using a phoneme feature extraction model. Specifically, the audio data may be input into the phoneme feature extraction model to obtain a phoneme feature output by the phoneme feature extraction model. The phoneme feature extraction model may adopt an acoustic model and a language model that are mature in the prior art, and those skilled in the art may refer to the prior art, which is not described herein again.

Facial expression features are extracted from the video data synchronized with the audio data. The facial expression features refer to data capable of representing facial expressions of the target user, and the format of the facial expression features is not limited in this embodiment, and may be an encoding format, a vector format, or the like.

Facial feature points of the target user can be extracted from the video data, and the facial feature points of the target user can be determined as facial expression features. The facial feature points and the facial expression categories of the target user can be extracted from the video data, and the facial feature points and the facial expression categories of the target user are spliced to obtain the facial expression features.

The facial feature point extraction technique mature in the prior art can be adopted for extracting the facial feature points, and the embodiment is not limited.

And the facial expression categories can be extracted by adopting a facial expression category extraction model, wherein training samples of the facial expression category extraction model are sample video data, and labels are facial expression categories corresponding to the sample video data. When the facial expression type extraction model is trained, a training sample is input into the model to obtain an output result of the model, a loss value of the model is determined according to the output result and a training label, and parameters of the model are adjusted according to the direction of reducing the loss value of the model. And repeatedly executing the training process until the loss value of the model is less than the set value, and finishing the training of the model. The set value may be set according to actual conditions, and the present embodiment is not limited thereto.

The facial expression categories include neutral expressions, disgust expressions, angry expressions, fear expressions, happy expressions, sad expressions, surprised expressions, and the like.

In this embodiment, the phoneme features may be extracted from the audio data of the target user in the order of the timestamps, and the facial expression features may be extracted from the video data synchronized with the audio data.

And S102, calculating to obtain a facial expression parameter sequence of the target user based on the phoneme characteristics and the facial expression characteristics.

The facial expression parameter sequence is used for driving a three-dimensional virtual image model corresponding to the target user so that the three-dimensional virtual image model can generate a virtual image corresponding to the target user. The three-dimensional virtual image model is constructed in advance and used for generating the virtual image model corresponding to the target user.

And calculating the facial expression parameter sequence of the target user based on the phoneme characteristics and the facial expression characteristics determined by the embodiment.

Specifically, the phoneme feature and the facial expression feature at the same time may be fused to obtain a fused feature. The fusion can be performed by splicing, and this embodiment is not limited. And calculating the fusion features in sequence to obtain a facial expression parameter sequence.

The solution process may be performed using a pre-trained network model. For example, a facial expression parameter calculation model is trained in advance, and the fusion features are input into the facial expression parameter calculation model in sequence to obtain a facial expression parameter sequence output by the facial expression parameter calculation model. The phoneme features and the facial expression features can be extracted from the sample audio and video data, the phoneme features and the facial expression features are fused to obtain sample fusion features, the sample fusion features are used as training samples of a facial expression parameter calculation model, and a facial expression parameter sequence corresponding to the sample audio and video data is used as a label.

A Lasso regression model or a Recurrent Neural Network (RNN) model may be employed as a base model of the facial expression parameter solution model. When the facial expression parameter calculation model is trained, a training sample is input into the model to obtain an output result of the model, a loss value of the model is determined according to the output result and a training label, and parameters of the model are adjusted according to the direction of reducing the loss value of the model. And repeatedly executing the training process until the loss value of the model is less than the set value, and finishing the model training. The set value may be set according to actual conditions, and the present embodiment is not limited thereto.

It should be noted that the target user may not be continuously speaking, that is, the target user may not be continuously outputting voice. Thus, the audio data may include two parts, one part being a voice data section where the target user has a voice output and a mute data section where the target user has no voice output.

When the target user has a part of voice output and the phoneme characteristics and the facial expression characteristics are calculated, the weight of the phoneme characteristics can be properly increased, and the influence of the phoneme characteristics on the facial expression of the generated virtual image is increased, so that the generated virtual image can accurately restore the action of the target user's face.

In a portion where the target user does not output speech, when the phoneme feature is empty, it may be confirmed that the target user does not speak and the lips of the target user do not move based on the audio data. In reality, however, the target user does not speak, but there may be motion of the lips. For example, in a case where the target user does not speak but only has lip motion, it is determined that the lip of the target user has no motion based on the phoneme feature, it can be determined that the lip motion of the target user exists based on the facial expression feature synchronized with the audio data, and it is apparent that the lip motion of the target user determined based on the phoneme feature does not coincide with the actual situation, in such a case, the phoneme feature becomes noise, and the accuracy of the avatar restoration is lowered.

Based on this, when the target user does not have a part of speech output, and the phoneme feature and the facial expression feature are calculated, the weight of the phoneme feature can be appropriately reduced, and the influence of noise on the facial expression of the generated avatar can be reduced, so that the generated avatar can accurately restore the action of the target user's face.

In a particular embodiment, as shown in FIG. 2, phoneme features are extracted from the audio data, the phoneme features including phoneme features of the speech data segments and phoneme features of the silence data segments; facial expression features are extracted from the video data. And fusing the phoneme features and the facial expression features to obtain fused features. And then resolving the fusion features to obtain the facial expression parameter sequence.

S103, driving a three-dimensional virtual image model corresponding to the target user by using the facial expression parameter sequence to generate a virtual image corresponding to the target user.

In this embodiment, the target user is modeled to obtain a three-dimensional virtual image model of the target user. After the facial expression parameter sequence is input into the three-dimensional virtual image model of the target user, the facial expression parameter sequence can drive the three-dimensional virtual image model corresponding to the target user to generate a virtual image corresponding to the target user.

Modeling a target user to obtain a three-dimensional avatar model of the target user is a very mature prior art in the field, for example, modeling is performed by three-dimensional software such as MAYA, and a person skilled in the art can refer to the prior art, which is not described herein again.

Specifically, the three-dimensional virtual image model is drivable, after the facial expression parameter sequence is input into the three-dimensional virtual image model, the three-dimensional virtual image model generates a video frame sequence, and a complete three-dimensional virtual image video can be obtained after audio data is added. The three-dimensional virtual image can be rendered at a free visual angle through a three-dimensional rendering engine, displayed in a two-dimensional plane or three-dimensional AR/VR form and used for real-time scenes such as virtual live broadcast, news broadcast and program interaction.

In the above embodiment, it is possible to extract a phoneme feature from audio data of a target user and a facial expression feature from video data synchronized with the audio data. And calculating to obtain a facial expression parameter sequence of the target user based on the phoneme characteristics and the facial expression characteristics. The method includes the steps that a part with complex facial actions is concentrated in the lip region of the lower half face, and the speech phoneme of a user speaking has strong correlation with the lip actions, so that phoneme characteristics are introduced, a facial expression parameter sequence is calculated according to the phoneme characteristics and the facial expression characteristics, an avatar generated by a three-dimensional avatar model corresponding to a target user is driven by the facial expression parameter sequence, and the facial actions of the target user can be accurately restored.

As an alternative implementation manner, as shown in fig. 3, another embodiment of the present application discloses that the steps of the foregoing embodiment extract phoneme features from audio data of a target user, and specifically may include the following steps:

s301, extracting a voice data segment and a mute data segment from the audio data.

The voice data segment refers to a data segment in the audio data from which a target user has a voice output. The silent data segment refers to a data segment in the audio data in which the target user does not output speech. In this embodiment, the audio data is segmented according to whether the target user outputs the voice, and the voice data segment and the mute data segment are extracted from the audio data.

The Voice endpoint Detection (VAD) algorithm in the prior art may be used to detect the audio data, and extract the Voice data segment and the mute data segment from the audio data, which is not limited in this embodiment.

S302, phoneme coding is carried out on the voice data section and the mute data section, and phoneme coding of the voice data section and phoneme coding of the mute data section are obtained.

After the speech data segment and the silence data segment are obtained, the speech data segment and the silence data segment may be encoded, respectively. When the speech data segment is encoded, the acoustic model and the speech model in the prior art may be used for encoding, which is not limited in this embodiment. When the mute data segment is encoded, the mute data segment can be encoded according to a set rule, so that when the phoneme characteristics and the facial expression characteristics are solved, whether the phoneme characteristics belong to the mute data segment or the voice data segment can be distinguished, and the purposes of reducing the weight of the phoneme characteristics in the mute data segment and increasing the weight of the phoneme characteristics in the voice data segment are achieved. For example, in order to reduce the noise caused by the phoneme characteristics of the silence data segment, the weight of reducing the phoneme characteristics of the silence data segment may be reduced to 0.

The encoding rule of the mute data segment may be determined according to actual conditions, and this embodiment is not limited. Illustratively, the mute data segment may be uniformly coded to be empty, for example, represented by 0 or null, so as to reduce the weight of the phoneme feature of the mute data segment to 0, so as to avoid noise caused by the phoneme feature of the mute data segment.

It should be noted that, in order to ensure that the audio features and the facial expression features can be aligned, that is, in order to determine that the audio features and the facial expression features at the same time can be spliced together, when the audio features and the facial expression features are generated, the timestamps may be retained, and the audio features and the facial expression features at the same timestamps are spliced; it may also be specified that the time length of the audio feature is the same as the time length of the corresponding audio data, that is, the time length of the phoneme feature of the mute data segment is the same as the time length of the mute data segment, the time length of the phoneme feature of the speech data segment is the same as the time length of the speech data segment, and the time length of the facial expression feature is the same as the time length of the corresponding video data.

S303, splicing the phoneme codes of the voice data segments and the phoneme codes of the mute data segments to obtain phoneme characteristics.

And splicing the phoneme codes of the voice data segment and the phoneme codes of the mute data segment together to obtain the phoneme characteristics.

In the above embodiment, the noise influence caused by the mute data segment can be reduced by respectively encoding the voice data segment and the mute data segment, and the expression precision of the three-dimensional virtual image can be improved.

As an alternative implementation manner, in another embodiment of the present application, it is disclosed that the steps of the foregoing embodiment extract facial expression features from video data synchronized with audio data, and specifically may include the following steps:

and inputting the video data synchronous with the audio data into a pre-trained facial expression feature extraction model to obtain facial expression features.

The facial expression feature extraction model is obtained by taking sample video data as a first training sample and taking facial feature points and facial expression categories corresponding to the sample video data as first labels for training.

In the common facial motion capture method in the market at present, the facial expression features directly use the key points of the human face. The method has the defects that the information of the key points of the face is sparse and cannot completely express rich facial expressions, so that the action of the face of a target user cannot be accurately restored by using the virtual image obtained by redirecting the key points of the face.

Based on this, the present embodiment takes the facial feature points and the facial expression categories together as factors that affect the facial expression features. Specifically, the facial expression feature extraction model in this embodiment uses the human face feature points and the human face expression categories as first labels, and uses the sample video data as a first training sample. The specific training process is the same as the model training process in the above embodiment, and is not described herein again. It should be noted that the present embodiment outputs the features before the classification layer in the facial expression feature extraction model as facial expression features.

Specifically, video data synchronized with audio data is input into a facial expression feature extraction model, and features before a classification layer in the facial expression feature extraction model are used as output of the facial expression feature extraction model to obtain facial expression features.

In this embodiment, the facial feature points and the facial expression categories are used as factors affecting the facial expression features together, so that abundant facial expression features can be provided, and the accuracy of the virtual image is improved.

As an optional implementation manner, as shown in fig. 4, another embodiment of the present application discloses that the steps of the foregoing embodiment are based on the phoneme feature and the facial expression feature, and the step of calculating to obtain the facial expression parameter sequence of the target user specifically includes the following steps:

s401, splicing the phoneme features and the facial expression features to obtain spliced features.

In this embodiment, a splicing and fusion mode is adopted, and the phoneme features and the facial expression features are spliced to obtain splicing features.

S402, inputting the splicing features into a facial expression parameter calculation model trained in advance to obtain a facial expression parameter sequence of the target user.

The facial expression parameter calculation model is obtained by taking the phoneme characteristics extracted from the sample audio and video data and the splicing characteristics of the facial expression characteristics as second training samples and taking the facial expression parameter sequence corresponding to the sample audio and video data as a second label for training.

The sample audio and video data is the audio and video data of the set person speaking, and the content of the set person speaking is to cover all phonemes. The language and dialect of the speaker can be determined according to the requirements of the target user group. For example, if the target user group is a mandarin chinese group, the language of the person speech is set to mandarin chinese, and the content of the speech covers all chinese phonemes; if the target user group is a Guangdong language group, the language of the person is set to be Guangdong language, and the content of the speech covers all phonemes of the Guangdong language; if the target user group is an English group, the speaking language of the person is set to be English, and the speaking content covers all English phonemes. After the second training sample is obtained, the facial expression parameter sequence may be manually extracted from the second training sample as a second training label.

A Lasso regression model may be employed as a base model of the facial expression parameter solution model. When the facial expression parameter calculation model is trained, a second training sample is input into the model to obtain an output result of the model, the loss value of the model is determined according to the output result and a second label, and parameters of the model are adjusted and optimized according to the direction of reducing the loss value of the model. If the parameters of the facial expression parameter calculation model are

The expression for adjusting and optimizing the parameters of the model is as follows:

wherein,

a second label is indicated which is a label,

a second training sample representing an input facial expression parameter solution model,

represents the output of the facial expression parameter calculation model,

a canonical term is represented.

For the loss value of the model, parameters of the model are subjected to the direction of decreasing the loss value of the model

And (6) adjusting.

And repeatedly executing the training process until the loss value of the facial expression parameter resolving model is smaller than a set value, and finishing the model training. The set value may be set according to actual conditions, and the present embodiment is not limited thereto.

As shown in fig. 5, in a specific embodiment, a phoneme feature is extracted from the audio data, the phoneme feature includes a phoneme feature of a speech data segment and a phoneme feature of a mute data segment, and the coding of the phoneme feature of the mute data segment is null and is represented by 0; facial expression features are extracted from the video data. And splicing and fusing the phoneme characteristics and the facial expression characteristics to obtain spliced characteristics. And then resolving the splicing characteristics by using a Lasso regression model to obtain the facial expression parameter sequence.

In the above embodiment, the facial expression parameter sequence of the target user is calculated based on the facial expression parameter calculation model, and the virtual image generated by the three-dimensional virtual image model corresponding to the target user is driven by using the facial expression parameter sequence, so that the facial actions of the target user can be accurately restored.

As an optional implementation manner, in another embodiment of the present application, after the steps of the foregoing embodiment extract facial expression features from video data synchronized with audio data, the method may specifically include the following steps: and performing feature dimension reduction processing on the facial expression features.

Specifically, since the data dimension of the video features extracted by the neural network is generally high, in order to increase the processing speed, feature dimension reduction processing may be performed on the facial expression features.

Further, the step of performing feature dimension reduction processing on the facial expression features is as follows:

reducing the dimension of the facial expression features by using a principal component analysis method to obtain facial expression principal component codes; and carrying out nonlinear mapping on the facial expression principal component codes by using a polynomial kernel function to obtain the processed facial expression features.

Specifically, the facial expression features may be subjected to dimensionality reduction by using Principal Component Analysis (PCA) to obtain facial expression Principal Component codes

Then using polynomial kernel function to code facial expression principal component

Carrying out nonlinear mapping to obtain the processed facial expression characteristics

. The processed facial expression features

And splicing the facial expression parameter sequence with the phoneme characteristics, and then performing calculation processing to obtain the facial expression parameter sequence.

In the above embodiment, the feature dimension reduction processing is performed on the facial expression features, so that the processing speed can be increased.

Exemplary devices

Corresponding to the above method for generating an avatar, an embodiment of the present application further discloses an apparatus for generating an avatar, as shown in fig. 6, the apparatus includes:

an extraction module 100, configured to extract a phoneme feature from audio data of a target user, and extract a facial expression feature from video data synchronized with the audio data;

the settlement module 110 is configured to obtain a facial expression parameter sequence of the target user by calculation based on the phoneme features and the facial expression features;

and the generating module 120 is configured to drive the three-dimensional avatar model corresponding to the target user by using the facial expression parameter sequence, and generate an avatar corresponding to the target user.

As an optional implementation manner, in another embodiment of the present application, it is disclosed that the extraction module 100 of the above embodiment includes:

the extraction unit is used for extracting a voice data segment and a mute data segment from the audio data;

the encoding unit is used for carrying out phoneme encoding on the voice data section and the mute data section to obtain phoneme encoding of the voice data section and phoneme encoding of the mute data section;

and the first splicing unit is used for splicing the phoneme codes of the voice data segments and the phoneme codes of the mute data segments to obtain phoneme characteristics.

the first input unit is used for inputting video data synchronous with audio data into a facial expression feature extraction model trained in advance to obtain facial expression features;

As an alternative implementation manner, in another embodiment of the present application, it is disclosed that the settlement module 110 of the above embodiment includes:

the second splicing unit is used for splicing the phoneme characteristics and the facial expression characteristics to obtain splicing characteristics;

the second input unit is used for inputting the splicing characteristics into a pre-trained facial expression parameter calculation model to obtain a facial expression parameter sequence of the target user;

As an optional implementation manner, in another embodiment of the present application, it is disclosed that the sample audio/video data is audio/video data of a speech of a set person; wherein the content of the set person speech covers all phonemes.

As an alternative implementation manner, in another embodiment of the present application, it is disclosed that the avatar generation apparatus of the above embodiment includes:

and the dimension reduction module is used for performing feature dimension reduction processing on the facial expression features.

As an optional implementation manner, in another embodiment of the present application, it is disclosed that the dimension reduction module of the above embodiment includes:

the dimensionality reduction unit is used for reducing dimensionality of the facial expression features by utilizing a principal component analysis method to obtain facial expression principal component codes;

and the mapping unit is used for carrying out nonlinear mapping on the facial expression principal component codes by utilizing the polynomial kernel function to obtain the processed facial expression characteristics.

Specifically, please refer to the contents of the above method embodiment for the specific working contents of each unit of the above avatar generation apparatus, which is not described herein again.

Exemplary electronic device, storage Medium, and computing product

Another embodiment of the present application further provides an electronic device, as shown in fig. 7, the electronic device includes:

a memory 200 and a processor 210;

wherein, the memory 200 is connected with the processor 210 for storing programs;

a processor 210 for implementing the method for generating an avatar disclosed in any of the above embodiments by executing the program stored in the memory 200.

Specifically, the electronic device may further include: a bus, a communication interface 220, an input device 230, and an output device 240.

The processor 210, the memory 200, the communication interface 220, the input device 230, and the output device 240 are connected to each other through a bus. Wherein:

a bus may comprise a path that transfers information between components of a computer system.

The processor 210 may be a general-purpose processor, such as a general-purpose Central Processing Unit (CPU), a microprocessor, etc., an application-specific integrated circuit (ASIC), or one or more integrated circuits for controlling the execution of programs according to the present disclosure. But may also be a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), an off-the-shelf programmable gate array (FPGA) or other programmable logic device, discrete gate or transistor logic, discrete hardware components.

The processor 210 may include a main processor and may also include a baseband chip, a modem, and the like.

The memory 200 stores programs for executing the technical solution of the present application, and may also store an operating system and other key services. In particular, the program may include program code comprising computer operating instructions. More specifically, memory 200 may include a read-only memory (ROM), other types of static storage devices that may store static information and instructions, a Random Access Memory (RAM), other types of dynamic storage devices that may store information and instructions, a disk storage, a flash, and so forth.

The input device 230 may include a means for receiving data and information input by a user, such as a keyboard, mouse, camera, scanner, light pen, voice input device, touch screen, pedometer or gravity sensor, etc.

Output device 240 may include equipment that allows output of information to a user, such as a display screen, printer, speakers, etc.

Communication interface 220 may include any device that uses any transceiver or the like to communicate with other devices or communication networks, such as an ethernet network, a Radio Access Network (RAN), a Wireless Local Area Network (WLAN), etc.

The processor 210 executes the program stored in the memory 200 and calls other devices, which can be used to implement the steps of the avatar generation method provided in the above-described embodiment of the present application.

In addition to the above-described methods and apparatuses, embodiments of the present application may also be a computer program product comprising computer program instructions that, when executed by the processor 210, cause the processor 210 to perform the steps of the avatar generation method provided by the above-described embodiments.

The computer program product may include program code for carrying out operations for embodiments of the present application in any combination of one or more programming languages, including an object oriented programming language such as Java, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computing device, partly on the user's device, as a stand-alone software package, partly on the user's computing device and partly on a remote computing device, or entirely on the remote computing device or server.

Furthermore, embodiments of the present application may also be a computer-readable storage medium having stored thereon computer program instructions, which, when executed by a processor, cause the processor 210 to perform the steps of the avatar generation method provided by the above-described embodiments.

The computer-readable storage medium may take any combination of one or more readable media. The readable medium may be a readable signal medium or a readable storage medium. A readable storage medium may include, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples (a non-exhaustive list) of the readable storage medium include: an electrical connection having one or more wires, a portable diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

Specifically, the specific working contents of each part of the electronic device, the storage medium, and the computer program, and the specific processing contents of the computer program product or the computer program on the storage medium when being executed by the processor, may refer to the contents of each embodiment of the above-mentioned avatar generation method, and are not described herein again.

While, for purposes of simplicity of explanation, the foregoing method embodiments have been described as a series of acts or combination of acts, it will be appreciated by those skilled in the art that the present application is not limited by the order of acts or acts described, as some steps may occur in other orders or concurrently with other steps in accordance with the application. Further, those skilled in the art will recognize that the embodiments described in this specification are preferred embodiments and that acts or modules referred to are not necessarily required for this application.

It should be noted that, in this specification, each embodiment is described in a progressive manner, and each embodiment focuses on differences from other embodiments, and portions that are the same as and similar to each other in each embodiment may be referred to. For the device-like embodiment, since it is basically similar to the method embodiment, the description is simple, and for the relevant points, reference may be made to the partial description of the method embodiment.

The steps in the methods of the embodiments of the present application may be sequentially adjusted, combined, and deleted according to actual needs, and technical features described in the embodiments may be replaced or combined.

The modules and sub-modules in the device and the terminal in the embodiments of the present application can be combined, divided, and deleted according to actual needs.

In the several embodiments provided in the present application, it should be understood that the disclosed terminal, apparatus and method may be implemented in other manners. For example, the above-described terminal embodiments are merely illustrative, and for example, the division of a module or a sub-module is only one logical division, and there may be other divisions when the terminal is actually implemented, for example, a plurality of sub-modules or modules may be combined or integrated into another module, or some features may be omitted or not executed. In addition, the shown or discussed coupling or direct coupling or communication connection between each other may be through some interfaces, indirect coupling or communication connection between devices or modules, and may be in an electrical, mechanical or other form.

The modules or sub-modules described as separate parts may or may not be physically separate, and parts that are modules or sub-modules may or may not be physical modules or sub-modules, may be located in one place, or may be distributed over a plurality of network modules or sub-modules. Some or all of the modules or sub-modules can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, each functional module or sub-module in the embodiments of the present application may be integrated into one processing module, or each module or sub-module may exist alone physically, or two or more modules or sub-modules may be integrated into one module. The integrated modules or sub-modules may be implemented in the form of hardware, or may be implemented in the form of software functional modules or sub-modules.

Those of skill would further appreciate that the various illustrative components and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both, and that the components and steps of the various examples have been described above generally in terms of their functionality in order to clearly illustrate this interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.

The steps of a method or algorithm described in connection with the embodiments disclosed herein may be embodied directly in hardware, in a software unit executed by a processor, or in a combination of the two. The software cells may be located in Random Access Memory (RAM), memory, read Only Memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, registers, hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art.

Finally, it should also be noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrases "comprising one of 8230; \8230;" 8230; "does not exclude the presence of additional like elements in a process, method, article, or apparatus that comprises the element.

The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present application. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the application. Thus, the present application is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. A method for generating an avatar, comprising:

extracting phoneme features from audio data of a target user, and extracting facial expression features from video data synchronized with the audio data; the facial expression features comprise facial feature points and facial expression categories of the target user;

2. The method of claim 1, wherein extracting the phoneme features from the audio data of the target user comprises:

extracting a voice data segment and a mute data segment from the audio data;

3. The method of claim 1, wherein extracting facial expression features from video data synchronized with the audio data comprises:

4. The method of claim 1, wherein calculating the sequence of facial expression parameters of the target user based on the phoneme features and the facial expression features comprises:

5. The method according to claim 4, wherein the sample audio-video data is audio-video data of a speaker of a person; wherein, the content of the speech of the setting personnel covers all phonemes.

6. The method of claim 1, wherein after extracting the facial expression features from the video data synchronized with the audio data, further comprising: and performing feature dimension reduction processing on the facial expression features.

7. The method of claim 6, wherein performing feature dimension reduction on the facial expressive features comprises:

and carrying out nonlinear mapping on the facial expression principal component codes by utilizing a polynomial kernel function to obtain the processed facial expression features.

8. An avatar generation apparatus, comprising:

the extraction module is used for extracting phoneme characteristics from audio data of a target user and extracting facial expression characteristics from video data synchronous with the audio data; the facial expression features comprise facial feature points and facial expression categories of the target user;

9. An electronic device, comprising:

a memory and a processor;

wherein the memory is used for storing programs;

the processor for implementing the avatar generation method of any one of claims 1 to 7 by executing the program in the memory.

10. A storage medium, comprising: the storage medium has stored thereon a computer program which, when executed by a processor, implements the avatar generation method of any of claims 1-7.