CN114895817A

CN114895817A - Interactive information processing method, and training method and device of network model

Info

Publication number: CN114895817A
Application number: CN202210572266.8A
Authority: CN
Inventors: 郭紫垣
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2022-05-24
Filing date: 2022-05-24
Publication date: 2022-08-12
Anticipated expiration: 2042-05-24
Also published as: CN114895817B

Abstract

The disclosure provides an interactive information processing method, a training method and device of a network model, equipment, a medium and a product, relates to the technical field of artificial intelligence, in particular to the technical field of deep learning and computer vision, and can be applied to scenes such as 3D vision, augmented reality, virtual reality and the like. The specific implementation scheme comprises the following steps: responding to the acquired interactive input information, and determining interactive response information; determining face driving parameters and limb driving parameters aiming at a preset virtual image model according to the interactive input information and the interactive response information; and applying the face driving parameters and the limb driving parameters to the virtual image model to obtain an interactive response video based on the interactive response information.

Description

Interactive information processing method, and training method and device of network model

Technical Field

The disclosure relates to the technical field of artificial intelligence, in particular to the technical field of deep learning and computer vision, can be applied to scenes such as 3D vision, augmented reality, virtual reality and the like, and particularly relates to an interactive information processing method, a network model training device, a network model training medium and a network model training product.

Background

The interactive information processing based on the virtual image model is widely applied to scenes such as 3D vision, augmented reality, virtual reality and the like. However, in some scenarios, the interactive information processing process has the phenomena of poor processing effect and high interactive cost.

Disclosure of Invention

The disclosure provides an interactive information processing method, a training method and device of a network model, equipment, a medium and a product.

According to an aspect of the present disclosure, there is provided an interactive information processing method, including: responding to the acquired interactive input information, and determining interactive response information; determining face driving parameters and limb driving parameters aiming at a preset virtual image model according to the interactive input information and the interactive response information; and applying the face driving parameters and the limb driving parameters to the virtual image model to obtain an interactive response video based on the interactive response information.

According to another aspect of the present disclosure, there is provided a training method of a network model, including: determining facial drive parameters and limb drive parameters of a target object in a sample image sequence; inputting sample audio data matched with the sample image sequence into a target network model to be trained to obtain sample audio features; and performing network model training based on the sample audio features, the face driving parameters and the limb driving parameters to obtain a trained target network model.

According to another aspect of the present disclosure, there is provided an interactive information processing method, including: outputting face driving parameters and limb driving parameters aiming at a preset virtual image model according to input interactive audio information by using the trained target network model; and applying the face driving parameters and the limb driving parameters to the avatar model to obtain an interactive response video based on the interactive audio information, the trained target network model being trained according to the method of the above aspect.

According to another aspect of the present disclosure, there is provided an interactive information processing apparatus including: the first processing module is used for responding to the acquired interactive input information and determining interactive response information; the second processing module is used for determining face driving parameters and limb driving parameters aiming at a preset virtual image model according to the interactive input information and the interactive response information; and the third processing module is used for applying the face driving parameters and the limb driving parameters to the virtual image model to obtain an interactive response video based on the interactive response information.

According to another aspect of the present disclosure, there is provided a training apparatus of a network model, including: the fourth processing module is used for determining the face driving parameters and the limb driving parameters of the target object in the sample image sequence; the fifth processing module is used for inputting the sample audio data matched with the sample image sequence into a target network model to be trained to obtain sample audio features; and a sixth processing module, configured to perform network model training based on the sample audio features, the face driving parameters, and the limb driving parameters, so as to obtain a trained target network model.

According to another aspect of the present disclosure, there is provided an interactive information processing apparatus including: a seventh processing model for outputting face driving parameters and limb driving parameters for a preset avatar model according to the input interactive audio information using the trained target network model; and an eighth processing model for applying the face driving parameters and the limb driving parameters to the avatar model to obtain an interactive response video based on the interactive audio information, the trained target network model being trained by the apparatus according to the above aspect.

According to another aspect of the present disclosure, there is provided an electronic device including: at least one processor and a memory communicatively coupled to the at least one processor. Wherein the memory stores instructions executable by the at least one processor, and the instructions are executed by the at least one processor to enable the at least one processor to execute the above-mentioned mutual information processing method or training method of the network model.

According to another aspect of the present disclosure, there is provided a non-transitory computer-readable storage medium storing computer instructions for causing the computer to perform the above-described mutual information processing method or training method of a network model.

According to another aspect of the present disclosure, a computer program product is provided, comprising a computer program which, when executed by a processor, implements the above-described mutual information processing method or training method of a network model.

It should be understood that the statements in this section do not necessarily identify key or critical features of the embodiments of the present disclosure, nor do they limit the scope of the present disclosure. Other features of the present disclosure will become apparent from the following description.

Drawings

The drawings are included to provide a better understanding of the present solution and are not to be construed as limiting the present disclosure. Wherein:

fig. 1 schematically shows a system architecture of an interactive information processing method and apparatus according to an embodiment of the present disclosure;

FIG. 2 schematically shows a flow diagram of an interaction information processing method according to an embodiment of the present disclosure;

FIG. 3 schematically shows a flow chart of an interaction information processing method according to yet another embodiment of the present disclosure;

FIG. 4 schematically illustrates a flow chart of a method of training a network model according to an embodiment of the present disclosure;

FIG. 5 schematically shows a schematic diagram of a training process of a network model according to an embodiment of the present disclosure;

FIG. 6 schematically shows a schematic diagram of a three-dimensional face model according to an embodiment of the present disclosure;

FIG. 7 schematically shows a flow chart of an interaction information processing method according to yet another embodiment of the present disclosure;

FIG. 8 schematically illustrates a schematic view of an avatar model according to an embodiment of the present disclosure;

FIG. 9 schematically shows a block diagram of an interaction information processing apparatus according to an embodiment of the present disclosure;

FIG. 10 schematically illustrates a block diagram of a training apparatus for a network model according to an embodiment of the present disclosure;

fig. 11 schematically shows a block diagram of an interworking information processing apparatus according to another embodiment of the present disclosure;

FIG. 12 schematically shows a block diagram of an electronic device for mutual information processing, according to an embodiment of the disclosure.

Detailed Description

Exemplary embodiments of the present disclosure are described below with reference to the accompanying drawings, in which various details of the embodiments of the disclosure are included to assist understanding, and which are to be considered as merely exemplary. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the disclosure. The terms "comprises," "comprising," and the like, as used herein, specify the presence of stated features, steps, operations, and/or components, but do not preclude the presence or addition of one or more other features, steps, operations, or components.

All terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs, unless otherwise defined. It is noted that the terms used herein should be interpreted as having a meaning that is consistent with the context of this specification and should not be interpreted in an idealized or overly formal sense.

Where a convention analogous to "at least one of A, B and C, etc." is used, in general such a construction is intended in the sense one having skill in the art would understand the convention (e.g., "a system having at least one of A, B and C" would include but not be limited to systems that have a alone, B alone, C alone, a and B together, a and C together, B and C together, and/or A, B, C together, etc.).

The embodiment of the disclosure provides an interactive information processing method. The method of the embodiment comprises the following steps: and responding to the acquired interactive input information, determining interactive response information, determining face driving parameters and limb driving parameters aiming at a preset virtual image model according to the interactive input information and the interactive response information, and applying the face driving parameters and the limb driving parameters to the virtual image model to obtain an interactive response video based on the interactive response information.

Fig. 1 schematically shows a system architecture of an interactive information processing method and apparatus according to an embodiment of the present disclosure. It should be noted that fig. 1 is only an example of a system architecture to which the embodiments of the present disclosure may be applied to help those skilled in the art understand the technical content of the present disclosure, and does not mean that the embodiments of the present disclosure may not be applied to other devices, systems, environments or scenarios.

The system architecture 100 according to this embodiment may include a requesting terminal 101, a network 102, and a server 103. Network 102 is the medium used to provide communication links between requesting terminals 101 and server 103. Network 102 may include various connection types, such as wired, wireless communication links, or fiber optic cables, to name a few. The server 103 may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server providing basic cloud computing services such as cloud services, cloud computing, network services, middleware services, and the like.

The requesting terminal 101 interacts with the server 103 through the network 102 to receive or transmit data or the like. The requesting terminal 101 is for example configured to initiate an interaction request to the server 103, and the requesting terminal 101 is further configured to send interaction input information for triggering an interaction to the server 103.

The server 103 may be a server providing various services, and for example, may be a background processing server (for example only) performing an interaction process according to interaction input information transmitted by the requesting terminal 101.

For example, the server 103 determines interactive response information in response to interactive input information acquired from the requesting terminal 101, determines face driving parameters and limb driving parameters for a preset avatar model according to the interactive input information and the interactive response information, and applies the face driving parameters and the limb driving parameters to the avatar model to obtain an interactive response video based on the interactive response information.

It should be noted that the mutual information processing method provided by the embodiment of the present disclosure may be executed by the server 103. Accordingly, the interactive information processing apparatus provided by the embodiment of the present disclosure may be disposed in the server 103. The mutual information processing method provided by the embodiment of the present disclosure may also be executed by a server or a server cluster that is different from the server 103 and is capable of communicating with the requesting terminal 101 and/or the server 103. Accordingly, the mutual information processing apparatus provided by the embodiment of the present disclosure may also be disposed in a server or a server cluster that is different from the server 103 and is capable of communicating with the requesting terminal 101 and/or the server 103.

It should be understood that the number of requesting terminals, networks, and servers in fig. 1 is merely illustrative. There may be any number of requesting terminals, networks, and servers, as desired for an implementation.

An interactive information processing method is provided in an embodiment of the present disclosure, and the interactive information processing method according to an exemplary embodiment of the present disclosure is described below with reference to fig. 2 to 3 in conjunction with the system architecture of fig. 1. The mutual information processing method of the embodiment of the present disclosure may be executed by the server 103 shown in fig. 1, for example.

Fig. 2 schematically shows a flowchart of an interactive information processing method according to an embodiment of the present disclosure.

As shown in fig. 2, the mutual information processing method 200 of the embodiment of the present disclosure may include, for example, operations S210 to S230.

In operation S210, interactive response information is determined in response to the acquired interactive input information.

In operation S220, a face driving parameter and a limb driving parameter for the preset avatar model are determined according to the interactive input information and the interactive response information.

In operation S230, the face driving parameters and the limb driving parameters are applied to the avatar model, resulting in an interactive response video based on the interactive response information.

The following exemplifies each operation example flow of the mutual information processing method of the present embodiment.

And determining interactive response information in response to the acquired interactive input information. Illustratively, in response to the acquired interactive input information, the interactive input information may be used as interactive response information so as to play the interactive input information through the avatar model. Or, in response to the acquired interactive input information, determining target response information matched with the interactive input information as interactive response information.

The interactive input information may include, for example, interactive input audio, interactive input video, interactive input text, and the like. The interactive response information matched with the interactive input information may include, for example, interactive response audio or interactive response text. The diversified interaction requirements of the user can be effectively met, and the intelligent degree of interaction through the virtual image model can be effectively improved.

For example, feature extraction may be performed on the interaction input information to obtain the interaction information features. And determining the interaction intention characteristics matched with the interaction input information according to the interaction information characteristics. And determining target response information matched with the interaction input information as interaction response information based on the interaction intention characteristics and a preset interaction response library. The interaction intention characteristic indicates interaction intention types matched with the interaction input information, and the interaction response library comprises reference response information matched with each interaction intention type.

And determining face driving parameters and limb driving parameters aiming at the preset virtual image model according to the interactive input information and the interactive response information. Illustratively, feature extraction can be performed on the interaction input information and the interaction response information to obtain interaction information features. And determining face driving parameters and limb driving parameters aiming at the virtual image model according to the interactive information characteristics.

Illustratively, the pose parameters of at least one feature point of the five sense organs of the avatar model may be determined as face driving parameters according to the mutual information features. For example, at least one mouth shape driving parameter of the avatar model may be determined as a face driving parameter from the mutual information features, the mouth shape driving parameter being, for example, characterizable by pose parameters of lip skeletal feature points. The mouth shape driving parameter may indicate upper and lower lip and upper and lower jaw states when phonemes matched with the interactive response information are pronounced, and may indicate information such as a lip longitudinal variation value, a lip lateral variation value, and a variation value of a degree of opening and closing between upper and lower jaws and teeth.

And applying the face driving parameters and the limb driving parameters to the virtual image model to obtain an interactive response video based on the interactive response information. And applying the face driving parameters and the limb driving parameters to the virtual image model to obtain an interactive response video with a lip movement effect and a limb movement effect. The interactive response information is played by utilizing the virtual image model with the lip movement effect and the limb movement effect, so that the naturalness of the interactive response can be effectively improved, and the intelligent interactive effect is effectively improved.

According to the embodiment of the disclosure, interactive response information is determined in response to the acquired interactive input information, the face driving parameters and the limb driving parameters for the preset virtual image model are determined according to the interactive input information and the interactive response information, and the face driving parameters and the limb driving parameters are applied to the virtual image model to obtain the interactive response video based on the interactive response information. The interactive response video can comprise facial five-sense organ actions of the virtual image model based on the facial driving parameters and limb actions of the virtual image model based on the limb driving parameters, so that the naturalness and the intelligent degree of interactive response through the virtual image model can be effectively improved, and the diversified interactive requirements of users can be favorably met. In addition, the dependence degree on hardware such as image acquisition equipment can be effectively reduced, the interactive response cost is favorably reduced, and the interactive response efficiency is effectively improved.

Fig. 3 schematically shows a flowchart of an interactive information processing method according to another embodiment of the present disclosure.

As shown in fig. 3, the mutual information processing method 300 of the embodiment of the present disclosure may include, for example, operations S210, operations S310 to S320, and operation S230.

In operation S310, feature extraction is performed on the interactive input information and the interactive response information to obtain an interactive information feature.

In operation S320, a face driving parameter and a limb driving parameter for the preset avatar model are determined according to the interactive information features.

An exemplary flow of each operation of the mutual information processing method of the present embodiment is illustrated below.

In an example manner, feature extraction may be performed on at least one of the interaction input information and the interaction response information to obtain the interaction information feature. According to the interactive information features, face driving parameters for the avatar model are determined. And determining the limb driving parameters aiming at the virtual image model according to the interactive information characteristics and the face driving parameters.

Illustratively, emotion feature extraction can be carried out on the interaction input information and/or the interaction response information to obtain interaction emotion features. And extracting the intention characteristics of the interactive input information and/or the interactive response information to obtain the interactive intention characteristics. And audio characteristic extraction is carried out on the interactive input information and/or the interactive response information to obtain interactive audio characteristics. Face driving parameters and limb driving parameters for the avatar model may be determined according to at least one of the interactive emotional features, the interactive intent features, and the interactive audio features.

For example, in the case that the interactive input information is an interactive input text, the interactive input text may be segmented to obtain a segmentation result. And determining whether the interactive input text contains keywords related to emotion or not and whether preset keywords such as degree adverbs, negative words and the like are contained or not based on the word segmentation result so as to obtain interactive emotional characteristics matched with the interactive input text. And determining face driving parameters and limb driving parameters aiming at the virtual image model according to the interactive emotional characteristics.

For example, the interactive input audio and the interactive response audio may be preliminarily divided by adding an audio window. And further dividing each audio window into m audio segments, and extracting n Mel cepstrum coefficient components of each audio segment to obtain interactive audio features. The interactive audio features may be m × n dimensional MFCC (Mel-scale Frequency Cepstral Coefficients) features. The MFCC features can better conform to the auditory characteristics of human ears, have better robustness, and can have better recognition performance when the signal-to-noise ratio of audio data is reduced. m, n may for example take values of 64, 32, respectively, and the audio window length may for example take values of 380 ms.

In one example approach, where the interaction input information includes interaction input audio, the interaction response information that matches the interaction input audio may include interaction response audio. The interactive spectral feature extraction method can extract the spatial domain features of the interactive input audio and the interactive response audio to obtain the interactive spectral features. And extracting time domain characteristics of the interactive frequency spectrum characteristics to obtain interactive audio characteristics. And extracting the context characteristics of the interactive audio characteristics to obtain the self-correlation audio characteristics which serve as the interactive information characteristics.

For example, the interaction input audio and the interaction response audio may be converted into a spectral feature map to obtain the interaction spectral feature. And performing time domain feature extraction on the interactive frequency spectrum features by using a full convolution neural network to obtain interactive audio features. And performing context feature extraction on the interactive audio features by using an LSTM (Long Short Term Memory) to obtain self-correlation audio features serving as interactive information features.

Through extracting abundant audio information in the interactive input audio frequency and the interactive response audio frequency, the generalization effect aiming at various timbres can be effectively improved, the naturalness and the intelligent degree of the interactive response are effectively improved, and the diversified interactive requirements of users are favorably met.

One example approach may time-sequentially frame interactive input audio and interactive response audio. And extracting the characteristics of the interactive input audio and the interactive response audio according to frames to obtain the self-correlation phoneme characteristics associated with each frame of audio. Mouth shape driving parameters for the avatar model are determined as face driving parameters according to the autocorrelation phoneme characteristics associated with each frame of audio. The phoneme is the minimum unit of syllable pronunciation action divided according to pronunciation attributes of audio. The auto-correlated phoneme features may include, for example, at least one of amplitude, frequency, and waveform features of the interactive input audio and the interactive response audio.

By capturing the characteristic change of fine granularity in the audio information, the virtual image model is driven to generate reasonable and natural half-body action and mouth shape effects, and the strong generalization effect of interactive response can be effectively ensured.

Limb driving parameters for the avatar model may be determined based on the autocorrelation phoneme features and mouth driving parameters associated with each frame of audio. The autocorrelation phoneme features and the mouth shape driving parameters may characterize the tempo information in the interactive input audio and the interactive response audio, e.g., may characterize plosive features in the interactive input audio and the interactive response audio. The limb actions with obvious rhythm and amplitude can better express the characteristics of the plosive, and the matching degree of the limb actions of the types of phonemes such as the plosive, the emergency tone and the like can be effectively improved.

The face driving parameters and the limb driving parameters can be aligned, and the aligned face driving parameters and the aligned limb driving parameters are applied to the virtual image model to obtain an interactive response video based on interactive response information. For example, in the case that the interactive response information is interactive response audio, the facial driving parameters and the limb driving parameters may be aligned according to a phoneme timestamp in the interactive response audio. When the interactive response information is an interactive response text, the face driving parameter and the limb driving parameter may be aligned according to the phoneme timestamp corresponding to each character in the interactive response text.

Through the embodiment of the disclosure, the virtual image model can be effectively driven to generate reasonable and natural limb actions and mouth shape effects, the strong generalization effect of interactive response is favorably ensured, and the naturalness and the intelligent degree of the interactive response are effectively improved. By determining the face driving parameters and the limb driving parameters of the virtual image model, the degree of dependence on hardware such as image acquisition equipment and face mouth shape capture equipment can be effectively reduced, the cost consumption of interactive response based on the virtual image model can be effectively reduced, and the digital human driving scheme with low cost and strong generalization effect is favorably realized.

FIG. 4 schematically shows a flow chart of a method of training a network model according to an embodiment of the present disclosure.

As shown in FIG. 4, the training method 400 may include operations S410-S430, for example.

In operation S410, a face driving parameter and a limb driving parameter of a target object in a sample image sequence are determined.

In operation S420, sample audio data matched with the sample image sequence is input into the target network model to be trained, so as to obtain sample audio features.

In operation S430, network model training is performed based on the sample audio features, the face driving parameters, and the limb driving parameters, resulting in a trained target network model.

An example flow of each operation of the model training method of the present embodiment is illustrated below.

Facial drive parameters and limb drive parameters of a target object in a sample image sequence are determined. By way of example, a three-dimensional virtual model may be constructed for a target object in a sample image sequence. And determining the face driving parameters and the limb driving parameters of the target object according to the preset reference virtual model and the three-dimensional virtual model aiming at the target object. For example, the face drive parameters and limb drive parameters of the target object may be determined from the reference virtual model and the pose offsets for corresponding keypoints in the three-dimensional virtual model of the target object.

For example, a three-dimensional face model for a target object in a sample image sequence may be constructed. Mouth shape driving parameters of the target object are determined as face driving parameters according to a preset reference face model and a three-dimensional face model for the target object. An object pose model for a target object in the sample image sequence is constructed. And determining the limb driving parameters of the target object according to a preset reference posture model and an object posture model aiming at the target object.

Sample audio data matched with the sample image sequence can be input into a target network model to be trained to obtain sample audio features. The sample audio data and the sample image sequence may correspond to the same time stamp sequence, and may be obtained by, for example, slicing a sample video used for network model training.

In an example manner, spatial domain feature extraction may be performed on sample audio data to obtain sample spectral features. And extracting time domain characteristics of the sample frequency spectrum characteristics to obtain initial audio characteristics. And extracting the context characteristics of the initial audio characteristics to obtain the self-correlation audio characteristics which serve as the sample audio characteristics. The autocorrelation audio features may for example comprise autocorrelation phoneme features,

aligning the autocorrelation audio features, the face driving parameters and the limb driving parameters, and performing network model training based on the aligned autocorrelation audio features, the face driving parameters and the limb driving parameters to obtain a trained target network model. In one example approach, the face driving parameters and the limb driving parameters may be used as tag data for the sample audio data. And carrying out network model training according to the autocorrelation audio features of the sample audio data and the label data to obtain a trained target network model.

Illustratively, the target network model to be trained may be utilized to generate face driven prediction parameters for the target object based on autocorrelation audio features of the sample audio data by fitting. And fitting to generate the limb driving prediction parameters of the target object according to the autocorrelation audio features and the face driving prediction parameters. And calculating a loss function value of the target network model according to the face driving prediction parameters, the limb driving prediction parameters and the label data. And adjusting the model parameters of the target network model based on the loss function values to obtain the trained target network model.

The loss function values may include, for example, distance loss function values and time-continuous loss function values. The distance loss function value may be characterized, for example, by a mean square error between a true vertex position and a predicted vertex position of the three-dimensional virtual model for the target object. The temporal continuity loss function value may be characterized, for example, by the mean square error between the true vertex displacement and the predicted vertex displacement of the three-dimensional virtual model in previous and subsequent frames.

By the embodiment of the disclosure, the face driving parameters and the limb driving parameters of the target object in the sample image sequence are determined, and the network model training is performed based on the sample audio features, the face driving parameters and the limb driving parameters to obtain the trained target network model. When the trained target network model is used for predicting the face driving parameters and the limb driving parameters based on the interactive input information, the strong generalization effect of interactive response through the virtual image model can be effectively ensured, and the naturalness and the intelligent degree of the interactive response are effectively improved. In addition, the interactive response cost can be effectively reduced, the interactive response efficiency is effectively improved, and the diversified interactive requirements of the user can be favorably met.

Fig. 5 schematically shows a schematic diagram of a training process of a network model according to an embodiment of the present disclosure.

As shown in fig. 5, a sample video 501 may be sliced into sample audio data 502 and a sample image sequence 503. Sample audio data 502 may be input into a target network model to be trained, resulting in sample audio features 504. A three-dimensional virtual model for the target object in the sample image sequence 503 may be constructed, and the face driving parameters/limb driving parameters 505 of the target object may be determined from the preset reference virtual model and the three-dimensional virtual model for the target object.

Network model training is performed based on sample audio features 504, facial driving parameters/limb driving parameters 505, resulting in a trained target network model 506. Illustratively, the face driving parameters/limb driving parameters 505 may be taken as tag data of the sample audio data 502. Network model training is performed according to the sample audio features 504 of the sample audio data 502 and the label data to obtain a trained target network model 506.

Fig. 6 schematically shows a schematic diagram of a three-dimensional face model according to an embodiment of the present disclosure.

As shown in fig. 6, three-

dimensional face models

602, 603 for a target object in a sample image sequence are constructed. Mouth shape driving parameters of the target object are determined as face driving parameters from the reference face model 601 and the three-

dimensional face models

602, 603 for the target object. For example, the mouth shape driving parameters that match the corresponding sample image may be determined from the pose parameter offsets of the lip bone feature points in the reference face model 601 and the three-

dimensional face models

602, 603.

Fig. 7 schematically shows a flowchart of an interactive information processing method according to still another embodiment of the present disclosure.

As shown in fig. 7, the mutual information processing method 700 of the embodiment of the present disclosure may include, for example, operations S710 to S720.

In operation S710, face driving parameters and limb driving parameters for a preset avatar model are output according to the input interactive audio information using the trained target network model.

In operation S720, the face driving parameters and the limb driving parameters are applied to the avatar model, resulting in an interactive response video based on the interactive audio information.

Illustratively, the interactive input audio may be made an interactive response audio in response to the acquired interactive input audio, so as to play the interactive input audio through the avatar model. Alternatively, a target response audio matching the interactive input audio may be determined as the interactive response audio. The interactive input audio and the interactive response audio constitute interactive audio information.

In addition, the interactive input text can be used as the interactive response text in response to the acquired interactive input text, so that the interactive input text can be played through the avatar model. Alternatively, a target response text matching the interactive input text may be determined as the interactive response text. The interactive input text and the interactive response text can be subjected to format conversion respectively to obtain corresponding interactive input audio and interactive response audio. The interactive input audio and the interactive response audio constitute interactive audio information.

And outputting face driving parameters and limb driving parameters for the virtual image model according to the input interactive audio information by using the trained target network model. Applying the face driving parameters and the limb driving parameters to the avatar model so as to drive the avatar model to perform mouth shape actions based on the face driving parameters and to perform limb actions based on the limb driving parameters, resulting in an interactive response video associated with the interactive audio information.

By utilizing the trained target network model and according to the input interactive audio information, determining face driving parameters for driving the virtual image model to generate mouth shape actions and determining limb driving parameters for driving the virtual image model to generate limb actions, the naturalness and the intelligent degree of interactive response can be effectively improved, and the digital human driving scheme with low cost and strong generalization effect can be realized.

Fig. 8 schematically shows a schematic view of an avatar model according to an embodiment of the present disclosure.

As shown in fig. 8, the face driving parameters and the limb driving parameters obtained from the interactive input information may be applied to the avatar model 801 to drive the avatar model 801 to perform a mouth shape motion based on the face driving parameters and to perform a limb motion based on the limb driving parameters, resulting in an interactive response video matching the interactive input information.

Fig. 9 schematically shows a block diagram of an interaction information processing apparatus according to an embodiment of the present disclosure.

As shown in fig. 9, the mutual information processing apparatus 900 of the embodiment of the present disclosure includes, for example, a first processing module 910, a second processing module 920, and a third processing module 930.

A first processing module 910, configured to determine interaction response information in response to the acquired interaction input information; the second processing module 920 is configured to determine a face driving parameter and a limb driving parameter for the preset avatar model according to the interactive input information and the interactive response information; and a third processing module 930 for applying the face driving parameters and the limb driving parameters to the avatar model to obtain an interactive response video based on the interactive response information.

The interactive response video can comprise facial five-sense organ actions of the virtual image model based on the facial driving parameters and limb actions of the virtual image model based on the limb driving parameters, so that the naturalness and the intelligent degree of interactive response through the virtual image model can be effectively improved, and the diversified interactive requirements of users can be favorably met. In addition, the dependence degree on hardware such as image acquisition equipment can be effectively reduced, the interactive response cost is favorably reduced, and the interactive response efficiency is effectively improved.

According to an embodiment of the present disclosure, a first processing module includes: and the first processing submodule is used for responding to the acquired interactive input information and determining target response information matched with the interactive input information as interactive response information.

According to an embodiment of the present disclosure, the second processing module includes: the second processing submodule is used for extracting the characteristics of the interactive input information and the interactive response information to obtain interactive information characteristics; the third processing submodule is used for determining face driving parameters aiming at the virtual image model according to the interactive information characteristics; and the fourth processing submodule is used for determining the limb driving parameters according to the interactive information characteristics and the face driving parameters.

According to an embodiment of the present disclosure, the interactive input information includes an interactive input audio, and the interactive response information matched with the interactive input audio includes an interactive response audio; the second processing submodule includes: the first processing unit is used for extracting the spatial domain characteristics of the interactive input audio and the interactive response audio to obtain interactive spectrum characteristics; the second processing unit is used for extracting time domain characteristics of the interactive frequency spectrum characteristics to obtain interactive audio characteristics; and the third processing unit is used for extracting the context characteristics of the interactive audio characteristics to obtain the self-correlation audio characteristics which are used as the interactive information characteristics.

According to an embodiment of the disclosure, the autocorrelation audio features include autocorrelation phoneme features, and the third processing submodule includes: a fourth processing unit for determining mouth shape driving parameters for the avatar model as face driving parameters based on the auto-correlation phoneme characteristics, the phoneme being a minimum unit of syllable pronunciation action divided according to pronunciation attributes of the audio.

Fig. 10 schematically shows a block diagram of a training apparatus of a network model according to an embodiment of the present disclosure.

As shown in fig. 10, the training apparatus 1000 for a network model according to an embodiment of the present disclosure includes, for example, a fourth processing module 1010, a fifth processing module 1020, and a sixth processing module 1030.

A fourth processing module 1010 for determining face driving parameters and limb driving parameters of a target object in the sample image sequence; a fifth processing module 1020, configured to input sample audio data matched with the sample image sequence into a target network model to be trained, to obtain a sample audio feature; and a sixth processing module 1030, configured to perform network model training based on the sample audio features, the face driving parameters, and the limb driving parameters, so as to obtain a trained target network model.

According to an embodiment of the present disclosure, the fourth processing module includes: a fifth processing submodule for constructing a three-dimensional virtual model for the target object in the sample image sequence; and a sixth processing submodule, configured to determine a face driving parameter and a limb driving parameter of the target object according to the preset reference virtual model and the three-dimensional virtual model for the target object.

According to an embodiment of the present disclosure, the fifth processing submodule includes: a fifth processing unit for constructing a three-dimensional face model for a target object in the sample image sequence; and the sixth processing submodule comprises: a sixth processing unit configured to determine mouth shape driving parameters of the target object as the face driving parameters, based on a preset reference face model and a three-dimensional face model for the target object.

According to an embodiment of the present disclosure, the fifth processing module includes: the seventh processing submodule is used for extracting the spatial domain characteristics of the sample audio data to obtain sample spectral characteristics; the eighth processing submodule is used for extracting time domain characteristics of the sample frequency spectrum characteristics to obtain initial audio frequency characteristics; and the ninth processing submodule is used for extracting the context characteristic of the initial audio characteristic to obtain the self-correlation audio characteristic which is used as the sample audio characteristic.

Fig. 11 schematically shows a block diagram of an interworking information processing apparatus according to another embodiment of the present disclosure.

As shown in fig. 11, the mutual information processing apparatus 1100 of the embodiment of the present disclosure includes, for example, a seventh processing module 1110 and an eighth processing module 1120.

A seventh processing module 1110, configured to output face driving parameters and limb driving parameters for the preset avatar model according to the input interactive audio information using the trained target network model; and an eighth processing module 1120, configured to apply the face driving parameters and the limb driving parameters to the avatar model, so as to obtain an interactive response video based on the interactive audio information.

It should be noted that in the technical solutions of the present disclosure, the processes of collecting, storing, using, processing, transmitting, providing, disclosing and the like of the related information are all in accordance with the regulations of the related laws and regulations, and do not violate the customs of the public order.

The present disclosure also provides an electronic device, a readable storage medium, and a computer program product according to embodiments of the present disclosure.

FIG. 12 illustrates a schematic block diagram of an example electronic device 1200 that can be used to implement embodiments of the present disclosure. The electronic device 1200 is intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular phones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be examples only, and are not meant to limit implementations of the disclosure described and/or claimed herein.

As shown in fig. 12, the apparatus 1200 includes a computing unit 1201 which can perform various appropriate actions and processes in accordance with a computer program stored in a Read Only Memory (ROM)1202 or a computer program loaded from a storage unit 1208 into a Random Access Memory (RAM) 1203. In the RAM 1203, various programs and data required for the operation of the device 1200 may also be stored. The computing unit 1201, the ROM 1202, and the RAM 1203 are connected to each other by a bus 1204. An input/output (I/O) interface 1205 is also connected to bus 1204.

Various components in the device 1200 are connected to the I/O interface 1205 including: an input unit 1206 such as a keyboard, a mouse, or the like; an output unit 1207 such as various types of displays, speakers, and the like; a storage unit 1208, such as a magnetic disk, optical disk, or the like; and a communication unit 1209 such as a network card, modem, wireless communication transceiver, etc. The communication unit 1209 allows the device 1200 to exchange information/data with other devices via a computer network, such as the internet, and/or various telecommunication networks.

The computing unit 1201 may be a variety of general purpose and/or special purpose processing components having processing and computing capabilities. Some examples of the computing unit 1201 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various dedicated Artificial Intelligence (AI) computing chips, various computing units running deep learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, and so forth. The computing unit 1201 executes the respective methods and processes described above, such as the mutual information processing method. For example, in some embodiments, the interactive information processing method may be implemented as a computer software program tangibly embodied on a machine-readable medium, such as storage unit 1208. In some embodiments, part or all of the computer program may be loaded and/or installed onto the device 1200 via the ROM 1202 and/or the communication unit 1209. When the computer program is loaded into the RAM 1203 and executed by the computing unit 1201, one or more steps of the above-described mutual information processing method may be performed. Alternatively, in other embodiments, the computing unit 1201 may be configured to perform the mutual information processing method by any other suitable means (e.g., by means of firmware).

Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuitry, Field Programmable Gate Arrays (FPGAs), Application Specific Integrated Circuits (ASICs), Application Specific Standard Products (ASSPs), system on a chip (SOCs), Complex Programmable Logic Devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, receiving data and instructions from, and transmitting data and instructions to, a storage system, at least one input device, and at least one output device.

Program code for implementing the methods of the present disclosure may be written in any combination of one or more programming languages. These program codes may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable model training apparatus, such that the program codes, when executed by the processor or controller, cause the functions/acts specified in the flowchart and/or block diagram to be performed. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

To provide for interaction with an object, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to an object; and a keyboard and a pointing device (e.g., a mouse or a trackball) by which objects can provide input to the computer. Other kinds of devices may also be used to provide for interaction with an object; for example, feedback provided to the subject can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the object may be received in any form, including acoustic, speech, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., an object computer having a graphical object interface or a web browser through which objects can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), Wide Area Networks (WANs), and the Internet.

The computer system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server may be a cloud server, a server of a distributed system, or a server with a combined blockchain.

It should be understood that various forms of the flows shown above may be used, with steps reordered, added, or deleted. For example, the steps described in the present disclosure may be executed in parallel or sequentially or in different orders, and are not limited herein as long as the desired results of the technical solutions disclosed in the present disclosure can be achieved.

The above detailed description should not be construed as limiting the scope of the disclosure. It should be understood by those skilled in the art that various modifications, combinations, sub-combinations and substitutions may be made in accordance with design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present disclosure should be included in the protection scope of the present disclosure.

Claims

1. An interactive information processing method, comprising:

responding to the acquired interactive input information, and determining interactive response information;

determining face driving parameters and limb driving parameters aiming at a preset virtual image model according to the interactive input information and the interactive response information; and

and applying the face driving parameters and the limb driving parameters to the virtual image model to obtain an interactive response video based on the interactive response information.

2. The method of claim 1, wherein the determining interactive response information in response to the obtained interactive input information comprises:

and responding to the acquired interaction input information, and determining target response information matched with the interaction input information as the interaction response information.

3. The method of claim 1, wherein the determining face driving parameters and limb driving parameters for a preset avatar model according to the interactive input information and the interactive response information comprises:

extracting the characteristics of the interactive input information and the interactive response information to obtain interactive information characteristics;

determining face driving parameters for the avatar model according to the interaction information features; and

and determining the limb driving parameters according to the interactive information characteristics and the face driving parameters.

4. The method of claim 3, wherein the interaction input information comprises interaction input audio, and the interaction response information matching the interaction input audio comprises interaction response audio;

the extracting the characteristics of the interactive input information and the interactive response information to obtain the interactive information characteristics comprises the following steps:

extracting spatial domain features of the interactive input audio and the interactive response audio to obtain interactive spectrum features;

extracting time domain features of the interactive frequency spectrum features to obtain interactive audio features; and

and extracting the context characteristics of the interactive audio characteristics to obtain self-correlation audio characteristics serving as the interactive information characteristics.

5. The method of claim 4, wherein the auto-correlation audio features include auto-correlation phoneme features, and the determining face driving parameters for the avatar model according to the interaction information features includes:

determining mouth shape driving parameters for the avatar model as the face driving parameters according to the auto-correlation phoneme features,

wherein, the phoneme is the minimum unit of syllable pronunciation action divided according to the pronunciation attribute of the audio.

6. A training method of a network model comprises the following steps:

determining facial drive parameters and limb drive parameters of a target object in a sample image sequence;

inputting sample audio data matched with the sample image sequence into a target network model to be trained to obtain sample audio features; and

and carrying out network model training based on the sample audio features, the face driving parameters and the limb driving parameters to obtain a trained target network model.

7. The method of claim 6, wherein the determining facial drive parameters and limb drive parameters of a target object in a sample sequence of images comprises:

constructing a three-dimensional virtual model for the target object in the sample image sequence; and

determining the face driving parameters and the limb driving parameters of the target object according to a preset reference virtual model and a three-dimensional virtual model aiming at the target object.

8. The method of claim 7, wherein the constructing a three-dimensional virtual model for the target object in the sample image sequence comprises:

constructing a three-dimensional face model for the target object in the sample image sequence; and

the determining the face driving parameters of the target object according to a preset reference virtual model and a three-dimensional virtual model for the target object comprises:

determining a mouth shape driving parameter of the target object as the face driving parameter according to a preset reference face model and a three-dimensional face model for the target object.

9. The method of claim 7, wherein the inputting sample audio data matched with the sample image sequence into a target network model to be trained to obtain sample audio features comprises:

carrying out spatial domain feature extraction on the sample audio data to obtain sample frequency spectrum features;

extracting time domain characteristics of the sample spectrum characteristics to obtain initial audio characteristics; and

and extracting context characteristics of the initial audio characteristics to obtain self-correlation audio characteristics serving as the sample audio characteristics.

10. An interactive information processing method, comprising:

outputting face driving parameters and limb driving parameters aiming at a preset virtual image model according to input interactive audio information by using the trained target network model; and

applying the face driving parameters and the limb driving parameters to the avatar model to obtain an interactive response video based on the interactive audio information,

wherein the trained target network model is trained according to the method of any one of claims 6 to 9.

11. An interactive information processing apparatus comprising:

the first processing module is used for responding to the acquired interactive input information and determining interactive response information;

the second processing module is used for determining face driving parameters and limb driving parameters aiming at a preset virtual image model according to the interactive input information and the interactive response information; and

and the third processing module is used for applying the face driving parameters and the limb driving parameters to the virtual image model to obtain an interactive response video based on the interactive response information.

12. The apparatus of claim 11, wherein the first processing module comprises:

and the first processing submodule is used for responding to the acquired interaction input information and determining target response information matched with the interaction input information to serve as the interaction response information.

13. The apparatus of claim 11, wherein the second processing module comprises:

the second processing submodule is used for extracting the characteristics of the interactive input information and the interactive response information to obtain interactive information characteristics;

a third processing submodule, configured to determine face driving parameters for the avatar model according to the interaction information features; and

and the fourth processing submodule is used for determining the limb driving parameters according to the interactive information characteristics and the face driving parameters.

14. The apparatus of claim 13, wherein the interaction input information comprises interaction input audio, and the interaction response information matching the interaction input audio comprises interaction response audio;

the second processing sub-module comprises:

the first processing unit is used for extracting the spatial domain characteristics of the interactive input audio and the interactive response audio to obtain interactive spectrum characteristics;

the second processing unit is used for extracting time domain characteristics of the interactive frequency spectrum characteristics to obtain interactive audio characteristics; and

and the third processing unit is used for extracting the context characteristics of the interactive audio characteristics to obtain the self-correlation audio characteristics which are used as the interactive information characteristics.

15. The apparatus of claim 14, wherein the autocorrelation audio features comprise autocorrelation phoneme features, the third processing sub-module comprising:

a fourth processing unit for determining mouth shape driving parameters for the avatar model as the face driving parameters according to the auto-correlation phoneme features,

16. An apparatus for training a network model, comprising:

the fourth processing module is used for determining the face driving parameters and the limb driving parameters of the target object in the sample image sequence;

the fifth processing module is used for inputting the sample audio data matched with the sample image sequence into a target network model to be trained to obtain sample audio features; and

and the sixth processing module is used for carrying out network model training based on the sample audio features, the face driving parameters and the limb driving parameters to obtain a trained target network model.

17. The apparatus of claim 16, wherein the fourth processing module comprises:

a fifth processing sub-module for constructing a three-dimensional virtual model for the target object in the sequence of sample images; and

a sixth processing sub-module, configured to determine the face driving parameter and the limb driving parameter of the target object according to a preset reference virtual model and a three-dimensional virtual model for the target object.

18. The apparatus of claim 17, wherein the fifth processing sub-module comprises:

a fifth processing unit for constructing a three-dimensional face model for the target object in the sample image sequence; and

the sixth processing submodule includes:

a sixth processing unit configured to determine, as the face driving parameter, a mouth shape driving parameter of the target object based on a preset reference face model and a three-dimensional face model for the target object.

19. The apparatus of claim 17, wherein the fifth processing module comprises:

the seventh processing submodule is used for extracting the spatial domain characteristics of the sample audio data to obtain sample spectral characteristics;

the eighth processing submodule is used for extracting time domain characteristics of the sample spectrum characteristics to obtain initial audio characteristics; and

and the ninth processing submodule is used for extracting the context characteristic of the initial audio characteristic to obtain the self-correlation audio characteristic which is used as the sample audio characteristic.

20. An interactive information processing apparatus comprising:

the seventh processing module is used for outputting face driving parameters and limb driving parameters aiming at the preset virtual image model according to the input interactive audio information by utilizing the trained target network model; and

an eighth processing module, configured to apply the face driving parameters and the limb driving parameters to the avatar model to obtain an interactive response video based on the interactive audio information,

wherein the trained target network model is trained by the apparatus according to any one of claims 16 to 19.

21. An electronic device, comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the mutual information processing method of any one of claims 1 to 5, or to perform the training method of the network model of any one of claims 6 to 9, or to perform the mutual information processing method of claim 10.

22. A non-transitory computer-readable storage medium storing computer instructions for causing a computer to execute the mutual information processing method of any one of claims 1 to 5, or execute the training method of the network model of any one of claims 6 to 9, or execute the mutual information processing method of claim 10.

23. A computer program product comprising a computer program which, when executed by a processor, implements the mutual information processing method of any one of claims 1 to 5, or implements the training method of the network model of any one of claims 6 to 9, or implements the mutual information processing method of claim 10.