[go: up one dir, main page]
More Web Proxy on the site http://driver.im/

CN114529637A - Method, device and equipment for determining expression coefficient and model training and live broadcast system - Google Patents

Method, device and equipment for determining expression coefficient and model training and live broadcast system Download PDF

Info

Publication number
CN114529637A
CN114529637A CN202210157073.6A CN202210157073A CN114529637A CN 114529637 A CN114529637 A CN 114529637A CN 202210157073 A CN202210157073 A CN 202210157073A CN 114529637 A CN114529637 A CN 114529637A
Authority
CN
China
Prior art keywords
image
expression coefficient
target object
virtual
expression
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210157073.6A
Other languages
Chinese (zh)
Inventor
卫华威
韩欣彤
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Foshan Huya Huxin Technology Co ltd
Original Assignee
Foshan Huya Huxin Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Foshan Huya Huxin Technology Co ltd filed Critical Foshan Huya Huxin Technology Co ltd
Priority to CN202210157073.6A priority Critical patent/CN114529637A/en
Publication of CN114529637A publication Critical patent/CN114529637A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T13/00Animation
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/20Servers specifically adapted for the distribution of content, e.g. VOD servers; Operations thereof
    • H04N21/21Server components or server architectures
    • H04N21/218Source of audio or video content, e.g. local disk arrays
    • H04N21/2187Live feed
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/43Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
    • H04N21/442Monitoring of processes or resources, e.g. detecting the failure of a recording device, monitoring the downstream bandwidth, the number of times a movie has been viewed, the storage space available from the internal hard disk
    • H04N21/44213Monitoring of end-user related data
    • H04N21/44218Detecting physical presence or behaviour of the user, e.g. using sensors to detect if the user is leaving the room or changes his face expression during a TV program

Landscapes

  • Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Social Psychology (AREA)
  • Databases & Information Systems (AREA)
  • Multimedia (AREA)
  • Signal Processing (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Processing Or Creating Images (AREA)

Abstract

The embodiment of the specification determines the expression coefficient of a real target object by acquiring the image of a virtual target object corresponding to the image of the real target object and determining the expression coefficient of the virtual target object in the image of the virtual target object by using an expression coefficient prediction model based on the property of the real target object corresponding to the expression coefficient of the virtual target object; the expression prediction model of the virtual object is obtained by training a sample image which is generated by driving the expression of the virtual object by using a pre-generated sample expression coefficient and the sample image which is generated by driving the expression of the virtual object based on the sample expression coefficient, so that the expression coefficient of the real target object can be determined without collecting a large number of real sample images at high cost and without a high-calculation algorithm.

Description

Method, device and equipment for determining expression coefficient and model training and live broadcast system
Technical Field
The specification relates to the field of computer vision, in particular to a method, a device and equipment for determining an expression coefficient and training a model and a live broadcast system.
Background
In the field of computer vision, it is often desirable to drive an avatar based on the target object's expression coefficients. In some scenes, it is necessary to determine an expression coefficient of the target object based on an image of the target object and then drive the avatar by the expression coefficient. The process of obtaining the expression coefficients can be realized through a prediction model, however, the expression coefficient prediction model needs to be trained based on a large number of sample images, and the training cost is very high.
Disclosure of Invention
In order to overcome the problems in the related art, the specification provides a method, a device and equipment for determining an expression coefficient and training a model, and a live broadcast system.
According to a first aspect of embodiments herein, there is provided a method of determining an expression coefficient, comprising:
acquiring an image of a virtual target object, wherein the image of the virtual target object is obtained by image conversion based on a real target object, and a first expression coefficient of the real target object corresponds to a second expression coefficient of the virtual target object;
inputting an image of a virtual target object into an expression coefficient prediction model, and acquiring a second expression coefficient output by the expression coefficient prediction model; the expression coefficient prediction model is obtained by adopting a pre-generated sample expression coefficient and sample image training including a virtual object, and the expression of the virtual object is obtained based on the sample expression coefficient drive;
determining the first expression coefficient based on the second expression coefficient.
According to a second aspect of embodiments herein, there is provided a system for live broadcasting of an avatar, comprising a host client, a viewer client and a server:
the anchor client is configured to: acquiring an image of a virtual target object, wherein the image of the virtual target object is obtained by converting an image of a real target object, and a first expression coefficient of the real target object corresponds to a second expression coefficient of the virtual target object;
inputting an image of a virtual target object into an expression coefficient prediction model, and acquiring a second expression coefficient output by the expression coefficient prediction model; the expression coefficient prediction model is obtained by adopting a pre-generated sample expression coefficient and sample image training including a virtual object, and the expression of the virtual object is obtained based on the sample expression coefficient drive;
determining the first expression coefficient based on the second expression coefficient;
driving the expression of the virtual image based on the first expression coefficient;
sending the avatar to the server;
the server is configured to: receiving the virtual image sent by the anchor client, and sending the virtual image to the audience client;
the viewer client is to: and receiving and displaying the virtual image sent by the server.
According to a third aspect of embodiments herein, there is provided another avatar live broadcast system, comprising a host client, a viewer client, and a server:
the anchor client is configured to: acquiring an image of a real target object, and sending the image of the real target object to the server;
the server is configured to: receiving an image of the real target object sent by an anchor client, and acquiring an image of a virtual target object, wherein the image of the virtual target object is obtained based on image conversion of the real target object, and a first expression coefficient of the real target object corresponds to a second expression coefficient of the virtual target object;
inputting an image of a virtual target object into an expression coefficient prediction model, and acquiring a second expression coefficient output by the expression coefficient prediction model; the expression coefficient prediction model is obtained by adopting a pre-generated sample expression coefficient and sample image training including a virtual object, and the expression of the virtual object is obtained based on the sample expression coefficient drive;
determining the first expression coefficient based on the second expression coefficient;
driving the expression of the virtual image based on the first expression coefficient;
sending the avatar to the anchor client and the viewer client;
the anchor client is further to: receiving and displaying the virtual image sent by the server;
the viewer client is to: and receiving and displaying the virtual image sent by the server.
According to a fourth aspect of embodiments herein, there is provided a method of model training, comprising:
driving the expression of a virtual object based on a pre-generated sample expression coefficient to generate a sample image comprising the virtual object;
training the expression coefficient prediction model based on the sample image and the sample expression coefficient;
the expression coefficient prediction model is used for acquiring a second expression coefficient of a virtual target object based on an image of the virtual target object, the image of the virtual target object is obtained through image conversion of a real target object, and the first expression coefficient of the real target object corresponds to the second expression coefficient of the virtual target object;
the second expression coefficient is used for determining the first expression coefficient; the first expression coefficient is used for driving the expression of the virtual image.
According to a fifth aspect of embodiments herein, there is provided an apparatus for determining an expression coefficient, including:
an acquisition module: the system comprises a virtual target object acquisition module, a display module and a display module, wherein the virtual target object acquisition module is used for acquiring an image of a virtual target object, the image of the virtual target object is obtained based on image conversion of a real target object, and a first expression coefficient of the real target object corresponds to a second expression coefficient of the virtual target object;
a prediction module: the system comprises an expression coefficient prediction model, an expression coefficient prediction model and a display model, wherein the expression coefficient prediction model is used for predicting the expression coefficient of a virtual target object; the expression coefficient prediction model is obtained by adopting a pre-generated sample expression coefficient and sample image training including a virtual object, and the expression of the virtual object is obtained based on the sample expression coefficient drive;
a determination module: for determining the first expression coefficient based on the second expression coefficient.
According to a sixth aspect of embodiments herein, there is provided an apparatus for determining an expression coefficient, including a camera; a processor; a memory for storing processor-executable instructions, the processor being configured to implement the method of any of the embodiments described above.
The method includes the steps that an expression coefficient of a virtual target object in an image of the virtual target object is determined by acquiring the image of the virtual target object corresponding to the image of the real target object, and determining the expression coefficient of the virtual target object in the image of the virtual target object by utilizing an expression coefficient prediction model based on the property that the real target object corresponds to the expression coefficient of the virtual target object; the expression prediction model of the virtual object is obtained by training a sample image which is generated by driving the expression of the virtual object by using a pre-generated sample expression coefficient and the sample image which is generated by driving the expression of the virtual object based on the sample expression coefficient, so that the expression coefficient of the real target object can be determined without collecting a large number of real sample images at high cost and without a high-calculation algorithm.
It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the specification.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present specification and together with the description, serve to explain the principles of the specification.
FIG. 1 is a flow chart of a prior art method for determining an expression coefficient of a target image using a model X according to an exemplary embodiment.
FIG. 2 is a flow diagram of prior art training model X provided by an exemplary embodiment.
Fig. 3 is a flowchart of a method for determining an expression coefficient according to an exemplary embodiment.
FIG. 4 is a schematic image of a virtual target object provided by an exemplary embodiment.
FIG. 5 is a flowchart of converting an image of a real object into an image of a virtual object by a conversion model provided by an exemplary embodiment.
Fig. 6 is a schematic diagram illustrating the effect of the conversion model before and after performing the style conversion according to an exemplary embodiment.
FIG. 7 is a flow diagram of training a transformation model provided by an exemplary embodiment.
FIG. 8 is a flowchart of a method for determining a 240-dimensional emoji coefficient according to an exemplary embodiment.
Fig. 9 is a system interaction diagram of a live avatar provided in an exemplary embodiment.
Fig. 10 is a system interaction diagram of another avatar live provided by an exemplary embodiment.
Fig. 11 is a flowchart of a method for training an expression coefficient prediction model according to an exemplary embodiment.
Fig. 12 is a schematic structural diagram of an apparatus for determining an expression coefficient according to an exemplary embodiment.
Fig. 13 is a schematic structural diagram of an apparatus for determining an expression coefficient according to an exemplary embodiment.
Detailed Description
Reference will now be made in detail to the exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, like numbers in different drawings represent the same or similar elements unless otherwise indicated. The embodiments described in the following exemplary embodiments do not represent all embodiments consistent with the present specification. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the specification, as detailed in the appended claims.
The terminology used in the description herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the description. As used in this specification and the appended claims, the singular forms "a", "an", and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should also be understood that the term "and/or" as used herein refers to and encompasses any and all possible combinations of one or more of the associated listed items.
It should be understood that although the terms first, second, third, etc. may be used herein to describe various information, these information should not be limited to these terms. These terms are only used to distinguish one type of information from another. For example, the first information may also be referred to as second information, and similarly, the second information may also be referred to as first information, without departing from the scope of the present specification. The word "if" as used herein may be interpreted as "at … …" or "when … …" or "in response to a determination", depending on the context.
In the field of computer vision, the expression coefficients (expressible blendshape coefficients) may be used to characterize facial expressions, for example, a facial smiling expression, angry expression, crying expression, surprised expression, etc., of a face, such as a real human face, a real animal face, a face of a virtual model, etc., with a set of expression coefficients. The dimensionality of the expression coefficients determines the precision and richness of the expression. For example, a virtual face driven by 52-dimensional expression coefficients has higher expression precision and higher expression richness than a virtual face driven by 240-dimensional expression coefficients. Of course, the method is not limited to the 52-dimensional expression coefficient or the 240-dimensional expression coefficient, and the dimension value of the specific expression coefficient can be determined according to actual needs.
At present, a common consumption-level virtual image expression driver is generally used for 52-dimensional expression coefficients, and expressible expression precision is limited. To drive the high-precision virtual image, a hundred-dimensional expression coefficient is generally used, for example, the current commonly used 240-dimensional expression coefficient is used to drive the high-precision virtual image, so that the micro expression of the human face can be captured, and the virtual image is more vivid.
In order to drive an avatar, such as a virtual human face, a virtual animal face, etc., an expression coefficient is usually determined by using a pre-trained model for determining a facial expression coefficient in an image, and then the avatar is driven according to the determined expression coefficient. As shown in fig. 1, the expression coefficient of the target object in the target image can be output by inputting the target image including the target object into a pre-trained expression coefficient prediction model (hereinafter referred to as model X).
Referring to fig. 2, in the current training process of the model X, a certain number of real sample images are collected, an expression coefficient of each sample image is calculated by using an expression calculation algorithm, and then the sample images and the expression coefficients of the sample images are used as the sample training model X. In the process, a large number of sample images are often required to be acquired, for example, not less than 3 ten thousand sample images, and the acquisition time of the sample images is long and the cost is high; and the calculation amount of the expression calculation algorithm is very large.
Particularly, when a high-precision virtual image is driven by using a high-dimensional expression coefficient, for example, when the high-precision virtual image is driven by using a 240-dimensional expression coefficient, the acquisition cost and the acquisition time for acquiring a large number of high-precision sample images are higher, and the calculation amount of an expression calculation algorithm is larger.
Therefore, in order to reduce the cost of determining the expression coefficient of the target image, the method for determining the expression coefficient of the target image is provided, the image of the virtual object is generated by utilizing the pre-generated sample expression coefficient, the sample expression coefficient and the image of the virtual object are used as samples to train an expression coefficient prediction model, the sample acquisition cost in the training process is reduced, meanwhile, the expression resolving complexity is reduced, the expression resolving efficiency is improved, and therefore the training and predicting efficiency is improved.
The following provides a detailed description of examples of the present specification.
As shown in fig. 3, fig. 3 is a flowchart of special effect rendering provided by an embodiment of this specification, and includes the following steps:
s302, acquiring an image of a virtual target object, wherein the image of the virtual target object is obtained by image conversion based on a real target object, and a first expression coefficient of the real target object corresponds to a second expression coefficient of the virtual target object;
s304, inputting the image of the virtual target object into an expression coefficient prediction model, and acquiring a second expression coefficient output by the expression coefficient prediction model; the expression coefficient prediction model is obtained by adopting a pre-generated sample expression coefficient and sample image training including a virtual object, and the expression of the virtual object is obtained based on the sample expression coefficient drive;
s306, determining the first expression coefficient based on the second expression coefficient.
In the present specification embodiment, the real object may be a real photographed object (e.g., a human face, an animal face), or the like; the real target object may be a real object for which the first expression coefficient needs to be determined. The virtual object may be a non-real photographed object (e.g., a virtual digital face, a virtual animal face); the virtual target object may be a virtual object converted from the real target object.
In S302, an image of the virtual target object obtained by image conversion based on the real target object is obtained, where the image of the real target object may be an image containing the real target object captured in real time or an image containing the real target object stored in the storage device; the real target object can be a real face with an expression, such as a real human face and a real animal face, and the expression can be represented by an expression coefficient; correspondingly, the virtual target object can be a virtual face with an expression, such as a virtual face, a virtual animal face, and the like, and the expression can also be represented by an expression coefficient. The image may be one or more images respectively collected, or may be one or more video frames in the collected video.
Taking a live scene as an example, the image of the real target object may be an image including a live face captured by a camera on the side of the live client, and the live face is the real target object.
After the image of the real target object is acquired, the image of the virtual target object may be obtained based on image conversion of the real target object. For example, the image of the real face is converted into the image of the virtual face, and the image of the real face is converted into the image of the virtual cat face.
It may be that the image of the real target object is converted into the image of the virtual target object according to a certain rule, and therefore, there is a certain correlation between the expression of the real target object and the expression of the virtual target object before and after the conversion, that is, the expression coefficient (first expression coefficient) of the real target object and the expression coefficient (second expression coefficient) of the virtual target object are corresponding. For example, they are the same or have a certain conversion relationship between them. The following description will be given by taking the same case as an example.
In some embodiments, the image of the real target object may be stylized, for example, to convert the image of the real target object into a virtual reality style, or into an animation style, or the like. After conversion, the real target object can be converted into a virtual target object, for example, a real face into a digital face. An image containing a virtual face (digital face) is shown in figure 4.
The expression of the virtual target object obtained by the style conversion technique may be identical to the expression of the real target object before conversion, in which case the first expression coefficient of the target object and the second expression coefficient of the virtual target object are identical.
In some embodiments, transforming the image of the real target object by the style transformation technique may be done by multiple transformation models in cooperation, or may be done by a single transformation model. In the embodiment of the present specification, in order to reduce the expenditure of model cost and the complexity of style conversion, the style conversion may be performed on the image of the real target object by using one conversion model. Referring to fig. 5, an image of a real object is input into the conversion model, and the conversion model outputs an image of a virtual object corresponding to the image of the real object. The effect graph of the conversion model can be seen in fig. 6, where (6-1) in fig. 6 is a real face image (which is mosaiced), and (6-2) in fig. 6 is an image containing a digital face obtained by performing style conversion on the real face image in (6-1) by the conversion model. Before and after the style conversion, the expressions of the real face and the digital face are consistent.
In some embodiments, various types of conversion models may be selected according to actual needs. In order to obtain the effect of converting the image of the real object into the image of the virtual object and obtain the sample for training the conversion model at low cost, in the embodiment of the present specification, the conversion function is realized by a generator in the cyclic generation network. The structure and training mode of cycleGAN are shown in fig. 7.
The cycleGAN includes a generator (i.e., a conversion model) for converting an image of a real object into an image of a virtual object; an inverse generator (i.e., an inverse transformation model) for transforming the image of the virtual object generated by the generator into an image of the real object; and a discriminator for discriminating whether or not the image generated by the reverse generator is the generated image.
During training, a first sample image of a real sample object can be input into the conversion model, wherein the real sample object can be a sample face and the like; and performing style conversion on the first sample image of the real sample object through the conversion model to obtain a sample image of a virtual sample object, wherein the virtual sample object can be a virtual sample face and the like. Since the sample image of the virtual sample object is converted from the first sample image style of the real sample object, the expressions of the virtual sample object and the real sample object are corresponding, and thus the expression coefficient of the real sample object is corresponding to the virtual expression coefficient of the virtual sample object. And then inputting the sample image of the virtual sample object into a reverse conversion model with parameters corresponding to the conversion model parameters, and converting the sample image of the virtual sample object into a second sample image of the real sample object by using the reverse conversion model. And inputting the second sample image into a discriminator, wherein the discriminator is used for discriminating whether the second sample image is the generated image or the image of the real sample object. And training the conversion model based on the discrimination result of the discriminator. The L1 loss function can be used as a loss function for training, and the optimization function uses Adam. In some specific embodiments, the conversion model may be trained for multiple generations, so that the conversion model can achieve a better conversion effect.
In S304, the image of the virtual target object is input to the expression coefficient prediction model so that the expression coefficient prediction model predicts and outputs a second expression coefficient of the virtual target object. In an embodiment of the present specification, an expression coefficient prediction model may be trained by using a pre-generated sample expression coefficient and a sample image including a virtual object, where an expression of the virtual object is obtained based on the sample expression coefficient; (i.e., generating a sample image based on the sample expression coefficients) the expression coefficient prediction model can thus predict a model of the expression coefficients of the virtual object.
In some embodiments, a large number of sample expression coefficients may be randomly generated in advance, for example, a large number of 52-dimensional expression coefficients or 240-dimensional expression coefficients are generated; and generating a sample image based on the sample expression coefficient, for example, generating an image of a virtual face. And taking the sample expression coefficients and the sample images as the input of an expression coefficient prediction model to train the expression coefficient prediction model.
In some embodiments, sample expression coefficients may be generated in advance, an image of a virtual animal face may be generated by using one part of the sample expression coefficients, an image of a virtual human face may be generated by using another part of the sample expression coefficients, and an expression coefficient prediction model may be trained by combining the image of the virtual human face, the image of the virtual animal face, and the sample expression coefficients, so that the expression coefficient prediction model may predict both the expression coefficients of the virtual animal face and the expression coefficients of the virtual human face.
In some embodiments, the sample image may be obtained using a rendering engine (e.g., a universal engine illusion engine). Specifically, the sample expression coefficients may be input to a rendering engine, and the rendering engine may drive the expression of the avatar based on the sample expression coefficients, and regenerate a sample image that includes the avatar and corresponds to the sample expression coefficients. The avatar may be an avatar pre-stored in a rendering engine, such as a digital human face pre-stored in the rendering engine.
For example, inputting a set of 240-dimensional expression coefficients into the rendering engine, where the set of 240-dimensional expression coefficients corresponds to an expression N, the rendering engine drives the expression N of the digital face pre-stored in the rendering engine based on the set of 240-dimensional expression coefficients, and generates a sample image containing the digital face.
In some embodiments, the sample image and the sample expression coefficients may be utilized, the sample expression coefficients are used as labels of the sample image, and the labeled sample image is input into a coefficient regression network (i.e., an expression coefficient prediction model) to train the coefficient regression network. In some embodiments, since the size of each image in the acquired sample image may be different, and the number of input ends in the input layer of the coefficient regression network is fixed, the pixel size of the sample image may also be adjusted to 224 × 224 (pixels), and the adjusted image is used as the image of the input coefficient regression network, so as to improve the efficiency of extracting the image features by the coefficient regression network. In some embodiments, a shufflenet (efficient lightweight network) is used as the backbone network, an L1 loss function is used as the loss function, and an Adam optimization function is used to optimize the network.
Of course, the selection of the backbone network, the loss function, and the optimization network is only exemplary and not limiting to the embodiments of the present disclosure. One skilled in the art can also select other backbone networks, loss functions, and optimization networks to achieve the above-described training of the coefficient regression network.
In S306, the first expression coefficient is determined based on the second expression coefficient. That is, the expression coefficient of the real target object is determined based on the expression coefficient of the virtual target object. Since the image of the virtual target object is converted based on the image of the real target object, the first expression coefficient of the real target object corresponds to the second expression coefficient of the virtual target object, and the first expression coefficient of the real target object may be determined based on the second expression coefficient of the virtual target object according to the correspondence.
In some embodiments, when the first expression coefficient is the same as the second expression coefficient, the second expression coefficient may be directly determined as the first expression coefficient; when there is a conversion relationship between the second expression coefficient and the first expression coefficient, the second expression coefficient may be converted into the first expression coefficient based on the conversion relationship.
In some embodiments, after determining the first expression coefficient of the real target object, the expression of the avatar may be driven based on the first expression coefficient. For example, the expression of a virtual face, a virtual animal face, or the like may be driven based on the first expression coefficient to enrich the manner of utilization of the image of the real target object.
For example, in live broadcasting, an anchor client acquires an image including an anchor face, determines a first expression coefficient of the anchor face, drives the expression of a cat face based on the first tag coefficient, and causes the anchor client and a viewer client to display the cat face instead of the anchor face, thereby enabling the anchor to perform virtual live broadcasting.
The above examples are illustrative of the embodiments of the present disclosure. The method comprises the steps of determining an expression coefficient of a virtual target object in an image of the virtual target object by acquiring the image of the virtual target object corresponding to the image of the real target object and by utilizing an expression coefficient prediction model based on the property of the real target object corresponding to the expression coefficient of the virtual target object, thereby determining the expression coefficient of the real target object; the virtual object expression prediction model is obtained by training only by using the pre-generated sample expression coefficients and the images of the virtual object generated based on the sample expression coefficients as samples, so that the expression coefficients of the real target object can be determined without the need of high-cost acquisition of a large number of real sample images and high-computation algorithm.
Fig. 8 is a flowchart of an overall method for determining an expression coefficient according to an embodiment of the present disclosure. In the embodiment shown in the figure, it is assumed that the facial expression is driven based on the 240-dimensional expression coefficient.
Firstly, an expression coefficient prediction model is trained in advance: and randomly generating a 240-dimensional sample expression coefficient, inputting the sample expression coefficient into a rendering engine for generating a rendering image based on the expression coefficient, so that the rendering engine generates a sample image according to the sample expression coefficient, wherein the generated sample image comprises a virtual object. And then, taking the sample expression coefficients and the sample images as training samples, and inputting the training samples into an expression coefficient prediction model to train the expression coefficient prediction model, so that the expression coefficient prediction model has the capability of determining the expression coefficients of the virtual objects in the images of the virtual objects.
The first expression coefficient of the real target object in the image of the real target object may then be determined by: acquiring an image of a real target object, for example acquiring an image of a real face; inputting the image of the real target object into a conversion model for style conversion by the conversion model, and converting the image of the real target object into an image of a virtual target object, for example, converting the image of a real face into an image of a virtual face (digital face); after the conversion model outputs the image of the virtual target object, inputting the image of the virtual target object into an expression coefficient prediction model so that the expression coefficient prediction model can predict a second expression coefficient of the virtual target object; and then determining a first expression coefficient of the real target object according to the second expression coefficient of the virtual target object.
The conversion model is used for carrying out style conversion on the image of the real target object, and before and after conversion, the expression of the real target object is consistent with that of the virtual target object, so that the expression coefficient of the real target object is the same as that of the virtual target object. Therefore, in this embodiment, the second expression coefficient of the virtual target object may be taken as the first expression coefficient of the real target object.
In this embodiment, the real object may be a real photographed human face, an animal face, or the like; the real target object may be a real object for which the first expression coefficient needs to be determined. The virtual object may be a virtual face, a virtual animal face, or the like that is not actually photographed; the virtual target object may be a virtual human face, a virtual animal face, or the like converted from the real target object, and the embodiment of the present specification is not limited thereto.
By the method of the embodiment, only 240-dimensional sample expression coefficients are randomly generated, a rendering engine is used for generating sample images according to the sample expression coefficients, an expression coefficient prediction model is trained by using the sample expression coefficients and the sample images as samples, a second expression coefficient of a virtual target object in an image of the virtual target object obtained by performing style conversion on the image of the real target object through a conversion model is predicted through the expression coefficient prediction model, and then a first expression coefficient of the real target object is determined. Therefore, the scheme does not need to collect high-precision sample images with 240-dimensional expression coefficients at high cost to train an expression coefficient prediction model, and does not need a high-calculation-power algorithm to calculate the expression coefficients of the real target object in the images of the real target object, so that the cost for determining the expression coefficients of the real target object is reduced, and the calculation resources are saved.
Referring to fig. 9, an embodiment of the present specification further provides a live broadcast system for live virtual images, where the system includes an anchor client, a viewer client, and a server:
the anchor client is to: acquiring an image of a virtual target object, wherein the image of the virtual target object is obtained by image conversion based on a real target object, and a first expression coefficient of the real target object corresponds to a second expression coefficient of the virtual target object;
inputting an image of a virtual target object into an expression coefficient prediction model, and acquiring a second expression coefficient output by the expression coefficient prediction model; the expression coefficient prediction model is obtained by training a pre-generated sample expression coefficient and a sample image comprising a virtual object, and the expression of the virtual object is obtained based on the sample expression coefficient drive;
determining the first expression coefficient based on the second expression coefficient;
driving the expression of the virtual image based on the first expression coefficient;
sending the avatar to the server;
the server is configured to: receiving the virtual image sent by the anchor client, and sending the virtual image to the audience client;
the viewer client is to: and receiving and displaying the virtual image sent by the server.
According to S901, the anchor client acquires an image of a real target object, which may be an image of the real target object acquired by a camera, and the real target object may be a face of a real anchor, or the like.
According to S902, the anchor client may convert the image of the real target object to obtain an image of the virtual target object based on the image of the real target object; since the image of the virtual target object is converted based on the image of the real target object, the first expression coefficient of the real target object corresponds to the second expression coefficient of the virtual target object.
According to the S903, the anchor client inputs the image of the virtual target object into the expression coefficient prediction model to obtain a second expression coefficient output by the expression coefficient prediction model; the training process of the expression coefficient prediction model comprises the following steps: driving the expression of the virtual object based on a pre-generated sample expression coefficient so as to generate a sample image containing the virtual object, and training an expression coefficient prediction model by using the sample expression coefficient and the sample image; therefore, the trained expression coefficient prediction model can predict the expression coefficient of the virtual object.
According to S904, since the first expression coefficient of the real target object corresponds to the second expression coefficient of the virtual target object, the anchor client may determine the first expression coefficient based on the second expression coefficient.
According to S905, the anchor client may drive an expression of the avatar based on the determined first expression coefficient, for example, an expression of driving a cat, an expression of an avatar, and the like.
According to S906, the anchor client transmits the avatar to the server to cause the server to perform S907, transmitting the avatar to the viewer client; so that the viewer client performs S908 to display the avatar after receiving the avatar.
Referring to fig. 10, an embodiment of the present specification further provides another avatar live broadcast system, which includes a main broadcast client, a viewer client, and a server:
the anchor client is configured to: acquiring an image of a real target object, and sending the image of the real target object to the server;
the server is configured to: receiving an image of the real target object sent by an anchor client, and acquiring an image of a virtual target object, wherein the image of the virtual target object is obtained based on image conversion of the real target object, and a first expression coefficient of the real target object corresponds to a second expression coefficient of the virtual target object;
inputting an image of a virtual target object into an expression coefficient prediction model, and acquiring a second expression coefficient output by the expression coefficient prediction model; the expression coefficient prediction model is obtained by adopting a pre-generated sample expression coefficient and sample image training including a virtual object, and the expression of the virtual object is obtained based on the sample expression coefficient drive;
determining the first expression coefficient based on the second expression coefficient;
driving the expression of the virtual image based on the first expression coefficient;
sending the avatar to the anchor client and the viewer client;
the anchor client is further to: receiving and displaying the virtual image sent by the server;
the viewer client is to: and receiving and displaying the virtual image sent by the server.
According to S1001, the anchor client acquires an image of a real target object, which may be an image of the anchor client acquiring the real target object through a camera, and the real target object may be an anchor face.
According to S1002, the anchor client sends the image of the real target object to the server after acquiring the image of the real target object.
According to S1003, the server converts the image of the frame number target object into the image of the virtual target object after receiving the image of the real target object, and the first expression coefficient of the real target object corresponds to the second expression coefficient of the virtual target object since the image of the virtual target object is converted based on the image of the real target object.
According to the step S1004, the server inputs the image of the virtual target object into the expression coefficient prediction model, and obtains a second expression coefficient output by the expression coefficient prediction model; the training process of the expression coefficient prediction model comprises the following steps: driving the expression of the virtual object based on a pre-generated sample expression coefficient so as to generate a sample image containing the virtual object, and training an expression coefficient prediction model by using the sample expression coefficient and the sample image; therefore, the trained expression coefficient prediction model can predict the expression coefficient of the virtual object.
According to S1005, since the first expression coefficient of the real target object and the second expression coefficient of the virtual target object correspond, the server may determine the first expression coefficient based on the second expression coefficient.
According to S1006, the server may drive an expression of the avatar based on the determined first expression coefficient, for example, drive an expression of a cat, an expression of an avatar, and the like.
According to S1007, the server sends the avatar to the anchor client and the audience client, so that the anchor client executes S1008 and displays the avatar after receiving the avatar; and after the viewer client receives the avatar, performing S1009 to display the avatar.
Referring to fig. 11, an embodiment of the present specification further provides a method for model training, including the following steps:
s1102, driving the expression of a virtual object based on a pre-generated sample expression coefficient, and generating a sample image comprising the virtual object;
the expression of the virtual object may be driven by a rendering engine based on the randomly generated sample expression coefficients, and a sample image including the virtual object may be generated.
S1104, training the expression coefficient prediction model based on the sample image and the sample expression coefficient;
the expression coefficient prediction model is used for acquiring a second expression coefficient of a virtual target object based on an image of the virtual target object, the image of the virtual target object is obtained through image conversion of a real target object, and the first expression coefficient of the real target object corresponds to the second expression coefficient of the virtual target object;
the second expression coefficient is used for determining the first expression coefficient; the first expression coefficient is used for driving the expression of the virtual image.
Training an expression coefficient prediction model through a sample image comprising the virtual object and a sample expression coefficient corresponding to the virtual object, so that the expression coefficient prediction model can predict the expression coefficient of the virtual object; then, the image of the real target object can be converted into the image of the virtual target object, namely, the second expression coefficient of the virtual target object can be determined by utilizing the expression coefficient prediction model, and then the first expression coefficient of the real target object is determined based on the second expression coefficient.
Referring to fig. 12, an embodiment of the present specification further provides an apparatus for determining an expression coefficient, including:
an obtaining module 1201: the system comprises a virtual target object acquisition module, a display module and a display module, wherein the virtual target object acquisition module is used for acquiring an image of a virtual target object, the image of the virtual target object is obtained based on image conversion of a real target object, and a first expression coefficient of the real target object corresponds to a second expression coefficient of the virtual target object;
the prediction module 1202: the system comprises an expression coefficient prediction model, a first expression coefficient and a second expression coefficient, wherein the expression coefficient prediction model is used for predicting the expression coefficient of a virtual target object; the expression coefficient prediction model is obtained by adopting a pre-generated sample expression coefficient and sample image training including a virtual object, and the expression of the virtual object is obtained based on the sample expression coefficient drive;
the determination module 1203: for determining the first expression coefficient based on the second expression coefficient.
For the device embodiments, since they substantially correspond to the method embodiments, reference may be made to the partial description of the method embodiments for relevant points. The above-described embodiments of the apparatus are merely illustrative, wherein the modules described as separate parts may or may not be physically separate, and the parts displayed as modules may or may not be physical modules, may be located in one place, or may be distributed on a plurality of network modules. Some or all of the modules can be selected according to actual needs to achieve the purpose of the solution in the present specification. One of ordinary skill in the art can understand and implement it without inventive effort.
The embodiment of the apparatus for determining the expression coefficients can be applied to computer equipment. In some embodiments, a camera is also included on the computer device; the camera is used for acquiring an image of a target object. The device embodiments may be implemented by software, or by hardware, or by a combination of hardware and software. The software implementation is taken as an example, and is formed by reading corresponding computer program instructions in the nonvolatile memory into the memory for operation through the processor in which the software implementation is located. From a hardware aspect, as shown in fig. 13, a hardware structure diagram of a computer device in which an apparatus for determining an expression coefficient according to an embodiment of the present disclosure is located is shown in fig. 13, except for the processor 1310, the memory 1330, the network interface 1320, and the nonvolatile memory 1340 shown in fig. 13, a server or an electronic device in which the apparatus 1331 is located in an embodiment may also include other hardware according to an actual function of the computer device, which is not described again.
The foregoing description has been directed to specific embodiments of this disclosure. Other embodiments are within the scope of the following claims. In some cases, the actions or steps recited in the claims may be performed in a different order than in the embodiments and still achieve desirable results. In addition, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some embodiments, multitasking and parallel processing may also be possible or may be advantageous.
Other embodiments of the present description will be apparent to those skilled in the art from consideration of the specification and practice of the invention disclosed herein. This specification is intended to cover any variations, uses, or adaptations of the specification following, in general, the principles of the specification and including such departures from the present disclosure as come within known or customary practice within the art to which the specification pertains.
It will be understood that the present description is not limited to the precise arrangements described above and shown in the drawings, and that various modifications and changes may be made without departing from the scope thereof. The scope of the present description is limited only by the appended claims.
The above description is only a preferred embodiment of the present disclosure, and should not be taken as limiting the present disclosure, and any modifications, equivalents, improvements, etc. made within the spirit and principle of the present disclosure should be included in the scope of the present disclosure.

Claims (11)

1. A method of determining an expression coefficient, the method comprising:
acquiring an image of a virtual target object, wherein the image of the virtual target object is obtained by converting an image of a real target object, and a first expression coefficient of the real target object corresponds to a second expression coefficient of the virtual target object;
inputting an image of a virtual target object into an expression coefficient prediction model, and acquiring a second expression coefficient output by the expression coefficient prediction model; the expression coefficient prediction model is obtained by adopting a pre-generated sample expression coefficient and sample image training including a virtual object, and the expression of the virtual object is obtained based on the sample expression coefficient drive;
determining the first expression coefficient based on the second expression coefficient.
2. The method of claim 1, wherein the obtaining the image of the virtual target object comprises:
and performing style conversion on the image of the real target object to obtain the image of the virtual target object.
3. The method of claim 2, wherein the style-transforming the real target object comprises:
and performing style conversion on the image of the real target object through a conversion model.
4. The method of claim 3, wherein the transformation model is trained by:
carrying out style conversion on the first sample image of the real sample object through a conversion model to obtain a sample image of the virtual sample object;
inputting the sample image of the virtual sample object into a reverse conversion model to obtain a second sample image of the real sample object;
and inputting the second sample image into a discriminator, wherein the discriminator is used for discriminating whether the second sample image is a generated image or an image of a real sample object, and the conversion model is trained based on the discrimination result of the discriminator.
5. The method of claim 1, wherein the first expression coefficient is used to drive an expression of the avatar.
6. The method of claim 1, wherein the sample image is obtained based on:
and inputting the sample expression coefficient into a rendering engine, and driving the expression of the virtual object based on the sample expression coefficient in the rendering engine to generate a sample image comprising the virtual object.
7. A live broadcast system of virtual image live broadcast is characterized by comprising a main broadcast client, an audience client and a server:
the anchor client is configured to: acquiring an image of a virtual target object, wherein the image of the virtual target object is obtained by image conversion based on a real target object, and a first expression coefficient of the real target object corresponds to a second expression coefficient of the virtual target object;
inputting an image of a virtual target object into an expression coefficient prediction model, and acquiring a second expression coefficient output by the expression coefficient prediction model; the expression coefficient prediction model is obtained by adopting a pre-generated sample expression coefficient and sample image training including a virtual object, and the expression of the virtual object is obtained based on the sample expression coefficient drive;
determining the first expression coefficient based on the second expression coefficient;
driving the expression of the virtual image based on the first expression coefficient;
sending the avatar to the server;
the server is configured to: receiving the virtual image sent by the anchor client, and sending the virtual image to the audience client;
the viewer client is to: and receiving and displaying the virtual image sent by the server.
8. A live broadcast system of virtual image live broadcast is characterized by comprising a main broadcast client, an audience client and a server:
the anchor client is configured to: acquiring an image of a real target object, and sending the image of the real target object to the server;
the server is configured to: receiving an image of the real target object sent by an anchor client, and acquiring an image of a virtual target object, wherein the image of the virtual target object is obtained based on image conversion of the real target object, and a first expression coefficient of the real target object corresponds to a second expression coefficient of the virtual target object;
inputting an image of a virtual target object into an expression coefficient prediction model, and acquiring a second expression coefficient output by the expression coefficient prediction model; the expression coefficient prediction model is obtained by adopting a pre-generated sample expression coefficient and sample image training including a virtual object, and the expression of the virtual object is obtained based on the sample expression coefficient drive;
determining the first expression coefficient based on the second expression coefficient;
driving the expression of the virtual image based on the first expression coefficient;
sending the avatar to the anchor client and the viewer client;
the anchor client is further to: receiving and displaying the virtual image sent by the server;
the viewer client is to: and receiving and displaying the virtual image sent by the server.
9. A method of model training, the method comprising:
driving the expression of a virtual object based on a pre-generated sample expression coefficient to generate a sample image comprising the virtual object;
training the expression coefficient prediction model based on the sample image and the sample expression coefficient;
the expression coefficient prediction model is used for acquiring a second expression coefficient of a virtual target object based on an image of the virtual target object, the image of the virtual target object is obtained through image conversion of a real target object, and the first expression coefficient of the real target object corresponds to the second expression coefficient of the virtual target object;
the second expression coefficient is used for determining the first expression coefficient; the first expression coefficient is used for driving the expression of the virtual image.
10. An apparatus for determining an expression coefficient, the apparatus comprising:
an acquisition module: the system comprises a virtual target object acquisition unit, a display unit and a display unit, wherein the virtual target object acquisition unit is used for acquiring an image of a virtual target object, the image of the virtual target object is obtained based on image conversion of a real target object, and a first expression coefficient of the real target object corresponds to a second expression coefficient of the virtual target object;
a prediction module: the system comprises an expression coefficient prediction model, a first expression coefficient and a second expression coefficient, wherein the expression coefficient prediction model is used for predicting the expression coefficient of a virtual target object; the expression coefficient prediction model is obtained by adopting a pre-generated sample expression coefficient and sample image training including a virtual object, and the expression of the virtual object is obtained based on the sample expression coefficient drive;
a determination module: for determining the first expression coefficient based on the second expression coefficient.
11. An apparatus for determining an expression coefficient, comprising a camera, a memory processor and a computer program stored on the memory and executable on the processor, wherein the processor when executing the program implements the method of any one of claims 1 to 6.
CN202210157073.6A 2022-02-21 2022-02-21 Method, device and equipment for determining expression coefficient and model training and live broadcast system Pending CN114529637A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210157073.6A CN114529637A (en) 2022-02-21 2022-02-21 Method, device and equipment for determining expression coefficient and model training and live broadcast system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210157073.6A CN114529637A (en) 2022-02-21 2022-02-21 Method, device and equipment for determining expression coefficient and model training and live broadcast system

Publications (1)

Publication Number Publication Date
CN114529637A true CN114529637A (en) 2022-05-24

Family

ID=81624342

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210157073.6A Pending CN114529637A (en) 2022-02-21 2022-02-21 Method, device and equipment for determining expression coefficient and model training and live broadcast system

Country Status (1)

Country Link
CN (1) CN114529637A (en)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109727303A (en) * 2018-12-29 2019-05-07 广州华多网络科技有限公司 Video display method, system, computer equipment, storage medium and terminal
US20190384967A1 (en) * 2018-06-19 2019-12-19 Beijing Kuangshi Technology Co., Ltd. Facial expression detection method, device and system, facial expression driving method, device and system, and storage medium
CN112541445A (en) * 2020-12-16 2021-03-23 中国联合网络通信集团有限公司 Facial expression migration method and device, electronic equipment and storage medium
CN113066156A (en) * 2021-04-16 2021-07-02 广州虎牙科技有限公司 Expression redirection method, device, equipment and medium
CN113537056A (en) * 2021-07-15 2021-10-22 广州虎牙科技有限公司 Avatar driving method, apparatus, device, and medium

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20190384967A1 (en) * 2018-06-19 2019-12-19 Beijing Kuangshi Technology Co., Ltd. Facial expression detection method, device and system, facial expression driving method, device and system, and storage medium
CN109727303A (en) * 2018-12-29 2019-05-07 广州华多网络科技有限公司 Video display method, system, computer equipment, storage medium and terminal
CN112541445A (en) * 2020-12-16 2021-03-23 中国联合网络通信集团有限公司 Facial expression migration method and device, electronic equipment and storage medium
CN113066156A (en) * 2021-04-16 2021-07-02 广州虎牙科技有限公司 Expression redirection method, device, equipment and medium
CN113537056A (en) * 2021-07-15 2021-10-22 广州虎牙科技有限公司 Avatar driving method, apparatus, device, and medium

Similar Documents

Publication Publication Date Title
Xiong et al. Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks
Zhao et al. Learning to forecast and refine residual motion for image-to-video generation
KR102488530B1 (en) Method and apparatus for generating video
CN110852256B (en) Method, device and equipment for generating time sequence action nomination and storage medium
CN107861938B (en) POI (Point of interest) file generation method and device and electronic equipment
WO2023103576A1 (en) Video processing method and apparatus, and computer device and storage medium
CN110852295B (en) Video behavior recognition method based on multitasking supervised learning
CN111460876B (en) Method and apparatus for identifying video
CN113283336A (en) Text recognition method and system
CN115205150A (en) Image deblurring method, device, equipment, medium and computer program product
CN115830638A (en) Small-size human head detection method based on attention mechanism and related equipment
CN113869226A (en) Face driving method and device, electronic equipment and computer readable storage medium
CN114511702A (en) Remote sensing image segmentation method and system based on multi-scale weighted attention
CN117635784B (en) Automatic three-dimensional digital human face animation generation system
CN114529637A (en) Method, device and equipment for determining expression coefficient and model training and live broadcast system
CN113538254A (en) Image restoration method and device, electronic equipment and computer readable storage medium
CN114882405B (en) Video saliency detection method based on space-time double-flow pyramid network architecture
CN116453204B (en) Action recognition method and device, storage medium and electronic equipment
CN112668504A (en) Action recognition method and device and electronic equipment
Zini et al. On the impact of rain over semantic segmentation of street scenes
CN113051434A (en) Video classification method, storage medium and terminal equipment
Cheng et al. Multi-Frame Content Integration with a Spatio-Temporal Attention Mechanism for Person Video Motion Transfer
CN118485578B (en) Real-time interactive video generation method and system
CN115393576A (en) Visual attention dynamic prediction method and system supporting variable duration
CN114549755A (en) Human body reconstruction method, human body reconstruction device, storage medium, and electronic apparatus

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination