WO2024169207A1 - Method of displaying performance content of virtual character and related device - Google Patents
Method of displaying performance content of virtual character and related device Download PDFInfo
- Publication number
- WO2024169207A1 WO2024169207A1 PCT/CN2023/124424 CN2023124424W WO2024169207A1 WO 2024169207 A1 WO2024169207 A1 WO 2024169207A1 CN 2023124424 W CN2023124424 W CN 2023124424W WO 2024169207 A1 WO2024169207 A1 WO 2024169207A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- feature information
- information
- streaming audio
- audio segment
- virtual character
- Prior art date
Links
- 238000000034 method Methods 0.000 title claims abstract description 144
- 230000033001 locomotion Effects 0.000 claims abstract description 230
- 230000009471 action Effects 0.000 claims abstract description 227
- 230000008921 facial expression Effects 0.000 claims abstract description 208
- 230000004927 fusion Effects 0.000 claims abstract description 83
- 230000000875 corresponding effect Effects 0.000 claims description 263
- 230000014509 gene expression Effects 0.000 claims description 140
- 230000015654 memory Effects 0.000 claims description 54
- 238000003860 storage Methods 0.000 claims description 26
- 239000000284 extract Substances 0.000 claims description 20
- 238000004590 computer program Methods 0.000 claims description 13
- 238000000605 extraction Methods 0.000 claims description 6
- 230000004044 response Effects 0.000 abstract description 9
- 238000004904 shortening Methods 0.000 abstract 1
- 238000012549 training Methods 0.000 description 83
- 238000013528 artificial neural network Methods 0.000 description 45
- 230000008569 process Effects 0.000 description 38
- 238000012545 processing Methods 0.000 description 37
- 238000010586 diagram Methods 0.000 description 33
- 230000006870 function Effects 0.000 description 30
- 238000004891 communication Methods 0.000 description 19
- 238000005516 engineering process Methods 0.000 description 17
- 238000004364 calculation method Methods 0.000 description 16
- 238000013473 artificial intelligence Methods 0.000 description 15
- 230000000694 effects Effects 0.000 description 14
- 239000011159 matrix material Substances 0.000 description 14
- 238000013500 data storage Methods 0.000 description 13
- 230000001815 facial effect Effects 0.000 description 10
- 230000001537 neural effect Effects 0.000 description 10
- 230000033764 rhythmic process Effects 0.000 description 9
- 230000002123 temporal effect Effects 0.000 description 8
- 238000004422 calculation algorithm Methods 0.000 description 7
- 238000013527 convolutional neural network Methods 0.000 description 6
- 230000000306 recurrent effect Effects 0.000 description 6
- 230000004913 activation Effects 0.000 description 5
- 230000003993 interaction Effects 0.000 description 5
- 101100012902 Saccharomyces cerevisiae (strain ATCC 204508 / S288c) FIG2 gene Proteins 0.000 description 4
- 230000008859 change Effects 0.000 description 4
- 238000011161 development Methods 0.000 description 4
- 230000003287 optical effect Effects 0.000 description 4
- 238000010801 machine learning Methods 0.000 description 3
- 238000010295 mobile communication Methods 0.000 description 3
- 210000002569 neuron Anatomy 0.000 description 3
- 230000008447 perception Effects 0.000 description 3
- 238000012546 transfer Methods 0.000 description 3
- 208000019749 Eye movement disease Diseases 0.000 description 2
- 230000009286 beneficial effect Effects 0.000 description 2
- 238000013135 deep learning Methods 0.000 description 2
- 230000000977 initiatory effect Effects 0.000 description 2
- 230000007774 longterm Effects 0.000 description 2
- 238000004519 manufacturing process Methods 0.000 description 2
- 230000007935 neutral effect Effects 0.000 description 2
- 239000013307 optical fiber Substances 0.000 description 2
- 239000004065 semiconductor Substances 0.000 description 2
- 230000003068 static effect Effects 0.000 description 2
- 101001121408 Homo sapiens L-amino-acid oxidase Proteins 0.000 description 1
- 101000827703 Homo sapiens Polyphosphoinositide phosphatase Proteins 0.000 description 1
- 102100026388 L-amino-acid oxidase Human genes 0.000 description 1
- 102100023591 Polyphosphoinositide phosphatase Human genes 0.000 description 1
- 101100233916 Saccharomyces cerevisiae (strain ATCC 204508 / S288c) KAR5 gene Proteins 0.000 description 1
- 230000001133 acceleration Effects 0.000 description 1
- 238000004458 analytical method Methods 0.000 description 1
- 238000013459 approach Methods 0.000 description 1
- 238000003491 array Methods 0.000 description 1
- 230000003190 augmentative effect Effects 0.000 description 1
- 238000013475 authorization Methods 0.000 description 1
- 230000015572 biosynthetic process Effects 0.000 description 1
- 230000001413 cellular effect Effects 0.000 description 1
- QVFWZNCVPCJQOP-UHFFFAOYSA-N chloralodol Chemical compound CC(O)(C)CC(C)OC(O)C(Cl)(Cl)Cl QVFWZNCVPCJQOP-UHFFFAOYSA-N 0.000 description 1
- 238000009833 condensation Methods 0.000 description 1
- 230000005494 condensation Effects 0.000 description 1
- 238000011217 control strategy Methods 0.000 description 1
- 230000001276 controlling effect Effects 0.000 description 1
- 238000013480 data collection Methods 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 238000001514 detection method Methods 0.000 description 1
- 238000006073 displacement reaction Methods 0.000 description 1
- 238000005538 encapsulation Methods 0.000 description 1
- 230000002708 enhancing effect Effects 0.000 description 1
- 230000008570 general process Effects 0.000 description 1
- 230000010365 information processing Effects 0.000 description 1
- 230000010354 integration Effects 0.000 description 1
- 238000012886 linear function Methods 0.000 description 1
- 239000007788 liquid Substances 0.000 description 1
- 230000007246 mechanism Effects 0.000 description 1
- 238000003058 natural language processing Methods 0.000 description 1
- 238000003062 neural network model Methods 0.000 description 1
- 238000010606 normalization Methods 0.000 description 1
- 230000002093 peripheral effect Effects 0.000 description 1
- 238000007781 pre-processing Methods 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 230000002441 reversible effect Effects 0.000 description 1
- 239000007787 solid Substances 0.000 description 1
- 230000009466 transformation Effects 0.000 description 1
- 238000013519 translation Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T13/00—Animation
- G06T13/20—3D [Three Dimensional] animation
- G06T13/205—3D [Three Dimensional] animation driven by audio data
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T13/00—Animation
- G06T13/20—3D [Three Dimensional] animation
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T13/00—Animation
- G06T13/20—3D [Three Dimensional] animation
- G06T13/40—3D [Three Dimensional] animation of characters, e.g. humans, animals or virtual beings
Definitions
- the present application relates to the field of computer technology, and in particular to a method for displaying performance content of a virtual character and related equipment.
- the generated performance content can be applied to the virtual character based on a piece of music provided by the user.
- the performance content of the virtual character can be generated based on the algorithm.
- the solution of generating dance movements based on the algorithm is efficient and generalizable, and only needs to input the music information to generate the matching performance content.
- this method generally adopts an offline model, that is, it needs to input the complete music and extract the characteristic information of the music from the complete music to generate the corresponding performance content, and the response time is relatively long.
- the present application provides a method and related equipment for displaying the performance content of a virtual character, which can directly output the performance content of the virtual character corresponding to the input streaming audio clip without inputting the complete music, thereby reducing the response time of generating the facial expression information and body movement information of the virtual character.
- the present application provides a method for displaying performance content of a virtual character, which can be applied in the field of computer technology and can be implemented by a target model.
- the method includes:
- a first streaming audio segment is obtained; then, based on feature information of the first streaming audio segment and action feature information generated according to the second streaming audio segment, first fused feature information is obtained, wherein the first streaming audio segment and the second streaming audio segment are included in the same streaming audio, and the playback time of the second audio segment is before the playback time of the first audio segment; finally, facial expression information and body movement information of the virtual character are generated based on the first fused feature information.
- the complete music needs to be input to generate the corresponding performance content, and the response time is long.
- the performance content of the virtual character is output in real time based on the input streaming audio clip, which reduces the response time of generating the performance content.
- the action feature information generated by the second streaming audio clip is also referred to, so as to achieve the front and back connection of different audio clips of the same streaming audio in the performance content.
- the action feature information is added when generating the first fusion feature information to enhance the display effect of the performance content of the virtual character.
- the effect of the virtual character singing and dancing is achieved.
- the method further includes:
- Second fusion feature information is obtained, wherein the third streaming audio segment and the first streaming audio segment are included in the same streaming audio, and a playback time of the first streaming audio segment is before a playback time of the third streaming audio segment;
- Facial expression information and body movement information of the virtual character are generated based on the second fused feature information.
- the facial expression information and body movement information of the virtual character corresponding to the third streaming audio segment can be generated based on the third streaming audio segment, thereby achieving the effect of online input of streaming audio segments and real-time output of the corresponding facial expression information and body movement information of the virtual character.
- the facial expression information and body movement information of the virtual character will be referenced.
- the action feature information generated by the second streaming audio segment that is played earlier is considered, thereby achieving the connection between the actions of the virtual character on the same audio segment and improving the user experience.
- the feature information of the first streaming audio segment includes text feature information
- generating facial expression information of the virtual character based on the first fused feature information includes:
- the basic expression corresponding to the at least one virtual character is adjusted to obtain the facial expression information corresponding to the at least one virtual character.
- the correlation between facial expression and mouth shape is strengthened when generating the facial expression of the virtual character, which not only increases the display dimension of facial expression information, enhances the display effect of facial expression information, but also improves the authenticity of facial expression.
- the facial expression information of other virtual characters is referred to, thereby achieving effective collaboration and cooperation of multiple virtual characters in facial expressions.
- the method further includes:
- the expression similarity between the multiple virtual characters is calculated
- the expression similarity between multiple virtual characters is used as a weight to adjust the expression feature information corresponding to each virtual character to obtain second associated feature information on the expressions of the multiple virtual characters.
- the second associated feature information is used to generate facial expression information corresponding to at least one virtual character corresponding to the third streaming audio segment.
- the expression similarity between each virtual character and other virtual characters is calculated based on the expression feature information corresponding to different virtual characters, so that when generating the facial expression information of each virtual character, not only the facial expression information of the virtual character is considered, but also the facial expression information of other virtual characters is considered, so as to achieve effective collaboration and cooperation of multiple virtual characters in facial expressions.
- the second associated feature information generated before the playback time of the third streaming audio segment is used as the input of the third streaming audio segment to generate facial expression information, so as to achieve effective connection in facial expressions between streaming audio segments with different playback times, maintain the coherence of the facial expression information of the virtual characters, and further utilize the temporal correlation of learning facial expressions, which is conducive to maintaining the temporal coherence of facial expressions and avoiding sudden changes and jitters in facial expressions.
- generating body movement information of the virtual character based on the first fused feature information includes:
- the action feature information generated by the first streaming audio segment is adjusted using the adjusted first fusion feature information and the third associated feature information as offsets to obtain corresponding body motion information of at least one virtual character.
- the generated body movement information of the virtual character lacks movement details.
- the action feature information generated by the first streaming audio segment is adjusted based on the adjusted first fusion feature information, so that the body movement information generated by each virtual character contains more action details.
- each virtual character can take into account the body movement information of other virtual characters and achieve collaboration and cooperation with other virtual characters in body movements.
- obtaining motion feature information generated by the first streaming audio segment includes:
- action coding information corresponding to the plurality of virtual characters is obtained;
- action feature information corresponding to the action coding information of multiple virtual characters is obtained.
- the action coding information corresponding to the plurality of virtual characters is obtained by performing action coding on the feature information. Then, based on the correspondence between the action coding information and the action feature information, the action feature information corresponding to the action coding information of multiple virtual characters is obtained, thereby reducing the amount of information for generating the action feature information and improving the operation efficiency.
- the method further includes:
- the corresponding action feature information of each virtual character is adjusted to obtain fourth associated feature information on the actions of the multiple virtual characters.
- the fourth associated feature information is used to generate body movement information corresponding to at least one virtual character corresponding to the third streaming audio segment.
- the expression similarity between each virtual character and other virtual characters is calculated based on the action feature information generated by the first streaming audio segment, so that when generating the body movement information of each virtual character, not only the body movement information of the virtual character but also the body movement information of other virtual characters are considered, so as to achieve effective cooperation and coordination of multiple virtual characters in body movement.
- the fourth associated feature information generated before the playback time of the third streaming audio segment is used as the input of the facial expression information generated by the third streaming audio segment, so as to achieve effective connection in body movement between streaming audio segments with different playback times, maintain the coherence of the facial expression information of the virtual character, and further utilize the temporal correlation of learning facial expressions, which is conducive to maintaining the temporal coherence of facial expressions and avoiding sudden changes and jitters in facial expressions.
- the method further includes:
- Tag information where the tag information is used to indicate facial expression information and/or body movement information of the virtual character
- the facial expression information and/or body movement information corresponding to each virtual character is adjusted based on the tag information to obtain the adjusted facial expression information and/or body movement information corresponding to each virtual character.
- the facial expression information and/or body movement information can be adjusted based on the tag information to generate performance content of different styles to adapt to different types of virtual characters, so that a user-controllable style can be generated according to the stylized tags specified by the user, thereby improving interactivity with the user.
- the feature information of the first streaming audio segment includes text feature information
- generating facial expression information of the virtual character based on the first fused feature information includes:
- the basic expression of the virtual character is adjusted to obtain the facial expression information of the virtual character.
- the correlation between facial expression and mouth shape is strengthened when generating the facial expression of the virtual character, which not only increases the display dimension of facial expression information, enhances the display effect of facial expression information, but also improves the authenticity of facial expression.
- the facial expression information of the virtual character corresponding to the first streaming audio segment by referring to the facial expression information of the second streaming audio segment with an earlier playback time, the streaming audio segments with different playback times are effectively connected in facial expression, thereby enhancing the display effect of the facial expression of the virtual character and improving the user experience.
- generating body movement information of the virtual character based on the first fused feature information includes:
- the action feature information generated by the first streaming audio segment is adjusted using the adjusted first fusion feature as an offset to obtain body action information corresponding to the virtual character.
- the generated body movement information of the virtual character lacks movement details.
- the movement feature information generated by the first streaming audio segment is adjusted based on the adjusted first fused feature information, so that the body movement information generated by each virtual character contains more movement details.
- the present application provides a virtual character performance content display device, which can be used in the field of computer technology.
- the device includes a streaming audio segment acquisition module, a fusion feature information generation module and a performance content generation module. Among them,
- a streaming audio segment acquisition module used to acquire a first streaming audio segment
- a fusion feature information generating module configured to obtain first fusion feature information based on feature information of a first streaming audio segment and action feature information generated according to a second streaming audio segment, wherein the first streaming audio segment and the second streaming audio segment are included in the same streaming audio, and a playback time of the second audio segment is before a playback time of the first audio segment;
- the performance content generation module is used to generate facial expression information and body movement information of the virtual character based on the first fused feature information.
- the method further includes:
- the streaming audio segment acquisition module is further used to acquire a third streaming audio segment
- the fusion feature information generating module is further used to obtain second fusion feature information based on feature information of a third streaming audio segment and action feature information generated according to the first streaming audio segment, wherein the third streaming audio segment and the first streaming audio segment are included in the same streaming audio, and the playback time of the first streaming audio segment is before the playback time of the third streaming audio segment;
- the performance content generation module is also used to generate facial expression information and body movement information of the virtual character based on the second fused feature information.
- the feature information of the first streaming audio segment includes text feature information
- the performance content generation module is further configured to:
- the basic expression corresponding to the at least one virtual character is adjusted to obtain the facial expression information corresponding to the at least one virtual character.
- the performance content generation module is further configured to:
- the expression similarity between the multiple virtual characters is calculated
- the expression similarity between multiple virtual characters is used as a weight to adjust the expression feature information corresponding to each virtual character to obtain second associated feature information on the expressions of the multiple virtual characters.
- the second associated feature information is used to generate facial expression information corresponding to at least one virtual character corresponding to the third streaming audio segment.
- the performance content generation module is further configured to:
- the action feature information generated by the first streaming audio segment is adjusted using the adjusted first fusion feature information and the third associated feature information as offsets to obtain corresponding body motion information of at least one virtual character.
- obtaining motion feature information generated by the first streaming audio segment includes:
- action coding information corresponding to the plurality of virtual characters is obtained;
- action feature information corresponding to the action coding information of multiple virtual characters is obtained.
- the performance content generation module is further configured to:
- the corresponding action feature information of each virtual character is adjusted to obtain fourth associated feature information on the actions of the multiple virtual characters.
- the fourth associated feature information is used to generate body movement information corresponding to at least one virtual character corresponding to the third streaming audio segment.
- the method further includes:
- Tag information where the tag information is used to indicate facial expression information and/or body movement information of the virtual character
- the facial expression information and/or body movement information corresponding to each virtual character is adjusted based on the tag information to obtain the adjusted facial expression information and/or body movement information corresponding to each virtual character.
- the feature information of the first streaming audio segment includes text feature information
- the performance content generation module is further configured to:
- the basic expression of the virtual character is adjusted to obtain the facial expression information of the virtual character.
- the performance content generation module is further configured to:
- the action feature information generated by the first streaming audio segment is adjusted using the adjusted first fusion feature as an offset to obtain body action information corresponding to the virtual character.
- the various modules included in the performance content display device of the virtual character can also be used to implement the steps in the various possible implementation methods of the first aspect.
- the specific implementation methods of the second aspect of the embodiment of the present application and the various possible implementation methods of the second aspect, as well as the beneficial effects brought about by each possible implementation method you can refer to the description of the various possible implementation methods in the first aspect, and will not be repeated here one by one.
- an embodiment of the present application provides a model training method, including:
- Acquire a first streaming audio segment obtain first fused feature information based on feature information of the first streaming audio segment and action feature information generated according to the second streaming audio segment, wherein the first streaming audio segment and the second streaming audio segment are included in the same streaming audio, and the playback time of the second audio segment is before the playback time of the first audio segment; generate facial expression information and body movement information of a virtual character based on the first fused feature information; obtain a target loss based on the facial expression information and body movement information of the virtual character and the real facial expression information and body movement information of the virtual character, the target loss being used to indicate the difference between the facial expression information and body movement information and the real facial expression information and body movement information; based on the target loss, update the parameters of the model to be trained until the model training conditions are met, and obtain the target model.
- the target model can be used to execute the steps in the aforementioned first aspect or various possible implementation methods of the first aspect, which will not be described one by one here.
- an embodiment of the present application provides a model training device, comprising:
- a streaming audio segment acquisition module used to acquire a first streaming audio segment
- a fusion feature information generating module configured to obtain first fusion feature information based on feature information of a first streaming audio segment and action feature information generated according to a second streaming audio segment, wherein the first streaming audio segment and the second streaming audio segment are included in the same streaming audio, and a playback time of the second audio segment is before a playback time of the first audio segment;
- a performance content generation module used to generate facial expression information and body movement information of the virtual character based on the first fused feature information
- a target loss acquisition module is used to obtain a target loss based on the facial expression information and body movement information of the virtual character of the virtual character and the real facial expression information and body movement information of the virtual character, wherein the target loss is used to indicate the difference between the facial expression information and body movement information and the real facial expression information and body movement information;
- the target model training module is used to update the parameters of the model to be trained based on the target loss until the model training conditions are met to obtain the target model.
- the present application provides a computing device cluster, comprising at least one computing device, each computing device comprising a processor and a memory;
- the processor of the at least one computing device is used to execute instructions stored in the memory of the at least one computing device, so that the computing device cluster executes the method in the above-mentioned first aspect or any possible implementation manner of the first aspect.
- an embodiment of the present application provides a training device, including a processor and a memory, wherein the processor is coupled to the memory.
- the memory is used to store a program.
- the processor is used to execute the program in the memory, so that the execution device executes the method of the third aspect above.
- an embodiment of the present application provides a computer storage medium, characterized in that it includes instructions, which, when executed on a computing device, enable the computing device to execute the method in the above-mentioned first aspect or any possible implementation of the first aspect, or the method in the above-mentioned third aspect.
- the present application provides a computer program product comprising instructions, which, when executed by a computing device cluster, enables The computing device cluster executes the method in the first aspect or any possible implementation manner of the first aspect, or the method in the third aspect.
- an embodiment of the present application provides a circuit system, the circuit system including a processing circuit, the processing circuit being configured to execute the method described in the first aspect or any possible implementation of the first aspect, or the method of the third aspect.
- the present application provides a chip system, including a processor and a memory, the memory is used to store a computer program, and the processor is used to call and run the computer program stored in the memory to execute the method as described in the first aspect or any possible implementation of the first aspect, or the method of the third aspect.
- the chip system can be composed of a chip, or it can include a chip and other discrete devices.
- FIG1 is a schematic diagram of a structure of an artificial intelligence main framework used in an embodiment of the present application.
- FIG2 is a system architecture diagram provided in an embodiment of the present application.
- FIG3 is a flow chart of a method for displaying performance content of a virtual character provided in an embodiment of the present application
- FIG4 is another system architecture diagram provided in an embodiment of the present application.
- FIG5 is another schematic flow chart of a method for displaying performance content of a virtual character provided in an embodiment of the present application.
- FIG6 is a schematic diagram of a process for generating first fusion feature information provided in an embodiment of the present application.
- FIG7 is a schematic diagram of a structure of a stylization processing module provided in an embodiment of the present application.
- FIG8 is a schematic diagram of a process for generating facial expression information of a virtual character provided in an embodiment of the present application.
- FIG9 is a schematic diagram of a structure of a multi-person action decoder provided in an embodiment of the present application.
- FIG10 is a schematic diagram of a flow chart of generating body movement information of a virtual character provided in an embodiment of the present application.
- FIG11 is another schematic flow chart of a method for displaying performance content of a virtual character provided in an embodiment of the present application.
- FIG12 is another schematic diagram of a process for generating facial expression information of a virtual character provided in an embodiment of the present application.
- FIG13 is another schematic diagram of a flow chart of generating body movement information of a virtual character provided in an embodiment of the present application.
- FIG14 is a flow chart of a model training method provided in an embodiment of the present application.
- FIG15 is a schematic diagram of a structure of a device for displaying performance content of a virtual character provided in an embodiment of the present application.
- FIG16 is a schematic diagram of a structure of a model training device provided in an embodiment of the present application.
- FIG17 is a schematic diagram of a structure of a computing device provided in an embodiment of the present application.
- FIG18 is a schematic diagram of a structure of a computer device cluster provided in an embodiment of the present application.
- FIG19 is a schematic diagram of a structure of a training device provided in an embodiment of the present application.
- FIG. 20 is a schematic diagram of the structure of a chip provided in an embodiment of the present application.
- a and/or B can represent: A exists alone, A and B exist at the same time, and B exists alone.
- the character "/" in this application generally indicates that the associated objects before and after are in an "or" relationship.
- the meaning of "at least one” refers to one or more, and the meaning of “plurality” refers to two or more. It is understood that in the present application, “when”, “if” and “if” all refer to the device making corresponding processing under certain objective circumstances, and do not limit the time, nor do they require that there must be a judgment action when the device is implemented, nor do they mean that there are other limitations.
- the special word “exemplary” means “used as an example, embodiment or illustrative”. Any embodiment described as “exemplary” is not necessarily interpreted as being superior or better than other embodiments.
- AI artificial intelligence
- AI is a theory, method, technology and application system that uses digital computers or machines controlled by digital computers to simulate, extend and expand human intelligence, perceive the environment, acquire knowledge and use knowledge to obtain the best results.
- artificial intelligence is a branch of computer science that attempts to understand the essence of intelligence and produce a new intelligent machine that can respond in a similar way to human intelligence.
- Artificial intelligence is to study the design principles and implementation methods of various intelligent machines so that the machines have the functions of perception, reasoning and decision-making.
- Research in the field of artificial intelligence includes robots, natural language processing, computer vision, decision-making and reasoning, human-computer interaction, recommendation and search, basic AI theory, etc.
- Figure 1 is a structural diagram of the artificial intelligence main framework applied in the embodiment of the present application.
- the above artificial intelligence theme framework is explained from the two dimensions of “intelligent information chain” (horizontal axis) and “IT value chain” (vertical axis).
- the "intelligent information chain” reflects a series of processes from data acquisition to processing. For example, it can be a general process of intelligent information perception, intelligent information representation and formation, intelligent reasoning, intelligent decision-making, intelligent execution and output. In this process, the data undergoes a condensation process of "data-information-knowledge-wisdom".
- the "IT value chain” reflects the value that artificial intelligence brings to the information technology industry, from the underlying infrastructure of human intelligence, information (providing and processing technology implementation) to the industrial ecology process of the system.
- the infrastructure provides computing power support for the artificial intelligence system, enables communication with the outside world, and is supported by the basic platform. It communicates with the outside world through sensors; computing power is provided by smart chips, such as central processing units (CPU), neural-network processing units (NPU), graphics processing units (GPU), application specific integrated circuits (ASIC) or field programmable gate arrays (FPGA) and other hardware acceleration chips; the basic platform includes distributed computing frameworks and networks and other related platform guarantees and support, which can include cloud storage and computing, interconnected networks, etc. For example, sensors communicate with the outside world to obtain data, and these data are provided to the smart chips in the distributed computing system provided by the basic platform for calculation.
- smart chips such as central processing units (CPU), neural-network processing units (NPU), graphics processing units (GPU), application specific integrated circuits (ASIC) or field programmable gate arrays (FPGA) and other hardware acceleration chips
- the basic platform includes distributed computing frameworks and networks and other related platform guarantees and support, which can include cloud storage and computing, interconnected networks,
- the data on the upper layer of the infrastructure is used to represent the data sources in the field of artificial intelligence.
- the data involves graphics, images, voice, video, text, and IoT data of traditional devices, including business data of existing systems and perception data such as force, displacement, liquid level, temperature, and humidity.
- Data processing usually includes data training, machine learning, deep learning, search, reasoning, decision-making and other methods.
- machine learning and deep learning can symbolize and formalize data for intelligent information modeling, extraction, preprocessing, and training.
- Reasoning refers to the process of simulating human intelligent reasoning in computers or intelligent systems, using formalized information to perform machine thinking and solve problems based on reasoning control strategies. Typical functions are search and matching.
- Decision-making refers to the process of making decisions after intelligent information is reasoned, usually providing functions such as classification, sorting, and prediction.
- some general capabilities can be formed based on the results of data processing, such as algorithms or a general system, for example, translation, text analysis, computer vision processing (such as image recognition, object detection) etc.), speech recognition, etc.
- algorithms or a general system for example, translation, text analysis, computer vision processing (such as image recognition, object detection) etc.), speech recognition, etc.
- Smart products and industry applications refer to the products and applications of artificial intelligence systems in various fields. They are the encapsulation of the overall artificial intelligence solution, which productizes intelligent information decision-making and realizes practical applications. Its application areas mainly include: smart manufacturing, smart transportation, smart home, smart medical care, smart security, autonomous driving, smart terminals, etc.
- the embodiments of the present application involve a large number of neural network-related applications.
- the relevant terms and concepts of the neural network that may be involved in the embodiments of the present application are first introduced below.
- a neural network may be composed of neural units.
- a neural unit may refer to an operation unit that takes x s as input, and the output of the operation unit may be as follows:
- n is a natural number greater than 1
- Ws is the weight of xs
- b is the bias of the neural unit.
- f is the activation function of the neural unit, which is used to introduce nonlinear characteristics into the neural network to convert the input signal in the neural unit into the output signal.
- the output signal of the activation function can be used as the input of the next convolution layer, and the activation function can be a sigmoid function.
- a neural network is a network formed by connecting multiple single neural units mentioned above, that is, the output of one neural unit can be the input of another neural unit.
- the input of each neural unit can be connected to the local receptive field of the previous layer to extract the features of the local receptive field.
- the local receptive field can be an area composed of several neural units.
- Recurrent neural network is used to process sequence data.
- the layers are fully connected, and the nodes within each layer are disconnected.
- RNN Recurrent neural network
- this ordinary neural network has solved many difficult problems, it is still powerless for many problems. For example, if you want to predict the next word in a sentence, you generally need to use the previous word, because the previous and next words in a sentence are not independent. The reason why RNN is called a recurrent neural network is that the current output of a sequence is also related to the previous output.
- the specific manifestation is that the network will remember the previous information and apply it to the calculation of the current output, that is, the nodes between the hidden layer are no longer disconnected but connected, and the input of the hidden layer includes not only the output of the input layer but also the output of the hidden layer at the previous moment.
- the Transformer structure is a feature extraction network (similar to a convolutional neural network) that includes an encoder and a decoder.
- Encoder Performs feature learning in the global receptive field through self-attention, such as pixel features.
- Decoder Learn the features of the required modules, such as the features of the output box, through self-attention and cross-attention.
- Multilayer Perceptron also known as Multilayer Perceptron, is a feedforward artificial neural network model.
- MLP is an artificial neural network (ANN) based on a fully connected (FC) forward structure, which contains artificial neurons (AN, hereinafter referred to as neurons) ranging from a dozen to hundreds.
- ANN artificial neural network
- FC fully connected
- MLP organizes neurons into a multi-layer structure, and uses a full connection method between layers to form an ANN with multi-weighted connection layers connected layer by layer.
- MLP contains an input layer (this layer does not actually contain operations), one or more hidden layers, and an output layer.
- the loss function can also be called the cost function, which is a metric that compares the difference between the predicted output of a machine learning model for a sample and the true value of the sample (also called the supervised value).
- the loss function can usually include mean square error, cross entropy, logarithm, exponential and other loss functions.
- mean square error can be used as the loss function, defined as The specific loss function can be selected according to the actual application scenario.
- the present application can be applied to application scenarios in which facial expression information and body movement information of virtual characters are generated based on music in the above-mentioned application fields.
- facial expression information and body movement information of virtual characters corresponding to different scenes can be generated based on the soundtrack in the movie, the music in the game, the music in the large-scale evening party or humming, etc.
- the application scenarios of the embodiments of the present application are not exhaustively listed here.
- the system architecture 200 includes a database 230 and a client device 240.
- the data acquisition device 260 is used to collect data and store it in the database 230, and the training module 202 generates a target model/rule 201 based on the data maintained in the database 230.
- the target model/rule 201 is the target model mentioned in the following implementation of the present application. Please refer to the relevant descriptions in the following Figures 5-14.
- the calculation module 211 may include a training module 202, and the target model/rule obtained by the training module 202 may be applied to different systems or devices.
- the execution device 210 is configured with a transceiver 212, which may be a wireless transceiver, an optical transceiver, or a wired interface (such as an I/O interface), etc., to interact with external devices for data, and a "user" may input data to the transceiver 212 through a client device 240.
- the client device 240 may send a target task to the execution device 210, request the execution device to train a neural network, and send a database for training to the execution device 210.
- the execution device 210 can call data, codes, etc. in the data storage system 250 , and can also store data, instructions, etc. in the data storage system 250 .
- the calculation module 211 uses the target model/rule 201 to process the input data. Specifically, in a possible implementation, please refer to Figure 3, which is a flow chart of a method for displaying the performance content of a virtual character provided in an embodiment of the present application. The method can be executed by the calculation module 211 as shown in Figure 2, specifically, by the execution device 210.
- the execution device 210 is used to: S1, obtain a first streaming audio segment; S2, based on the feature information of the first streaming audio segment and the action feature information generated according to the second streaming audio segment, obtain first fusion feature information, the first streaming audio segment and the second streaming audio segment are included in the same streaming audio, and the playback time of the second audio segment is before the playback time of the first audio segment; S3, generate facial expression information and body movement information of the virtual character based on the first fusion feature information.
- the transceiver 212 returns the output result of the target model to the client device 240.
- the user can input an audio clip through the client device 240, and the facial expression information and body movement information of the virtual character are output through the target model and fed back to the client device 240.
- the training module 202 can obtain corresponding target models/rules 201 for different tasks based on different data to provide users with better results.
- the data input into the execution device 210 can be determined based on the user's input data.
- the user can operate in the interface provided by the transceiver 212.
- the client device 240 can automatically input data into the transceiver 212 and obtain the result. If the automatic data input of the client device 240 requires the user's authorization, the user can set the corresponding authority in the client device 240. The user can view the result output by the execution device 210 in the client device 240.
- the specific presentation form can be specific forms such as expressions and actions.
- the client device 240 can also serve as a data collection terminal to store the collected data associated with the target task into the database 230.
- the training or updating process mentioned in the present application can be performed by the training module 202. It is understandable that the training process of the neural network is to learn the way to control the spatial transformation, more specifically, to learn the weight matrix. The purpose of training the neural network is to make the output of the neural network as close to the expected value as possible. Therefore, the weight vector of each layer of the neural network in the neural network can be updated according to the difference between the predicted value and the expected value of the current network by comparing the predicted value and the expected value of the current network (of course, the weight vector can usually be initialized before the first update, that is, the parameters of each layer in the deep neural network are pre-configured).
- the value of the weight in the weight matrix is adjusted to reduce the predicted value, and after continuous adjustment, the value output by the neural network is close to or equal to the expected value.
- the difference between the predicted value and the expected value of the neural network can be measured by a loss function or an objective function. Taking the loss function as an example, the higher the output value (loss) of the loss function, the greater the difference.
- the training of the neural network can be understood as the process of minimizing the loss as much as possible.
- the process of updating the weight of the starting network and training the serial network in the following embodiments of the present application can refer to this process, which will not be repeated below.
- a target model/rule 201 is obtained by training the training module 202.
- the target model/rule 201 is implemented in the present application. Examples may include deep convolutional neural networks (DCNN), recurrent neural networks (RNNS), etc.
- the neural networks mentioned in this application may include various types, such as deep neural networks (DNN), convolutional neural networks (CNN), recurrent neural networks (RNN) or residual networks and other neural networks.
- the database 230 can be used to store sample sets for training.
- the execution device 210 generates a target model/rule 201 for processing samples, and iteratively trains the target model/rule 201 using the sample set in the database to obtain a mature target model/rule 201, which is specifically represented by a neural network.
- the neural network obtained by the execution device 210 can be applied to different systems or devices.
- the execution device 210 can call the data, code, etc. in the data storage system 250, or store the data, instructions, etc. in the data storage system 250.
- the data storage system 250 can be placed in the execution device 210, or the data storage system 250 can be an external memory relative to the execution device 210.
- the calculation module 211 can process the samples obtained by the execution device 210 through the neural network to obtain the prediction result.
- the specific expression form of the prediction result is related to the function of the neural network.
- FIG2 is only an exemplary schematic diagram of a system architecture provided by an embodiment of the present application, and the positional relationship between the devices, components, modules, etc. shown in the figure does not constitute any limitation.
- the data storage system 250 is an external memory relative to the execution device 210. In other scenarios, the data storage system 250 can also be placed in the execution device 210.
- the target model/rule 201 trained by the training module 202 can be applied to different systems or devices, such as mobile phones, tablet computers, laptops, augmented reality (AR)/virtual reality (VR), vehicle terminals, etc., and can also be servers or cloud devices.
- systems or devices such as mobile phones, tablet computers, laptops, augmented reality (AR)/virtual reality (VR), vehicle terminals, etc., and can also be servers or cloud devices.
- AR augmented reality
- VR virtual reality
- vehicle terminals etc.
- servers or cloud devices can also be servers or cloud devices.
- the target model/rule 201 in the embodiment of the present application may be a model for executing the performance content display method of the virtual character provided in the present application, that is, the target model/rule 201 may be a neural network provided in the present application for generating facial expression information and body movement information of the virtual character.
- the target model provided in the embodiment of the present application may include one or more networks such as CNN, deep convolutional neural networks (DCNN), recurrent neural networks (RNN), etc.
- Figure 4 is another system architecture diagram provided by an embodiment of the present application.
- the execution device 210 is implemented by one or more servers, and optionally cooperates with other computing devices, such as data storage, routers, load balancers and other devices; the execution device 210 can be arranged at one physical site, or distributed at multiple physical sites.
- the execution device 210 can use the data in the data storage system 250, or call the program code in the data storage system 250 to implement the steps of the training method for the computing device corresponding to the following Figure 14 of the present application.
- Each local device can represent any computing device, such as a personal computer, a computer workstation, a smart phone, a tablet computer, a smart camera, a smart car or other type of cellular phone, a media consumption device, a wearable device, a set-top box, a game console, etc.
- the local device of each user can interact with the execution device 210 through a communication network of any communication mechanism/communication standard, and the communication network can be a wide area network, a local area network, a point-to-point connection, etc., or any combination thereof.
- the communication network may include a wireless network, a wired network, or a combination of a wireless network and a wired network, etc.
- the wireless network includes but is not limited to: a fifth-generation mobile communication technology (5th-Generation, 5G) system, a long-term evolution (long term evolution, LTE) system, a global system for mobile communication (global system for mobile communication, GSM) or a code division multiple access (code division multiple access, CDMA) network, a wideband code division multiple access (wideband code division multiple access, WCDMA) network, wireless fidelity (wireless fidelity, WiFi), Bluetooth (bluetooth), Zigbee protocol (Zigbee), radio frequency identification technology (radio frequency identification, RFID), long-range (Lora) wireless communication, and near-field wireless communication (NFC) Any one or more combinations.
- the wired network may include an optical fiber communication network or a network composed of coaxial cables, etc.
- one or more aspects of the execution device 210 may be implemented by each local device, for example, the local device 401 may provide local data or feedback calculation results to the execution device 210.
- the local device may also be referred to as a computing device.
- the local device 401 implements the functions of the execution device 210 and provides services to its own user, or provides services to the user of the local device 402.
- the real-time motion capture solution mainly includes three steps: wearable devices, real-time motion capture, and character driving.
- the dancers put on the motion capture suits in advance and complete the calibration; then, the staff starts dancing and uses the motion capture software to collect the corresponding body movement sequences in real time; finally, the collected body movement sequences are transferred to the virtual character to be driven through motion redirection, driving the virtual character to show the body movements corresponding to the music.
- Real-time motion capture uses a real-person driven approach, which allows staff to drive virtual characters in real time, and the body movements are realistic and rich.
- real-time motion capture is expensive and requires a high level of professionalism from the staff.
- the real-time motion capture method is not generalizable and requires the staff's body movements to be re-collected each time to drive the virtual character, which is not suitable for application scenarios with large demands.
- Generating body movements based on algorithms can greatly improve the efficiency of body movement generation. Compared with simply generating body movements, more attention is currently paid to generating body movements based on music. That is, input complete music and use algorithms to generate body movements that match the input music in terms of rhythm, style, and other dimensions.
- the algorithm-based solution for generating body movements is efficient and generalizable, but it is mainly an offline model, that is, it is necessary to input complete music to generate corresponding body movements. This requires extracting information such as the rhythm of the music from the complete music.
- the offline model has a long response time, which limits the potential application scenarios of music-driven body movement generation.
- the generated dance moves are of fixed style and cannot be well adapted to the styles of different types of characters.
- the performance content of the virtual character is output in real time based on the input streaming audio clip, which reduces the response time of generating the performance content.
- the performance content of the virtual character is output in real time based on the input streaming audio clip, which reduces the response time of generating the performance content.
- the associated features of multiple virtual characters in facial expressions and body movements the effective collaboration and coordination of multiple virtual characters in facial expressions and body movements is achieved.
- the styles of different types of virtual characters are effectively adapted.
- the effect of the virtual character singing and dancing is achieved.
- the embodiment of the present application provides a method for displaying the performance content of a virtual character.
- the following describes the specific implementation process of the reasoning phase and the training phase of the method for displaying the performance content of a virtual character provided by the embodiment of the present application in combination with the accompanying drawings and application scenarios.
- the corresponding implementation method can be executed according to the number of virtual characters selected by the user to generate performance content information of the virtual characters controllable by the user. This allows the execution device to generate facial expression information and body movement information of multiple virtual characters according to parameters such as the number of virtual characters set by the user, thereby improving user interactivity.
- the reasoning stage describes the process of how the execution device 210 uses the target model/rule 201 to process the collected information data to generate a prediction result.
- Figure 5 is another flow chart of the performance content display method of the virtual character provided in the embodiment of the present application. The method can be completed by the target model. The method can be implemented by the target model, including steps 501 to 517.
- Step 501 An execution device obtains a first streaming audio segment.
- the execution device obtains the first streaming audio segment in real time.
- Streaming audio means that audio can be input in real time in a streaming manner without the need to input complete audio, thereby achieving the effect of driving the character while playing music. It can be applied to online scenarios and reduce response time.
- the execution device obtains audio information of the first streaming audio segment.
- the execution device obtains audio information of the first stream audio segment, and needs to extract corresponding text information from the audio information of the first stream audio segment.
- the execution device may use automatic speech recognition technology (ASR) to parse corresponding text information from the audio information.
- ASR automatic speech recognition technology
- the execution device may use a variety of speech recognition technologies to parse the corresponding text information from the audio information, which is not limited here.
- the execution device obtains the audio information and text information of the first streaming audio segment.
- the execution device can directly obtain the text information of the first stream audio segment.
- the execution device can directly obtain the first stream audio segment with lyrics text, directly obtain the audio information of the first stream audio segment and the text information corresponding to the first stream audio segment, and no longer need to parse the text information from the audio information of the first stream audio segment.
- Step 502 The execution device extracts feature information of the first streaming audio segment from the first streaming audio segment.
- the execution device after acquiring the audio information and text information of the first stream audio segment, extracts the feature information of the first stream audio segment from the first stream audio segment, wherein the feature information of the first stream audio segment includes the first audio feature information and the text feature information.
- the execution device uses open source libraries such as librosa and madmom to extract first audio feature information such as Onset and Chromatogram from the audio information.
- open source libraries such as librosa and madmom to extract first audio feature information such as Onset and Chromatogram from the audio information.
- the execution device uses a text encoder module in a Contrastive Language-Image Pre-Training (CLIP) model to extract text feature information.
- CLIP Contrastive Language-Image Pre-Training
- the execution device extracts the first audio feature information and the text feature information from the first streaming audio segment, in an optional implementation manner, the feature information of the first streaming audio segment further includes rhythm feature information.
- the execution device uses an MLP network to predict the rhythm feature information from the first audio feature information.
- the embodiment of the present application uses an MLP network to achieve the matching of audio clips with facial expression information and body movement information in rhythm, and continuously optimizes the network through training and learning.
- the feature information of the first streaming audio segment further includes phoneme feature information.
- the execution device extracts phoneme feature information from text feature information using an open source library and a text alignment tool.
- the execution device can use open source libraries such as Phonemizer to extract phoneme information, and use alignment tools such as MFA (Montreal Forced Aligner) to extract timestamp information to achieve alignment of audio and text in the time dimension.
- Phoneme represents a basic unit of sound production.
- Phoneme feature information includes multiple phonemes decomposed from the text and the duration of each phoneme.
- the embodiment of the present application uses an open source library to extract phonemes from audio text, and uses an MFA tool to align text and audio to obtain the timestamp information corresponding to each phoneme.
- Step 503 The execution device obtains motion feature information generated by the second streaming audio segment.
- the execution device adds the action feature information when generating the first fusion feature information.
- the execution device obtains the action feature information generated by the second streaming audio segment to achieve the connection between different audio segments of the same streaming audio in terms of performance content.
- the second stream audio segment can be any audio segment before the first stream audio segment is played, for example, the previous audio segment of the first stream audio segment, etc., and can be set according to actual needs and is not limited here.
- the action feature vector can be set randomly, by user-specified method or by historical value.
- Step 504 The execution device obtains first fusion feature information based on the feature information of the first streaming audio segment and the action feature information generated according to the second streaming audio segment.
- the execution device fuses the feature information of the first streaming audio segment and the feature information of the action generated by the second streaming audio segment to obtain first fused feature information.
- a specific method for the execution device to obtain the first fusion feature information includes:
- the execution device performs feature fusion on the first audio feature information and the rhythm feature information of the first streaming audio segment to obtain second audio feature information;
- the execution device concatenates the text feature information, the second audio feature information and the action feature information, and inputs the concatenated information into the first neural network to obtain the first fusion feature information.
- FIG. 6 is a flow chart of generating the first fused feature information provided in an embodiment of the present application.
- the execution device performs feature fusion on the first audio feature information and the predicted rhythm feature information to obtain the second audio feature information. Then, the execution device splices the text feature information, the second audio feature information and the action feature information, and inputs them into the first neural network to obtain the first fused feature information.
- the execution device concatenates the text feature information, the first feature information, and the action feature information, and inputs the concatenated information into a multi-layer Transformer network to output the first fused feature information.
- the type of the first neural network can be set according to actual needs, which is only used as an example and not limited here.
- steps 501 to 504 there may be multiple implementations based on different generated contents.
- the following takes the execution device generating the facial expression information and body movement information of the virtual character as an example to illustrate the following steps. It should be understood that in actual application, the execution device synchronously outputs the facial expression information and body movement information of the virtual character to drive the virtual character to achieve the effect of singing and dancing.
- the execution device executes steps 505-510 and steps 511-517 in parallel.
- Step 505 The execution device obtains first associated feature information on expressions of multiple virtual characters, where the first associated feature information is generated based on the second streaming audio segment.
- the execution device obtains the first associated feature information on the expressions of multiple virtual characters generated based on the second streaming audio segment, and generates the facial expression information corresponding to the second streaming audio segment based on the first associated feature information.
- the facial expression information of other virtual characters is referenced, so as to achieve effective collaboration and cooperation of the multiple virtual characters in facial expressions.
- the first streaming audio segment is the first audio segment obtained or the method is executed for the first time
- the first associated feature information defaults to the feature information corresponding to the initial expression.
- the initial expression can be set according to actual needs, such as the initial default expressionless, which is limited here.
- Step 506 The execution device obtains weight information of a basic expression corresponding to at least one virtual character based on the first fusion feature information, the phoneme feature information and the first associated feature information.
- the execution device generates weight information of a basic expression corresponding to at least one virtual character based on the first fused feature information, the phoneme feature information extracted from the text feature information, and the first associated feature information generated from the second streaming audio segment.
- the execution device inputs the first fusion feature information, the phoneme feature information and the first association feature information into a multi-layer AcLSTM network, and outputs weight information of a basic expression corresponding to at least one virtual character by autoregression.
- the execution device can output the weight information of the basic expressions corresponding to multiple virtual characters in an autoregressive manner. That is, when the execution device outputs the weight information of the basic expressions corresponding to multiple virtual characters corresponding to the first streaming audio segment, the second streaming audio segment is used to generate the first associated feature information.
- the second streaming audio segment is the previous audio segment or the previous audio segment of the first streaming audio segment.
- the first streaming audio segment is used to generate the second associated feature information.
- the third streaming audio segment is the next audio segment or the subsequent audio segment of the first streaming audio segment.
- the execution device uses the associated feature information output before the playback time of the current streaming audio segment as input for re-iteration, so as to further utilize the temporal correlation of the learned facial expressions, maintain the temporal continuity of the facial expressions, and avoid sudden changes and jitters in facial expressions.
- the AcLSTM (Auto-Conditioned LSTM) network can better handle the problem of generating long sequences, that is, it can avoid or minimize the situation where the generated facial expressions no longer change or change less as the processing time increases when generating long facial expression sequences.
- Step 507 The execution device adjusts the basic expression corresponding to the at least one virtual character based on the weight information of the basic expression corresponding to the at least one virtual character to obtain facial expression information corresponding to the at least one virtual character.
- the execution device performs a weighting operation on at least one virtual character based on the weighting information of the basic expression corresponding to at least one virtual character.
- the basic expression corresponding to the color is adjusted to make detailed adjustments to the basic expression corresponding to the virtual character to obtain facial expression information matching the first streaming audio clip.
- the execution device obtains basic expression information corresponding to at least one virtual character, and obtains facial expression information corresponding to the at least one virtual character by weighted summation.
- the basic expression information corresponding to at least one virtual character can be understood as a set of basic expression bases for each virtual character.
- the left eye rolling up, the right eye rolling up, etc. can correspond to an expression base
- each basic expression base can represent the position of the vertex of the facial mesh of the virtual character in a specific state
- each vertex can be a three-dimensional coordinate.
- the formula for the execution device to generate facial expression information corresponding to at least one virtual character may be:
- Fi represents the facial expression information corresponding to the virtual character i
- N represents the number or dimension of the expression base
- b l represents the basic expression information corresponding to the expression base l
- e l represents the weight information of the basic expression corresponding to the expression base l
- b0 represents the basic expression, such as the default expressionless face.
- the execution device uses feature information between multiple virtual characters in the process of outputting facial expression information corresponding to the virtual characters, when finally outputting the facial expression information corresponding to the virtual characters, the number of virtual characters finally displayed can be set according to actual needs to display facial expression information corresponding to one or more virtual characters.
- the execution device may obtain tag information and adjust the facial expression information corresponding to each virtual character based on the tag information to obtain the adjusted facial expression information corresponding to each virtual character.
- the tag information is used to indicate the facial expression information of the virtual character.
- the execution device may use a stylization processing module to perform stylization processing on the facial expression information of the generated virtual character according to the tag information specified by the user, and generate facial expression information of multiple styles to adapt to different types of virtual characters.
- the stylization processing module can adopt the ERD (Encoder-RNN-Decoder) architecture, and use the Encoder network to de-stylize the facial expression information of multiple virtual characters input, and only retain the expression content part. Then, by combining the RNN network of the label information of a specific style with the output of the first RNN network, the facial expression features with a specific style are obtained, and finally the Decoder network is used to decode the stylized facial expression information.
- ERD Encoder-RNN-Decoder
- FIG. 7 is a schematic diagram of the structure of a stylized processing module provided in an embodiment of the present application.
- the stylized processing module includes branches corresponding to multiple stylized labels, such as pride, happiness, etc., and at least one first RNN network (used to represent neutral motion sequence features without style).
- the execution device obtains facial expression information with a specific style by adding the output of the RNN network branch of the specific stylized label to the output of the first RNN network branch.
- the execution device can add the output of the RNN network branch of the stylized label corresponding to happiness to the output of the first RNN network branch, and then add the output of the RNN network branch of the stylized label corresponding to pride to the output of the first RNN network branch. Finally, the two added results are averaged to obtain facial expression information with styles corresponding to happiness and pride.
- the first RNN network can be understood as a default RNN network, or a basic RNN network without style.
- Step 508 The execution device extracts expression feature information corresponding to the multiple virtual characters based on the facial expression information corresponding to the multiple virtual characters.
- the execution device uses the facial expression information corresponding to the multiple virtual characters as input in an autoregressive manner and outputs expression feature information corresponding to the multiple virtual characters.
- the execution device inputs facial expression information corresponding to multiple virtual characters into the MLP network, and extracts expression feature information corresponding to each virtual character.
- a specific network model can be selected according to actual needs to extract expression feature information, such as a Transformer network, etc. This is only an example and not limited to one example.
- Step 509 The execution device calculates the similarity of expressions between the multiple virtual characters based on the expression feature information corresponding to the multiple virtual characters.
- the execution device after the execution device extracts the expression feature information corresponding to the multiple virtual characters based on the facial expression information corresponding to the multiple virtual characters, it can calculate the similarity in expression between each virtual character and other virtual characters in the multiple virtual characters. Obtain the associated feature information of expressions between each virtual character and other virtual characters.
- the execution device calculates the cosine similarity between the expression feature information of each virtual character and the expression feature information of other virtual characters as the similarity in expression between the multiple virtual characters.
- Step 510 The execution device uses the similarity of expressions between the multiple virtual characters as a weight to adjust the expression feature information corresponding to each virtual character to obtain second associated feature information on the expressions of the multiple virtual characters.
- the execution device uses the cosine similarity as a weight coefficient and performs weighted summation with the expression feature information k j of other virtual characters j to obtain a cross feature.
- the expression feature information k i of the current virtual character i and the cross feature are bitwise added to obtain the second associated feature information fi of the virtual character i in expression.
- the execution device may obtain the second associated feature information fi of the facial expression of the virtual character i according to the following calculation formula.
- fi represents the second associated feature information of the expressions of virtual character i and other virtual characters j
- k i represents the expression feature information corresponding to virtual character i
- n represents the number of virtual characters
- k j represents the expression feature information corresponding to virtual character j
- cosine_sim() is used to calculate the cosine similarity of the expressions of virtual characters i and j.
- FIG8 is a flow chart of generating facial expression information of a virtual character provided by an embodiment of the present application. Please refer to steps 505 to 510 for specific execution steps.
- FIG. 8 is only an example structural description of one embodiment of the present application when generating facial expression information of a virtual character, and is not limited here.
- Step 511 The execution device obtains third associated feature information and action feature information generated by the second streaming audio segment.
- the execution device obtains third-association feature information and action feature information on the actions of multiple virtual characters generated based on the second streaming audio segment, and generates body movement information corresponding to the first streaming audio segment based on the third-association feature information and action feature information, so that the execution device can refer to the body movement information of other virtual characters while generating the body movement information corresponding to each virtual character, thereby achieving effective collaboration and coordination of multiple virtual characters in body movements.
- the third associated characteristic information defaults to the characteristic information corresponding to the initial action.
- the initial action can be set according to actual needs and is limited here.
- Step 512 The execution device obtains action coding information corresponding to the multiple virtual characters based on the first fusion feature information, the third association feature information and the action feature information.
- the execution device obtains action coding information corresponding to multiple virtual characters based on the first fusion feature information, the third associated feature information generated by the second streaming audio segment, and the action feature information generated by the second streaming audio segment.
- the execution device can output the action coding information corresponding to multiple virtual characters in an autoregressive manner. That is, when the execution device outputs the action coding information corresponding to multiple virtual characters corresponding to the first streaming audio segment, the second streaming audio segment is used to generate the third associated feature information and action feature information.
- the second streaming audio segment is the previous audio segment or the previous audio segment of the first streaming audio segment.
- the first streaming audio segment is used to generate the fourth associated feature information and action feature information.
- the third streaming audio segment is the next audio segment or the subsequent audio segment of the first streaming audio segment.
- the execution device uses the associated feature information output before the playback time of the current streaming audio segment as input for re-iteration, so as to further utilize the temporal correlation of the learned body movements, which is conducive to maintaining the temporal continuity of the body movements and reducing the incoherence of the body movements.
- the execution device inputs the first fused feature information, the third associated feature information, and the action feature information generated by the second streaming audio segment into a multi-layer AcLSTM network, and outputs action encoding information corresponding to multiple virtual characters by autoregression.
- the AcLSTM Auto-Conditioned LSTM network can better handle the problem of generating long sequences, that is, it can avoid or minimize the situation where the generated limb movements no longer change or change less as the processing time increases when generating long limb movement sequences.
- Step 513 The execution device obtains action feature information corresponding to the action coding information of the plurality of virtual characters based on the action coding information.
- the execution device obtains the action feature information corresponding to the action coding information of multiple virtual characters based on the correspondence between the action coding information and the action feature information.
- each action encoding information corresponds to an action feature information in the code book (Code Book), which is used to represent a multi-frame full-body action sequence.
- the execution device can directly obtain the corresponding action feature information of multiple virtual characters from the Code Book based on the action encoding information of multiple virtual characters input.
- the execution device can pre-encode the position coordinates, rotation angle and speed of each joint point to obtain the corresponding action coding information.
- the execution device only needs to obtain the action coding information corresponding to each virtual character, and based on the correspondence between the action coding information and the action feature information, the action feature information corresponding to each virtual character can be obtained, thereby reducing the amount of information processing and improving the operation efficiency.
- Step 514 The execution device calculates the action similarity between the multiple virtual characters based on the action feature information generated by the first streaming audio segment.
- the execution device after the execution device extracts the action feature information corresponding to multiple virtual characters based on the action coding information corresponding to the multiple virtual characters, it can obtain the associated feature information on the actions between each virtual character and other virtual characters by calculating the similarity in actions between each virtual character and other virtual characters in the multiple virtual characters.
- the execution device calculates the cosine similarity between the action feature information of each virtual character and the action feature information of other virtual characters as the action similarity between the multiple virtual characters.
- Step 515 The execution device uses the action similarities between the multiple virtual characters as weights to adjust the corresponding action feature information of each virtual character to obtain fourth associated feature information on the actions of the multiple virtual characters.
- the execution device uses the cosine similarity as a weight coefficient and performs weighted summation with the action feature information p j of other virtual characters j to obtain a cross feature.
- the action feature information p i of the current virtual character i and the cross feature are bitwise added to obtain the second associated feature information fi of the virtual character i in action.
- the execution device may obtain the fourth associated feature information f′ i of the action of the virtual character i according to the following calculation formula.
- f′ i represents the fourth associated feature information of the action between virtual character i and other virtual characters
- pi represents the action feature information corresponding to virtual character i
- n represents the number of virtual characters
- pj represents the action feature information corresponding to virtual character j
- cosine_sim() is used to calculate the cosine similarity between virtual character i and virtual character j in action.
- Step 516 The execution device performs detail adjustment on the first fused feature information to obtain adjusted first fused feature information.
- the execution device since the first fusion feature is obtained based on the feature information of the first streaming audio segment and the action feature information generated by the second streaming audio segment, the execution device obtains the adjusted first fusion feature information by making detail adjustments to the action feature included in the first fusion feature.
- the embodiment of the present application generates the offset of the limb action by using a single action fine-tuning network (such as an MLP network) to generate more action details.
- a single action fine-tuning network such as an MLP network
- the execution device uses an MLP network to perform detail adjustments on the first fused feature information.
- the execution device may also use other network models to make detailed adjustments to the first fusion feature information. This is only an example and is not limited to this.
- Step 517 The execution device adjusts the action feature information generated by the first streaming audio segment using the adjusted first fusion feature information and the third associated feature information as offsets to obtain body action information corresponding to at least one virtual character.
- the execution device after the execution device obtains the action feature information generated by the first streaming audio segment from the Code Book, one party On the one hand, for the acquired action feature information, the execution device calculates the fourth associated feature of each virtual character in a manner similar to the second associated feature extraction, and uses it as input information for generating action encoding information corresponding to multiple virtual characters; on the other hand, the execution device uses the adjusted first fusion feature information and the third associated feature information as offsets to adjust the action feature information generated by the first streaming audio segment, so as to obtain corresponding body motion information of at least one virtual character.
- the execution device will, on the one hand, make detailed adjustments to the first fused feature information, and on the other hand, obtain the action coding information corresponding to the multiple virtual characters based on the action feature information generated based on the first fused feature information, the third associated feature information and the second streaming audio segment.
- the action feature information generated by the first streaming audio segment here is obtained based on the action coding information corresponding to the multiple virtual characters.
- the execution device inputs the adjusted first fusion feature information, the third associated feature information, and the action feature information generated by the first streaming audio segment into a multi-person action decoder respectively, and uses the adjusted first fusion feature information and the third associated feature information as offsets to adjust the action feature information generated by the first streaming audio segment to obtain the body movement information corresponding to at least one virtual character.
- the execution device inputs the first fused feature information, the third associated feature information, and the action feature information generated by the first streaming audio clip into the multi-person action decoder, respectively, so that the generated body movement information can not only contain more details but also take into account the body movements of other virtual characters.
- the multi-person action decoder is mainly composed of three Decoders. Please refer to Figure 9, which is a structural diagram of a multi-person action decoder provided in an embodiment of the present application.
- the multi-person action decoder outputs the body movement information of multiple virtual characters, that is, the joint point posture information of multiple virtual characters (including the position, rotation angle, speed and other information of the joint point). On the basis of generating the body movement information of multiple virtual characters, the body movement information corresponding to at least one virtual character can be displayed according to actual needs.
- the multi-person action decoder can adopt neural networks such as MLP networks and convolutional networks, which can be set according to actual needs and are not limited here.
- the execution device obtains tag information, and adjusts the body movement information corresponding to each virtual character based on the tag information to obtain the adjusted body movement information corresponding to each virtual character.
- the tag information is used to indicate the body movement information of the virtual character.
- the execution device may use a stylization processing module to perform stylization processing on the generated virtual character's body movement information according to the tag information specified by the user, and generate body movement information of various styles to adapt to different types of virtual characters.
- the stylized processing module can adopt the ERD (Encoder-RNN-Decoder) architecture, and only retain the action content part of the body movement information of multiple virtual characters input through the Encoder network. Then, by combining the RNN network of the label information of a specific style with the output of the first RNN network, a motion sequence feature with a specific style is obtained, and finally the stylized body movement information is decoded using the Decoder network.
- the stylized processing module includes branches corresponding to multiple stylized labels, such as the elderly, young people, children, etc., and at least one first RNN network (used to represent neutral motion sequence features without style).
- the execution device obtains the body movement information with a specific style by adding the output of the RNN network branch of the specific stylized label to the output of the first RNN network branch.
- the specific structure diagram of the stylized processing module can refer to Figure 7.
- the first RNN network can be understood as a default RNN network, or a basic RNN network without style.
- step 514 to step 515 and step 516 to step 517.
- FIG. 10 is a flow chart of generating the body movement information of the virtual character provided by the embodiment of the present application. Please refer to steps 511 to 517 for the specific execution steps.
- FIG. 10 is only an example structural description of one embodiment of the present application when generating the body movement information of a virtual character, and is not limited here.
- the reasoning stage describes the process of how the execution device 210 uses the target model/rule 201 to process the collected information data to generate a prediction result.
- Figure 11 is another flow chart of the method for displaying the performance content of a virtual character provided in the embodiment of the present application. The method may include steps 1101 to 1112.
- Step 1101 The execution device obtains a first streaming audio segment.
- Step 1102 The execution device extracts feature information of the first streaming audio segment from the first streaming audio segment.
- Step 1103 The execution device obtains motion feature information generated by the second streaming audio segment.
- Step 1104 The execution device obtains first fusion feature information based on the feature information of the first streaming audio segment and the action feature information generated according to the second streaming audio segment.
- step 1101 to step 1104 For the specific description of step 1101 to step 1104 , reference may be made to the description of step 501 to step 504 shown in FIG. 5 in the above embodiment, and will not be repeated here.
- steps 1101 to 1104 there may be multiple implementations based on different generated contents.
- the following takes the example of the execution device generating the facial expression information and body movement information of the virtual character respectively as an example to illustrate the following steps. It should be understood that in actual application, the execution device synchronously outputs the facial expression information and body movement information of the virtual character, and the execution device executes steps 1105-1107 and steps 1108-1112 in parallel.
- Step 1105 The execution device obtains facial expression information generated by the second streaming audio segment.
- the execution device generates facial expression information corresponding to the first streaming audio segment by acquiring facial expression information generated by the second streaming audio segment.
- Step 1106 The execution device obtains weight information of a basic expression corresponding to the virtual character based on the first fusion feature information, the phoneme feature information, and the facial expression information generated by the second streaming audio segment.
- the execution device compared with generating facial expression information corresponding to multiple virtual characters, there is no need to consider the associated feature information between the multiple virtual characters. Therefore, when generating weight information of the basic expression corresponding to the virtual character, the execution device directly inputs the facial expression information generated by the second streaming audio segment.
- Step 1107 The execution device adjusts the basic expression of the virtual character based on the weight information of the basic expression corresponding to the virtual character to obtain facial expression information of the virtual character.
- the execution device adjusts the basic expression corresponding to the virtual character based on the weight information of the basic expression corresponding to the virtual character, so as to make detailed adjustments to the basic expression corresponding to the virtual character and obtain the facial expression information of the virtual character that matches the first streaming audio clip.
- steps 1105 to 1107 For the specific description of steps 1105 to 1107 , reference may be made to the description of steps 505 to 510 shown in FIG. 5 in the above embodiment, and will not be repeated here.
- FIG. 12 is another flow chart of generating facial expression information of a virtual character provided by an embodiment of the present application. Please refer to steps 1105 to 1107 for specific execution steps.
- FIG. 12 is only an example structural description of one embodiment of the present application when generating facial expression information of a virtual character, and is not limited here.
- Step 1108 The execution device obtains motion feature information of the second streaming audio segment.
- Step 1109 The execution device obtains action coding information corresponding to the virtual character based on the first fusion feature information and the second streaming audio segment action feature information.
- the execution device compared to generating action coding information corresponding to multiple virtual characters, directly inputs the first fusion feature information and the second streaming audio segment action feature information when generating action coding information corresponding to the virtual characters, without considering the associated feature information between the multiple virtual characters.
- Step 1110 The execution device obtains action feature information corresponding to the action coding information of the virtual character based on the action coding information.
- Step 1111 The execution device performs detail adjustment on the first fused feature information to obtain adjusted first fused feature information.
- Step 1112 The execution device uses the adjusted first fusion feature information as an offset to adjust the action feature information generated by the first streaming audio segment to obtain body action information corresponding to the virtual character.
- the execution device inputs the adjusted first fusion feature information and the action feature information generated by the first streaming audio segment into a single-person action decoder respectively, and uses the adjusted first fusion feature information as an offset to adjust the action feature information generated by the first streaming audio segment to obtain the body movement information corresponding to a single virtual character.
- the single-player action decoder mainly consists of two
- the single-person action decoder outputs the body action information of a single virtual character, that is, the joint point posture information of a single virtual character (including the position, rotation angle, speed, etc. of the joint point).
- the single-person action decoder can adopt neural networks such as MLP networks and convolutional networks, which can be set according to actual needs and are not limited here.
- steps 1108 to 1112 For the specific description of steps 1108 to 1112, reference may be made to the description of steps 511 to 517 shown in FIG. 5 in the above embodiment, and will not be repeated here.
- FIG. 13 is another flow chart of generating the body movement information of the virtual character provided by the embodiment of the present application. Please refer to steps 1108 to 1112 for the specific execution steps.
- FIG. 13 is only an example structural description of one embodiment of the present application when generating the body movement information of a virtual character, and is not limited here.
- the training phase describes the process of how the training module 202 generates a mature neural network using the data set in the database 230.
- FIG. 14 is a flow chart of a model training method provided in the embodiment of the present application.
- the model training method provided in the embodiment of the present application may include:
- Step 1401 The training device obtains a first streaming audio segment.
- Step 1402 The training device obtains first fusion feature information based on the feature information of the first streaming audio segment and the action feature information generated according to the second streaming audio segment.
- Step 1403 The training device generates facial expression information and body movement information of the virtual character based on the first fused feature information.
- step 1401 to step 1403 For the specific description of step 1401 to step 1403 , reference may be made to the description of step 501 to step 517 shown in FIG. 5 in the above embodiment, and will not be repeated here.
- Step 1404 The training device obtains a target loss based on the facial expression information and body movement information of the virtual character and the real facial expression information and body movement information of the virtual character.
- the training device obtains a target loss based on the generated facial expression information of the virtual character and the real facial expression information, and the generated body movement information of the virtual character and the real body movement information.
- the target loss is used to indicate the difference between the facial expression information and the real facial expression information, and the body movement information and the real body movement information.
- the training device may obtain the target loss based on the following method, including:
- the training device obtains the relative positions, facial orientations and limb orientations of the virtual characters based on the limb motion information of the virtual characters;
- the training device obtains a first loss based on the body movement information corresponding to each virtual character and the real body movement information;
- the training device obtains a second loss based on the relative positions of the virtual characters and the real relative positions
- the training device obtains a third loss based on the facial orientation and body orientation of each virtual character and the real facial orientation and body orientation;
- the training device obtains a target loss according to the first loss, the second loss, and the third loss.
- the training device can obtain the target loss based on the body movement information of the virtual character, the relative positions between the virtual characters, and the facial orientation and body orientation of each virtual character.
- the limb movement information corresponding to each virtual character refers to the limb movement information output by the target model
- the real limb movement information refers to the limb movement information actually collected through real-time motion capture and other methods.
- the training device obtains the first loss L 1 based on the body movement information corresponding to each virtual character and the real body movement information, which is specifically:
- Each virtual character learns the real body movement information based on the first loss, and realizes the coordination between characters in hand and body movements. For example, two virtual characters perform a heart-shaped gesture together, and by letting each virtual character learn its own action, the two virtual characters can coordinate in action and complete the heart-shaped gesture.
- the real body movement information can be obtained through dynamic capture.
- the training device obtains a formula for the second loss L2 based on the relative positions of the virtual characters and the real relative positions:
- i and j represent different virtual characters
- ⁇ h ij represent the relative position and the real relative position between virtual character i and virtual character j, respectively.
- the relative position between virtual character i and virtual character j can be calculated by the limb motion information output by the target model, for example, the position information (x, y, z) of the joints contained in the limb motion information.
- the training device calculates the relative distance between the position information of the joints of virtual character i and the position information of the joints of virtual character j, that is, the relative position between virtual character i and virtual character j can be obtained. If it is a plane, it is calculated based on the plane coordinates (x, y).
- the real relative position between virtual character i and virtual character j can be collected by real-time motion capture and other methods.
- the training device realizes the cooperation of different virtual characters in the process of movement by adding a second loss L2 , generates a motion trajectory with a collaborative effect, and realizes the alignment of the motion trajectory by constraining the relative position of any two virtual characters to be consistent with the real relative position.
- the training device obtains a formula for the third loss L3 based on the facial orientation and limb orientation of each virtual character and the real facial orientation and limb orientation:
- i and j represent different virtual characters
- ⁇ dr ij represent the facial orientation between virtual character i and virtual character j and the real facial orientation, respectively.
- ⁇ df ij represent the limb orientation between virtual character i and virtual character j and the real limb orientation, respectively.
- the facial orientation and limb orientation between virtual character i and virtual character j can be calculated through the limb motion information output by the target model.
- the rotation angle contained in the limb motion information the training device can obtain the rotation angle of the joint point of virtual character i and the rotation angle of virtual character j.
- the limb orientation of the virtual character mainly refers to the orientation of the root node.
- the training device realizes the collaboration of different virtual characters in the process of movement by adding a third loss L3 , produces the "cooperation effect of two virtual characters staring at each other", and learns the orientation of different characters on the face and root node to achieve orientation alignment.
- a third loss L3 produces the "cooperation effect of two virtual characters staring at each other"
- the training device obtains a target loss according to the first loss, the second loss, and the third loss, specifically:
- the training device sets different weights r 1 , r 2 , r 3 for the first loss L 1 , the second loss L 2 and the third loss L 3 respectively.
- the final target loss is obtained by multiplying the weight by the corresponding loss and adding them, i.e., r 1 *L 1 +r 2 *L 2 +r 3 *L 3 .
- the calculation method of the target loss can be set according to actual needs. This is only an example and not a limitation.
- Step 1405 The training device updates the parameters of the model to be trained based on the target loss until the model training conditions are met to obtain the target model.
- FIG. 15 is a structural schematic diagram of a performance content display device for a virtual character provided in the embodiment of the present application.
- the performance content display device 1500 for a virtual character includes a streaming audio segment acquisition module 1501, a fusion feature information generation module 1502, and a performance content generation module 1503. Among them:
- the streaming audio segment acquisition module 1501 is used to acquire a first streaming audio segment
- the fusion feature information generating module 1502 is used to obtain first fusion feature information based on feature information of a first streaming audio segment and action feature information generated according to a second streaming audio segment, wherein the first streaming audio segment and the second streaming audio segment are included in the same streaming audio, and a playback time of the second audio segment is before a playback time of the first audio segment;
- the performance content generating module 1503 is used to generate facial expression information and body movement information of the virtual character based on the first fused feature information.
- a possible implementation also includes:
- the streaming audio segment acquisition module 1501 is further used to acquire a third streaming audio segment
- the fusion feature information generating module 1502 is further used to obtain second fusion feature information based on feature information of a third streaming audio segment and action feature information generated according to the first streaming audio segment, wherein the third streaming audio segment and the first streaming audio segment are included in the same streaming audio, and the playback time of the first streaming audio segment is before the playback time of the third streaming audio segment;
- the performance content generating module 1503 is further used to generate facial expression information and body movement information of the virtual character based on the second fused feature information.
- the feature information of the first streamed audio segment includes text feature information
- the performance content generation module 1503 is further configured to:
- the basic expression corresponding to the at least one virtual character is adjusted to obtain the facial expression information corresponding to the at least one virtual character.
- the performance content generating module 1503 is further used to:
- the expression similarity between the multiple virtual characters is calculated
- the expression similarity between multiple virtual characters is used as a weight to adjust the expression feature information corresponding to each virtual character to obtain second associated feature information on the expressions of the multiple virtual characters.
- the second associated feature information is used to generate facial expression information corresponding to at least one virtual character corresponding to the third streaming audio segment.
- the performance content generating module 1503 is further used to:
- the action feature information generated by the first streaming audio segment is adjusted using the adjusted first fusion feature information and the third associated feature information as offsets to obtain corresponding body motion information of at least one virtual character.
- obtaining motion feature information generated by the first streaming audio segment includes:
- action coding information corresponding to the plurality of virtual characters is obtained;
- action feature information corresponding to the action coding information of multiple virtual characters is obtained.
- the performance content generating module 1503 is further used to:
- the corresponding action feature information of each virtual character is adjusted to obtain fourth associated feature information on the actions of the multiple virtual characters.
- the fourth associated feature information is used to generate body movement information corresponding to at least one virtual character corresponding to the third streaming audio segment.
- a possible implementation also includes:
- Tag information where the tag information is used to indicate facial expression information and/or body movement information of the virtual character
- the facial expression information and/or body movement information corresponding to each virtual character is adjusted based on the tag information to obtain the adjusted facial expression information and/or body movement information corresponding to each virtual character.
- the feature information of the first streamed audio segment includes text feature information
- the performance content generation module 1503 is further configured to:
- the basic expression of the virtual character is adjusted to obtain the facial expression information of the virtual character.
- the performance content generating module 1503 is further used to:
- the action feature information generated by the first streaming audio segment is adjusted using the adjusted first fusion feature as an offset to obtain body action information corresponding to the virtual character.
- FIG. 16 is a schematic diagram of a structure of a model training device provided in the present application.
- the model training device 1600 may include:
- the streaming audio segment acquisition module 1601 is used to acquire a first streaming audio segment
- the fusion feature information generating module 1602 is used to obtain first fusion feature information based on feature information of the first streaming audio segment and action feature information generated according to the second streaming audio segment, wherein the first streaming audio segment and the second streaming audio segment are included in the same streaming audio, and the playing time of the second audio segment is before the playing time of the first audio segment;
- a performance content generating module 1603, configured to generate facial expression information and body movement information of a virtual character based on the first fused feature information
- a target loss acquisition module 1604 is used to obtain a target loss based on the facial expression information and body movement information of the virtual character of the virtual character and the real facial expression information and body movement information of the virtual character, wherein the target loss is used to indicate the difference between the facial expression information and body movement information and the real facial expression information and body movement information;
- the target model training module 1605 is used to update the parameters of the model to be trained based on the target loss until the model training conditions are met to obtain the target model.
- a possible implementation also includes:
- the streaming audio segment acquisition module 1601 is further used to acquire a third streaming audio segment
- the fusion feature information generating module 1602 is further used to obtain second fusion feature information based on feature information of a third streaming audio segment and action feature information generated according to the first streaming audio segment, wherein the third streaming audio segment and the first streaming audio segment are included in the same streaming audio, and the playback time of the first streaming audio segment is before the playback time of the third streaming audio segment;
- the performance content generating module 1603 is further used to generate facial expression information and body movement information of the virtual character based on the second fused feature information.
- the feature information of the first streamed audio segment includes text feature information
- the performance content generation module 1603 is further configured to:
- the basic expression corresponding to the at least one virtual character is adjusted to obtain the facial expression information corresponding to the at least one virtual character.
- the performance content generating module 1603 is further used to:
- the expression similarity between the multiple virtual characters is calculated
- the expression similarity between multiple virtual characters is used as a weight to adjust the expression feature information corresponding to each virtual character to obtain second associated feature information on the expressions of the multiple virtual characters.
- the second associated feature information is used to generate facial expression information corresponding to at least one virtual character corresponding to the third streaming audio segment.
- the performance content generating module 1603 is further used to:
- the action feature information generated by the first streaming audio segment is adjusted using the adjusted first fusion feature information and the third associated feature information as offsets to obtain corresponding body motion information of at least one virtual character.
- obtaining motion feature information generated by the first streaming audio segment includes:
- action coding information corresponding to the plurality of virtual characters is obtained;
- action feature information corresponding to the action coding information of multiple virtual characters is obtained.
- the performance content generating module 1603 is further used to:
- the corresponding action feature information of each virtual character is adjusted to obtain fourth associated feature information on the actions of the multiple virtual characters.
- the fourth associated feature information is used to generate body movement information corresponding to at least one virtual character corresponding to the third streaming audio segment.
- a possible implementation also includes:
- Tag information where the tag information is used to indicate facial expression information and/or body movement information of the virtual character
- the facial expression information and/or body movement information corresponding to each virtual character is adjusted based on the tag information to obtain the adjusted facial expression information and/or body movement information corresponding to each virtual character.
- the feature information of the first streamed audio segment includes text feature information
- the performance content generation module 1603 is further configured to:
- the basic expression of the virtual character is adjusted to obtain the facial expression information of the virtual character.
- the performance content generating module 1603 is further used to:
- the action feature information generated by the first streaming audio segment is adjusted using the adjusted first fusion feature as an offset to obtain body action information corresponding to the virtual character.
- FIG. 17 is a schematic diagram of a structure of a computing device provided in an embodiment of the present application, and the computing device 1700 includes: a bus 1702, a processor 1704, a memory 1706, and a communication interface 1708.
- the processor 1704, the memory 1706, and the communication interface 1708 communicate through the bus 1702.
- the computing device 1700 can be a server or a terminal device. It should be understood that the present application does not limit the number of processors and memories in the computing device 1700.
- the bus 1702 may be a peripheral component interconnect (PCI) bus or an extended industry standard architecture (EISA) bus, etc.
- the bus may be divided into an address bus, a data bus, a control bus, etc.
- FIG. 17 is represented by only one line, but does not mean that there is only one bus or one type of bus.
- the bus 1704 may include a path for transmitting information between various components of the computing device 1700 (e.g., the memory 1706, the processor 1704, and the communication interface 1708).
- Processor 1704 may include any one or more of a central processing unit (CPU), a graphics processing unit (GPU), a microprocessor (MP), or a digital signal processor (DSP).
- CPU central processing unit
- GPU graphics processing unit
- MP microprocessor
- DSP digital signal processor
- the memory 1706 may include a volatile memory, such as a random access memory (RAM).
- the processor 1704 may also include a non-volatile memory, such as a read-only memory (ROM), a flash memory, a hard disk drive (HDD), or a solid state drive (SSD).
- ROM read-only memory
- HDD hard disk drive
- SSD solid state drive
- the memory 1706 stores executable program codes, and the processor 1704 executes the executable program codes to respectively implement the functions of the aforementioned streaming audio segment acquisition module 1501, the fusion feature information generation module 1502, and the performance content generation module 1503, thereby implementing the performance content display method of the virtual character. That is, the memory 1706 stores instructions for executing the performance content display method of the virtual character.
- the communication interface 1708 uses a transceiver module such as, but not limited to, a network interface card or a transceiver to implement communication between the computing device 1700 and other devices or communication networks.
- a transceiver module such as, but not limited to, a network interface card or a transceiver to implement communication between the computing device 1700 and other devices or communication networks.
- the embodiment of the present application also provides a computing device cluster.
- the computing device cluster includes at least one computing device.
- the computing device can be a server, such as a central server, an edge server, or a local server in a local data center.
- the computing device can also be a terminal device such as a desktop computer, a laptop computer, or a smart phone.
- FIG18 is a schematic diagram of a structure of a computer device cluster provided in an embodiment of the present application, and the computing device cluster includes at least one computing device 1700.
- the memory 1706 in one or more computing devices 1700 in the computing device cluster may store the same instructions for executing the performance content display method of the virtual character.
- the memory 1706 of one or more computing devices 1700 in the computing device cluster may also store partial instructions for executing the method for displaying the performance content of the virtual character.
- the combination of one or more computing devices 1700 may jointly execute the instructions for executing the method for displaying the performance content of the virtual character.
- the memory 1706 in different computing devices 1700 in the computing device cluster can store different instructions, which are respectively used to execute part of the functions of the performance content display device of the virtual character. That is, the instructions stored in the memory 1706 in different computing devices 1700 can realize the functions of one or more modules among the streaming audio segment acquisition module 1501, the fusion feature information generation module 1502 and the performance content generation module 1503.
- one or more computing devices in the computing device cluster may be connected via a network, which may be a wide area network or a local area network.
- the embodiment of the present application also provides a training device, please refer to Figure 19, which is a structural diagram of a training device provided by the embodiment of the present application.
- the training device 1900 is implemented by one or more servers, and the training device 1900 may have relatively large differences due to different configurations or performances, and may include one or more central processing units (CPU) 1922 (for example, one or more processors) and memory 1932, one or more storage media 1930 (for example, one or more mass storage devices) storing application programs 1942 or data 1944.
- the memory 1932 and the storage medium 1930 can be short-term storage or permanent storage.
- the program stored in the storage medium 1930 may include one or more modules (not shown in the figure), and each module may include a series of instruction operations in the training device.
- the central processor 1922 can be configured to communicate with the storage medium 1930 to execute a series of instruction operations in the storage medium 1930 on the training device 1900.
- the training device 1900 may also include one or more power supplies 1926, one or more wired or wireless network interfaces 1950, one or more input and output interfaces 1958, and/or, one or more operating systems 1941, such as Windows ServerTM, Mac OS XTM, UnixTM, LinuxTM, FreeBSDTM, etc.
- the central processing unit 1922 is used to execute the model training method executed by the training device in the embodiment corresponding to Figure 14. It should be noted that the specific manner in which the central processing unit 1922 executes the aforementioned steps is based on the same concept as the various method embodiments corresponding to Figure 17 in the present application, and the technical effects brought about are the same as the various method embodiments corresponding to Figure 14 in the present application. For specific contents, please refer to the description in the method embodiments shown in the preceding embodiment of the present application, which will not be repeated here.
- the present application also provides a computer program product including instructions.
- the computer program product may be software or a program product including instructions that can be run on a computing device or stored in any available medium.
- the at least one computing device executes the method of any of the above embodiments.
- the embodiment of the present application also provides a computer-readable storage medium.
- the computer-readable storage medium can be any available medium that can be stored by a computing device or a data storage device such as a data center containing one or more available media.
- the available medium can be a magnetic medium (e.g., a floppy disk, a hard disk, a tape), an optical medium (e.g., a DVD), or a semiconductor medium (e.g., a solid-state hard disk).
- the computer-readable storage medium includes instructions that instruct the computing device to execute the method of any of the above embodiments.
- the performance content display device, model training device, computing device, and training device for a virtual character provided in the embodiments of the present application may specifically be a chip, and the chip includes: a processing unit and a communication unit, wherein the processing unit may be, for example, a processor, and the communication unit may be, for example, an input/output interface, a pin, or a circuit, etc.
- the processing unit may execute computer execution instructions stored in the storage unit so that the chip executes the performance content display method for a virtual character described in the above-mentioned various method embodiments, or so that the chip executes the model training method described in the embodiment shown in FIG. 14 above.
- the storage unit is a storage unit within the chip, such as a register, a cache, etc.
- the storage unit may also be a storage unit located outside the chip within the wireless access device end, such as a read-only memory (read-only memory, ROM) or other types of static storage devices that can store static information and instructions, random access memory (RAM), etc.
- FIG. 20 is a schematic diagram of a structure of a chip provided in an embodiment of the present application, wherein the chip may be a neural network processor NPU 2000, which is mounted on a host CPU (Host CPU) as a coprocessor and is assigned tasks by the Host CPU.
- the core part of the NPU is an operation circuit 2003, which is controlled by a controller 2004 to extract matrix data from a memory and perform multiplication operations.
- the operation circuit 2003 includes multiple processing units (Process Engine, PE) inside.
- the operation circuit 2003 is a two-dimensional systolic array.
- the operation circuit 2003 can also be a one-dimensional systolic array or other electronic circuits capable of performing mathematical operations such as multiplication and addition.
- the operation circuit 2003 is a general-purpose matrix processor.
- the operation circuit takes the corresponding data of matrix B from the weight memory 2002 and caches it on each PE in the operation circuit.
- the operation circuit takes the matrix A data from the input memory 2001 and performs matrix operation with matrix B.
- the partial result or final result of the matrix is stored in the accumulator 2008.
- the unified memory 2006 is used to store input data and output data.
- the weight data is directly transferred to the weight memory 2002 through the direct memory access controller (DMAC) 2005.
- the input data is also transferred to the unified memory 2006 through the DMAC.
- DMAC direct memory access controller
- BIU stands for Bus Interface Unit 2010, which is used for the interaction between AXI bus and DMAC and instruction fetch buffer (IFB) 2009.
- the bus interface unit 2010 (Bus Interface Unit, BIU for short) is used for the instruction fetch memory 2009 to obtain instructions from the external memory, and is also used for the storage unit access controller 2005 to obtain the original data of the input matrix A or the weight matrix B from the external memory.
- DMAC is mainly used to transfer input data in the external memory DDR to the unified memory 2006 or to transfer weight data to the weight memory 2002 or to transfer input data to the input memory 2001.
- the vector calculation unit 2007 includes multiple operation processing units, which can further process the output of the operation circuit when necessary, such as vector multiplication, vector addition, exponential operation, logarithmic operation, size comparison, etc. It is mainly used for non-convolutional/fully connected layer network calculations in neural networks, such as Batch Normalization, pixel-level summation, upsampling of feature planes, etc.
- the vector calculation unit 2007 can store the processed output vector to the unified memory 2006.
- the vector calculation unit 2007 can apply a linear function and/or a nonlinear function to the output of the operation circuit 2003, such as linear interpolation of the feature plane extracted by the convolution layer, and then, for example, a vector of accumulated values to generate an activation value.
- the vector calculation unit 2007 generates a normalized value, a pixel-level summed value, or both.
- the processed output vector can be used as an activation input to the operation circuit 2003, for example, for use in a subsequent layer in a neural network.
- An instruction fetch buffer 2009 connected to the controller 2004, for storing instructions used by the controller 2004;
- the unified memory 2006, the input memory 2001, the weight memory 2002 and the instruction fetch memory 2009 are all on-chip memories.
- the external memory is private to the NPU hardware architecture.
- the operations of each layer in the above-mentioned target model can be performed by the operation circuit 2003 or the vector calculation unit 2007.
- the processor mentioned in any of the above places may be a general-purpose central processing unit, a microprocessor, an ASIC, or one or more integrated circuits for controlling the execution of the program of the above-mentioned first aspect method.
- the device embodiments described above are merely schematic, wherein the units described as separate components may or may not be physically separated, and the components displayed as units may or may not be physical units, that is, they may be located in one place, or they may be distributed over multiple network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the scheme of this embodiment.
- the connection relationship between the modules indicates that there is a communication connection between them, which may be specifically implemented as one or more communication buses or signal lines.
- the technical solution of the present application is essentially In other words, the part that contributes to the prior art can be embodied in the form of a software product, which is stored in a readable storage medium, such as a computer floppy disk, USB flash drive, mobile hard disk, ROM, RAM, magnetic disk or optical disk, etc., and includes a number of instructions for enabling a computer device (which can be a personal computer, training equipment, or network equipment, etc.) to execute the methods described in the various embodiments of the present application.
- a computer device which can be a personal computer, training equipment, or network equipment, etc.
- all or part of the embodiments may be implemented by software, hardware, firmware or any combination thereof.
- all or part of the embodiments may be implemented in the form of a computer program product.
- the computer program product includes one or more computer instructions.
- the computer may be a general-purpose computer, a special-purpose computer, a computer network, or other programmable devices.
- the computer instructions may be stored in a computer-readable storage medium, or transmitted from one computer-readable storage medium to another computer-readable storage medium.
- the computer instructions may be transmitted from a website site, a computer, a training device, or a data center by wired (e.g., coaxial cable, optical fiber, digital subscriber line (DSL)) or wireless (e.g., infrared, wireless, microwave, etc.) mode to another website site, computer, training device, or data center.
- the computer-readable storage medium may be any available medium that a computer can store or a data storage device such as a training device, a data center, etc. that includes one or more available media integrations.
- the available medium may be a magnetic medium, (e.g., a floppy disk, a hard disk, a tape), an optical medium (e.g., a DVD), or a semiconductor medium (e.g., a solid-state drive (SSD)), etc.
- a magnetic medium e.g., a floppy disk, a hard disk, a tape
- an optical medium e.g., a DVD
- a semiconductor medium e.g., a solid-state drive (SSD)
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- Multimedia (AREA)
- Processing Or Creating Images (AREA)
Abstract
Disclosed in the present application is a method of displaying performance content of a virtual character. The method comprises: acquiring a first streaming audio clip; on the basis of feature information of the first streaming audio clip and action feature information generated according to a second streaming audio clip, obtaining first fusion feature information, the first streaming audio clip and the second streaming audio clip being included in a same streaming audio, and the playing time of the second audio clip being before the playing time of the first audio clip; and on the basis of the first fusion feature information, generating facial expression information and body movement information of a virtual character. Therefore, the present application can directly output the performance content of the virtual character corresponding to the input streaming audio clip without the need of inputting a complete music piece, thus shortening the response time of generating facial expression information and body movement information of virtual characters.
Description
本申请要求于2023年02月16日提交中国国家知识产权局、申请号为CN202310126607.3、发明名称为“基于云技术的虚拟人物舞蹈动作生成方法及系统”的中国专利申请的优先权,以及于2023年05月15日提交中国国家知识产权局、申请号为CN202310544883.1、发明名称为“一种虚拟角色的表演内容展示方法及相关设备”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。This application claims the priority of the Chinese patent application filed with the State Intellectual Property Office of China on February 16, 2023, with application number CN202310126607.3 and invention name “Method and system for generating virtual character dance movements based on cloud technology”, and the priority of the Chinese patent application filed with the State Intellectual Property Office of China on May 15, 2023, with application number CN202310544883.1 and invention name “A method for displaying performance content of a virtual character and related equipment”, all of which are incorporated by reference in this application.
本申请涉及计算机技术领域,尤其涉及一种虚拟角色的表演内容展示方法及相关设备。The present application relates to the field of computer technology, and in particular to a method for displaying performance content of a virtual character and related equipment.
随着计算机技术的不断发展,基于音乐驱动虚拟角色进行表演的技术应运而生,现如今在虚拟现实、人机交互等领域有着非常广泛的应用。例如,可以根据用户提供的一段音乐,将生成的表演内容应用到虚拟角色上。With the continuous development of computer technology, the technology of driving virtual characters to perform based on music has emerged, and now has very wide applications in the fields of virtual reality, human-computer interaction, etc. For example, the generated performance content can be applied to the virtual character based on a piece of music provided by the user.
相关技术中,可以基于算法生成虚拟角色的表演内容。基于算法生成舞蹈动作的方案效率高而且泛化性好,仅需要输入音乐信息便可生成与之匹配的表演内容。但是,该方法一般是采用离线模型,即需要输入完整的音乐,从完整的音乐中的提取音乐的特征信息,以生成对应的表演内容,响应时间较长。In the related art, the performance content of the virtual character can be generated based on the algorithm. The solution of generating dance movements based on the algorithm is efficient and generalizable, and only needs to input the music information to generate the matching performance content. However, this method generally adopts an offline model, that is, it needs to input the complete music and extract the characteristic information of the music from the complete music to generate the corresponding performance content, and the response time is relatively long.
发明内容Summary of the invention
本申请提供了一种虚拟角色的表演内容展示方法及相关设备,能够直接输出与输入的流式音频片段对应的虚拟角色的表演内容,而无需输入完整的音乐,降低生成虚拟角色的面部表情信息和肢体动作信息的响应时间。The present application provides a method and related equipment for displaying the performance content of a virtual character, which can directly output the performance content of the virtual character corresponding to the input streaming audio clip without inputting the complete music, thereby reducing the response time of generating the facial expression information and body movement information of the virtual character.
第一方面,本申请提供了一种虚拟角色的表演内容展示方法,该方法可以应用于计算机技术领域,该方法可以由目标模型实现,该方法包括:In a first aspect, the present application provides a method for displaying performance content of a virtual character, which can be applied in the field of computer technology and can be implemented by a target model. The method includes:
首先,获取第一流式音频片段;然后,基于第一流式音频片段的特征信息和根据第二流式音频片段生成的动作特征信息,得到第一融合特征信息,其中,第一流式音频片段和第二流式音频片段包含于同一流式音频,第二音频片段的播放时间在第一音频片段的播放时间之前;最后,基于第一融合特征信息生成虚拟角色的面部表情信息和肢体动作信息。First, a first streaming audio segment is obtained; then, based on feature information of the first streaming audio segment and action feature information generated according to the second streaming audio segment, first fused feature information is obtained, wherein the first streaming audio segment and the second streaming audio segment are included in the same streaming audio, and the playback time of the second audio segment is before the playback time of the first audio segment; finally, facial expression information and body movement information of the virtual character are generated based on the first fused feature information.
针对目前主要采用离线模型,需要输入完整的音乐来生成对应的表演内容,响应时间较长的问题。在本申请中,第一方面,基于输入的流式音频片段实时输出虚拟角色的表演内容,降低了生成表演内容的响应时间。第二方面,在生成虚拟角色的面部表情信息和肢体动作信息时,除了基于第一流式音频片段的特征信息,还参考了第二流式音频片段生成的动作特征信息,实现了同一流式音频的不同音频片段在表演内容上的前后衔接。第三方面,考虑到虚拟角色对应的肢体动作的幅度大小以及移动速度对面部表情有一定的影响,因此在生成第一融合特征信息时加入了动作特征信息,以增强虚拟角色的表演内容的展示效果。第四方面,通过同步输出虚拟角色的肢体动作信息和表演信息,实现了虚拟角色边唱边跳的效果。In view of the problem that the offline model currently mainly adopts, the complete music needs to be input to generate the corresponding performance content, and the response time is long. In this application, on the one hand, the performance content of the virtual character is output in real time based on the input streaming audio clip, which reduces the response time of generating the performance content. On the second hand, when generating the facial expression information and body movement information of the virtual character, in addition to the feature information based on the first streaming audio clip, the action feature information generated by the second streaming audio clip is also referred to, so as to achieve the front and back connection of different audio clips of the same streaming audio in the performance content. On the third hand, considering that the amplitude and movement speed of the body movement corresponding to the virtual character have a certain influence on the facial expression, the action feature information is added when generating the first fusion feature information to enhance the display effect of the performance content of the virtual character. On the fourth hand, by synchronously outputting the body movement information and performance information of the virtual character, the effect of the virtual character singing and dancing is achieved.
在第一方面的一种可能的实现方式中,方法还包括:In a possible implementation manner of the first aspect, the method further includes:
获取第三流式音频片段;Get the third streaming audio segment;
基于第三流式音频片段的特征信息和根据第一流式音频片段生成的动作特征信息,得到第二融合特征信息,第三流式音频片段和第一流式音频片段包含于同一流式音频,第一流式音频片段的播放时间在第三流式音频片段的播放时间之前;Based on feature information of a third streaming audio segment and action feature information generated according to the first streaming audio segment, second fusion feature information is obtained, wherein the third streaming audio segment and the first streaming audio segment are included in the same streaming audio, and a playback time of the first streaming audio segment is before a playback time of the third streaming audio segment;
基于第二融合特征信息生成虚拟角色的面部表情信息和肢体动作信息。Facial expression information and body movement information of the virtual character are generated based on the second fused feature information.
该种可能的实现方式中,对于同一流式音频,由于第三流式音频片段的播放时间在第一流式音频片段的播放时间之后,因此,在基于第一流式音频片段生成虚拟角色的面部表情信息和肢体动作信息后,可以基于第三流式音频片段生成第三流式音频片段对应的虚拟角色的面部表情信息和肢体动作信息,从而实现了在线输入流式音频片段,实时输出对应的虚拟角色的面部表情信息和肢体动作信息的效果。另外,在播放时间在后的第三流式音频片段生成虚拟角色的面部表情信息和肢体动作信息的过程中,会参
考播放时间在前的第二流式音频片段生成的动作特征信息,从而实现了虚拟角色的动作在同一音频片段上的前后衔接,提高了用户体验。In this possible implementation, for the same streaming audio, since the playback time of the third streaming audio segment is after the playback time of the first streaming audio segment, after generating the facial expression information and body movement information of the virtual character based on the first streaming audio segment, the facial expression information and body movement information of the virtual character corresponding to the third streaming audio segment can be generated based on the third streaming audio segment, thereby achieving the effect of online input of streaming audio segments and real-time output of the corresponding facial expression information and body movement information of the virtual character. In addition, in the process of generating the facial expression information and body movement information of the virtual character by the third streaming audio segment with a later playback time, the facial expression information and body movement information of the virtual character will be referenced. The action feature information generated by the second streaming audio segment that is played earlier is considered, thereby achieving the connection between the actions of the virtual character on the same audio segment and improving the user experience.
在第一方面的一种可能的实现方式中,第一流式音频片段的特征信息包括文本特征信息,基于第一融合特征信息生成虚拟角色的面部表情信息,包括:In a possible implementation manner of the first aspect, the feature information of the first streaming audio segment includes text feature information, and generating facial expression information of the virtual character based on the first fused feature information includes:
基于文本特征信息进行音素提取,得到音素特征信息;Extract phonemes based on text feature information to obtain phoneme feature information;
获取多个虚拟角色在表情上的第一关联特征信息,第一关联特征信息基于第二流式音频片段生成;Acquire first associated feature information on expressions of multiple virtual characters, where the first associated feature information is generated based on the second streaming audio segment;
基于第一融合特征信息、音素特征信息和第一关联特征信息,得到至少一个虚拟角色对应的基础表情的权重信息;Based on the first fusion feature information, the phoneme feature information and the first associated feature information, obtaining weight information of a basic expression corresponding to at least one virtual character;
基于至少一个虚拟角色对应的基础表情的权重信息,对至少一个虚拟角色对应的基础表情进行调整,得到至少一个虚拟角色对应的面部表情信息。Based on the weight information of the basic expression corresponding to the at least one virtual character, the basic expression corresponding to the at least one virtual character is adjusted to obtain the facial expression information corresponding to the at least one virtual character.
该种可能的实现方式中,第一方面,通过从第一流式音频片段的特征信息中提取出音素特征信息,以在生成虚拟角色的面部表情时加强面部表情与嘴部形状的关联性,不仅增加了面部表情信息的展示维度,增强了面部表情信息的展示效果,而且提高了面部表情的真实性。第二方面,通过参考多个虚拟角色在表情上的第一关联特征,从而在生成每个虚拟角色对应的面部表情信息的同时,参考其他虚拟角色的面部表情信息,实现了多个虚拟角色在面部表情上的有效协作与配合。In this possible implementation, on the one hand, by extracting phoneme feature information from the feature information of the first streaming audio clip, the correlation between facial expression and mouth shape is strengthened when generating the facial expression of the virtual character, which not only increases the display dimension of facial expression information, enhances the display effect of facial expression information, but also improves the authenticity of facial expression. On the other hand, by referring to the first associated features of the expressions of multiple virtual characters, while generating the facial expression information corresponding to each virtual character, the facial expression information of other virtual characters is referred to, thereby achieving effective collaboration and cooperation of multiple virtual characters in facial expressions.
在第一方面的一种可能的实现方式中,还包括:In a possible implementation manner of the first aspect, the method further includes:
基于多个虚拟角色对应的面部表情信息,提取多个虚拟角色对应的表情特征信息;Extracting expression feature information corresponding to the plurality of virtual characters based on the facial expression information corresponding to the plurality of virtual characters;
基于多个虚拟角色对应的表情特征信息,计算得到多个虚拟角色之间的表情相似度;Based on the expression feature information corresponding to the multiple virtual characters, the expression similarity between the multiple virtual characters is calculated;
以多个虚拟角色之间的表情相似度作为权重,对每个虚拟角色对应的表情特征信息进行调整,得到多个虚拟角色在表情上的第二关联特征信息,第二关联特征信息用于生成第三流式音频片段对应的至少一个虚拟角色对应的面部表情信息。The expression similarity between multiple virtual characters is used as a weight to adjust the expression feature information corresponding to each virtual character to obtain second associated feature information on the expressions of the multiple virtual characters. The second associated feature information is used to generate facial expression information corresponding to at least one virtual character corresponding to the third streaming audio segment.
该种可能的实现方式中,根据不同虚拟角色对应的表情特征信息计算每个虚拟角色与其他虚拟角色之间的表情相似度,以在生成每个虚拟角色的面部表情信息时,不仅考虑该虚拟角色的面部表情信息,同时考虑其他虚拟角色的面部表情信息,实现多个虚拟角色在面部表情上的有效协作与配合。另外,将第三流式音频片段的播放时间之前生成的第二关联特征信息,作为第三流式音频片段生成面部表情信息的输入,能够实现播放时间不同的流式音频片段之间在面部表情上的有效衔接,保持虚拟角色的面部表情信息的连贯性,以进一步利用学习面部表情在时序上的关联性,有利于保持面部表情在时序上的连贯性,避免面部表情发生突变和抖动的情况。In this possible implementation, the expression similarity between each virtual character and other virtual characters is calculated based on the expression feature information corresponding to different virtual characters, so that when generating the facial expression information of each virtual character, not only the facial expression information of the virtual character is considered, but also the facial expression information of other virtual characters is considered, so as to achieve effective collaboration and cooperation of multiple virtual characters in facial expressions. In addition, the second associated feature information generated before the playback time of the third streaming audio segment is used as the input of the third streaming audio segment to generate facial expression information, so as to achieve effective connection in facial expressions between streaming audio segments with different playback times, maintain the coherence of the facial expression information of the virtual characters, and further utilize the temporal correlation of learning facial expressions, which is conducive to maintaining the temporal coherence of facial expressions and avoiding sudden changes and jitters in facial expressions.
在第一方面的一种可能的实现方式中,基于第一融合特征信息生成虚拟角色的肢体动作信息,包括:In a possible implementation manner of the first aspect, generating body movement information of the virtual character based on the first fused feature information includes:
对第一融合特征信息进行细节调整,得到调整后的第一融合特征信息;Performing detail adjustment on the first fused feature information to obtain adjusted first fused feature information;
获取多个虚拟角色在动作上的第三关联特征信息,第三关联特征信息基于第二流式音频片段生成;Acquire third associated feature information on actions of the plurality of virtual characters, where the third associated feature information is generated based on the second streaming audio segment;
获取第一流式音频片段生成的动作特征信息;Obtaining motion feature information generated by the first streaming audio segment;
以调整后的第一融合特征信息和第三关联特征信息作为偏移量,对第一流式音频片段生成的动作特征信息进行调整,得到至少一个虚拟角色的对应的肢体动作信息。The action feature information generated by the first streaming audio segment is adjusted using the adjusted first fusion feature information and the third associated feature information as offsets to obtain corresponding body motion information of at least one virtual character.
该种可能的实现方式中,相比于直接将相关的肢体动作归类为同一肢体动作信息,导致生成的虚拟角色的肢体动作信息缺乏动作细节的问题。在本实现方式中,通过对第一融合特征信息进行细节调整,以基于调整后的第一融合特征信息对第一流式音频片段生成的动作特征信息进行调整,以使得每个虚拟角色生成的肢体动作信息中包含更多的动作细节。另外,基于多个虚拟角色之间在动作上的第三关联特征对第一流式音频片段生成的动作特征信息进行调整,能够使得每个虚拟角色能够兼顾到其他虚拟角色的肢体动作信息,实现与其他虚拟角色在肢体动作上的协作与配合。In this possible implementation, compared with directly classifying related body movements as the same body movement information, the generated body movement information of the virtual character lacks movement details. In this implementation, by making a detailed adjustment to the first fusion feature information, the action feature information generated by the first streaming audio segment is adjusted based on the adjusted first fusion feature information, so that the body movement information generated by each virtual character contains more action details. In addition, by adjusting the action feature information generated by the first streaming audio segment based on the third correlation feature in the actions between multiple virtual characters, each virtual character can take into account the body movement information of other virtual characters and achieve collaboration and cooperation with other virtual characters in body movements.
在第一方面的一种可能的实现方式中,获取第一流式音频片段生成的动作特征信息,包括:In a possible implementation manner of the first aspect, obtaining motion feature information generated by the first streaming audio segment includes:
基于第一融合特征信息、第三关联特征信息和第二流式音频片段生成的动作特征信息,得到多个虚拟角色对应的动作编码信息;Based on the first fusion feature information, the third association feature information, and the action feature information generated by the second streaming audio segment, action coding information corresponding to the plurality of virtual characters is obtained;
基于动作编码信息得到与多个虚拟角色的动作编码信息对应的动作特征信息。Based on the action coding information, action feature information corresponding to the action coding information of multiple virtual characters is obtained.
该种可能的实现方式中,相比于直接从第一融合特征信息、第三关联特征信息和第二流式音频片段中提取有效的动作特征信息,通过将上述特征信息进行动作编码,得到多个虚拟角色对应的动作编码信
息,然后,基于动作编码信息与动作特征信息之间的对应关系,得到多个虚拟角色的动作编码信息对应的动作特征信息,从而减少了生成动作特征信息的信息量,提高了运行效率。In this possible implementation, compared with directly extracting effective action feature information from the first fused feature information, the third associated feature information, and the second streaming audio segment, the action coding information corresponding to the plurality of virtual characters is obtained by performing action coding on the feature information. Then, based on the correspondence between the action coding information and the action feature information, the action feature information corresponding to the action coding information of multiple virtual characters is obtained, thereby reducing the amount of information for generating the action feature information and improving the operation efficiency.
在第一方面的一种可能的实现方式中,还包括:In a possible implementation manner of the first aspect, the method further includes:
基于第一流式音频片段生成的动作特征信息,计算得到多个虚拟角色之间的动作相似度;Based on the action feature information generated by the first streaming audio segment, calculating the action similarity between the multiple virtual characters;
以多个虚拟角色之间的动作相似度作为权重,对每个虚拟角色的对应的动作特征信息进行调整,得到多个虚拟角色在动作上的第四关联特征信息,第四关联特征信息用于生成第三流式音频片段对应的至少一个虚拟角色对应的肢体动作信息。Using the action similarity between multiple virtual characters as a weight, the corresponding action feature information of each virtual character is adjusted to obtain fourth associated feature information on the actions of the multiple virtual characters. The fourth associated feature information is used to generate body movement information corresponding to at least one virtual character corresponding to the third streaming audio segment.
该种可能的实现方式中,根据第一流式音频片段生成的动作特征信息计算每个虚拟角色与其他虚拟角色之间的表情相似度,以在生成每个虚拟角色的肢体动作信息时,不仅考虑该虚拟角色的肢体动作信息,同时考虑其他虚拟角色的肢体动作信息,实现多个虚拟角色在肢体动作上的有效协作与配合。另外,将第三流式音频片段的播放时间之前生成的第四关联特征信息,作为第三流式音频片段生成面部表情信息的输入,能够实现播放时间不同的流式音频片段之间在肢体动作上的有效衔接,保持虚拟角色的面部表情信息的连贯性,以进一步利用学习面部表情在时序上的关联性,有利于保持面部表情在时序上的连贯性,避免面部表情发生突变和抖动的情况。In this possible implementation, the expression similarity between each virtual character and other virtual characters is calculated based on the action feature information generated by the first streaming audio segment, so that when generating the body movement information of each virtual character, not only the body movement information of the virtual character but also the body movement information of other virtual characters are considered, so as to achieve effective cooperation and coordination of multiple virtual characters in body movement. In addition, the fourth associated feature information generated before the playback time of the third streaming audio segment is used as the input of the facial expression information generated by the third streaming audio segment, so as to achieve effective connection in body movement between streaming audio segments with different playback times, maintain the coherence of the facial expression information of the virtual character, and further utilize the temporal correlation of learning facial expressions, which is conducive to maintaining the temporal coherence of facial expressions and avoiding sudden changes and jitters in facial expressions.
在第一方面的一种可能的实现方式中,还包括:In a possible implementation manner of the first aspect, the method further includes:
获取标签信息,标签信息用于指示虚拟角色的面部表情信息和/或肢体动作信息;Acquire tag information, where the tag information is used to indicate facial expression information and/or body movement information of the virtual character;
基于标签信息对每个虚拟角色各自对应的面部表情信息和/或肢体动作信息进行调整,得到调整后的每个虚拟角色各自对应的面部表情信息和/或肢体动作信息。The facial expression information and/or body movement information corresponding to each virtual character is adjusted based on the tag information to obtain the adjusted facial expression information and/or body movement information corresponding to each virtual character.
该种可能的实现方式中,在生成多个虚拟角色对应的面部表情信息和/或肢体动作信息后,可以基于标签信息对面部表情信息和/或肢体动作信息进行调整,以生成不同风格的表演内容,适配不同类型的虚拟角色,从而可以根据用户指定的风格化标签生成用户可控的风格,提升与用户的交互性。In this possible implementation, after generating facial expression information and/or body movement information corresponding to multiple virtual characters, the facial expression information and/or body movement information can be adjusted based on the tag information to generate performance content of different styles to adapt to different types of virtual characters, so that a user-controllable style can be generated according to the stylized tags specified by the user, thereby improving interactivity with the user.
在第一方面的一种可能的实现方式中,第一流式音频片段的特征信息包括文本特征信息,基于第一融合特征信息生成虚拟角色的面部表情信息,包括:In a possible implementation manner of the first aspect, the feature information of the first streaming audio segment includes text feature information, and generating facial expression information of the virtual character based on the first fused feature information includes:
基于文本特征进行音素提取,获取音素特征信息;Extract phonemes based on text features to obtain phoneme feature information;
获取第二流式音频片段生成的面部表情信息;Obtaining facial expression information generated by a second streaming audio segment;
基于第一融合特征信息、音素特征信息和第二流式音频片段生成的面部表情信息,得到虚拟角色对应的基础表情的权重信息;Obtaining weight information of a basic expression corresponding to the virtual character based on the first fused feature information, the phoneme feature information, and the facial expression information generated by the second streaming audio segment;
基于虚拟角色对应的基础表情的权重信息,对虚拟角色的基础表情进行调整,得到虚拟角色的面部表情信息。Based on the weight information of the basic expression corresponding to the virtual character, the basic expression of the virtual character is adjusted to obtain the facial expression information of the virtual character.
该种可能的实现方式中,针对于生成单个虚拟角色面部表情信息的场景,第一方面,通过从第一流式音频片段的特征信息中提取出音素特征信息,以在生成虚拟角色的面部表情时加强面部表情与嘴部形状的关联性,不仅增加了面部表情信息的展示维度,增强了面部表情信息的展示效果,而且提高了面部表情的真实性。第二方面,在生成第一流式音频片段对应的虚拟角色的面部表情信息时,通过参考播放时间在前的第二流式音频片段的面部表情信息,以实现不同播放时间的流式音频片段在面部表情上的有效衔接,从而增强了虚拟角色的面部表情的展示效果,提高了用户体验。In this possible implementation, for the scenario of generating facial expression information of a single virtual character, on the one hand, by extracting phoneme feature information from the feature information of the first streaming audio segment, the correlation between facial expression and mouth shape is strengthened when generating the facial expression of the virtual character, which not only increases the display dimension of facial expression information, enhances the display effect of facial expression information, but also improves the authenticity of facial expression. On the other hand, when generating the facial expression information of the virtual character corresponding to the first streaming audio segment, by referring to the facial expression information of the second streaming audio segment with an earlier playback time, the streaming audio segments with different playback times are effectively connected in facial expression, thereby enhancing the display effect of the facial expression of the virtual character and improving the user experience.
在第一方面的一种可能的实现方式中,基于第一融合特征信息生成虚拟角色的肢体动作信息,包括:In a possible implementation manner of the first aspect, generating body movement information of the virtual character based on the first fused feature information includes:
对第一融合特征信息进行调整,得到调整后的第一融合特征信息;Adjusting the first fused feature information to obtain adjusted first fused feature information;
获取第一流式音频片段生成的动作特征信息;Obtaining motion feature information generated by the first streaming audio segment;
以调整后的第一融合特征作为偏移量,对第一流式音频片段生成的动作特征信息进行调整,得到虚拟角色对应的肢体动作信息。The action feature information generated by the first streaming audio segment is adjusted using the adjusted first fusion feature as an offset to obtain body action information corresponding to the virtual character.
该种可能的实现方式中,相比于直接将相关的肢体动作归类为同一肢体动作信息,导致生成的虚拟角色的肢体动作信息缺乏动作细节的问题。在本实现方式中,通过对第一融合特征信息进行细节调整,以基于调整后的第一融合特征信息对第一流式音频片段生成的动作特征信息进行调整,以使得每个虚拟角色生成的肢体动作信息中包含更多的动作细节。In this possible implementation, compared with directly classifying related body movements as the same body movement information, the generated body movement information of the virtual character lacks movement details. In this implementation, by making detail adjustments to the first fused feature information, the movement feature information generated by the first streaming audio segment is adjusted based on the adjusted first fused feature information, so that the body movement information generated by each virtual character contains more movement details.
第二方面,本申请提供了一种虚拟角色的表演内容展示装置,可用于计算机技术领域中。装置包括流式音频片段获取模块、融合特征信息生成模块和表演内容生成模块。其中,
In a second aspect, the present application provides a virtual character performance content display device, which can be used in the field of computer technology. The device includes a streaming audio segment acquisition module, a fusion feature information generation module and a performance content generation module. Among them,
流式音频片段获取模块,用于获取第一流式音频片段;A streaming audio segment acquisition module, used to acquire a first streaming audio segment;
融合特征信息生成模块,用于基于第一流式音频片段的特征信息和根据第二流式音频片段生成的动作特征信息,得到第一融合特征信息,第一流式音频片段和第二流式音频片段包含于同一流式音频,第二音频片段的播放时间在第一音频片段的播放时间之前;a fusion feature information generating module, configured to obtain first fusion feature information based on feature information of a first streaming audio segment and action feature information generated according to a second streaming audio segment, wherein the first streaming audio segment and the second streaming audio segment are included in the same streaming audio, and a playback time of the second audio segment is before a playback time of the first audio segment;
表演内容生成模块,用于基于第一融合特征信息生成虚拟角色的面部表情信息和肢体动作信息。The performance content generation module is used to generate facial expression information and body movement information of the virtual character based on the first fused feature information.
在第二方面的一种可能的实现方式中,还包括:In a possible implementation manner of the second aspect, the method further includes:
流式音频片段获取模块,还用于获取第三流式音频片段;The streaming audio segment acquisition module is further used to acquire a third streaming audio segment;
融合特征信息生成模块,还用于基于第三流式音频片段的特征信息和根据第一流式音频片段生成的动作特征信息,得到第二融合特征信息,第三流式音频片段和第一流式音频片段包含于同一流式音频,第一流式音频片段的播放时间在第三流式音频片段的播放时间之前;The fusion feature information generating module is further used to obtain second fusion feature information based on feature information of a third streaming audio segment and action feature information generated according to the first streaming audio segment, wherein the third streaming audio segment and the first streaming audio segment are included in the same streaming audio, and the playback time of the first streaming audio segment is before the playback time of the third streaming audio segment;
表演内容生成模块,还用于基于第二融合特征信息生成虚拟角色的面部表情信息和肢体动作信息。The performance content generation module is also used to generate facial expression information and body movement information of the virtual character based on the second fused feature information.
在第二方面的一种可能的实现方式中,第一流式音频片段的特征信息包括文本特征信息,表演内容生成模块还用于:In a possible implementation manner of the second aspect, the feature information of the first streaming audio segment includes text feature information, and the performance content generation module is further configured to:
基于文本特征信息进行音素提取,得到音素特征信息;Extract phonemes based on text feature information to obtain phoneme feature information;
获取多个虚拟角色在表情上的第一关联特征信息,第一关联特征信息基于第二流式音频片段生成;Acquire first associated feature information on expressions of multiple virtual characters, where the first associated feature information is generated based on the second streaming audio segment;
基于第一融合特征信息、音素特征信息和第一关联特征信息,得到至少一个虚拟角色对应的基础表情的权重信息;Based on the first fusion feature information, the phoneme feature information and the first associated feature information, obtaining weight information of a basic expression corresponding to at least one virtual character;
基于至少一个虚拟角色对应的基础表情的权重信息,对至少一个虚拟角色对应的基础表情进行调整,得到至少一个虚拟角色对应的面部表情信息。Based on the weight information of the basic expression corresponding to the at least one virtual character, the basic expression corresponding to the at least one virtual character is adjusted to obtain the facial expression information corresponding to the at least one virtual character.
在第二方面的一种可能的实现方式中,表演内容生成模块还用于:In a possible implementation manner of the second aspect, the performance content generation module is further configured to:
基于多个虚拟角色对应的面部表情信息,提取多个虚拟角色对应的表情特征信息;Extracting expression feature information corresponding to the plurality of virtual characters based on the facial expression information corresponding to the plurality of virtual characters;
基于多个虚拟角色对应的表情特征信息,计算得到多个虚拟角色之间的表情相似度;Based on the expression feature information corresponding to the multiple virtual characters, the expression similarity between the multiple virtual characters is calculated;
以多个虚拟角色之间的表情相似度作为权重,对每个虚拟角色对应的表情特征信息进行调整,得到多个虚拟角色在表情上的第二关联特征信息,第二关联特征信息用于生成第三流式音频片段对应的至少一个虚拟角色对应的面部表情信息。The expression similarity between multiple virtual characters is used as a weight to adjust the expression feature information corresponding to each virtual character to obtain second associated feature information on the expressions of the multiple virtual characters. The second associated feature information is used to generate facial expression information corresponding to at least one virtual character corresponding to the third streaming audio segment.
在第二方面的一种可能的实现方式中,表演内容生成模块还用于:In a possible implementation manner of the second aspect, the performance content generation module is further configured to:
对第一融合特征信息进行细节调整,得到调整后的第一融合特征信息;Performing detail adjustment on the first fused feature information to obtain adjusted first fused feature information;
获取多个虚拟角色在动作上的第三关联特征信息,第三关联特征信息基于第二流式音频片段生成;Acquire third associated feature information on actions of the plurality of virtual characters, where the third associated feature information is generated based on the second streaming audio segment;
获取第一流式音频片段生成的动作特征信息;Obtaining motion feature information generated by the first streaming audio segment;
以调整后的第一融合特征信息和第三关联特征信息作为偏移量,对第一流式音频片段生成的动作特征信息进行调整,得到至少一个虚拟角色的对应的肢体动作信息。The action feature information generated by the first streaming audio segment is adjusted using the adjusted first fusion feature information and the third associated feature information as offsets to obtain corresponding body motion information of at least one virtual character.
在第二方面的一种可能的实现方式中,获取第一流式音频片段生成的动作特征信息,包括:In a possible implementation manner of the second aspect, obtaining motion feature information generated by the first streaming audio segment includes:
基于第一融合特征信息、第三关联特征信息和第二流式音频片段生成的动作特征信息,得到多个虚拟角色对应的动作编码信息;Based on the first fusion feature information, the third association feature information, and the action feature information generated by the second streaming audio segment, action coding information corresponding to the plurality of virtual characters is obtained;
基于动作编码信息得到与多个虚拟角色的动作编码信息对应的动作特征信息。Based on the action coding information, action feature information corresponding to the action coding information of multiple virtual characters is obtained.
在第二方面的一种可能的实现方式中,表演内容生成模块还用于:In a possible implementation manner of the second aspect, the performance content generation module is further configured to:
基于第一流式音频片段生成的动作特征信息,计算得到多个虚拟角色之间的动作相似度;Based on the action feature information generated by the first streaming audio segment, calculating the action similarity between the multiple virtual characters;
以多个虚拟角色之间的动作相似度作为权重,对每个虚拟角色的对应的动作特征信息进行调整,得到多个虚拟角色在动作上的第四关联特征信息,第四关联特征信息用于生成第三流式音频片段对应的至少一个虚拟角色对应的肢体动作信息。Using the action similarity between multiple virtual characters as a weight, the corresponding action feature information of each virtual character is adjusted to obtain fourth associated feature information on the actions of the multiple virtual characters. The fourth associated feature information is used to generate body movement information corresponding to at least one virtual character corresponding to the third streaming audio segment.
在第二方面的一种可能的实现方式中,还包括:In a possible implementation manner of the second aspect, the method further includes:
获取标签信息,标签信息用于指示虚拟角色的面部表情信息和/或肢体动作信息;Acquire tag information, where the tag information is used to indicate facial expression information and/or body movement information of the virtual character;
基于标签信息对每个虚拟角色各自对应的面部表情信息和/或肢体动作信息进行调整,得到调整后的每个虚拟角色各自对应的面部表情信息和/或肢体动作信息。The facial expression information and/or body movement information corresponding to each virtual character is adjusted based on the tag information to obtain the adjusted facial expression information and/or body movement information corresponding to each virtual character.
在第二方面的一种可能的实现方式中,第一流式音频片段的特征信息包括文本特征信息,表演内容生成模块还用于:
In a possible implementation manner of the second aspect, the feature information of the first streaming audio segment includes text feature information, and the performance content generation module is further configured to:
基于文本特征进行音素提取,获取音素特征信息;Extract phonemes based on text features to obtain phoneme feature information;
获取第二流式音频片段生成的面部表情信息;Obtaining facial expression information generated by a second streaming audio segment;
基于第一融合特征信息、音素特征信息和第二流式音频片段生成的面部表情信息,得到虚拟角色对应的基础表情的权重信息;Obtaining weight information of a basic expression corresponding to the virtual character based on the first fused feature information, the phoneme feature information, and the facial expression information generated by the second streaming audio segment;
基于虚拟角色对应的基础表情的权重信息,对虚拟角色的基础表情进行调整,得到虚拟角色的面部表情信息。Based on the weight information of the basic expression corresponding to the virtual character, the basic expression of the virtual character is adjusted to obtain the facial expression information of the virtual character.
在第二方面的一种可能的实现方式中,表演内容生成模块还用于:In a possible implementation manner of the second aspect, the performance content generation module is further configured to:
对第一融合特征信息进行调整,得到调整后的第一融合特征信息;Adjusting the first fused feature information to obtain adjusted first fused feature information;
获取第一流式音频片段生成的动作特征信息;Obtaining motion feature information generated by the first streaming audio segment;
以调整后的第一融合特征作为偏移量,对第一流式音频片段生成的动作特征信息进行调整,得到虚拟角色对应的肢体动作信息。The action feature information generated by the first streaming audio segment is adjusted using the adjusted first fusion feature as an offset to obtain body action information corresponding to the virtual character.
本申请第二方面中,虚拟角色的表演内容展示装置包括的各个模块还可以用于实现第一方面各种可能的实现方式中的步骤,对于本申请实施例第二方面以及第二方面的各种可能的实现方式中某些步骤的具体实现方式,以及每种可能的实现方式所带来的有益效果,均可以参考第一方面中各种可能的实现方式中的描述,此处不再一一赘述。In the second aspect of the present application, the various modules included in the performance content display device of the virtual character can also be used to implement the steps in the various possible implementation methods of the first aspect. For the specific implementation methods of the second aspect of the embodiment of the present application and the various possible implementation methods of the second aspect, as well as the beneficial effects brought about by each possible implementation method, you can refer to the description of the various possible implementation methods in the first aspect, and will not be repeated here one by one.
第三方面,本申请实施例提供了一种模型训练方法,包括:In a third aspect, an embodiment of the present application provides a model training method, including:
获取第一流式音频片段;基于第一流式音频片段的特征信息和根据第二流式音频片段生成的动作特征信息,得到第一融合特征信息,其中,第一流式音频片段和第二流式音频片段包含于同一流式音频,第二音频片段的播放时间在第一音频片段的播放时间之前;基于第一融合特征信息生成虚拟角色的面部表情信息和肢体动作信息;基于虚拟角色的虚拟角色的面部表情信息和肢体动作信息与虚拟角色的真实的面部表情信息和肢体动作信息,得到目标损失,目标损失用于指示面部表情信息和肢体动作信息与真实的面部表情信息和肢体动作信息之间的差异;基于目标损失,更新待训练模型的参数,直至满足模型训练条件,得到目标模型。Acquire a first streaming audio segment; obtain first fused feature information based on feature information of the first streaming audio segment and action feature information generated according to the second streaming audio segment, wherein the first streaming audio segment and the second streaming audio segment are included in the same streaming audio, and the playback time of the second audio segment is before the playback time of the first audio segment; generate facial expression information and body movement information of a virtual character based on the first fused feature information; obtain a target loss based on the facial expression information and body movement information of the virtual character and the real facial expression information and body movement information of the virtual character, the target loss being used to indicate the difference between the facial expression information and body movement information and the real facial expression information and body movement information; based on the target loss, update the parameters of the model to be trained until the model training conditions are met, and obtain the target model.
本申请第三方面中,可以理解的是,该目标模型可以用于执行前述第一方面或第一方面的各种可能的实现方式中的步骤,此处不再一一赘述。In the third aspect of the present application, it can be understood that the target model can be used to execute the steps in the aforementioned first aspect or various possible implementation methods of the first aspect, which will not be described one by one here.
第四方面,本申请实施例提供了一种模型训练装置,包括:In a fourth aspect, an embodiment of the present application provides a model training device, comprising:
流式音频片段获取模块,用于获取第一流式音频片段;A streaming audio segment acquisition module, used to acquire a first streaming audio segment;
融合特征信息生成模块,用于基于第一流式音频片段的特征信息和根据第二流式音频片段生成的动作特征信息,得到第一融合特征信息,其中,第一流式音频片段和第二流式音频片段包含于同一流式音频,第二音频片段的播放时间在第一音频片段的播放时间之前;a fusion feature information generating module, configured to obtain first fusion feature information based on feature information of a first streaming audio segment and action feature information generated according to a second streaming audio segment, wherein the first streaming audio segment and the second streaming audio segment are included in the same streaming audio, and a playback time of the second audio segment is before a playback time of the first audio segment;
表演内容生成模块,用于基于第一融合特征信息生成虚拟角色的面部表情信息和肢体动作信息;A performance content generation module, used to generate facial expression information and body movement information of the virtual character based on the first fused feature information;
目标损失获取模块,用于基于虚拟角色的虚拟角色的面部表情信息和肢体动作信息与虚拟角色的真实的面部表情信息和肢体动作信息,得到目标损失,目标损失用于指示面部表情信息和肢体动作信息与真实的面部表情信息和肢体动作信息之间的差异;A target loss acquisition module is used to obtain a target loss based on the facial expression information and body movement information of the virtual character of the virtual character and the real facial expression information and body movement information of the virtual character, wherein the target loss is used to indicate the difference between the facial expression information and body movement information and the real facial expression information and body movement information;
目标模型训练模块,用于基于目标损失,更新待训练模型的参数,直至满足模型训练条件,得到目标模型。The target model training module is used to update the parameters of the model to be trained based on the target loss until the model training conditions are met to obtain the target model.
第五方面,本申请提供了一种计算设备集群,包括至少一个计算设备,每个计算设备包括处理器和存储器;In a fifth aspect, the present application provides a computing device cluster, comprising at least one computing device, each computing device comprising a processor and a memory;
该至少一个计算设备的处理器用于执行该至少一个计算设备的存储器中存储的指令,以使得该计算设备集群执行上述第一方面或第一方面的任一可能的实现方式中的方法。The processor of the at least one computing device is used to execute instructions stored in the memory of the at least one computing device, so that the computing device cluster executes the method in the above-mentioned first aspect or any possible implementation manner of the first aspect.
第六方面,本申请实施例提供了一种训练设备,包括处理器和存储器,处理器与存储器耦合。存储器,用于存储程序。处理器,用于执行存储器中的程序,使得执行设备执行上述第三方面的方法。In a sixth aspect, an embodiment of the present application provides a training device, including a processor and a memory, wherein the processor is coupled to the memory. The memory is used to store a program. The processor is used to execute the program in the memory, so that the execution device executes the method of the third aspect above.
第七方面,本申请实施例提供了一种计算机存储介质,其特征在于,包括指令,当指令在计算设备上运行时,使得计算设备执行上述第一方面或第一方面的任一可能的实现方式中的方法,或上述第三方面的方法。In the seventh aspect, an embodiment of the present application provides a computer storage medium, characterized in that it includes instructions, which, when executed on a computing device, enable the computing device to execute the method in the above-mentioned first aspect or any possible implementation of the first aspect, or the method in the above-mentioned third aspect.
第八方面,本申请提供了一种包含指令的计算机程序产品,当所述指令被计算设备集群运行时,使
得所述计算设备集群执行上述第一方面或第一方面的任一可能的实现方式中的方法,或上述第三方面的方法。In an eighth aspect, the present application provides a computer program product comprising instructions, which, when executed by a computing device cluster, enables The computing device cluster executes the method in the first aspect or any possible implementation manner of the first aspect, or the method in the third aspect.
第九方面,本申请实施例提供了一种电路系统,电路系统包括处理电路,处理电路配置为执行上述第一方面或第一方面的任一种可能的实现方式所述的方法,或上述第三方面的方法。In a ninth aspect, an embodiment of the present application provides a circuit system, the circuit system including a processing circuit, the processing circuit being configured to execute the method described in the first aspect or any possible implementation of the first aspect, or the method of the third aspect.
第十方面,本申请提供了一种芯片系统,包括处理器和存储器,存储器用于存储计算机程序,处理器用于调用并运行存储器中存储的计算机程序,以执行如上述第一方面或第一方面的任一种可能的实现方式所述的方法,或上述第三方面的方法。该芯片系统,可以由芯片构成,也可以包括芯片和其他分立器件。In a tenth aspect, the present application provides a chip system, including a processor and a memory, the memory is used to store a computer program, and the processor is used to call and run the computer program stored in the memory to execute the method as described in the first aspect or any possible implementation of the first aspect, or the method of the third aspect. The chip system can be composed of a chip, or it can include a chip and other discrete devices.
上述第二至第十方面的方案,用于实现或配合实现上述第一方面或其中任一种可能的实现方式中的方法,因此能够与第一方面达到相同或相应的有益效果,此处不再进行赘述。The solutions of the second to tenth aspects mentioned above are used to implement or cooperate with the method of the first aspect or any possible implementation method thereof, and therefore can achieve the same or corresponding beneficial effects as the first aspect, and will not be repeated here.
图1为本申请实施例应用的人工智能主体框架的一种结构示意图;FIG1 is a schematic diagram of a structure of an artificial intelligence main framework used in an embodiment of the present application;
图2为本申请实施例提供的一种系统架构图;FIG2 is a system architecture diagram provided in an embodiment of the present application;
图3为本申请实施例提供的虚拟角色的表演内容展示方法的一种流程示意图;FIG3 is a flow chart of a method for displaying performance content of a virtual character provided in an embodiment of the present application;
图4为本申请实施例提供的另一种系统架构图;FIG4 is another system architecture diagram provided in an embodiment of the present application;
图5为本申请实施例提供的虚拟角色的表演内容展示方法的另一种流程示意图;FIG5 is another schematic flow chart of a method for displaying performance content of a virtual character provided in an embodiment of the present application;
图6为本申请实施例提供的生成第一融合特征信息的一种流程示意图;FIG6 is a schematic diagram of a process for generating first fusion feature information provided in an embodiment of the present application;
图7为本申请实施例提供的风格化处理模块的一种结构示意图;FIG7 is a schematic diagram of a structure of a stylization processing module provided in an embodiment of the present application;
图8为本申请实施例提供的生成虚拟角色的面部表情信息的一种流程示意图;FIG8 is a schematic diagram of a process for generating facial expression information of a virtual character provided in an embodiment of the present application;
图9为本申请实施例提供的多人动作解码器的一种结构示意图;FIG9 is a schematic diagram of a structure of a multi-person action decoder provided in an embodiment of the present application;
图10为本申请实施例提供的生成虚拟角色的肢体动作信息的一种流程示意图;FIG10 is a schematic diagram of a flow chart of generating body movement information of a virtual character provided in an embodiment of the present application;
图11为本申请实施例提供的虚拟角色的表演内容展示方法的另一种流程示意图;FIG11 is another schematic flow chart of a method for displaying performance content of a virtual character provided in an embodiment of the present application;
图12为本申请实施例提供的生成虚拟角色的面部表情信息的另一种流程示意图;FIG12 is another schematic diagram of a process for generating facial expression information of a virtual character provided in an embodiment of the present application;
图13为本申请实施例提供的生成虚拟角色的肢体动作信息的另一种流程示意图;FIG13 is another schematic diagram of a flow chart of generating body movement information of a virtual character provided in an embodiment of the present application;
图14为本申请实施例提供的模型训练方法的一种流程示意图;FIG14 is a flow chart of a model training method provided in an embodiment of the present application;
图15为本申请实施例提供的虚拟角色的表演内容展示装置的一种结构示意图;FIG15 is a schematic diagram of a structure of a device for displaying performance content of a virtual character provided in an embodiment of the present application;
图16为本申请实施例提供的模型训练装置的一种结构示意图;FIG16 is a schematic diagram of a structure of a model training device provided in an embodiment of the present application;
图17为本申请实施例提供的计算设备的一种结构示意图;FIG17 is a schematic diagram of a structure of a computing device provided in an embodiment of the present application;
图18为本申请实施例提供的计算机设备集群的一种结构示意图;FIG18 is a schematic diagram of a structure of a computer device cluster provided in an embodiment of the present application;
图19为本申请实施例提供的训练设备一种结构示意图;FIG19 is a schematic diagram of a structure of a training device provided in an embodiment of the present application;
图20为本申请实施例提供的芯片的一种结构示意图。FIG. 20 is a schematic diagram of the structure of a chip provided in an embodiment of the present application.
下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例仅是本申请的部分实施例,而不是全部的实施例。基于本申请中的实施例,本领域技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本申请保护的范围。本领域技术人员可知,随着技术的发展和新场景的出现,本申请实施例提供的技术方案对于类似的技术问题,同样适用。The following will be combined with the drawings in the embodiments of the present application to clearly and completely describe the technical solutions in the embodiments of the present application. Obviously, the described embodiments are only some embodiments of the present application, not all embodiments. Based on the embodiments in the present application, all other embodiments obtained by those skilled in the art without making creative work are within the scope of protection of this application. It is known to those skilled in the art that with the development of technology and the emergence of new scenarios, the technical solutions provided in the embodiments of the present application are also applicable to similar technical problems.
本申请的说明书和权利要求书及上述附图中的术语“第一”、“第二”等是用于区别类似的对象,而不必用于描述特定的顺序或先后次序。应该理解这样使用的数据在适当情况下可以互换,以便这里描述的实施例能够以除了在这里图示或描述的内容以外的顺序实施。此外,术语“包括”和“具有”以及他们的任何变形,意图在于覆盖不排他的包含,例如,包含了一系列步骤或模块的过程、方法、系统、产品或设备不必限于清楚地列出的那些步骤或模块,而是可包括没有清楚地列出的或对于这些过程、方法、产品或设备固有的其它步骤或模块。The terms "first", "second", etc. in the specification and claims of the present application and the above-mentioned drawings are used to distinguish similar objects, and are not necessarily used to describe a specific order or sequence. It should be understood that the data used in this way can be interchangeable where appropriate, so that the embodiments described herein can be implemented in an order other than that illustrated or described herein. In addition, the terms "including" and "having" and any of their variations are intended to cover non-exclusive inclusions, for example, a process, method, system, product or device that includes a series of steps or modules is not necessarily limited to those steps or modules that are clearly listed, but may include other steps or modules that are not clearly listed or inherent to these processes, methods, products or devices.
本申请中出现的术语“和/或”,可以是一种描述关联对象的关联关系,表示可以存在三种关系,例如,A和/或B,可以表示:单独存在A,同时存在A和B,单独存在B这三种情况。另外,本申请中字符“/”,一般表示前后关联对象是一种“或”的关系。
The term "and/or" in this application can be a description of the association relationship of associated objects, indicating that three relationships can exist. For example, A and/or B can represent: A exists alone, A and B exist at the same time, and B exists alone. In addition, the character "/" in this application generally indicates that the associated objects before and after are in an "or" relationship.
还应该注意的是,在一些替代实施中,所注明的功能/动作可以不按附图的顺序出现。例如,取决于涉及的功能/动作,事实上可以实质上同时发生或可以有时以相反的顺序执行连续示出的两个附图。It should also be noted that, in some alternative implementations, the functions/acts noted may occur out of the order of the drawings. For example, two figures shown in succession may in fact occur substantially simultaneously or may sometimes be performed in the reverse order, depending on the functions/acts involved.
本申请实施例,除非另有说明,“至少一个”的含义是指一个或多个,“多个”的含义是指两个或两个以上。可以理解,在本申请中,“当…时”、“若”以及“如果”均指在某种客观情况下装置会做出相应的处理,并非是限定时间,且也不要求装置实现时一定要有判断的动作,也不意味着存在其它限定。另外,专用的词“示例性”意为“用作例子、实施例或说明性”。作为“示例性”所说明的任何实施例不必解释为优于或好于其它实施例。In the embodiments of the present application, unless otherwise specified, the meaning of "at least one" refers to one or more, and the meaning of "plurality" refers to two or more. It is understood that in the present application, "when", "if" and "if" all refer to the device making corresponding processing under certain objective circumstances, and do not limit the time, nor do they require that there must be a judgment action when the device is implemented, nor do they mean that there are other limitations. In addition, the special word "exemplary" means "used as an example, embodiment or illustrative". Any embodiment described as "exemplary" is not necessarily interpreted as being superior or better than other embodiments.
下面结合附图,对本申请的实施例进行描述。本领域普通技术人员可知,随着技术的发展和新场景的出现,本申请实施例提供的技术方案对于类似的技术问题,同样适用。The embodiments of the present application are described below in conjunction with the accompanying drawings. It is known to those skilled in the art that with the development of technology and the emergence of new scenarios, the technical solutions provided in the embodiments of the present application are also applicable to similar technical problems.
本申请提供的虚拟角色的表演内容展示方法可以应用于计算机技术领域中,尤其是人工智能(artificial intelligence,AI)领域中。AI是利用数字计算机或者数字计算机控制的机器模拟、延伸和扩展人的智能,感知环境、获取知识并使用知识获得最佳结果的理论、方法、技术及应用系统。换句话说,人工智能是计算机科学的一个分支,它企图了解智能的实质,并生产出一种新的能以人类智能相似的方式做出反应的智能机器。人工智能也就是研究各种智能机器的设计原理与实现方法,使机器具有感知、推理与决策的功能。人工智能领域的研究包括机器人,自然语言处理,计算机视觉,决策与推理,人机交互,推荐与搜索,AI基础理论等。The performance content display method of the virtual character provided in this application can be applied to the field of computer technology, especially in the field of artificial intelligence (AI). AI is a theory, method, technology and application system that uses digital computers or machines controlled by digital computers to simulate, extend and expand human intelligence, perceive the environment, acquire knowledge and use knowledge to obtain the best results. In other words, artificial intelligence is a branch of computer science that attempts to understand the essence of intelligence and produce a new intelligent machine that can respond in a similar way to human intelligence. Artificial intelligence is to study the design principles and implementation methods of various intelligent machines so that the machines have the functions of perception, reasoning and decision-making. Research in the field of artificial intelligence includes robots, natural language processing, computer vision, decision-making and reasoning, human-computer interaction, recommendation and search, basic AI theory, etc.
下面结合附图,对本申请的实施例进行描述。本领域普通技术人员可知,随着技术的发展和新场景的出现,本申请实施例提供的技术方案对于类似的技术问题,同样适用。The embodiments of the present application are described below in conjunction with the accompanying drawings. It is known to those skilled in the art that with the development of technology and the emergence of new scenarios, the technical solutions provided in the embodiments of the present application are also applicable to similar technical problems.
首先对人工智能系统总体工作流程进行描述,请参阅图1,本申请实施例应用的人工智能主体框架的一种结构示意图,下面从“智能信息链”(水平轴)和“IT价值链”(垂直轴)两个维度对上述人工智能主题框架进行阐述。其中,“智能信息链”反映从数据的获取到处理的一列过程。举例来说,可以是智能信息感知、智能信息表示与形成、智能推理、智能决策、智能执行与输出的一般过程。在这个过程中,数据经历了“数据—信息—知识—智慧”的凝练过程。“IT价值链”从人智能的底层基础设施、信息(提供和处理技术实现)到系统的产业生态过程,反映人工智能为信息技术产业带来的价值。First, the overall workflow of the artificial intelligence system is described. Please refer to Figure 1, which is a structural diagram of the artificial intelligence main framework applied in the embodiment of the present application. The above artificial intelligence theme framework is explained from the two dimensions of "intelligent information chain" (horizontal axis) and "IT value chain" (vertical axis). Among them, the "intelligent information chain" reflects a series of processes from data acquisition to processing. For example, it can be a general process of intelligent information perception, intelligent information representation and formation, intelligent reasoning, intelligent decision-making, intelligent execution and output. In this process, the data undergoes a condensation process of "data-information-knowledge-wisdom". The "IT value chain" reflects the value that artificial intelligence brings to the information technology industry, from the underlying infrastructure of human intelligence, information (providing and processing technology implementation) to the industrial ecology process of the system.
(1)基础设施:(1) Infrastructure:
基础设施为人工智能系统提供计算能力支持,实现与外部世界的沟通,并通过基础平台实现支撑。通过传感器与外部沟通;计算能力由智能芯片,如中央处理器(central processing unit,CPU)、网络处理器(neural-network processing unit,NPU)、图形处理器(英语:graphics processing unit,GPU)、专用集成电路(application specific integrated circuit,ASIC)或现场可编程逻辑门阵列(field programmable gate array,FPGA)等硬件加速芯片)提供;基础平台包括分布式计算框架及网络等相关的平台保障和支持,可以包括云存储和计算、互联互通网络等。举例来说,传感器和外部沟通获取数据,这些数据提供给基础平台提供的分布式计算系统中的智能芯片进行计算。The infrastructure provides computing power support for the artificial intelligence system, enables communication with the outside world, and is supported by the basic platform. It communicates with the outside world through sensors; computing power is provided by smart chips, such as central processing units (CPU), neural-network processing units (NPU), graphics processing units (GPU), application specific integrated circuits (ASIC) or field programmable gate arrays (FPGA) and other hardware acceleration chips; the basic platform includes distributed computing frameworks and networks and other related platform guarantees and support, which can include cloud storage and computing, interconnected networks, etc. For example, sensors communicate with the outside world to obtain data, and these data are provided to the smart chips in the distributed computing system provided by the basic platform for calculation.
(2)数据(2) Data
基础设施的上一层的数据用于表示人工智能领域的数据来源。数据涉及到图形、图像、语音、视频、文本,还涉及到传统设备的物联网数据,包括已有系统的业务数据以及力、位移、液位、温度、湿度等感知数据。The data on the upper layer of the infrastructure is used to represent the data sources in the field of artificial intelligence. The data involves graphics, images, voice, video, text, and IoT data of traditional devices, including business data of existing systems and perception data such as force, displacement, liquid level, temperature, and humidity.
(3)数据处理(3) Data processing
数据处理通常包括数据训练,机器学习,深度学习,搜索,推理,决策等方式。Data processing usually includes data training, machine learning, deep learning, search, reasoning, decision-making and other methods.
其中,机器学习和深度学习可以对数据进行符号化和形式化的智能信息建模、抽取、预处理、训练等。Among them, machine learning and deep learning can symbolize and formalize data for intelligent information modeling, extraction, preprocessing, and training.
推理是指在计算机或智能系统中,模拟人类的智能推理方式,依据推理控制策略,利用形式化的信息进行机器思维和求解问题的过程,典型的功能是搜索与匹配。Reasoning refers to the process of simulating human intelligent reasoning in computers or intelligent systems, using formalized information to perform machine thinking and solve problems based on reasoning control strategies. Typical functions are search and matching.
决策是指智能信息经过推理后进行决策的过程,通常提供分类、排序、预测等功能。Decision-making refers to the process of making decisions after intelligent information is reasoned, usually providing functions such as classification, sorting, and prediction.
(4)通用能力(4) General capabilities
对数据经过上面提到的数据处理后,进一步基于数据处理的结果可以形成一些通用的能力,比如可以是算法或者一个通用系统,例如,翻译,文本的分析,计算机视觉的处理(如图像识别、目标检测
等),语音识别等等。After the data is processed as mentioned above, some general capabilities can be formed based on the results of data processing, such as algorithms or a general system, for example, translation, text analysis, computer vision processing (such as image recognition, object detection) etc.), speech recognition, etc.
(5)智能产品及行业应用(5) Smart products and industry applications
智能产品及行业应用指人工智能系统在各领域的产品和应用,是对人工智能整体解决方案的封装,将智能信息决策产品化、实现落地应用,其应用领域主要包括:智能制造、智能交通、智能家居、智能医疗、智能安防、自动驾驶,智能终端等。Smart products and industry applications refer to the products and applications of artificial intelligence systems in various fields. They are the encapsulation of the overall artificial intelligence solution, which productizes intelligent information decision-making and realizes practical applications. Its application areas mainly include: smart manufacturing, smart transportation, smart home, smart medical care, smart security, autonomous driving, smart terminals, etc.
本申请实施例涉及了大量神经网络的相关应用,为了更好地理解本申请实施例的方案,下面先对本申请实施例可能涉及的神经网络的相关术语和概念进行介绍。The embodiments of the present application involve a large number of neural network-related applications. In order to better understand the solutions of the embodiments of the present application, the relevant terms and concepts of the neural network that may be involved in the embodiments of the present application are first introduced below.
(1)神经网络(1) Neural Network
神经网络可以是由神经单元组成的,例如,神经单元可以是指以xs为输入的运算单元,该运算单元的输出可以如以下所示:
A neural network may be composed of neural units. For example, a neural unit may refer to an operation unit that takes x s as input, and the output of the operation unit may be as follows:
A neural network may be composed of neural units. For example, a neural unit may refer to an operation unit that takes x s as input, and the output of the operation unit may be as follows:
其中,s=1、2、……n,n为大于1的自然数,Ws为xs的权重,b为神经单元的偏置。f为神经单元的激活函数(activation functions),用于将非线性特性引入神经网络中,来将神经单元中的输入信号转换为输出信号。该激活函数的输出信号可以作为下一层卷积层的输入,激活函数可以是sigmoid函数。神经网络是将多个上述单一的神经单元联结在一起形成的网络,即一个神经单元的输出可以是另一个神经单元的输入。每个神经单元的输入可以与前一层的局部接受域相连,来提取局部接受域的特征,局部接受域可以是由若干个神经单元组成的区域。Where s=1, 2, ...n, n is a natural number greater than 1, Ws is the weight of xs , and b is the bias of the neural unit. f is the activation function of the neural unit, which is used to introduce nonlinear characteristics into the neural network to convert the input signal in the neural unit into the output signal. The output signal of the activation function can be used as the input of the next convolution layer, and the activation function can be a sigmoid function. A neural network is a network formed by connecting multiple single neural units mentioned above, that is, the output of one neural unit can be the input of another neural unit. The input of each neural unit can be connected to the local receptive field of the previous layer to extract the features of the local receptive field. The local receptive field can be an area composed of several neural units.
(2)循环神经网络(2) Recurrent Neural Network
循环神经网络(recurrent neural network,RNN)是用来处理序列数据的。在传统的神经网络模型中,是从输入层到隐含层再到输出层,层与层之间是全连接的,而对于每一层层内之间的各个节点是无连接的。这种普通的神经网络虽然解决了很多难题,但是却仍然对很多问题却无能无力。例如,你要预测句子的下一个单词是什么,一般需要用到前面的单词,因为一个句子中前后单词并不是独立的。RNN之所以称为循环神经网路,即一个序列当前的输出与前面的输出也有关。具体的表现形式为网络会对前面的信息进行记忆并应用于当前输出的计算中,即隐含层本层之间的节点不再无连接而是有连接的,并且隐含层的输入不仅包括输入层的输出还包括上一时刻隐含层的输出。Recurrent neural network (RNN) is used to process sequence data. In the traditional neural network model, from the input layer to the hidden layer and then to the output layer, the layers are fully connected, and the nodes within each layer are disconnected. Although this ordinary neural network has solved many difficult problems, it is still powerless for many problems. For example, if you want to predict the next word in a sentence, you generally need to use the previous word, because the previous and next words in a sentence are not independent. The reason why RNN is called a recurrent neural network is that the current output of a sequence is also related to the previous output. The specific manifestation is that the network will remember the previous information and apply it to the calculation of the current output, that is, the nodes between the hidden layer are no longer disconnected but connected, and the input of the hidden layer includes not only the output of the input layer but also the output of the hidden layer at the previous moment.
(3)Transformer结构(3) Transformer structure
Transformer结构是一种包含编码器(Encoder)与解码器(Decoder)的特征提取网络(类别于卷积神经网络)。The Transformer structure is a feature extraction network (similar to a convolutional neural network) that includes an encoder and a decoder.
编码器:通过自注意力的方式在全局感受野下进行特征学习,例如像素点的特征。Encoder: Performs feature learning in the global receptive field through self-attention, such as pixel features.
解码器:通过自注意力与交叉注意力来学习所需模块的特征,例如输出框的特征。Decoder: Learn the features of the required modules, such as the features of the output box, through self-attention and cross-attention.
(4)多层感知机(Multi-Layer Perceptron,MLP)(4) Multi-Layer Perceptron (MLP)
多层感知机,也可以称为多层感知器,是一种前馈人工神经网络模型。MLP是一种基于全连接(Fully-Connected,FC)前向结构的人工神经网络(Artificial Neural Network,ANN),其包含从十几个到成百上千不等的人工神经元(Artificial Neuron,AN,下文简称神经元)。MLP将神经元组织为多层的结构,层间采用全连接方法,形成逐层连接的多权连接层的ANN。一般来讲,MLP包含一个输入层(该层实际不包含运算)、一个或者多个隐层以及一个输出层。Multilayer Perceptron, also known as Multilayer Perceptron, is a feedforward artificial neural network model. MLP is an artificial neural network (ANN) based on a fully connected (FC) forward structure, which contains artificial neurons (AN, hereinafter referred to as neurons) ranging from a dozen to hundreds. MLP organizes neurons into a multi-layer structure, and uses a full connection method between layers to form an ANN with multi-weighted connection layers connected layer by layer. Generally speaking, MLP contains an input layer (this layer does not actually contain operations), one or more hidden layers, and an output layer.
(5)损失函数(loss function)(5) Loss function
损失函数也可以称为代价函数(cost function),一种比较机器学习模型对样本的预测输出和样本的真实值(也可以称为监督值)区别的度量,即用于衡量机器学习模型对样本的预测输出和样本的真实值之间的区别。The loss function can also be called the cost function, which is a metric that compares the difference between the predicted output of a machine learning model for a sample and the true value of the sample (also called the supervised value).
该损失函数通常可以包括误差平方均方、交叉熵、对数、指数等损失函数。例如,可以使用误差均方作为损失函数,定义为具体可以根据实际应用场景选择具体的损失函数。
The loss function can usually include mean square error, cross entropy, logarithm, exponential and other loss functions. For example, the mean square error can be used as the loss function, defined as The specific loss function can be selected according to the actual application scenario.
本申请可以应用于上述各个应用领域中基于音乐生成虚拟角色的面部表情信息和肢体动作信息的应用场景中。具体的,作为示例,例如,在需要对电影、游戏、大型晚会、虚拟直播、车载大屏娱乐等场景中的音乐生成对应的虚拟角色的面部表情信息和肢体动作信息时,可以基于电影中的配乐、游戏中的音乐、大型晚会或者哼唱等场景中的音乐生成不同场景下对应的虚拟角色的面部表情信息和肢体动作信息。此处不对本申请实施例的应用场景进行穷举。The present application can be applied to application scenarios in which facial expression information and body movement information of virtual characters are generated based on music in the above-mentioned application fields. Specifically, as an example, when it is necessary to generate facial expression information and body movement information of virtual characters corresponding to music in scenes such as movies, games, large-scale evening parties, virtual live broadcasts, and in-car large-screen entertainment, facial expression information and body movement information of virtual characters corresponding to different scenes can be generated based on the soundtrack in the movie, the music in the game, the music in the large-scale evening party or humming, etc. The application scenarios of the embodiments of the present application are not exhaustively listed here.
在对本申请实施例提供的虚拟角色的表演内容展示方法进行详细介绍之前,先结合图2对本申请实施例提供的系统架构进行介绍。Before introducing in detail the method for displaying the performance content of a virtual character provided in the embodiment of the present application, the system architecture provided in the embodiment of the present application is first introduced in conjunction with Figure 2.
请参阅图2,图2为本申请实施例提供的一种系统架构图。在图2所示的实施例中,该系统架构200包括数据库230、客户设备240。数据采集设备260用于采集数据并存入数据库230,训练模块202基于数据库230中维护的数据生成目标模型/规则201。下面将更详细地描述训练模块202如何基于数据得到目标模型/规则201,目标模型/规则201即本申请以下实施方式中所提及的目标模型,具体参阅以下图5-图14中的相关描述。Please refer to Figure 2, which is a system architecture diagram provided in an embodiment of the present application. In the embodiment shown in Figure 2, the system architecture 200 includes a database 230 and a client device 240. The data acquisition device 260 is used to collect data and store it in the database 230, and the training module 202 generates a target model/rule 201 based on the data maintained in the database 230. The following will describe in more detail how the training module 202 obtains the target model/rule 201 based on the data. The target model/rule 201 is the target model mentioned in the following implementation of the present application. Please refer to the relevant descriptions in the following Figures 5-14.
计算模块211可以包括训练模块202,训练模块202得到的目标模型/规则可以应用不同的系统或设备中。在附图2中,执行设备210配置收发器212,该收发器212可以是无线收发器、光收发器或有线接口(如I/O接口)等,与外部设备进行数据交互,“用户”可以通过客户设备240向收发器212输入数据,例如,客户设备240可以向执行设备210发送目标任务,请求执行设备训练神经网络,并向执行设备210发送用于训练的数据库。The calculation module 211 may include a training module 202, and the target model/rule obtained by the training module 202 may be applied to different systems or devices. In FIG2 , the execution device 210 is configured with a transceiver 212, which may be a wireless transceiver, an optical transceiver, or a wired interface (such as an I/O interface), etc., to interact with external devices for data, and a "user" may input data to the transceiver 212 through a client device 240. For example, the client device 240 may send a target task to the execution device 210, request the execution device to train a neural network, and send a database for training to the execution device 210.
执行设备210可以调用数据存储系统250中的数据、代码等,也可以将数据、指令等存入数据存储系统250中。The execution device 210 can call data, codes, etc. in the data storage system 250 , and can also store data, instructions, etc. in the data storage system 250 .
计算模块211使用目标模型/规则201对输入的数据进行处理。具体地,一种可能的实现方式中,请参阅图3,图3为本申请实施例提供的虚拟角色的表演内容展示方法的一种流程示意图。该方法可以由如图2中所示的计算模块211执行,具体的,由执行设备210执行。执行设备210用于:S1、获取第一流式音频片段;S2、基于所述第一流式音频片段的特征信息和根据第二流式音频片段生成的动作特征信息,得到第一融合特征信息,所述第一流式音频片段和所述第二流式音频片段包含于同一流式音频,所述第二音频片段的播放时间在所述第一音频片段的播放时间之前;S3、基于所述第一融合特征信息生成虚拟角色的面部表情信息和肢体动作信息。The calculation module 211 uses the target model/rule 201 to process the input data. Specifically, in a possible implementation, please refer to Figure 3, which is a flow chart of a method for displaying the performance content of a virtual character provided in an embodiment of the present application. The method can be executed by the calculation module 211 as shown in Figure 2, specifically, by the execution device 210. The execution device 210 is used to: S1, obtain a first streaming audio segment; S2, based on the feature information of the first streaming audio segment and the action feature information generated according to the second streaming audio segment, obtain first fusion feature information, the first streaming audio segment and the second streaming audio segment are included in the same streaming audio, and the playback time of the second audio segment is before the playback time of the first audio segment; S3, generate facial expression information and body movement information of the virtual character based on the first fusion feature information.
最后,收发器212将目标模型的输出结果返回给客户设备240。如用户可以通过客户设备240输入一段音频片段,通过目标模型输出虚拟角色的面部表情信息和肢体动作信息,并反馈给客户设备240。Finally, the transceiver 212 returns the output result of the target model to the client device 240. For example, the user can input an audio clip through the client device 240, and the facial expression information and body movement information of the virtual character are output through the target model and fed back to the client device 240.
更深层地,训练模块202可以针对不同的任务,基于不同的数据得到相应的目标模型/规则201,以给用户提供更佳的结果。More deeply, the training module 202 can obtain corresponding target models/rules 201 for different tasks based on different data to provide users with better results.
在附图2中所示情况下,可以根据用户的输入数据确定输入执行设备210中的数据,例如,用户可以在收发器212提供的界面中操作。另一种情况下,客户设备240可以自动地向收发器212输入数据并获得结果,若客户设备240自动输入数据需要获得用户的授权,用户可以在客户设备240中设置相应权限。用户可以在客户设备240查看执行设备210输出的结果,具体的呈现形式可以是表情、动作等具体方式。客户设备240也可以作为数据采集端将采集到与目标任务关联的数据存入数据库230。In the case shown in FIG. 2 , the data input into the execution device 210 can be determined based on the user's input data. For example, the user can operate in the interface provided by the transceiver 212. In another case, the client device 240 can automatically input data into the transceiver 212 and obtain the result. If the automatic data input of the client device 240 requires the user's authorization, the user can set the corresponding authority in the client device 240. The user can view the result output by the execution device 210 in the client device 240. The specific presentation form can be specific forms such as expressions and actions. The client device 240 can also serve as a data collection terminal to store the collected data associated with the target task into the database 230.
在本申请所提及的训练或者更新过程可以由训练模块202来执行。可以理解的是,神经网络的训练过程即学习控制空间变换的方式,更具体即学习权重矩阵。训练神经网络的目的是使神经网络的输出尽可能接近期望值,因此可以通过比较当前网络的预测值和期望值,再根据两者之间的差异情况来更新神经网络中的每一层神经网络的权重向量(当然,在第一次更新之前通常可以先对权重向量进行初始化,即为深度神经网络中的各层预先配置参数)。例如,如果网络的预测值过高,则调整权重矩阵中的权重的值从而降低预测值,经过不断的调整,直到神经网络输出的值接近期望值或者等于期望值。具体地,可以通过损失函数(loss function)或目标函数(objective function)来衡量神经网络的预测值和期望值之间的差异。以损失函数举例,损失函数的输出值(loss)越高表示差异越大,神经网络的训练可以理解为尽可能缩小loss的过程。本申请以下实施方式中更新起点网络的权重以及对串行网络进行训练的过程可以参阅此过程,以下不再赘述。The training or updating process mentioned in the present application can be performed by the training module 202. It is understandable that the training process of the neural network is to learn the way to control the spatial transformation, more specifically, to learn the weight matrix. The purpose of training the neural network is to make the output of the neural network as close to the expected value as possible. Therefore, the weight vector of each layer of the neural network in the neural network can be updated according to the difference between the predicted value and the expected value of the current network by comparing the predicted value and the expected value of the current network (of course, the weight vector can usually be initialized before the first update, that is, the parameters of each layer in the deep neural network are pre-configured). For example, if the predicted value of the network is too high, the value of the weight in the weight matrix is adjusted to reduce the predicted value, and after continuous adjustment, the value output by the neural network is close to or equal to the expected value. Specifically, the difference between the predicted value and the expected value of the neural network can be measured by a loss function or an objective function. Taking the loss function as an example, the higher the output value (loss) of the loss function, the greater the difference. The training of the neural network can be understood as the process of minimizing the loss as much as possible. The process of updating the weight of the starting network and training the serial network in the following embodiments of the present application can refer to this process, which will not be repeated below.
如图2所示,根据训练模块202训练得到目标模型/规则201,该目标模型/规则201在本申请实施
例中可以包括深度卷积神经网络(deep convolutional neural networks,DCNN),循环神经网络(recurrent neural network,RNNS)等等网络。本申请提及的神经网络可以包括多种类型,如深度神经网络(deep neural network,DNN)、卷积神经网络(convolutional neural network,CNN)、循环神经网络(recurrent neural networks,RNN)或残差网络其他神经网络等。As shown in FIG. 2 , a target model/rule 201 is obtained by training the training module 202. The target model/rule 201 is implemented in the present application. Examples may include deep convolutional neural networks (DCNN), recurrent neural networks (RNNS), etc. The neural networks mentioned in this application may include various types, such as deep neural networks (DNN), convolutional neural networks (CNN), recurrent neural networks (RNN) or residual networks and other neural networks.
其中,在训练阶段,数据库230可以用于存储用于训练的样本集。执行设备210生成用于处理样本的目标模型/规则201,并利用数据库中的样本集合对目标模型/规则201进行迭代训练,得到成熟的目标模型/规则201,该目标模型/规则201具体表现为神经网络。执行设备210得到的神经网络可以应用不同的系统或设备中。In the training phase, the database 230 can be used to store sample sets for training. The execution device 210 generates a target model/rule 201 for processing samples, and iteratively trains the target model/rule 201 using the sample set in the database to obtain a mature target model/rule 201, which is specifically represented by a neural network. The neural network obtained by the execution device 210 can be applied to different systems or devices.
在推理阶段,执行设备210可以调用数据存储系统250中的数据、代码等,也可以将数据、指令等存入数据存储系统250中。数据存储系统250可以置于执行设备210中,也可以为数据存储系统250相对执行设备210是外部存储器。计算模块211可以通过神经网络对执行设备210获取到的样本进行处理,得到预测结果,预测结果的具体表现形式与神经网络的功能相关。In the inference stage, the execution device 210 can call the data, code, etc. in the data storage system 250, or store the data, instructions, etc. in the data storage system 250. The data storage system 250 can be placed in the execution device 210, or the data storage system 250 can be an external memory relative to the execution device 210. The calculation module 211 can process the samples obtained by the execution device 210 through the neural network to obtain the prediction result. The specific expression form of the prediction result is related to the function of the neural network.
需要说明的是,附图2仅是本申请实施例提供的一种系统架构的示例性的示意图,图中所示设备、器件、模块等之间的位置关系不构成任何限制。例如,在附图2中,数据存储系统250相对执行设备210是外部存储器,在其它场景中,也可以将数据存储系统250置于执行设备210中。It should be noted that FIG2 is only an exemplary schematic diagram of a system architecture provided by an embodiment of the present application, and the positional relationship between the devices, components, modules, etc. shown in the figure does not constitute any limitation. For example, in FIG2, the data storage system 250 is an external memory relative to the execution device 210. In other scenarios, the data storage system 250 can also be placed in the execution device 210.
根据训练模块202训练得到的目标模型/规则201可以应用于不同的系统或设备中,如应用于手机,平板电脑,笔记本电脑,增强现实(augmented reality,AR)/虚拟现实(virtual reality,VR),车载终端等,还可以是服务器或者云端设备等。The target model/rule 201 trained by the training module 202 can be applied to different systems or devices, such as mobile phones, tablet computers, laptops, augmented reality (AR)/virtual reality (VR), vehicle terminals, etc., and can also be servers or cloud devices.
该目标模型/规则201在本申请实施例中可以是用于执行本申请提供的虚拟角色的表演内容展示方法的模型,即该目标模型/规则201可以是本申请提供的用于生成虚拟角色的面部表情信息和肢体动作信息的神经网络。具体的,本申请实施例提供的目标模型可以包括CNN,深度卷积神经网络(deep convolutional neural networks,DCNN),循环神经网络(recurrent neural network,RNN)等等中的一种或者多种网络。The target model/rule 201 in the embodiment of the present application may be a model for executing the performance content display method of the virtual character provided in the present application, that is, the target model/rule 201 may be a neural network provided in the present application for generating facial expression information and body movement information of the virtual character. Specifically, the target model provided in the embodiment of the present application may include one or more networks such as CNN, deep convolutional neural networks (DCNN), recurrent neural networks (RNN), etc.
请参阅图4,图4为本申请实施例提供的另一种系统架构图。在系统架构400中,执行设备210由一个或多个服务器实现,可选的,与其它计算设备配合,例如:数据存储、路由器、负载均衡器等设备;执行设备210可以布置在一个物理站点上,或者分布在多个物理站点上。执行设备210可以使用数据存储系统250中的数据,或者调用数据存储系统250中的程序代码实现本申请以下图14对应的用于计算设备的训练方法的步骤。Please refer to Figure 4, which is another system architecture diagram provided by an embodiment of the present application. In the system architecture 400, the execution device 210 is implemented by one or more servers, and optionally cooperates with other computing devices, such as data storage, routers, load balancers and other devices; the execution device 210 can be arranged at one physical site, or distributed at multiple physical sites. The execution device 210 can use the data in the data storage system 250, or call the program code in the data storage system 250 to implement the steps of the training method for the computing device corresponding to the following Figure 14 of the present application.
用户可以操作各自的用户设备(例如本地设备401和本地设备402)与执行设备210进行交互。每个本地设备可以表示任何计算设备,例如个人计算机、计算机工作站、智能手机、平板电脑、智能摄像头、智能汽车或其他类型蜂窝电话、媒体消费设备、可穿戴设备、机顶盒、游戏机等。Users can operate their respective user devices (e.g., local device 401 and local device 402) to interact with execution device 210. Each local device can represent any computing device, such as a personal computer, a computer workstation, a smart phone, a tablet computer, a smart camera, a smart car or other type of cellular phone, a media consumption device, a wearable device, a set-top box, a game console, etc.
每个用户的本地设备可以通过任何通信机制/通信标准的通信网络与执行设备210进行交互,通信网络可以是广域网、局域网、点对点连接等方式,或它们的任意组合。具体地,该通信网络可以包括无线网络、有线网络或者无线网络与有线网络的组合等。该无线网络包括但不限于:第五代移动通信技术(5th-Generation,5G)系统,长期演进(long term evolution,LTE)系统、全球移动通信系统(global system for mobile communication,GSM)或码分多址(code division multiple access,CDMA)网络、宽带码分多址(wideband code division multiple access,WCDMA)网络、无线保真(wireless fidelity,WiFi)、蓝牙(bluetooth)、紫蜂协议(Zigbee)、射频识别技术(radio frequency identification,RFID)、远程(Long Range,Lora)无线通信、近距离无线通信(near field communication,NFC)中的任意一种或多种的组合。该有线网络可以包括光纤通信网络或同轴电缆组成的网络等。The local device of each user can interact with the execution device 210 through a communication network of any communication mechanism/communication standard, and the communication network can be a wide area network, a local area network, a point-to-point connection, etc., or any combination thereof. Specifically, the communication network may include a wireless network, a wired network, or a combination of a wireless network and a wired network, etc. The wireless network includes but is not limited to: a fifth-generation mobile communication technology (5th-Generation, 5G) system, a long-term evolution (long term evolution, LTE) system, a global system for mobile communication (global system for mobile communication, GSM) or a code division multiple access (code division multiple access, CDMA) network, a wideband code division multiple access (wideband code division multiple access, WCDMA) network, wireless fidelity (wireless fidelity, WiFi), Bluetooth (bluetooth), Zigbee protocol (Zigbee), radio frequency identification technology (radio frequency identification, RFID), long-range (Lora) wireless communication, and near-field wireless communication (NFC) Any one or more combinations. The wired network may include an optical fiber communication network or a network composed of coaxial cables, etc.
在另一种实现中,执行设备210的一个方面或多个方面可以由每个本地设备实现,例如,本地设备401可以为执行设备210提供本地数据或反馈计算结果。该本地设备也可以称为计算设备。In another implementation, one or more aspects of the execution device 210 may be implemented by each local device, for example, the local device 401 may provide local data or feedback calculation results to the execution device 210. The local device may also be referred to as a computing device.
需要注意的,执行设备210的所有功能也可以由本地设备实现。例如,本地设备401实现执行设备210的功能并为自己的用户提供服务,或者为本地设备402的用户提供服务。
It should be noted that all functions of the execution device 210 can also be implemented by the local device. For example, the local device 401 implements the functions of the execution device 210 and provides services to its own user, or provides services to the user of the local device 402.
为了更好的理解本申请实施例提供的技术方案,下面首先对目前采用的虚拟角色的表演内容展示方法进行介绍。In order to better understand the technical solution provided by the embodiments of the present application, the currently used method for displaying the performance content of a virtual character is first introduced below.
一、实时动捕1. Real-time Motion Capture
实时动捕方案主要包含穿戴设备、实时动捕以及角色驱动三个步骤。首先,由舞蹈人员提前穿戴好动捕服并完成校准;然后,工作人员开始跳舞并通过动捕软件实时采集对应的肢体运动序列;最后,将采集好的肢体运动序列通过运动重定向的方式迁移到待驱动的虚拟角色上,驱动虚拟角色展示音乐对应的肢体动作。The real-time motion capture solution mainly includes three steps: wearable devices, real-time motion capture, and character driving. First, the dancers put on the motion capture suits in advance and complete the calibration; then, the staff starts dancing and uses the motion capture software to collect the corresponding body movement sequences in real time; finally, the collected body movement sequences are transferred to the virtual character to be driven through motion redirection, driving the virtual character to show the body movements corresponding to the music.
实时动捕采用真人驱动的方式,可以实时地通过工作人员来驱动虚拟角色,而且肢体动作真实、丰富。Real-time motion capture uses a real-person driven approach, which allows staff to drive virtual characters in real time, and the body movements are realistic and rich.
但是,实时动捕的成本较高,且对工作人员的专业性要求较高。此外,实时动捕的方式不具备泛化性,每次都需要重新采集工作人员的肢体动作来驱动虚拟角色,不适用于有大量需求的应用场景。However, real-time motion capture is expensive and requires a high level of professionalism from the staff. In addition, the real-time motion capture method is not generalizable and requires the staff's body movements to be re-collected each time to drive the virtual character, which is not suitable for application scenarios with large demands.
二、基于算法生成肢体动作2. Generate body movements based on algorithms
基于算法生成肢体动作可以极大提升肢体动作生成的效率。相比较单纯生成肢体动作,目前更多的注意力放在基于音乐生成肢体动作。即,输入完整音乐,利用算法生成与输入音乐在节奏、风格等维度匹配的肢体动作。Generating body movements based on algorithms can greatly improve the efficiency of body movement generation. Compared with simply generating body movements, more attention is currently paid to generating body movements based on music. That is, input complete music and use algorithms to generate body movements that match the input music in terms of rhythm, style, and other dimensions.
在一些方案中,基于算法生成肢体动作的方案效率高而且泛化性好,但主要是离线模型,即需要输入完整的音乐来生成对应的肢体动作。这就需要从完整的音乐中提取音乐的节奏等信息。但是,离线模型的响应时间长,限制了音乐驱动肢体动作生成的潜在应用场景。In some solutions, the algorithm-based solution for generating body movements is efficient and generalizable, but it is mainly an offline model, that is, it is necessary to input complete music to generate corresponding body movements. This requires extracting information such as the rhythm of the music from the complete music. However, the offline model has a long response time, which limits the potential application scenarios of music-driven body movement generation.
在另一些方案中,主要是生成单个虚拟角色的舞蹈动作,不具备多角色间协作的能力。此外,无法同步驱动多个虚拟角色的面部和肢体。In other solutions, the dance movements of a single virtual character are mainly generated, and the ability to collaborate between multiple characters is not available. In addition, the faces and limbs of multiple virtual characters cannot be driven synchronously.
在另一些方案中,生成的舞蹈动作风格固定,无法很好地适配不同类型角色的风格。In other solutions, the generated dance moves are of fixed style and cannot be well adapted to the styles of different types of characters.
相比于上述方案,在本申请实施例中,第一方面,基于输入的流式音频片段实时输出虚拟角色的表演内容,降低了生成表演内容的响应时间。第二方面,通过参考多个虚拟角色在面部表情上和肢体动作上关联特征,实现了多个虚拟角色在面部表情和肢体动作上的有效协作与配合。第三方面,通过添加风格化标签,有效适配不同类型的虚拟角色的风格。第四方面,通过同步输出虚拟角色的肢体动作信息和表演信息,实现了虚拟角色边唱边跳的效果。Compared with the above scheme, in the embodiment of the present application, on the one hand, the performance content of the virtual character is output in real time based on the input streaming audio clip, which reduces the response time of generating the performance content. On the other hand, by referring to the associated features of multiple virtual characters in facial expressions and body movements, the effective collaboration and coordination of multiple virtual characters in facial expressions and body movements is achieved. On the third hand, by adding stylized tags, the styles of different types of virtual characters are effectively adapted. On the fourth hand, by synchronously outputting the body movement information and performance information of the virtual character, the effect of the virtual character singing and dancing is achieved.
为了解决上述问题,本申请实施例提供了一种虚拟角色的表演内容展示方法。下面结合附图和应用场景,对本申请实施例提供的虚拟角色的表演内容展示方法的推理阶段和训练阶段的具体实现流程进行描述。In order to solve the above problems, the embodiment of the present application provides a method for displaying the performance content of a virtual character. The following describes the specific implementation process of the reasoning phase and the training phase of the method for displaying the performance content of a virtual character provided by the embodiment of the present application in combination with the accompanying drawings and application scenarios.
一、推理阶段1. Reasoning Stage
为了便于说明,以下分别对多虚拟角色和单个虚拟角色场景进行说明。For ease of explanation, the following describes multiple virtual character scenarios and a single virtual character scenario respectively.
一种可选的实现方式中,可以根据用户选定的虚拟角色的个数来执行对应的实施方式,生成用户可控的虚拟角色的表演内容信息。以使得执行设备可以根据用户设定的虚拟角色的个数等参数来生成多个虚拟角色的面部表情信息和肢体动作信息,提升用户的交互性。In an optional implementation, the corresponding implementation method can be executed according to the number of virtual characters selected by the user to generate performance content information of the virtual characters controllable by the user. This allows the execution device to generate facial expression information and body movement information of multiple virtual characters according to parameters such as the number of virtual characters set by the user, thereby improving user interactivity.
(一)多虚拟角色场景(I) Multiple virtual character scenes
本申请实施例中,推理阶段描述的是执行设备210如何利用目标模型/规则201,对采集到的信息数据进行处理以生成预测结果的过程,具体地请参阅图5,图5为本申请实施例提供的虚拟角色的表演内容展示方法的另一种流程示意图,该方法可以由目标模型完成,该方法可以由目标模型实现,包括步骤501至步骤517。In the embodiment of the present application, the reasoning stage describes the process of how the execution device 210 uses the target model/rule 201 to process the collected information data to generate a prediction result. Please refer to Figure 5 for details. Figure 5 is another flow chart of the performance content display method of the virtual character provided in the embodiment of the present application. The method can be completed by the target model. The method can be implemented by the target model, including steps 501 to 517.
步骤501,执行设备获取第一流式音频片段。Step 501: An execution device obtains a first streaming audio segment.
本申请实施例中,执行设备实时获取第一流式音频片段。In the embodiment of the present application, the execution device obtains the first streaming audio segment in real time.
可以理解的是,针对输入完整的音乐来生成对应的表演内容,响应时间较长的问题。流式音频表示音频可以按照流式的方式实时输入,而不需要输入完整的音频,从而实现边播放音乐边驱动角色的效果。能够适用于在线场景,降低响应时间。It is understandable that the response time is long when inputting complete music to generate corresponding performance content. Streaming audio means that audio can be input in real time in a streaming manner without the need to input complete audio, thereby achieving the effect of driving the character while playing music. It can be applied to online scenarios and reduce response time.
一种可选的实现方式中,执行设备获取第一流式音频片段的音频信息。
In an optional implementation manner, the execution device obtains audio information of the first streaming audio segment.
该种实现方式中,在哼唱等场景下,执行设备获取的是第一流式音频片段的音频信息,需要从第一流式音频片段的音频信息中提取对应的文本信息。In this implementation, in scenarios such as humming, the execution device obtains audio information of the first stream audio segment, and needs to extract corresponding text information from the audio information of the first stream audio segment.
可选的,执行设备可以采用自动语音识别技术(Automatic Speech Recognition,ASR)从音频信息中解析出对应的文本信息。Optionally, the execution device may use automatic speech recognition technology (ASR) to parse corresponding text information from the audio information.
可以理解的是,在实际应用过程中,执行设备可以采用多种语音识别技术从音频信息中解析出对应的文本信息,在此不进行限定。It is understandable that, in actual application, the execution device may use a variety of speech recognition technologies to parse the corresponding text information from the audio information, which is not limited here.
另一种可选的实现方式中,执行设备获取第一流式音频片段的音频信息和文本信息。In another optional implementation, the execution device obtains the audio information and text information of the first streaming audio segment.
该种实现方式中,执行设备可以直接获取第一流式音频片段的文本信息。例如,执行设备可以直接获取带有歌词文本的第一流式音频片段,直接获取到第一流式音频片段的音频信息以及第一流式音频片段对应的文本信息,而不再需要从第一流式音频片段的音频信息中解析出文本信息。In this implementation, the execution device can directly obtain the text information of the first stream audio segment. For example, the execution device can directly obtain the first stream audio segment with lyrics text, directly obtain the audio information of the first stream audio segment and the text information corresponding to the first stream audio segment, and no longer need to parse the text information from the audio information of the first stream audio segment.
步骤502,执行设备从第一流式音频片段中提取第一流式音频片段的特征信息。Step 502: The execution device extracts feature information of the first streaming audio segment from the first streaming audio segment.
本申请实施例中,执行设备在获取到第一流式音频片段的音频信息和文本信息后,从第一流式音频片段中提取的第一流式音频片段的特征信息。其中,第一流式音频片段的特征信息包括第一音频特征信息和文本特征信息。In the embodiment of the present application, after acquiring the audio information and text information of the first stream audio segment, the execution device extracts the feature information of the first stream audio segment from the first stream audio segment, wherein the feature information of the first stream audio segment includes the first audio feature information and the text feature information.
可选的,执行设备采用librosa、madmom等开源库从音频信息中提取出Onset、Chromatogram等第一音频特征信息。Optionally, the execution device uses open source libraries such as librosa and madmom to extract first audio feature information such as Onset and Chromatogram from the audio information.
可选的,执行设备采用对比语言-图像预训练(Contrastive Language-Image Pre-Training,CLIP)模型中的文本解码器(Text Encoder)模块来提取文本特征信息。Optionally, the execution device uses a text encoder module in a Contrastive Language-Image Pre-Training (CLIP) model to extract text feature information.
在执行设备从第一流式音频片段提取出第一音频特征信息和文本特征信息的基础上,一种可选的实现方式中,第一流式音频片段的特征信息还包括节奏特征信息。On the basis that the execution device extracts the first audio feature information and the text feature information from the first streaming audio segment, in an optional implementation manner, the feature information of the first streaming audio segment further includes rhythm feature information.
可选的,执行设备采用MLP网络从第一音频特征信息中预测出节奏特征信息。Optionally, the execution device uses an MLP network to predict the rhythm feature information from the first audio feature information.
可以理解的是,为了更好的生成与音频片段对应的面部表情信息和肢体动作信息,节奏的匹配至关重要。但是目前对于长度较短的音频片段,如小于1s的音频片段,很难提取出有效的节奏特征信息或者提取的准确率较低。本申请实施例通过采用MLP网络,实现音频片段与面部表情信息和肢体动作信息在节奏上的匹配,通过训练学习,对网络进行不断优化。It is understandable that in order to better generate facial expression information and body movement information corresponding to the audio clip, the matching of rhythm is crucial. However, for audio clips of shorter length, such as audio clips of less than 1s, it is difficult to extract effective rhythm feature information or the accuracy of extraction is low. The embodiment of the present application uses an MLP network to achieve the matching of audio clips with facial expression information and body movement information in rhythm, and continuously optimizes the network through training and learning.
另一种可选的实现方式中,第一流式音频片段的特征信息还包括音素特征信息。In another optional implementation, the feature information of the first streaming audio segment further includes phoneme feature information.
可选的,执行设备采用开源库和文本对齐工具从文本特征信息中提取出音素特征信息。Optionally, the execution device extracts phoneme feature information from text feature information using an open source library and a text alignment tool.
示例性的,执行设备可以采用Phonemizer等开源库提取音素信息,并利用对齐工具,例如MFA(Montreal Forced Aligner,强制对齐)等工具提取时间戳信息,实现音频和文本在时间维度上对齐。音素表示一个基本的发声单元。音素特征信息包括根据文本分解出的多个音素以及每一个音素对应的时长。For example, the execution device can use open source libraries such as Phonemizer to extract phoneme information, and use alignment tools such as MFA (Montreal Forced Aligner) to extract timestamp information to achieve alignment of audio and text in the time dimension. Phoneme represents a basic unit of sound production. Phoneme feature information includes multiple phonemes decomposed from the text and the duration of each phoneme.
可以理解的是,音素特征对于面部表情中的口型驱动效果具有重要作用。本申请实施例利用开源库从音频文本中提取出音素,并利用MFA工具实现文本与音频对齐,以获取每个音素对应的时间戳信息。It is understandable that phoneme features play an important role in the lip-driven effect of facial expressions. The embodiment of the present application uses an open source library to extract phonemes from audio text, and uses an MFA tool to align text and audio to obtain the timestamp information corresponding to each phoneme.
步骤503,执行设备获取第二流式音频片段生成的动作特征信息。Step 503: The execution device obtains motion feature information generated by the second streaming audio segment.
本申请实施例中,考虑到虚拟角色的肢体动作的幅度大小以及移动速度对面部表情有一定的影响,因此执行设备在生成第一融合特征信息时加入了动作特征信息。其中,执行设备获取的是第二流式音频片段生成的动作特征信息,以实现同一流式音频的不同音频片段在表演内容上的前后衔接。In the embodiment of the present application, considering that the amplitude and speed of the virtual character's body movements have a certain influence on the facial expression, the execution device adds the action feature information when generating the first fusion feature information. The execution device obtains the action feature information generated by the second streaming audio segment to achieve the connection between different audio segments of the same streaming audio in terms of performance content.
可以理解的是,第二流式音频片段可以是第一流式音频片段播放之间之前的任一音频片段,例如,第一流式音频片段的上一个音频片段等,可以根据实际需求设定,在此不进行限定。另外,在第一流式音频片段为当前音频的第一个音频片段时,可以采用随机、用户指定的方式或者历史数值设定动作特征向量。It is understandable that the second stream audio segment can be any audio segment before the first stream audio segment is played, for example, the previous audio segment of the first stream audio segment, etc., and can be set according to actual needs and is not limited here. In addition, when the first stream audio segment is the first audio segment of the current audio, the action feature vector can be set randomly, by user-specified method or by historical value.
步骤504,执行设备基于第一流式音频片段的特征信息和根据第二流式音频片段生成的动作特征信息,得到第一融合特征信息。Step 504: The execution device obtains first fusion feature information based on the feature information of the first streaming audio segment and the action feature information generated according to the second streaming audio segment.
本申请实施例中,执行设备对第一流式音频片段的特征信息和第二流式音频片段生成动作特征信息进行融合,得到第一融合特征信息。In the embodiment of the present application, the execution device fuses the feature information of the first streaming audio segment and the feature information of the action generated by the second streaming audio segment to obtain first fused feature information.
一种可选的实现方式中,执行设备得到第一融合特征信息的具体方法包括:
In an optional implementation, a specific method for the execution device to obtain the first fusion feature information includes:
执行设备对第一流式音频片段的第一音频特征信息和节奏特征信息进行特征融合,得到第二音频特征信息;The execution device performs feature fusion on the first audio feature information and the rhythm feature information of the first streaming audio segment to obtain second audio feature information;
执行设备将文本特征信息、第二音频特征信息和动作特征信息进行拼接,并输入到第一神经网络中得到第一融合特征信息。The execution device concatenates the text feature information, the second audio feature information and the action feature information, and inputs the concatenated information into the first neural network to obtain the first fusion feature information.
该种实现方式中,示例性的,请参阅图6,图6为本申请实施例提供的生成第一融合特征信息的一种流程示意图。执行设备将第一音频特征信息和预测出的节奏特征信息进行特征融合,得到第二音频特征信息。然后,执行设备将文本特征信息、第二音频特征信息和动作特征信息进行拼接,输入到第一神经网络中得到第一融合特征信息。In this implementation, for example, please refer to FIG. 6, which is a flow chart of generating the first fused feature information provided in an embodiment of the present application. The execution device performs feature fusion on the first audio feature information and the predicted rhythm feature information to obtain the second audio feature information. Then, the execution device splices the text feature information, the second audio feature information and the action feature information, and inputs them into the first neural network to obtain the first fused feature information.
可选的,执行设备将文本特征信息、第一特征信息和动作特征信息进行拼接,并输入到多层Transformer网络中,输出第一融合特征信息。Optionally, the execution device concatenates the text feature information, the first feature information, and the action feature information, and inputs the concatenated information into a multi-layer Transformer network to output the first fused feature information.
应理解,在实际应用过程中,第一神经网络的类型可以根据实际需求设定,在此仅为举例说明,而不进行限定。It should be understood that in actual application, the type of the first neural network can be set according to actual needs, which is only used as an example and not limited here.
在执行完步骤501至步骤504后,基于生成的内容不同,可以存在多种实现方式,下面分别以执行设备生成虚拟角色的面部表情信息和肢体动作信息为例,对接下来的步骤进行说明。应理解,在实际应用时,执行设备同步输出虚拟角色的面部表情信息和肢体动作信息,以驱动虚拟角色实现边唱边跳的效果。执行设备并行执行步骤505-510与步骤511-517。After executing steps 501 to 504, there may be multiple implementations based on different generated contents. The following takes the execution device generating the facial expression information and body movement information of the virtual character as an example to illustrate the following steps. It should be understood that in actual application, the execution device synchronously outputs the facial expression information and body movement information of the virtual character to drive the virtual character to achieve the effect of singing and dancing. The execution device executes steps 505-510 and steps 511-517 in parallel.
(a)生成虚拟角色的面部表情信息(a) Generate facial expression information of virtual characters
步骤505,执行设备获取多个虚拟角色在表情上的第一关联特征信息,第一关联特征信息基于第二流式音频片段生成。Step 505: The execution device obtains first associated feature information on expressions of multiple virtual characters, where the first associated feature information is generated based on the second streaming audio segment.
本申请实施例中,执行设备获取基于第二流式音频片段生成的多个虚拟角色在表情上的第一关联特征信息,以基于该第一关联特征信息,生成第二流式音频片段对应的面部表情信息。从而在生成每个虚拟角色对应的面部表情信息的同时,参考其他虚拟角色的面部表情信息,实现多个虚拟角色在面部表情上的有效协作与配合。In the embodiment of the present application, the execution device obtains the first associated feature information on the expressions of multiple virtual characters generated based on the second streaming audio segment, and generates the facial expression information corresponding to the second streaming audio segment based on the first associated feature information. Thus, while generating the facial expression information corresponding to each virtual character, the facial expression information of other virtual characters is referenced, so as to achieve effective collaboration and cooperation of the multiple virtual characters in facial expressions.
可以理解的是,在第一流式音频片段为获取的首个音频片段或者方法首次执行时,第一关联特征信息默认为初始表情对应的特征信息。其中,初始表情可以根据实际需求进行设定,例如初始默认的无表情,在此进行限定。It is understandable that when the first streaming audio segment is the first audio segment obtained or the method is executed for the first time, the first associated feature information defaults to the feature information corresponding to the initial expression. The initial expression can be set according to actual needs, such as the initial default expressionless, which is limited here.
步骤506,执行设备基于第一融合特征信息、音素特征信息和第一关联特征信息,得到至少一个虚拟角色对应的基础表情的权重信息。Step 506: The execution device obtains weight information of a basic expression corresponding to at least one virtual character based on the first fusion feature information, the phoneme feature information and the first associated feature information.
本申请实施例中,执行设备基于第一融合特征信息,从文本特征信息中提取出的音素特征信息和第二流式音频片段生成的第一关联特征信息,生成至少一个虚拟角色对应的基础表情的权重信息。In an embodiment of the present application, the execution device generates weight information of a basic expression corresponding to at least one virtual character based on the first fused feature information, the phoneme feature information extracted from the text feature information, and the first associated feature information generated from the second streaming audio segment.
一种可选的实现方式中,执行设备将第一融合特征信息、音素特征信息和第一关联特征信息输入到多层的AcLSTM网络中,通过自回归的方式输出至少一个虚拟角色对应的基础表情的权重信息。In an optional implementation, the execution device inputs the first fusion feature information, the phoneme feature information and the first association feature information into a multi-layer AcLSTM network, and outputs weight information of a basic expression corresponding to at least one virtual character by autoregression.
该种实现方式中,执行设备可以采用自回归的方式输出多个虚拟角色对应的基础表情的权重信息。即执行设备在输出第一流式音频片段对应的多个虚拟角色对应的基础表情的权重信息时,采用的是第二流式音频片段生成第一关联特征信息。而第二流式音频片段为第一流式音频片段的上一音频片段或者之前的音频片段。对应的,执行设备在输出第三流式音频片段对应的多个虚拟角色对应的基础表情的权重信息时,采用的是第一流式音频片段生成第二关联特征信息。而第三流式音频片段为第一流式音频片段的下一音频片段或者之后的音频片段。从而使得执行设备将在当前的流式音频片段的播放时间之前输出的关联特征信息作为输入进行重新迭代,以进一步利用学习面部表情在时序上的关联性,保持面部表情在时序上的连贯性,避免面部表情发生突变和抖动的情况。In this implementation, the execution device can output the weight information of the basic expressions corresponding to multiple virtual characters in an autoregressive manner. That is, when the execution device outputs the weight information of the basic expressions corresponding to multiple virtual characters corresponding to the first streaming audio segment, the second streaming audio segment is used to generate the first associated feature information. The second streaming audio segment is the previous audio segment or the previous audio segment of the first streaming audio segment. Correspondingly, when the execution device outputs the weight information of the basic expressions corresponding to multiple virtual characters corresponding to the third streaming audio segment, the first streaming audio segment is used to generate the second associated feature information. The third streaming audio segment is the next audio segment or the subsequent audio segment of the first streaming audio segment. As a result, the execution device uses the associated feature information output before the playback time of the current streaming audio segment as input for re-iteration, so as to further utilize the temporal correlation of the learned facial expressions, maintain the temporal continuity of the facial expressions, and avoid sudden changes and jitters in facial expressions.
可以理解的是,相比于LSTM网络,AcLSTM(Auto-Conditioned LSTM)网络能够更好地处理长序列的生成问题,即能够避免或者尽量减少在生成长时间的面部表情序列时,出现处理时间越长生成的面部表情不再发生变化或者变化较少的情况。It is understandable that compared with the LSTM network, the AcLSTM (Auto-Conditioned LSTM) network can better handle the problem of generating long sequences, that is, it can avoid or minimize the situation where the generated facial expressions no longer change or change less as the processing time increases when generating long facial expression sequences.
步骤507,执行设备基于至少一个虚拟角色对应的基础表情的权重信息,对至少一个虚拟角色对应的基础表情进行调整,得到至少一个虚拟角色对应的面部表情信息。Step 507: The execution device adjusts the basic expression corresponding to the at least one virtual character based on the weight information of the basic expression corresponding to the at least one virtual character to obtain facial expression information corresponding to the at least one virtual character.
本申请实施例中,执行设备基于至少一个虚拟角色对应的基础表情的权重信息,对至少一个虚拟角
色对应的基础表情进行调整,以对虚拟角色对应的基础表情进行细节调整,得到与第一流式音频片段匹配的面部表情信息。In the embodiment of the present application, the execution device performs a weighting operation on at least one virtual character based on the weighting information of the basic expression corresponding to at least one virtual character. The basic expression corresponding to the color is adjusted to make detailed adjustments to the basic expression corresponding to the virtual character to obtain facial expression information matching the first streaming audio clip.
一种可选的实现方式中,执行设备获取至少一个虚拟角色对应的基础表情信息,并采用加权求和的方式得到至少一个虚拟角色对应的面部表情信息。In an optional implementation, the execution device obtains basic expression information corresponding to at least one virtual character, and obtains facial expression information corresponding to the at least one virtual character by weighted summation.
该种实现方式中,至少一个虚拟角色对应的基础表情信息可以理解为每个虚拟角色的一组基础的表情基。比如面部左眼往上翻,右眼往上翻等,都可以对应于一个表情基,每个基础表情基可表示虚拟角色在特定状态下的面部网格(Mesh)顶点的位置,每个顶点可以为三维坐标。In this implementation, the basic expression information corresponding to at least one virtual character can be understood as a set of basic expression bases for each virtual character. For example, the left eye rolling up, the right eye rolling up, etc. can correspond to an expression base, and each basic expression base can represent the position of the vertex of the facial mesh of the virtual character in a specific state, and each vertex can be a three-dimensional coordinate.
示例性的,执行设备生成至少一个虚拟角色对应的面部表情信息的公式可以为:
Exemplarily, the formula for the execution device to generate facial expression information corresponding to at least one virtual character may be:
Exemplarily, the formula for the execution device to generate facial expression information corresponding to at least one virtual character may be:
其中,Fi表示虚拟角色i对应的面部表情信息,N表示表情基的数量或者维度,bl表示表情基l对应的基础表情信息,el表示表情基l对应的基础表情的权重信息,b0表示基础的表情,例如默认的面无表情。Among them, Fi represents the facial expression information corresponding to the virtual character i, N represents the number or dimension of the expression base, b l represents the basic expression information corresponding to the expression base l, e l represents the weight information of the basic expression corresponding to the expression base l, and b0 represents the basic expression, such as the default expressionless face.
可以理解的是,虽然执行设备在输出虚拟角色对应的面部表情信息过程中采用的是多个虚拟角色之间的特征信息,但是,在最终输出虚拟角色对应的面部表情信息时,可以根据实际需求设定最终展示的虚拟角色的个数,展示一个或者多个虚拟角色对应的面部表情信息。It is understandable that although the execution device uses feature information between multiple virtual characters in the process of outputting facial expression information corresponding to the virtual characters, when finally outputting the facial expression information corresponding to the virtual characters, the number of virtual characters finally displayed can be set according to actual needs to display facial expression information corresponding to one or more virtual characters.
进一步的,在执行设备得到至少一个虚拟角色对应的面部表情信息后,在基于重定向等方式驱动虚拟角色的面部表情之前,一种可选的实现方式中,执行设备可以通过获取标签信息,并基于标签信息对每个虚拟角色各自对应的面部表情信息进行调整,得到调整后的每个虚拟角色各自对应的面部表情信息。其中,标签信息用于指示虚拟角色的面部表情信息。Furthermore, after the execution device obtains the facial expression information corresponding to at least one virtual character, before driving the facial expression of the virtual character based on redirection or other methods, in an optional implementation, the execution device may obtain tag information and adjust the facial expression information corresponding to each virtual character based on the tag information to obtain the adjusted facial expression information corresponding to each virtual character. The tag information is used to indicate the facial expression information of the virtual character.
该种实现方式中,执行设备可以采用风格化处理模块根据用户指定的标签信息对生成的虚拟角色的面部表情信息进行风格化处理,生成多种风格的面部表情信息,以适配不同类型的虚拟角色。In this implementation, the execution device may use a stylization processing module to perform stylization processing on the facial expression information of the generated virtual character according to the tag information specified by the user, and generate facial expression information of multiple styles to adapt to different types of virtual characters.
可选的,风格化处理模块可以采用ERD(Encoder-RNN-Decoder)架构,通过Encoder网络对输入的多个虚拟角色的面部表情信息去风格,仅保留表情内容部分。然后,通过将特定风格的标签信息的RNN网络与第一RNN网络的输出进行结合,得到具有特定风格的面部表情特征,最后利用Decoder网络解码出风格化后的面部表情信息。Optionally, the stylization processing module can adopt the ERD (Encoder-RNN-Decoder) architecture, and use the Encoder network to de-stylize the facial expression information of multiple virtual characters input, and only retain the expression content part. Then, by combining the RNN network of the label information of a specific style with the output of the first RNN network, the facial expression features with a specific style are obtained, and finally the Decoder network is used to decode the stylized facial expression information.
示例性的,请参阅图7,图7为本申请实施例提供的风格化处理模块的一种结构示意图。风格化处理模块包含多个风格化标签对应的分支,例如自豪、开心等,以及至少一个第一RNN网络(用以表示无风格的中性的运动序列特征),执行设备通过将特定风格化标签的RNN网络分支的输出与第一RNN网络分支的输出相加,得到具有特定风格的面部表情信息。假设生成开心加自豪的风格,执行设备可以将开心对应的风格化标签的RNN网络分支的输出与第一RNN网络分支的输出相加,然后将自豪对应的风格化标签的RNN网络分支的输出与第一RNN网络分支的输出相加,最后,两个相加后的结果取平均,即可得到具有开心和自豪对应的风格的面部表情信息。Exemplarily, please refer to FIG. 7, which is a schematic diagram of the structure of a stylized processing module provided in an embodiment of the present application. The stylized processing module includes branches corresponding to multiple stylized labels, such as pride, happiness, etc., and at least one first RNN network (used to represent neutral motion sequence features without style). The execution device obtains facial expression information with a specific style by adding the output of the RNN network branch of the specific stylized label to the output of the first RNN network branch. Assuming that a style of happiness plus pride is generated, the execution device can add the output of the RNN network branch of the stylized label corresponding to happiness to the output of the first RNN network branch, and then add the output of the RNN network branch of the stylized label corresponding to pride to the output of the first RNN network branch. Finally, the two added results are averaged to obtain facial expression information with styles corresponding to happiness and pride.
可以理解的是,第一RNN网络可以理解为默认的RNN网络,或者无风格的基础RNN网络。It can be understood that the first RNN network can be understood as a default RNN network, or a basic RNN network without style.
步骤508,执行设备基于多个虚拟角色对应的面部表情信息,提取多个虚拟角色对应的表情特征信息。Step 508: The execution device extracts expression feature information corresponding to the multiple virtual characters based on the facial expression information corresponding to the multiple virtual characters.
本申请实施例中,执行设备在获取到第一流式音频片段对应的多个虚拟角色对应的面部表情信息后,以自回归的方式将多个虚拟角色对应的面部表情信息作为输入,输出多个虚拟角色对应的表情特征信息。In an embodiment of the present application, after acquiring the facial expression information corresponding to multiple virtual characters corresponding to the first streaming audio segment, the execution device uses the facial expression information corresponding to the multiple virtual characters as input in an autoregressive manner and outputs expression feature information corresponding to the multiple virtual characters.
一种可选的实现方式中,执行设备将多个虚拟角色对应的面部表情信息输入MLP网络中,提取出每个虚拟角色对应的表情特征信息。In an optional implementation, the execution device inputs facial expression information corresponding to multiple virtual characters into the MLP network, and extracts expression feature information corresponding to each virtual character.
可以理解的是,在实际应用过程中,可以根据实际需求选择具体的网络模型进行表情特征信息的提取,例如,Transformer网络等,在此仅为其中一种示例说明,而不进行限定。It is understandable that in actual application, a specific network model can be selected according to actual needs to extract expression feature information, such as a Transformer network, etc. This is only an example and not limited to one example.
步骤509,执行设备基于多个虚拟角色对应的表情特征信息,计算得到多个虚拟角色之间的表情相似度。Step 509: The execution device calculates the similarity of expressions between the multiple virtual characters based on the expression feature information corresponding to the multiple virtual characters.
本申请实施例中,执行设备在基于多个虚拟角色对应的面部表情信息提取出多个虚拟角色对应的表情特征信息后,可以通过计算多个虚拟角色中每个虚拟角色与其他虚拟角色之间在表情上的相似度,来
得到每个虚拟角色与其他虚拟角色之间在表情上的关联特征信息。In the embodiment of the present application, after the execution device extracts the expression feature information corresponding to the multiple virtual characters based on the facial expression information corresponding to the multiple virtual characters, it can calculate the similarity in expression between each virtual character and other virtual characters in the multiple virtual characters. Obtain the associated feature information of expressions between each virtual character and other virtual characters.
一种可选的实现方式中,执行设备计算每个虚拟角色的表情特征信息与其他虚拟角色的表情特征信息之间的余弦相似度,作为多个虚拟角色之间在表情上的相似度。In an optional implementation, the execution device calculates the cosine similarity between the expression feature information of each virtual character and the expression feature information of other virtual characters as the similarity in expression between the multiple virtual characters.
步骤510,执行设备以多个虚拟角色之间的表情相似度作为权重,对每个虚拟角色对应的表情特征信息进行调整,得到多个虚拟角色在表情上的第二关联特征信息。Step 510: The execution device uses the similarity of expressions between the multiple virtual characters as a weight to adjust the expression feature information corresponding to each virtual character to obtain second associated feature information on the expressions of the multiple virtual characters.
本申请实施例中,执行设备在计算得到每个虚拟角色的表情特征信息ki与其他虚拟角色j的表情特征信息kj之间的余弦相似度之后,将该余弦相似度用作权重系数,与其它虚拟角色j的表情特征信息kj进行加权求和得到交叉特征。最后,将当前虚拟角色i的表情特征信息ki与交叉特征按位相加得到该虚拟角色i在表情上的第二关联特征信息fi。In the embodiment of the present application, after calculating the cosine similarity between the expression feature information k i of each virtual character and the expression feature information k j of other virtual characters j, the execution device uses the cosine similarity as a weight coefficient and performs weighted summation with the expression feature information k j of other virtual characters j to obtain a cross feature. Finally, the expression feature information k i of the current virtual character i and the cross feature are bitwise added to obtain the second associated feature information fi of the virtual character i in expression.
一种可选的实现方式中,执行设备可以根据一下计算公式得到虚拟角色i在表情上的第二关联特征信息fi。
In an optional implementation, the execution device may obtain the second associated feature information fi of the facial expression of the virtual character i according to the following calculation formula.
In an optional implementation, the execution device may obtain the second associated feature information fi of the facial expression of the virtual character i according to the following calculation formula.
其中,fi表示虚拟角色i与其他虚拟角色j在表情上的第二关联特征信息,ki表示虚拟角色i对应的表情特征信息,n表示虚拟角色的个数,kj表示虚拟角色j对应的表情特征信息,cosine_sim()用于计算虚拟角色i与虚拟角色j在表情上的余弦相似度。Among them, fi represents the second associated feature information of the expressions of virtual character i and other virtual characters j, k i represents the expression feature information corresponding to virtual character i, n represents the number of virtual characters, k j represents the expression feature information corresponding to virtual character j, and cosine_sim() is used to calculate the cosine similarity of the expressions of virtual characters i and j.
为了便于说明生成虚拟角色的面部表情信息的过程。示例性的,请参阅图8,图8为本申请实施例提供的生成虚拟角色的面部表情信息的一种流程示意图。具体的执行步骤请参见步骤505至步骤510。For the convenience of explaining the process of generating facial expression information of a virtual character, please refer to FIG8 for example, which is a flow chart of generating facial expression information of a virtual character provided by an embodiment of the present application. Please refer to steps 505 to 510 for specific execution steps.
可以理解的是,图8所示的结构图仅为本申请实施例在生成虚拟角色的面部表情信息时的其中一种示例结构说明,在此不进行限定。It can be understood that the structural diagram shown in FIG. 8 is only an example structural description of one embodiment of the present application when generating facial expression information of a virtual character, and is not limited here.
(b)生成虚拟角色的肢体动作信息(b) Generate virtual character’s body movement information
步骤511,执行设备获取第二流式音频片段生成的第三关联特征信息和动作特征信息。Step 511: The execution device obtains third associated feature information and action feature information generated by the second streaming audio segment.
本申请实施例中,执行设备获取基于第二流式音频片段生成的多个虚拟角色在动作上的第三关联特征信息和动作特征信息,以基于该第三关联特征信息和动作特征信息,生成第一流式音频片段对应的肢体动作信息,从而使得执行设备在生成每个虚拟角色对应的肢体动作信息的同时,能够参考其他虚拟角色的肢体动作信息,实现了多个虚拟角色在肢体动作上的有效协作与配合。In an embodiment of the present application, the execution device obtains third-association feature information and action feature information on the actions of multiple virtual characters generated based on the second streaming audio segment, and generates body movement information corresponding to the first streaming audio segment based on the third-association feature information and action feature information, so that the execution device can refer to the body movement information of other virtual characters while generating the body movement information corresponding to each virtual character, thereby achieving effective collaboration and coordination of multiple virtual characters in body movements.
可以理解的是,在第二流式音频片段为首个音频片段或者方法首次执行时,第三关联特征信息默认为初始动作对应的特征信息。其中,初始动作可以根据实际需求进行设定,在此进行限定。It is understandable that when the second streaming audio segment is the first audio segment or the method is executed for the first time, the third associated characteristic information defaults to the characteristic information corresponding to the initial action. The initial action can be set according to actual needs and is limited here.
步骤512,执行设备基于第一融合特征信息、第三关联特征信息和动作特征信息,得到多个虚拟角色对应的动作编码信息。Step 512: The execution device obtains action coding information corresponding to the multiple virtual characters based on the first fusion feature information, the third association feature information and the action feature information.
本申请实施例中,执行设备基于第一融合特征信息,第二流式音频片段生成的第三关联特征信息,以及第二流式音频片段生成的动作特征信息,得到多个虚拟角色对应的动作编码信息。In an embodiment of the present application, the execution device obtains action coding information corresponding to multiple virtual characters based on the first fusion feature information, the third associated feature information generated by the second streaming audio segment, and the action feature information generated by the second streaming audio segment.
可以理解的是,执行设备可以采用自回归的方式输出多个虚拟角色对应的动作编码信息。即执行设备在输出第一流式音频片段对应的多个虚拟角色对应的动作编码信息时,采用的是第二流式音频片段生成第三关联特征信息和动作特征信息。而第二流式音频片段为第一流式音频片段的上一音频片段或者之前的音频片段。对应的,执行设备在输出第三流式音频片段对应的多个虚拟角色对应的动作编码信息时,采用的是第一流式音频片段生成第四关联特征信息和动作特征信息。而第三流式音频片段为第一流式音频片段的下一音频片段或者之后的音频片段。从而使得执行设备将在当前的流式音频片段的播放时间之前输出的关联特征信息作为输入进行重新迭代,以进一步利用学习肢体动作在时序上的关联性,有利于保持肢体动作在时序上的连贯性,减少肢体动作的不连贯现象。It is understandable that the execution device can output the action coding information corresponding to multiple virtual characters in an autoregressive manner. That is, when the execution device outputs the action coding information corresponding to multiple virtual characters corresponding to the first streaming audio segment, the second streaming audio segment is used to generate the third associated feature information and action feature information. The second streaming audio segment is the previous audio segment or the previous audio segment of the first streaming audio segment. Correspondingly, when the execution device outputs the action coding information corresponding to multiple virtual characters corresponding to the third streaming audio segment, the first streaming audio segment is used to generate the fourth associated feature information and action feature information. The third streaming audio segment is the next audio segment or the subsequent audio segment of the first streaming audio segment. As a result, the execution device uses the associated feature information output before the playback time of the current streaming audio segment as input for re-iteration, so as to further utilize the temporal correlation of the learned body movements, which is conducive to maintaining the temporal continuity of the body movements and reducing the incoherence of the body movements.
可选的,执行设备将第一融合特征信息、第三关联特征信息和第二流式音频片段生成的动作特征信息输入到多层的AcLSTM网络中,通过自回归的方式输出得到多个虚拟角色对应的动作编码信息。Optionally, the execution device inputs the first fused feature information, the third associated feature information, and the action feature information generated by the second streaming audio segment into a multi-layer AcLSTM network, and outputs action encoding information corresponding to multiple virtual characters by autoregression.
可以理解的是,相比于LSTM网络,AcLSTM(Auto-Conditioned LSTM)网络能够更好地处理长序列的生成问题,即能够避免或者尽量减少在生成长时间的肢体动作序列时,出现处理时间越长生成的肢体动作不再发生变化或者变化较少的情况。
It can be understood that compared with the LSTM network, the AcLSTM (Auto-Conditioned LSTM) network can better handle the problem of generating long sequences, that is, it can avoid or minimize the situation where the generated limb movements no longer change or change less as the processing time increases when generating long limb movement sequences.
步骤513,执行设备基于动作编码信息得到与多个虚拟角色的动作编码信息对应的动作特征信息。Step 513: The execution device obtains action feature information corresponding to the action coding information of the plurality of virtual characters based on the action coding information.
本申请实施例中,执行设备基于动作编码信息与动作特征信息之间的对应关系,得到多个虚拟角色的动作编码信息对应的动作特征信息。In the embodiment of the present application, the execution device obtains the action feature information corresponding to the action coding information of multiple virtual characters based on the correspondence between the action coding information and the action feature information.
可以理解的是,每个动作编码信息对应于代码簿(Code Book)中的一个动作特征信息,用于表示多帧的全身动作序列。执行设备可以基于输入的多个虚拟角色的动作编码信息,直接从Code Book中获取对应的多个虚拟角色的动作特征信息。It can be understood that each action encoding information corresponds to an action feature information in the code book (Code Book), which is used to represent a multi-frame full-body action sequence. The execution device can directly obtain the corresponding action feature information of multiple virtual characters from the Code Book based on the action encoding information of multiple virtual characters input.
相比于直接输入每个虚拟角色的位姿信息,包括关节点的位置坐标、旋转角度以及速度等,来生成每个虚拟角色对应的动作特征信息,输入的信息量较多,运行效率较低。执行设备可以预先对每个关节点的位置坐标、旋转角度以及速度等进行编码,得到对应的动作编码信息。在实际操作时,执行设备仅需获取每个虚拟角色对应的动作编码信息,并基于动作编码信息与动作特征信息之间的对应关系,即可获取到每个虚拟角色对应的动作特征信息,从而减少了信息量的处理,提高了运行效率。Compared with directly inputting the posture information of each virtual character, including the position coordinates, rotation angle and speed of the joint points, to generate the action feature information corresponding to each virtual character, the amount of input information is large and the operation efficiency is low. The execution device can pre-encode the position coordinates, rotation angle and speed of each joint point to obtain the corresponding action coding information. In actual operation, the execution device only needs to obtain the action coding information corresponding to each virtual character, and based on the correspondence between the action coding information and the action feature information, the action feature information corresponding to each virtual character can be obtained, thereby reducing the amount of information processing and improving the operation efficiency.
步骤514,执行设备基于第一流式音频片段生成的动作特征信息,计算得到多个虚拟角色之间的动作相似度。Step 514: The execution device calculates the action similarity between the multiple virtual characters based on the action feature information generated by the first streaming audio segment.
本申请实施例中,执行设备在基于多个虚拟角色对应的动作编码信息提取出多个虚拟角色对应的动作特征信息后,可以通过计算多个虚拟角色中每个虚拟角色与其他虚拟角色之间在动作上的相似度,来得到每个虚拟角色与其他虚拟角色之间在动作上的关联特征信息。In an embodiment of the present application, after the execution device extracts the action feature information corresponding to multiple virtual characters based on the action coding information corresponding to the multiple virtual characters, it can obtain the associated feature information on the actions between each virtual character and other virtual characters by calculating the similarity in actions between each virtual character and other virtual characters in the multiple virtual characters.
一种可选的实现方式中,执行设备计算每个虚拟角色的动作特征信息与其他虚拟角色的动作特征信息之间的余弦相似度,作为多个虚拟角色之间在动作上的相似度。In an optional implementation, the execution device calculates the cosine similarity between the action feature information of each virtual character and the action feature information of other virtual characters as the action similarity between the multiple virtual characters.
步骤515,执行设备以多个虚拟角色之间的动作相似度作为权重,对每个虚拟角色的对应的动作特征信息进行调整,得到多个虚拟角色在动作上的第四关联特征信息。Step 515: The execution device uses the action similarities between the multiple virtual characters as weights to adjust the corresponding action feature information of each virtual character to obtain fourth associated feature information on the actions of the multiple virtual characters.
本申请实施例中,执行设备在计算得到每个虚拟角色的动作特征信息pi与其他虚拟角色j的动作特征信息pj之间的余弦相似度之后,将该余弦相似度用作权重系数,与其它虚拟角色j的动作特征信息pj进行加权求和得到交叉特征。最后,将当前虚拟角色i的动作特征信息pi与交叉特征按位相加得到该虚拟角色i在动作上的第二关联特征信息fi。In the embodiment of the present application, after calculating the cosine similarity between the action feature information p i of each virtual character and the action feature information p j of other virtual characters j, the execution device uses the cosine similarity as a weight coefficient and performs weighted summation with the action feature information p j of other virtual characters j to obtain a cross feature. Finally, the action feature information p i of the current virtual character i and the cross feature are bitwise added to obtain the second associated feature information fi of the virtual character i in action.
一种可选的实现方式中,执行设备可以根据一下计算公式得到虚拟角色i在动作上的第四关联特征信息f′i。
In an optional implementation, the execution device may obtain the fourth associated feature information f′ i of the action of the virtual character i according to the following calculation formula.
In an optional implementation, the execution device may obtain the fourth associated feature information f′ i of the action of the virtual character i according to the following calculation formula.
其中,f′i表示虚拟角色i与其他虚拟角色在动作上的第四关联特征信息,pi表示虚拟角色i对应的动作特征信息,n表示虚拟角色的个数,pj表示虚拟角色j对应的动作特征信息,cosine_sim()用于计算虚拟角色i与虚拟角色j之间在动作上的余弦相似度。Among them, f′ i represents the fourth associated feature information of the action between virtual character i and other virtual characters, pi represents the action feature information corresponding to virtual character i, n represents the number of virtual characters, pj represents the action feature information corresponding to virtual character j, and cosine_sim() is used to calculate the cosine similarity between virtual character i and virtual character j in action.
步骤516,执行设备对第一融合特征信息进行细节调整,得到调整后的第一融合特征信息。Step 516: The execution device performs detail adjustment on the first fused feature information to obtain adjusted first fused feature information.
本申请实施例中,由于第一融合特征是基于第一流式音频片段的特征信息和第二流式音频片段生成的动作特征信息得到的,执行设备通过对第一融合特征中包含的动作特征进行细节调整,得到调整后的第一融合特征信息。In an embodiment of the present application, since the first fusion feature is obtained based on the feature information of the first streaming audio segment and the action feature information generated by the second streaming audio segment, the execution device obtains the adjusted first fusion feature information by making detail adjustments to the action feature included in the first fusion feature.
相比于相关技术中基于模型训练得到的Code Book存在聚类效应的问题,即相似的动作会归类到同一个动作编码信息,进而导致每个动作编码信息编码了一组相似动作的平均动作。这样会导致动作编码信息中的每个特征向量缺乏动作细节。本申请实施例通过单独使用动作微调网络(例如MLP网络)来生成肢体动作的偏移量,以生成更多的动作细节。Compared with the Code Book obtained based on model training in the related art, there is a clustering effect problem, that is, similar actions will be classified into the same action encoding information, which will cause each action encoding information to encode the average action of a group of similar actions. This will cause each feature vector in the action encoding information to lack action details. The embodiment of the present application generates the offset of the limb action by using a single action fine-tuning network (such as an MLP network) to generate more action details.
一种可选的实现方式中,执行设备采用MLP网络对第一融合特征信息进行细节调整。In an optional implementation, the execution device uses an MLP network to perform detail adjustments on the first fused feature information.
可以理解的是,在实际应用过程中,执行设备也可以采用其他网络模型对第一融合特征信息进行细节调整,此处仅为其中一种示例说明,而不进行限定。It is understandable that, in actual application, the execution device may also use other network models to make detailed adjustments to the first fusion feature information. This is only an example and is not limited to this.
步骤517,执行设备以调整后的第一融合特征信息和第三关联特征信息作为偏移量,对第一流式音频片段生成的动作特征信息进行调整,得到至少一个虚拟角色对应的肢体动作信息。Step 517: The execution device adjusts the action feature information generated by the first streaming audio segment using the adjusted first fusion feature information and the third associated feature information as offsets to obtain body action information corresponding to at least one virtual character.
本申请实施例中,执行设备在从Code Book中获取第一流式音频片段生成的动作特征信息后,一方
面,对于获取的动作特征信息,执行设备采用类似于第二关联特征提取的方式计算出每个虚拟角色的第四关联特征,并作为生成多个虚拟角色对应的动作编码信息的输入信息;另一方面,执行设备以调整后的第一融合特征信息和第三关联特征信息作为偏移量,对第一流式音频片段生成的动作特征信息进行调整,得到至少一个虚拟角色的对应的肢体动作信息。In the embodiment of the present application, after the execution device obtains the action feature information generated by the first streaming audio segment from the Code Book, one party On the one hand, for the acquired action feature information, the execution device calculates the fourth associated feature of each virtual character in a manner similar to the second associated feature extraction, and uses it as input information for generating action encoding information corresponding to multiple virtual characters; on the other hand, the execution device uses the adjusted first fusion feature information and the third associated feature information as offsets to adjust the action feature information generated by the first streaming audio segment, so as to obtain corresponding body motion information of at least one virtual character.
可以理解的是,执行设备在得到第一融合特征信息后,一方面,执行设备会对第一融合特征信息进行细节调整,另一方面,执行设备会将基于第一融合特征信息、所述第三关联特征信息和第二流式音频片段生成的动作特征信息,得到多个虚拟角色对应的动作编码信息。而此处的第一流式音频片段生成的动作特征信息则是根据多个虚拟角色对应的动作编码信息得到的。It is understandable that after obtaining the first fused feature information, the execution device will, on the one hand, make detailed adjustments to the first fused feature information, and on the other hand, obtain the action coding information corresponding to the multiple virtual characters based on the action feature information generated based on the first fused feature information, the third associated feature information and the second streaming audio segment. The action feature information generated by the first streaming audio segment here is obtained based on the action coding information corresponding to the multiple virtual characters.
一种可选的实现方式中,执行设备将调整后的第一融合特征信息、第三关联特征信息以及第一流式音频片段生成的动作特征信息分别输入到多人动作解码器中,以调整后的第一融合特征信息和第三关联特征信息作为偏移量,对第一流式音频片段生成的动作特征信息进行调整,得到至少一个虚拟角色对应的肢体动作信息。In an optional implementation, the execution device inputs the adjusted first fusion feature information, the third associated feature information, and the action feature information generated by the first streaming audio segment into a multi-person action decoder respectively, and uses the adjusted first fusion feature information and the third associated feature information as offsets to adjust the action feature information generated by the first streaming audio segment to obtain the body movement information corresponding to at least one virtual character.
该种实现方式中,执行设备将第一融合特征信息、第三关联特征信息以及第一流式音频片段生成的动作特征信息分别输入到多人动作解码器中,使得生成的肢体动作信息不仅可以包含更多的细节而且能够兼顾到其他虚拟角色的肢体动作。示例性的,多人动作解码器主要由三个Decoder组成。请参阅图9,图9为本申请实施例提供的多人动作解码器的一种结构示意图。多人动作解码器输出的是多个虚拟角色的肢体动作信息,即多个虚拟角色的关节点位姿信息(包括关节点的位置、旋转角度、速度等信息)。在生成多个虚拟角色的肢体动作信息的基础上,可以根据实际需求,展示至少一个虚拟角色对应的肢体动作信息。In this implementation, the execution device inputs the first fused feature information, the third associated feature information, and the action feature information generated by the first streaming audio clip into the multi-person action decoder, respectively, so that the generated body movement information can not only contain more details but also take into account the body movements of other virtual characters. Exemplarily, the multi-person action decoder is mainly composed of three Decoders. Please refer to Figure 9, which is a structural diagram of a multi-person action decoder provided in an embodiment of the present application. The multi-person action decoder outputs the body movement information of multiple virtual characters, that is, the joint point posture information of multiple virtual characters (including the position, rotation angle, speed and other information of the joint point). On the basis of generating the body movement information of multiple virtual characters, the body movement information corresponding to at least one virtual character can be displayed according to actual needs.
可以理解的是,多人动作解码器可以采用MLP网络、卷积网络等神经网络,具体可以根据实际需求进行设定,在此不进行限定。It is understandable that the multi-person action decoder can adopt neural networks such as MLP networks and convolutional networks, which can be set according to actual needs and are not limited here.
另一种可选的实现方式中,执行设备获取标签信息,并基于标签信息对每个虚拟角色各自对应的肢体动作信息进行调整,得到调整后的每个虚拟角色各自对应的或肢体动作信息。其中,标签信息用于指示虚拟角色的肢体动作信息。In another optional implementation, the execution device obtains tag information, and adjusts the body movement information corresponding to each virtual character based on the tag information to obtain the adjusted body movement information corresponding to each virtual character. The tag information is used to indicate the body movement information of the virtual character.
该种实现方式中,执行设备可以采用风格化处理模块根据用户指定的标签信息对生成的虚拟角色的肢体动作信息进行风格化处理,生成多种风格的肢体动作信息,以适配不同类型的虚拟角色。In this implementation, the execution device may use a stylization processing module to perform stylization processing on the generated virtual character's body movement information according to the tag information specified by the user, and generate body movement information of various styles to adapt to different types of virtual characters.
可选的,风格化处理模块可以采用ERD(Encoder-RNN-Decoder)架构,通过Encoder网络对输入的多个虚拟角色的肢体动作信息,仅保留动作内容部分。然后,通过将特定风格的标签信息的RNN网络与第一RNN网络的输出进行结合,得到具有特定风格的运动序列特征,最后利用Decoder网络解码出风格化后的肢体动作信息。风格化处理模块包含多个风格化标签对应的分支,例如老人、青年、小孩等,以及至少一个第一RNN网络(用以表示无风格的中性的运动序列特征),执行设备通过将特定风格化标签的RNN网络分支的输出与第一RNN网络分支的输出相加,得到具有特定风格的风格话后的肢体动作信息。其中,风格化处理模块的具体结构图可以参考附图7。Optionally, the stylized processing module can adopt the ERD (Encoder-RNN-Decoder) architecture, and only retain the action content part of the body movement information of multiple virtual characters input through the Encoder network. Then, by combining the RNN network of the label information of a specific style with the output of the first RNN network, a motion sequence feature with a specific style is obtained, and finally the stylized body movement information is decoded using the Decoder network. The stylized processing module includes branches corresponding to multiple stylized labels, such as the elderly, young people, children, etc., and at least one first RNN network (used to represent neutral motion sequence features without style). The execution device obtains the body movement information with a specific style by adding the output of the RNN network branch of the specific stylized label to the output of the first RNN network branch. Among them, the specific structure diagram of the stylized processing module can refer to Figure 7.
可以理解的是,第一RNN网络可以理解为默认的RNN网络,或者无风格的基础RNN网络。It can be understood that the first RNN network can be understood as a default RNN network, or a basic RNN network without style.
需要说明的是,步骤514至步骤515与步骤516至步骤517的先后顺序不做要求。It should be noted that there is no requirement for the order of step 514 to step 515 and step 516 to step 517.
为了便于说明生成虚拟角色的肢体动作信息的过程。示例性的,请参阅图10,图10为本申请实施例提供的生成虚拟角色的肢体动作信息的一种流程示意图。具体的执行步骤请参见步骤511至步骤517。For the convenience of explaining the process of generating the body movement information of the virtual character, please refer to FIG. 10 for example, which is a flow chart of generating the body movement information of the virtual character provided by the embodiment of the present application. Please refer to steps 511 to 517 for the specific execution steps.
可以理解的是,图10所示的结构图仅为本申请实施例在生成虚拟角色的肢体动作信息时的其中一种示例结构说明,在此不进行限定。It can be understood that the structural diagram shown in FIG. 10 is only an example structural description of one embodiment of the present application when generating the body movement information of a virtual character, and is not limited here.
(二)单个虚拟角色场景(II) Single virtual character scene
本申请实施例中,推理阶段描述的是执行设备210如何利用目标模型/规则201,对采集到的信息数据进行处理以生成预测结果的过程,具体地请参阅图11,图11为本申请实施例提供的虚拟角色的表演内容展示方法的另一种流程示意图,该方法可以包括步骤1101至步骤1112。In the embodiment of the present application, the reasoning stage describes the process of how the execution device 210 uses the target model/rule 201 to process the collected information data to generate a prediction result. Please refer to Figure 11 for details. Figure 11 is another flow chart of the method for displaying the performance content of a virtual character provided in the embodiment of the present application. The method may include steps 1101 to 1112.
步骤1101,执行设备获取第一流式音频片段。Step 1101: The execution device obtains a first streaming audio segment.
步骤1102,执行设备从第一流式音频片段中提取第一流式音频片段的特征信息。Step 1102: The execution device extracts feature information of the first streaming audio segment from the first streaming audio segment.
步骤1103,执行设备获取第二流式音频片段生成的动作特征信息。
Step 1103: The execution device obtains motion feature information generated by the second streaming audio segment.
步骤1104,执行设备基于第一流式音频片段的特征信息和根据第二流式音频片段生成的动作特征信息,得到第一融合特征信息。Step 1104: The execution device obtains first fusion feature information based on the feature information of the first streaming audio segment and the action feature information generated according to the second streaming audio segment.
其中,步骤1101至步骤1104的具体说明,可以分别参考上述实施例中图5所示的步骤501至步骤504的描述,此处不再赘述。For the specific description of step 1101 to step 1104 , reference may be made to the description of step 501 to step 504 shown in FIG. 5 in the above embodiment, and will not be repeated here.
在执行完步骤1101至步骤1104后,基于生成的内容不同,可以存在多种实现方式,下面以执行设备分别生成虚拟角色的面部表情信息和肢体动作信息为例,对接下来的步骤进行说明。应理解,在实际应用时,执行设备同步输出虚拟角色的面部表情信息和肢体动作信息,执行设备并行执行步骤1105-1107与步骤1108-1112。After executing steps 1101 to 1104, there may be multiple implementations based on different generated contents. The following takes the example of the execution device generating the facial expression information and body movement information of the virtual character respectively as an example to illustrate the following steps. It should be understood that in actual application, the execution device synchronously outputs the facial expression information and body movement information of the virtual character, and the execution device executes steps 1105-1107 and steps 1108-1112 in parallel.
(a)生成虚拟角色的面部表情信息(a) Generate facial expression information of virtual characters
步骤1105,执行设备获取第二流式音频片段生成的面部表情信息。Step 1105: The execution device obtains facial expression information generated by the second streaming audio segment.
本申请实施例中,执行设备通过获取第二流式音频片段生成的面部表情信息,以生成第一流式音频片段对应的面部表情信息。In an embodiment of the present application, the execution device generates facial expression information corresponding to the first streaming audio segment by acquiring facial expression information generated by the second streaming audio segment.
步骤1106,执行设备基于第一融合特征信息、音素特征信息和第二流式音频片段生成的面部表情信息,得到虚拟角色对应的基础表情的权重信息。Step 1106: The execution device obtains weight information of a basic expression corresponding to the virtual character based on the first fusion feature information, the phoneme feature information, and the facial expression information generated by the second streaming audio segment.
本申请实施例中,相比于生成多个虚拟角色对应的面部表情信息,无需考虑多个虚拟角色之间的关联特征信息。因此,在生成虚拟角色对应的基础表情的权重信息,执行设备直接输入第二流式音频片段生成的面部表情信息。In the embodiment of the present application, compared with generating facial expression information corresponding to multiple virtual characters, there is no need to consider the associated feature information between the multiple virtual characters. Therefore, when generating weight information of the basic expression corresponding to the virtual character, the execution device directly inputs the facial expression information generated by the second streaming audio segment.
步骤1107,执行设备基于虚拟角色对应的基础表情的权重信息,对虚拟角色的基础表情进行调整,得到虚拟角色的面部表情信息。Step 1107: The execution device adjusts the basic expression of the virtual character based on the weight information of the basic expression corresponding to the virtual character to obtain facial expression information of the virtual character.
本申请实施例中,执行设备基于虚拟角色对应的基础表情的权重信息,对虚拟角色对应的基础表情进行调整,以对虚拟角色对应的基础表情进行细节调整,得到与第一流式音频片段匹配的虚拟角色的面部表情信息。In an embodiment of the present application, the execution device adjusts the basic expression corresponding to the virtual character based on the weight information of the basic expression corresponding to the virtual character, so as to make detailed adjustments to the basic expression corresponding to the virtual character and obtain the facial expression information of the virtual character that matches the first streaming audio clip.
其中,步骤1105至步骤1107的具体说明,可以分别参考上述实施例中图5所示的步骤505至步骤510的描述,此处不再赘述。For the specific description of steps 1105 to 1107 , reference may be made to the description of steps 505 to 510 shown in FIG. 5 in the above embodiment, and will not be repeated here.
为了便于说明生成虚拟角色的面部表情信息的过程。示例性的,请参阅图12,图12为本申请实施例提供的生成虚拟角色的面部表情信息的另一种流程示意图。具体的执行步骤请参见步骤1105至步骤1107。For the convenience of explaining the process of generating facial expression information of a virtual character, please refer to FIG. 12 for example, which is another flow chart of generating facial expression information of a virtual character provided by an embodiment of the present application. Please refer to steps 1105 to 1107 for specific execution steps.
可以理解的是,图12所示的结构图仅为本申请实施例在生成虚拟角色的面部表情信息时的其中一种示例结构说明,在此不进行限定。It can be understood that the structural diagram shown in FIG. 12 is only an example structural description of one embodiment of the present application when generating facial expression information of a virtual character, and is not limited here.
(b)生成虚拟角色的肢体动作信息(b) Generate virtual character’s body movement information
步骤1108,执行设备获取第二流式音频片段动作特征信息。Step 1108: The execution device obtains motion feature information of the second streaming audio segment.
步骤1109,执行设备基于第一融合特征信息和第二流式音频片段动作特征信息动作特征信息,得到虚拟角色对应的动作编码信息。Step 1109: The execution device obtains action coding information corresponding to the virtual character based on the first fusion feature information and the second streaming audio segment action feature information.
本申请实施例中,相比于生成多个虚拟角色对应的动作编码信息,执行设备在生成虚拟角色对应的动作编码信息时,直接输入第一融合特征信息和第二流式音频片段动作特征信息动作特征信息,而无需考虑多个虚拟角色之间的关联特征信息。In an embodiment of the present application, compared to generating action coding information corresponding to multiple virtual characters, the execution device directly inputs the first fusion feature information and the second streaming audio segment action feature information when generating action coding information corresponding to the virtual characters, without considering the associated feature information between the multiple virtual characters.
步骤1110,执行设备基于动作编码信息得到与虚拟角色的动作编码信息对应的动作特征信息。Step 1110: The execution device obtains action feature information corresponding to the action coding information of the virtual character based on the action coding information.
步骤1111,执行设备对第一融合特征信息进行细节调整,得到调整后的第一融合特征信息。Step 1111: The execution device performs detail adjustment on the first fused feature information to obtain adjusted first fused feature information.
步骤1112,执行设备以调整后的第一融合特征信息作为偏移量,对第一流式音频片段生成的动作特征信息进行调整,得到虚拟角色对应的肢体动作信息。Step 1112: The execution device uses the adjusted first fusion feature information as an offset to adjust the action feature information generated by the first streaming audio segment to obtain body action information corresponding to the virtual character.
本申请实施例中,考虑到单个虚拟角色无需考虑多个虚拟角色之间的关联特征信息,因此在进行肢体动作信息的提取时仅提取单个虚拟角色的动作特征信息,无需提取关联特征信息。In the embodiment of the present application, considering that a single virtual character does not need to consider the associated feature information between multiple virtual characters, only the action feature information of a single virtual character is extracted when extracting the body motion information, and there is no need to extract the associated feature information.
一种可选的实现方式中,执行设备将调整后的第一融合特征信息以及第一流式音频片段生成的动作特征信息分别输入到单人动作解码器中,以调整后的第一融合特征信息作为偏移量,对第一流式音频片段生成的动作特征信息进行调整,得到单个虚拟角色对应的肢体动作信息。In an optional implementation, the execution device inputs the adjusted first fusion feature information and the action feature information generated by the first streaming audio segment into a single-person action decoder respectively, and uses the adjusted first fusion feature information as an offset to adjust the action feature information generated by the first streaming audio segment to obtain the body movement information corresponding to a single virtual character.
该种实现方式中,由于无需考虑多个虚拟角色之间的关联特征信息。单人动作解码器主要包括两个
Decoder。单人动作解码器输出的是单个虚拟角色的肢体动作信息,即单个虚拟角色的关节点位姿信息(包括关节点的位置、旋转角度、速度等信息)。In this implementation, there is no need to consider the associated feature information between multiple virtual characters. The single-player action decoder mainly consists of two The single-person action decoder outputs the body action information of a single virtual character, that is, the joint point posture information of a single virtual character (including the position, rotation angle, speed, etc. of the joint point).
可以理解的是,单人动作解码器可以采用MLP网络、卷积网络等神经网络,具体可以根据实际需求进行设定,在此不进行限定。It is understandable that the single-person action decoder can adopt neural networks such as MLP networks and convolutional networks, which can be set according to actual needs and are not limited here.
其中,步骤1108至步骤1112的具体说明,可以分别参考上述实施例中图5所示的步骤511至步骤517的描述,此处不再赘述。For the specific description of steps 1108 to 1112, reference may be made to the description of steps 511 to 517 shown in FIG. 5 in the above embodiment, and will not be repeated here.
为了便于说明生成虚拟角色的肢体动作信息的过程。示例性的,请参阅图13,图13为本申请实施例提供的生成虚拟角色的肢体动作信息的另一种流程示意图。具体的执行步骤请参见步骤1108至步骤1112。For the convenience of explaining the process of generating the body movement information of the virtual character, please refer to FIG. 13 for example, which is another flow chart of generating the body movement information of the virtual character provided by the embodiment of the present application. Please refer to steps 1108 to 1112 for the specific execution steps.
可以理解的是,图13所示的结构图仅为本申请实施例在生成虚拟角色的肢体动作信息时的其中一种示例结构说明,在此不进行限定。It can be understood that the structural diagram shown in FIG. 13 is only an example structural description of one embodiment of the present application when generating the body movement information of a virtual character, and is not limited here.
二、训练阶段2. Training Phase
本申请实施例中,训练阶段描述的是训练模块202如何利用数据库230中的数据集合生成成熟的神经网络的过程,具体地,请参阅图14,图14为本申请实施例提供的模型训练方法的一种流程示意图,本申请实施例提供的模型训练方法可以包括:In the embodiment of the present application, the training phase describes the process of how the training module 202 generates a mature neural network using the data set in the database 230. Specifically, please refer to FIG. 14, which is a flow chart of a model training method provided in the embodiment of the present application. The model training method provided in the embodiment of the present application may include:
步骤1401,训练设备获取第一流式音频片段。Step 1401: The training device obtains a first streaming audio segment.
步骤1402,训练设备基于第一流式音频片段的特征信息和根据第二流式音频片段生成的动作特征信息,得到第一融合特征信息。Step 1402: The training device obtains first fusion feature information based on the feature information of the first streaming audio segment and the action feature information generated according to the second streaming audio segment.
步骤1403,训练设备基于第一融合特征信息生成虚拟角色的面部表情信息和肢体动作信息。Step 1403: The training device generates facial expression information and body movement information of the virtual character based on the first fused feature information.
其中,步骤1401至步骤1403的具体说明,可以分别参考上述实施例中图5所示的步骤501至步骤517的描述,此处不再赘述。For the specific description of step 1401 to step 1403 , reference may be made to the description of step 501 to step 517 shown in FIG. 5 in the above embodiment, and will not be repeated here.
步骤1404,训练设备基于虚拟角色的虚拟角色的面部表情信息和肢体动作信息与虚拟角色的真实的面部表情信息和肢体动作信息,得到目标损失。Step 1404: The training device obtains a target loss based on the facial expression information and body movement information of the virtual character and the real facial expression information and body movement information of the virtual character.
本申请实施例中,训练设备基于生成的虚拟角色的面部表情信息与真实的面部表情信息,以及生成的虚拟角色的肢体动作信息与真实的肢体动作信息,得到目标损失。其中,目标损失用于指示面部表情信息与真实的面部表情信息,以及肢体动作信息与真实的肢体动作信息之间的差异。In the embodiment of the present application, the training device obtains a target loss based on the generated facial expression information of the virtual character and the real facial expression information, and the generated body movement information of the virtual character and the real body movement information. The target loss is used to indicate the difference between the facial expression information and the real facial expression information, and the body movement information and the real body movement information.
一种可选的实现方式中,训练设备可以基于以下方法获取得到目标损失,包括:In an optional implementation, the training device may obtain the target loss based on the following method, including:
训练设备基于虚拟角色的肢体动作信息获取各个虚拟角色之间的相对位置、面部朝向和肢体朝向;The training device obtains the relative positions, facial orientations and limb orientations of the virtual characters based on the limb motion information of the virtual characters;
训练设备基于各个虚拟角色各自对应的肢体动作信息与真实的肢体动作信息,得到第一损失;The training device obtains a first loss based on the body movement information corresponding to each virtual character and the real body movement information;
训练设备基于各个虚拟角色之间的相对位置与真实的相对位置,得到第二损失;The training device obtains a second loss based on the relative positions of the virtual characters and the real relative positions;
训练设备基于各个虚拟角色的面部朝向和肢体朝向与真实的面部朝向和肢体朝向,得到第三损失;The training device obtains a third loss based on the facial orientation and body orientation of each virtual character and the real facial orientation and body orientation;
训练设备根据第一损失、第二损失和第三损失得到目标损失。The training device obtains a target loss according to the first loss, the second loss, and the third loss.
该种实现方式中,训练设备可以根据虚拟角色的肢体动作信息,各个虚拟角色之间的相对位置,以及各个虚拟角色的面部朝向和肢体朝向,得到目标损失。In this implementation, the training device can obtain the target loss based on the body movement information of the virtual character, the relative positions between the virtual characters, and the facial orientation and body orientation of each virtual character.
可以理解的是,各个虚拟角色各自对应的肢体动作信息是指经过目标模型输出的肢体动作信息,真实的肢体动作信息是指通过实时动捕等方式实际采集的肢体动作信息。It can be understood that the limb movement information corresponding to each virtual character refers to the limb movement information output by the target model, and the real limb movement information refers to the limb movement information actually collected through real-time motion capture and other methods.
可选的,训练设备基于各个虚拟角色各自对应的肢体动作信息与真实的肢体动作信息,得到第一损失L1,具体为:Optionally, the training device obtains the first loss L 1 based on the body movement information corresponding to each virtual character and the real body movement information, which is specifically:
每个虚拟角色基于第一损失来学习真实的肢体动作信息,实现角色间在手部动作和肢体动作上的配合。例如,两个虚拟角色共同执行比心的动作,通过让每个虚拟角色学习到自身的动作,进而实现两个虚拟角色在动作上的配合,完成比心动作。其中,真实的肢体动作信息可以通过动态捕捉的方式获取。Each virtual character learns the real body movement information based on the first loss, and realizes the coordination between characters in hand and body movements. For example, two virtual characters perform a heart-shaped gesture together, and by letting each virtual character learn its own action, the two virtual characters can coordinate in action and complete the heart-shaped gesture. The real body movement information can be obtained through dynamic capture.
可选的,训练设备基于各个虚拟角色之间的相对位置与真实的相对位置,得到第二损失L2的公式为:
Optionally, the training device obtains a formula for the second loss L2 based on the relative positions of the virtual characters and the real relative positions:
Optionally, the training device obtains a formula for the second loss L2 based on the relative positions of the virtual characters and the real relative positions:
其中,i和j分别代表不同的虚拟角色,和Δhij分别代表虚拟角色i和虚拟角色j之间的相对位置和真实的相对位置。Among them, i and j represent different virtual characters, and Δh ij represent the relative position and the real relative position between virtual character i and virtual character j, respectively.
可以理解的是,虚拟角色i和虚拟角色j之间的相对位置可以通过目标模型输出的肢体动作信息计算得到,例如,肢体动作信息中包含的关节点的位置信息(x,y,z),训练设备通过计算虚拟角色i的关节点的位置信息和虚拟角色j的关节点的位置信息之间的相对距离,即可以得到虚拟角色i和虚拟角色j之间的相对位置。如果是平面,基于平面坐标(x,y)计算。而虚拟角色i和虚拟角色j之间的真实相对位置可以采用实时动捕等方式采集虚拟角色i和虚拟角色j之间的真实的相对位置。训练设备通过新增第二损失L2来实现不同虚拟角色在移动过程中的配合,生成具有协作效果的运动轨迹,通过约束任意两个虚拟角色的相对位置与真实的相对位置保持一致,来实现运动轨迹的对齐。It is understandable that the relative position between virtual character i and virtual character j can be calculated by the limb motion information output by the target model, for example, the position information (x, y, z) of the joints contained in the limb motion information. The training device calculates the relative distance between the position information of the joints of virtual character i and the position information of the joints of virtual character j, that is, the relative position between virtual character i and virtual character j can be obtained. If it is a plane, it is calculated based on the plane coordinates (x, y). The real relative position between virtual character i and virtual character j can be collected by real-time motion capture and other methods. The training device realizes the cooperation of different virtual characters in the process of movement by adding a second loss L2 , generates a motion trajectory with a collaborative effect, and realizes the alignment of the motion trajectory by constraining the relative position of any two virtual characters to be consistent with the real relative position.
可选的,训练设备基于各个虚拟角色的面部朝向和肢体朝向与真实的面部朝向和肢体朝向,得到第三损失L3的公式为:
Optionally, the training device obtains a formula for the third loss L3 based on the facial orientation and limb orientation of each virtual character and the real facial orientation and limb orientation:
Optionally, the training device obtains a formula for the third loss L3 based on the facial orientation and limb orientation of each virtual character and the real facial orientation and limb orientation:
其中,i和j分别代表不同的虚拟角色,和Δdrij分别代表虚拟角色i和虚拟角色j之间的面部朝向和真实的面部朝向,和Δdfij分别代表虚拟角色i和虚拟角色j之间的肢体朝向和真实的肢体朝向。Among them, i and j represent different virtual characters, and Δdr ij represent the facial orientation between virtual character i and virtual character j and the real facial orientation, respectively. and Δdf ij represent the limb orientation between virtual character i and virtual character j and the real limb orientation, respectively.
可以理解的是,虚拟角色i和虚拟角色j之间的面部朝向和肢体朝向可以通过目标模型输出的肢体动作信息计算得到,例如,肢体动作信息中包含的旋转角度,训练设备可以得到虚拟角色i的关节点的旋转角度和虚拟角色j的旋转角度。其中,虚拟角色的肢体朝向主要是指根节点的朝向。训练设备通过新增第三损失L3来实现不同虚拟角色在移动过程中的协作,产生“两个虚拟角色相互凝望的配合效果”,学习不同角色在面部和根节点的朝向来实现朝向的对齐。以实现多个虚拟角色在肢体动作、运动轨迹以及身体朝向等维度的有效配合。It is understandable that the facial orientation and limb orientation between virtual character i and virtual character j can be calculated through the limb motion information output by the target model. For example, the rotation angle contained in the limb motion information, the training device can obtain the rotation angle of the joint point of virtual character i and the rotation angle of virtual character j. Among them, the limb orientation of the virtual character mainly refers to the orientation of the root node. The training device realizes the collaboration of different virtual characters in the process of movement by adding a third loss L3 , produces the "cooperation effect of two virtual characters staring at each other", and learns the orientation of different characters on the face and root node to achieve orientation alignment. In order to achieve effective coordination of multiple virtual characters in dimensions such as limb motion, motion trajectory and body orientation.
可选的,训练设备根据第一损失、第二损失和第三损失得到目标损失,具体为:Optionally, the training device obtains a target loss according to the first loss, the second loss, and the third loss, specifically:
训练设备将第一损失L1,第二损失L2以及第三损失L3分别设定不同的权重r1、r2、r3,权重越大代表比重越大,通过将权重乘以对应的损失并相加,即r1*L1+r2*L2+r3*L3,以得到最终的目标损失。The training device sets different weights r 1 , r 2 , r 3 for the first loss L 1 , the second loss L 2 and the third loss L 3 respectively. The larger the weight, the greater the proportion. The final target loss is obtained by multiplying the weight by the corresponding loss and adding them, i.e., r 1 *L 1 +r 2 *L 2 +r 3 *L 3 .
可以理解的是,在实际操作过程中,目标损失的计算方法可以根据实际需求进行设定,此处仅为举例说明,不进行限定。It is understandable that in actual operation, the calculation method of the target loss can be set according to actual needs. This is only an example and not a limitation.
步骤1405,训练设备基于目标损失更新待训练模型的参数,直至满足模型训练条件,得到目标模型。Step 1405: The training device updates the parameters of the model to be trained based on the target loss until the model training conditions are met to obtain the target model.
在图1至图14所对应的实施例的基础上,为了更好的实施本申请实施例的上述方案,下面还提供用于实施上述方案的相关设备。请参阅图15,图15为本申请实施例提供的虚拟角色的表演内容展示装置的一种结构示意图。虚拟角色的表演内容展示装置1500包括流式音频片段获取模块1501、融合特征信息生成模块1502和表演内容生成模块1503。其中:On the basis of the embodiments corresponding to FIG. 1 to FIG. 14 , in order to better implement the above-mentioned scheme of the embodiment of the present application, the following also provides related devices for implementing the above-mentioned scheme. Please refer to FIG. 15 , which is a structural schematic diagram of a performance content display device for a virtual character provided in the embodiment of the present application. The performance content display device 1500 for a virtual character includes a streaming audio segment acquisition module 1501, a fusion feature information generation module 1502, and a performance content generation module 1503. Among them:
流式音频片段获取模块1501,用于获取第一流式音频片段;The streaming audio segment acquisition module 1501 is used to acquire a first streaming audio segment;
融合特征信息生成模块1502,用于基于第一流式音频片段的特征信息和根据第二流式音频片段生成的动作特征信息,得到第一融合特征信息,第一流式音频片段和第二流式音频片段包含于同一流式音频,第二音频片段的播放时间在第一音频片段的播放时间之前;The fusion feature information generating module 1502 is used to obtain first fusion feature information based on feature information of a first streaming audio segment and action feature information generated according to a second streaming audio segment, wherein the first streaming audio segment and the second streaming audio segment are included in the same streaming audio, and a playback time of the second audio segment is before a playback time of the first audio segment;
表演内容生成模块1503,用于基于第一融合特征信息生成虚拟角色的面部表情信息和肢体动作信息。The performance content generating module 1503 is used to generate facial expression information and body movement information of the virtual character based on the first fused feature information.
一种可能的实现方式中,还包括:A possible implementation also includes:
流式音频片段获取模块1501,还用于获取第三流式音频片段;The streaming audio segment acquisition module 1501 is further used to acquire a third streaming audio segment;
融合特征信息生成模块1502,还用于基于第三流式音频片段的特征信息和根据第一流式音频片段生成的动作特征信息,得到第二融合特征信息,第三流式音频片段和第一流式音频片段包含于同一流式音频,第一流式音频片段的播放时间在第三流式音频片段的播放时间之前;The fusion feature information generating module 1502 is further used to obtain second fusion feature information based on feature information of a third streaming audio segment and action feature information generated according to the first streaming audio segment, wherein the third streaming audio segment and the first streaming audio segment are included in the same streaming audio, and the playback time of the first streaming audio segment is before the playback time of the third streaming audio segment;
表演内容生成模块1503,还用于基于第二融合特征信息生成虚拟角色的面部表情信息和肢体动作信息。
The performance content generating module 1503 is further used to generate facial expression information and body movement information of the virtual character based on the second fused feature information.
一种可能的实现方式中,第一流式音频片段的特征信息包括文本特征信息,表演内容生成模块1503还用于:In a possible implementation, the feature information of the first streamed audio segment includes text feature information, and the performance content generation module 1503 is further configured to:
基于文本特征信息进行音素提取,得到音素特征信息;Extract phonemes based on text feature information to obtain phoneme feature information;
获取多个虚拟角色在表情上的第一关联特征信息,第一关联特征信息基于第二流式音频片段生成;Acquire first associated feature information on expressions of multiple virtual characters, where the first associated feature information is generated based on the second streaming audio segment;
基于第一融合特征信息、音素特征信息和第一关联特征信息,得到至少一个虚拟角色对应的基础表情的权重信息;Based on the first fusion feature information, the phoneme feature information and the first associated feature information, obtaining weight information of a basic expression corresponding to at least one virtual character;
基于至少一个虚拟角色对应的基础表情的权重信息,对至少一个虚拟角色对应的基础表情进行调整,得到至少一个虚拟角色对应的面部表情信息。Based on the weight information of the basic expression corresponding to the at least one virtual character, the basic expression corresponding to the at least one virtual character is adjusted to obtain the facial expression information corresponding to the at least one virtual character.
一种可能的实现方式中,表演内容生成模块1503还用于:In a possible implementation, the performance content generating module 1503 is further used to:
基于多个虚拟角色对应的面部表情信息,提取多个虚拟角色对应的表情特征信息;Extracting expression feature information corresponding to the plurality of virtual characters based on the facial expression information corresponding to the plurality of virtual characters;
基于多个虚拟角色对应的表情特征信息,计算得到多个虚拟角色之间的表情相似度;Based on the expression feature information corresponding to the multiple virtual characters, the expression similarity between the multiple virtual characters is calculated;
以多个虚拟角色之间的表情相似度作为权重,对每个虚拟角色对应的表情特征信息进行调整,得到多个虚拟角色在表情上的第二关联特征信息,第二关联特征信息用于生成第三流式音频片段对应的至少一个虚拟角色对应的面部表情信息。The expression similarity between multiple virtual characters is used as a weight to adjust the expression feature information corresponding to each virtual character to obtain second associated feature information on the expressions of the multiple virtual characters. The second associated feature information is used to generate facial expression information corresponding to at least one virtual character corresponding to the third streaming audio segment.
一种可能的实现方式中,表演内容生成模块1503还用于:In a possible implementation, the performance content generating module 1503 is further used to:
对第一融合特征信息进行细节调整,得到调整后的第一融合特征信息;Performing detail adjustment on the first fused feature information to obtain adjusted first fused feature information;
获取多个虚拟角色在动作上的第三关联特征信息,第三关联特征信息基于第二流式音频片段生成;Acquire third associated feature information on actions of the plurality of virtual characters, where the third associated feature information is generated based on the second streaming audio segment;
获取第一流式音频片段生成的动作特征信息;Obtaining motion feature information generated by the first streaming audio segment;
以调整后的第一融合特征信息和第三关联特征信息作为偏移量,对第一流式音频片段生成的动作特征信息进行调整,得到至少一个虚拟角色的对应的肢体动作信息。The action feature information generated by the first streaming audio segment is adjusted using the adjusted first fusion feature information and the third associated feature information as offsets to obtain corresponding body motion information of at least one virtual character.
一种可能的实现方式中,获取第一流式音频片段生成的动作特征信息,包括:In a possible implementation, obtaining motion feature information generated by the first streaming audio segment includes:
基于第一融合特征信息、第三关联特征信息和第二流式音频片段生成的动作特征信息,得到多个虚拟角色对应的动作编码信息;Based on the first fusion feature information, the third association feature information, and the action feature information generated by the second streaming audio segment, action coding information corresponding to the plurality of virtual characters is obtained;
基于动作编码信息得到与多个虚拟角色的动作编码信息对应的动作特征信息。Based on the action coding information, action feature information corresponding to the action coding information of multiple virtual characters is obtained.
一种可能的实现方式中,表演内容生成模块1503还用于:In a possible implementation, the performance content generating module 1503 is further used to:
基于第一流式音频片段生成的动作特征信息,计算得到多个虚拟角色之间的动作相似度;Based on the action feature information generated by the first streaming audio segment, calculating the action similarity between the multiple virtual characters;
以多个虚拟角色之间的动作相似度作为权重,对每个虚拟角色的对应的动作特征信息进行调整,得到多个虚拟角色在动作上的第四关联特征信息,第四关联特征信息用于生成第三流式音频片段对应的至少一个虚拟角色对应的肢体动作信息。Using the action similarity between multiple virtual characters as a weight, the corresponding action feature information of each virtual character is adjusted to obtain fourth associated feature information on the actions of the multiple virtual characters. The fourth associated feature information is used to generate body movement information corresponding to at least one virtual character corresponding to the third streaming audio segment.
一种可能的实现方式中,还包括:A possible implementation also includes:
获取标签信息,标签信息用于指示虚拟角色的面部表情信息和/或肢体动作信息;Acquire tag information, where the tag information is used to indicate facial expression information and/or body movement information of the virtual character;
基于标签信息对每个虚拟角色各自对应的面部表情信息和/或肢体动作信息进行调整,得到调整后的每个虚拟角色各自对应的面部表情信息和/或肢体动作信息。The facial expression information and/or body movement information corresponding to each virtual character is adjusted based on the tag information to obtain the adjusted facial expression information and/or body movement information corresponding to each virtual character.
一种可能的实现方式中,第一流式音频片段的特征信息包括文本特征信息,表演内容生成模块1503还用于:In a possible implementation, the feature information of the first streamed audio segment includes text feature information, and the performance content generation module 1503 is further configured to:
基于文本特征进行音素提取,获取音素特征信息;Extract phonemes based on text features to obtain phoneme feature information;
获取第二流式音频片段生成的面部表情信息;Obtaining facial expression information generated by a second streaming audio segment;
基于第一融合特征信息、音素特征信息和第二流式音频片段生成的面部表情信息,得到虚拟角色对应的基础表情的权重信息;Obtaining weight information of a basic expression corresponding to the virtual character based on the first fused feature information, the phoneme feature information, and the facial expression information generated by the second streaming audio segment;
基于虚拟角色对应的基础表情的权重信息,对虚拟角色的基础表情进行调整,得到虚拟角色的面部表情信息。Based on the weight information of the basic expression corresponding to the virtual character, the basic expression of the virtual character is adjusted to obtain the facial expression information of the virtual character.
一种可能的实现方式中,表演内容生成模块1503还用于:In a possible implementation, the performance content generating module 1503 is further used to:
对第一融合特征信息进行调整,得到调整后的第一融合特征信息;Adjusting the first fused feature information to obtain adjusted first fused feature information;
获取第一流式音频片段生成的动作特征信息;Obtaining motion feature information generated by the first streaming audio segment;
以调整后的第一融合特征作为偏移量,对第一流式音频片段生成的动作特征信息进行调整,得到虚拟角色对应的肢体动作信息。
The action feature information generated by the first streaming audio segment is adjusted using the adjusted first fusion feature as an offset to obtain body action information corresponding to the virtual character.
需要说明的是,虚拟角色的表演内容展示装置1500中各模块/单元之间的信息交互、执行过程等内容,与本申请中图5对应的各个方法实施例基于同一构思,具体内容可参见本申请前述所示的方法实施例中的叙述,此处不再赘述。It should be noted that the information interaction, execution process, etc. between the modules/units in the virtual character's performance content display device 1500 are based on the same concept as the various method embodiments corresponding to Figure 5 in the present application. The specific contents can be found in the description of the method embodiments shown in the previous part of the present application, and will not be repeated here.
本申请实施例还提供了一种模型训练装置,请参阅图16,图16为本申请实施例提供的模型训练装置的一种结构示意图,模型训练装置1600可以包括:The present application also provides a model training device. Please refer to FIG. 16 . FIG. 16 is a schematic diagram of a structure of a model training device provided in the present application. The model training device 1600 may include:
流式音频片段获取模块1601,用于获取第一流式音频片段;The streaming audio segment acquisition module 1601 is used to acquire a first streaming audio segment;
融合特征信息生成模块1602,用于基于第一流式音频片段的特征信息和根据第二流式音频片段生成的动作特征信息,得到第一融合特征信息,其中,第一流式音频片段和第二流式音频片段包含于同一流式音频,第二音频片段的播放时间在第一音频片段的播放时间之前;The fusion feature information generating module 1602 is used to obtain first fusion feature information based on feature information of the first streaming audio segment and action feature information generated according to the second streaming audio segment, wherein the first streaming audio segment and the second streaming audio segment are included in the same streaming audio, and the playing time of the second audio segment is before the playing time of the first audio segment;
表演内容生成模块1603,用于基于第一融合特征信息生成虚拟角色的面部表情信息和肢体动作信息;A performance content generating module 1603, configured to generate facial expression information and body movement information of a virtual character based on the first fused feature information;
目标损失获取模块1604,用于基于虚拟角色的虚拟角色的面部表情信息和肢体动作信息与虚拟角色的真实的面部表情信息和肢体动作信息,得到目标损失,目标损失用于指示面部表情信息和肢体动作信息与真实的面部表情信息和肢体动作信息之间的差异;A target loss acquisition module 1604 is used to obtain a target loss based on the facial expression information and body movement information of the virtual character of the virtual character and the real facial expression information and body movement information of the virtual character, wherein the target loss is used to indicate the difference between the facial expression information and body movement information and the real facial expression information and body movement information;
目标模型训练模块1605,用于基于目标损失,更新待训练模型的参数,直至满足模型训练条件,得到目标模型。The target model training module 1605 is used to update the parameters of the model to be trained based on the target loss until the model training conditions are met to obtain the target model.
一种可能的实现方式中,还包括:A possible implementation also includes:
流式音频片段获取模块1601,还用于获取第三流式音频片段;The streaming audio segment acquisition module 1601 is further used to acquire a third streaming audio segment;
融合特征信息生成模块1602,还用于基于第三流式音频片段的特征信息和根据第一流式音频片段生成的动作特征信息,得到第二融合特征信息,第三流式音频片段和第一流式音频片段包含于同一流式音频,第一流式音频片段的播放时间在第三流式音频片段的播放时间之前;The fusion feature information generating module 1602 is further used to obtain second fusion feature information based on feature information of a third streaming audio segment and action feature information generated according to the first streaming audio segment, wherein the third streaming audio segment and the first streaming audio segment are included in the same streaming audio, and the playback time of the first streaming audio segment is before the playback time of the third streaming audio segment;
表演内容生成模块1603,还用于基于第二融合特征信息生成虚拟角色的面部表情信息和肢体动作信息。The performance content generating module 1603 is further used to generate facial expression information and body movement information of the virtual character based on the second fused feature information.
一种可能的实现方式中,第一流式音频片段的特征信息包括文本特征信息,表演内容生成模块1603还用于:In a possible implementation, the feature information of the first streamed audio segment includes text feature information, and the performance content generation module 1603 is further configured to:
基于文本特征信息进行音素提取,得到音素特征信息;Extract phonemes based on text feature information to obtain phoneme feature information;
获取多个虚拟角色在表情上的第一关联特征信息,第一关联特征信息基于第二流式音频片段生成;Acquire first associated feature information on expressions of multiple virtual characters, where the first associated feature information is generated based on the second streaming audio segment;
基于第一融合特征信息、音素特征信息和第一关联特征信息,得到至少一个虚拟角色对应的基础表情的权重信息;Based on the first fusion feature information, the phoneme feature information and the first associated feature information, obtaining weight information of a basic expression corresponding to at least one virtual character;
基于至少一个虚拟角色对应的基础表情的权重信息,对至少一个虚拟角色对应的基础表情进行调整,得到至少一个虚拟角色对应的面部表情信息。Based on the weight information of the basic expression corresponding to the at least one virtual character, the basic expression corresponding to the at least one virtual character is adjusted to obtain the facial expression information corresponding to the at least one virtual character.
一种可能的实现方式中,表演内容生成模块1603还用于:In a possible implementation, the performance content generating module 1603 is further used to:
基于多个虚拟角色对应的面部表情信息,提取多个虚拟角色对应的表情特征信息;Extracting expression feature information corresponding to the plurality of virtual characters based on the facial expression information corresponding to the plurality of virtual characters;
基于多个虚拟角色对应的表情特征信息,计算得到多个虚拟角色之间的表情相似度;Based on the expression feature information corresponding to the multiple virtual characters, the expression similarity between the multiple virtual characters is calculated;
以多个虚拟角色之间的表情相似度作为权重,对每个虚拟角色对应的表情特征信息进行调整,得到多个虚拟角色在表情上的第二关联特征信息,第二关联特征信息用于生成第三流式音频片段对应的至少一个虚拟角色对应的面部表情信息。The expression similarity between multiple virtual characters is used as a weight to adjust the expression feature information corresponding to each virtual character to obtain second associated feature information on the expressions of the multiple virtual characters. The second associated feature information is used to generate facial expression information corresponding to at least one virtual character corresponding to the third streaming audio segment.
一种可能的实现方式中,表演内容生成模块1603还用于:In a possible implementation, the performance content generating module 1603 is further used to:
对第一融合特征信息进行细节调整,得到调整后的第一融合特征信息;Performing detail adjustment on the first fused feature information to obtain adjusted first fused feature information;
获取多个虚拟角色在动作上的第三关联特征信息,第三关联特征信息基于第二流式音频片段生成;Acquire third associated feature information on actions of the plurality of virtual characters, where the third associated feature information is generated based on the second streaming audio segment;
获取第一流式音频片段生成的动作特征信息;Obtaining motion feature information generated by the first streaming audio segment;
以调整后的第一融合特征信息和第三关联特征信息作为偏移量,对第一流式音频片段生成的动作特征信息进行调整,得到至少一个虚拟角色的对应的肢体动作信息。The action feature information generated by the first streaming audio segment is adjusted using the adjusted first fusion feature information and the third associated feature information as offsets to obtain corresponding body motion information of at least one virtual character.
一种可能的实现方式中,获取第一流式音频片段生成的动作特征信息,包括:In a possible implementation, obtaining motion feature information generated by the first streaming audio segment includes:
基于第一融合特征信息、第三关联特征信息和第二流式音频片段生成的动作特征信息,得到多个虚拟角色对应的动作编码信息;
Based on the first fusion feature information, the third association feature information, and the action feature information generated by the second streaming audio segment, action coding information corresponding to the plurality of virtual characters is obtained;
基于动作编码信息得到与多个虚拟角色的动作编码信息对应的动作特征信息。Based on the action coding information, action feature information corresponding to the action coding information of multiple virtual characters is obtained.
一种可能的实现方式中,表演内容生成模块1603还用于:In a possible implementation, the performance content generating module 1603 is further used to:
基于第一流式音频片段生成的动作特征信息,计算得到多个虚拟角色之间的动作相似度;Based on the action feature information generated by the first streaming audio segment, calculating the action similarity between the multiple virtual characters;
以多个虚拟角色之间的动作相似度作为权重,对每个虚拟角色的对应的动作特征信息进行调整,得到多个虚拟角色在动作上的第四关联特征信息,第四关联特征信息用于生成第三流式音频片段对应的至少一个虚拟角色对应的肢体动作信息。Using the action similarity between multiple virtual characters as a weight, the corresponding action feature information of each virtual character is adjusted to obtain fourth associated feature information on the actions of the multiple virtual characters. The fourth associated feature information is used to generate body movement information corresponding to at least one virtual character corresponding to the third streaming audio segment.
一种可能的实现方式中,还包括:A possible implementation also includes:
获取标签信息,标签信息用于指示虚拟角色的面部表情信息和/或肢体动作信息;Acquire tag information, where the tag information is used to indicate facial expression information and/or body movement information of the virtual character;
基于标签信息对每个虚拟角色各自对应的面部表情信息和/或肢体动作信息进行调整,得到调整后的每个虚拟角色各自对应的面部表情信息和/或肢体动作信息。The facial expression information and/or body movement information corresponding to each virtual character is adjusted based on the tag information to obtain the adjusted facial expression information and/or body movement information corresponding to each virtual character.
一种可能的实现方式中,第一流式音频片段的特征信息包括文本特征信息,表演内容生成模块1603还用于:In a possible implementation, the feature information of the first streamed audio segment includes text feature information, and the performance content generation module 1603 is further configured to:
基于文本特征进行音素提取,获取音素特征信息;Extract phonemes based on text features to obtain phoneme feature information;
获取第二流式音频片段生成的面部表情信息;Obtaining facial expression information generated by a second streaming audio segment;
基于第一融合特征信息、音素特征信息和第二流式音频片段生成的面部表情信息,得到虚拟角色对应的基础表情的权重信息;Obtaining weight information of a basic expression corresponding to the virtual character based on the first fused feature information, the phoneme feature information, and the facial expression information generated by the second streaming audio segment;
基于虚拟角色对应的基础表情的权重信息,对虚拟角色的基础表情进行调整,得到虚拟角色的面部表情信息。Based on the weight information of the basic expression corresponding to the virtual character, the basic expression of the virtual character is adjusted to obtain the facial expression information of the virtual character.
一种可能的实现方式中,表演内容生成模块1603还用于:In a possible implementation, the performance content generating module 1603 is further used to:
对第一融合特征信息进行调整,得到调整后的第一融合特征信息;Adjusting the first fused feature information to obtain adjusted first fused feature information;
获取第一流式音频片段生成的动作特征信息;Obtaining motion feature information generated by the first streaming audio segment;
以调整后的第一融合特征作为偏移量,对第一流式音频片段生成的动作特征信息进行调整,得到虚拟角色对应的肢体动作信息。The action feature information generated by the first streaming audio segment is adjusted using the adjusted first fusion feature as an offset to obtain body action information corresponding to the virtual character.
需要说明的是,模型训练装置1600中各模块/单元之间的信息交互、执行过程等内容,与本申请中图14对应的各个方法实施例基于同一构思,具体内容可参见本申请前述所示的方法实施例中的叙述,此处不再赘述。It should be noted that the information interaction, execution process, etc. between the modules/units in the model training device 1600 are based on the same concept as the various method embodiments corresponding to Figure 14 in the present application. The specific contents can be found in the description of the method embodiments shown in the previous part of the present application, and will not be repeated here.
本申请还提供一种计算设备1700。如图17所示,图17为本申请实施例提供的计算设备的一种结构示意图,计算设备1700包括:总线1702、处理器1704、存储器1706和通信接口1708。处理器1704、存储器1706和通信接口1708之间通过总线1702通信。计算设备1700可以是服务器或终端设备。应理解,本申请不限定计算设备1700中的处理器、存储器的个数。The present application also provides a computing device 1700. As shown in FIG. 17, FIG. 17 is a schematic diagram of a structure of a computing device provided in an embodiment of the present application, and the computing device 1700 includes: a bus 1702, a processor 1704, a memory 1706, and a communication interface 1708. The processor 1704, the memory 1706, and the communication interface 1708 communicate through the bus 1702. The computing device 1700 can be a server or a terminal device. It should be understood that the present application does not limit the number of processors and memories in the computing device 1700.
总线1702可以是外设部件互连标准(peripheral component interconnect,PCI)总线或扩展工业标准结构(extended industry standard architecture,EISA)总线等。总线可以分为地址总线、数据总线、控制总线等。为便于表示,图17中仅用一条线表示,但并不表示仅有一根总线或一种类型的总线。总线1704可包括在计算设备1700各个部件(例如,存储器1706、处理器1704、通信接口1708)之间传送信息的通路。The bus 1702 may be a peripheral component interconnect (PCI) bus or an extended industry standard architecture (EISA) bus, etc. The bus may be divided into an address bus, a data bus, a control bus, etc. For ease of representation, FIG. 17 is represented by only one line, but does not mean that there is only one bus or one type of bus. The bus 1704 may include a path for transmitting information between various components of the computing device 1700 (e.g., the memory 1706, the processor 1704, and the communication interface 1708).
处理器1704可以包括中央处理器(central processing unit,CPU)、图形处理器(graphics processing unit,GPU)、微处理器(micro processor,MP)或者数字信号处理器(digital signal processor,DSP)等处理器中的任意一种或多种。Processor 1704 may include any one or more of a central processing unit (CPU), a graphics processing unit (GPU), a microprocessor (MP), or a digital signal processor (DSP).
存储器1706可以包括易失性存储器(volatile memory),例如随机存取存储器(random access memory,RAM)。处理器1704还可以包括非易失性存储器(non-volatile memory),例如只读存储器(read-only memory,ROM),快闪存储器,机械硬盘(hard disk drive,HDD)或固态硬盘(solid state drive,SSD)。The memory 1706 may include a volatile memory, such as a random access memory (RAM). The processor 1704 may also include a non-volatile memory, such as a read-only memory (ROM), a flash memory, a hard disk drive (HDD), or a solid state drive (SSD).
存储器1706中存储有可执行的程序代码,处理器1704执行该可执行的程序代码以分别实现前述流式音频片段获取模块1501、融合特征信息生成模块1502和表演内容生成模块1503的功能,从而实现虚拟角色的表演内容展示方法。也即,存储器1706上存有用于执行虚拟角色的表演内容展示方法的指令。
The memory 1706 stores executable program codes, and the processor 1704 executes the executable program codes to respectively implement the functions of the aforementioned streaming audio segment acquisition module 1501, the fusion feature information generation module 1502, and the performance content generation module 1503, thereby implementing the performance content display method of the virtual character. That is, the memory 1706 stores instructions for executing the performance content display method of the virtual character.
通信接口1708使用例如但不限于网络接口卡、收发器一类的收发模块,来实现计算设备1700与其他设备或通信网络之间的通信。The communication interface 1708 uses a transceiver module such as, but not limited to, a network interface card or a transceiver to implement communication between the computing device 1700 and other devices or communication networks.
本申请实施例还提供了一种计算设备集群。该计算设备集群包括至少一台计算设备。该计算设备可以是服务器,例如是中心服务器、边缘服务器,或者是本地数据中心中的本地服务器。在一些实施例中,计算设备也可以是台式机、笔记本电脑或者智能手机等终端设备。The embodiment of the present application also provides a computing device cluster. The computing device cluster includes at least one computing device. The computing device can be a server, such as a central server, an edge server, or a local server in a local data center. In some embodiments, the computing device can also be a terminal device such as a desktop computer, a laptop computer, or a smart phone.
如图18所示,图18为本申请实施例提供的计算机设备集群的一种结构示意图,该计算设备集群包括至少一个计算设备1700。计算设备集群中的一个或多个计算设备1700中的存储器1706中可以存有相同的用于执行虚拟角色的表演内容展示方法的指令。As shown in FIG18 , FIG18 is a schematic diagram of a structure of a computer device cluster provided in an embodiment of the present application, and the computing device cluster includes at least one computing device 1700. The memory 1706 in one or more computing devices 1700 in the computing device cluster may store the same instructions for executing the performance content display method of the virtual character.
在一些可能的实现方式中,该计算设备集群中的一个或多个计算设备1700的存储器1706中也可以分别存有用于执行虚拟角色的表演内容展示方法的部分指令。换言之,一个或多个计算设备1700的组合可以共同执行用于执行虚拟角色的表演内容展示方法的指令。In some possible implementations, the memory 1706 of one or more computing devices 1700 in the computing device cluster may also store partial instructions for executing the method for displaying the performance content of the virtual character. In other words, the combination of one or more computing devices 1700 may jointly execute the instructions for executing the method for displaying the performance content of the virtual character.
需要说明的是,计算设备集群中的不同的计算设备1700中的存储器1706可以存储不同的指令,分别用于执行虚拟角色的表演内容展示装置的部分功能。也即,不同的计算设备1700中的存储器1706存储的指令可以实现流式音频片段获取模块1501、融合特征信息生成模块1502和表演内容生成模块1503中的一个或多个模块的功能。It should be noted that the memory 1706 in different computing devices 1700 in the computing device cluster can store different instructions, which are respectively used to execute part of the functions of the performance content display device of the virtual character. That is, the instructions stored in the memory 1706 in different computing devices 1700 can realize the functions of one or more modules among the streaming audio segment acquisition module 1501, the fusion feature information generation module 1502 and the performance content generation module 1503.
在一些可能的实现方式中,计算设备集群中的一个或多个计算设备可以通过网络连接。其中,所述网络可以是广域网或局域网等等。In some possible implementations, one or more computing devices in the computing device cluster may be connected via a network, which may be a wide area network or a local area network.
本申请实施例还提供了一种训练设备,请参阅图19,图19为本申请实施例提供的训练设备一种结构示意图。具体地,训练设备1900由一个或多个服务器实现,训练设备1900可因配置或性能不同而产生比较大的差异,可以包括一个或一个以上中央处理器(central processing units,CPU)1922(例如,一个或一个以上处理器)和存储器1932,一个或一个以上存储应用程序1942或数据1944的存储介质1930(例如一个或一个以上海量存储设备)。其中,存储器1932和存储介质1930可以是短暂存储或持久存储。存储在存储介质1930的程序可以包括一个或一个以上模块(图示没标出),每个模块可以包括对训练设备中的一系列指令操作。更进一步地,中央处理器1922可以设置为与存储介质1930通信,在训练设备1900上执行存储介质1930中的一系列指令操作。The embodiment of the present application also provides a training device, please refer to Figure 19, which is a structural diagram of a training device provided by the embodiment of the present application. Specifically, the training device 1900 is implemented by one or more servers, and the training device 1900 may have relatively large differences due to different configurations or performances, and may include one or more central processing units (CPU) 1922 (for example, one or more processors) and memory 1932, one or more storage media 1930 (for example, one or more mass storage devices) storing application programs 1942 or data 1944. Among them, the memory 1932 and the storage medium 1930 can be short-term storage or permanent storage. The program stored in the storage medium 1930 may include one or more modules (not shown in the figure), and each module may include a series of instruction operations in the training device. Furthermore, the central processor 1922 can be configured to communicate with the storage medium 1930 to execute a series of instruction operations in the storage medium 1930 on the training device 1900.
训练设备1900还可以包括一个或一个以上电源1926,一个或一个以上有线或无线网络接口1950,一个或一个以上输入输出接口1958,和/或,一个或一个以上操作系统1941,例如Windows ServerTM,Mac OS XTM,UnixTM,LinuxTM,FreeBSDTM等等。The training device 1900 may also include one or more power supplies 1926, one or more wired or wireless network interfaces 1950, one or more input and output interfaces 1958, and/or, one or more operating systems 1941, such as Windows ServerTM, Mac OS XTM, UnixTM, LinuxTM, FreeBSDTM, etc.
本申请实施例中,中央处理器1922,用于执行图14对应实施例中的训练设备执行的模型训练方法。需要说明的是,中央处理器1922执行前述各个步骤的具体方式,与本申请中图17对应的各个方法实施例基于同一构思,其带来的技术效果与本申请中图14对应的各个方法实施例相同,具体内容可参见本申请前述所示的方法实施例中的叙述,此处不再赘述。In the embodiment of the present application, the central processing unit 1922 is used to execute the model training method executed by the training device in the embodiment corresponding to Figure 14. It should be noted that the specific manner in which the central processing unit 1922 executes the aforementioned steps is based on the same concept as the various method embodiments corresponding to Figure 17 in the present application, and the technical effects brought about are the same as the various method embodiments corresponding to Figure 14 in the present application. For specific contents, please refer to the description in the method embodiments shown in the preceding embodiment of the present application, which will not be repeated here.
本申请实施例还提供了一种包含指令的计算机程序产品。所述计算机程序产品可以是包含指令的,能够运行在计算设备上或被储存在任何可用介质中的软件或程序产品。当所述计算机程序产品在至少一个计算设备上运行时,使得至少一个计算设备执行上述任一实施例的方法。The present application also provides a computer program product including instructions. The computer program product may be software or a program product including instructions that can be run on a computing device or stored in any available medium. When the computer program product is run on at least one computing device, the at least one computing device executes the method of any of the above embodiments.
本申请实施例还提供了一种计算机可读存储介质。所述计算机可读存储介质可以是计算设备能够存储的任何可用介质或者是包含一个或多个可用介质的数据中心等数据存储设备。所述可用介质可以是磁性介质,(例如,软盘、硬盘、磁带)、光介质(例如,DVD)、或者半导体介质(例如固态硬盘)等。该计算机可读存储介质包括指令,所述指令指示计算设备执行上述任一实施例的方法。The embodiment of the present application also provides a computer-readable storage medium. The computer-readable storage medium can be any available medium that can be stored by a computing device or a data storage device such as a data center containing one or more available media. The available medium can be a magnetic medium (e.g., a floppy disk, a hard disk, a tape), an optical medium (e.g., a DVD), or a semiconductor medium (e.g., a solid-state hard disk). The computer-readable storage medium includes instructions that instruct the computing device to execute the method of any of the above embodiments.
本申请实施例提供的虚拟角色的表演内容展示装置、模型训练装置、计算设备、训练设备具体可以为芯片,芯片包括:处理单元和通信单元,所述处理单元例如可以是处理器,所述通信单元例如可以是输入/输出接口、管脚或电路等。该处理单元可执行存储单元存储的计算机执行指令,以使芯片执行上述各个方法实施例描述的虚拟角色的表演内容展示方法,或者,以使芯片执行上述图14所示实施例描述的模型训练方法。可选地,所述存储单元为所述芯片内的存储单元,如寄存器、缓存等,所述存储单元还可以是所述无线接入设备端内的位于所述芯片外部的存储单元,如只读存储器(read-only memory,
ROM)或可存储静态信息和指令的其他类型的静态存储设备,随机存取存储器(random access memory,RAM)等。The performance content display device, model training device, computing device, and training device for a virtual character provided in the embodiments of the present application may specifically be a chip, and the chip includes: a processing unit and a communication unit, wherein the processing unit may be, for example, a processor, and the communication unit may be, for example, an input/output interface, a pin, or a circuit, etc. The processing unit may execute computer execution instructions stored in the storage unit so that the chip executes the performance content display method for a virtual character described in the above-mentioned various method embodiments, or so that the chip executes the model training method described in the embodiment shown in FIG. 14 above. Optionally, the storage unit is a storage unit within the chip, such as a register, a cache, etc. The storage unit may also be a storage unit located outside the chip within the wireless access device end, such as a read-only memory (read-only memory, ROM) or other types of static storage devices that can store static information and instructions, random access memory (RAM), etc.
具体地,请参阅图20,图20为本申请实施例提供的芯片的一种结构示意图,所述芯片可以表现为神经网络处理器NPU 2000,NPU 2000作为协处理器挂载到主CPU(Host CPU)上,由Host CPU分配任务。NPU的核心部分为运算电路2003,通过控制器2004控制运算电路2003提取存储器中的矩阵数据并进行乘法运算。Specifically, please refer to FIG. 20 , which is a schematic diagram of a structure of a chip provided in an embodiment of the present application, wherein the chip may be a neural network processor NPU 2000, which is mounted on a host CPU (Host CPU) as a coprocessor and is assigned tasks by the Host CPU. The core part of the NPU is an operation circuit 2003, which is controlled by a controller 2004 to extract matrix data from a memory and perform multiplication operations.
在一些实现中,运算电路2003内部包括多个处理单元(Process Engine,PE)。在一些实现中,运算电路2003是二维脉动阵列。运算电路2003还可以是一维脉动阵列或者能够执行例如乘法和加法这样的数学运算的其它电子线路。在一些实现中,运算电路2003是通用的矩阵处理器。In some implementations, the operation circuit 2003 includes multiple processing units (Process Engine, PE) inside. In some implementations, the operation circuit 2003 is a two-dimensional systolic array. The operation circuit 2003 can also be a one-dimensional systolic array or other electronic circuits capable of performing mathematical operations such as multiplication and addition. In some implementations, the operation circuit 2003 is a general-purpose matrix processor.
举例来说,假设有输入矩阵A,权重矩阵B,输出矩阵C。运算电路从权重存储器2002中取矩阵B相应的数据,并缓存在运算电路中每一个PE上。运算电路从输入存储器2001中取矩阵A数据与矩阵B进行矩阵运算,得到的矩阵的部分结果或最终结果,保存在累加器(accumulator)2008中。For example, suppose there is an input matrix A, a weight matrix B, and an output matrix C. The operation circuit takes the corresponding data of matrix B from the weight memory 2002 and caches it on each PE in the operation circuit. The operation circuit takes the matrix A data from the input memory 2001 and performs matrix operation with matrix B. The partial result or final result of the matrix is stored in the accumulator 2008.
统一存储器2006用于存放输入数据以及输出数据。权重数据直接通过存储单元访问控制器(Direct Memory Access Controller,DMAC)2005,DMAC被搬运到权重存储器2002中。输入数据也通过DMAC被搬运到统一存储器2006中。The unified memory 2006 is used to store input data and output data. The weight data is directly transferred to the weight memory 2002 through the direct memory access controller (DMAC) 2005. The input data is also transferred to the unified memory 2006 through the DMAC.
BIU为Bus Interface Unit即,总线接口单元2010,用于AXI总线与DMAC和取指存储器(Instruction Fetch Buffer,IFB)2009的交互。BIU stands for Bus Interface Unit 2010, which is used for the interaction between AXI bus and DMAC and instruction fetch buffer (IFB) 2009.
总线接口单元2010(Bus Interface Unit,简称BIU),用于取指存储器2009从外部存储器获取指令,还用于存储单元访问控制器2005从外部存储器获取输入矩阵A或者权重矩阵B的原数据。The bus interface unit 2010 (Bus Interface Unit, BIU for short) is used for the instruction fetch memory 2009 to obtain instructions from the external memory, and is also used for the storage unit access controller 2005 to obtain the original data of the input matrix A or the weight matrix B from the external memory.
DMAC主要用于将外部存储器DDR中的输入数据搬运到统一存储器2006或将权重数据搬运到权重存储器2002中或将输入数据数据搬运到输入存储器2001中。DMAC is mainly used to transfer input data in the external memory DDR to the unified memory 2006 or to transfer weight data to the weight memory 2002 or to transfer input data to the input memory 2001.
向量计算单元2007包括多个运算处理单元,在需要的情况下,对运算电路的输出做进一步处理,如向量乘,向量加,指数运算,对数运算,大小比较等等。主要用于神经网络中非卷积/全连接层网络计算,如Batch Normalization(批归一化),像素级求和,对特征平面进行上采样等。The vector calculation unit 2007 includes multiple operation processing units, which can further process the output of the operation circuit when necessary, such as vector multiplication, vector addition, exponential operation, logarithmic operation, size comparison, etc. It is mainly used for non-convolutional/fully connected layer network calculations in neural networks, such as Batch Normalization, pixel-level summation, upsampling of feature planes, etc.
在一些实现中,向量计算单元2007能将经处理的输出的向量存储到统一存储器2006。例如,向量计算单元2007可以将线性函数和/或非线性函数应用到运算电路2003的输出,例如对卷积层提取的特征平面进行线性插值,再例如累加值的向量,用以生成激活值。在一些实现中,向量计算单元2007生成归一化的值、像素级求和的值,或二者均有。在一些实现中,处理过的输出的向量能够用作到运算电路2003的激活输入,例如用于在神经网络中的后续层中的使用。In some implementations, the vector calculation unit 2007 can store the processed output vector to the unified memory 2006. For example, the vector calculation unit 2007 can apply a linear function and/or a nonlinear function to the output of the operation circuit 2003, such as linear interpolation of the feature plane extracted by the convolution layer, and then, for example, a vector of accumulated values to generate an activation value. In some implementations, the vector calculation unit 2007 generates a normalized value, a pixel-level summed value, or both. In some implementations, the processed output vector can be used as an activation input to the operation circuit 2003, for example, for use in a subsequent layer in a neural network.
控制器2004连接的取指存储器(instruction fetch buffer)2009,用于存储控制器2004使用的指令;An instruction fetch buffer 2009 connected to the controller 2004, for storing instructions used by the controller 2004;
统一存储器2006,输入存储器2001,权重存储器2002以及取指存储器2009均为On-Chip存储器。外部存储器私有于该NPU硬件架构。The unified memory 2006, the input memory 2001, the weight memory 2002 and the instruction fetch memory 2009 are all on-chip memories. The external memory is private to the NPU hardware architecture.
其中,上述提到的目标模型中各层的运算可以由运算电路2003或向量计算单元2007执行。Among them, the operations of each layer in the above-mentioned target model can be performed by the operation circuit 2003 or the vector calculation unit 2007.
其中,上述任一处提到的处理器,可以是一个通用中央处理器,微处理器,ASIC,或一个或多个用于控制上述第一方面方法的程序执行的集成电路。The processor mentioned in any of the above places may be a general-purpose central processing unit, a microprocessor, an ASIC, or one or more integrated circuits for controlling the execution of the program of the above-mentioned first aspect method.
另外需说明的是,以上所描述的装置实施例仅仅是示意性的,其中所述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部模块来实现本实施例方案的目的。另外,本申请提供的装置实施例附图中,模块之间的连接关系表示它们之间具有通信连接,具体可以实现为一条或多条通信总线或信号线。It should also be noted that the device embodiments described above are merely schematic, wherein the units described as separate components may or may not be physically separated, and the components displayed as units may or may not be physical units, that is, they may be located in one place, or they may be distributed over multiple network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the scheme of this embodiment. In addition, in the drawings of the device embodiments provided by the present application, the connection relationship between the modules indicates that there is a communication connection between them, which may be specifically implemented as one or more communication buses or signal lines.
通过以上的实施方式的描述,所属领域的技术人员可以清楚地了解到本申请可借助软件加必需的通用硬件的方式来实现,当然也可以通过专用硬件包括专用集成电路、专用CPU、专用存储器、专用元器件等来实现。一般情况下,凡由计算机程序完成的功能都可以很容易地用相应的硬件来实现,而且,用来实现同一功能的具体硬件结构也可以是多种多样的,例如模拟电路、数字电路或专用电路等。但是,对本申请而言更多情况下软件程序实现是更佳的实施方式。基于这样的理解,本申请的技术方案本质上
或者说对现有技术做出贡献的部分可以以软件产品的形式体现出来,该计算机软件产品存储在可读取的存储介质中,如计算机的软盘、U盘、移动硬盘、ROM、RAM、磁碟或者光盘等,包括若干指令用以使得一台计算机设备(可以是个人计算机,训练设备,或者网络设备等)执行本申请各个实施例所述的方法。Through the description of the above implementation methods, technicians in the relevant field can clearly understand that the present application can be implemented by means of software plus necessary general-purpose hardware. Of course, it can also be implemented by special-purpose hardware including special-purpose integrated circuits, special-purpose CPUs, special-purpose memories, special-purpose components, etc. In general, all functions performed by computer programs can be easily implemented with corresponding hardware, and the specific hardware structures used to implement the same function can also be diverse, such as analog circuits, digital circuits, or special-purpose circuits. However, for the present application, software program implementation is a better implementation method in most cases. Based on this understanding, the technical solution of the present application is essentially In other words, the part that contributes to the prior art can be embodied in the form of a software product, which is stored in a readable storage medium, such as a computer floppy disk, USB flash drive, mobile hard disk, ROM, RAM, magnetic disk or optical disk, etc., and includes a number of instructions for enabling a computer device (which can be a personal computer, training equipment, or network equipment, etc.) to execute the methods described in the various embodiments of the present application.
在上述实施例中,可以全部或部分地通过软件、硬件、固件或者其任意组合来实现。当使用软件实现时,可以全部或部分地以计算机程序产品的形式实现。In the above embodiments, all or part of the embodiments may be implemented by software, hardware, firmware or any combination thereof. When implemented by software, all or part of the embodiments may be implemented in the form of a computer program product.
所述计算机程序产品包括一个或多个计算机指令。在计算机上加载和执行所述计算机程序指令时,全部或部分地产生按照本申请实施例所述的流程或功能。所述计算机可以是通用计算机、专用计算机、计算机网络、或者其他可编程装置。所述计算机指令可以存储在计算机可读存储介质中,或者从一个计算机可读存储介质向另一计算机可读存储介质传输,例如,所述计算机指令可以从一个网站站点、计算机、训练设备或数据中心通过有线(例如同轴电缆、光纤、数字用户线(DSL))或无线(例如红外、无线、微波等)方式向另一个网站站点、计算机、训练设备或数据中心进行传输。所述计算机可读存储介质可以是计算机能够存储的任何可用介质或者是包含一个或多个可用介质集成的训练设备、数据中心等数据存储设备。所述可用介质可以是磁性介质,(例如,软盘、硬盘、磁带)、光介质(例如,DVD)、或者半导体介质(例如固态硬盘(Solid State Disk,SSD))等。
The computer program product includes one or more computer instructions. When the computer program instructions are loaded and executed on a computer, the process or function described in the embodiment of the present application is generated in whole or in part. The computer may be a general-purpose computer, a special-purpose computer, a computer network, or other programmable devices. The computer instructions may be stored in a computer-readable storage medium, or transmitted from one computer-readable storage medium to another computer-readable storage medium. For example, the computer instructions may be transmitted from a website site, a computer, a training device, or a data center by wired (e.g., coaxial cable, optical fiber, digital subscriber line (DSL)) or wireless (e.g., infrared, wireless, microwave, etc.) mode to another website site, computer, training device, or data center. The computer-readable storage medium may be any available medium that a computer can store or a data storage device such as a training device, a data center, etc. that includes one or more available media integrations. The available medium may be a magnetic medium, (e.g., a floppy disk, a hard disk, a tape), an optical medium (e.g., a DVD), or a semiconductor medium (e.g., a solid-state drive (SSD)), etc.
Claims (23)
- 一种虚拟角色的表演内容展示方法,其特征在于,包括:A method for displaying performance content of a virtual character, characterized by comprising:获取第一流式音频片段;Get the first streaming audio segment;基于所述第一流式音频片段的特征信息和根据第二流式音频片段生成的动作特征信息,得到第一融合特征信息,所述第一流式音频片段和所述第二流式音频片段包含于同一流式音频,所述第二音频片段的播放时间在所述第一音频片段的播放时间之前;obtaining first fusion feature information based on feature information of the first streaming audio segment and action feature information generated according to the second streaming audio segment, wherein the first streaming audio segment and the second streaming audio segment are included in the same streaming audio, and a playback time of the second audio segment is before a playback time of the first audio segment;基于所述第一融合特征信息生成虚拟角色的面部表情信息和肢体动作信息。Facial expression information and body movement information of the virtual character are generated based on the first fused feature information.
- 根据权利要求1所述的方法,其特征在于,还包括:The method according to claim 1, further comprising:获取第三流式音频片段;Get the third streaming audio segment;基于所述第三流式音频片段的特征信息和根据第一流式音频片段生成的动作特征信息,得到第二融合特征信息,所述第三流式音频片段和所述第一流式音频片段包含于所述同一流式音频,所述第一流式音频片段的播放时间在所述第三流式音频片段的播放时间之前;obtaining second fusion feature information based on feature information of the third streaming audio segment and action feature information generated according to the first streaming audio segment, wherein the third streaming audio segment and the first streaming audio segment are included in the same streaming audio, and a playback time of the first streaming audio segment is before a playback time of the third streaming audio segment;基于所述第二融合特征信息生成所述虚拟角色的面部表情信息和肢体动作信息。Facial expression information and body movement information of the virtual character are generated based on the second fused feature information.
- 根据权利要求1或2所述的方法,其特征在于,所述第一流式音频片段的特征信息包括文本特征信息,所述基于所述第一融合特征信息生成虚拟角色的面部表情信息,包括:The method according to claim 1 or 2 is characterized in that the feature information of the first streaming audio segment includes text feature information, and the generating of the facial expression information of the virtual character based on the first fused feature information comprises:基于所述文本特征信息进行音素提取,得到音素特征信息;Perform phoneme extraction based on the text feature information to obtain phoneme feature information;获取多个虚拟角色在表情上的第一关联特征信息,所述第一关联特征信息基于所述第二流式音频片段生成;Acquire first associated feature information on expressions of multiple virtual characters, where the first associated feature information is generated based on the second streaming audio segment;基于所述第一融合特征信息、所述音素特征信息和所述第一关联特征信息,得到至少一个虚拟角色对应的基础表情的权重信息;Based on the first fusion feature information, the phoneme feature information and the first association feature information, obtaining weight information of a basic expression corresponding to at least one virtual character;基于所述至少一个虚拟角色对应的基础表情的权重信息,对所述至少一个虚拟角色对应的基础表情进行调整,得到所述至少一个虚拟角色对应的面部表情信息。Based on the weight information of the basic expression corresponding to the at least one virtual character, the basic expression corresponding to the at least one virtual character is adjusted to obtain the facial expression information corresponding to the at least one virtual character.
- 根据权利要求3所述的方法,其特征在于,还包括:The method according to claim 3, further comprising:基于多个虚拟角色对应的面部表情信息,提取所述多个虚拟角色对应的表情特征信息;Extracting facial expression feature information corresponding to the plurality of virtual characters based on facial expression information corresponding to the plurality of virtual characters;基于所述多个虚拟角色对应的表情特征信息,计算得到所述多个虚拟角色之间的表情相似度;Based on the expression feature information corresponding to the multiple virtual characters, calculating the expression similarity between the multiple virtual characters;以所述多个虚拟角色之间的表情相似度作为权重,对每个虚拟角色对应的表情特征信息进行调整,得到所述多个虚拟角色在表情上的第二关联特征信息,所述第二关联特征信息用于生成所述第三流式音频片段对应的至少一个虚拟角色对应的面部表情信息。The expression similarity between the multiple virtual characters is used as a weight to adjust the expression feature information corresponding to each virtual character to obtain second associated feature information on the expressions of the multiple virtual characters. The second associated feature information is used to generate facial expression information corresponding to at least one virtual character corresponding to the third streaming audio segment.
- 根据权利要求1至4任一项所述的方法,其特征在于,所述基于所述第一融合特征信息生成虚拟角色的肢体动作信息,包括:The method according to any one of claims 1 to 4, characterized in that the step of generating the virtual character's body movement information based on the first fused feature information comprises:对所述第一融合特征信息进行细节调整,得到调整后的第一融合特征信息;Performing detail adjustment on the first fused feature information to obtain adjusted first fused feature information;获取所述多个虚拟角色在动作上的第三关联特征信息,所述第三关联特征信息基于所述第二流式音频片段生成;Acquire third associated feature information on the actions of the plurality of virtual characters, wherein the third associated feature information is generated based on the second streaming audio segment;获取第一流式音频片段生成的动作特征信息;Obtaining motion feature information generated by the first streaming audio segment;以所述调整后的第一融合特征信息和所述第三关联特征信息作为偏移量,对第一流式音频片段生成的动作特征信息进行调整,得到所述至少一个虚拟角色的对应的肢体动作信息。The action feature information generated by the first streaming audio segment is adjusted using the adjusted first fusion feature information and the third associated feature information as offsets to obtain corresponding body motion information of the at least one virtual character.
- 根据权利要求5所述的方法,其特征在于,所述获取第一流式音频片段生成的动作特征信息,包括:The method according to claim 5, characterized in that the step of obtaining the motion feature information generated by the first streaming audio segment comprises:基于所述第一融合特征信息、所述第三关联特征信息和所述第二流式音频片段生成的动作特征信息,得到多个虚拟角色对应的动作编码信息;Obtaining action coding information corresponding to a plurality of virtual characters based on the first fused feature information, the third associated feature information, and the action feature information generated by the second streaming audio segment;基于所述动作编码信息得到与所述多个虚拟角色的动作编码信息对应的动作特征信息。Based on the action coding information, action feature information corresponding to the action coding information of the multiple virtual characters is obtained.
- 根据权利要求5或6所述的方法,其特征在于,还包括:The method according to claim 5 or 6, characterized in that it also includes:基于所述第一流式音频片段生成的动作特征信息,计算得到所述多个虚拟角色之间的动作相似度;Based on the action feature information generated by the first streaming audio segment, calculating the action similarity between the multiple virtual characters;以所述多个虚拟角色之间的动作相似度作为权重,对每个所述虚拟角色的对应的所述动作特征信息进行调整,得到所述多个虚拟角色在动作上的第四关联特征信息,所述第四关联特征信息用于生成所述 第三流式音频片段对应的所述至少一个虚拟角色对应的肢体动作信息。The action feature information corresponding to each of the virtual characters is adjusted by taking the action similarity between the multiple virtual characters as a weight to obtain fourth associated feature information on the actions of the multiple virtual characters, and the fourth associated feature information is used to generate the The body movement information corresponding to the at least one virtual character corresponding to the third streaming audio segment.
- 根据权利要求5至7中任一项所述的方法,其特征在于,还包括:The method according to any one of claims 5 to 7, further comprising:获取标签信息,所述标签信息用于指示所述虚拟角色的面部表情信息和/或肢体动作信息;Acquire tag information, where the tag information is used to indicate facial expression information and/or body movement information of the virtual character;基于所述标签信息对每个虚拟角色各自对应的面部表情信息和/或肢体动作信息进行调整,得到调整后的每个所述虚拟角色各自对应的面部表情信息和/或肢体动作信息。The facial expression information and/or body movement information corresponding to each virtual character is adjusted based on the tag information to obtain adjusted facial expression information and/or body movement information corresponding to each virtual character.
- 根据权利要求1所述的方法,其特征在于,所述第一流式音频片段的特征信息包括文本特征信息,所述基于所述第一融合特征信息生成虚拟角色的面部表情信息,包括:The method according to claim 1, characterized in that the feature information of the first streaming audio segment includes text feature information, and the generating of the facial expression information of the virtual character based on the first fused feature information comprises:基于所述文本特征进行音素提取,获取音素特征信息;Extract phonemes based on the text features to obtain phoneme feature information;获取所述第二流式音频片段生成的面部表情信息;Obtaining facial expression information generated by the second streaming audio segment;基于所述第一融合特征信息、所述音素特征信息和所述第二流式音频片段生成的面部表情信息,得到所述虚拟角色对应的基础表情的权重信息;Obtaining weight information of a basic expression corresponding to the virtual character based on the first fused feature information, the phoneme feature information, and the facial expression information generated by the second streaming audio segment;基于所述虚拟角色对应的基础表情的权重信息,对所述虚拟角色的基础表情进行调整,得到所述虚拟角色的面部表情信息。Based on the weight information of the basic expression corresponding to the virtual character, the basic expression of the virtual character is adjusted to obtain the facial expression information of the virtual character.
- 根据权利要求1或9所述的方法,其特征在于,基于所述第一融合特征信息生成所述虚拟角色的所述肢体动作信息,包括:The method according to claim 1 or 9, characterized in that generating the body movement information of the virtual character based on the first fusion feature information comprises:对所述第一融合特征信息进行调整,得到调整后的第一融合特征信息;Adjusting the first fused feature information to obtain adjusted first fused feature information;获取所述第一流式音频片段生成的动作特征信息;Obtaining motion feature information generated by the first streaming audio segment;以所述调整后的第一融合特征作为偏移量,对第一流式音频片段生成的动作特征信息进行调整,得到所述虚拟角色对应的肢体动作信息。The action feature information generated by the first streaming audio segment is adjusted using the adjusted first fusion feature as an offset to obtain the body movement information corresponding to the virtual character.
- 一种虚拟角色的表演内容展示装置,其特征在于,包括:A device for displaying performance content of a virtual character, characterized by comprising:流式音频片段获取模块,用于获取第一流式音频片段;A streaming audio segment acquisition module, used to acquire a first streaming audio segment;融合特征信息生成模块,用于基于所述第一流式音频片段的特征信息和根据第二流式音频片段生成的动作特征信息,得到第一融合特征信息,所述第一流式音频片段和所述第二流式音频片段包含于同一流式音频,所述第二音频片段的播放时间在所述第一音频片段的播放时间之前;a fusion feature information generating module, configured to obtain first fusion feature information based on feature information of the first streaming audio segment and action feature information generated according to the second streaming audio segment, wherein the first streaming audio segment and the second streaming audio segment are included in the same streaming audio, and a playback time of the second audio segment is before a playback time of the first audio segment;表演内容生成模块,用于基于所述第一融合特征信息生成虚拟角色的面部表情信息和肢体动作信息。A performance content generation module is used to generate facial expression information and body movement information of the virtual character based on the first fusion feature information.
- 根据权利要求11所述的装置,其特征在于,还包括:The device according to claim 11, further comprising:所述流式音频片段获取模块,还用于获取第三流式音频片段;The streaming audio segment acquisition module is further used to acquire a third streaming audio segment;所述融合特征信息生成模块,还用于基于所述第三流式音频片段的特征信息和根据第一流式音频片段生成的动作特征信息,得到第二融合特征信息,所述第三流式音频片段和所述第一流式音频片段包含于所述同一流式音频,所述第一流式音频片段的播放时间在所述第三流式音频片段的播放时间之前;The fusion feature information generating module is further used to obtain second fusion feature information based on feature information of the third streaming audio segment and action feature information generated according to the first streaming audio segment, wherein the third streaming audio segment and the first streaming audio segment are included in the same streaming audio, and the playback time of the first streaming audio segment is before the playback time of the third streaming audio segment;所述表演内容生成模块,还用于基于所述第二融合特征信息生成所述虚拟角色的面部表情信息和肢体动作信息。The performance content generation module is further used to generate facial expression information and body movement information of the virtual character based on the second fused feature information.
- 根据权利要求11或12所述的装置,其特征在于,所述第一流式音频片段的特征信息包括文本特征信息,所述表演内容生成模块还用于:The device according to claim 11 or 12, characterized in that the feature information of the first streaming audio segment includes text feature information, and the performance content generation module is further used to:基于所述文本特征信息进行音素提取,得到音素特征信息;Perform phoneme extraction based on the text feature information to obtain phoneme feature information;获取多个虚拟角色在表情上的第一关联特征信息,所述第一关联特征信息基于所述第二流式音频片段生成;Acquire first associated feature information on expressions of multiple virtual characters, where the first associated feature information is generated based on the second streaming audio segment;基于所述第一融合特征信息、所述音素特征信息和所述第一关联特征信息,得到至少一个虚拟角色对应的基础表情的权重信息;Based on the first fusion feature information, the phoneme feature information and the first association feature information, obtaining weight information of a basic expression corresponding to at least one virtual character;基于所述至少一个虚拟角色对应的基础表情的权重信息,对所述至少一个虚拟角色对应的基础表情进行调整,得到所述至少一个虚拟角色对应的面部表情信息。Based on the weight information of the basic expression corresponding to the at least one virtual character, the basic expression corresponding to the at least one virtual character is adjusted to obtain the facial expression information corresponding to the at least one virtual character.
- 根据权利要求13所述的装置,其特征在于,所述表演内容生成模块还用于:The device according to claim 13, characterized in that the performance content generation module is further used to:基于多个虚拟角色对应的面部表情信息,提取所述多个虚拟角色对应的表情特征信息;Extracting facial expression feature information corresponding to the plurality of virtual characters based on facial expression information corresponding to the plurality of virtual characters;基于所述多个虚拟角色对应的表情特征信息,计算得到所述多个虚拟角色之间的表情相似度;Based on the expression feature information corresponding to the multiple virtual characters, calculating the expression similarity between the multiple virtual characters;以所述多个虚拟角色之间的表情相似度作为权重,对每个虚拟角色对应的表情特征信息进行调整,得到所述多个虚拟角色在表情上的第二关联特征信息,所述第二关联特征信息用于生成所述第三流式音 频片段对应的至少一个虚拟角色对应的面部表情信息。The expression feature information corresponding to each virtual character is adjusted by using the expression similarity between the multiple virtual characters as a weight to obtain second associated feature information on the expressions of the multiple virtual characters, and the second associated feature information is used to generate the third streaming audio The video clip may include facial expression information corresponding to at least one virtual character corresponding to the video clip.
- 根据权利要求11至14任一项所述的装置,其特征在于,所述表演内容生成模块还用于:The device according to any one of claims 11 to 14, characterized in that the performance content generation module is further used to:对所述第一融合特征信息进行细节调整,得到调整后的第一融合特征信息;Performing detail adjustment on the first fused feature information to obtain adjusted first fused feature information;获取所述多个虚拟角色在动作上的第三关联特征信息,所述第三关联特征信息基于所述第二流式音频片段生成;Acquire third associated feature information on the actions of the plurality of virtual characters, wherein the third associated feature information is generated based on the second streaming audio segment;获取第一流式音频片段生成的动作特征信息;Obtaining motion feature information generated by the first streaming audio segment;以所述调整后的第一融合特征信息和所述第三关联特征信息作为偏移量,对第一流式音频片段生成的动作特征信息进行调整,得到所述至少一个虚拟角色的对应的肢体动作信息。The action feature information generated by the first streaming audio segment is adjusted using the adjusted first fusion feature information and the third associated feature information as offsets to obtain corresponding body motion information of the at least one virtual character.
- 根据权利要求15所述的装置,其特征在于,所述获取第一流式音频片段生成的动作特征信息,包括:The device according to claim 15, wherein obtaining the motion feature information generated by the first streaming audio segment comprises:基于所述第一融合特征信息、所述第三关联特征信息和所述第二流式音频片段生成的动作特征信息,得到多个虚拟角色对应的动作编码信息;Obtaining action coding information corresponding to a plurality of virtual characters based on the first fused feature information, the third associated feature information, and the action feature information generated by the second streaming audio segment;基于所述动作编码信息得到与所述多个虚拟角色的动作编码信息对应的动作特征信息。Based on the action coding information, action feature information corresponding to the action coding information of the multiple virtual characters is obtained.
- 根据权利要求15或16所述的装置,其特征在于,所述表演内容生成模块还用于:The device according to claim 15 or 16, characterized in that the performance content generation module is further used to:基于所述第一流式音频片段生成的动作特征信息,计算得到所述多个虚拟角色之间的动作相似度;Based on the action feature information generated by the first streaming audio segment, calculating the action similarity between the multiple virtual characters;以所述多个虚拟角色之间的动作相似度作为权重,对每个所述虚拟角色的对应的所述动作特征信息进行调整,得到所述多个虚拟角色在动作上的第四关联特征信息,所述第四关联特征信息用于生成所述第三流式音频片段对应的所述至少一个虚拟角色对应的肢体动作信息。Using the action similarity between the multiple virtual characters as a weight, the corresponding action feature information of each virtual character is adjusted to obtain fourth associated feature information on the actions of the multiple virtual characters. The fourth associated feature information is used to generate body movement information corresponding to the at least one virtual character corresponding to the third streaming audio segment.
- 根据权利要求15至17中任一项所述的装置,其特征在于,还包括:The device according to any one of claims 15 to 17, characterized in that it also includes:获取标签信息,所述标签信息用于指示所述虚拟角色的面部表情信息和/或肢体动作信息;Acquire tag information, where the tag information is used to indicate facial expression information and/or body movement information of the virtual character;基于所述标签信息对每个虚拟角色各自对应的面部表情信息和/或肢体动作信息进行调整,得到调整后的每个所述虚拟角色各自对应的面部表情信息和/或肢体动作信息。The facial expression information and/or body movement information corresponding to each virtual character is adjusted based on the tag information to obtain adjusted facial expression information and/or body movement information corresponding to each virtual character.
- 根据权利要求11所述的装置,其特征在于,所述第一流式音频片段的特征信息包括文本特征信息,所述表演内容生成模块还用于:The device according to claim 11, characterized in that the feature information of the first streaming audio segment includes text feature information, and the performance content generation module is further used to:基于所述文本特征进行音素提取,获取音素特征信息;Extract phonemes based on the text features to obtain phoneme feature information;获取所述第二流式音频片段生成的面部表情信息;Obtaining facial expression information generated by the second streaming audio segment;基于所述第一融合特征信息、所述音素特征信息和所述第二流式音频片段生成的面部表情信息,得到所述虚拟角色对应的基础表情的权重信息;Obtaining weight information of a basic expression corresponding to the virtual character based on the first fused feature information, the phoneme feature information, and the facial expression information generated by the second streaming audio segment;基于所述虚拟角色对应的基础表情的权重信息,对所述虚拟角色的基础表情进行调整,得到所述虚拟角色的面部表情信息。Based on the weight information of the basic expression corresponding to the virtual character, the basic expression of the virtual character is adjusted to obtain the facial expression information of the virtual character.
- 根据权利要求11或19所述的装置,其特征在于,所述表演内容生成模块还用于:The device according to claim 11 or 19, characterized in that the performance content generation module is further used to:对所述第一融合特征信息进行调整,得到调整后的第一融合特征信息;Adjusting the first fused feature information to obtain adjusted first fused feature information;获取所述第一流式音频片段生成的动作特征信息;Obtaining motion feature information generated by the first streaming audio segment;以所述调整后的第一融合特征作为偏移量,对第一流式音频片段生成的动作特征信息进行调整,得到所述虚拟角色对应的肢体动作信息。The action feature information generated by the first streaming audio segment is adjusted using the adjusted first fusion feature as an offset to obtain the body movement information corresponding to the virtual character.
- 一种计算设备集群,其特征在于,包括至少一个计算设备,每个计算设备包括处理器和存储器;A computing device cluster, characterized in that it includes at least one computing device, each computing device includes a processor and a memory;所述至少一个计算设备的处理器用于执行所述至少一个计算设备的存储器中存储的指令,以使得所述计算设备集群执行如权利要求1至10中任一项所述的方法。The processor of the at least one computing device is configured to execute instructions stored in the memory of the at least one computing device, so that the computing device cluster executes the method according to any one of claims 1 to 10.
- 一种包含指令的计算机程序产品,其特征在于,当所述指令被计算设备集群运行时,使得所述计算设备集群执行如权利要求的1至10中任一项所述的方法。A computer program product comprising instructions, wherein when the instructions are executed by a computing device cluster, the computing device cluster executes the method according to any one of claims 1 to 10.
- 一种计算机可读存储介质,其特征在于,包括计算机程序指令,当所述计算机程序指令由计算设备集群执行时,所述计算设备集群执行如权利要求1至10中任一项所述的方法。 A computer-readable storage medium, characterized in that it includes computer program instructions. When the computer program instructions are executed by a computing device cluster, the computing device cluster executes the method as described in any one of claims 1 to 10.
Applications Claiming Priority (4)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310126607 | 2023-02-16 | ||
CN202310126607.3 | 2023-02-16 | ||
CN202310544883.1 | 2023-05-15 | ||
CN202310544883.1A CN118537452A (en) | 2023-02-16 | 2023-05-15 | Virtual character performance content display method and related equipment |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2024169207A1 true WO2024169207A1 (en) | 2024-08-22 |
Family
ID=92388423
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/CN2023/124424 WO2024169207A1 (en) | 2023-02-16 | 2023-10-13 | Method of displaying performance content of virtual character and related device |
Country Status (2)
Country | Link |
---|---|
CN (1) | CN118537452A (en) |
WO (1) | WO2024169207A1 (en) |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111489424A (en) * | 2020-04-10 | 2020-08-04 | 网易(杭州)网络有限公司 | Virtual character expression generation method, control method, device and terminal equipment |
CN114554111A (en) * | 2022-02-22 | 2022-05-27 | 广州繁星互娱信息科技有限公司 | Video generation method and device, storage medium and electronic equipment |
-
2023
- 2023-05-15 CN CN202310544883.1A patent/CN118537452A/en active Pending
- 2023-10-13 WO PCT/CN2023/124424 patent/WO2024169207A1/en unknown
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111489424A (en) * | 2020-04-10 | 2020-08-04 | 网易(杭州)网络有限公司 | Virtual character expression generation method, control method, device and terminal equipment |
CN114554111A (en) * | 2022-02-22 | 2022-05-27 | 广州繁星互娱信息科技有限公司 | Video generation method and device, storage medium and electronic equipment |
Also Published As
Publication number | Publication date |
---|---|
CN118537452A (en) | 2024-08-23 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
WO2021104110A1 (en) | Voice matching method and related device | |
JP2021192222A (en) | Video image interactive method and apparatus, electronic device, computer readable storage medium, and computer program | |
JP7479750B2 (en) | Virtual video live broadcast processing method and device, electronic device | |
CN112883149B (en) | Natural language processing method and device | |
WO2023284435A1 (en) | Method and apparatus for generating animation | |
CN113064968B (en) | Social media emotion analysis method and system based on tensor fusion network | |
WO2022179603A1 (en) | Augmented reality method and related device thereof | |
CN117152843B (en) | Digital person action control method and system | |
CN112634413B (en) | Method, apparatus, device and storage medium for generating model and generating 3D animation | |
CN112906520A (en) | Gesture coding-based action recognition method and device | |
Gao | A two-channel attention mechanism-based MobileNetV2 and bidirectional long short memory network for multi-modal dimension dance emotion recognition | |
WO2023284634A1 (en) | Data processing method and related device | |
CN116524924A (en) | Digital human interaction control method, device, electronic equipment and storage medium | |
US20230130287A1 (en) | Light-weight machine learning models for lip sync animation on mobile devices or other devices | |
CN115222847A (en) | Animation data generation method and device based on neural network and related products | |
Mehmood et al. | Automatically human action recognition (HAR) with view variation from skeleton means of adaptive transformer network | |
CN117877125B (en) | Action recognition and model training method and device, electronic equipment and storage medium | |
WO2024066549A1 (en) | Data processing method and related device | |
CN117857892B (en) | Data processing method, device, electronic equipment, computer program product and computer readable storage medium based on artificial intelligence | |
WO2024109910A1 (en) | Generative model training method and apparatus and data conversion method and apparatus | |
WO2024169207A1 (en) | Method of displaying performance content of virtual character and related device | |
WO2024067113A1 (en) | Action prediction method and related device thereof | |
Usman et al. | Skeleton-based motion prediction: A survey | |
WO2023142886A1 (en) | Expression transfer method, model training method, and device | |
Zhang et al. | Skeleton‐Guided Action Recognition with Multistream 3D Convolutional Neural Network for Elderly‐Care Robot |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 23922325 Country of ref document: EP Kind code of ref document: A1 |